# Retrieval Augmented Generation(RAG)

- Retrival - Find relevant information given a query

- Augmented - Take the relevant information and augment the prompt to LLM with that relevant information.
- Generation - Take the first two steps and pass them to an LLM for generative outputs.


## What
A nutrition chatbot using a  built-from-scratch RAG and a LLM
## How
1. download "nutrition" PDFs
2. split the text in PDFs for embedding into chunks
3. embed chunks -> turn them into numerical repretations(embedding) and store somewhere(? not sure if a vector db will be needed)
4. build a retrival system that uses vector search to find relevant chunk of text based on a query.
5. create a prompt based on the retrived text
6. send prompt to a LLM for an answer.

In [1]:
%pip install PyMuPDF

Collecting PyMuPDF
  Downloading pymupdf-1.26.0-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.0-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m75.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.26.0


# Document/text processing

In [2]:
import os
import requests

pdf_file = 'human-nutrition-pdf'
pdf_url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"

# if not os.path.exists(pdf_file):
resp = requests.get(pdf_url)
resp.raise_for_status()

with open(pdf_file, 'wb') as f:
  f.write(resp.content)

print(f"PDF {pdf_file} found")


PDF human-nutrition-pdf found


| Task                      | PyMuPDF                       | PyPDF                      |
| ------------------------- | ----------------------------- | -------------------------- |
| Extract paragraphs neatly | ✅                             | ❌ (loses formatting)       |
| Merge multiple PDFs       | ⚠️ (can do, but not main use) | ✅ Built-in                 |
| Add watermarks            | ✅                             | ✅                          |
| Preview page as PNG       | ✅ `page.get_pixmap()`         | ❌ Not supported            |
| Read form field values    | ⚠️ Limited                    | ✅ `get_form_text_fields()` |
| Add annotation / comment  | ✅                             | ⚠️ Very limited            |


## splitting PDF into pages

In [3]:
import fitz
from tqdm.auto import tqdm

def _process_text(text: str) -> str:
  return text.replace('\n', ' ').strip()

doc = fitz.open(pdf_file)
print(f"total pages: {len(doc)}")

page_and_texts = []
for page_num, page in tqdm(enumerate(doc)):
  text = page.get_text()
  text = _process_text(text)
  page_and_texts.append({
      "page_number": page_num - 41, # skip table of content, etc.
      "page_char_count": len(text),
      "page_word_count": len(text.split(' ')),
      "page_sentence_count": len(text.split('. ')),
      "page_token_count": len(text)//4, # approximation.from openai 1 token ~ 4 chars in English https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
      "text": text
  })


total pages: 1208


0it [00:00, ?it/s]

In [4]:
page_and_texts[55]

{'page_number': 14,
 'page_char_count': 947,
 'page_word_count': 167,
 'page_sentence_count': 6,
 'page_token_count': 236,
 'text': 'Image by  David De  Veroli on  unsplash.co m / CC0  Food Quality  UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN  NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM  One measurement of food quality is the amount of nutrients it  contains relative to the amount of energy it provides. High-quality  foods are nutrient-dense, meaning they contain significant amounts  of one or more essential nutrients relative to the amount of calories  they provide. Nutrient-dense foods are the opposite of “empty- calorie” foods such as carbonated sugary soft drinks, which provide  many calories and very little, if any, other nutrients. Food quality is  additionally associated with its taste, texture, appearance, microbial  content, and how much consumers like it.  Food: A Better Source of Nutrients  It is better to get all your micronutrients from the foods you eat  as op

In [5]:
# get a rough exploratory data analysis(EDA) of the data
import pandas as pd

df = pd.DataFrame(page_and_texts)
df.head()
df.describe().round(2)


Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,9.97,286.62
std,348.86,560.38,95.76,6.19,140.09
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.0,134.0,4.0,190.0
50%,562.5,1231.5,214.5,10.0,307.5
75%,864.25,1603.5,271.0,14.0,400.25
max,1166.0,2308.0,429.0,32.0,577.0


Looking at the average token per page 287, meaning we can embed an average whole page with the all-mpnet-base-v2 model(input capacity of 384).  
PS. Texts over 384 tokens are discarded silently by embedding model, potentially losing some information.

## further text processing(splitting pages into sentences)

### why sentences?
1. easier to handle for e.g. manage token size to meet embdedding model's limit, than pages, especially when pages are densely filled with text.
2. can get specific and find out which group of sentence were used in a RAG pipeline.

### Use [spaCy](https://spacy.io/) to break text into sentences



In [6]:
from spacy.lang.en import English

In [7]:
nlp = English()
nlp.add_pipe('sentencizer')

for page in tqdm(page_and_texts):
  page['sentences'] = list(nlp(page['text']).sents)
  page['sentences'] = [str(sent) for sent in page['sentences']]

  page['page_sentence_count_spacy'] = len(page['sentences'])


  0%|          | 0/1208 [00:00<?, ?it/s]

In [8]:
df = pd.DataFrame(page_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,page_sentence_count_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,9.97,286.62,10.32
std,348.86,560.38,95.76,6.19,140.09,6.3
min,-41.0,0.0,1.0,1.0,0.0,0.0
25%,260.75,762.0,134.0,4.0,190.0,5.0
50%,562.5,1231.5,214.5,10.0,307.5,10.0
75%,864.25,1603.5,271.0,14.0,400.25,15.0
max,1166.0,2308.0,429.0,32.0,577.0,28.0


## chunking sentences together

In [9]:
num_sentence_chunk_size = 10 # arbitrary, just picked bcs it fits the average token per page and the token limit of embedding model

for page in tqdm(page_and_texts):
  page['sentence_chunks'] = [page['sentences'][i:i+num_sentence_chunk_size] for i in range(0, len(page['sentences']), num_sentence_chunk_size)]
  page['num_chunks'] = len(page['sentence_chunks'])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [10]:
df = pd.DataFrame(page_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,page_sentence_count_spacy,num_chunks
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,9.97,286.62,10.32,1.53
std,348.86,560.38,95.76,6.19,140.09,6.3,0.64
min,-41.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,260.75,762.0,134.0,4.0,190.0,5.0,1.0
50%,562.5,1231.5,214.5,10.0,307.5,10.0,1.0
75%,864.25,1603.5,271.0,14.0,400.25,15.0,2.0
max,1166.0,2308.0,429.0,32.0,577.0,28.0,3.0


## biuld new metadata for each chunk

In [11]:
import re

chunks_metadata = []
for page in tqdm(page_and_texts):
  for sent_chunk in page['sentence_chunks']:
    chunk_dict = {"page_number": page['page_number']}
    chunk_text = ''.join(sent_chunk).replace('  ', ' ').strip()
    chunk_text = re.sub(r'\.([A-Z])', r'. \1', chunk_text) # ".A" -> ". A"
    chunk_dict["sentence_chunk"] = chunk_text
    chunk_dict['chunk_char_count'] = len(chunk_text)
    chunk_dict['chunk_word_count'] = len(chunk_text.split(' '))
    chunk_dict['chunk_token_count'] = len(chunk_text) // 4

    chunks_metadata.append(chunk_dict)

print(f"Total {len(chunks_metadata)} chunks")



  0%|          | 0/1208 [00:00<?, ?it/s]

Total 1843 chunks


In [12]:
chunks_metadata[900]

{'page_number': 570,
 'sentence_chunk': 'Pantothenic Acid (Vitamin B5) makes up coenzyme A, which carries the carbons of glucose, fatty acids, and amino acids into the citric acid cycle as Acetyl-CoA. Pantothenic acid forms coenzyme A, which is the main carrier of carbon molecules in a cell. Acetyl-CoA is the carbon carrier of glucose, fatty acids, and amino acids into the citric acid cycle (Figure 9.14 “Pantothenic Acid’s Role in the Citric Acid Cycle”). Coenzyme A is also involved in the synthesis of lipids, cholesterol, and acetylcholine (a neurotransmitter). A Pantothenic Acid deficiency is exceptionally rare. Signs and symptoms include fatigue, irritability, numbness, muscle pain, and cramps. You may have seen pantothenic acid on many ingredients lists for skin and hair care products; however there is no good scientific evidence that pantothenic acid improves human skin or hair. Dietary Reference Intakes Because there is little information on the requirements for pantothenic acids

In [13]:
df = pd.DataFrame(chunks_metadata)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1843.0,1843.0,1843.0,1843.0
mean,583.38,734.44,112.33,183.23
std,347.79,447.54,71.22,111.89
min,-41.0,12.0,3.0,3.0
25%,280.5,315.0,44.0,78.0
50%,586.0,746.0,114.0,186.0
75%,890.0,1118.5,173.0,279.0
max,1166.0,1831.0,297.0,457.0


looks like some chunks with very little tokens(min 3..), check if can filter out.

In [14]:
min_token = 30
for row in df[df['chunk_token_count'] <= min_token].sample(10).iterrows():
  print(f"token count:{row[1]['chunk_token_count']}, content: {row[1]['sentence_chunk']}")

token count:25, content: http://www.ajcn.org/cgi/ pmidlookup?view=long&pmid=10197575. Accessed October 6, 2017. 640 | Magnesium
token count:28, content: Accessed September 22, 2017. Dietary, Behavioral, and Physical Activity Recommendations for Weight Management | 505
token count:25, content: The Polynesian Family System in Ka-‘u. Rutland, Vermont: Charles E. Tuttle Company 780 | Introduction
token count:29, content: 2. Lacto-vegetarian. This type of vegetarian diet includes dairy products but not eggs. Lifestyles and Nutrition | 27
token count:11, content: Accessed March 17, 2018. Sports Nutrition | 961
token count:30, content: Weight-bearing exercises requires your body to move against gravity which requires more energy. Men Sports Nutrition | 959
token count:3, content: 828 | Infancy
token count:3, content: 814 | Infancy
token count:29, content: 2010). EH. Net Encyclopedia. http://eh.net/?s=History+of+Food+and+Drug+Regulatio Protecting the Public Health | 1011
token count:25, conten

seems most of them are headers, footers, etc.

In [15]:
# filter out short chunks
filtered_chunks = df[df['chunk_token_count'] > min_token].to_dict(orient="records")
print(f"Filtered chunks {len(filtered_chunks)}/{len(chunks_metadata)}")

Filtered chunks 1673/1843


# Embedding chunks

use pre-trained `all-mpnet-base-v2` from sentence-transformers as our embedding model.

1. model card from HF https://huggingface.co/sentence-transformers/all-mpnet-base-v2#intended-uses
2. what is embedding? https://vickiboykis.com/what_are_embeddings/index.html

In [None]:
%pip install sentence-transformers

In [17]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path='all-mpnet-base-v2', device='cpu')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [18]:
%%time
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
embedding_model.to(device)

for item in tqdm(filtered_chunks):
  item['embedding'] = embedding_model.encode(item["sentence_chunk"])

  0%|          | 0/1673 [00:00<?, ?it/s]

CPU times: user 32.6 s, sys: 575 ms, total: 33.2 s
Wall time: 38.6 s


## Saving embeddings to file
Since we only have less than 2000 embeddings, a csv file is enough.

In [19]:
filtered_chunks_df = pd.DataFrame(filtered_chunks)
embedding_file = 'text_chunks_and_embeddings_df.csv'
filtered_chunks_df.to_csv(embedding_file, index=False)

In [20]:
filtered_chunks_df['embedding']

Unnamed: 0,embedding
0,"[0.06742427, 0.09022814, -0.005095489, -0.0317..."
1,"[0.05521564, 0.059213977, -0.016616724, -0.020..."
2,"[0.027980184, 0.033981375, -0.020642668, 0.001..."
3,"[0.06825669, 0.0381275, -0.008468541, -0.01813..."
4,"[0.03302645, -0.008497635, 0.009571596, -0.004..."
...,...
1668,"[0.018562254, -0.016427767, -0.012704563, -0.0..."
1669,"[0.03347206, -0.057044085, 0.015148939, -0.010..."
1670,"[0.07705155, 0.009785576, -0.012181741, 0.0010..."
1671,"[0.10304516, -0.016470186, 0.008268461, 0.0377..."


In [21]:
# save to drive
save_to_drive=True
embedding_file_drive = '/content/drive/My Drive/rag_from_scratch/text_chunks_and_embeddings_df.csv'
if save_to_drive:
  import shutil
  from google.colab import drive
  drive.mount('/content/drive')

  shutil.copy(embedding_file, embedding_file_drive)

Mounted at /content/drive


### Chunking and embedding notes

1. how to choose embedding model?  
Experiment more. Refer to HF's leaderboard
https://huggingface.co/spaces/mteb/leaderboard
2. other ways of chunking  
for more details https://www.pinecone.io/learn/chunking-strategies/  
some references from langchain https://python.langchain.com/docs/modules/data_connection/document_transformers/  
3. when creating embeddings, thinking about  
- size of input - If you need to embed longer sequences, choose a model with a larger input capacity.
- size of embedding vector - Larger is generally a better representation but requires more compute/storage.
- size of model - Larger models generally result in better embeddings but require more compute power/time to run.
- open or closed - Open models allow you to run them on your own hardware whereas closed models can be easier to setup but require an API call to get embeddings.

4. where to store embeddings  
for small dataset, maybe under 100,000 examples, store in memory is good enough - np.array, torch.tensor.  
large dataset or production env, use vector database.

## Similarity Search

In [3]:
import shutil
from google.colab import drive
embedding_file_drive = '/content/drive/My Drive/rag_from_scratch/text_chunks_and_embeddings_df.csv'
drive.mount('/content/drive')
save_to_drive = True

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### load embeddings from pre-saved csv file to memory

In [69]:
import numpy as np
import pandas as pd
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

def load_embedding_from_file():
  if save_to_drive:
    text_chunks_and_embedding_df = pd.read_csv(embedding_file_drive)
  else:
    text_chunks_and_embedding_df = pd.read_csv(embedding_file)
  # embeddning loaded from csv looks like "[1 2 3 4 ...]"
  text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

  # convert from df to dict
  page_and_chunks_dict = text_chunks_and_embedding_df.to_dict(orient='records')
  embeddings_tensor = torch.tensor(np.array([item['embedding'] for item in page_and_chunks_dict]), dtype=torch.float32).to(device)
  print(embeddings_tensor.shape)
  return page_and_chunks_dict, embeddings_tensor

page_and_chunks_dict, embeddings_tensor = load_embedding_from_file()


torch.Size([1673, 768])


### load embedding models(make sure the same as the one used for creating embeddings)

In [16]:
from sentence_transformers import SentenceTransformer

def get_embedding_model(model, device):
  return SentenceTransformer(model_name_or_path=model, device=device)

ebd_model = 'all-mpnet-base-v2'
embedding_model = get_embedding_model(ebd_model, device)

SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

So all-mpnet-base-v2 uses average pooling by default and also output normolized embeddings.

semantic search needs to convert text/query to embedding using the same embedding model, and then use f.e. dot product/cosine similarity with all the embeddings from csv file to find the top-k results.

And cosine similarity uses "direction" rather than "magnitude", so the normalization is also needed before computation.

### similiarty search

In [67]:
from sentence_transformers import util
import textwrap

def print_wrapped(text, wrap_length=80):
    wrapped_text = textwrap.fill(text, wrap_length)
    return wrapped_text

def search_topk_results(query: str, k: int=5):
  query_embedding = embedding_model.encode(query, convert_to_tensor=True).to(device)

  dot_scores = util.dot_score(a=query_embedding, b=embeddings_tensor)[0]

  top_results_dot_product = torch.topk(dot_scores, k=k)
  # top_results_dot_product[0] # score
  # top_results_dot_product[1] # index in embeddings_tensor
  return top_results_dot_product

# query = "macronutrients functions"
query = "how to eat like ironman"
top_results_dot_product = search_topk_results(query)
for score, idx in zip(top_results_dot_product[0], top_results_dot_product[1]):
  print(f'Score: {score}')
  print(print_wrapped(page_and_chunks_dict[idx]['sentence_chunk']))


Score: 0.5774324536323547
• Get your protein from foods such as soybeans, tofu, tempeh, lentils, and
beans, beans, and more beans. Many of these foods are high in zinc too. • Eat
foods fortified with vitamins B12 and D and calcium. Some examples are soy milk
and fortified cereals. • Get enough iron in your diet by eating kidney beans,
lentils, whole-grain cereals, and leafy green vegetables. • To increase iron
absorption, eat foods with vitamin C at the same time. • Don’t forget that
carbohydrates and fats are required in your diet too, especially if you are
training. Eat whole-grain breads, cereals, and pastas. For fats, eat an avocado,
add some olive oil to a salad or stir-fry, or spread some peanut or cashew
butter on a bran muffin. Learning Activities Technology Note: The second edition
of the Human Nutrition Open Educational Resource (OER) textbook features
interactive learning activities.
Score: 0.49949949979782104
Remedies can include increasing the frequency of meals and adding

### dot product vs cosine similarity

Embedding vectors which are representations of data with magnitude and direction in high dimensional space.  
Two of the most common are *dot product* and *cosine similarity*.

| Similarity Measure | Description                                                                                                                                       | Code                                                                                                          |
|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|
| **Dot Product**    | - Measure of magnitude and direction between two vectors<br>- Vectors that are aligned in direction and magnitude have a higher positive value<br>- Vectors that are opposite in direction and magnitude have a higher negative value | `torch.dot`, `np.dot`, `sentence_transformers.util.dot_score`                                                 |
| **Cosine Similarity** | - Vectors get normalized by magnitude/[Euclidean norm](https://en.wikipedia.org/wiki/Euclidean_vector)/L2 norm so they have unit length and are compared more so on direction<br>- Vectors that are aligned in direction have a value close to 1<br>- Vectors that are opposite in direction have a value close to -1 | `torch.nn.functional.cosine_similarity`, `1 - scipy.spatial.distance.cosine`, `sentence_transformers.util.cos_sim` |

And semantic similarity mainly targets the `direction`, so cosine similarity is used. In the previous code, because the embedding model has included a normalization step(L2 norm), in the following code, we use dot product `sentence_transformers.util.dot_score` directly.
