# Retrieval Augmented Generation(RAG)

- Retrival - Find relevant information given a query

- Augmented - Take the relevant information and augment the prompt to LLM with that relevant information.
- Generation - Take the first two steps and pass them to an LLM for generative outputs.


## What
A nutrition chatbot using a  built-from-scratch RAG and a LLM
## How
1. download "nutrition" PDFs
2. split the text in PDFs for embedding into chunks
3. embed chunks -> turn them into numerical repretations(embedding) and store somewhere(? not sure if a vector db will be needed)
4. build a retrival system that uses vector search to find relevant chunk of text based on a query.
5. create a prompt based on the retrived text
6. send prompt to a LLM for an answer.

In [2]:
%pip install PyMuPDF

Collecting PyMuPDF
  Downloading pymupdf-1.26.0-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.0-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m63.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.26.0


# Document/text processing

In [3]:
import os
import requests

pdf_file = 'human-nutrition-pdf'
pdf_url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"

# if not os.path.exists(pdf_file):
resp = requests.get(pdf_url)
resp.raise_for_status()

with open(pdf_file, 'wb') as f:
  f.write(resp.content)

print(f"PDF {pdf_file} found")


PDF human-nutrition-pdf found


| Task                      | PyMuPDF                       | PyPDF                      |
| ------------------------- | ----------------------------- | -------------------------- |
| Extract paragraphs neatly | ✅                             | ❌ (loses formatting)       |
| Merge multiple PDFs       | ⚠️ (can do, but not main use) | ✅ Built-in                 |
| Add watermarks            | ✅                             | ✅                          |
| Preview page as PNG       | ✅ `page.get_pixmap()`         | ❌ Not supported            |
| Read form field values    | ⚠️ Limited                    | ✅ `get_form_text_fields()` |
| Add annotation / comment  | ✅                             | ⚠️ Very limited            |


## splitting PDF into pages

In [4]:
import fitz
from tqdm.auto import tqdm

def _process_text(text: str) -> str:
  return text.replace('\n', ' ').strip()

doc = fitz.open(pdf_file)
print(f"total pages: {len(doc)}")

page_and_texts = []
for page_num, page in tqdm(enumerate(doc)):
  text = page.get_text()
  text = _process_text(text)
  page_and_texts.append({
      "page_number": page_num - 41, # skip table of content, etc.
      "page_char_count": len(text),
      "page_word_count": len(text.split(' ')),
      "page_sentence_count": len(text.split('. ')),
      "page_token_count": len(text)//4, # approximation.from openai 1 token ~ 4 chars in English https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
      "text": text
  })


total pages: 1208


0it [00:00, ?it/s]

In [27]:
page_and_texts[55]

{'page_number': 14,
 'page_char_count': 948,
 'page_word_count': 168,
 'page_sentence_count': 6,
 'page_token_count': 237,
 'text': 'Image by  David De  Veroli on  unsplash.co m / CC0  Food Quality  UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN  NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM  One measurement of food quality is the amount of nutrients it  contains relative to the amount of energy it provides. High-quality  foods are nutrient-dense, meaning they contain significant amounts  of one or more essential nutrients relative to the amount of calories  they provide. Nutrient-dense foods are the opposite of “empty- calorie” foods such as carbonated sugary soft drinks, which provide  many calories and very little, if any, other nutrients. Food quality is  additionally associated with its taste, texture, appearance, microbial  content, and how much consumers like it.  Food: A Better Source of Nutrients  It is better to get all your micronutrients from the foods you eat  as op

In [5]:
# get a rough exploratory data analysis(EDA) of the data
import pandas as pd

df = pd.DataFrame(page_and_texts)
df.head()
df.describe().round(2)


Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,9.97,286.62
std,348.86,560.38,95.76,6.19,140.09
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.0,134.0,4.0,190.0
50%,562.5,1231.5,214.5,10.0,307.5
75%,864.25,1603.5,271.0,14.0,400.25
max,1166.0,2308.0,429.0,32.0,577.0


Looking at the average token per page 287, meaning we can embed an average whole page with the all-mpnet-base-v2 model(input capacity of 384).  
PS. Texts over 384 tokens are discarded silently by embedding model, potentially losing some information.

## further text processing(splitting pages into sentences)

### why sentences?
1. easier to handle for e.g. manage token size to meet embdedding model's limit, than pages, especially when pages are densely filled with text.
2. can get specific and find out which group of sentence were used in a RAG pipeline.

### Use [spaCy](https://spacy.io/) to break text into sentences



In [6]:
from spacy.lang.en import English

In [7]:
nlp = English()
nlp.add_pipe('sentencizer')

for page in tqdm(page_and_texts):
  page['sentences'] = list(nlp(page['text']).sents)
  page['sentences'] = [str(sent) for sent in page['sentences']]

  page['page_sentence_count_spacy'] = len(page['sentences'])


  0%|          | 0/1208 [00:00<?, ?it/s]

In [8]:
df = pd.DataFrame(page_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,page_sentence_count_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,9.97,286.62,10.32
std,348.86,560.38,95.76,6.19,140.09,6.3
min,-41.0,0.0,1.0,1.0,0.0,0.0
25%,260.75,762.0,134.0,4.0,190.0,5.0
50%,562.5,1231.5,214.5,10.0,307.5,10.0
75%,864.25,1603.5,271.0,14.0,400.25,15.0
max,1166.0,2308.0,429.0,32.0,577.0,28.0


## chunking sentences together

In [14]:
list(range(0, 25, 10))

a = list(range(25))
a[20:130]

[20, 21, 22, 23, 24]

In [15]:
num_sentence_chunk_size = 10 # arbitrary, just picked bcs it fits the average token per page and the token limit of embedding model

for page in tqdm(page_and_texts):
  page['sentence_chunks'] = [page['sentences'][i:i+num_sentence_chunk_size] for i in range(0, len(page['sentences']), num_sentence_chunk_size)]
  page['num_chunks'] = len(page['sentence_chunks'])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [16]:
df = pd.DataFrame(page_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,page_sentence_count_spacy,num_chunks
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,9.97,286.62,10.32,1.53
std,348.86,560.38,95.76,6.19,140.09,6.3,0.64
min,-41.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,260.75,762.0,134.0,4.0,190.0,5.0,1.0
50%,562.5,1231.5,214.5,10.0,307.5,10.0,1.0
75%,864.25,1603.5,271.0,14.0,400.25,15.0,2.0
max,1166.0,2308.0,429.0,32.0,577.0,28.0,3.0


## biuld new metadata for each chunk

In [33]:
import re

chunks_metadata = []
for page in tqdm(page_and_texts):
  for sent_chunk in page['sentence_chunks']:
    chunk_dict = {"page_number": page['page_number']}
    chunk_text = ''.join(sent_chunk).replace('  ', ' ').strip()
    chunk_text = re.sub(r'\.([A-Z])', r'. \1', chunk_text) # ".A" -> ". A"
    chunk_dict["sentence_chunk"] = chunk_text
    chunk_dict['chunk_char_count'] = len(chunk_text)
    chunk_dict['chunk_word_count'] = len(chunk_text.split(' '))
    chunk_dict['chunk_token_count'] = len(chunk_text) // 4

    chunks_metadata.append(chunk_dict)

print(f"Total {len(chunks_metadata)} chunks")



  0%|          | 0/1208 [00:00<?, ?it/s]

Total 1843 chunks


In [34]:
chunks_metadata[900]

{'page_number': 570,
 'sentence_chunk': 'Pantothenic Acid (Vitamin B5) makes up coenzyme A, which carries the carbons of glucose, fatty acids, and amino acids into the citric acid cycle as Acetyl-CoA. Pantothenic acid forms coenzyme A, which is the main carrier of carbon molecules in a cell. Acetyl-CoA is the carbon carrier of glucose, fatty acids, and amino acids into the citric acid cycle (Figure 9.14 “Pantothenic Acid’s Role in the Citric Acid Cycle”). Coenzyme A is also involved in the synthesis of lipids, cholesterol, and acetylcholine (a neurotransmitter). A Pantothenic Acid deficiency is exceptionally rare. Signs and symptoms include fatigue, irritability, numbness, muscle pain, and cramps. You may have seen pantothenic acid on many ingredients lists for skin and hair care products; however there is no good scientific evidence that pantothenic acid improves human skin or hair. Dietary Reference Intakes Because there is little information on the requirements for pantothenic acids

In [35]:
df = pd.DataFrame(chunks_metadata)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1843.0,1843.0,1843.0,1843.0
mean,583.38,734.44,112.33,183.23
std,347.79,447.54,71.22,111.89
min,-41.0,12.0,3.0,3.0
25%,280.5,315.0,44.0,78.0
50%,586.0,746.0,114.0,186.0
75%,890.0,1118.5,173.0,279.0
max,1166.0,1831.0,297.0,457.0


looks like some chunks with very little tokens(min 3..), check if can filter out.

In [46]:
min_token = 50
for row in df[df['chunk_token_count'] <= min_token].sample(10).iterrows():
  print(f"token count:{row[1]['chunk_token_count']}, content: {row[1]['sentence_chunk']}")

token count:11, content: 978 | Food Supplements and Food Replacements
token count:16, content: Health Consequences and Benefits of High-Carbohydrate Diets | 267
token count:45, content: Although oils are essential for health they do contain about 120 calories per tablespoon. It is vital to balance oil consumption with total caloric intake. The MyPlate Planner | 747
token count:38, content: https://healthyforgood.heart.org/Eat- smart/Articles/Fish-and-Omega-3-Fatty-Acids. Updated March 24, 2017. Accessed October 5, 2017. Tools for Change | 335
token count:25, content: The Polynesian Family System in Ka-‘u. Rutland, Vermont: Charles E. Tuttle Company 780 | Introduction
token count:46, content: compares the recommended vitamins and minerals for lactating women to the levels for nonpregnant and pregnant women. Table 13.3 Recommended Nutrient Intakes during Lactation 824 | Infancy
token count:16, content: Complementary foods include baby meats, vegetables, Infancy | 837
token count:8, conte

seems most of them are headers, footers, etc.

In [50]:
# filter out short chunks
filtered_chunks = df[df['chunk_token_count'] > min_token].to_dict(orient="records")
filtered_chunks[70]

{'page_number': 27,
 'sentence_chunk': 'influence lasts through adulthood. People make food choices based on how they see others and want others to see them. For example, individuals who are surrounded by others who consume fast food are more likely to do the same. • Health concerns. Some people have significant food allergies, to peanuts for example, and need to avoid those foods. Others may have developed health issues which require them to follow a low salt diet. In addition, people who have never worried about their weight have a very different approach to eating than those who have long struggled with excess weight. • Emotions. There is a wide range in how emotional issues affect eating habits. When faced with a great deal of stress, some people tend to overeat, while others find it hard to eat at all.',
 'chunk_char_count': 778,
 'chunk_word_count': 135,
 'chunk_token_count': 194}