# Local RAG pipeline


## 0. Intro


### What is RAG?

- Retrieval - Find relevant info given a query
- Augmented - Take relevant info and augment our input (prompt) to an LLM with that relevant info
- Generation - Take the first two steps and pass them to an LLM for generative outputs


### Why RAG?

Improve generation outputs of LLMS

1. Prevents hallucinations - good looking text that is not necessarily factual
2. Work with custom data not internet-scale data


### What can RAG be used for?

1. Customer support Q&A chat
2. Email chain analysis
3. Company internal documentation chat
4. Textbook Q&A


### Why local?

1. Privacy - private documentation that you don't want to send to an API
2. Speed - no need to send data across the internet
3. Cost - No cost if using own hardware


### To do list

- Build a RAG pipeline which enables us to chat with a PDF document, specifically an open-source nutrition textbook, ~1200 pages long.

- Write the code to:

1. Open a PDF document (you could use almost any PDF here).
2. Format the text of the PDF textbook ready for an embedding model (this process is known as text splitting/chunking).
3. Embed all of the chunks of text in the textbook and turn them into numerical representation which we can store for later.
4. Build a retrieval system that uses vector search to find relevant chunks of text based on a query.
5. Create a prompt that incorporates the retrieved pieces of text.
6. Generate an answer to a query based on passages from the textbook.


## 1. Document/text pre processing


### Import and open PDF


In [19]:
# Import PDF

import os
import requests

# Get pdf document path
pdf_path = "data/human-nutrition-text.pdf"

# Download PDF
if not os.path.exists(pdf_path):
    print(f"[INFO] file doesn't exist, downloading...")

    # Enter the URL of the pdf
    url = "https://pressbooks.oer.hawaii.edu/humannutrition/open/download?type=pdf"

    # The local file name to save downloaded file
    filename = pdf_path

    # Send a GET request to the URL
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Open the file and save it
        with open(filename, "wb") as file:
            file.write(response.content)
        print(f"[INFO] The file has been downloaded and saved as{filename}")
    else:
        print(f"[INFO] Failed to download the file. Status code : {response.status_code}")

else:
    print(f"[INFO] File {pdf_path} exists")

[INFO] File data/human-nutrition-text.pdf exists


In [20]:
# Open PDF
import fitz
from tqdm.auto import tqdm

def text_formatter(text: str) -> str:
    """Performs minor formatting on text"""
    cleaned_text = text.replace("\n", " ").strip()

    # More text formatting functions can go in here
    return cleaned_text

def open_and_read_pdf(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        text= page.get_text()
        text = text_formatter(text=text)
        pages_and_texts.append({"page_number": page_number - 17,
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(". ")),
                                "page_token_count": len(text) / 4,
                                "text": text})
    return pages_and_texts

pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[:2]

0it [00:00, ?it/s]

[{'page_number': -17,
  'page_char_count': 15,
  'page_word_count': 2,
  'page_sentence_count_raw': 1,
  'page_token_count': 3.75,
  'text': 'Human Nutrition'},
 {'page_number': -16,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [21]:
import random

random.sample(pages_and_texts, k=3)

[{'page_number': 16,
  'page_char_count': 1773,
  'page_word_count': 211,
  'page_sentence_count_raw': 48,
  'page_token_count': 443.25,
  'text': '1. Lacto-ovo vegetarian. This is the most common form. This type of vegetarian diet includes the animal foods eggs and dairy products. 2. Lacto-vegetarian. This type of vegetarian diet includes dairy products but not eggs. 3. Ovo-vegetarian. This type of vegetarian diet includes eggs but not dairy products. 4. Vegan. This type of vegetarian diet does not include dairy, eggs, or any type of animal product or animal by-product. Lifestyles and Nutrition by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted. Notes 1. https://health.gov/paguidelines/ 2. http://www.csep.ca/english/view.asp?x=804 3. Centers for Disease Control and Prevention (CDC). “Smoking and Tobacco Use.” http://www.cdc.gov/tobac

In [22]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-17,15,2,1,3.75,Human Nutrition
1,-16,0,1,1,0.0,
2,-15,188,26,1,47.0,Human Nutrition UNIVERSITY OF HAWAI‘I AT MĀNOA...
3,-14,607,100,5,151.75,Human Nutrition by University of Hawai‘i at Mā...
4,-13,827,130,4,206.75,Contents Preface xi About the Contributors xii...


In [23]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,667.0,667.0,667.0,667.0,667.0
mean,316.0,1756.98,270.41,16.31,439.25
std,192.69,1211.29,188.06,13.79,302.82
min,-17.0,0.0,1.0,1.0,0.0
25%,149.5,774.5,112.5,6.0,193.62
50%,316.0,1584.0,249.0,14.0,396.0
75%,482.5,2750.5,424.5,23.0,687.62
max,649.0,4555.0,757.0,99.0,1138.75


### Why we care about token count?

1. Embedding models don't deal with infinite tokens

- In this case sentence-transformers/all-mpnet-base-v2 embedding model was used
- It was trained to embed sequences of 384 tokens into numerical space

2. LLMs don't deal with infinite tokens


### Further text processing (splitting pages into sentences)

- split at ". " or use NLP libraries like spaCy and nltk.


In [24]:
from spacy.lang.en import English

nlp = English()

# Add a sentencizer pipeline
nlp.add_pipe("sentencizer")

# Create document instance as an example
doc = nlp("This is a sentence. This is another sentence.")
assert len(list(doc.sents)) == 2

# Print out our sentences split
list(doc.sents)

[This is a sentence., This is another sentence.]

In [25]:
pages_and_texts[500]

{'page_number': 483,
 'page_char_count': 3794,
 'page_word_count': 655,
 'page_sentence_count_raw': 33,
 'page_token_count': 948.5,
 'text': 'ENERGY AND MACRONUTRIENTS Energy needs relative to size are much greater in an infant than an adult. A baby’s resting metabolic rate is two times that of an adult. The RDA to meet energy needs changes as an infant matures and puts on more weight. The IOM uses a set of equations to calculate the total energy expenditure and resulting energy needs. For example, the equation for the first three months of life is (89 x weight [kg] −100) + 175 kcal. Based on these equations, the estimated energy requirement for infants from zero to six months of age is 472 to 645 kilocalories per day for boys and 438 to 593 kilocalories per day for girls. For infants ages six to twelve months, the estimated requirement is 645 to 844 kilocalories per day for boys and 593 to 768 kilocalories per day for girls. From the age one to age two, the estimated requirement rises

In [26]:
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)

    # Make sure sentences are strings (default type is spaCy datatype)
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]

    # Count the sentences
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/667 [00:00<?, ?it/s]

In [27]:
random.sample(pages_and_texts, k=1)
# Has been split into sentences

[{'page_number': 241,
  'page_char_count': 1232,
  'page_word_count': 192,
  'page_sentence_count_raw': 10,
  'page_token_count': 308.0,
  'text': 'charged molecules, such as protons (H+), calcium, potassium, and magnesium which are also circulating in the blood. Albumin acts as a buffer against abrupt changes in the concentrations of these molecules, thereby balancing blood pH and maintaining the status quo. The protein hemoglobin also participates in acid-base balance by binding and releasing protons. TRANSPORT Albumin and hemoglobin also play a role in molecular transport. Albumin chemically binds to hormones, fatty acids, some vitamins, essential minerals, and drugs, and transports them throughout the circulatory system. Each red blood cell contains millions of hemoglobin molecules that bind oxygen in the lungs and transport it to all the tissues in the body. A cell’s plasma membrane is usually not permeable to large polar molecules, so to get the required nutrients and molecules i

In [28]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2) 

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,667.0,667.0,667.0,667.0,667.0,667.0
mean,316.0,1756.98,270.41,16.31,439.25,16.21
std,192.69,1211.29,188.06,13.79,302.82,13.64
min,-17.0,0.0,1.0,1.0,0.0,0.0
25%,149.5,774.5,112.5,6.0,193.62,6.0
50%,316.0,1584.0,249.0,14.0,396.0,14.0
75%,482.5,2750.5,424.5,23.0,687.62,23.0
max,649.0,4555.0,757.0,99.0,1138.75,101.0


### Splitting and chunking sentences together in groups of 10

- Makes text easier to filter and inspect
- For our text chunks to fit into our embedding model context window


In [29]:
# Define split size
num_sentence_chunk_size = 10

# Create function to split list of text recursively into chunk size
# 20 -> 10, 10
# 25 -> 10, 10, 5
def split_list(input_list: list,
               split_size: int = num_sentence_chunk_size) -> list[list[str]]:
    return [input_list[i:i+split_size] for i in range(0, len(input_list), split_size)]

test_list = list(range(25))
split_list(test_list)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [30]:
# Loop through pages and text and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         split_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

random.sample(pages_and_texts, k=1)

  0%|          | 0/667 [00:00<?, ?it/s]

[{'page_number': 533,
  'page_char_count': 1309,
  'page_word_count': 160,
  'page_sentence_count_raw': 39,
  'page_token_count': 327.25,
  'text': 'Middle Age by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License, except where otherwise noted. Notes 1. Polan EU, Taylor DR. Journey Across the Life Span: Human Development and Health Promotion. Philadelphia: F. A. Davis Company; 2003, 192–93. 2. Polan EU, Taylor DR. Journey Across the Life Span: Human Development and Health Promotion. Philadelphia: F. A. Davis Company; 2003, 212–213. 3. Drewnowski A, Darmon, N. Food Choices and Diet Cost: an Economic Analysis. The Journal of Nutrition. 2005; 135(4), 900-904. http://jn.nutrition.org/content/135/4/900.full. Accessed December 12, 2017. 4. Voutilainen S, Nurmi T, Mursu J, Rissanen, TH. Carotenoids and Cardiovascular Health. Am J Clin Nutr. 2006; 83, 1265–71. http://www.aj

In [31]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,667.0,667.0,667.0,667.0,667.0,667.0,667.0
mean,316.0,1756.98,270.41,16.31,439.25,16.21,2.11
std,192.69,1211.29,188.06,13.79,302.82,13.64,1.38
min,-17.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,149.5,774.5,112.5,6.0,193.62,6.0,1.0
50%,316.0,1584.0,249.0,14.0,396.0,14.0,2.0
75%,482.5,2750.5,424.5,23.0,687.62,23.0,3.0
max,649.0,4555.0,757.0,99.0,1138.75,101.0,11.0


### Splitting each chunk into it's own item

- So as to embed each chunk of sentences into it's own numerical representation giving us a good level of granularity


In [32]:
import regex as re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences together into a paragraph like structure
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()

        # To return the space in the beginning of sentences
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk)

        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get some stats on our chunks
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4

        pages_and_chunks.append(chunk_dict)

len(pages_and_chunks)


  0%|          | 0/667 [00:00<?, ?it/s]

1409

In [33]:
random.sample(pages_and_chunks, k=1)

[{'page_number': 20,
  'sentence_chunk': 'they gather additional evidence from multiple sources and finally come up with a conclusion. This organized process of inquiry used in science is called the scientific method. Figure 1.2 Scientific Method Steps In 1811, French chemist Bernard Courtois was isolating saltpeter for producing gunpowder to be used by Napoleon’s army. To carry out this isolation, he burned some seaweed and in the process, observed an intense violet vapor that crystallized when he exposed it to a cold surface. He sent the violet crystals to an expert on gases, Joseph Gay-Lussac, who identified the crystal as a new element. It was named iodine, the Greek word for violet. The following scientific record is some of what took place in order to conclude that iodine is a nutrient. Observation. Eating seaweed is a cure for goiter, a gross enlargement of the thyroid gland in the neck. Hypothesis.',
  'chunk_char_count': 877,
  'chunk_word_count': 145,
  'chunk_token_count': 2

In [34]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1409.0,1409.0,1409.0,1409.0
mean,339.14,830.21,127.02,207.55
std,194.96,535.48,86.02,133.87
min,-17.0,3.0,1.0,0.75
25%,170.0,344.0,42.0,86.0
50%,337.0,848.0,132.0,212.0
75%,514.0,1222.0,193.0,305.5
max,649.0,3060.0,483.0,765.0


### Create a filter for sentence chunks that are below 30 tokens


In [35]:
# Show random chunks with under 30 tokens in length
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token Count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token Count: 6.75 | Text: CHAPTER 17. FOOD SAFETY 571
Chunk token Count: 0.75 | Text: 186
Chunk token Count: 25.5 | Text: https://www.ncbi.nlm.nih.gov/pubmed/20047325. Accessed September 22, 2017. Central Nervous System | 60
Chunk token Count: 3.75 | Text: Human Nutrition
Chunk token Count: 25.25 | Text: https://www.choosemyplate.gov/fruits-nutrients-health. Accessed February 16, 2018. Introduction | 526


In [36]:
# Filter our DataFrame
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
random.sample(pages_and_chunks_over_min_token_len, k=1)

[{'page_number': 163,
  'sentence_chunk': 'Health Consequences and Benefits of High-Carbohydrate Diets Can America blame its obesity epidemic on the higher consumption of added sugars and refined grains?This is a hotly debated topic by both the scientific community and the general public. In this section, we will give a brief overview of the scientific evidence. ADDED SUGARS Figure 4.13 Sugar Consumption (In Teaspoons) From Various Sources The Food and Nutrition Board of the Institute of Medicine (IOM) defines added sugars as “sugars and syrups that are added to foods during processing or preparation.”The IOM goes on to state, “Major sources of added sugars include soft drinks, sports drinks, cakes, cookies, pies, fruitades, fruit punch, dairy desserts, and candy.”Processed foods, even microwaveable dinners, also contain added sugars. Added sugars do not include sugars that occur naturally in whole foods (such as an apple), but do include natural sugars such as brown sugar, corn syrup,

## 2. Document Embedding
