# Create and run a local RAG pipeline from scratch


## Intro


### What is RAG?

- Retrieval - Find relevant info given a query
- Augmented - Take relevant info and augment our input (prompt) to an LLM with that relevant info
- Generation - Take the first two steps and pass them to an LLM for generative outputs


### Why RAG?

Improve generation outputs of LLMS

1. Prevents hallucinations - good looking text that is not necessarily factual
2. Work with custom data not internet-scale data


### What can RAG be used for?

1. Customer support Q&A chat
2. Email chain analysis
3. Company internal documentation chat
4. Textbook Q&A


### Why local?

1. Privacy - private documentation that you don't want to send to an API
2. Speed - no need to send data across the internet
3. Cost - No cost if using own hardware


### To do list

- Build a RAG pipeline which enables us to chat with a PDF document, specifically an open-source nutrition textbook, ~1200 pages long.

- Write the code to:

1. Open a PDF document (you could use almost any PDF here).
2. Format the text of the PDF textbook ready for an embedding model (this process is known as text splitting/chunking).
3. Embed all of the chunks of text in the textbook and turn them into numerical representation which we can store for later.
4. Build a retrieval system that uses vector search to find relevant chunks of text based on a query.
5. Create a prompt that incorporates the retrieved pieces of text.
6. Generate an answer to a query based on passages from the textbook.


## 1. Document/text processing and embedding creation


### Import and open PDF


In [32]:
# Import PDF

import os
import requests

# Get pdf document path
pdf_path = "human-nutrition-text.pdf"

# Download PDF
if not os.path.exists(pdf_path):
    print(f"[INFO] file doesn't exist, downloading...")

    # Enter the URL of the pdf
    url = "https://pressbooks.oer.hawaii.edu/humannutrition/open/download?type=pdf"

    # The local file name to save downloaded file
    filename = pdf_path

    # Send a GET request to the URL
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Open the file and save it
        with open(filename, "wb") as file:
            file.write(response.content)
        print(f"[INFO] The file has been downloaded and saved as{filename}")
    else:
        print(f"[INFO] Failed to download the file. Status code : {response.status_code}")

else:
    print(f"[INFO] File {pdf_path} exists")

[INFO] File human-nutrition-text.pdf exists


In [33]:
# Open PDF
import fitz
from tqdm.auto import tqdm

def text_formatter(text: str) -> str:
    """Performs minor formatting on text"""
    cleaned_text = text.replace("\n", " ").strip()

    # More text formatting functions can go in here
    return cleaned_text

def open_and_read_pdf(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        text= page.get_text()
        text = text_formatter(text=text)
        pages_and_texts.append({"page_number": page_number - 17,
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(". ")),
                                "page_token_count": len(text) / 4,
                                "text": text})
    return pages_and_texts

pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[:2]

0it [00:00, ?it/s]

[{'page_number': -17,
  'page_char_count': 15,
  'page_word_count': 2,
  'page_sentence_count_raw': 1,
  'page_token_count': 3.75,
  'text': 'Human Nutrition'},
 {'page_number': -16,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [34]:
import random

random.sample(pages_and_texts, k=3)

[{'page_number': 6,
  'page_char_count': 1593,
  'page_word_count': 252,
  'page_sentence_count_raw': 13,
  'page_token_count': 398.25,
  'text': 'WATER There is one other nutrient that we must have in large quantities: water. Water does not contain carbon, but is composed of two hydrogens and one oxygen per molecule of water. More than 60 percent of your total body weight is water. Without it, nothing could be transported in or out of the body, chemical reactions would not occur, organs would not be cushioned, and body temperature would fluctuate widely. On average, an adult consumes just over two liters of water per day from food and drink combined. Since water is so critical for life’s basic processes, the amount of water input and output is supremely important, a topic we will explore in detail in Chapter 4. Micronutrients Micronutrients are nutrients required by the body in lesser amounts, but are still essential for carrying out bodily functions. Micronutrients include all the es

In [35]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-17,15,2,1,3.75,Human Nutrition
1,-16,0,1,1,0.0,
2,-15,188,26,1,47.0,Human Nutrition UNIVERSITY OF HAWAI‘I AT MĀNOA...
3,-14,607,100,5,151.75,Human Nutrition by University of Hawai‘i at Mā...
4,-13,827,130,4,206.75,Contents Preface xi About the Contributors xii...


In [36]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,667.0,667.0,667.0,667.0,667.0
mean,316.0,1756.98,270.41,16.31,439.25
std,192.69,1211.29,188.06,13.79,302.82
min,-17.0,0.0,1.0,1.0,0.0
25%,149.5,774.5,112.5,6.0,193.62
50%,316.0,1584.0,249.0,14.0,396.0
75%,482.5,2750.5,424.5,23.0,687.62
max,649.0,4555.0,757.0,99.0,1138.75


### Why we care about token count?

1. Embedding models don't deal with infinite tokens

- In this case sentence-transformers/all-mpnet-base-v2 embedding model was used
- It was trained to embed sequences of 384 tokens into numerical space

2. LLMs don't deal with infinite tokens


### Further text processing (splitting pages into sentences)

- split at ". " or use NLP libraries like spaCy and nltk.


In [37]:
from spacy.lang.en import English

nlp = English()

# Add a sentencizer pipeline
nlp.add_pipe("sentencizer")

# Create document instance as an example
doc = nlp("This is a sentence. This is another sentence.")
assert len(list(doc.sents)) == 2

# Print out our sentences split
list(doc.sents)

[This is a sentence., This is another sentence.]

In [38]:
pages_and_texts[500]

{'page_number': 483,
 'page_char_count': 3794,
 'page_word_count': 655,
 'page_sentence_count_raw': 33,
 'page_token_count': 948.5,
 'text': 'ENERGY AND MACRONUTRIENTS Energy needs relative to size are much greater in an infant than an adult. A baby’s resting metabolic rate is two times that of an adult. The RDA to meet energy needs changes as an infant matures and puts on more weight. The IOM uses a set of equations to calculate the total energy expenditure and resulting energy needs. For example, the equation for the first three months of life is (89 x weight [kg] −100) + 175 kcal. Based on these equations, the estimated energy requirement for infants from zero to six months of age is 472 to 645 kilocalories per day for boys and 438 to 593 kilocalories per day for girls. For infants ages six to twelve months, the estimated requirement is 645 to 844 kilocalories per day for boys and 593 to 768 kilocalories per day for girls. From the age one to age two, the estimated requirement rises

In [39]:
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)

    # Make sure sentences are strings (default type is spaCy datatype)
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]

    # Count the sentences
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/667 [00:00<?, ?it/s]

In [40]:
random.sample(pages_and_texts, k=1)
# Has been split into sentences

[{'page_number': 612,
  'page_char_count': 2605,
  'page_word_count': 413,
  'page_sentence_count_raw': 1,
  'page_token_count': 651.25,
  'text': 'Diet Pros Cons DASH Diet • Recommended by the National Heart, Lung, and Blood Institute, the American Heart Association, and many physicians • Helps to lower blood pressure and cholesterol • Reduces risk of heart disease and stroke • Reduces risk of certain cancers • Reduces diabetes risk • There are very few negative factors associated with the DASH diet • Risk for hyponatremia Gluten-Free Diet • Reduces the symptoms of gluten intolerance, such as chronic diarrhea, cramping, constipation, and bloating • Promotes healing of the small intestines for people with celiac disease, preventing malnutrition • May be beneficial for other autoimmune diseases, such as Parkinson’s disease, rheumatoid arthritis, and multiple sclerosis • Risk of folate, iron, thiamin, riboflavin, niacin, and vitamin B6 deficiencies • Special gluten-free products can be h

In [41]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2) 

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,667.0,667.0,667.0,667.0,667.0,667.0
mean,316.0,1756.98,270.41,16.31,439.25,16.21
std,192.69,1211.29,188.06,13.79,302.82,13.64
min,-17.0,0.0,1.0,1.0,0.0,0.0
25%,149.5,774.5,112.5,6.0,193.62,6.0
50%,316.0,1584.0,249.0,14.0,396.0,14.0
75%,482.5,2750.5,424.5,23.0,687.62,23.0
max,649.0,4555.0,757.0,99.0,1138.75,101.0


### Splitting and chunking sentences together in groups of 10

- Makes text easier to filter and inspect
- For our text chunks to fit into our embedding model context window


In [42]:
# Define split size
num_sentence_chunk_size = 10

# Create function to split list of text recursively into chunk size
# 20 -> 10, 10
# 25 -> 10, 10, 5
def split_list(input_list: list,
               split_size: int = num_sentence_chunk_size) -> list[list[str]]:
    return [input_list[i:i+split_size] for i in range(0, len(input_list), split_size)]

test_list = list(range(25))
split_list(test_list)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [45]:
# Loop through pages and text and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         split_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

random.sample(pages_and_texts, k=1)

  0%|          | 0/667 [00:00<?, ?it/s]

[{'page_number': 341,
  'page_char_count': 2431,
  'page_word_count': 352,
  'page_sentence_count_raw': 22,
  'page_token_count': 607.75,
  'text': 'Health Professional Fact Sheet: Thiamin. National Institutes of Health, Office of Dietary Supplements.https://ods.od.nih.gov/factsheets/Thiamin-HealthProfessional/ . Updated February 11, 2016 . Accessed October 5, 2017. Riboflavin (B2) Riboflavin is an essential component of flavoproteins, which are coenzymes involved in many metabolic pathways of carbohydrate, lipid, and protein metabolism. Flavoproteins aid in the transfer of electrons in the electron transport chain. Furthermore, the functions of other B-vitamin coenzymes, such as vitamin B6 and folate, are dependent on the actions of flavoproteins. The “flavin” portion of riboflavin gives a bright yellow color to riboflavin, an attribute that helped lead to its discovery as a vitamin. When riboflavin is taken in excess amounts (supplement form) the excess will be excreted through your 

In [47]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,667.0,667.0,667.0,667.0,667.0,667.0,667.0
mean,316.0,1756.98,270.41,16.31,439.25,16.21,2.11
std,192.69,1211.29,188.06,13.79,302.82,13.64,1.38
min,-17.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,149.5,774.5,112.5,6.0,193.62,6.0,1.0
50%,316.0,1584.0,249.0,14.0,396.0,14.0,2.0
75%,482.5,2750.5,424.5,23.0,687.62,23.0,3.0
max,649.0,4555.0,757.0,99.0,1138.75,101.0,11.0


### Splitting each chunk into it's own item

- So as to embed each chunk of sentences into it's own numerical representation giving us a good level of granularity


In [58]:
import regex as re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences together into a paragraph like structure
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()

        # To return the space in the beginning of sentences
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk)

        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get some stats on our chunks
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4

        pages_and_chunks.append(chunk_dict)

len(pages_and_chunks)


  0%|          | 0/667 [00:00<?, ?it/s]

1409

In [59]:
random.sample(pages_and_chunks, k=1)

[{'page_number': 114,
  'sentence_chunk': 'Updated February 11, 2004. Accessed September 22, 2017. Sodium | 114',
  'chunk_char_count': 68,
  'chunk_word_count': 11,
  'chunk_token_count': 17.0}]