# Local RAG pipeline


## 0. Intro


### What is RAG?

- Retrieval - Find relevant info given a query
- Augmented - Take relevant info and augment our input (prompt) to an LLM with that relevant info
- Generation - Take the first two steps and pass them to an LLM for generative outputs


### Why RAG?

Improve generation outputs of LLMS

1. Prevents hallucinations - good looking text that is not necessarily factual
2. Work with custom data not internet-scale data


### What can RAG be used for?

1. Customer support Q&A chat
2. Email chain analysis
3. Company internal documentation chat
4. Textbook Q&A


### Why local?

1. Privacy - private documentation that you don't want to send to an API
2. Speed - no need to send data across the internet
3. Cost - No cost if using own hardware


### To do list

- Build a RAG pipeline which enables us to chat with a PDF document, specifically an open-source nutrition textbook, ~1200 pages long.

- Write the code to:

1. Open a PDF document (you could use almost any PDF here).
2. Format the text of the PDF textbook ready for an embedding model (this process is known as text splitting/chunking).
3. Embed all of the chunks of text in the textbook and turn them into numerical representation which we can store for later.
4. Build a retrieval system that uses vector search to find relevant chunks of text based on a query.
5. Create a prompt that incorporates the retrieved pieces of text.
6. Generate an answer to a query based on passages from the textbook.


## 1. Text pre-processing


### Import and open PDF


In [26]:
# Import PDF

import os
import requests

# Get pdf document path
pdf_path = "data/human-nutrition-text.pdf"

# Download PDF
if not os.path.exists(pdf_path):
    print(f"[INFO] file doesn't exist, downloading...")

    # Enter the URL of the pdf
    url = "https://pressbooks.oer.hawaii.edu/humannutrition/open/download?type=pdf"

    # The local file name to save downloaded file
    filename = pdf_path

    # Send a GET request to the URL
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Open the file and save it
        with open(filename, "wb") as file:
            file.write(response.content)
        print(f"[INFO] The file has been downloaded and saved as{filename}")
    else:
        print(f"[INFO] Failed to download the file. Status code : {response.status_code}")

else:
    print(f"[INFO] File {pdf_path} exists")

[INFO] File data/human-nutrition-text.pdf exists


In [27]:
# Open PDF
import fitz
from tqdm.auto import tqdm

def text_formatter(text: str) -> str:
    """Performs minor formatting on text"""
    cleaned_text = text.replace("\n", " ").strip()

    # More text formatting functions can go in here
    return cleaned_text

def open_and_read_pdf(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        text= page.get_text()
        text = text_formatter(text=text)
        pages_and_texts.append({"page_number": page_number - 17,
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(". ")),
                                "page_token_count": len(text) / 4,
                                "text": text})
    return pages_and_texts

pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[:2]

0it [00:00, ?it/s]

[{'page_number': -17,
  'page_char_count': 15,
  'page_word_count': 2,
  'page_sentence_count_raw': 1,
  'page_token_count': 3.75,
  'text': 'Human Nutrition'},
 {'page_number': -16,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [28]:
import random

random.sample(pages_and_texts, k=3)

[{'page_number': 566,
  'page_char_count': 1098,
  'page_word_count': 169,
  'page_sentence_count_raw': 8,
  'page_token_count': 274.5,
  'text': 'known as hyponatremia (see Figure 16.11 “The Effect of Exercise on Sodium Levels”). When sodium levels in the blood are decreased, water moves into the cell through osmosis which causes swelling. Accumulation of fluid in the lungs and the brain can cause serious life threatening conditions such as a seizure, coma and death. In order to avoid hyponatremia, athletes should increase their consumption of sodium in the days leading up to an event and consume sodium-containing sports drinks during their race or game. The early signs of hyponatremia include nausea, muscle cramps, disorientation, and slurred speech. To learn more about the sports drinks that can optimize your performance, refer back to Chapter 3, Water and Electrolytes. Figure 16.11 The Effect of Exercise on Sodium Levels Image by Allison Calabrese / CC BY 4.0 Water and Electrolyte 

In [29]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-17,15,2,1,3.75,Human Nutrition
1,-16,0,1,1,0.0,
2,-15,188,26,1,47.0,Human Nutrition UNIVERSITY OF HAWAI‘I AT MĀNOA...
3,-14,607,100,5,151.75,Human Nutrition by University of Hawai‘i at Mā...
4,-13,827,130,4,206.75,Contents Preface xi About the Contributors xii...


In [30]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,667.0,667.0,667.0,667.0,667.0
mean,316.0,1756.98,270.41,16.31,439.25
std,192.69,1211.29,188.06,13.79,302.82
min,-17.0,0.0,1.0,1.0,0.0
25%,149.5,774.5,112.5,6.0,193.62
50%,316.0,1584.0,249.0,14.0,396.0
75%,482.5,2750.5,424.5,23.0,687.62
max,649.0,4555.0,757.0,99.0,1138.75


### Why we care about token count?

1. Embedding models don't deal with infinite tokens

- In this case sentence-transformers/all-mpnet-base-v2 embedding model was used
- It was trained to embed sequences of 384 tokens into numerical space

2. LLMs don't deal with infinite tokens


### Further text processing (splitting pages into sentences)

- split at ". " or use NLP libraries like spaCy and nltk.


In [31]:
from spacy.lang.en import English

nlp = English()

# Add a sentencizer pipeline
nlp.add_pipe("sentencizer")

# Create document instance as an example
doc = nlp("This is a sentence. This is another sentence.")
assert len(list(doc.sents)) == 2

# Print out our sentences split
list(doc.sents)

[This is a sentence., This is another sentence.]

In [32]:
pages_and_texts[500]

{'page_number': 483,
 'page_char_count': 3794,
 'page_word_count': 655,
 'page_sentence_count_raw': 33,
 'page_token_count': 948.5,
 'text': 'ENERGY AND MACRONUTRIENTS Energy needs relative to size are much greater in an infant than an adult. A baby’s resting metabolic rate is two times that of an adult. The RDA to meet energy needs changes as an infant matures and puts on more weight. The IOM uses a set of equations to calculate the total energy expenditure and resulting energy needs. For example, the equation for the first three months of life is (89 x weight [kg] −100) + 175 kcal. Based on these equations, the estimated energy requirement for infants from zero to six months of age is 472 to 645 kilocalories per day for boys and 438 to 593 kilocalories per day for girls. For infants ages six to twelve months, the estimated requirement is 645 to 844 kilocalories per day for boys and 593 to 768 kilocalories per day for girls. From the age one to age two, the estimated requirement rises

In [33]:
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)

    # Make sure sentences are strings (default type is spaCy datatype)
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]

    # Count the sentences
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/667 [00:00<?, ?it/s]

In [34]:
random.sample(pages_and_texts, k=1)
# Has been split into sentences

[{'page_number': 648,
  'page_char_count': 3679,
  'page_word_count': 585,
  'page_sentence_count_raw': 25,
  'page_token_count': 919.75,
  'text': 'Attributions 1. Figure 2.5 The Human Digestive System reused “Digestive system without labels” by Mariana Ruiz / Public Domain 2. Figure 2.6 Peristalsis in the Esophagus reused “Peristalsis” by OpenStax College / CC BY 3.0 3. Figure 2.9 The Absorption of Nutrients reused “ “Digestive system without labels” by Mariana Ruiz / Public Domain; “Simple columnar epithelial cells” by McortNGHH / CC BY 3.0 4. Figure 2.28 Body Composition reused “Male body silhouette” by mlampret / Public Domain 5. Figure 2.32 Fat Distribution reused “Body shapes” by Succubus MacAstaroth / Public Domain; “Simple red apple” by Sanja / Public Domain; “Pear” by Mrallowski / Public Domain 6. Figure 3.2 Distribution of Body Water reused “Male body silhouette” by mlampret / Public Domain 7. Figure 3.6 Regulating Water Intake reused “Female silhouette” by Pnx / Public Doma

In [35]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2) 

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,667.0,667.0,667.0,667.0,667.0,667.0
mean,316.0,1756.98,270.41,16.31,439.25,16.21
std,192.69,1211.29,188.06,13.79,302.82,13.64
min,-17.0,0.0,1.0,1.0,0.0,0.0
25%,149.5,774.5,112.5,6.0,193.62,6.0
50%,316.0,1584.0,249.0,14.0,396.0,14.0
75%,482.5,2750.5,424.5,23.0,687.62,23.0
max,649.0,4555.0,757.0,99.0,1138.75,101.0


### Splitting and chunking sentences together in groups of 10

- Makes text easier to filter and inspect
- For our text chunks to fit into our embedding model context window


In [36]:
# Define split size
num_sentence_chunk_size = 10

# Create function to split list of text recursively into chunk size
# 20 -> 10, 10
# 25 -> 10, 10, 5
def split_list(input_list: list,
               split_size: int = num_sentence_chunk_size) -> list[list[str]]:
    return [input_list[i:i+split_size] for i in range(0, len(input_list), split_size)]

test_list = list(range(25))
split_list(test_list)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [37]:
# Loop through pages and text and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         split_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

random.sample(pages_and_texts, k=1)

  0%|          | 0/667 [00:00<?, ?it/s]

[{'page_number': 55,
  'page_char_count': 2622,
  'page_word_count': 429,
  'page_sentence_count_raw': 20,
  'page_token_count': 655.5,
  'text': 'the blood, and blood delivers the carbon dioxide to the lungs where it is exhaled. Also, the liver produces the waste product urea from the breakdown of amino acids and detoxifies many harmful substances, all of which require transport in the blood to the kidneys for excretion. ALL FOR ONE, ONE FOR ALL The eleven organ systems in the body completely depend on each other for continued survival as a complex organism. Blood allows for transport of nutrients, wastes, water, and heat, and is also a conduit of communication between organ systems. Blood’s importance to the rest of the body is aptly presented in its role in glucose delivery, especially to the brain. The brain metabolizes, on average, 6 grams of glucose per hour. In order to avert confusion, coma, and death, glucose must be readily available to the brain at all times. To accomplish t

In [38]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,667.0,667.0,667.0,667.0,667.0,667.0,667.0
mean,316.0,1756.98,270.41,16.31,439.25,16.21,2.11
std,192.69,1211.29,188.06,13.79,302.82,13.64,1.38
min,-17.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,149.5,774.5,112.5,6.0,193.62,6.0,1.0
50%,316.0,1584.0,249.0,14.0,396.0,14.0,2.0
75%,482.5,2750.5,424.5,23.0,687.62,23.0,3.0
max,649.0,4555.0,757.0,99.0,1138.75,101.0,11.0


### Splitting each chunk into it's own item

- So as to embed each chunk of sentences into it's own numerical representation giving us a good level of granularity


In [39]:
import regex as re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences together into a paragraph like structure
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()

        # To return the space in the beginning of sentences
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk)

        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get some stats on our chunks
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4

        pages_and_chunks.append(chunk_dict)

len(pages_and_chunks)


  0%|          | 0/667 [00:00<?, ?it/s]

1409

In [40]:
random.sample(pages_and_chunks, k=1)

[{'page_number': 642,
  'sentence_chunk': 'Coleman-Jensen A. Household Food Security in the United States in 2010. US Department of Agriculture, Economic Research Report, no. ERR-125.2011.https://www.ers.usda.gov/publications/pub-details/?pubid=44909. Accessed April 15, 2018.14. National School Lunch Program. US Department of Agriculture.https://www.fns.usda.gov/nslp/national-school-lunch-program- nslp.',
  'chunk_char_count': 363,
  'chunk_word_count': 33,
  'chunk_token_count': 90.75}]

In [41]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1409.0,1409.0,1409.0,1409.0
mean,339.14,830.21,127.02,207.55
std,194.96,535.48,86.02,133.87
min,-17.0,3.0,1.0,0.75
25%,170.0,344.0,42.0,86.0
50%,337.0,848.0,132.0,212.0
75%,514.0,1222.0,193.0,305.5
max,649.0,3060.0,483.0,765.0


### Create a filter for sentence chunks that are below 30 tokens


In [42]:
# Show random chunks with under 30 tokens in length
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token Count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token Count: 8.5 | Text: CHAPTER 18. NUTRITIONAL ISSUES 603
Chunk token Count: 7.0 | Text: CHAPTER 2. THE HUMAN BODY 29
Chunk token Count: 16.0 | Text: Figure 4.1 Carbohydrate Classification Scheme Introduction | 142
Chunk token Count: 9.75 | Text: Table 3.5 Salt Substitutes Sodium | 118
Chunk token Count: 0.75 | Text: 186


In [43]:
# Filter our DataFrame
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
random.sample(pages_and_chunks_over_min_token_len, k=1)

[{'page_number': 154,
  'sentence_chunk': 'intestine. The products of bacterial digestion of these slow-releasing carbohydrates are short-chain fatty acids and some gases. The short-chain fatty acids are either used by the bacteria to make energy and grow, are eliminated in the feces, or are absorbed into cells of the colon, with a small amount being transported to the liver. Colonic cells use the short- chain fatty acids to support some of their functions. The liver can also metabolize the short-chain fatty acids into cellular energy. The yield of energy from dietary fiber is about 2 kilocalories per gram for humans, but is highly dependent upon the fiber type, with soluble fibers and resistant starches yielding more energy than insoluble fibers. Since dietary fiber is digested much less in the gastrointestinal tract than other carbohydrate types (simple sugars, many starches) the rise in blood glucose after eating them is less, and slower. These physiological attributes of high-fiber

## 2. Text Embedding

- Turn our text chunks into numbers, specifically embeddings


In [44]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", device="cuda")

# Create a list of sentences
sentences= ["The sentence transformer library provides an easy way to create embeddings.",
            "Sentences can be embedded one by one or in a list."]

# Sentences are encoded/embedded by calling model.encode()
embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

# See the embeddings
for sentence, embedding in embeddings_dict.items():
    print(f"Sentence: {sentence}")
    print(f"Embedding: {embedding}")
    print("")

Sentence: The sentence transformer library provides an easy way to create embeddings.
Embedding: [-3.44286002e-02  2.95328889e-02 -2.33643297e-02  5.57256900e-02
 -2.19098255e-02 -6.47061830e-03  1.02849538e-02 -6.57803640e-02
  2.29717419e-02 -2.61120386e-02  3.80420797e-02  5.61402254e-02
 -3.68746594e-02  1.52788032e-02  4.37020846e-02 -5.19723482e-02
  4.89479229e-02  3.58106894e-03 -1.29750511e-02  3.54384072e-03
  4.23261896e-02  3.52606922e-02  2.49401350e-02  2.99177542e-02
 -1.99381709e-02 -2.39752736e-02 -3.33370618e-03 -4.30450253e-02
  5.72014228e-02 -1.32517572e-02 -3.54477540e-02 -1.13935806e-02
  5.55561222e-02  3.61096952e-03  8.88527154e-07  1.14027197e-02
 -3.82229686e-02 -2.43546139e-03  1.51313730e-02 -1.32674788e-04
  5.00659645e-02 -5.50877005e-02  1.73444767e-02  5.00959009e-02
 -3.75959501e-02 -1.04463724e-02  5.08322716e-02  1.24861198e-02
  8.67376402e-02  4.64143008e-02 -2.10690033e-02 -3.90251651e-02
  1.99697260e-03 -1.42345913e-02 -1.86795332e-02  2.826696

In [45]:
embeddings[0].shape

(768,)

In [46]:
embedding = embedding_model.encode("My name is Tony")
embedding, embedding.shape

(array([ 5.62499762e-02,  1.43900067e-02, -1.57591701e-02,  2.25988366e-02,
         2.84206178e-02,  1.51288798e-02, -2.87298323e-03,  5.61753772e-02,
        -1.53593114e-02,  1.87198892e-02, -3.03351469e-02,  6.82871370e-03,
         2.70711258e-02,  5.50674312e-02, -2.68677566e-02, -6.28586039e-02,
         2.30752979e-03, -5.52843027e-02,  7.35155717e-02, -1.07624792e-02,
        -2.73546763e-02, -8.40380439e-04, -4.41351300e-03,  3.90982255e-02,
        -2.33301073e-02,  1.73130073e-02,  7.06520081e-02, -1.16217192e-02,
         2.33279038e-02,  4.49071303e-02, -2.17582863e-02,  1.96407698e-02,
        -1.26328077e-02,  4.12252843e-02,  1.77576806e-06,  2.37653926e-02,
        -1.10460017e-02, -6.35289541e-03, -8.91105458e-03, -1.41535494e-02,
        -5.20250537e-02,  4.14466038e-02, -2.25551482e-02, -1.82777084e-02,
        -3.04814670e-02,  1.35036096e-01,  4.49801944e-02,  7.59741440e-02,
         2.99696135e-03,  4.24329638e-02,  8.18389654e-03,  2.38802470e-02,
        -7.6

In [47]:
%%time

embedding_model.to("cuda")

# Embed each chunk one by one
for item in tqdm(pages_and_chunks_over_min_token_len):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])


  0%|          | 0/1303 [00:00<?, ?it/s]

CPU times: total: 5min 51s
Wall time: 4min 7s


In [48]:
# %%time

# # Embed all text in batches
# text_chunks = [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len]
# text_chunk_embeddings = embedding_model.encode(text_chunks,
#                                                 batch_size=32,
#                                                 convert_to_tensor=True)

# Should be faster

### Save embeddings to file


In [49]:
pages_and_chunks_over_min_token_len[2]

{'page_number': -13,
 'sentence_chunk': 'Contents Preface xi About the Contributors xii Acknowledgements xvii Chapter 1. Basic Concepts in Nutrition Introduction 3 Food Quality 9 Units of Measure 11 Lifestyles and Nutrition 13 Achieving a Healthy Diet 17 Research and the Scientific Method 19 Types of Scientific Studies 23 Chapter 2. The Human Body Introduction 31 Basic Biology, Anatomy, and Physiology 36 The Digestive System 40 The Cardiovascular System 50 Central Nervous System 58 The Respiratory System 61 The Endocrine System 66 The Urinary System 68 The Muscular System 73 The Skeletal System 74 The Immune System 81 Indicators of Health: Body Mass Index, Body Fat Content, and Fat Distribution 82 Chapter 3. Water and Electrolytes Introduction 93 Overview of Fluid and Electrolyte Balance 96 Water’s Importance to Vitality 100 Regulation of Water Balance 105',
 'chunk_char_count': 827,
 'chunk_word_count': 130,
 'chunk_token_count': 206.75,
 'embedding': array([ 2.95152534e-02,  2.506555

In [51]:
# Save embedding to file
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_len)
embeddings_df_save_path = "data/text_chunks_and_embeddings_df.csv"
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)
