## 1. Document/text processing and embedding creation

Ingredients:
* PDF document of choice (could be any kind of document).
* Embedding model of choice.

Steps:
1. Import PDF document.
2. Process text for embedding.
3. Embed text chunks with embedding model.
4. Save embeddings to file for later use.

### Import PDF Document

In [1]:
import os
import requests # help download stuff

# Get PDF Document
pdf_path = "human-nutrition-text.pdf"

# Download
if not os.path.exists(pdf_path):
    print(f"[INFO] File doesn't exist, downloading...")

    # Enter URL of the pdf
    url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"

    # The local filename to save the downloaded file
    filename = pdf_path

    # Send a GET request to the URL
    response = requests.get(url)
    if response.status_code == 200:
        # Open the file and save it
        with open(pdf_path, "wb") as file:
            file.write(response.content)
        print(f"[INFO] The file has been downloaded and saved as {filename}")
    else:
        print(f"[INFO] Failed to download the file. Status Code: {response.status_code}")
else:
    print(f"File {pdf_path} exists.")

File human-nutrition-text.pdf exists.


In [2]:
import fitz # requires: PyMuPDF
from tqdm.auto import tqdm

def text_formatter(text: str) -> str:
    """Performs minor formatting on text"""
    cleaned_text = text.replace("\n", " ").strip()

    return cleaned_text

def open_and_read_pds(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text)
        pages_and_texts.append({"page_number": page_number - 41,
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(".")),
                                "page_token_count": len(text) / 4, # 1 token ~4 chars
                                "text": text
                               })
    return pages_and_texts

pages_and_texts = open_and_read_pds(pdf_path)
pages_and_texts[:2]

0it [00:00, ?it/s]

[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [3]:
import random

random.sample(pages_and_texts, k=3)

[{'page_number': 847,
  'page_char_count': 813,
  'page_word_count': 157,
  'page_sentence_count_raw': 7,
  'page_token_count': 203.25,
  'text': 'not only by the kinds of liquids given to an infant, but also by  the frequency and length of time that fluids are given. Giving a  child a bottle of juice or other sweet liquids several times each  day, or letting a baby suck on a bottle longer than a mealtime,  either when awake or asleep, can also cause early childhood caries.  In addition, this practice affects the development and position of  the teeth and the jaw. The risk of early childhood caries continues  into the toddler years as children begin to consume more foods  with a high sugar content. Therefore, parents should avoid putting  their children to bed with a bottle, and giving their children sugary  snacks and beverages. If a parent insists on giving their child a bottle  in bed, then it should be filled with water only.  Infancy  |  847'},
 {'page_number': 926,
  'page_char_c

In [4]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition
1,-40,0,1,1,0.0,
2,-39,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,32,3,53.0,Human Nutrition: 2020 Edition by University of...
4,-37,797,147,3,199.25,Contents Preface University of Hawai‘i at Mā...


In [5]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,199.5,14.18,287.0
std,348.86,560.38,95.83,9.54,140.1
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.0,134.0,8.0,190.5
50%,562.5,1231.5,216.0,13.0,307.88
75%,864.25,1603.5,272.0,19.0,400.88
max,1166.0,2308.0,430.0,82.0,577.0


### Further text processing (splitting pages into sentences)

Two ways to do this:
1. By splitting on `"."`.
2. We can do this with a NLP library such as spaCY or nltk.

In [6]:
from spacy.lang.en import English

nlp = English()

# Add a sentencizer pipeline
nlp.add_pipe("sentencizer")

# Create document instance as an example
doc = nlp("This is a sentence. This is another sentence.")
assert len(list(doc.sents)) == 2

# Print out our sentences split
list(doc.sents)

[This is a sentence., This is another sentence.]

In [7]:
pages_and_texts[:2]

[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [8]:
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)

    # Make sure all sentences are strings (default type is spaCY datatype)
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]

    # Count the sentences
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [9]:
random.sample(pages_and_texts, k=1)

[{'page_number': 967,
  'page_char_count': 889,
  'page_word_count': 157,
  'page_sentence_count_raw': 18,
  'page_token_count': 222.25,
  'text': 'Image by  Allison  Calabrese /  CC BY 4.0  Figure 16.8 The Female Athlete Triad  Iron  Iron deficiency is very common in athletes. During exercise, iron- containing proteins like hemoglobin and myoglobin are needed in  great amounts. An iron deficiency can impair muscle function to  limit work capacity leading to compromised training performance.  Some athletes in intense training may have an increase in iron losses  through sweat, urine, and feces. Iron losses are greater in females  than males due to the iron lost in blood every menstrual cycle.  Female athletes, distance runners and vegetarians are at the  greatest risk for developing iron deficiency.8 See Table 16.3 “The  triad. Published October 7, 2016. Accessed March 16,  2018.  8. Beard J, Tobin B. (2000). Iron Status and Exercise. The  American Journal of Clinical Nutrition, 72(2),

In [10]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,199.5,14.18,287.0,10.32
std,348.86,560.38,95.83,9.54,140.1,6.3
min,-41.0,0.0,1.0,1.0,0.0,0.0
25%,260.75,762.0,134.0,8.0,190.5,5.0
50%,562.5,1231.5,216.0,13.0,307.88,10.0
75%,864.25,1603.5,272.0,19.0,400.88,15.0
max,1166.0,2308.0,430.0,82.0,577.0,28.0


### Chunking our sentences together

In [11]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 10

def split_list(input_list: list[str],
               slice_size: int=num_sentence_chunk_size) -> list[list[str]]:
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

test_list = list(range(25))
split_list(test_list)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [12]:
# Loop through pages and text, and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                        slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [13]:
random.sample(pages_and_texts, k=1)

[{'page_number': 268,
  'page_char_count': 1047,
  'page_word_count': 156,
  'page_sentence_count_raw': 24,
  'page_token_count': 261.75,
  'text': 'grain products daily were 30 percent less likely to have a heart  attack.8  The AHA makes the following statements on whole grains9:  • “Dietary fiber from whole grains, as part of an overall healthy  diet, helps reduce blood cholesterol levels and may lower risk  of heart disease.”  • “Fiber-containing foods, such as whole grains, help provide a  feeling of fullness with fewer calories and may help with  weight management.”  Figure 4.15 Grain Consumption Statistics in America  8. Liu S, Stampfer MJ, et al. (1999). Whole-Grain  Consumption and Risk of Coronary Heart Disease:  Results from the Nurses’ Health Study. American Journal  of Clinical Nutrition, 70(3), 412–19. http://www.ajcn.org/ content/70/3/412.long. Accessed September 27, 2017.  9. Whole Grains and Fiber. American Heart Association.  http://www.heart.org/HEARTORG/GettingHealth

In [14]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,199.5,14.18,287.0,10.32,1.53
std,348.86,560.38,95.83,9.54,140.1,6.3,0.64
min,-41.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,260.75,762.0,134.0,8.0,190.5,5.0,1.0
50%,562.5,1231.5,216.0,13.0,307.88,10.0,1.0
75%,864.25,1603.5,272.0,19.0,400.88,15.0,2.0
max,1166.0,2308.0,430.0,82.0,577.0,28.0,3.0


### Splitting each chunk into its own item

In [15]:
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences together into a paragraph like structure, aka join the list of sentences into one paragraph
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" => ". A" (will work for any uppercase letter)

        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get some stats on our chunks
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token ~4 chars

        pages_and_chunks.append(chunk_dict)

len(pages_and_chunks)

  0%|          | 0/1208 [00:00<?, ?it/s]

1843

In [16]:
random.sample(pages_and_chunks, k=1)

[{'page_number': 870,
  'sentence_chunk': 'http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=463  870 | Introduction',
  'chunk_char_count': 76,
  'chunk_word_count': 6,
  'chunk_token_count': 19.0}]

In [17]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1843.0,1843.0,1843.0,1843.0
mean,583.38,734.1,112.74,183.52
std,347.79,447.51,71.24,111.88
min,-41.0,12.0,3.0,3.0
25%,280.5,315.0,45.0,78.75
50%,586.0,745.0,115.0,186.25
75%,890.0,1118.0,173.0,279.5
max,1166.0,1830.0,297.0,457.5


### Filter chunks of text for short chunks

These chunks may not contain much useful information

In [18]:
# Show random chunks with under 30 tokens in length
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 15.25 | Text: Accessed November 30, 2017. Discovering Nutrition Facts | 737
Chunk token count: 24.25 | Text: biological, chemicals, or physical) and identify preventative 1014 | Protecting the Public Health
Chunk token count: 12.75 | Text: PART VI CHAPTER 6. PROTEIN Chapter 6. Protein | 357
Chunk token count: 3.75 | Text: 806 | Pregnancy
Chunk token count: 27.75 | Text: In exchange, for the reabsorption of sodium and water, potassium is excreted. Regulation of Water Balance | 169


In [19]:
# Filter our DF for rows with under 30 tokens
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': -39,
  'sentence_chunk': 'Human Nutrition: 2020 Edition UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM ALAN TITCHENAL, SKYLAR HARA, NOEMI ARCEO CAACBAY, WILLIAM MEINKE-LAU, YA-YUN YANG, MARIE KAINOA FIALKOWSKI REVILLA, JENNIFER DRAPER, GEMADY LANGFELDER, CHERYL GIBBY, CHYNA NICOLE CHUN, AND ALLISON CALABRESE',
  'chunk_char_count': 308,
  'chunk_word_count': 42,
  'chunk_token_count': 77.0},
 {'page_number': -38,
  'sentence_chunk': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.',
  'chunk_char_count': 210,
  'chunk_word_count': 30,
  'chunk_token_count': 52.5}]

In [20]:
random.sample(pages_and_chunks_over_min_token_len, k=1)

[{'page_number': 809,
  'sentence_chunk': 'Baby-Friendly USA. (2020). The ten steps to successful breastfeeding.https://www.babyfriendlyusa.org/for- facilities/practice-guidelines/10-steps-and- international-code/ Infancy | 809',
  'chunk_char_count': 184,
  'chunk_word_count': 14,
  'chunk_token_count': 46.0}]

### Embedding our text chunks

In [21]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                     device="cpu")

# Create a list of sentences
sentences = [
    "Learning Activities Technology Note: The second edition of the Human Nutrition Open Educational Resource (OER) textbook features interactive learning activities",
    "I like horses!"
]

# Sentences are encoded/embedded by calling model.encode()
embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

for sentence, embedding in embeddings_dict.items():
    print(f"Sentece: {sentence}")
    print(f"Embedding: {embedding}")
    print("")



Sentece: Learning Activities Technology Note: The second edition of the Human Nutrition Open Educational Resource (OER) textbook features interactive learning activities
Embedding: [-3.04869073e-03 -5.04321009e-02  3.28285460e-05 -4.17669751e-02
  2.43278388e-02  4.45254929e-02  2.30051205e-02  4.34296392e-02
  5.52819073e-02 -1.17314244e-02  2.55363639e-02  5.15292399e-04
  7.15017784e-03  1.28226485e-02 -1.03462450e-02 -6.01724535e-02
  7.26220012e-03  2.78440509e-02 -2.65388135e-02  3.83527800e-02
 -3.79710388e-03  1.12740379e-02 -5.53640835e-02  2.00721305e-02
 -9.69741214e-03  3.75270285e-03 -2.41694357e-02  7.56419124e-03
  9.71736945e-03 -8.09573010e-02  4.51820623e-03  3.78629752e-02
 -1.64126009e-02 -3.07473708e-02  1.90306287e-06 -2.45039444e-02
 -2.88355574e-02  5.79970852e-02 -9.41525847e-02  2.38348190e-02
  8.19078088e-02  6.08036593e-02 -7.63305975e-03 -1.21022738e-03
  1.66139621e-02  5.39150201e-02  7.20746070e-02  1.96702965e-02
 -3.61063741e-02 -2.26881728e-02  6.367

In [22]:
embeddings[0].shape

(768,)

In [23]:
embedding = embedding_model.encode("My favorite animal is the dog")
embedding

array([-5.00165345e-03,  4.72070612e-02, -2.40024757e-02, -1.58585571e-02,
        2.43593287e-02,  7.41919205e-02, -7.20568597e-02, -4.86758631e-03,
       -4.06144373e-02, -2.54812911e-02, -4.35632430e-02,  7.13663325e-02,
       -6.53339028e-02, -3.90337780e-02,  1.16073042e-02, -3.58555801e-02,
        3.92476581e-02,  3.44534665e-02, -4.05611843e-03,  2.30281334e-02,
       -5.52244997e-03,  5.43379486e-02, -2.46011242e-02, -1.10480832e-02,
        1.83787495e-02,  2.62907073e-02, -8.49822350e-03, -2.58107781e-02,
        5.51881082e-03, -1.84886847e-02, -5.59564941e-02, -5.69908395e-02,
        1.31537057e-02,  7.42521510e-03,  1.27404417e-06,  1.13695366e-02,
       -8.80080182e-03,  2.10266025e-03,  6.48652762e-02, -6.27641678e-02,
        2.95485556e-02, -6.84317900e-04, -2.72781793e-02,  1.91967329e-03,
        1.73108894e-02,  2.81309858e-02,  5.70766069e-02,  8.49287808e-02,
       -5.50692976e-02,  3.90659012e-02, -1.84876267e-02, -5.11634275e-02,
       -2.97517218e-02,  

In [24]:
%%time

# embedding_model.to("cpu")

# # Embed each chunk one by one
# for item in tqdm(pages_and_chunks_over_min_token_len):
#     item["embedding"] = embedding_model.encode(item["sentence_chunk"])

CPU times: total: 0 ns
Wall time: 0 ns


In [25]:
%%time

embedding_model.to("cuda")

for item in tqdm(pages_and_chunks_over_min_token_len):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])

  0%|          | 0/1680 [00:00<?, ?it/s]

CPU times: total: 3min 38s
Wall time: 33.1 s


In [26]:
%%time

text_chunks = [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len]
text_chunks[293]

CPU times: total: 0 ns
Wall time: 0 ns


'The chloride AI for adults, set by the IOM, is 2,300 milligrams. Therefore just ⅔ teaspoon of table salt per day is sufficient for chloride as well as sodium. The AIs for other age groups are listed in Table 3.7 “Adequate Intakes for Chloride”. Table 3.7 Adequate Intakes for Chloride Chloride | 191'

In [27]:
len(text_chunks)

1680

In [28]:
%%time

# Embed all text in batches
text_chunk_embeddings = embedding_model.encode(text_chunks,
                                              batch_size=32,
                                              convert_to_tensor=True)

text_chunk_embeddings

CPU times: total: 41.1 s
Wall time: 13.1 s


tensor([[ 0.0674,  0.0902, -0.0051,  ..., -0.0221, -0.0232,  0.0126],
        [ 0.0552,  0.0592, -0.0166,  ..., -0.0120, -0.0103,  0.0227],
        [ 0.0280,  0.0340, -0.0206,  ..., -0.0054,  0.0213,  0.0313],
        ...,
        [ 0.0771,  0.0098, -0.0122,  ..., -0.0409, -0.0752, -0.0241],
        [ 0.1030, -0.0165,  0.0083,  ..., -0.0574, -0.0283, -0.0295],
        [ 0.0864, -0.0125, -0.0113,  ..., -0.0522, -0.0337, -0.0299]],
       device='cuda:0')

### Save embeddings to file

In [29]:
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_len)
embeddings_df_save_path = "text_chunks_and_embeddings_df.csv"
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)

In [30]:
# Import saved file and view
text_chunks_and_embeddings_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embeddings_df_load.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-39,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,77.0,[ 6.74242750e-02 9.02281553e-02 -5.09548420e-...
1,-38,Human Nutrition: 2020 Edition by University of...,210,30,52.5,[ 5.52156307e-02 5.92139177e-02 -1.66167375e-...
2,-37,Contents Preface University of Hawai‘i at Māno...,766,116,191.5,[ 2.79802009e-02 3.39813903e-02 -2.06426457e-...
3,-36,Lifestyles and Nutrition University of Hawai‘i...,941,144,235.25,[ 6.82566836e-02 3.81275155e-02 -8.46855994e-...
4,-35,The Cardiovascular System University of Hawai‘...,998,152,249.5,[ 3.30264494e-02 -8.49768426e-03 9.57158674e-...


## 2. RAG - Search and Answer