## 1. Document/text processing and embedding creation

Ingredients:
* PDF document of choice (could be any kind of document).
* Embedding model of choice.

Steps:
1. Import PDF document.
2. Process text for embedding.
3. Embed text chunks with embedding model.
4. Save embeddings to file for later use.

### Import PDF Document

In [1]:
import os
import requests # help download stuff

# Get PDF Document
pdf_path = "human-nutrition-text.pdf"

# Download
if not os.path.exists(pdf_path):
    print(f"[INFO] File doesn't exist, downloading...")

    # Enter URL of the pdf
    url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"

    # The local filename to save the downloaded file
    filename = pdf_path

    # Send a GET request to the URL
    response = requests.get(url)
    if response.status_code == 200:
        # Open the file and save it
        with open(pdf_path, "wb") as file:
            file.write(response.content)
        print(f"[INFO] The file has been downloaded and saved as {filename}")
    else:
        print(f"[INFO] Failed to download the file. Status Code: {response.status_code}")
else:
    print(f"File {pdf_path} exists.")

File human-nutrition-text.pdf exists.


In [2]:
import fitz # requires: PyMuPDF
from tqdm.auto import tqdm

def text_formatter(text: str) -> str:
    """Performs minor formatting on text"""
    cleaned_text = text.replace("\n", " ").strip()

    return cleaned_text

def open_and_read_pds(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text)
        pages_and_texts.append({"page_number": page_number - 41,
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(".")),
                                "page_token_count": len(text) / 4, # 1 token ~4 chars
                                "text": text
                               })
    return pages_and_texts

pages_and_texts = open_and_read_pds(pdf_path)
pages_and_texts[:2]

0it [00:00, ?it/s]

[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [3]:
import random

random.sample(pages_and_texts, k=3)

[{'page_number': 219,
  'page_char_count': 1248,
  'page_word_count': 221,
  'page_sentence_count_raw': 12,
  'page_token_count': 312.0,
  'text': 'Photo by  Jeremy  Ricketts on  unsplash.co m / CC0  chemicals found in coffee and tea. This means that when assessing  the benefits and consequences of your caffeine intake, you must  take into account how much caffeine in your diet comes from coffee  and tea versus how much you obtain from soft drinks.  There is scientific evidence supporting that higher consumption of  caffeine, mostly in the form of coffee, substantially reduces the risk  for developing Type 2 diabetes and Parkinson’s disease. There is a  lesser amount of evidence suggesting increased coffee consumption  lowers the risk of heart attacks in both men and women, and strokes  in women. In smaller population studies, decaffeinated coffee  sometimes performs as well as caffeinated coffee, bringing up the  hypothesis that there are beneficial chemicals in coffee other than  caf

In [4]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition
1,-40,0,1,1,0.0,
2,-39,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,32,3,53.0,Human Nutrition: 2020 Edition by University of...
4,-37,797,147,3,199.25,Contents Preface University of Hawai‘i at Mā...


In [5]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,199.5,14.18,287.0
std,348.86,560.38,95.83,9.54,140.1
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.0,134.0,8.0,190.5
50%,562.5,1231.5,216.0,13.0,307.88
75%,864.25,1603.5,272.0,19.0,400.88
max,1166.0,2308.0,430.0,82.0,577.0


### Further text processing (splitting pages into sentences)

Two ways to do this:
1. By splitting on `"."`.
2. We can do this with a NLP library such as spaCY or nltk.

In [6]:
from spacy.lang.en import English

nlp = English()

# Add a sentencizer pipeline
nlp.add_pipe("sentencizer")

# Create document instance as an example
doc = nlp("This is a sentence. This is another sentence.")
assert len(list(doc.sents)) == 2

# Print out our sentences split
list(doc.sents)

[This is a sentence., This is another sentence.]

In [7]:
pages_and_texts[:2]

[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [8]:
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)

    # Make sure all sentences are strings (default type is spaCY datatype)
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]

    # Count the sentences
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [9]:
random.sample(pages_and_texts, k=1)

[{'page_number': 219,
  'page_char_count': 1248,
  'page_word_count': 221,
  'page_sentence_count_raw': 12,
  'page_token_count': 312.0,
  'text': 'Photo by  Jeremy  Ricketts on  unsplash.co m / CC0  chemicals found in coffee and tea. This means that when assessing  the benefits and consequences of your caffeine intake, you must  take into account how much caffeine in your diet comes from coffee  and tea versus how much you obtain from soft drinks.  There is scientific evidence supporting that higher consumption of  caffeine, mostly in the form of coffee, substantially reduces the risk  for developing Type 2 diabetes and Parkinson’s disease. There is a  lesser amount of evidence suggesting increased coffee consumption  lowers the risk of heart attacks in both men and women, and strokes  in women. In smaller population studies, decaffeinated coffee  sometimes performs as well as caffeinated coffee, bringing up the  hypothesis that there are beneficial chemicals in coffee other than  caf

In [10]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,199.5,14.18,287.0,10.32
std,348.86,560.38,95.83,9.54,140.1,6.3
min,-41.0,0.0,1.0,1.0,0.0,0.0
25%,260.75,762.0,134.0,8.0,190.5,5.0
50%,562.5,1231.5,216.0,13.0,307.88,10.0
75%,864.25,1603.5,272.0,19.0,400.88,15.0
max,1166.0,2308.0,430.0,82.0,577.0,28.0


### Chunking our sentences together

In [11]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 10

def split_list(input_list: list[str],
               slice_size: int=num_sentence_chunk_size) -> list[list[str]]:
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

test_list = list(range(25))
split_list(test_list)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [12]:
# Loop through pages and text, and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                        slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [13]:
random.sample(pages_and_texts, k=1)

[{'page_number': 195,
  'page_char_count': 1752,
  'page_word_count': 304,
  'page_sentence_count_raw': 16,
  'page_token_count': 438.0,
  'text': 'Potassium  UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN  NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM  Potassium is the most abundant positively charged ion inside of  cells. Ninety percent of potassium exists in intracellular fluid, with  about 10 percent in extracellular fluid, and only 1 percent in blood  plasma. As with sodium, potassium levels in the blood are strictly  regulated. The hormone aldosterone is what primarily controls  potassium levels, but other hormones (such as insulin) also play  a role. When potassium levels in the blood increase, the adrenal  glands release aldosterone. The aldosterone acts on the collecting  ducts of kidneys, where it stimulates an increase in the number  of sodium-potassium pumps. Sodium is then reabsorbed and more  potassium is excreted. Because potassium is required for  maintaining sod

In [14]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,199.5,14.18,287.0,10.32,1.53
std,348.86,560.38,95.83,9.54,140.1,6.3,0.64
min,-41.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,260.75,762.0,134.0,8.0,190.5,5.0,1.0
50%,562.5,1231.5,216.0,13.0,307.88,10.0,1.0
75%,864.25,1603.5,272.0,19.0,400.88,15.0,2.0
max,1166.0,2308.0,430.0,82.0,577.0,28.0,3.0


### Splitting each chunk into its own item

In [15]:
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # Join the sentences together into a paragraph like structure, aka join the list of sentences into one paragraph
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" => ". A" (will work for any uppercase letter)

        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get some stats on our chunks
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token ~4 chars

        pages_and_chunks.append(chunk_dict)

len(pages_and_chunks)

  0%|          | 0/1208 [00:00<?, ?it/s]

1843

In [16]:
random.sample(pages_and_chunks, k=1)

[{'page_number': 1123,
  'sentence_chunk': 'and about food that they eat. Anorexia results in extreme nutrient inadequacy and eventually to organ malfunction. Anorexia is relatively rare—the National Institute of Mental Health (NIMH) reports that 0.9 percent of females and 0.3 percent of males will have anorexia at some point in their lifetime, but it is an extreme example of how an unbalanced diet can affect health.2 Anorexia frequently manifests during adolescence and it has the highest rate of mortality of all mental illnesses. People with anorexia consume, on average, fewer than 1,000 kilocalories per day and exercise excessively. They are in a tremendous caloric imbalance. Moreover, some may participate in binge eating, self-induced vomiting, and purging with laxatives or enemas. The very first time a person starves him- or herself may trigger the onset of anorexia. The exact causes of anorexia are not completely known, but many things contribute to its development including econo

In [17]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1843.0,1843.0,1843.0,1843.0
mean,583.38,734.1,112.74,183.52
std,347.79,447.51,71.24,111.88
min,-41.0,12.0,3.0,3.0
25%,280.5,315.0,45.0,78.75
50%,586.0,745.0,115.0,186.25
75%,890.0,1118.0,173.0,279.5
max,1166.0,1830.0,297.0,457.5


### Filter chunks of text for short chunks

These chunks may not contain much useful information

In [18]:
# Show random chunks with under 30 tokens in length
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 20.25 | Text: Published 2002. Accessed December 2, 2017. Pacific Based Dietary Guidelines | 761
Chunk token count: 12.5 | Text: Figure 11.2 The Structure of Hemoglobin Iron | 655
Chunk token count: 21.5 | Text: http://www.health.gov.fj/?page_id=1406. Accessed November 12, 2017. 652 | Introduction
Chunk token count: 25.5 | Text: http://www.ajcn.org/cgi/ pmidlookup?view=long&pmid=10197575. Accessed October 6, 2017. 640 | Magnesium
Chunk token count: 9.75 | Text: 1002 | The Causes of Food Contamination


In [19]:
# Filter our DF for rows with under 30 tokens
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': -39,
  'sentence_chunk': 'Human Nutrition: 2020 Edition UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM ALAN TITCHENAL, SKYLAR HARA, NOEMI ARCEO CAACBAY, WILLIAM MEINKE-LAU, YA-YUN YANG, MARIE KAINOA FIALKOWSKI REVILLA, JENNIFER DRAPER, GEMADY LANGFELDER, CHERYL GIBBY, CHYNA NICOLE CHUN, AND ALLISON CALABRESE',
  'chunk_char_count': 308,
  'chunk_word_count': 42,
  'chunk_token_count': 77.0},
 {'page_number': -38,
  'sentence_chunk': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.',
  'chunk_char_count': 210,
  'chunk_word_count': 30,
  'chunk_token_count': 52.5}]

In [20]:
random.sample(pages_and_chunks_over_min_token_len, k=1)

[{'page_number': 115,
  'sentence_chunk': '“Kidney Position in Abdomen” by OpenStax College / CC BY 3.0 The kidneys lie on either side of the spine in the retroperitoneal space behind the main body cavity that contains the intestines. The kidneys are well protected by muscle, fat, and the lower ribs. They are roughly the size of your fist, and the male kidney is typically a bit larger than the female kidney. The kidneys are well vascularized, receiving about 25 percent of the cardiac output at rest. Figure 2.23 The Kidneys The kidneys (as viewed from the back of the body) are slightly protected by the ribs and are surrounded by fat for protection (not shown). The effects of failure of parts of the urinary system may range from inconvenient (incontinence) to fatal (loss of filtration and many other functions). The kidneys catalyze the final reaction in the synthesis of active vitamin D that in turn helps regulate Ca++. The kidney hormone EPO stimulates erythrocyte development and promot

### Embedding our text chunks

In [21]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                     device="cpu")

# Create a list of sentences
sentences = [
    "Learning Activities Technology Note: The second edition of the Human Nutrition Open Educational Resource (OER) textbook features interactive learning activities",
    "I like horses!"
]

# Sentences are encoded/embedded by calling model.encode()
embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

for sentence, embedding in embeddings_dict.items():
    print(f"Sentece: {sentence}")
    print(f"Embedding: {embedding}")
    print("")



Sentece: Learning Activities Technology Note: The second edition of the Human Nutrition Open Educational Resource (OER) textbook features interactive learning activities
Embedding: [-3.04869073e-03 -5.04321009e-02  3.28285460e-05 -4.17669751e-02
  2.43278388e-02  4.45254929e-02  2.30051205e-02  4.34296392e-02
  5.52819073e-02 -1.17314244e-02  2.55363639e-02  5.15292399e-04
  7.15017784e-03  1.28226485e-02 -1.03462450e-02 -6.01724535e-02
  7.26220012e-03  2.78440509e-02 -2.65388135e-02  3.83527800e-02
 -3.79710388e-03  1.12740379e-02 -5.53640835e-02  2.00721305e-02
 -9.69741214e-03  3.75270285e-03 -2.41694357e-02  7.56419124e-03
  9.71736945e-03 -8.09573010e-02  4.51820623e-03  3.78629752e-02
 -1.64126009e-02 -3.07473708e-02  1.90306287e-06 -2.45039444e-02
 -2.88355574e-02  5.79970852e-02 -9.41525847e-02  2.38348190e-02
  8.19078088e-02  6.08036593e-02 -7.63305975e-03 -1.21022738e-03
  1.66139621e-02  5.39150201e-02  7.20746070e-02  1.96702965e-02
 -3.61063741e-02 -2.26881728e-02  6.367

In [22]:
embeddings[0].shape

(768,)

In [23]:
embedding = embedding_model.encode("My favorite animal is the dog")
embedding

array([-5.00165345e-03,  4.72070612e-02, -2.40024757e-02, -1.58585571e-02,
        2.43593287e-02,  7.41919205e-02, -7.20568597e-02, -4.86758631e-03,
       -4.06144373e-02, -2.54812911e-02, -4.35632430e-02,  7.13663325e-02,
       -6.53339028e-02, -3.90337780e-02,  1.16073042e-02, -3.58555801e-02,
        3.92476581e-02,  3.44534665e-02, -4.05611843e-03,  2.30281334e-02,
       -5.52244997e-03,  5.43379486e-02, -2.46011242e-02, -1.10480832e-02,
        1.83787495e-02,  2.62907073e-02, -8.49822350e-03, -2.58107781e-02,
        5.51881082e-03, -1.84886847e-02, -5.59564941e-02, -5.69908395e-02,
        1.31537057e-02,  7.42521510e-03,  1.27404417e-06,  1.13695366e-02,
       -8.80080182e-03,  2.10266025e-03,  6.48652762e-02, -6.27641678e-02,
        2.95485556e-02, -6.84317900e-04, -2.72781793e-02,  1.91967329e-03,
        1.73108894e-02,  2.81309858e-02,  5.70766069e-02,  8.49287808e-02,
       -5.50692976e-02,  3.90659012e-02, -1.84876267e-02, -5.11634275e-02,
       -2.97517218e-02,  

In [24]:
%%time

# embedding_model.to("cpu")

# # Embed each chunk one by one
# for item in tqdm(pages_and_chunks_over_min_token_len):
#     item["embedding"] = embedding_model.encode(item["sentence_chunk"])

CPU times: total: 0 ns
Wall time: 0 ns


In [25]:
%%time

embedding_model.to("cuda")

for item in tqdm(pages_and_chunks_over_min_token_len):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])

  0%|          | 0/1680 [00:00<?, ?it/s]

CPU times: total: 3min 17s
Wall time: 30.6 s


In [26]:
%%time

text_chunks = [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len]
text_chunks[293]

CPU times: total: 0 ns
Wall time: 1 ms


'The chloride AI for adults, set by the IOM, is 2,300 milligrams. Therefore just ⅔ teaspoon of table salt per day is sufficient for chloride as well as sodium. The AIs for other age groups are listed in Table 3.7 “Adequate Intakes for Chloride”. Table 3.7 Adequate Intakes for Chloride Chloride | 191'

In [27]:
len(text_chunks)

1680

In [28]:
%%time

# Embed all text in batches
text_chunk_embeddings = embedding_model.encode(text_chunks,
                                              batch_size=32,
                                              convert_to_tensor=True)

text_chunk_embeddings

CPU times: total: 47.6 s
Wall time: 13.1 s


tensor([[ 0.0674,  0.0902, -0.0051,  ..., -0.0221, -0.0232,  0.0126],
        [ 0.0552,  0.0592, -0.0166,  ..., -0.0120, -0.0103,  0.0227],
        [ 0.0280,  0.0340, -0.0206,  ..., -0.0054,  0.0213,  0.0313],
        ...,
        [ 0.0771,  0.0098, -0.0122,  ..., -0.0409, -0.0752, -0.0241],
        [ 0.1030, -0.0165,  0.0083,  ..., -0.0574, -0.0283, -0.0295],
        [ 0.0864, -0.0125, -0.0113,  ..., -0.0522, -0.0337, -0.0299]],
       device='cuda:0')

### Save embeddings to file

In [29]:
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_len)
embeddings_df_save_path = "text_chunks_and_embeddings_df.csv"
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)

In [30]:
# Import saved file and view
text_chunks_and_embeddings_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embeddings_df_load.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-39,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,77.0,[ 6.74242750e-02 9.02281553e-02 -5.09548420e-...
1,-38,Human Nutrition: 2020 Edition by University of...,210,30,52.5,[ 5.52156307e-02 5.92139177e-02 -1.66167375e-...
2,-37,Contents Preface University of Hawai‘i at Māno...,766,116,191.5,[ 2.79802009e-02 3.39813903e-02 -2.06426457e-...
3,-36,Lifestyles and Nutrition University of Hawai‘i...,941,144,235.25,[ 6.82566836e-02 3.81275155e-02 -8.46855994e-...
4,-35,The Cardiovascular System University of Hawai‘...,998,152,249.5,[ 3.30264494e-02 -8.49768426e-03 9.57158674e-...


## 2. RAG - Search and Answer

### Similarity search

In [31]:
import random
import torch
import numpy as np
import pandas as pd

device  = "cuda" if torch.cuda.is_available() else "cpu"

# Import text and embedding df
text_chunks_and_embeddings_df = pd.read_csv("text_chunks_and_embeddings_df.csv")

# Convert embedding column to np.array (it got converted to string when it saved to CSV)
text_chunks_and_embeddings_df["embedding"] = text_chunks_and_embeddings_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

# Convert our embeddings into torch.tensor
embeddings = torch.tensor(np.array(text_chunks_and_embeddings_df["embedding"].to_list()), dtype=torch.float32).to(device)

# Convert text and embeddings df to list of dicts
pages_and_chunks = text_chunks_and_embeddings_df.to_dict(orient="records")

text_chunks_and_embeddings_df

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-39,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,77.00,"[0.067424275, 0.0902281553, -0.0050954842, -0...."
1,-38,Human Nutrition: 2020 Edition by University of...,210,30,52.50,"[0.0552156307, 0.0592139177, -0.0166167375, -0..."
2,-37,Contents Preface University of Hawai‘i at Māno...,766,116,191.50,"[0.0279802009, 0.0339813903, -0.0206426457, 0...."
3,-36,Lifestyles and Nutrition University of Hawai‘i...,941,144,235.25,"[0.0682566836, 0.0381275155, -0.00846855994, -..."
4,-35,The Cardiovascular System University of Hawai‘...,998,152,249.50,"[0.0330264494, -0.00849768426, 0.00957158674, ..."
...,...,...,...,...,...,...
1675,1164,Flashcard Images Note: Most images in the flas...,1304,186,326.00,"[0.0185622461, -0.0164277758, -0.0127045559, -..."
1676,1164,Hazard Analysis Critical Control Points reused...,374,51,93.50,"[0.0334720798, -0.0570440665, 0.015148947, -0...."
1677,1165,ShareAlike 11. Organs reused “Pancreas Organ A...,1285,175,321.25,"[0.0770515352, 0.00978557486, -0.0121817607, 0..."
1678,1165,Sucrose reused “Figure 03 02 05” by OpenStax B...,410,63,102.50,"[0.10304518, -0.0164701659, 0.00826845318, 0.0..."


In [32]:
text_chunks_and_embeddings_df["embedding"]

0       [0.067424275, 0.0902281553, -0.0050954842, -0....
1       [0.0552156307, 0.0592139177, -0.0166167375, -0...
2       [0.0279802009, 0.0339813903, -0.0206426457, 0....
3       [0.0682566836, 0.0381275155, -0.00846855994, -...
4       [0.0330264494, -0.00849768426, 0.00957158674, ...
                              ...                        
1675    [0.0185622461, -0.0164277758, -0.0127045559, -...
1676    [0.0334720798, -0.0570440665, 0.015148947, -0....
1677    [0.0770515352, 0.00978557486, -0.0121817607, 0...
1678    [0.10304518, -0.0164701659, 0.00826845318, 0.0...
1679    [0.0863773674, -0.0125358775, -0.0112746563, 0...
Name: embedding, Length: 1680, dtype: object

In [33]:
embeddings.shape

torch.Size([1680, 768])

In [34]:
# Create model
from sentence_transformers import util, SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                     device=device)



In [35]:
# 1. Define the query
query = "Macronutrient functions"
print(f"Query: {query}")

# 2. Embed the query
query_embedding = embedding_model.encode(query, convert_to_tensor=True).to(device)

# 3. Get similarity scores with the dot product (use cosine similarity if outputs of model aren't normalized)
from time import perf_counter as timer
start_time = timer()
dot_scores = util.dot_score(a=query_embedding,
                           b=embeddings)[0]
end_time = timer()

print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

# Get the top-k results (we'll get top 5)
top_results_dot_product = torch.topk(dot_scores, k=5)
top_results_dot_product

Query: Macronutrient functions
[INFO] Time taken to get scores on 1680 embeddings: 0.00017 seconds.


torch.return_types.topk(
values=tensor([0.6843, 0.6717, 0.6517, 0.6493, 0.6478], device='cuda:0'),
indices=tensor([42, 47, 46, 51, 41], device='cuda:0'))

In [36]:
import textwrap

def print_wrapped(text, wrap_length=80):
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

In [37]:
query = "Macronutrient functions"
print(f"Query: {query}\n")
print("Results:")
# Loop through zipped together scores and indices from torch.topk
for score, idx in zip(top_results_dot_product[0], top_results_dot_product[1]):
    print(f"Score: {score:.4f}")
    print("Text:")
    print(pages_and_chunks[idx]["sentence_chunk"])
    print(f"Page number: {pages_and_chunks[idx]['page_number']}")
    print("\n")

Query: Macronutrient functions

Results:
Score: 0.6843
Text:
Macronutrients Nutrients that are needed in large amounts are called macronutrients. There are three classes of macronutrients: carbohydrates, lipids, and proteins. These can be metabolically processed into cellular energy. The energy from macronutrients comes from their chemical bonds. This chemical energy is converted into cellular energy that is then utilized to perform work, allowing our bodies to conduct their basic functions. A unit of measurement of food energy is the calorie. On nutrition food labels the amount given for “calories” is actually equivalent to each calorie multiplied by one thousand. A kilocalorie (one thousand calories, denoted with a small “c”) is synonymous with the “Calorie” (with a capital “C”) on nutrition food labels. Water is also a macronutrient in the sense that you require a large amount of it, but unlike the other macronutrients, it does not yield calories. Carbohydrates Carbohydrates are mol

### Functioning our semantic search pipeline

In [51]:
def retrieve_relevant_resources(query, embeddings: torch.tensor, model=embedding_model, n_resources_to_return=10):
    """
    Embeds a query with model and returns top-k scored and indices from embeddings
    """

    # Embed the query
    query_embedding = model.encode(query, convert_to_tensor=True)
    # print(query)

    # Get dot product scores on embeddings
    start_time = timer()
    dot_scores = util.dot_score(query_embedding, embeddings)[0]
    end_time = timer()

    print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

    scores, indices = torch.topk(dot_scores, n_resources_to_return)

    return scores, indices

def print_top_results_and_scores(query, 
                                 embeddings: torch.tensor, 
                                 pages_and_chunks=pages_and_chunks, 
                                 n_resources_to_return=5):
    """
    Finds relevant passages given a query and prints them out along with their scores
    """
    scores, indices = retrieve_relevant_resources(query=query, embeddings=embeddings, n_resources_to_return=n_resources_to_return)
    
    # Loop through zipped together scores and indices from torch.topk
    for score, idx in zip(scores, indices):
        print(f"Score: {score:.4f}")
        print("Text:")
        print(pages_and_chunks[idx]["sentence_chunk"])
        print(f"Page number: {pages_and_chunks[idx]['page_number']}")
        print("\n")

In [53]:
query = "foods high in fiber"
# retrieve_relevant_resources(query, embeddings)
print_top_results_and_scores(query=query, embeddings=embeddings)

[INFO] Time taken to get scores on 1680 embeddings: 0.00008 seconds.
Score: 0.6964
Text:
• Change it up a bit and experience the taste and satisfaction of other whole grains such as barley, quinoa, and bulgur. • Eat snacks high in fiber, such as almonds, pistachios, raisins, and air-popped popcorn. Add an artichoke and green peas to your dinner plate more 276 | Carbohydrates and Personal Diet Choices
Page number: 276


Score: 0.6810
Text:
Dietary fiber is categorized as either water-soluble or insoluble. Some examples of soluble fibers are inulin, pectin, and guar gum and they are found in peas, beans, oats, barley, and rye. Cellulose and lignin are insoluble fibers and a few dietary sources of them are whole-grain foods, flax, cauliflower, and avocados. Cellulose is the most abundant fiber in plants, making up the cell walls and providing structure. Soluble fibers are more easily accessible to bacterial enzymes in the large intestine so they can be broken down to a greater extent than

### Getting an LLM for local generation 

In [57]:
# Get GPU available memory
import torch
gpu_memory_bytes = torch.cuda.get_device_properties(0).total_memory
gpu_memory_gb = round(gpu_memory_bytes / (2 ** 30))
print(f"Available GPU memory: {gpu_memory_gb} GB")

Available GPU memory: 8 GB


In [59]:
# Note: the following is Gemma focused, however, there are more and more LLMs of the 2B and 7B size appearing for local use.
if gpu_memory_gb < 5.1:
    print(f"Your available GPU memory is {gpu_memory_gb}GB, you may not have enough memory to run a Gemma LLM locally without quantization.")
elif gpu_memory_gb < 8.1:
    print(f"GPU memory: {gpu_memory_gb} | Recommended model: Gemma 2B in 4-bit precision.")
    use_quantization_config = True 
    model_id = "google/gemma-2b-it"
elif gpu_memory_gb < 19.0:
    print(f"GPU memory: {gpu_memory_gb} | Recommended model: Gemma 2B in float16 or Gemma 7B in 4-bit precision.")
    use_quantization_config = False 
    model_id = "google/gemma-2b-it"
elif gpu_memory_gb > 19.0:
    print(f"GPU memory: {gpu_memory_gb} | Recommend model: Gemma 7B in 4-bit or float16 precision.")
    use_quantization_config = False 
    model_id = "google/gemma-7b-it"

print(f"use_quantization_config set to: {use_quantization_config}")
print(f"model_id set to: {model_id}")

GPU memory: 8 | Recommended model: Gemma 2B in 4-bit precision.
use_quantization_config set to: True
model_id set to: google/gemma-2b-it


### Loading an LLM locally

In [74]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.utils import is_flash_attn_2_available

# 1. Create a quantization config
# Note: requires !pip install bitsandbytes accelerate
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True,
                                        bnb_4bit_compute_dtype=torch.float16)

# Bonus: flash attention 2 = faster attention mechanism
if (is_flash_attn_2_available()) and torch.cuda.get_device_capability(0)[0] >= 8:
    attn_implementation = "flash_attention_2"
else:
    attn_implementation = "sdpa" # scaled dot product attention

# 2. Pick a model we'd like to use
# model_id already set

# 3. Instantiate tokenizer
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_id)

# 4. Instantiate the model
llm_model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path=model_id,
                                                torch_dtype=torch.float16,
                                                quantization_config=quantization_config if use_quantization_config else None,
                                                low_cpu_mem_usage=False,
                                                attn_implementation=attn_implementation)

if not use_quantization_config:
    llm_model.to("cuda")

tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

In [75]:
llm_model

GemmaForCausalLM(
  (model): GemmaModel(
    (embed_tokens): Embedding(256000, 2048, padding_idx=0)
    (layers): ModuleList(
      (0-17): 18 x GemmaDecoderLayer(
        (self_attn): GemmaSdpaAttention(
          (q_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear4bit(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear4bit(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): GemmaRotaryEmbedding()
        )
        (mlp): GemmaMLP(
          (gate_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (up_proj): Linear4bit(in_features=2048, out_features=16384, bias=False)
          (down_proj): Linear4bit(in_features=16384, out_features=2048, bias=False)
          (act_fn): GELUActivation()
        )
        (input_layernorm): GemmaRMSNorm()
        (post_attention_layernorm): GemmaRMSNorm()
     

In [214]:
def get_model_num_params(model):
    return sum([param.numel() for param in model.parameters()])

get_model_num_params(llm_model)

1515268096

In [216]:
def get_model_mem_size(model):
    # Get model parameters and buffer sizes
    mem_params = sum([param.nelement() * param.element_size() for param in model.parameters()])
    mem_buffers = sum([buf.nelement() * buf.element_size() for buf in model.buffers()])

    # Calculate model sizes
    model_mem_bytes = mem_params + mem_buffers
    model_mem_mb = model_mem_bytes / (1024**2)
    model_mem_gb = model_mem_mb / 1024

    return {"model_mem_bytes": model_mem_bytes,
           "model_mem_mb": round(model_mem_mb, 2),
           "model_mem_gb": round(model_mem_gb, 2)}

get_model_mem_size(llm_model)

{'model_mem_bytes': 2106740736, 'model_mem_mb': 2009.14, 'model_mem_gb': 1.96}

To load in the model, we need min ~2 gigs of memory

### Generate text with our LLM