[Daniel's Course](https://www.youtube.com/watch?v=qN_2fnOPY-M)

In [1]:
#%pip install -r requirements.txt

In [2]:
#%pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

In [3]:
!nvidia-smi

Mon Nov 11 11:24:30 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.90                 Driver Version: 565.90         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce GTX 1650      WDDM  |   00000000:01:00.0  On |                  N/A |
| N/A   55C    P8              5W /   50W |     505MiB /   4096MiB |     10%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

[Info](https://whimsical.com/simple-local-rag-workflow-39kToR3yNf7E8kY4sS2tjV)

In [4]:
import os


dataDir = "./data/human-nutrition-text.pdf"

if not os.path.exists(dataDir):
    print(f"Doesn't exist {dataDir}")
else:
    print(f"File exists {dataDir}")


File exists ./data/human-nutrition-text.pdf


In [5]:
import fitz 
from tqdm.auto import tqdm

def text_formatter(text: str) -> str:
    return text.replace('\n', ' ').strip()


def read_pdf(path: str) -> list[dict]:
    doc = fitz.open(path)
    #print(f"Number of pages: {len(doc)}")
    pages_and_text = []
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text)
        pages_and_text.append({"page_number":page_number,
                               "page_char_count":len(text),
                               "page_word_count":len(text.split(" ")),
                               "page_sentence_count":len(text.split(". ")),
                               "page_token_count":len(text) / 4,
                               "page_text":text})
    return pages_and_text

In [6]:
pages_and_text = read_pdf(path = dataDir)


0it [00:00, ?it/s]

In [7]:

pages_and_text[:2]

[{'page_number': 0,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count': 1,
  'page_token_count': 7.25,
  'page_text': 'Human Nutrition: 2020 Edition'},
 {'page_number': 1,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count': 1,
  'page_token_count': 0.0,
  'page_text': ''}]

In [8]:
import random

random.sample(pages_and_text, 2)

[{'page_number': 780,
  'page_char_count': 439,
  'page_word_count': 66,
  'page_sentence_count': 3,
  'page_token_count': 109.75,
  'page_text': 'http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=420    An interactive or media element has been  excluded from this version of the text. You can  view it online here:  http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=420    An interactive or media element has been  excluded from this version of the text. You can  view it online here:  http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=420  Discovering Nutrition Facts  |  739'},
 {'page_number': 1160,
  'page_char_count': 1237,
  'page_word_count': 210,
  'page_sentence_count': 9,
  'page_token_count': 309.25,
  'page_text': 'Disease Prevention and Management  Eating fresh, healthy foods not only stimulates your taste buds, but  also can improve your quality of life and help you to live longer.  As discussed, food fuels your body and helps you to maintain a  healthy weight. Nutriti

In [9]:
import pandas as pd 

df = pd.DataFrame(pages_and_text)

df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,page_text
0,0,29,4,1,7.25,Human Nutrition: 2020 Edition
1,1,0,1,1,0.0,
2,2,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,3,212,32,1,53.0,Human Nutrition: 2020 Edition by University of...
4,4,797,147,3,199.25,Contents Preface University of Hawai‘i at Mā...


In [10]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,603.5,1148.0,199.5,10.52,287.0
std,348.86,560.38,95.83,6.55,140.1
min,0.0,0.0,1.0,1.0,0.0
25%,301.75,762.0,134.0,5.0,190.5
50%,603.5,1231.5,216.0,10.0,307.88
75%,905.25,1603.5,272.0,15.0,400.88
max,1207.0,2308.0,430.0,39.0,577.0


In [11]:
# from spacy.lang.uk import Ukrainian
from spacy.lang.en import English

nlp = English()

nlp.add_pipe("sentencizer")

doc = nlp("This is the first sentence. This is the second sentence. I like dogs")   

assert len(list(doc.sents)) == 3

print(list(doc.sents))


[This is the first sentence., This is the second sentence., I like dogs]


In [12]:
for item in tqdm(pages_and_text):
    item["sentences"] = list(nlp(item["page_text"]).sents)
    
    # Make sure all sentences are strings
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]
    
    # Count the sentences 
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [13]:
# Inspect an example
random.sample(pages_and_text, k=1)

[{'page_number': 1150,
  'page_char_count': 1451,
  'page_word_count': 256,
  'page_sentence_count': 16,
  'page_token_count': 362.75,
  'page_text': 'Supplementation may also be helpful to a limited degree. Vitamin  D and antioxidants have been linked to lowering the risk of some  cancers (however taking an iron supplement may promote others),  but, obtaining vital nutrients from food first is the best way to  help prevent or manage cancer. In addition, regular and vigorous  exercise can lower the risk of breast and colon cancers, among  others. Also, wear sunblock, stay in the shade, and avoid the midday  sun to protect yourself from skin cancer, which is one of the most  common kinds of cancer.8  Diabetes  What Is Diabetes?  Diabetes is one of the top three diseases in America. It affects  millions of people and causes tens of thousands of deaths each year.  Diabetes is a metabolic disease of insulin deficiency and glucose  over-sufficiency.  Like  other  diseases,  genetics,  nutri

In [14]:
df = pd.DataFrame(pages_and_text)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,page_sentence_count_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,603.5,1148.0,199.5,10.52,287.0,10.32
std,348.86,560.38,95.83,6.55,140.1,6.3
min,0.0,0.0,1.0,1.0,0.0,0.0
25%,301.75,762.0,134.0,5.0,190.5,5.0
50%,603.5,1231.5,216.0,10.0,307.88,10.0
75%,905.25,1603.5,272.0,15.0,400.88,15.0
max,1207.0,2308.0,430.0,39.0,577.0,28.0


In [15]:
num_sentence_chunk_size = 10

def split_list(input_list: list[str], 
               slice_size: int = num_sentence_chunk_size) -> list[list[str]]:
    return [input_list[i:i+slice_size] for i in range(0, len(input_list), slice_size)]

test_list = list(range(25))

split_list(test_list)



[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [16]:
for item in tqdm(pages_and_text):
    item["sentence_chunks"] = split_list(input_list = item["sentences"],
                                        slice_size = num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [17]:
random.sample(pages_and_text, k=1)

[{'page_number': 589,
  'page_char_count': 1243,
  'page_word_count': 242,
  'page_sentence_count': 1,
  'page_token_count': 310.75,
  'page_text': 'Vitamin  Sources  Recommended  Intake for  adults  Major functions  Deficiency  diseases and  symptoms  Vitamin A  (retinol,  retinal,  retinoic  acid,carotene,  beta-carotene)  Retinol: beef and  chicken liver,  skim milk, whole  milk, cheddar  cheese;  Carotenoids:  pumpkin, carrots,  squash, collards,  peas  700-900  mcg/day  Antioxidant,vision,  cell  differentiation,  reproduction,  immune function  Xerophthalmia, night  blindness, eye  infections;  poor growth,  dry skin,  impaired  immune  function  Vitamin D  Swordfish,  salmon, tuna,  orange juice  (fortified), milk  (fortified),  sardines, egg,  synthesis from  sunlight  600-800 IU/ day (15-20  mcg/day)  Absorption and  regulation of  calcium and  phosphorus,  maintenance of  bone  Rickets in  children:  abnormal  growth,  misshapen  bones, bowed  legs, soft  bones;  osteomalacia

In [18]:
df = pd.DataFrame(pages_and_text)

df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,page_sentence_count_spacy,num_chunks
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,603.5,1148.0,199.5,10.52,287.0,10.32,1.53
std,348.86,560.38,95.83,6.55,140.1,6.3,0.64
min,0.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,301.75,762.0,134.0,5.0,190.5,5.0,1.0
50%,603.5,1231.5,216.0,10.0,307.88,10.0,1.0
75%,905.25,1603.5,272.0,15.0,400.88,15.0,2.0
max,1207.0,2308.0,430.0,39.0,577.0,28.0,3.0


In [19]:
df.head(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,page_text,sentences,page_sentence_count_spacy,sentence_chunks,num_chunks
0,0,29,4,1,7.25,Human Nutrition: 2020 Edition,[Human Nutrition: 2020 Edition],1,[[Human Nutrition: 2020 Edition]],1
1,1,0,1,1,0.0,,[],0,[],0


In [20]:
import re

pages_and_chunks = []

for item in tqdm(pages_and_text):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict ={}
        chunk_dict["page_number"] = item["page_number"]
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk)
        chunk_dict["sentence_chunk"] = joined_sentence_chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4

        pages_and_chunks.append(chunk_dict)


  0%|          | 0/1208 [00:00<?, ?it/s]

In [21]:
random.sample(pages_and_chunks,1)

[{'page_number': 111,
  'sentence_chunk': 'Image by Gabriel Lee / CC BY-NC-SA Everyday Connection There has been significant talk about pre- and probiotic foods in the mainstream media. The World Health Organization defines probiotics as live bacteria that confer beneficial health effects on their host. They are sometimes called “friendly bacteria.”The most common bacteria labeled as probiotic is lactic acid bacteria (lactobacilli). They are added as live cultures to certain fermented foods such as yogurt. Prebiotics are indigestible foods, primarily soluble fibers, that stimulate the growth of certain strains of bacteria in the large intestine and provide health benefits to the host. A review article in the June 2008 issue of the Journal of Nutrition concludes that there is scientific 70 | The Digestive System',
  'chunk_char_count': 779,
  'chunk_word_count': 120,
  'chunk_token_count': 194.75}]

In [22]:
len(pages_and_chunks)

1843

In [23]:
df = pd.DataFrame(pages_and_chunks)

df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1843.0,1843.0,1843.0,1843.0
mean,624.38,734.1,112.74,183.52
std,347.79,447.51,71.24,111.88
min,0.0,12.0,3.0,3.0
25%,321.5,315.0,45.0,78.75
50%,627.0,745.0,115.0,186.25
75%,931.0,1118.0,173.0,279.5
max,1207.0,1830.0,297.0,457.5


In [24]:
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 22.75 | Text: Building a protein involves three steps: transcription, translation, Defining Protein | 369
Chunk token count: 27.0 | Text: view it online here: http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=165 226 | Popular Beverage Choices
Chunk token count: 20.0 | Text: http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=507  Sports Nutrition | 971
Chunk token count: 17.5 | Text: The Obesity Myth. Gotham Books. Calories In Versus Calories Out | 1069
Chunk token count: 29.25 | Text: Abagovomab (monoclonal antibody) by Blake C / CC BY-SA 3.0 Figure 6.13 Antigens Protein’s Functions in the Body | 389


In [25]:
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient = "records")

pages_and_chunks_over_min_token_len[:2]

[{'page_number': 2,
  'sentence_chunk': 'Human Nutrition: 2020 Edition UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM ALAN TITCHENAL, SKYLAR HARA, NOEMI ARCEO CAACBAY, WILLIAM MEINKE-LAU, YA-YUN YANG, MARIE KAINOA FIALKOWSKI REVILLA, JENNIFER DRAPER, GEMADY LANGFELDER, CHERYL GIBBY, CHYNA NICOLE CHUN, AND ALLISON CALABRESE',
  'chunk_char_count': 308,
  'chunk_word_count': 42,
  'chunk_token_count': 77.0},
 {'page_number': 3,
  'sentence_chunk': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.',
  'chunk_char_count': 210,
  'chunk_word_count': 30,
  'chunk_token_count': 52.5}]

### Embedding

In [26]:
from sentence_transformers import SentenceTransformer

model_name_or_path = "all-mpnet-base-v2"
embedding_model = SentenceTransformer(model_name_or_path = model_name_or_path,
                                      device="cpu")



In [27]:
# Create a list of sentences to turn into numbers
sentences = [
    "The Sentences Transformers library provides an easy and open-source way to create embeddings.",
    "Sentences can be embedded one by one or as a list of strings.",
    "Embeddings are one of the most powerful concepts in machine learning!",
    "Learn to use embeddings well and you'll be well on your way to being an AI engineer."
]

# Sentences are encoded/embedded by calling model.encode()
embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))


In [28]:
for sentence, embedding in embeddings_dict.items():
    print(f"Sentence: {sentence}")
    print(f"Embedding: {embedding}")
    print("")


Sentence: The Sentences Transformers library provides an easy and open-source way to create embeddings.
Embedding: [-2.07982659e-02  3.03164814e-02 -2.01217812e-02  6.86484948e-02
 -2.55256258e-02 -8.47686827e-03 -2.07231977e-04 -6.32377416e-02
  2.81606596e-02 -3.33353728e-02  3.02633960e-02  5.30721396e-02
 -5.03526554e-02  2.62288544e-02  3.33313718e-02 -4.51577306e-02
  3.63045074e-02 -1.37121335e-03 -1.20171625e-02  1.14947166e-02
  5.04510924e-02  4.70856801e-02  2.11914051e-02  5.14606386e-02
 -2.03746390e-02 -3.58889215e-02 -6.67755026e-04 -2.94393897e-02
  4.95859198e-02 -1.05639463e-02 -1.52014066e-02 -1.31760491e-03
  4.48197499e-02  1.56023446e-02  8.60379259e-07 -1.21392065e-03
 -2.37978753e-02 -9.09372466e-04  7.34484568e-03 -2.53931386e-03
  5.23370616e-02 -4.68043797e-02  1.66214872e-02  4.71579544e-02
 -4.15599197e-02  9.01963329e-04  3.60278040e-02  3.42213996e-02
  9.68226939e-02  5.94829135e-02 -1.64984576e-02 -3.51249389e-02
  5.92516316e-03 -7.07909290e-04 -2.4103

In [29]:
embeddings[0].shape


(768,)

In [30]:
embedding = embedding_model.encode("I like coffe")

In [31]:
%%time
# # Embed each chunk one by one
# for item in tqdm(pages_and_chunks_over_min_token_len):
#     item["embedding"] = embedding_model.encode(item["sentence_chunk"])

CPU times: total: 0 ns
Wall time: 0 ns


In [32]:
%%time

# Send the model to the GPU
embedding_model.to("cuda") # requires a GPU installed, for reference on my local machine, I'm using a NVIDIA RTX 4090

# Create embeddings one by one on the GPU
for item in tqdm(pages_and_chunks_over_min_token_len):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])

  0%|          | 0/1680 [00:00<?, ?it/s]

CPU times: total: 7min 20s
Wall time: 2min 8s


In [33]:
%%time

# Turn text chunks into a single list
text_chunks = [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len]


CPU times: total: 0 ns
Wall time: 0 ns


In [34]:
%%time
# Embed all texts in batches
text_chunk_embeddings = embedding_model.encode(text_chunks,
                                               batch_size=8, # you can use different batch sizes here for speed/performance, I found 32 works well for this use case
                                               convert_to_tensor=True) # optional to return embeddings as tensor instead of array

text_chunk_embeddings

CPU times: total: 3min 19s
Wall time: 1min 27s


tensor([[ 0.0674,  0.0902, -0.0051,  ..., -0.0221, -0.0232,  0.0126],
        [ 0.0552,  0.0592, -0.0166,  ..., -0.0120, -0.0103,  0.0227],
        [ 0.0280,  0.0340, -0.0206,  ..., -0.0054,  0.0213,  0.0313],
        ...,
        [ 0.0771,  0.0098, -0.0122,  ..., -0.0409, -0.0752, -0.0241],
        [ 0.1030, -0.0165,  0.0083,  ..., -0.0574, -0.0283, -0.0295],
        [ 0.0864, -0.0125, -0.0113,  ..., -0.0522, -0.0337, -0.0299]],
       device='cuda:0')

In [35]:
# Save embeddings to file
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_len)
embeddings_df_save_path = "text_chunks_and_embeddings_df.csv"
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)

In [36]:
# Import saved file and view
text_chunks_and_embedding_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embedding_df_load.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,2,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,77.0,[ 6.74242601e-02 9.02281627e-02 -5.09549398e-...
1,3,Human Nutrition: 2020 Edition by University of...,210,30,52.5,[ 5.52155897e-02 5.92139363e-02 -1.66167151e-...
2,4,Contents Preface University of Hawai‘i at Māno...,766,116,191.5,[ 2.79801879e-02 3.39813977e-02 -2.06426699e-...
3,5,Lifestyles and Nutrition University of Hawai‘i...,941,144,235.25,[ 6.82566836e-02 3.81274931e-02 -8.46854411e-...
4,6,The Cardiovascular System University of Hawai‘...,998,152,249.5,[ 3.30264382e-02 -8.49767588e-03 9.57158953e-...


In [37]:
import random

import torch
import numpy as np 
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"

# Import texts and embedding df
text_chunks_and_embedding_df = pd.read_csv("text_chunks_and_embeddings_df.csv")

# Convert embedding column back to np.array (it got converted to string when it got saved to CSV)
text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))


# Convert texts and embedding df to list of dicts
pages_and_chunks = text_chunks_and_embedding_df.to_dict(orient="records")

# Convert embeddings to torch tensor and send to device (note: NumPy arrays are float64, torch tensors are float32 by default)
embeddings = torch.tensor(np.array(text_chunks_and_embedding_df["embedding"].tolist()), dtype=torch.float32).to(device)
embeddings.shape

torch.Size([1680, 768])

In [38]:
from sentence_transformers import util, SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", 
                                      device=device) # choose the device to load the model to



In [39]:
# 1. Define the query
# Note: This could be anything. But since we're working with a nutrition textbook, we'll stick with nutrition-based queries.
query = "macronutrients functions"
print(f"Query: {query}")

# 2. Embed the query to the same numerical space as the text examples 
# Note: It's important to embed your query with the same model you embedded your examples with.
query_embedding = embedding_model.encode(query, convert_to_tensor=True)

# 3. Get similarity scores with the dot product (we'll time this for fun)
from time import perf_counter as timer

start_time = timer()
dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]
end_time = timer()

print(f"Time take to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

# 4. Get the top-k results (we'll keep this to 5)
top_results_dot_product = torch.topk(dot_scores, k=5)
top_results_dot_product

Query: macronutrients functions
Time take to get scores on 1680 embeddings: 0.00059 seconds.


torch.return_types.topk(
values=tensor([0.6926, 0.6738, 0.6646, 0.6536, 0.6473], device='cuda:0'),
indices=tensor([42, 47, 41, 51, 46], device='cuda:0'))

In [40]:
# Define helper function to print wrapped text 
import textwrap

def print_wrapped(text, wrap_length=40):
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

In [41]:
print(f"Query: '{query}'\n")
print("Results:")
# Loop through zipped together scores and indicies from torch.topk
for score, idx in zip(top_results_dot_product[0], top_results_dot_product[1]):
    print(f"Score: {score:.4f}")
    # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
    print("Text:")
    print_wrapped(pages_and_chunks[idx]["sentence_chunk"])
    # Print the page number too so we can reference the textbook further (and check the results)
    print(f"Page number: {pages_and_chunks[idx]['page_number']}")
    print("\n")

Query: 'macronutrients functions'

Results:
Score: 0.6926
Text:
Macronutrients Nutrients that are needed
in large amounts are called
macronutrients. There are three classes
of macronutrients: carbohydrates,
lipids, and proteins. These can be
metabolically processed into cellular
energy. The energy from macronutrients
comes from their chemical bonds. This
chemical energy is converted into
cellular energy that is then utilized to
perform work, allowing our bodies to
conduct their basic functions. A unit of
measurement of food energy is the
calorie. On nutrition food labels the
amount given for “calories” is actually
equivalent to each calorie multiplied by
one thousand. A kilocalorie (one
thousand calories, denoted with a small
“c”) is synonymous with the “Calorie”
(with a capital “C”) on nutrition food
labels. Water is also a macronutrient in
the sense that you require a large
amount of it, but unlike the other
macronutrients, it does not yield
calories. Carbohydrates Carbohydrates
are 

In [42]:
# Get GPU available memory
import torch
gpu_memory_bytes = torch.cuda.get_device_properties(0).total_memory
gpu_memory_gb = round(gpu_memory_bytes / (2**30))
print(f"Available GPU memory: {gpu_memory_gb} GB")

Available GPU memory: 4 GB


In [None]:
def cosine_similarity(vector1, vector2):
    dot_product = torch.dot(vector1, vector2)

    # Get Euclidean/L2 norm of each vector (removes the magnitude, keeps direction)
    norm_vector1 = torch.sqrt(torch.sum(vector1**2))
    norm_vector2 = torch.sqrt(torch.sum(vector2**2))

    return dot_product / (norm_vector1 * norm_vector2)

In [43]:
def retrieve_relevant_resources(query: str,
                                embeddings: torch.tensor,
                                model: SentenceTransformer=embedding_model,
                                n_resources_to_return: int=5,
                                print_time: bool=True):
    """
    Embeds a query with model and returns top k scores and indices from embeddings.
    """

    # Embed the query
    query_embedding = model.encode(query, 
                                   convert_to_tensor=True) 

    # Get dot product scores on embeddings
    start_time = timer()
    dot_scores = util.cos_sim(query_embedding, embeddings)[0]
    end_time = timer()

    if print_time:
        print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

    scores, indices = torch.topk(input=dot_scores, 
                                 k=n_resources_to_return)

    return scores, indices

def print_top_results_and_scores(query: str,
                                 embeddings: torch.tensor,
                                 pages_and_chunks: list[dict]=pages_and_chunks,
                                 n_resources_to_return: int=5):
    """
    Takes a query, retrieves most relevant resources and prints them out in descending order.

    Note: Requires pages_and_chunks to be formatted in a specific way (see above for reference).
    """
    
    scores, indices = retrieve_relevant_resources(query=query,
                                                  embeddings=embeddings,
                                                  n_resources_to_return=n_resources_to_return)
    
    print(f"Query: {query}\n")
    print("Results:")
    # Loop through zipped together scores and indicies
    for score, index in zip(scores, indices):
        print(f"Score: {score:.4f}")
        # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
        print_wrapped(pages_and_chunks[index]["sentence_chunk"])
        # Print the page number too so we can reference the textbook further and check the results
        print(f"Page number: {pages_and_chunks[index]['page_number']}")
        print("\n")

In [44]:
query = "symptoms of pellagra"

# Get just the scores and indices of top related results
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
scores, indices

[INFO] Time taken to get scores on 1680 embeddings: 0.00048 seconds.


(tensor([0.5000, 0.3741, 0.2959, 0.2793, 0.2721], device='cuda:0'),
 tensor([ 822,  853, 1536, 1555, 1531], device='cuda:0'))

In [45]:
# Print out the texts of the top scores
print_top_results_and_scores(query=query,
                             embeddings=embeddings)

[INFO] Time taken to get scores on 1680 embeddings: 0.00035 seconds.
Query: symptoms of pellagra

Results:
Score: 0.5000
Niacin deficiency is commonly known as
pellagra and the symptoms include
fatigue, decreased appetite, and
indigestion. These symptoms are then
commonly followed by the four D’s:
diarrhea, dermatitis, dementia, and
sometimes death. Figure 9.12 Conversion
of Tryptophan to Niacin Water-Soluble
Vitamins | 565
Page number: 606


Score: 0.3741
car. Does it drive faster with a half-
tank of gas or a full one?It does not
matter; the car drives just as fast as
long as it has gas. Similarly, depletion
of B vitamins will cause problems in
energy metabolism, but having more than
is required to run metabolism does not
speed it up. Buyers of B-vitamin
supplements beware; B vitamins are not
stored in the body and all excess will
be flushed down the toilet along with
the extra money spent. B vitamins are
naturally present in numerous foods, and
many other foods are enriched with the