## NutriBot
I am building NutriBot AI, a Retrieval-Augmented Generation (RAG)-based assistant that answers questions from large nutrition research documents (like the 1200-page HNutrients dataset).
It uses document embeddings and a local LLM to provide accurate, context-rich responses.

Goal: Help health professionals, students, and researchers quickly extract insights from large nutrition documents instead of manually searching through them.

## What is RAG?
RAG stands for Retrieval Augmented Generation.

Each step can be roughly broken down to:
Retrieval → The system first searches and retrieves relevant information from an external knowledge source (e.g., a database, documents) based on the user’s query.

Augmentation → The retrieved information is then added to (or used to enhance) the model’s input context.

Generation → Finally, a generative model (like an LLM) produces an answer that is grounded in the retrieved data.

In [1]:
# import os
from pathlib import Path
pdf_path = Path("HNutrition.pdf")

if pdf_path.is_file():
    print(f"File Found {pdf_path}")
else:
    print(f"Not Found {pdf_path}")


File Found HNutrition.pdf


In [2]:
import fitz
from tqdm.auto import tqdm # for progress bars, requires 

def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip() 

    # Other potential text formatting functions can go here
    return cleaned_text

# Open PDF and get lines/pages
# Note: this only focuses on text, rather than images/figures etc
def open_and_read_pdf(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)  # open a document
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):  # iterate the document pages
        text = page.get_text()  # get plain text encoded as UTF-8
        text = text_formatter(text)
        pages_and_texts.append({"page_number": page_number - 41,  # adjust page numbers since our PDF starts on page 42
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split(". ")),
                                "page_token_count": len(text) / 4,  # 1 token = ~4 chars, see: https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them
                                "text": text})
    return pages_and_texts

pages_and_texts = open_and_read_pdf(pdf_path=pdf_path)
pages_and_texts[:2]


0it [00:00, ?it/s]

[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_raw': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_raw': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [3]:
#get a random sample of the page
import random
random.sample(pages_and_texts, k=2)

[{'page_number': 340,
  'page_char_count': 1685,
  'page_word_count': 288,
  'page_sentence_count_raw': 11,
  'page_token_count': 421.25,
  'text': 'Lipids and the Food Industry  UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN  NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM  What is the first thing that comes to mind when you read  ingredients such as “partially hydrogenated oil” and “hydrogenated  oil” on a food label? Do you think of heart disease, heart health, or  atherosclerosis? Most people probably do not. As we uncover what  hydrogenation is and why manufacturers use it, you will be better  equipped to adhere to healthier dietary choices and promote your  heart health.  Hydrogenation: The Good Gone Bad?  Food manufacturers are aware that fatty acids are susceptible to  attack by oxygen molecules because their points of unsaturation  render them vulnerable in this regard. When oxygen molecules  attack these points of unsaturation the modified fatty acid becomes  oxidized. T

In [4]:
import pandas as pd
df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition
1,-40,0,1,1,0.0,
2,-39,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,32,1,53.0,Human Nutrition: 2020 Edition by University of...
4,-37,797,147,3,199.25,Contents Preface University of Hawai‘i at Mā...


## Further text processing (splitting pages into sentences)

In [5]:
from spacy.lang.en import English 

nlp = English()

# Add a sentencizer pipeline
nlp.add_pipe("sentencizer")

# Create a document instance as an example
doc = nlp("My name is Harshat.I am a B.tech Student.Doing B.Tech in CSE")
assert len(list(doc.sents)) == 3

# Access the sentences of the document
list(doc.sents)

[My name is Harshat., I am a B.tech Student., Doing B.Tech in CSE]

In [6]:
pages_and_texts[600]

{'page_number': 559,
 'page_char_count': 863,
 'page_word_count': 138,
 'page_sentence_count_raw': 9,
 'page_token_count': 215.75,
 'text': 'Image by  Allison  Calabrese /  CC BY 4.0  Korsakoff syndrome can cause similar symptoms as beriberi such  as confusion, loss of coordination, vision changes, hallucinations,  and may progress to coma and death. This condition is specific  to alcoholics as diets high in alcohol can cause thiamin deficiency.  Other individuals at risk include individuals who also consume diets  typically low in micronutrients such as those with eating disorders,  elderly, and individuals who have gone through gastric bypass  surgery.5  Figure 9.10 The Role of Thiamin  Figure 9.11 Beriberi, Thiamin Deficiency  5. Fact Sheets for Health Professionals: Thiamin. National  Institute of Health, Office of Dietary Supplements.   https://ods.od.nih.gov/factsheets/Thiamin- HealthProfessional/. Updated Feburary 11, 2016.  Accessed October 22, 2017.  Water-Soluble Vitamins  | 

In [7]:
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)
    
    # Make sure all sentences are strings
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]
    
    # Count the sentences 
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [8]:
#get a random sample of the page_and_texts
random.sample(pages_and_texts,k=1)

[{'page_number': 760,
  'page_char_count': 1400,
  'page_word_count': 221,
  'page_sentence_count_raw': 17,
  'page_token_count': 350.0,
  'text': 'Pacific Based Dietary  Guidelines  UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN  NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM  To reflect the unique food environment and practices of the Pacific,  the Secretariat of the Pacific Community (SPC) Public Health  division developed Dietary Guidelines for healthy eating to promote  and protect the health and future of Pacific Island peoples1. With  such a diverse food supply, it can be difficult to place some pacific  foods into the USDA 5 food group system. For example, ‘ulu,  otherwise known as breadfruit, is a fruit but also has many similar  properties and functions like whole grains as well due to its high  carbohydrate and fiber content.  Therefore, guidelines for healthy eating include a series of leaflets  and fact sheets that focus on traditional Pacific foods, food security,  

In [9]:
#visualize list of dictionaries into a DataFrame and get some stats.
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,199.5,10.52,287.0,10.32
std,348.86,560.38,95.83,6.55,140.1,6.3
min,-41.0,0.0,1.0,1.0,0.0,0.0
25%,260.75,762.0,134.0,5.0,190.5,5.0
50%,562.5,1231.5,216.0,10.0,307.88,10.0
75%,864.25,1603.5,272.0,15.0,400.88,15.0
max,1166.0,2308.0,430.0,39.0,577.0,28.0


## Chunking our sentences together

In [10]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 10 

# Create a function that recursively splits a list into desired sizes
def split_list(input_list: list, 
               slice_size: int) -> list[list[str]]:
    """
    Splits the input_list into sublists of size slice_size (or as close as possible).

    For example, a list of 17 sentences would be split into two lists of [[10], [7]]
    """
    return [input_list[i:i + slice_size] for i in range(0, len(input_list), slice_size)]

# Loop through pages and texts and split sentences into chunks
for item in tqdm(pages_and_texts):
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size=num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [11]:
random.sample(pages_and_texts,k=1)

[{'page_number': 106,
  'page_char_count': 539,
  'page_word_count': 94,
  'page_sentence_count_raw': 3,
  'page_token_count': 134.75,
  'text': '“Major  Endocrine  Glands” by  National  Cancer  Institute /  Public  Domain  The Endocrine System  UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN  NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM  Figure 2.19 The Endocrine System  The functions of the endocrine system are intricately connected to  the body’s nutrition. This organ system is responsible for regulating  appetite, nutrient absorption, nutrient storage, and nutrient usage,  in addition to other functions, such as reproduction. The glands  106  |  The Endocrine System',
  'sentences': ['“Major  Endocrine  Glands” by  National  Cancer  Institute /  Public  Domain  The Endocrine System  UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN  NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM  Figure 2.19 The Endocrine System  The functions of the endocrine system are intricately conne

In [12]:
# Create a DataFrame to get stats
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,199.5,10.52,287.0,10.32,1.53
std,348.86,560.38,95.83,6.55,140.1,6.3,0.64
min,-41.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,260.75,762.0,134.0,5.0,190.5,5.0,1.0
50%,562.5,1231.5,216.0,10.0,307.88,10.0,1.0
75%,864.25,1603.5,272.0,15.0,400.88,15.0,2.0
max,1166.0,2308.0,430.0,39.0,577.0,28.0,3.0


In [13]:
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]
        
        # Join the sentences together into a paragraph-like structure, aka a chunk (so they are a single string)
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk) # ".A" -> ". A" for any full-stop/capital letter combo 
        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        # Get stats about the chunk
        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token = ~4 characters
        
        pages_and_chunks.append(chunk_dict)

# How many chunks do we have?
len(pages_and_chunks)

  0%|          | 0/1208 [00:00<?, ?it/s]

1843

In [14]:
## View a random sample
random.sample(pages_and_chunks, k=1)

[{'page_number': 26,
  'sentence_chunk': 'Habitually grabbing a fast food sandwich for breakfast can seem convenient, but might not offer substantial nutrition. Yet getting in the habit of drinking an ample amount of water each day can yield multiple benefits. • Culture. The culture in which one grows up affects how one sees food in daily life and on special occasions. • Geography. Where a person lives influences food choices. For instance, people who live in Midwestern US states have less access to seafood than those living along the coasts. • Advertising. The media greatly influences food choice by persuading consumers to eat certain foods. • Social factors.',
  'chunk_char_count': 626,
  'chunk_word_count': 103,
  'chunk_token_count': 156.5}]

In [15]:
# Let Visualize our chunks
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1843.0,1843.0,1843.0,1843.0
mean,583.38,734.1,112.74,183.52
std,347.79,447.51,71.24,111.88
min,-41.0,12.0,3.0,3.0
25%,280.5,315.0,45.0,78.75
50%,586.0,745.0,115.0,186.25
75%,890.0,1118.0,173.0,279.5
max,1166.0,1830.0,297.0,457.5


In [16]:
# Show random chunks with under 30 tokens in length
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(5).iterrows():
    print(f'Chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')

Chunk token count: 19.0 | Text: http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=519  Introduction | 991
Chunk token count: 28.75 | Text: Accessed September 22, 2017. Dietary, Behavioral, and Physical Activity Recommendations for Weight Management | 505
Chunk token count: 28.75 | Text: Journal of Nutrition, 138(6), 1250S–4S. http://jn.nutrition.org/content/138/6/ 1250S.long The Digestive System | 71
Chunk token count: 28.75 | Text: American Journal of Clinical Dietary, Behavioral, and Physical Activity Recommendations for Weight Management | 509
Chunk token count: 8.25 | Text: Regulation of Water Balance | 165


In [17]:
'''''
Looks like many of these are headers and footers of different pages.
They don't seem to offer too much information.
Let's filter our DataFrame/list of dictionaries to only include chunks with over 30 tokens in length.
'''''
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[:2]

[{'page_number': -39,
  'sentence_chunk': 'Human Nutrition: 2020 Edition UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM ALAN TITCHENAL, SKYLAR HARA, NOEMI ARCEO CAACBAY, WILLIAM MEINKE-LAU, YA-YUN YANG, MARIE KAINOA FIALKOWSKI REVILLA, JENNIFER DRAPER, GEMADY LANGFELDER, CHERYL GIBBY, CHYNA NICOLE CHUN, AND ALLISON CALABRESE',
  'chunk_char_count': 308,
  'chunk_word_count': 42,
  'chunk_token_count': 77.0},
 {'page_number': -38,
  'sentence_chunk': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.',
  'chunk_char_count': 210,
  'chunk_word_count': 30,
  'chunk_token_count': 52.5}]

## Embedding our text chunks
While humans understand text, machines understand numbers best.
we use the sentence-transformers library which contains many pre-trained embedding models.Specifically, we get the all-mpnet-base-v2 model

In [18]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", 
                                      device="cuda") # choose the device to load the model to (note: GPU will often be much faster than CPU)

# Create a list of sentences to turn into numbers
sentences = [
    "The Sentences Transformers library provides an easy and open-source way to create embeddings.",
    "Sentences can be embedded one by one or as a list of strings.",
    "Embeddings are one of the most powerful concepts in machine learning!",
    "Learn to use embeddings well and you'll be well on your way to being an AI engineer."
]

# Sentences are encoded/embedded by calling model.encode()
embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences, embeddings))

# See the embeddings
for sentence, embedding in embeddings_dict.items():
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")



Sentence: The Sentences Transformers library provides an easy and open-source way to create embeddings.
Embedding: [-2.07981374e-02  3.03164218e-02 -2.01218128e-02  6.86483532e-02
 -2.55255643e-02 -8.47689621e-03 -2.07111196e-04 -6.32377416e-02
  2.81606149e-02 -3.33353467e-02  3.02634798e-02  5.30721173e-02
 -5.03526777e-02  2.62287464e-02  3.33313905e-02 -4.51578945e-02
  3.63044366e-02 -1.37112767e-03 -1.20171290e-02  1.14946542e-02
  5.04510924e-02  4.70857136e-02  2.11913381e-02  5.14607430e-02
 -2.03746390e-02 -3.58889140e-02 -6.67873712e-04 -2.94393301e-02
  4.95858900e-02 -1.05639435e-02 -1.52013786e-02 -1.31752936e-03
  4.48196791e-02  1.56022888e-02  8.60379657e-07 -1.21391716e-03
 -2.37978864e-02 -9.09424969e-04  7.34484987e-03 -2.53933878e-03
  5.23369685e-02 -4.68043461e-02  1.66214611e-02  4.71579283e-02
 -4.15599123e-02  9.01962689e-04  3.60279009e-02  3.42214443e-02
  9.68227684e-02  5.94828688e-02 -1.64984874e-02 -3.51249352e-02
  5.92517806e-03 -7.07964529e-04 -2.4103

In [19]:
#Embedding for one sentence
single_sentence = "Hi! I am Harshat."
single_embedding = embedding_model.encode(single_sentence)
print(f"Sentence: {single_sentence}")
print(f"Embedding:\n{single_embedding}")
print(f"Embedding size: {single_embedding.shape}")

Sentence: Hi! I am Harshat.
Embedding:
[ 3.43916900e-02 -7.04600662e-02 -3.50788496e-02  6.82661086e-02
  5.12629226e-02 -2.74087600e-02 -1.90097988e-02 -6.02451619e-03
  9.87662151e-02  5.02573978e-03  2.01086421e-02 -3.09305619e-02
 -3.34404968e-02  8.96232501e-02  3.91642042e-02 -1.94823872e-02
  4.08135355e-02 -5.22705317e-02  1.63302962e-02  1.99865662e-02
  7.64074102e-02  1.07072359e-02 -1.51920607e-02  5.93051836e-02
 -3.82512026e-02 -4.81653549e-02  4.32689749e-02  2.74892198e-03
  2.98219156e-02  1.54384586e-03  8.66539031e-02 -3.88130569e-03
  4.48752232e-02  3.38229276e-02  1.90175399e-06 -1.27896769e-02
  2.55781170e-02 -6.11090427e-03 -2.63585504e-02 -4.43911850e-02
  5.48399948e-02 -1.44742364e-02  4.16455902e-02  3.22761461e-02
  6.13439083e-03  2.95961667e-02  2.77900416e-02 -8.46813084e-04
  3.84629443e-02  5.60595505e-02 -1.16795367e-02 -4.22622561e-02
 -6.94573508e-04 -2.86393501e-02 -2.76706740e-02 -4.31966269e-03
  2.54626572e-03 -1.15740439e-02 -3.96988541e-02  6

In [20]:
# Turn text chunks into a single list
text_chunks = [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len][450]

## create the embeddings with a GPU.

In [21]:
%%time

# Send the model to the GPU
embedding_model.to("cuda") # requires a GPU installed, for reference on my local machine, I'm using a NVIDIA RTX 2050

# Create embeddings one by one on the GPU
for item in tqdm(pages_and_chunks_over_min_token_len):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])

  0%|          | 0/1680 [00:00<?, ?it/s]

CPU times: total: 4min 8s
Wall time: 42.4 s


In [22]:
%%time

# Embed all texts in batches
text_chunk_embeddings = embedding_model.encode(text_chunks,
                                               batch_size=32, # I found 16 works well for my use case
                                               convert_to_tensor=True) # optional to return embeddings as tensor instead of array

text_chunk_embeddings

CPU times: total: 172 ms
Wall time: 24 ms


tensor([ 3.3388e-02, -6.1417e-02, -2.2464e-02,  5.2329e-03,  5.4989e-02,
         1.1775e-02, -3.2594e-02,  2.6364e-02,  1.0218e-01, -2.3871e-03,
         6.8169e-03,  1.4789e-02,  8.9569e-03, -1.0108e-02, -6.5055e-03,
        -2.5122e-02, -4.3819e-03, -3.0392e-02, -4.4738e-02,  1.1128e-02,
         4.2078e-03, -2.2626e-02, -1.1900e-02, -2.3986e-03, -3.3243e-02,
         8.0504e-03,  3.6850e-02, -7.7539e-03,  7.8729e-03, -4.1051e-02,
        -5.7295e-03, -1.0844e-02,  1.5408e-02, -7.4839e-03,  2.0473e-06,
         3.4715e-02, -4.4108e-02,  1.7490e-02, -3.7902e-02,  4.1096e-02,
         4.0579e-02, -5.4899e-02,  8.1754e-03,  2.3780e-03,  5.3903e-02,
        -3.9419e-03, -1.1858e-02, -2.5497e-03,  1.2418e-03, -2.1682e-02,
         3.5438e-03,  4.2490e-02, -1.3505e-02,  6.7765e-02,  2.1849e-02,
         2.7498e-03, -9.1032e-03,  4.7123e-02,  2.2353e-03,  2.9456e-02,
        -5.9417e-03, -1.3723e-02,  4.1062e-03, -1.8517e-02,  3.5074e-02,
         9.0240e-02, -2.3282e-02, -6.9037e-03,  2.6

## Save embeddings to file

In [23]:
# Save embeddings to file
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_len)
embeddings_df_save_path = "text_chunks_and_embeddings_df.csv"
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)

In [24]:
# Import saved file and view
text_chunks_and_embedding_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embedding_df_load.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-39,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,77.0,[ 6.74242899e-02 9.02281702e-02 -5.09548699e-...
1,-38,Human Nutrition: 2020 Edition by University of...,210,30,52.5,[ 5.52156232e-02 5.92139959e-02 -1.66167356e-...
2,-37,Contents Preface University of Hawai‘i at Māno...,766,116,191.5,[ 2.79801786e-02 3.39813903e-02 -2.06426717e-...
3,-36,Lifestyles and Nutrition University of Hawai‘i...,941,144,235.25,[ 6.82566687e-02 3.81275155e-02 -8.46854225e-...
4,-35,The Cardiovascular System University of Hawai‘...,998,152,249.5,[ 3.30264382e-02 -8.49764794e-03 9.57158674e-...


In [25]:
import random

import torch
import numpy as np 
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"

# Import texts and embedding df
text_chunks_and_embedding_df = pd.read_csv("text_chunks_and_embeddings_df.csv")

# Convert embedding column back to np.array (it got converted to string when it got saved to CSV)
text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

# Convert texts and embedding df to list of dicts
pages_and_chunks = text_chunks_and_embedding_df.to_dict(orient="records")

# Convert embeddings to torch tensor and send to device (note: NumPy arrays are float64, torch tensors are float32 by default)
embeddings = torch.tensor(np.array(text_chunks_and_embedding_df["embedding"].tolist()), dtype=torch.float32).to(device)
embeddings.shape

torch.Size([1680, 768])

In [26]:
text_chunks_and_embedding_df.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-39,Human Nutrition: 2020 Edition UNIVERSITY OF HA...,308,42,77.0,"[0.0674242899, 0.0902281702, -0.00509548699, -..."
1,-38,Human Nutrition: 2020 Edition by University of...,210,30,52.5,"[0.0552156232, 0.0592139959, -0.0166167356, -0..."
2,-37,Contents Preface University of Hawai‘i at Māno...,766,116,191.5,"[0.0279801786, 0.0339813903, -0.0206426717, 0...."
3,-36,Lifestyles and Nutrition University of Hawai‘i...,941,144,235.25,"[0.0682566687, 0.0381275155, -0.00846854225, -..."
4,-35,The Cardiovascular System University of Hawai‘...,998,152,249.5,"[0.0330264382, -0.00849764794, 0.00957158674, ..."


In [27]:
embeddings[0]

tensor([ 6.7424e-02,  9.0228e-02, -5.0955e-03, -3.1755e-02,  7.3908e-02,
         3.5198e-02, -1.9799e-02,  4.6769e-02,  5.3573e-02,  5.0123e-03,
         3.3393e-02, -1.6221e-03,  1.7608e-02,  3.6265e-02, -3.1667e-04,
        -1.0712e-02,  1.5426e-02,  2.6218e-02,  2.7765e-03,  3.6494e-02,
        -4.4411e-02,  1.8936e-02,  4.9012e-02,  1.6402e-02, -4.8578e-02,
         3.1829e-03,  2.7299e-02, -2.0476e-03, -1.2283e-02, -7.2805e-02,
         1.2045e-02,  1.0730e-02,  2.1000e-03, -8.1777e-02,  2.6783e-06,
        -1.8143e-02, -1.2080e-02,  2.4717e-02, -6.2747e-02,  7.3544e-02,
         2.2162e-02, -3.2877e-02, -1.8010e-02,  2.2295e-02,  5.6136e-02,
         1.7951e-03,  5.2593e-02, -3.3174e-03, -8.3388e-03, -1.0628e-02,
         2.3192e-03, -2.2393e-02, -1.5301e-02, -9.9306e-03,  4.6532e-02,
         3.5747e-02, -2.5476e-02,  2.6369e-02,  3.7491e-03, -3.8268e-02,
         2.5833e-02,  4.1287e-02,  2.5818e-02,  3.3297e-02, -2.5178e-02,
         4.5152e-02,  4.4903e-04, -9.9662e-02,  4.9

In [28]:
from sentence_transformers import util, SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2", 
                                      device=device) # choose the device to load the model 



In [29]:
# 1. Define the query
query = "macronutrients functions"
print(f"Query: {query}")
# 2. Embed the query to the same numerical space as the text examples 
# Note: It's important to embed your query with the same model you embedded your examples with.
query_embedding = embedding_model.encode(query, convert_to_tensor=True)

# 3. Get similarity scores with the dot product
from time import perf_counter as timer

start_time = timer()
dot_scores = util.dot_score(a=query_embedding, b=embeddings)[0]
end_time = timer()

print(f"Time take to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

# 4. Get the top-5 results 
top_results_dot_product = torch.topk(dot_scores, k=2)
top_results_dot_product

Query: macronutrients functions
Time take to get scores on 1680 embeddings: 0.00327 seconds.


torch.return_types.topk(
values=tensor([0.6926, 0.6738], device='cuda:0'),
indices=tensor([42, 47], device='cuda:0'))

In [30]:
# Define helper function to print wrapped text 
import textwrap

def print_wrapped(text, wrap_length=80):
    wrapped_text = textwrap.fill(text, wrap_length)
    print(wrapped_text)

In [31]:
print(f"Query: '{query}'\n")
print("Results:")
# Loop through zipped together scores and indicies from torch.topk
for score, idx in zip(top_results_dot_product[0], top_results_dot_product[1]):
    print(f"Score: {score:.4f}")
    # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
    print("Text:")
    print_wrapped(pages_and_chunks[idx]["sentence_chunk"])
    # Print the page number too so we can reference the textbook further (and check the results)
    print(f"Page number: {pages_and_chunks[idx]['page_number']}")
    print("\n")

Query: 'macronutrients functions'

Results:
Score: 0.6926
Text:
Macronutrients Nutrients that are needed in large amounts are called
macronutrients. There are three classes of macronutrients: carbohydrates,
lipids, and proteins. These can be metabolically processed into cellular energy.
The energy from macronutrients comes from their chemical bonds. This chemical
energy is converted into cellular energy that is then utilized to perform work,
allowing our bodies to conduct their basic functions. A unit of measurement of
food energy is the calorie. On nutrition food labels the amount given for
“calories” is actually equivalent to each calorie multiplied by one thousand. A
kilocalorie (one thousand calories, denoted with a small “c”) is synonymous with
the “Calorie” (with a capital “C”) on nutrition food labels. Water is also a
macronutrient in the sense that you require a large amount of it, but unlike the
other macronutrients, it does not yield calories. Carbohydrates Carbohydrates
are 

In [33]:
import fitz

# Open PDF and load target page
pdf_path = "human-nutrition-text.pdf" # requires PDF to be downloaded
doc = fitz.open(pdf_path)
page = doc.load_page(5 + 41) # number of page (our doc starts page numbers on page 41)

# Get the image of the page
img = page.get_pixmap(dpi=300)

# Optional: save the image
#img.save("output_filename.png")
doc.close()

# Convert the Pixmap to a numpy array
img_array = np.frombuffer(img.samples_mv, 
                          dtype=np.uint8).reshape((img.h, img.w, img.n))

# Display the image using Matplotlib
import matplotlib.pyplot as plt
plt.figure(figsize=(13, 10))
plt.imshow(img_array)
plt.title(f"Query: '{query}' | Most relevant page:")
plt.axis('off') # Turn off axis
plt.show()

FileNotFoundError: no such file: 'human-nutrition-text.pdf'

## Functionizing our semantic search pipeline

In [34]:
def retrieve_relevant_resources(query: str,
                                embeddings: torch.tensor,
                                model: SentenceTransformer=embedding_model,
                                n_resources_to_return: int=2,
                                print_time: bool=True):
    """
    Embeds a query with model and returns top k scores and indices from embeddings.
    """

    # Embed the query
    query_embedding = model.encode(query,convert_to_tensor=True) 

    start_time = timer()
    dot_scores = util.dot_score(query_embedding, embeddings)[0]
    end_time = timer()

    if print_time:
        print(f"[INFO] Time taken to get scores on {len(embeddings)} embeddings: {end_time-start_time:.5f} seconds.")

    scores, indices = torch.topk(input=dot_scores, 
                                 k=n_resources_to_return)

    return scores, indices

def print_top_results_and_scores(query: str,
                                 embeddings: torch.tensor,
                                 pages_and_chunks: list[dict]=pages_and_chunks,
                                 n_resources_to_return: int=2):
    """
    Takes a query, retrieves most relevant resources and prints them out in descending order.

    Note: Requires pages_and_chunks to be formatted in a specific way (see above for reference).
    """
    
    scores, indices = retrieve_relevant_resources(query=query,
                                                  embeddings=embeddings,
                                                  n_resources_to_return=n_resources_to_return)
    
    print(f"Query: {query}\n")
    print("Results:")
    # Loop through zipped together scores and indicies
    for score, index in zip(scores, indices):
        print(f"Score: {score:.4f}")
        # Print relevant sentence chunk (since the scores are in descending order, the most relevant chunk will be first)
        print_wrapped(pages_and_chunks[index]["sentence_chunk"])
        # Print the page number too so we can reference the textbook further and check the results
        print(f"Page number: {pages_and_chunks[index]['page_number']}")
        print("\n")

In [None]:
query = "Explain Vitamin C "

# Get just the scores and indices of top related results
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
scores, indices

In [None]:
# Print out the texts of the top scores
print_top_results_and_scores(query=query,
                             embeddings=embeddings)

# Loading an LLM locally

In [None]:
# Generating text with our LLM
import ollama

response = ollama.chat(
    model='llama3.2:1b',
    messages=[
        {'role': 'user', 'content': 'Explain machine learning in simple terms.'}
    ]
)
print(response['message']['content'])


In [None]:
import subprocess

result = subprocess.run(
    ["ollama", "run", "llama3.2:1b", "Hello llama "],
    capture_output=True,
    text=True,
    encoding='utf-8',
    errors='ignore'
)

print(result.stdout)

In [None]:
import ollama

# Original input
input_text = "What are the macronutrients, and what roles do they play in the human body?"
print(f"Input text:\n{input_text}")

# Simple Llama-style instruction wrapper (minimal and safe)
def format_llama_instruction(user_text, system_text=None):
    if system_text:
        return f"<<SYS>>\n{system_text}\n<</SYS>>\n\n{user_text}"
    return user_text

# Build prompt (you can replace system_text with your RAG context policy)
system_text = "You are a helpful, concise assistant. Answer clearly for a general audience."
prompt = format_llama_instruction(input_text, system_text=system_text)
print(f"\nPrompt (formatted):\n{prompt}")

# Generate via Ollama llama3.2:1b
response = ollama.generate(
    model="llama3.2:1b",
    prompt=prompt,
    options={
        "num_predict": 192,     # short output for 4 GB VRAM
        "temperature": 0.3,
        "top_p": 0.9,
        "num_ctx": 2048         # keep context modest on small GPU
    },
    stream=False
)

answer_text = response.get("response", "")
print("\nModel answer:\n", answer_text)



## Lets check out the last step Augmentation.First, let's put together a list of queries we can try out with our pipeline.

In [None]:
# Nutrition-style questions generated with GPT4
gpt4_questions = [
    "What are the macronutrients, and what roles do they play in the human body?",
    "How do vitamins and minerals differ in their roles and importance for health?",
    "Describe the process of digestion and absorption of nutrients in the human body.",
    "What role does fibre play in digestion? Name five fibre containing foods.",
    "Explain the concept of energy balance and its importance in weight management."
]

# Manually created question list
manual_questions = [
    "How often should infants be breastfed?",
    "What are symptoms of pellagra?",
    "How does saliva help with digestion?",
    "What is the RDI for protein per day?",
    "water soluble vitamins"
]

query_list = gpt4_questions + manual_questions

In [None]:
import random
query = random.choice(query_list)

print(f"Query: {query}")

# Get just the scores and indices of top related results
scores, indices = retrieve_relevant_resources(query=query,
                                              embeddings=embeddings)
scores, indices

# Augmenting our propt with context items

In [None]:
def prompt_formatter(query: str, context_items: list[dict]) -> list[dict]:
    """
    Augments query with text-based context from context_items.
    """
    context = "- " + "\n- ".join([item["sentence_chunk"] for item in context_items])

    best_prompt='''Based on the following context items, please answer the query.
Extract relevant passages from the context before answering the query. 
Do not include your reasoning, only return the final answer. 
Use the retrieved information from the RAG system to create a user-friendly, precise answer (around 30 words). 

Example 1:
Query: What are the fat-soluble vitamins?
Answer: The fat-soluble vitamins are A, D, E, and K. They are absorbed with dietary fats and stored in body tissues for functions like vision, bone health, and blood clotting.

Example 2:
Query: What causes type 2 diabetes?
Answer: Type 2 diabetes is mainly caused by insulin resistance linked to poor diet, obesity, and low physical activity, leading to high blood sugar and impaired insulin function.

Use the model: *LLaMA 3.2 1B*
Output limit: *200 tokens*

Now use the following context items to answer the user query:
{context}

Relevant passages: <extract relevant passages from the context here>
User query: {query}
Answer: '''

    best_prompt = best_prompt.format(context=context, query=query)

    response = [
        {"role": "user", "content": best_prompt}
    ]

    print(response[0]['content'])
    return response


In [None]:
from ollama import chat
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def ask(query, 
        top_k=2,
        temperature=0.7,
        max_new_tokens=512,
        format_answer_text=True, 
        return_answer_only=True):
    """
    Takes a query, retrieves top relevant context from embeddings,
    and generates an answer using the LLaMA 3.2 (1B) model via Ollama.
    """

''' Retrieve relevant context '''
    scores, indices = retrieve_relevant_resources(query=query, embeddings=embeddings)
    context_items = [pages_and_chunks[i] for i in indices]

''' Add score to context '''
    for i, item in enumerate(context_items):
        item["score"] = scores[i].cpu()

'''Format prompt '''
    prompt = prompt_formatter(query=query, context_items=context_items)

''' Call LLaMA via Ollama '''
    print("\nGenerating answer using LLaMA 3.2 (1B)...\n")

    response = chat(
        model="llama3.2:1b",
        messages=prompt,
        options={
            "temperature": temperature,
            "num_predict": max_new_tokens
        }
    )

'''Extract only content'''
    # Ollama sometimes returns a response object with a nested message
    answer_text = ""

    if hasattr(response, "message") and hasattr(response.message, "content"):
        answer_text = response.message.content
    elif isinstance(response, dict):
        answer_text = response.get("message", {}).get("content", "")
    else:
        text = str(response)
        if "content=" in text:
            answer_text = text.split("content=")[-1].split(", thinking")[0].strip('"\' ')
        else:
            answer_text = text

''' Clean result'''
    answer_text = (
        answer_text.replace("<bos>", "")
                   .replace("<eos>", "")
                   .replace("\u00A0", " ")
                   .strip()
    )
''' Return clean answer'''
    result = {"query": query, "answer": answer_text}
    return result if not return_answer_only else answer_text

In [None]:
import random
query="Macronutrients"
# query = random.choice(query_list)
print(f"Query: {query}\n")

result = ask(
    query=query,
    temperature=0.3,
    max_new_tokens=512,
    return_answer_only=False
)

print("Answer:\n")
# print(result["answer"])
print(f"Query: {result['query']}\nAnswer: {result['answer']}")
