### This notebook builds a RAG system from scratch using python and FAISS vector store. I use sample pdf as the document to vectorise and retreive from.

- Steps to follow : 
    1. Open a document 
    2. format the text for the embedding model
    3. embed all the chunks which can be stored for later
    4. build a retrieval system that searches the vector store and returns the similar embeddings to teh query
    5. create a prompt that incorporates the returned embeddings
    6. generate an answer to the query based on the passages from the text.

- Steps 1-3 : Document preprocessing and embedding creation
- Steps 4-6 : Search and Answer

## 1. Document pre-processing and embedding creation

Ingredients : 
- Data documents of any choice
- embedding model of choice

In [1]:
import os 
import requests

pdf_path = './Rag-From-Scratch/simple-local-rag/human-nutrition-text.pdf'


if not os.path.exists(pdf_path):
    print('File does not exist.')


In [6]:
import fitz 
from tqdm.auto import tqdm

def text_formatter(strng):
    cleaned_text = strng.replace('\n', ' ').strip()
    return cleaned_text

def open_and_read(pdf_path):
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text)
        pages_and_texts.append({
            'page_number' : page_number - 41,
            'page_char_count' : len(text),
            'page_word_count' : len(text.split(' ')) if len(text) > 0 else 0,
            'page_sentence_count' : len(text.split('. ')) if len(text) > 0 else 0,
            'page_token_count' : len(text)/4, 
            'text' : text
        })
    return pages_and_texts

In [7]:
pages_and_text = open_and_read(pdf_path)
len(pages_and_text)
pages_and_text[:2]

0it [00:00, ?it/s]

[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 0,
  'page_sentence_count': 0,
  'page_token_count': 0.0,
  'text': ''}]

In [8]:
import pandas as pd

df = pd.DataFrame(pages_and_text)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,text
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition
1,-40,0,0,0,0.0,
2,-39,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,32,1,53.0,Human Nutrition: 2020 Edition by University of...
4,-37,797,147,3,199.25,Contents Preface University of Hawai‘i at Mā...


## Why care about the token count ?

Token count is important because of the context window of the embedding model and the context window of the LLMs. By context window I mean, the maximum length of the input text provided to the embbedding model or the LLM.

In [9]:
from spacy.lang.en import English

nlp = English()

# Add a sentencizer pipeline
nlp.add_pipe('sentencizer')

# create a document instance as an example.
doc = nlp('This a sentence. This is another sentence. This is the third sentence.')

assert len(list(doc.sents)) == 3

# print sentences split
list(doc.sents)


[This a sentence., This is another sentence., This is the third sentence.]

In [10]:
# Our pdf dictionary
pages_and_text[600]

{'page_number': 559,
 'page_char_count': 863,
 'page_word_count': 138,
 'page_sentence_count': 9,
 'page_token_count': 215.75,
 'text': 'Image by  Allison  Calabrese /  CC BY 4.0  Korsakoff syndrome can cause similar symptoms as beriberi such  as confusion, loss of coordination, vision changes, hallucinations,  and may progress to coma and death. This condition is specific  to alcoholics as diets high in alcohol can cause thiamin deficiency.  Other individuals at risk include individuals who also consume diets  typically low in micronutrients such as those with eating disorders,  elderly, and individuals who have gone through gastric bypass  surgery.5  Figure 9.10 The Role of Thiamin  Figure 9.11 Beriberi, Thiamin Deficiency  5. Fact Sheets for Health Professionals: Thiamin. National  Institute of Health, Office of Dietary Supplements.   https://ods.od.nih.gov/factsheets/Thiamin- HealthProfessional/. Updated Feburary 11, 2016.  Accessed October 22, 2017.  Water-Soluble Vitamins  |  559

In [11]:
for item in tqdm(pages_and_text):
    item['sentences'] = list(nlp(item['text']).sents)

    # make sure all sentences are string. Default is a spacy datatype.
    item['sentences'] = [str(strng) for strng in item['sentences']]

    item['sentences_per_page'] = len(item['sentences'])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [13]:
import random

random.sample(pages_and_text, k=1)

[{'page_number': 733,
  'page_char_count': 1581,
  'page_word_count': 255,
  'page_sentence_count': 16,
  'page_token_count': 395.25,
  'text': 'AMS, USDA. https://www.ams.usda.gov/about-ams/programs- offices/national-organic-program.  Food Labeling Guide. US Food and Drug Administration. http://www.fda.gov. Updated February 10, 2012. Accessed  November 28, 2017.  Health Claims  Often we hear news of a particular nutrient or food product that  contributes to our health or may prevent disease. A health claim  is a statement that links a particular food with a reduced risk of  developing disease.  Implied health claims include the use of  symbols, statements and other forms of communication that  suggest a relationship between a food  substance and disease  reduction. As such, health claims such as “Three grams of soluble  fiber from oatmeal daily in combination with a diet low in  cholesterol and saturated fat may reduce the risk of heart disease,”  must be evaluated by the FDA before i

In [15]:
# update dataframe
df = pd.DataFrame(pages_and_text)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,sentences_per_page
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,199.48,10.5,287.0,10.32
std,348.86,560.38,95.88,6.59,140.1,6.3
min,-41.0,0.0,0.0,0.0,0.0,0.0
25%,260.75,762.0,134.0,5.0,190.5,5.0
50%,562.5,1231.5,216.0,10.0,307.88,10.0
75%,864.25,1603.5,272.0,15.0,400.88,15.0
max,1166.0,2308.0,430.0,39.0,577.0,28.0


## Chunking approach

- The concept of splitting larger pieces of text into smaller text of suitable sizes or chunking is done to provide appropriate sized inputs to the embedding model and LLM. 
- There is no, one correct way to chunk. It depends on the your use case. Some of the approaches to chunking are fixed size chunking, token and word based chunking, recursive token and word based chunking, semantic chunking, etc. 
- We will use fixed sized chunking here, and go with 10 sentences in a chunk.
- Each page will be subdivided into chunks of 10 sentences or smaller.

In [None]:
# Define split size to turn groups of sentences into chunks
num_sentence_chunk_size = 10

# create a function to split lists of sentences into chunk size, recursively
def split_list(input_list):
    slice_size = num_sentence_chunk_size
    return [input_list[i:i+slice_size] for i in range(0, len(input_list), slice_size)]

# test_list = list(range(25))
# split_list(test_list)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [17]:
# Loop through pages and split text into chunks

for item in tqdm(pages_and_text):
    item['sentence_chunks'] = split_list(item['sentences'])
    item['num_chunks'] = len(item['sentence_chunks'])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [19]:
random.sample(pages_and_text, k =1)

[{'page_number': 383,
  'page_char_count': 434,
  'page_word_count': 90,
  'page_sentence_count': 3,
  'page_token_count': 108.5,
  'text': 'Proteins are  the  “workhorses”  of the body  and  participate  in many  bodily  functions.  Proteins  come in all  sizes and  shapes and  each is  specifically  structured  for its  particular  function.  Protein’s Functions in the  Body  UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN  NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM  Structure and Motion  Figure 6.9 Collagen Structure  Protein’s Functions in the Body  |  383',
  'sentences': ['Proteins are  the  “workhorses”  of the body  and  participate  in many  bodily  functions.',
   ' Proteins  come in all  sizes and  shapes and  each is  specifically  structured  for its  particular  function.',
   ' Protein’s Functions in the  Body  UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN  NUTRITION PROGRAM AND HUMAN NUTRITION PROGRAM  Structure and Motion  Figure 6.9 Collagen Structure

In [20]:
df = pd.DataFrame(pages_and_text)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count,page_token_count,sentences_per_page,num_chunks
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,199.48,10.5,287.0,10.32,1.53
std,348.86,560.38,95.88,6.59,140.1,6.3,0.64
min,-41.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,260.75,762.0,134.0,5.0,190.5,5.0,1.0
50%,562.5,1231.5,216.0,10.0,307.88,10.0,1.0
75%,864.25,1603.5,272.0,15.0,400.88,15.0,2.0
max,1166.0,2308.0,430.0,39.0,577.0,28.0,3.0


### Splitting each chunk into its own item in the document dictionary. This gives a greater level of granularity.

In [29]:
import re

# Split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_text):
    for sentence_chunk in item['sentence_chunks']:
        chunk_dict = {}
        chunk_dict['page_number'] = item['page_number']
        joined_sentence_chunk = ''.join(sentence_chunk).replace('  ', ' ').strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk)

        chunk_dict['sentence_chunk'] = joined_sentence_chunk
        chunk_dict['chunk_char_count'] = len(joined_sentence_chunk)
        chunk_dict['chunk_word_count'] = len(joined_sentence_chunk.split(' '))
        chunk_dict['chunk_token_count'] = len(joined_sentence_chunk)/4

        pages_and_chunks.append(chunk_dict)


len(pages_and_chunks)

  0%|          | 0/1208 [00:00<?, ?it/s]

1843

In [32]:
random.sample(pages_and_chunks, k=1)

[{'page_number': 1028,
  'sentence_chunk': 'The Pros and Cons of Food Additives The FDA works to protect the public from potentially dangerous additives. Passed in 1958, the Food Additives Amendment states that a manufacturer is responsible for demonstrating the safety of an additive before it can be approved. The Delaney Clause that was added to this legislation prohibits the approval of any additive found to cause cancer in animals or humans. However, most additives are considered to be “generally recognized as safe,” a status that is determined by the FDA and referred to as GRAS. Food additives are typically included in the processing stage to improve the quality and consistency of a product. Many additives also make items more “shelf stable,” meaning they will last a lot longer on store shelves and can generate more profit for store owners. Additives can also help to prevent spoilage that results from changes in temperature, damage during distribution, and other adverse conditions.

In [33]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1843.0,1843.0,1843.0,1843.0
mean,583.38,734.1,112.74,183.52
std,347.79,447.51,71.24,111.88
min,-41.0,12.0,3.0,3.0
25%,280.5,315.0,45.0,78.75
50%,586.0,745.0,115.0,186.25
75%,890.0,1118.0,173.0,279.5
max,1166.0,1830.0,297.0,457.5


In [38]:
# Filter chunks with very small text length. These chunks might not have useful information.
min_token_length = 30

# for row in df[df['chunk_token_count']<min_token_length].sample(5).iterrows():
#     print(f'Chunk token count : {row[1]['chunk_token_count']} | Text : {row[1]['sentence_chunk']}')

pages_and_chunks_over_min_token_len = df[df['chunk_token_count'] > min_token_length].to_dict(orient='records')
pages_and_chunks_over_min_token_len[:2]

[{'page_number': -39,
  'sentence_chunk': 'Human Nutrition: 2020 Edition UNIVERSITY OF HAWAI‘I AT MĀNOA FOOD SCIENCE AND HUMAN NUTRITION PROGRAM ALAN TITCHENAL, SKYLAR HARA, NOEMI ARCEO CAACBAY, WILLIAM MEINKE-LAU, YA-YUN YANG, MARIE KAINOA FIALKOWSKI REVILLA, JENNIFER DRAPER, GEMADY LANGFELDER, CHERYL GIBBY, CHYNA NICOLE CHUN, AND ALLISON CALABRESE',
  'chunk_char_count': 308,
  'chunk_word_count': 42,
  'chunk_token_count': 77.0},
 {'page_number': -38,
  'sentence_chunk': 'Human Nutrition: 2020 Edition by University of Hawai‘i at Mānoa Food Science and Human Nutrition Program is licensed under a Creative Commons Attribution 4.0 International License, except where otherwise noted.',
  'chunk_char_count': 210,
  'chunk_word_count': 30,
  'chunk_token_count': 52.5}]

In [39]:
random.sample(pages_and_chunks_over_min_token_len, k=2)

[{'page_number': 939,
  'sentence_chunk': 'https://www.acsm.org/docs/brochures/resistance- training.pdf. Accessed March 11, 2018. 4. Fitness Training: Elements of a Well-Rounded Routine. MayoClinic.com.http://www.mayoclinic.com/health/ fitness-training/HQ01305. Updated August 10, 2017. The Essential Elements of Physical Fitness | 939',
  'chunk_char_count': 292,
  'chunk_word_count': 28,
  'chunk_token_count': 73.0},
 {'page_number': 85,
  'sentence_chunk': 'Albumin helps maintain fluid balance between blood and tissues, as well as helping to maintain a constant blood pH. We have also learned that the water component of blood is essential for its actions as a transport vehicle, and that the electrolytes carried in blood help to maintain fluid balance and a constant pH. Furthermore, the high water content of blood helps maintain body temperature, and the constant flow of blood distributes heat throughout the body. Blood is exceptionally good at temperature The Cardiovascular System | 85

## Embedding our text chunks

In [69]:
# we are using an embedding model from sentence transformer library.

import torch
from sentence_transformers import SentenceTransformer

device = torch.device('cuda:2' if torch.cuda.is_available() else 'cpu')
print(device)
embedding_model = SentenceTransformer(model_name_or_path = 'all-mpnet-base-v2', device=device)

# for item in tqdm(pages_and_chunks_over_min_token_len):
#     # sentences are encoded by calling .encode on the model
#     item['embeddings'] = embedding_model.encode(item['sentence_chunk'])

text_chunks = [item['sentence_chunk'] for item in pages_and_chunks_over_min_token_len]
# text_chunks[419]

text_chunk_embeddings = embedding_model.encode(
    text_chunks,
    batch_size = 32,
    convert_to_tensor=True
)

text_chunk_embeddings



cuda:2




tensor([[ 0.0674,  0.0902, -0.0051,  ..., -0.0221, -0.0232,  0.0126],
        [ 0.0552,  0.0592, -0.0166,  ..., -0.0120, -0.0103,  0.0227],
        [ 0.0280,  0.0340, -0.0206,  ..., -0.0054,  0.0213,  0.0313],
        ...,
        [ 0.0771,  0.0098, -0.0122,  ..., -0.0409, -0.0752, -0.0241],
        [ 0.1030, -0.0165,  0.0083,  ..., -0.0574, -0.0283, -0.0295],
        [ 0.0864, -0.0125, -0.0113,  ..., -0.0522, -0.0337, -0.0299]],
       device='cuda:2')

In [70]:
# Implementing FAISS Vector store

import numpy as np
import faiss


# Convert the tensor from GPU to CPU and detach it from the graph
# Then convert to a numpy array of type float32
text_chunk_embeddings = np.array(text_chunk_embeddings.cpu(), dtype=np.float32)

d = 768
# setting up the vector store:
index = faiss.IndexFlatL2(d)
index.add(text_chunk_embeddings)

In [None]:
xq = np.random.random((10, d)).astype('float32') # create random query

k=4 #nearest 4 neighbours

D,I = index.search(xq, k) #return distances and indices for each query
print(I)
print(D)

[[ 605  589  558    9]
 [ 613  113 1626  785]
 [1011  895   81  776]
 [ 123  122 1161 1092]
 [1600 1223 1444 1296]
 [ 330 1049  337  333]
 [1088  738 1081  745]
 [ 412 1053 1112  355]
 [ 288  739   74 1505]
 [1148 1170 1094 1168]]
[[245.42792 245.47127 245.88495 245.93845]
 [248.34288 248.54364 248.57002 248.5737 ]
 [267.95093 268.01514 268.10217 268.1548 ]
 [259.2244  259.28052 259.3379  259.382  ]
 [257.7188  257.83063 257.96027 258.00818]
 [249.28033 249.44339 249.5235  249.6034 ]
 [258.6822  258.7377  258.74884 258.8122 ]
 [273.278   273.3703  273.51318 273.52405]
 [251.01639 251.08907 251.09534 251.1025 ]
 [252.02844 252.2282  252.36292 252.41183]]


## Search and Retreive 

In [78]:
# Implement a re-rank model

from sentence_transformers import CrossEncoder

reranking_model = CrossEncoder('mixedbread-ai/mxbai-rerank-large-v1')




In [74]:
# convert query to embeddings using the same embedding model used to embed the data documents.

query = 'macronutrients functions'

# embed the query
query_embed = embedding_model.encode(query, convert_to_tensor=True)
query_embed = query_embed.cpu().reshape(1,-1)

D,I = index.search(query_embed, k)


print(f'I : {I}')
print(f'D : {D}')

for dist, idx in zip(D[0], I[0]):
    print(f'Distance : {dist}')
    print(f'Text : {pages_and_chunks_over_min_token_len[idx]['sentence_chunk']}')
    print(f'Page number : {pages_and_chunks_over_min_token_len[idx]['page_number']}')

I : [[42 47 41 51]]
D : [[0.61483824 0.6523455  0.67074776 0.69273096]]
Distance : 0.6148382425308228
Text : Macronutrients Nutrients that are needed in large amounts are called macronutrients. There are three classes of macronutrients: carbohydrates, lipids, and proteins. These can be metabolically processed into cellular energy. The energy from macronutrients comes from their chemical bonds. This chemical energy is converted into cellular energy that is then utilized to perform work, allowing our bodies to conduct their basic functions. A unit of measurement of food energy is the calorie. On nutrition food labels the amount given for “calories” is actually equivalent to each calorie multiplied by one thousand. A kilocalorie (one thousand calories, denoted with a small “c”) is synonymous with the “Calorie” (with a capital “C”) on nutrition food labels. Water is also a macronutrient in the sense that you require a large amount of it, but unlike the other macronutrients, it does not yie

#### We could potentially improve the results by using a re-ranking model. The model is trained specifically to re-rank the search results and rank them in the order most likely.

In [79]:
retreived_docs = [pages_and_chunks_over_min_token_len[idx]['sentence_chunk'] for idx in I[0]]

results = reranking_model.rank(query, retreived_docs, return_documents=True, top_k=3)
results

[{'corpus_id': 0,
  'score': 0.97560567,
  'text': 'Macronutrients Nutrients that are needed in large amounts are called macronutrients. There are three classes of macronutrients: carbohydrates, lipids, and proteins. These can be metabolically processed into cellular energy. The energy from macronutrients comes from their chemical bonds. This chemical energy is converted into cellular energy that is then utilized to perform work, allowing our bodies to conduct their basic functions. A unit of measurement of food energy is the calorie. On nutrition food labels the amount given for “calories” is actually equivalent to each calorie multiplied by one thousand. A kilocalorie (one thousand calories, denoted with a small “c”) is synonymous with the “Calorie” (with a capital “C”) on nutrition food labels. Water is also a macronutrient in the sense that you require a large amount of it, but unlike the other macronutrients, it does not yield calories. Carbohydrates Carbohydrates are molecules co

In [None]:
def retrieve_relevant_resources(query, embeddings, model, num_res_to_return):
    '''
    Embeds a query with the used model and returns top k scores and indices from vector store.
    '''

    # embed the query
    query_embed = model.encode(query, convert_to_tensor=True)
    D,I = index.search(query_embed, num_res_to_return+5)
    retreived_docs = [pages_and_chunks_over_min_token_len[idx]['sentence_chunk'] for idx in I[0]]
    
    return reranking_model.rank(query, retreived_docs, return_documents=True, top_k=3)