## Local Retrieval Augmented Generation

In [48]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [57]:
%%time

print("Hello world")

Hello world
CPU times: user 29 µs, sys: 6 µs, total: 35 µs
Wall time: 35 µs


In [4]:
import pandas as pd
from tqdm.auto import tqdm
import numpy as np

### Step:1 Document chunking process

#### Data info
- Book: Encyclopedia of Foods: A Guide to Healthy Nutrition 
- Pages: 529
- Words: 2,91,478 After Cleaning
- link:https://ia803100.us.archive.org/11/items/encyclopediaoffoods.aguidetohealthynutritionpdfdrive.com/Encyclopedia%20of%20Foods.%20A%20Guide%20to%20Healthy%20Nutrition%20%28%20PDFDrive.com%20%29.pdf

In [43]:
pip install PyMuPDF 

Note: you may need to restart the kernel to use updated packages.


In [3]:
import fitz
from tqdm.auto import tqdm
from nltk.tokenize import word_tokenize, sent_tokenize

pdf_path = "Encyclopedia of Foods and Healthy Nutrition.pdf"

#Cleaning the text in the pdf
def text_formatting(text: str) -> str:
    cleaned_text = text.replace("\n", " ").strip()
    
    return cleaned_text

#open pdf
def open_and_read(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatting(text = text)
        pages_and_texts.append({"Page Number": page_number -12, 
                               "Page_char_count": len(text),
                               "Page_word_count": len(word_tokenize(text)),
                               "Page_sentence_count": len(sent_tokenize(text)),
                               "Page_token_count": len(text)/4,
                               "Text": text
                              })
        
    return pages_and_texts

pages_and_texts = open_and_read(pdf_path = pdf_path)
pages_and_texts[:2]

  from .autonotebook import tqdm as notebook_tqdm
529it [00:02, 251.88it/s]


[{'Page Number': -12,
  'Page_char_count': 0,
  'Page_word_count': 0,
  'Page_sentence_count': 0,
  'Page_token_count': 0.0,
  'Text': ''},
 {'Page Number': -11,
  'Page_char_count': 50,
  'Page_word_count': 8,
  'Page_sentence_count': 1,
  'Page_token_count': 12.5,
  'Text': 'ENCYCLOPEDIA of FOODS a guide to Healthy Nutrition'}]

In [5]:
pages_and_texts[10]

{'Page Number': -2,
 'Page_char_count': 1098,
 'Page_word_count': 201,
 'Page_sentence_count': 12,
 'Page_token_count': 274.5,
 'Text': 'x Acknowledgments Editorial Staff Editors-in-Chief Robert A. Rizza, M.D. Vay Liang W. Go., M.D. M. Molly McMahon, M.D. Gail G. Harrison, Ph.D., R.D. Associate Editors Jennifer K. Nelson, R.D. Kristine A. Kuhnert, R.D. Assistant Editor Sydne J. Newberry, Ph.D. Editorial Director LeAnn M. Stee Art Directors Karen E. Barrie Kathryn K. Shepel Medical Illustrators John V. Hagen Michael A. King Editorial Assistant Sharon L. Wadleigh Production Consultant Ronald R. Ward Photography Tony Kubat The vision for this book belongs to David H. Murdock, Chairman and Chief Executive Officer of Dole Food Company, Inc.  Mr. Murdock brought his vision and a request for assistance in making it a reality to two of the authors:  Robert A. Rizza, M.D., of Mayo Clinic, and Gail G. Harrison, Ph.D., R.D., University of California Los Angeles School of Public Health.  They each

In [6]:
import random
random.sample(pages_and_texts, k = 2)

[{'Page Number': 172,
  'Page_char_count': 3498,
  'Page_word_count': 690,
  'Page_sentence_count': 28,
  'Page_token_count': 874.5,
  'Text': '172 Part II:  Encyclopedia of Foods Varieties Fresh dates are classified as “soft,” “semi- soft,” and “dry,” depending on their mois- ture content.  The most common type is “semisoft,” a well-known example  of which is the large, flavorful Medjool from Morocco.  Other “semisoft” varieties are the firm-fleshed, amber Deglet Noor and the small, golden Zahidi.  The Barhi, Khadrawy, and Halawy are “soft” dates. “Dry” varieties contain relatively little moisture when ripe.  Thus, the term “dry” does not mean “dehydrated” or “dried.” Origin & botanical facts Dates originated somewhere in the desert area that stretches from India to North Africa.  Cultivation seems to have begun at least 8,000 years ago, when settlement began along the Jordan River and around the Dead Sea.  Archaeological evidence indicates that cultivation of dates was well establish

In [7]:
df = pd.DataFrame(pages_and_texts)
df.head(10)

Unnamed: 0,Page Number,Page_char_count,Page_word_count,Page_sentence_count,Page_token_count,Text
0,-12,0,0,0,0.0,
1,-11,50,8,1,12.5,ENCYCLOPEDIA of FOODS a guide to Healthy Nutri...
2,-10,34,5,1,8.5,This Page Intentionally Left Blank
3,-9,238,41,2,59.5,a guide to Healthy Nutrition Prepared by medic...
4,-8,1559,276,8,389.75,This book is printed on acid-free paper. Copyr...
5,-7,3000,545,22,750.0,v Foreword I believe that knowledge is power. ...
6,-6,5056,2157,1956,1264.0,vi Part I A Guide to Healthy Nutrition . . . ...
7,-5,5380,2465,2358,1345.0,Chapter 5 Preparing Healthful Meals . . . . ....
8,-4,3265,606,30,816.25,N utrition is important to all of us. What we...
9,-3,695,141,9,173.75,published in the area of nutrition. We decide...


In [8]:
df.describe()

Unnamed: 0,Page Number,Page_char_count,Page_word_count,Page_sentence_count,Page_token_count
count,529.0,529.0,529.0,529.0,529.0
mean,252.0,2764.897921,562.183365,29.807183,691.22448
std,152.853416,1366.470968,276.690873,132.70302,341.617742
min,-12.0,0.0,0.0,0.0,0.0
25%,120.0,1865.0,427.0,3.0,466.25
50%,252.0,2995.0,609.0,26.0,748.75
75%,384.0,3718.0,732.0,33.0,929.5
max,516.0,5380.0,2465.0,2358.0,1345.0


### Chunking

In [9]:
for item in tqdm(pages_and_texts):
    item["sentences"] = list(sent_tokenize(item["Text"]))
    
    item['sentences'] = [str(sentence) for sentence in item['sentences']]

100%|██████████| 529/529 [00:00<00:00, 3244.38it/s]


In [10]:
random.sample(pages_and_texts, k=1)

[{'Page Number': 207,
  'Page_char_count': 3161,
  'Page_word_count': 618,
  'Page_sentence_count': 28,
  'Page_token_count': 790.25,
  'Text': 'Fruits 207 Varieties The extensive cultivation of the sapodilla in India has resulted in numerous varieties. Brown Sugar produces fragrant, juicy fruits whose flesh is pale brown and richly sweet.  The flesh of the Prolific variety is light pinkish tan, mildly fragrant, smooth- textured, and sweet.  Russel bears large fruits that are rich and sweet, but it is not a prolific producer.  A new selection, Tikal, yields fruits that have an excellent flavor but are smaller. Origin & botanical facts The sapodilla plant is believed to have originated in the Yucatán peninsula of Mexico, northern Belize, and northeast Guatemala.  The plant was highly prized by the Aztecs, who called the fruit “tzapotl,” from which the Spanish derived the name sapodilla.  The plant is now grown in almost all the tropical and sub- tropical regions of Africa, Asia, the Eas

In [11]:
df = pd.DataFrame(pages_and_texts)
df.head(10)

Unnamed: 0,Page Number,Page_char_count,Page_word_count,Page_sentence_count,Page_token_count,Text,sentences
0,-12,0,0,0,0.0,,[]
1,-11,50,8,1,12.5,ENCYCLOPEDIA of FOODS a guide to Healthy Nutri...,[ENCYCLOPEDIA of FOODS a guide to Healthy Nutr...
2,-10,34,5,1,8.5,This Page Intentionally Left Blank,[This Page Intentionally Left Blank]
3,-9,238,41,2,59.5,a guide to Healthy Nutrition Prepared by medic...,[a guide to Healthy Nutrition Prepared by medi...
4,-8,1559,276,8,389.75,This book is printed on acid-free paper. Copyr...,"[This book is printed on acid-free paper., Cop..."
5,-7,3000,545,22,750.0,v Foreword I believe that knowledge is power. ...,[v Foreword I believe that knowledge is power....
6,-6,5056,2157,1956,1264.0,vi Part I A Guide to Healthy Nutrition . . . ...,"[vi Part I A Guide to Healthy Nutrition ., .,..."
7,-5,5380,2465,2358,1345.0,Chapter 5 Preparing Healthful Meals . . . . ....,"[Chapter 5 Preparing Healthful Meals ., ., .,..."
8,-4,3265,606,30,816.25,N utrition is important to all of us. What we...,"[N utrition is important to all of us., What w..."
9,-3,695,141,9,173.75,published in the area of nutrition. We decide...,"[published in the area of nutrition., We decid..."


In [12]:
df.describe()

Unnamed: 0,Page Number,Page_char_count,Page_word_count,Page_sentence_count,Page_token_count
count,529.0,529.0,529.0,529.0,529.0
mean,252.0,2764.897921,562.183365,29.807183,691.22448
std,152.853416,1366.470968,276.690873,132.70302,341.617742
min,-12.0,0.0,0.0,0.0,0.0
25%,120.0,1865.0,427.0,3.0,466.25
50%,252.0,2995.0,609.0,26.0,748.75
75%,384.0,3718.0,732.0,33.0,929.5
max,516.0,5380.0,2465.0,2358.0,1345.0


In [13]:
pages_and_texts[6]

{'Page Number': -6,
 'Page_char_count': 5056,
 'Page_word_count': 2157,
 'Page_sentence_count': 1956,
 'Page_token_count': 1264.0,
 'Text': 'vi Part I A Guide to Healthy Nutrition  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Chapter 1 Optimizing Health  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 The Dietary Reference Intakes (DRIs)  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 America’s Health Goals  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 The Dietary Guidelines for Americans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8 The Power of the Food Guide Pyramid  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

In [14]:
# Removing two pages has more the 60 sentences 
df[df["Page_sentence_count"] > 60]
#drop = -6, -5

Unnamed: 0,Page Number,Page_char_count,Page_word_count,Page_sentence_count,Page_token_count,Text,sentences
6,-6,5056,2157,1956,1264.0,vi Part I A Guide to Healthy Nutrition . . . ...,"[vi Part I A Guide to Healthy Nutrition ., .,..."
7,-5,5380,2465,2358,1345.0,Chapter 5 Preparing Healthful Meals . . . . ....,"[Chapter 5 Preparing Healthful Meals ., ., .,..."


In [15]:
pages_and_texts[3]

{'Page Number': -9,
 'Page_char_count': 238,
 'Page_word_count': 41,
 'Page_sentence_count': 2,
 'Page_token_count': 59.5,
 'Text': 'a guide to Healthy Nutrition Prepared by medical and nutrition experts from Mayo Clinic,  University of California Los Angeles, and Dole Food Company, Inc. Academic Press An  Imprint of Elsevier San Diego, California ENCYCLOPEDIA of FOODS',
 'sentences': ['a guide to Healthy Nutrition Prepared by medical and nutrition experts from Mayo Clinic,  University of California Los Angeles, and Dole Food Company, Inc.',
  'Academic Press An  Imprint of Elsevier San Diego, California ENCYCLOPEDIA of FOODS']}

In [16]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [17]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 275,
    chunk_overlap = 25,
    length_function = len,
    is_separator_regex = False
)

In [18]:
text_splitter

<langchain_text_splitters.character.RecursiveCharacterTextSplitter at 0x16848ff70>

In [19]:
text = "a guide to Healthy Nutrition Prepared by medical and nutrition experts from Mayo Clinic,  University of California Los Angeles, and Dole Food Company, Inc. Academic Press An  Imprint of Elsevier San Diego, California ENCYCLOPEDIA of FOODS"

In [20]:
print(str(text))

a guide to Healthy Nutrition Prepared by medical and nutrition experts from Mayo Clinic,  University of California Los Angeles, and Dole Food Company, Inc. Academic Press An  Imprint of Elsevier San Diego, California ENCYCLOPEDIA of FOODS


In [21]:
chunks = text_splitter.create_documents([text])
print(chunks)
print("***************************")
print("Chunk Length:", len(chunks))

[Document(metadata={}, page_content='a guide to Healthy Nutrition Prepared by medical and nutrition experts from Mayo Clinic,  University of California Los Angeles, and Dole Food Company, Inc. Academic Press An  Imprint of Elsevier San Diego, California ENCYCLOPEDIA of FOODS')]
***************************
Chunk Length: 1


In [22]:
for content in tqdm(pages_and_texts):
    page_text = str(content['Text'])
    chunks = text_splitter.create_documents([page_text])
    content['Chunks'] = chunks
    content['Total_chunks'] = len(chunks)

100%|██████████| 529/529 [00:00<00:00, 2630.98it/s]


In [23]:
pages_and_texts[7]

{'Page Number': -5,
 'Page_char_count': 5380,
 'Page_word_count': 2465,
 'Page_sentence_count': 2358,
 'Page_token_count': 1345.0,
 'Text': 'Chapter 5 Preparing Healthful Meals  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .125 Change Is Good  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .125 Creating Healthful Menus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .128 Food Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .148 Serving Safely  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .149 Refrigerating or Freezing Food . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

In [24]:
df_chunks = pd.DataFrame(pages_and_texts)
df_chunks.head(10)

Unnamed: 0,Page Number,Page_char_count,Page_word_count,Page_sentence_count,Page_token_count,Text,sentences,Chunks,Total_chunks
0,-12,0,0,0,0.0,,[],[],0
1,-11,50,8,1,12.5,ENCYCLOPEDIA of FOODS a guide to Healthy Nutri...,[ENCYCLOPEDIA of FOODS a guide to Healthy Nutr...,[page_content='ENCYCLOPEDIA of FOODS a guide t...,1
2,-10,34,5,1,8.5,This Page Intentionally Left Blank,[This Page Intentionally Left Blank],[page_content='This Page Intentionally Left Bl...,1
3,-9,238,41,2,59.5,a guide to Healthy Nutrition Prepared by medic...,[a guide to Healthy Nutrition Prepared by medi...,[page_content='a guide to Healthy Nutrition Pr...,1
4,-8,1559,276,8,389.75,This book is printed on acid-free paper. Copyr...,"[This book is printed on acid-free paper., Cop...",[page_content='This book is printed on acid-fr...,7
5,-7,3000,545,22,750.0,v Foreword I believe that knowledge is power. ...,[v Foreword I believe that knowledge is power....,[page_content='v Foreword I believe that knowl...,12
6,-6,5056,2157,1956,1264.0,vi Part I A Guide to Healthy Nutrition . . . ...,"[vi Part I A Guide to Healthy Nutrition ., .,...",[page_content='vi Part I A Guide to Healthy Nu...,21
7,-5,5380,2465,2358,1345.0,Chapter 5 Preparing Healthful Meals . . . . ....,"[Chapter 5 Preparing Healthful Meals ., ., .,...",[page_content='Chapter 5 Preparing Healthful M...,22
8,-4,3265,606,30,816.25,N utrition is important to all of us. What we...,"[N utrition is important to all of us., What w...",[page_content='N utrition is important to all ...,13
9,-3,695,141,9,173.75,published in the area of nutrition. We decide...,"[published in the area of nutrition., We decid...",[page_content='published in the area of nutrit...,3


In [40]:
df_chunks.describe()

Unnamed: 0,Page Number,Page_char_count,Page_word_count,Page_sentence_count,Page_token_count,Total_chunks
count,529.0,529.0,529.0,529.0,529.0,529.0
mean,252.0,2764.897921,562.183365,29.807183,691.22448,11.47448
std,152.853416,1366.470968,276.690873,132.70302,341.617742,5.41426
min,-12.0,0.0,0.0,0.0,0.0,0.0
25%,120.0,1865.0,427.0,3.0,466.25,8.0
50%,252.0,2995.0,609.0,26.0,748.75,12.0
75%,384.0,3718.0,732.0,33.0,929.5,15.0
max,516.0,5380.0,2465.0,2358.0,1345.0,22.0


In [41]:
df_chunks.shape

(529, 9)

In [27]:
min_length_token = 30
max_sentence_count = 60

# Correct the condition with parentheses and use the correct DataFrame reference
pages_and_chunks = df_chunks[(df_chunks['Page_token_count'] > min_length_token) &
                             (df_chunks['Page_sentence_count'] < max_sentence_count)]

# Display the first 3 rows
pages_and_chunks[:3]

Unnamed: 0,Page Number,Page_char_count,Page_word_count,Page_sentence_count,Page_token_count,Text,sentences,Chunks,Total_chunks
3,-9,238,41,2,59.5,a guide to Healthy Nutrition Prepared by medic...,[a guide to Healthy Nutrition Prepared by medi...,[page_content='a guide to Healthy Nutrition Pr...,1
4,-8,1559,276,8,389.75,This book is printed on acid-free paper. Copyr...,"[This book is printed on acid-free paper., Cop...",[page_content='This book is printed on acid-fr...,7
5,-7,3000,545,22,750.0,v Foreword I believe that knowledge is power. ...,[v Foreword I believe that knowledge is power....,[page_content='v Foreword I believe that knowl...,12


In [42]:
#42 Pages has less than 30 token - removed.
pages_and_chunks.shape

(482, 9)

In [43]:
pages_and_chunks.describe()

Unnamed: 0,Page Number,Page_char_count,Page_word_count,Page_sentence_count,Page_token_count,Total_chunks
count,482.0,482.0,482.0,482.0,482.0,482.0
mean,260.188797,2999.358921,604.726141,23.562241,749.83973,12.383817
std,151.733023,1128.716821,201.566319,15.177031,282.179205,4.515161
min,-9.0,238.0,35.0,1.0,59.5,1.0
25%,132.75,2154.25,492.25,9.0,538.5625,9.0
50%,263.5,3161.5,629.0,27.0,790.375,13.0
75%,390.75,3750.5,735.75,34.0,937.625,15.0
max,516.0,5138.0,1040.0,59.0,1284.5,21.0


In [44]:
pages_and_chunks['Chunks']

3      [page_content='a guide to Healthy Nutrition Pr...
4      [page_content='This book is printed on acid-fr...
5      [page_content='v Foreword I believe that knowl...
8      [page_content='N utrition is important to all ...
9      [page_content='published in the area of nutrit...
                             ...                        
524    [page_content='Yogurt Dressing, 139 Curried Tu...
525    [page_content='Index   513 hypertension, 54 os...
526    [page_content='Temperature, cooking, in cancer...
527    [page_content='Index   515 nectarine, 191 pass...
528    [page_content='White wines, 384 Whole fish, 31...
Name: Chunks, Length: 482, dtype: object

## ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

## Embedding the chunks

Embedding size: https://huggingface.co/spaces/mteb/leaderboard

##### Model info:
- Embedding dimension: 384 
- Max Token: 512
- Parameter: 22.7 Million 
- Size: 0.08GB    

In [31]:
from langchain_huggingface import HuggingFaceEmbeddings
#384 Embedding length
embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name = embedding_model_name)



In [45]:
embeddings

HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
), model_name='sentence-transformers/all-MiniLM-L6-v2', cache_folder=None, model_kwargs={}, encode_kwargs={}, multi_process=False, show_progress=False)

In [34]:
text = "a guide to Healthy Nutrition Prepared by medical and nutrition experts from Mayo Clinic,  University of California Los Angeles, and Dole Food Company, Inc. Academic Press An  Imprint of Elsevier San Diego, California ENCYCLOPEDIA of FOODS"

In [35]:
print(text)

a guide to Healthy Nutrition Prepared by medical and nutrition experts from Mayo Clinic,  University of California Los Angeles, and Dole Food Company, Inc. Academic Press An  Imprint of Elsevier San Diego, California ENCYCLOPEDIA of FOODS


In [36]:
embed = embeddings.embed_query(text)
print(embed) #return a list

[0.031577836722135544, -0.05370346084237099, 0.02644321322441101, 0.07072191685438156, -0.08148486912250519, 0.022096510976552963, -0.059182945638895035, 0.08234719932079315, -0.07985066622495651, -0.04588228464126587, -0.009498351253569126, 0.007602713070809841, -0.01186411827802658, -0.1258339136838913, 0.0048511638306081295, -0.05299125239253044, 0.1097869947552681, -0.012395246885716915, -0.023458009585738182, -0.06528518348932266, 0.07566510140895844, 0.01877969689667225, 0.06333909183740616, 0.005573967471718788, -0.029701756313443184, -0.08036581426858902, 0.02370082400739193, -0.056687433272600174, -0.064668670296669, -0.002547063399106264, 0.016911126673221588, -0.05639523267745972, 0.06170494854450226, -0.020001664757728577, -0.042054433375597, 0.008509458974003792, 0.03881470859050751, -0.09231086075305939, -0.0546158105134964, 0.011697866953909397, 0.002867236966267228, -0.010792880319058895, -0.06540720164775848, 0.03195008635520935, -0.027261782437562943, -0.0110899535939

In [46]:
embed_array = np.array(embed)
embed_array.shape

(384,)

In [58]:
%%time

pages_and_chunks['Embeddings'] = None

for idx, row in tqdm(pages_and_chunks.iterrows(), total=pages_and_chunks.shape[0]):
    chunk_content = [chunk.page_content for chunk in row['Chunks']]
    
    embedding_result = embeddings.embed_documents(chunk_content)
    pages_and_chunks.at[idx, "Embeddings"] = embedding_result    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
100%|██████████| 482/482 [00:56<00:00,  8.48it/s]


CPU times: user 41.7 s, sys: 4.63 s, total: 46.4 s
Wall time: 56.9 s


In [59]:
pages_and_chunks

Unnamed: 0,Page Number,Page_char_count,Page_word_count,Page_sentence_count,Page_token_count,Text,sentences,Chunks,Total_chunks,Embeddings
3,-9,238,41,2,59.50,a guide to Healthy Nutrition Prepared by medic...,[a guide to Healthy Nutrition Prepared by medi...,[page_content='a guide to Healthy Nutrition Pr...,1,"[[0.031577836722135544, -0.05370346084237099, ..."
4,-8,1559,276,8,389.75,This book is printed on acid-free paper. Copyr...,"[This book is printed on acid-free paper., Cop...",[page_content='This book is printed on acid-fr...,7,"[[-0.05825991928577423, -0.020900370553135872,..."
5,-7,3000,545,22,750.00,v Foreword I believe that knowledge is power. ...,[v Foreword I believe that knowledge is power....,[page_content='v Foreword I believe that knowl...,12,"[[0.022774269804358482, 0.11820303648710251, -..."
8,-4,3265,606,30,816.25,N utrition is important to all of us. What we...,"[N utrition is important to all of us., What w...",[page_content='N utrition is important to all ...,13,"[[0.028853192925453186, -0.0001783517800504341..."
9,-3,695,141,9,173.75,published in the area of nutrition. We decide...,"[published in the area of nutrition., We decid...",[page_content='published in the area of nutrit...,3,"[[0.01428879052400589, -0.016935119405388832, ..."
...,...,...,...,...,...,...,...,...,...,...
524,512,3100,610,1,775.00,"Yogurt Dressing, 139 Curried Tuna Salad With P...","[Yogurt Dressing, 139 Curried Tuna Salad With ...","[page_content='Yogurt Dressing, 139 Curried Tu...",13,"[[-0.03047480434179306, -0.011794799007475376,..."
525,513,2747,559,1,686.75,"Index 513 hypertension, 54 osteoporosis, 67–...","[Index 513 hypertension, 54 osteoporosis, 67...","[page_content='Index 513 hypertension, 54 os...",11,"[[0.02409634180366993, 0.0400601327419281, -0...."
526,514,2719,557,1,679.75,"Temperature, cooking, in cancer, 77 Terpenes, ...","[Temperature, cooking, in cancer, 77 Terpenes,...","[page_content='Temperature, cooking, in cancer...",11,"[[0.025952516123652458, -0.029439203441143036,..."
527,515,2424,514,1,606.00,"Index 515 nectarine, 191 passion fruit, 195 ...","[Index 515 nectarine, 191 passion fruit, 195...","[page_content='Index 515 nectarine, 191 pass...",10,"[[0.05597793310880661, -0.05537722259759903, -..."


In [228]:
# Sava the dataframe in an CSV file
output_path = 'Chunks_embeddings.csv'
pages_and_chunks.to_csv(output_path, index = False)
print(f"Dataframe saved successfully {output_path}")

Dataframe saved successfully Chunks_embeddings.csv


In [60]:
chunks_embedding = pd.read_csv("Chunks_embeddings.csv")
chunks_embedding.head()

Unnamed: 0,Page Number,Page_char_count,Page_word_count,Page_sentence_count,Page_token_count,Text,sentences,Chunks,Total_chunks,Embeddings
0,-9,238,41,2,59.5,a guide to Healthy Nutrition Prepared by medic...,['a guide to Healthy Nutrition Prepared by med...,"[Document(metadata={}, page_content='a guide t...",1,"[[0.031577836722135544, -0.05370346084237099, ..."
1,-8,1559,276,8,389.75,This book is printed on acid-free paper. Copyr...,"['This book is printed on acid-free paper.', '...","[Document(metadata={}, page_content='This book...",7,"[[-0.05825991928577423, -0.020900370553135872,..."
2,-7,3000,545,22,750.0,v Foreword I believe that knowledge is power. ...,['v Foreword I believe that knowledge is power...,"[Document(metadata={}, page_content='v Forewor...",12,"[[0.022774269804358482, 0.11820303648710251, -..."
3,-4,3265,606,30,816.25,N utrition is important to all of us. What we...,"['N utrition is important to all of us.', 'Wha...","[Document(metadata={}, page_content='N utritio...",13,"[[0.028853192925453186, -0.0001783517800504341..."
4,-3,695,141,9,173.75,published in the area of nutrition. We decide...,"['published in the area of nutrition.', 'We de...","[Document(metadata={}, page_content='published...",3,"[[0.01428879052400589, -0.016935119405388832, ..."


In [61]:
pages_and_chunks.dtypes

Page Number              int64
Page_char_count          int64
Page_word_count          int64
Page_sentence_count      int64
Page_token_count       float64
Text                    object
sentences               object
Chunks                  object
Total_chunks             int64
Embeddings              object
dtype: object

In [62]:
print(type(pages_and_chunks["Chunks"].iloc[0]))

<class 'list'>


In [92]:
# Convert each list of embeddings to a NumPy array
pages_and_chunks.loc[:, 'Embeddings_array'] = pages_and_chunks['Embeddings'].apply(lambda x: np.array(x) if x else None)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pages_and_chunks.loc[:, 'Embeddings_array'] = pages_and_chunks['Embeddings'].apply(lambda x: np.array(x) if x else None)


In [63]:
pages_and_chunks.head()

Unnamed: 0,Page Number,Page_char_count,Page_word_count,Page_sentence_count,Page_token_count,Text,sentences,Chunks,Total_chunks,Embeddings
3,-9,238,41,2,59.5,a guide to Healthy Nutrition Prepared by medic...,[a guide to Healthy Nutrition Prepared by medi...,[page_content='a guide to Healthy Nutrition Pr...,1,"[[0.031577836722135544, -0.05370346084237099, ..."
4,-8,1559,276,8,389.75,This book is printed on acid-free paper. Copyr...,"[This book is printed on acid-free paper., Cop...",[page_content='This book is printed on acid-fr...,7,"[[-0.05825991928577423, -0.020900370553135872,..."
5,-7,3000,545,22,750.0,v Foreword I believe that knowledge is power. ...,[v Foreword I believe that knowledge is power....,[page_content='v Foreword I believe that knowl...,12,"[[0.022774269804358482, 0.11820303648710251, -..."
8,-4,3265,606,30,816.25,N utrition is important to all of us. What we...,"[N utrition is important to all of us., What w...",[page_content='N utrition is important to all ...,13,"[[0.028853192925453186, -0.0001783517800504341..."
9,-3,695,141,9,173.75,published in the area of nutrition. We decide...,"[published in the area of nutrition., We decid...",[page_content='published in the area of nutrit...,3,"[[0.01428879052400589, -0.016935119405388832, ..."


In [65]:
# Use this do similarity search without using VectorDatabase, since we have less data.
pages_and_chunks['Embeddings']

3      [[0.031577836722135544, -0.05370346084237099, ...
4      [[-0.05825991928577423, -0.020900370553135872,...
5      [[0.022774269804358482, 0.11820303648710251, -...
8      [[0.028853192925453186, -0.0001783517800504341...
9      [[0.01428879052400589, -0.016935119405388832, ...
                             ...                        
524    [[-0.03047480434179306, -0.011794799007475376,...
525    [[0.02409634180366993, 0.0400601327419281, -0....
526    [[0.025952516123652458, -0.029439203441143036,...
527    [[0.05597793310880661, -0.05537722259759903, -...
528    [[0.055915895849466324, -0.017874974757432938,...
Name: Embeddings, Length: 482, dtype: object

## Vector database

###### Faiss into:
  - IndexType: IndexFlatL2
  - Link: https://www.pingcap.com/article/mastering-faiss-vector-database-a-beginners-handbook/#:~:text=Scalability%3A%20Designed%20to%20handle%20datasets,or%20even%20billions%20of%20vectors.

In [66]:
import faiss
index = faiss.IndexFlatL2(384)

In [67]:
embeddings_matrix = np.vstack(pages_and_chunks['Embeddings'].values)

In [68]:
len(embeddings_matrix[0])

384

In [69]:
#Always reset the faiss.Index before adding the vector*
index.reset()
print(f"Number of vectors after reset: {index.ntotal}")

Number of vectors after reset: 0


In [70]:
index.add(embeddings_matrix)
print(f"Number of vectors after reset: {index.ntotal}") #With means 14,92,250words approximately

Number of vectors after reset: 5969


In [71]:
from langchain.docstore import InMemoryDocstore
from langchain.schema import Document

doc_contents = {}
doc_index = 0

for idx, row in enumerate(pages_and_chunks['Chunks']):
    for chunk in row:
        doc_contents[doc_index] = Document(chunk.page_content)
        doc_index += 1    

In [72]:
docstore = InMemoryDocstore(doc_contents)

In [73]:
doc_contents

{0: Document(metadata={}, page_content='a guide to Healthy Nutrition Prepared by medical and nutrition experts from Mayo Clinic,  University of California Los Angeles, and Dole Food Company, Inc. Academic Press An  Imprint of Elsevier San Diego, California ENCYCLOPEDIA of FOODS'),
 1: Document(metadata={}, page_content='This book is printed on acid-free paper. Copyright 2002 by Dole Food Company, Inc. An Imprint of Elsevier All rights reserved.  No part of this publication may be reproduced or transmitted in any form or by any means, electronic or  mechanical, including photocopy,'),
 2: Document(metadata={}, page_content='including photocopy, recording, or any information storage and retrieval system, without  permission in writing from Dole Food Company, Inc.  Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in  Oxford, UK: Phone: (+44) 1865 843830,'),
 3: Document(metadata={}, page_content='(+44) 1865 843830, fax: (+44) 1865 853333, e-mail: p

In [74]:
doc_id = 5968
if doc_id in docstore._dict:
    doc = docstore._dict[doc_id]
    print(f"Document found: {doc.page_content}") 
else:
    print(f"Document ID {doc.id} not found in docstore.")

Document found: ingredient, 282–283 Yellow snap beans, 257 Yogurt nutrients, 474–475 preparation and serving, 358–359 Yolk, egg, 297 Yuba, 333 Yuca, 223 Z Zea mays L., 231 Zinc, in napa cabbage, 229 Zingiber officinale, 237 Ziziphus jujuba, 180 516 Index


In [75]:
index_docstore_id = {i : i for i in range(len(doc_contents))}

In [76]:
index_docstore_id

{0: 0,
 1: 1,
 2: 2,
 3: 3,
 4: 4,
 5: 5,
 6: 6,
 7: 7,
 8: 8,
 9: 9,
 10: 10,
 11: 11,
 12: 12,
 13: 13,
 14: 14,
 15: 15,
 16: 16,
 17: 17,
 18: 18,
 19: 19,
 20: 20,
 21: 21,
 22: 22,
 23: 23,
 24: 24,
 25: 25,
 26: 26,
 27: 27,
 28: 28,
 29: 29,
 30: 30,
 31: 31,
 32: 32,
 33: 33,
 34: 34,
 35: 35,
 36: 36,
 37: 37,
 38: 38,
 39: 39,
 40: 40,
 41: 41,
 42: 42,
 43: 43,
 44: 44,
 45: 45,
 46: 46,
 47: 47,
 48: 48,
 49: 49,
 50: 50,
 51: 51,
 52: 52,
 53: 53,
 54: 54,
 55: 55,
 56: 56,
 57: 57,
 58: 58,
 59: 59,
 60: 60,
 61: 61,
 62: 62,
 63: 63,
 64: 64,
 65: 65,
 66: 66,
 67: 67,
 68: 68,
 69: 69,
 70: 70,
 71: 71,
 72: 72,
 73: 73,
 74: 74,
 75: 75,
 76: 76,
 77: 77,
 78: 78,
 79: 79,
 80: 80,
 81: 81,
 82: 82,
 83: 83,
 84: 84,
 85: 85,
 86: 86,
 87: 87,
 88: 88,
 89: 89,
 90: 90,
 91: 91,
 92: 92,
 93: 93,
 94: 94,
 95: 95,
 96: 96,
 97: 97,
 98: 98,
 99: 99,
 100: 100,
 101: 101,
 102: 102,
 103: 103,
 104: 104,
 105: 105,
 106: 106,
 107: 107,
 108: 108,
 109: 109,
 110: 110,

In [78]:
from langchain.vectorstores import FAISS

vector_store = FAISS(
    embedding_function = embeddings,
    index = index,
    docstore = docstore,
    index_to_docstore_id = index_docstore_id
)

In [87]:
#print(vector_store.docstore._dict)

In [81]:
doc_id = 6
print(vector_store.docstore._dict[doc_id])

page_content='San Diego, California 92101-4495, USA http://www.academicpress.com Academic Press Harcourt Place, 32 Jamestown Road, London, NW1 7BY, UK http://www.academicpress.com Library of Congress Catalog Card Number: 2001093328 ISBN-13: 978–0–12–219803–8 ISBN-10: 0–12–219803–4'


In [82]:
print(len(vector_store.docstore._dict))

5969


## Querying using Similarity search

In [88]:
user_query = "Name of the book?"
query_embedding = embeddings.embed_query(str(user_query))

results = vector_store.similarity_search_by_vector(query_embedding, k = 2)
#threshold = 0.5

for idx, result in enumerate(results):
    print(f"Result {idx}: {result.page_content}")

Result 0: ISBN-10: 0–12–219803–4 PRINTED IN CHINA 05     06     WP     9     8     7     6
Result 1: experts.  Another premise of the book is that accurate information does not have to be bor- ing.  Most of us are curious about what is in the food we eat, where  it comes from, and why one food is supposed to be good for us whereas too much of it may be bad.  The book is


### 2.Embedding model

#### Model info:
  - Model Name: stella-base-en-v2 
  -  Embedding dimension: 768 
  - Parameter: 55Million 
  - Model Size: 0.2GB 
  - Tokens: 512 
  - Language: English 
  - Link: https://huggingface.co/infgrad/stella-base-en-v2

In [89]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("infgrad/stella-base-en-v2")

No sentence-transformers model found with name infgrad/stella-base-en-v2. Creating a new one with mean pooling.


In [90]:
from langchain_huggingface import HuggingFaceEmbeddings

model = "infgrad/stella-base-en-v2"
transformer = HuggingFaceEmbeddings(model_name = model)

No sentence-transformers model found with name infgrad/stella-base-en-v2. Creating a new one with mean pooling.


In [91]:
transformer

HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
), model_name='infgrad/stella-base-en-v2', cache_folder=None, model_kwargs={}, encode_kwargs={}, multi_process=False, show_progress=False)

In [94]:
text = "The objective of this research is to explore the use of Retrieval-Augmented Generation (RAG) in creating a personal chat-botpowered by pretrained large language models (LLMs)."

In [100]:
embedding = transformer.embed_documents(text)
print(embedding)

[[0.03906305506825447, -0.05364309251308441, -0.0020662248134613037, -0.008052281104028225, -0.05839700624346733, 0.009989478625357151, 0.016513945534825325, 0.02609889768064022, -0.04449855908751488, 0.00762022053822875, -0.05959286168217659, -0.031252291053533554, -0.02844049595296383, 0.032822754234075546, -0.03765871748328209, 0.06254729628562927, 0.030714958906173706, -0.0649227574467659, 0.012137998826801777, -0.03797019645571709, 0.027489423751831055, -0.0382908396422863, -0.08776166290044785, 0.01111998874694109, -0.006594307254999876, -0.04063411056995392, -0.0255275908857584, -0.002300069434568286, -0.0037461731117218733, 0.07875286787748337, 0.020094197243452072, -0.022283008322119713, -0.026245588436722755, -0.11467114090919495, 0.005676348228007555, -0.0378386564552784, -0.027858713641762733, -0.04658894240856171, -0.02453482151031494, -0.10782375931739807, -0.03223752602934837, -0.021054169163107872, -0.017035728320479393, 0.032424673438072205, -0.07873754948377609, 0.009

In [102]:
len(embedding[0])

768

In [103]:
new_df = pages_and_chunks.drop(columns=['Embeddings'])
new_df.head(10)

Unnamed: 0,Page Number,Page_char_count,Page_word_count,Page_sentence_count,Page_token_count,Text,sentences,Chunks,Total_chunks
3,-9,238,41,2,59.5,a guide to Healthy Nutrition Prepared by medic...,[a guide to Healthy Nutrition Prepared by medi...,[page_content='a guide to Healthy Nutrition Pr...,1
4,-8,1559,276,8,389.75,This book is printed on acid-free paper. Copyr...,"[This book is printed on acid-free paper., Cop...",[page_content='This book is printed on acid-fr...,7
5,-7,3000,545,22,750.0,v Foreword I believe that knowledge is power. ...,[v Foreword I believe that knowledge is power....,[page_content='v Foreword I believe that knowl...,12
8,-4,3265,606,30,816.25,N utrition is important to all of us. What we...,"[N utrition is important to all of us., What w...",[page_content='N utrition is important to all ...,13
9,-3,695,141,9,173.75,published in the area of nutrition. We decide...,"[published in the area of nutrition., We decid...",[page_content='published in the area of nutrit...,3
10,-2,1098,201,12,274.5,x Acknowledgments Editorial Staff Editors-in-C...,[x Acknowledgments Editorial Staff Editors-in-...,[page_content='x Acknowledgments Editorial Sta...,5
11,-1,3482,670,26,870.5,contributions are worthy of special mention. ...,"[contributions are worthy of special mention.,...",[page_content='contributions are worthy of spe...,14
15,3,2873,529,20,718.25,The Nutriants and Other Food Substances 3 T h...,[The Nutriants and Other Food Substances 3 T ...,[page_content='The Nutriants and Other Food Su...,12
16,4,593,107,4,148.25,"In this chapter, you will be introduced to the...","[In this chapter, you will be introduced to th...","[page_content='In this chapter, you will be in...",3
17,5,2319,427,16,579.75,C H A P T E r o n e OPTIMIZING HEALTH THE DI...,[C H A P T E r o n e OPTIMIZING HEALTH THE D...,[page_content='C H A P T E r o n e OPTIMIZIN...,10


In [104]:
print(type(new_df['Chunks'].iloc[0]))

<class 'list'>


In [105]:
%%time
from tqdm.auto import tqdm

new_df['Embeddings'] = None

for idx, row in tqdm(new_df.iterrows(), total = new_df.shape[0]):
    chunk_context = [chunk.page_content for chunk in row['Chunks']]
    
    embed_result = transformer.embed_documents(chunk_context)
    new_df.at[idx, "Embeddings"] = embed_result

100%|██████████| 482/482 [02:38<00:00,  3.03it/s]

CPU times: user 1min 22s, sys: 11.8 s, total: 1min 34s
Wall time: 2min 38s





In [106]:
new_df['Embeddings']

3      [[-0.047219518572092056, -0.017264654859900475...
4      [[-0.10556717216968536, 0.025198714807629585, ...
5      [[0.05545396730303764, -0.027879992499947548, ...
8      [[0.033604733645915985, -0.03632670268416405, ...
9      [[-0.02384265325963497, -0.06204993650317192, ...
                             ...                        
524    [[-0.03200452774763107, -0.0532161146402359, -...
525    [[-0.04697604104876518, -0.03409915417432785, ...
526    [[-0.033665843307971954, -0.04699765890836716,...
527    [[-0.08509013801813126, -0.08782226592302322, ...
528    [[-0.05062337592244148, 0.03341011330485344, -...
Name: Embeddings, Length: 482, dtype: object

In [107]:
print(type(new_df['Embeddings'].iloc[0]))

<class 'list'>


### Vector database

In [108]:
import faiss
stella = faiss.IndexFlatL2(768)

In [109]:
embedding_matrix = np.vstack(new_df['Embeddings'].values)

In [110]:
embedding_matrix.shape

(5969, 768)

In [111]:
stella.reset()
print(f"Number of vectors after reset: {stella.ntotal}")

Number of vectors after reset: 0


In [112]:
stella.add(embedding_matrix)
print(f"Number of vectors after reset: {stella.ntotal}")

Number of vectors after reset: 5969


In [113]:
from langchain.docstore import InMemoryDocstore
from langchain.schema import Document

Doc_store = {}
Doc_index = 0

for ind, row in enumerate(new_df['Chunks']):
    for chunk in row:
        Doc_store[Doc_index] = Document(chunk.page_content)
        Doc_index += 1

In [114]:
Doc_store

{0: Document(metadata={}, page_content='a guide to Healthy Nutrition Prepared by medical and nutrition experts from Mayo Clinic,  University of California Los Angeles, and Dole Food Company, Inc. Academic Press An  Imprint of Elsevier San Diego, California ENCYCLOPEDIA of FOODS'),
 1: Document(metadata={}, page_content='This book is printed on acid-free paper. Copyright 2002 by Dole Food Company, Inc. An Imprint of Elsevier All rights reserved.  No part of this publication may be reproduced or transmitted in any form or by any means, electronic or  mechanical, including photocopy,'),
 2: Document(metadata={}, page_content='including photocopy, recording, or any information storage and retrieval system, without  permission in writing from Dole Food Company, Inc.  Permissions may be sought directly from Elsevier’s Science & Technology Rights Department in  Oxford, UK: Phone: (+44) 1865 843830,'),
 3: Document(metadata={}, page_content='(+44) 1865 843830, fax: (+44) 1865 853333, e-mail: p

In [115]:
store_doc = InMemoryDocstore(Doc_store)

In [116]:
store_doc

<langchain_community.docstore.in_memory.InMemoryDocstore at 0x3fc5fc610>

In [117]:
index_docstore = {i : i for i in range(len(Doc_store))}

In [118]:
index_docstore

{0: 0,
 1: 1,
 2: 2,
 3: 3,
 4: 4,
 5: 5,
 6: 6,
 7: 7,
 8: 8,
 9: 9,
 10: 10,
 11: 11,
 12: 12,
 13: 13,
 14: 14,
 15: 15,
 16: 16,
 17: 17,
 18: 18,
 19: 19,
 20: 20,
 21: 21,
 22: 22,
 23: 23,
 24: 24,
 25: 25,
 26: 26,
 27: 27,
 28: 28,
 29: 29,
 30: 30,
 31: 31,
 32: 32,
 33: 33,
 34: 34,
 35: 35,
 36: 36,
 37: 37,
 38: 38,
 39: 39,
 40: 40,
 41: 41,
 42: 42,
 43: 43,
 44: 44,
 45: 45,
 46: 46,
 47: 47,
 48: 48,
 49: 49,
 50: 50,
 51: 51,
 52: 52,
 53: 53,
 54: 54,
 55: 55,
 56: 56,
 57: 57,
 58: 58,
 59: 59,
 60: 60,
 61: 61,
 62: 62,
 63: 63,
 64: 64,
 65: 65,
 66: 66,
 67: 67,
 68: 68,
 69: 69,
 70: 70,
 71: 71,
 72: 72,
 73: 73,
 74: 74,
 75: 75,
 76: 76,
 77: 77,
 78: 78,
 79: 79,
 80: 80,
 81: 81,
 82: 82,
 83: 83,
 84: 84,
 85: 85,
 86: 86,
 87: 87,
 88: 88,
 89: 89,
 90: 90,
 91: 91,
 92: 92,
 93: 93,
 94: 94,
 95: 95,
 96: 96,
 97: 97,
 98: 98,
 99: 99,
 100: 100,
 101: 101,
 102: 102,
 103: 103,
 104: 104,
 105: 105,
 106: 106,
 107: 107,
 108: 108,
 109: 109,
 110: 110,

In [119]:
from langchain.vectorstores import FAISS

vector_database = FAISS(
    embedding_function = transformer,
    index = stella,
    docstore = store_doc,
    index_to_docstore_id = index_docstore
)

In [120]:
vector_database

<langchain_community.vectorstores.faiss.FAISS at 0x3fc5b1f60>

In [125]:
print(vector_database.docstore._dict)



In [121]:
len(vector_database.docstore._dict)

5969

### Vector database similarity search

In [348]:
user = "Summary of the book?"
query_embed = transformer.embed_query(str(user))

#Search using Eculidean distance
content = vector_database.similarity_search_with_score_by_vector(query_embed, k = 2)
threshold = 1.3

filtered_results = [
    (res.page_content, score) 
    for res, score in content
    if score < threshold
]

retrieved_content = []

for ide, (result_text, score) in enumerate(filtered_results):
    retrieved_content.append(result_text)
    print(f"Result:{ide} (Score: {score}): {result_text}")

Result:0 (Score: 1.189841389656067): may be bad.  The book is divided into two parts.  Part I provides the reader with an overview of the principles of nutrition, including the basis for the Food Guide Pyramid and for nutrition recommendations, how various nutrients differ, and how our nutrition needs differ
Result:1 (Score: 1.271240234375): seeks to answer three main questions:  What am I eating?  What should I eat? and Why?  The premise of the book is that well-informed people make well-informed decisions.  The theme of the book is moderation.  The standard is that all recommendations be based on valid


In [349]:
retrieved_content

['may be bad.  The book is divided into two parts.  Part I provides the reader with an overview of the principles of nutrition, including the basis for the Food Guide Pyramid and for nutrition recommendations, how various nutrients differ, and how our nutrition needs differ',
 'seeks to answer three main questions:  What am I eating?  What should I eat? and Why?  The premise of the book is that well-informed people make well-informed decisions.  The theme of the book is moderation.  The standard is that all recommendations be based on valid']

### LLM

In [131]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "true"

#### Model info
- Name: Google/T5-Base
- Size: 1GB
- Parameter: 223
- Year: 2020

##### Citation
 - @article{2020t5, \
      author  = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu}, \
      title   = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},\
      journal = {Journal of Machine Learning Research}, \
      year    = {2020}, \
      volume  = {21}, \
      number  = {140}, \
      pages   = {1-67}, \
      url     = {http://jmlr.org/papers/v21/20-074.html} \
    }

In [175]:
%%time

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer_T5 = AutoTokenizer.from_pretrained("google-t5/t5-base")
model_T5 = AutoModelForSeq2SeqLM.from_pretrained("google-t5/t5-base", pad_token_id = tokenizer_T5.eos_token_id)

CPU times: user 399 ms, sys: 174 ms, total: 573 ms
Wall time: 3.27 s


In [176]:
model_T5

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=768, out_features=3072, bias=False)
              (wo): Linear(in_features=3072, out_features=768, bias=False)
              (dropout): Dro

In [350]:
retrieval = " ".join(retrieved_content)
retriever = f"Return the suitable context for the question. question: {user}?. context: {retrieval}"
print(retriever)

Return the suitable context for the question. question: Summary of the book? ?. context: may be bad.  The book is divided into two parts.  Part I provides the reader with an overview of the principles of nutrition, including the basis for the Food Guide Pyramid and for nutrition recommendations, how various nutrients differ, and how our nutrition needs differ seeks to answer three main questions:  What am I eating?  What should I eat? and Why?  The premise of the book is that well-informed people make well-informed decisions.  The theme of the book is moderation.  The standard is that all recommendations be based on valid


In [352]:
input_token = tokenizer_T5.encode(
    retriever,
    return_tensors = "pt"
)

In [353]:
print("ToKens:", input_token )

ToKens: tensor([[ 9778,     8,  3255,  2625,    21,     8,   822,     5,   822,    10,
         20698,    13,     8,   484,    58,     3,    58,     5,  2625,    10,
           164,    36,  1282,     5,    37,   484,    19,  8807,   139,   192,
          1467,     5,  2733,    27,   795,     8,  5471,    28,    46,  8650,
            13,     8,  5559,    13,  7470,     6,   379,     8,  1873,    21,
             8,  3139,  4637, 30237,    11,    21,  7470,  5719,     6,   149,
           796, 12128,  7641,     6,    11,   149,    69,  7470,   523,  7641,
          2762,     7,    12,  1525,   386,   711,   746,    10,   363,   183,
            27,  3182,    58,   363,   225,    27,     3,  1544,    58,    11,
          1615,    58,    37,     3, 17398,    13,     8,   484,    19,    24,
           168,    18,    77, 10816,   151,   143,   168,    18,    77, 10816,
          3055,     5,    37,  3800,    13,     8,   484,    19,  2175,  2661,
             5,    37,  1068,    19,    24, 

In [354]:
%%time

out = model_T5.generate(
    input_token,
    max_length = 250,
    min_length= 50, 
    num_beams = 2,
    early_stopping = True
)

CPU times: user 17.8 s, sys: 2.91 s, total: 20.7 s
Wall time: 6.01 s


In [355]:
print("T5 prediction:", out[0])

T5 prediction: tensor([    0,  9778,     8,  3255,  2625,    21,     8,   822,     3,     5,
         2733,    27,   795,     8,  5471,    28,    46,  8650,    13,     8,
         5559,    13,  7470,     6,   379,     8,  1873,    21,     8,  3139,
         4637, 30237,    11,    21,  7470,  5719,     6,   149,   796, 12128,
         7641,     6,    11,   149,    69,  7470,   523,  7641,     3,     5,
            1])


In [356]:
Human_text = tokenizer_T5.decode(
    out[0],
    skip_special_tokens = True
)
print("Final Result:", Human_text)

Final Result: Return the suitable context for the question. Part I provides the reader with an overview of the principles of nutrition, including the basis for the Food Guide Pyramid and for nutrition recommendations, how various nutrients differ, and how our nutrition needs differ.


End of the Road...♡