## 1. Indexing 

It has four parts: 
1. Document loading 

2. Document spliting 

3. Document Embedding 

4. Document Storing 



### Document loading 
* We can load a lot of different documents and also we can load from drive and dropbox, after loading, langchain will store in Document object. 

* Document is a class in langchain, that stores the chunk and releated metadata ( like page_number, page_title )

In [9]:
%pip install pypdf docx2txt langchain-community --quiet

Note: you may need to restart the kernel to use updated packages.


In [10]:
from langchain_community.document_loaders import PyPDFLoader 
import copy 

In [11]:
loaded_pdf = PyPDFLoader("/workspaces/LearningLCEL/Data/TDEE+Calculator.pdf")
pages_pdf = loaded_pdf.load() # it outputs in pages 

pages_pdf

Ignoring wrong pointing object 13 0 (offset 0)


[Document(metadata={'source': '/workspaces/LearningLCEL/Data/TDEE+Calculator.pdf', 'page': 0}, page_content='TDEE CALCULATOR\nLink To Online Calculator: https://www.freedieting.com/calorie-\ncalculator\nTDEE Explained\nYour Total Daily Energy Expenditure (TDEE) is the number of \ncalories that your body burns in one day. It is calculated by \nestimating how many calories you burn while resting (= Basal \nMetabolic Rate or BMR) and adding a certain number of calories \non top, depending on how often you exercise. \nThe simplest method of calculating your TDEE is by using an \nonline calculator such as the one I linked above. It will ask you for \nyour age, weight, height and weekly exercise. While the result will \nnot be 100% accurate – since we all have different metabolisms \nand BMRs – it will give you a good idea of how many calories you \nneed to consume in order to maintain your current weight. \nIn case the abbreviations used by many TDEE calculators confuse \nyou, here is an ex

In [13]:
# we have a lot of new lines, it causes a lot of tokens 

pages_pdf_copy = copy.deepcopy(pages_pdf)

for i in pages_pdf_copy: 
    i.page_content = " ".join( i.page_content.split() ) #removing white lines 

### Document splitting


* Langchain offers various algorithms to splitting the documents into chunks. 

In [16]:
## Character splitter --> split the documents based on characters not words, token, sentence. 
from langchain_text_splitters import CharacterTextSplitter 


splitter = CharacterTextSplitter(
    separator=".", 
    chunk_size=100, 
    chunk_overlap=30, 
)

splitted_documents = splitter.split_documents(pages_pdf_copy)

Created a chunk of size 139, which is longer than the specified 100
Created a chunk of size 188, which is longer than the specified 100
Created a chunk of size 109, which is longer than the specified 100
Created a chunk of size 207, which is longer than the specified 100
Created a chunk of size 214, which is longer than the specified 100


In [17]:
splitted_documents

## The problem in this approach, we don't know whether each chunk contains the all the necessary information about the topic, so we have to use some other approach. 

[Document(metadata={'source': '/workspaces/LearningLCEL/Data/TDEE+Calculator.pdf', 'page': 0}, page_content='TDEE CALCULATOR Link To Online Calculator: https://www.freedieting'),
 Document(metadata={'source': '/workspaces/LearningLCEL/Data/TDEE+Calculator.pdf', 'page': 0}, page_content='com/calorie- calculator TDEE Explained Your Total Daily Energy Expenditure (TDEE) is the number of calories that your body burns in one day'),
 Document(metadata={'source': '/workspaces/LearningLCEL/Data/TDEE+Calculator.pdf', 'page': 0}, page_content='It is calculated by estimating how many calories you burn while resting (= Basal Metabolic Rate or BMR) and adding a certain number of calories on top, depending on how often you exercise'),
 Document(metadata={'source': '/workspaces/LearningLCEL/Data/TDEE+Calculator.pdf', 'page': 0}, page_content='The simplest method of calculating your TDEE is by using an online calculator such as the one I linked above'),
 Document(metadata={'source': '/workspaces/Learn

In [26]:
## Markdown Header splitter --> split the documents based on the header.

from langchain_text_splitters.markdown import MarkdownHeaderTextSplitter 

md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("#", "title"), ("##", "sub_title")]
)

pages_md_split = md_splitter.split_text(pages_pdf_copy[1].page_content)

In [27]:
pages_md_split

[Document(metadata={}, page_content='TDEE Formulas (for the fitness geeks ;-) Harris-Benedict: Women BMR = 655 + (9.6 X weight in kg) + (1.8 x height in cm) – (4.7 x age in yrs) Men BMR = 66 + (13.7 X weight in kg) + (5 x height in cm) – (6.8 x age in yrs) Mifflin-St. Jeor: Women BMR = 10 x weight (kg) + 6.25 x height (cm) – 5 x age (y) – 161 Men BMR = 10 x weight (kg) + 6.25 x height (cm) – 5 x age (y) + 5 Katch-McArdle (need to know your bodyfat %): BMR = 370 + (21.6 x Lean Body Mass (kg)) Lean Body Mass = (Weight(kg) x (100-(Body Fat)))/100 To then calculate your TDEE, simply multiply your BMR by these activity factors: Sedentary (little to no exercise + work a desk job) = 1.2 Lightly Active (light exercise 1-3 days / week) = 1.375 Moderately Active (moderate exercise 3-5 days / week) = 1.55 Very Active (heavy exercise 6-7 days / week) = 1.725 Extremely Active (very heavy exercise, hard labor job, training 2x / day) = 1.9')]

### Document Embedding

In [24]:
%pip install --upgrade --quiet  langchain-google-genai faiss-cpu

Note: you may need to restart the kernel to use updated packages.


In [7]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001", google_api_key="AIzaSyD9jMTWnoPhX-mtDgbzvp9a20suPFsD_70", task_type="retrieval_query")
vector = embeddings.embed_query("hello, world!")

In [31]:
from langchain_community.vectorstores import FAISS
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore

In [32]:
index = faiss.IndexFlatL2(len(embeddings.embed_query("hello world")))

vector_store = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(), 
    index_to_docstore_id={}
) 

In [41]:
from uuid import uuid4 

uuids = [str(uuid4()) for _ in range(len(splitted_documents))]

vector_store.add_documents(documents=splitted_documents, ids=uuids)

['24190fd4-86ee-430c-99b7-7e552e86de40',
 'a2b879b2-5611-4c04-bee7-c38107175fd2',
 '49cf3f2b-2094-476e-a809-ba8d0efc9ccb',
 'ffc64989-e14c-4ddf-b541-0d19853a18fb',
 'eede6893-3cb1-463a-a51f-82e7bfc29cae',
 '18864564-187f-4404-b53c-64e0152191e1',
 '1e1be44a-34a0-4cf2-b215-32fcb3c69eb0',
 '3f810586-e919-47d0-9631-2395412ac68d',
 'd2a16d05-295f-4d40-816b-e0036355404b',
 '0dea0b8d-e163-4dba-b42a-3a9710140272',
 'cf4e0337-6bc9-41cd-9893-a8002244ff7b',
 '8e9b30f4-7f7f-471e-87ff-145e7cf27167',
 '50ce37ca-900e-4dcc-8cf6-3dd008aa9f35',
 '456e0264-c2a1-451d-871b-dc5405acfc94',
 '3e03802b-2341-412a-92a1-b07f7700bd66',
 '74c86870-dc49-468f-a8cd-471d21cb393b',
 '37b304aa-9a22-488a-99fc-43fea5fbb18e',
 '3584cb3d-5bc4-42e4-9cfd-7b531db5adc9']

## Retrieval


We have a different types of retrieval algorithm in langchain ( like similarity, mmr, Hyde, similarity search with threshold and more, check the documentation)

In [47]:
question = "what is TDEE?"

vb = vector_store.as_retriever(search_type="mmr", search_kwargs={"k": 0.7, "lambda_mult": 0.7}) ## as_retriever is a runnalbe object 


In [48]:
vb.invoke(question)

[Document(id='ffc64989-e14c-4ddf-b541-0d19853a18fb', metadata={'source': '/workspaces/LearningLCEL/Data/TDEE+Calculator.pdf', 'page': 0}, page_content='The simplest method of calculating your TDEE is by using an online calculator such as the one I linked above')]

In [None]:
vector_store.max_marginal_relevance_search(query=question, lambda_mult=0.7, k = 2)


## lambda = 1 removes the diversity component 
## lambda = 0 removes the similarity componet in the formula 

[Document(id='ffc64989-e14c-4ddf-b541-0d19853a18fb', metadata={'source': '/workspaces/LearningLCEL/Data/TDEE+Calculator.pdf', 'page': 0}, page_content='The simplest method of calculating your TDEE is by using an online calculator such as the one I linked above'),
 Document(id='a2b879b2-5611-4c04-bee7-c38107175fd2', metadata={'source': '/workspaces/LearningLCEL/Data/TDEE+Calculator.pdf', 'page': 0}, page_content='com/calorie- calculator TDEE Explained Your Total Daily Energy Expenditure (TDEE) is the number of calories that your body burns in one day')]

In [62]:
## Let's create a chain 

TEMPLATE = """
Answer the following question 
{question}

To answer the question, use the following context: 
{context} 

If you don't find the answer, just say "I don't know" 
"""

from langchain.prompts import PromptTemplate

template = PromptTemplate.from_template(TEMPLATE)

api_key = "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"


#1st component 
from langchain_groq import ChatGroq
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.output_parsers import StructuredOutputParser

llm = ChatGroq(
    model="mixtral-8x7b-32768",
    temperature=0.0,
    max_retries=2,
    api_key=api_key
)


question = "what is TDEE?"

In [68]:


chain = ( 
    RunnableParallel({"context": vb, "question": RunnablePassthrough()}) 
    | template 
    | llm 
    # | StructuredOutputParser()
) 


In [69]:
chain.invoke("what is TDEE?")

AIMessage(content='Based on the provided context, TDEE stands for Total Daily Energy Expenditure. It is a measure of the total number of calories a person needs to consume in a day to maintain their current weight, taking into account their basal metabolic rate, physical activity level, and thermic effect of food. The simplest method of calculating your TDEE is by using an online calculator such as the one linked above.', additional_kwargs={}, response_metadata={'token_usage': {'completion_tokens': 90, 'prompt_tokens': 164, 'total_tokens': 254, 'completion_time': 0.142282938, 'prompt_time': 0.012063255, 'queue_time': 0.001169873, 'total_time': 0.154346193}, 'model_name': 'mixtral-8x7b-32768', 'system_fingerprint': 'fp_c5f20b5bb1', 'finish_reason': 'stop', 'logprobs': None}, id='run-945aa1a4-4459-4fce-91fb-94544d21d5b7-0', usage_metadata={'input_tokens': 164, 'output_tokens': 90, 'total_tokens': 254})