## 1. Indexing 

It has four parts: 
1. Document loading 

2. Document spliting 

3. Document Embedding 

4. Document Storing 



### Document loading 
* We can load a lot of different documents and also we can load from drive and dropbox, after loading, langchain will store in Document object. 

* Document is a class in langchain, that stores the chunk and releated metadata ( like page_number, page_title )

In [1]:
%pip install pypdf docx2txt langchain-community --quiet

Note: you may need to restart the kernel to use updated packages.


In [3]:
from langchain_community.document_loaders import PyPDFLoader 
import copy 

In [4]:
loaded_pdf = PyPDFLoader("/workspaces/LearningLCEL/Data/TDEE+Calculator.pdf")
pages_pdf = loaded_pdf.load() # it outputs in pages 

pages_pdf

Ignoring wrong pointing object 13 0 (offset 0)


[Document(metadata={'source': '/workspaces/LearningLCEL/Data/TDEE+Calculator.pdf', 'page': 0}, page_content='TDEE CALCULATOR\nLink To Online Calculator: https://www.freedieting.com/calorie-\ncalculator\nTDEE Explained\nYour Total Daily Energy Expenditure (TDEE) is the number of \ncalories that your body burns in one day. It is calculated by \nestimating how many calories you burn while resting (= Basal \nMetabolic Rate or BMR) and adding a certain number of calories \non top, depending on how often you exercise. \nThe simplest method of calculating your TDEE is by using an \nonline calculator such as the one I linked above. It will ask you for \nyour age, weight, height and weekly exercise. While the result will \nnot be 100% accurate – since we all have different metabolisms \nand BMRs – it will give you a good idea of how many calories you \nneed to consume in order to maintain your current weight. \nIn case the abbreviations used by many TDEE calculators confuse \nyou, here is an ex

In [13]:
# we have a lot of new lines, it causes a lot of tokens 

pages_pdf_copy = copy.deepcopy(pages_pdf)

for i in pages_pdf_copy: 
    i.page_content = " ".join( i.page_content.split() ) #removing white lines 

### Document splitting


* Langchain offers various algorithms to splitting the documents into chunks. 

In [16]:
## Character splitter --> split the documents based on characters not words, token, sentence. 
from langchain_text_splitters import CharacterTextSplitter 


splitter = CharacterTextSplitter(
    separator=".", 
    chunk_size=100, 
    chunk_overlap=30, 
)

splitted_documents = splitter.split_documents(pages_pdf_copy)

Created a chunk of size 139, which is longer than the specified 100
Created a chunk of size 188, which is longer than the specified 100
Created a chunk of size 109, which is longer than the specified 100
Created a chunk of size 207, which is longer than the specified 100
Created a chunk of size 214, which is longer than the specified 100


In [17]:
splitted_documents

## The problem in this approach, we don't know whether each chunk contains the all the necessary information about the topic, so we have to use some other approach. 

[Document(metadata={'source': '/workspaces/LearningLCEL/Data/TDEE+Calculator.pdf', 'page': 0}, page_content='TDEE CALCULATOR Link To Online Calculator: https://www.freedieting'),
 Document(metadata={'source': '/workspaces/LearningLCEL/Data/TDEE+Calculator.pdf', 'page': 0}, page_content='com/calorie- calculator TDEE Explained Your Total Daily Energy Expenditure (TDEE) is the number of calories that your body burns in one day'),
 Document(metadata={'source': '/workspaces/LearningLCEL/Data/TDEE+Calculator.pdf', 'page': 0}, page_content='It is calculated by estimating how many calories you burn while resting (= Basal Metabolic Rate or BMR) and adding a certain number of calories on top, depending on how often you exercise'),
 Document(metadata={'source': '/workspaces/LearningLCEL/Data/TDEE+Calculator.pdf', 'page': 0}, page_content='The simplest method of calculating your TDEE is by using an online calculator such as the one I linked above'),
 Document(metadata={'source': '/workspaces/Learn

In [26]:
## Markdown Header splitter --> split the documents based on the header.

from langchain_text_splitters.markdown import MarkdownHeaderTextSplitter 

md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[("#", "title"), ("##", "sub_title")]
)

pages_md_split = md_splitter.split_text(pages_pdf_copy[1].page_content)

In [27]:
pages_md_split

[Document(metadata={}, page_content='TDEE Formulas (for the fitness geeks ;-) Harris-Benedict: Women BMR = 655 + (9.6 X weight in kg) + (1.8 x height in cm) – (4.7 x age in yrs) Men BMR = 66 + (13.7 X weight in kg) + (5 x height in cm) – (6.8 x age in yrs) Mifflin-St. Jeor: Women BMR = 10 x weight (kg) + 6.25 x height (cm) – 5 x age (y) – 161 Men BMR = 10 x weight (kg) + 6.25 x height (cm) – 5 x age (y) + 5 Katch-McArdle (need to know your bodyfat %): BMR = 370 + (21.6 x Lean Body Mass (kg)) Lean Body Mass = (Weight(kg) x (100-(Body Fat)))/100 To then calculate your TDEE, simply multiply your BMR by these activity factors: Sedentary (little to no exercise + work a desk job) = 1.2 Lightly Active (light exercise 1-3 days / week) = 1.375 Moderately Active (moderate exercise 3-5 days / week) = 1.55 Very Active (heavy exercise 6-7 days / week) = 1.725 Extremely Active (very heavy exercise, hard labor job, training 2x / day) = 1.9')]

### Document Embedding

In [24]:
%pip install --upgrade --quiet  langchain-google-genai faiss-cpu

Note: you may need to restart the kernel to use updated packages.


In [7]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001", google_api_key="AIzaSyD9jMTWnoPhX-mtDgbzvp9a20suPFsD_70", task_type="retrieval_query")
vector = embeddings.embed_query("hello, world!")

In [31]:
from langchain_community.vectorstores import FAISS
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore

In [32]:
index = faiss.IndexFlatL2(len(embeddings.embed_query("hello world")))

vector_store = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(), 
    index_to_docstore_id={}
) from uuid import uuid4 

uuids = [str(uuid4()) for _ in range(len(splitted_documents))]

vector_store.add_documents(documents=splitted_documents, ids=uuids)

In [41]:
from uuid import uuid4 

uuids = [str(uuid4()) for _ in range(len(splitted_documents))]

vector_store.add_documents(documents=splitted_documents, ids=uuids)

['24190fd4-86ee-430c-99b7-7e552e86de40',
 'a2b879b2-5611-4c04-bee7-c38107175fd2',
 '49cf3f2b-2094-476e-a809-ba8d0efc9ccb',
 'ffc64989-e14c-4ddf-b541-0d19853a18fb',
 'eede6893-3cb1-463a-a51f-82e7bfc29cae',
 '18864564-187f-4404-b53c-64e0152191e1',
 '1e1be44a-34a0-4cf2-b215-32fcb3c69eb0',
 '3f810586-e919-47d0-9631-2395412ac68d',
 'd2a16d05-295f-4d40-816b-e0036355404b',
 '0dea0b8d-e163-4dba-b42a-3a9710140272',
 'cf4e0337-6bc9-41cd-9893-a8002244ff7b',
 '8e9b30f4-7f7f-471e-87ff-145e7cf27167',
 '50ce37ca-900e-4dcc-8cf6-3dd008aa9f35',
 '456e0264-c2a1-451d-871b-dc5405acfc94',
 '3e03802b-2341-412a-92a1-b07f7700bd66',
 '74c86870-dc49-468f-a8cd-471d21cb393b',
 '37b304aa-9a22-488a-99fc-43fea5fbb18e',
 '3584cb3d-5bc4-42e4-9cfd-7b531db5adc9']

## Retrieval


We have a different types of retrieval algorithm in langchain ( like similarity, mmr, Hyde, similarity search with threshold and more, check the documentation)

In [47]:
question = "what is TDEE?"

vb = vector_store.as_retriever(search_type="mmr", search_kwargs={"k": 0.7, "lambda_mult": 0.7}) ## as_retriever is a runnalbe object 


In [48]:
vb.invoke(question)

[Document(id='ffc64989-e14c-4ddf-b541-0d19853a18fb', metadata={'source': '/workspaces/LearningLCEL/Data/TDEE+Calculator.pdf', 'page': 0}, page_content='The simplest method of calculating your TDEE is by using an online calculator such as the one I linked above')]

In [None]:
vector_store.max_marginal_relevance_search(query=question, lambda_mult=0.7, k = 2)


## lambda = 1 removes the diversity component 
## lambda = 0 removes the similarity componet in the formula 

[Document(id='ffc64989-e14c-4ddf-b541-0d19853a18fb', metadata={'source': '/workspaces/LearningLCEL/Data/TDEE+Calculator.pdf', 'page': 0}, page_content='The simplest method of calculating your TDEE is by using an online calculator such as the one I linked above'),
 Document(id='a2b879b2-5611-4c04-bee7-c38107175fd2', metadata={'source': '/workspaces/LearningLCEL/Data/TDEE+Calculator.pdf', 'page': 0}, page_content='com/calorie- calculator TDEE Explained Your Total Daily Energy Expenditure (TDEE) is the number of calories that your body burns in one day')]

In [62]:
## Let's create a chain 

TEMPLATE = """
Answer the following question 
{question}

To answer the question, use the following context: 
{context} 

If you don't find the answer, just say "I don't know" 
"""

from langchain.prompts import PromptTemplate

template = PromptTemplate.from_template(TEMPLATE)

api_key = "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"


#1st component 
from langchain_groq import ChatGroq
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.output_parsers import StructuredOutputParser

llm = ChatGroq(
    model="mixtral-8x7b-32768",
    temperature=0.0,
    max_retries=2,
    api_key=api_key
)


question = "what is TDEE?"

In [68]:


chain = ( 
    RunnableParallel({"context": vb, "question": RunnablePassthrough()}) 
    | template 
    | llm 
    # | StructuredOutputParser()
) 


In [69]:
chain.invoke("what is TDEE?")

AIMessage(content='Based on the provided context, TDEE stands for Total Daily Energy Expenditure. It is a measure of the total number of calories a person needs to consume in a day to maintain their current weight, taking into account their basal metabolic rate, physical activity level, and thermic effect of food. The simplest method of calculating your TDEE is by using an online calculator such as the one linked above.', additional_kwargs={}, response_metadata={'token_usage': {'completion_tokens': 90, 'prompt_tokens': 164, 'total_tokens': 254, 'completion_time': 0.142282938, 'prompt_time': 0.012063255, 'queue_time': 0.001169873, 'total_time': 0.154346193}, 'model_name': 'mixtral-8x7b-32768', 'system_fingerprint': 'fp_c5f20b5bb1', 'finish_reason': 'stop', 'logprobs': None}, id='run-945aa1a4-4459-4fce-91fb-94544d21d5b7-0', usage_metadata={'input_tokens': 164, 'output_tokens': 90, 'total_tokens': 254})

## Production level rag with memory and conversation history

In [4]:
from langchain.document_loaders import PyPDFLoader

import os 



def get_all_data_in_pages(folder_path:str): 

    all_documents = []
    for i in os.listdir(folder_path): 
        path = os.path.join(folder_path, i) 
        loader = PyPDFLoader(path).load()
        all_documents.extend(loader) 
    
    return all_documents


all_documents = get_all_data_in_pages("./Data")

for i in all_documents: 
    i.page_content = " ".join( i.page_content.split() ) 

Ignoring wrong pointing object 13 0 (offset 0)


In [9]:
from langchain_text_splitters import CharacterTextSplitter 


splitter = CharacterTextSplitter(
    separator=".", 
    chunk_size=500, 
    chunk_overlap=30, 
)

splitted_documents = splitter.split_documents(all_documents)

In [10]:
splitted_documents

[Document(metadata={'source': './Data/Macronutrient+Food+List.pdf', 'page': 0}, page_content='Macronutrient Food List 1. Healthy Protein Sources 2. Healthy Carbohydrate Sources 3. Healthy Fat Sources'),
 Document(metadata={'source': './Data/Macronutrient+Food+List.pdf', 'page': 1}, page_content='Healthy Protein Sources Protein is the most important macronutrient in your diet. I suggest you build your main meals around a big portion of one of the sources below, as they are high in protein and (usually) low in calories. Good Protein Sources: - Eggs - Lean Meats - Chicken Breast - Beans & Legumes - Fish/Sea Food - Soy'),
 Document(metadata={'source': './Data/Macronutrient+Food+List.pdf', 'page': 2}, page_content='Healthy Carbohydrate Sources Carbs have a pretty bad rep in the fitness world - mostly undeserved. They are the body’s preferred source of energy and will give you more strength in the gym. For health reasons, make sure to stick to mostly healthy sources (see below). 10% - 20% of

In [None]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI

embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001", google_api_key="AIzaSyD9jMTWnoPhX-mtDgbzvp9a20suPFsD_70", task_type="retrieval_query")
llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro", google_api_key="AIzaSyD9jMTWnoPhX-mtDgbzvp9a20suPFsD_70")



AIMessage(content="I'm doing well, thank you for asking!  As a large language model, I don't experience emotions or feelings like humans do, but I'm functioning optimally and ready to assist you. How can I help you today?\n", additional_kwargs={}, response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'safety_ratings': []}, id='run-cea9d51e-603f-4e00-9929-71dca833bb96-0', usage_metadata={'input_tokens': 5, 'output_tokens': 50, 'total_tokens': 55, 'input_token_details': {'cache_read': 0}})

In [16]:
from langchain_community.vectorstores import FAISS
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore

index = faiss.IndexFlatL2(len(embeddings.embed_query("hello world")))

vector_store = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore=InMemoryDocstore(), 
    index_to_docstore_id={}
) 

from uuid import uuid4 

uuids = [str(uuid4()) for _ in range(len(splitted_documents))]

vector_store.add_documents(documents=splitted_documents, ids=uuids)

['76b33373-9b5d-4abc-b6f7-a889d317cd82',
 '9d626946-8518-466d-87cb-7c7fdacca251',
 '1a4ea1a1-87b3-4307-a38e-8a5f9f3b0dfb',
 '67ce32e6-8e23-4be8-906a-8657b77548ec',
 'de7f1505-763d-46f9-ba96-825118cd3199',
 '17d77f2c-daef-4006-bf1c-49ffc6d2a2c9',
 'aa9b8801-7916-4038-88a0-0409cf1be7b8',
 '9c5abccc-29b2-4a75-ac36-a305bfece6a8',
 '84818fd5-5ad9-4f65-8540-ffde9b5f1f84',
 'd0a92d7c-09cf-4f01-9a6d-be860905b0ca',
 'e924ef2a-adcd-436c-b890-7e5e2641cf1c',
 'e1c91559-ada0-4eb6-b4f1-3ffbd43c4622']

In [26]:
vb = vector_store.as_retriever(search_type="mmr", search_kwargs={"k": 0.7, "lambda_mult": 0.7})

In [72]:
#1st component 
from langchain.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.output_parsers import StructuredOutputParser
from langchain.memory import ConversationSummaryMemory

TEMPLATE = """
Answer the following question 
{question}

To answer the question, use the following context: 
{context} 

If you don't find the answer, just say "I don't know" 
"""

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder


TEMPALTE_FOR_HISTORY = """
Given the chat history and the latest user question 
which might reference context in the chat history, formulate a standalone question
which can be understood without the chat history, Do NOT answer the question, 
just reformulate it if needed otherwise just return as it is. If it's a standalone question,
return as it is. 

Mention details about aggreement name, id, and entities in the questions if it's a follow up question
"""

question_maker_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", TEMPALTE_FOR_HISTORY), 
        MessagesPlaceholder(variable_name="history"), 
        ("human", "{question}" )
    ]
)

question_maker_chain = question_maker_prompt | llm 

In [99]:
all_history = []

In [73]:
out = """Section 1: Counterparts The document includes the names and designations of the authorized signatories from UPL Limited. It specifies the date of authorization for each signatory. Section 2: Parties Defines the parties involved, namely UPL Limited and Mr. Murali Krishnan, as the service provider. Provides the registered office addresses of UPL Limited and the personal details of Mr. Murali Krishnan. Section 3: Status Indicates that the agreement has been signed. Section 4: Fees Outlines the payment terms and service fees to be paid by UPL Limited to the service provider. Includes details on the payment process and deductions that may be applicable. Section 5: Agreed Terms Defines various terms related to the engagement, business opportunities, confidential information, intellectual property rights, and services. Specifies the commencement date and termination date. Section 6: Notices Details the requirements for serving notices under the agreement, including the methods of delivery and deemed receipt. Section 7: Termination Outlines the conditions under which UPL Limited may terminate the agreement with the service provider. Specifies the rights of UPL Limited in case of termination. Section 8: Obligations on Termination Describes the obligations of the service provider upon termination, including the return of company property and confidential information. Section 9: Entire Agreement Acknowledges that the agreement constitutes the entire understanding between the service provider and UPL Limited, superseding any previous arrangements. Section 10: Definitions and Interpretations Reiterates the definitions and rules of interpretation within the agreement. Section 11: Governing Law and Jurisdiction Specifies that Indian law governs the agreement and designates the courts of Mumbai for exclusive jurisdiction. Section 12: Other Activities Addresses the service provider's engagement in other business activities during the agreement, subject to certain conditions. Section 13: Confidential Information Reiterates the definition of confidential information and intellectual property rights within the context of the engagement. Section 14: Variation States that any variation of the agreement must be in writing and signed by each party. Section 15: Duties Outlines the responsibilities of the service provider, including providing services with due care, skill, and ability, and promptly providing necessary information to UPL Limited. Section 16: Service Provider’s Representations and Warranties Specifies the undertakings, representations, and warranties of the service provider, including confidentiality and professional skills. Section 17: Expenses Details the reimbursement of expenses incurred by the service provider, subject to pre-approval and compliance with travel policies. This response summarizes each section in detail, ensuring no section or clause is missed, and maintains the same order of sections as presented in the conversation history. All key details from the relevant data provided have been included."""

In [78]:

from langchain_core.messages import HumanMessage, AIMessage

question_maker_chain.invoke({"question": "mention all nda of upl", "history": [HumanMessage(content="compare clm 12112?"), AIMessage(content=out)]})


AIMessage(content='List all Non-Disclosure Agreements (NDAs) associated with UPL Limited.\n', additional_kwargs={}, response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'safety_ratings': []}, id='run-4a8837d4-4452-4a2c-9863-b10d962425f3-0', usage_metadata={'input_tokens': 661, 'output_tokens': 17, 'total_tokens': 678, 'input_token_details': {'cache_read': 0}})

In [107]:
from langchain_core.runnables import chain



@chain 
def question_answer(question): 
    ques = question 
    his = all_history[-6:]

    get_question = question_maker_chain.invoke({"question": ques, "history": his}).content 
    all_history.append(HumanMessage(content=get_question))

    print(f"refined question: {get_question}")

    rag_chain = ( 
        RunnableParallel({"context": vb, "question": RunnablePassthrough()}) 
        | PromptTemplate.from_template(TEMPLATE) 
        | llm 
    )

    output = rag_chain.invoke( get_question).content

    all_history.append(AIMessage(content=output))

    return output 



In [129]:
question = "how to calculate TDEE?"

In [130]:
print(question_answer.invoke(question))

refined question: How do you calculate Total Daily Energy Expenditure (TDEE), including how to determine the calories added based on activity level?



Retrying langchain_google_genai.chat_models._chat_with_retry.<locals>._chat_with_retry in 2.0 seconds as it raised ResourceExhausted: 429 Resource has been exhausted (e.g. check quota)..


ResourceExhausted: 429 Resource has been exhausted (e.g. check quota).

In [131]:
all_history

[HumanMessage(content='What is TDEE?\n', additional_kwargs={}, response_metadata={}),
 AIMessage(content='TDEE stands for Total Daily Energy Expenditure, which is the number of calories your body burns in a day.\n', additional_kwargs={}, response_metadata={}),
 HumanMessage(content='How do I calculate or estimate my Total Daily Energy Expenditure (TDEE)?\n', additional_kwargs={}, response_metadata={}),
 AIMessage(content='Your Total Daily Energy Expenditure (TDEE) is calculated by first estimating your Basal Metabolic Rate (BMR), which represents the calories burned while resting.  Then, you add calories to your BMR based on your level of physical activity.  The document suggests using an online TDEE calculator linked in the document.\n', additional_kwargs={}, response_metadata={}),
 HumanMessage(content='How do I estimate my Basal Metabolic Rate (BMR)?\n', additional_kwargs={}, response_metadata={}),
 AIMessage(content="The provided text explains how to calculate Lean Body Mass and ho

'The macronutrients listed are protein, carbohydrates, and fat.\n'