In [22]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.vectorstores import Chroma
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_community.document_loaders import PyPDFLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from dotenv import load_dotenv
import os

In [6]:
model = ChatGoogleGenerativeAI(model="gemini-1.5-flash")

In [9]:
file_path="./data/5pages.pdf"
loader=PyPDFLoader(file_path)
docs=loader.load()
print(len(docs))

4


In [10]:
print(docs[0])

page_content='Page 1 of 4 PDF Files 
Scan – Create – Reduce File Size  
 
 
It is recommended that you purchase an Adobe Acrobat product that 
allows you to read, create and manipulate PDF documents.  Go to http://www.adobe.com/products/acrobat/matrix.html
 to compare 
Adobe products and features –Adobe  Acrobat Standard is sufficient. 
 
 
Scanning Documents 
 
You should only have to scan docu ments that are not electronic, and 
when you are unable to create a PDF using PDFMaker or the Print 
Command from the applicat ion you are using.   
 
Signature Pages If you have a document such as a CV that requires a signature on a page only print the page that re quires the signature –printing the 
entire document and scanning it is not
 necessary or desired.  Once you 
sign and scan the signature page you can combine it with the original 
document using the Create PDF From Multiple Files feature. 
 Scanner Settings Before scanning documents rememb er to make certain that the 
following sett

In [21]:
print(docs[0].page_content[0:100]+"\n\n")
print("The Metadata of the above is ")
print(docs[0].metadata)

Page 1 of 4 PDF Files 
Scan – Create – Reduce File Size  
 
 
It is recommended that you purchase an


The Metadata of the above is 
{'source': './data/5pages.pdf', 'page': 0}


In [35]:
text_splitter =RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=200)
chunks = text_splitter.split_documents(docs)
print(len(chunks))

7


In [36]:
print(chunks[0])
print("---------------------\n\n\n")
print(chunks[1])

page_content='Page 1 of 4 PDF Files 
Scan – Create – Reduce File Size  
 
 
It is recommended that you purchase an Adobe Acrobat product that 
allows you to read, create and manipulate PDF documents.  Go to http://www.adobe.com/products/acrobat/matrix.html
 to compare 
Adobe products and features –Adobe  Acrobat Standard is sufficient. 
 
 
Scanning Documents 
 
You should only have to scan docu ments that are not electronic, and 
when you are unable to create a PDF using PDFMaker or the Print 
Command from the applicat ion you are using.   
 
Signature Pages If you have a document such as a CV that requires a signature on a page only print the page that re quires the signature –printing the 
entire document and scanning it is not
 necessary or desired.  Once you' metadata={'source': './data/5pages.pdf', 'page': 0}
---------------------



page_content='entire document and scanning it is not
 necessary or desired.  Once you 
sign and scan the signature page you can combine it with the 

In [40]:
# To store database for the first time
vectorstore=Chroma.from_documents(documents=chunks, embedding=OpenAIEmbeddings(),persist_directory="./my_chroma_db")
retriever= vectorstore.as_retriever()

In [42]:
# To retrieve already build database
vectorstore = Chroma(persist_directory="./my_chroma_db", embedding_function=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()


In [43]:
system_prompt=(
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)


In [44]:
prompt =ChatPromptTemplate.from_messages(
[
    ('system', system_prompt),
    ('human', '{input}'),
    
])

In [45]:
question_answer_chain=create_stuff_documents_chain(model, prompt)
rag_chain=create_retrieval_chain(retriever, question_answer_chain)
results=rag_chain.invoke({"input": "write summary of this document"})
results["answer"]

'This document provides instructions on creating PDF documents, specifically focusing on two methods: using Adobe Acrobat Standard or higher, and using an online converter called PS2PSF. The document also includes instructions for scanning documents and combining them into a single PDF file. \n'

In [46]:
results

{'input': 'write summary of this document',
 'context': [Document(metadata={'page': 2, 'source': './data/5pages.pdf'}, page_content='Page 3 of 4 Creating PDF Documents (continued) \n \nOption 2: If you do not have  Acrobat Standard or higher \ninstalled use PS2PSF.*   \n    \n 1. Open the file in its authoring app lication, and choose File > Print. \n2. Select “Print to File” and save. \n3. Open your browser and go to http://ps2pdf.com/convert.htm\n \n4. Click “browse” select the file you created in step 2 (.prn or .ps), \nclick “convert” \n5. Download the newly created PDF file. \n*Note: Some formatting changes ma y occur once converted (bullets \nmay turn to symbols and color may become black and white).'),
  Document(metadata={'page': 2, 'source': './data/5pages.pdf'}, page_content='Page 3 of 4 Creating PDF Documents (continued) \n \nOption 2: If you do not have  Acrobat Standard or higher \ninstalled use PS2PSF.*   \n    \n 1. Open the file in its authoring app lication, and choose

### Process Flow:

First, the retriever searches through the document collection and brings back only the most relevant documents based on the user's query.
Create Stuff Documents Chain Embeds the Documents:

After the retriever finds the relevant documents, the create_stuff_documents_chain formats those documents into a prompt that the LLM can understand.
The documents are bundled together in a specific way and embedded into the prompt.
LLM Answers the Query:

The prompt (which now contains the relevant documents) is passed to the LLM, which processes the information and generates an answer to the user's query.
In essence, the retriever finds the best documents first, and then the create_stuff_documents_chain takes those documents and combines them into a prompt that is given to the LLM for a final response.

This ensures that the LLM focuses on just the relevant information, improving both the efficiency and accuracy of the response.