# QA over PDF file

## Intro
* We will create a Q&A app that can answer questions about PDF files.
* We will use a Document Loader to load text in a format usable by an LLM, then build a retrieval-augmented generation (RAG) pipeline to answer questions, including citations from the source material.
* **We will use a basic approach for this project. You will see more advanced ways to solve the same problem in next projects**.

In [None]:
#!pip install python-dotenv

In [None]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

## Install LangChain

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [None]:
#!pip install langchain

## Connect with an LLM

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [1]:
#!pip install langchain-openai
!pip install -U langchain-community langchain langchain-huggingface huggingface_hub

Collecting langchain-community
  Downloading langchain_community-0.3.14-py3-none-any.whl.metadata (2.9 kB)
Collecting langchain
  Downloading langchain-0.3.14-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-huggingface
  Downloading langchain_huggingface-0.1.2-py3-none-any.whl.metadata (1.3 kB)
Collecting huggingface_hub
  Downloading huggingface_hub-0.27.1-py3-none-any.whl.metadata (13 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain-core<0.4.0,>=0.3.29 (from langchain-community)
  Downloading langchain_core-0.3.29-py3-none-any.whl.metadata (6.3 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.7.1-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from datacla

* NOTE: Since right now is the best LLM in the market, we will use OpenAI by default. You will see how to connect with other Open Source LLMs like Llama3 or Mistral in a next lesson.

In [None]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo-0125")

In [2]:
from google.colab import userdata
HUGGINGFACEHUB_API_TOKEN= userdata.get('HUGGING_FACE_API_KEY')

from huggingface_hub import login
login(token = HUGGINGFACEHUB_API_TOKEN)

from langchain_huggingface import HuggingFaceEndpoint
repo_id = "mistralai/Mistral-7B-Instruct-v0.2"

llm = HuggingFaceEndpoint(
    repo_id=repo_id,
    max_length=128,
    temperature=0.7,
    huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN,
)


                    max_length was transferred to model_kwargs.
                    Please make sure that max_length is what you intended.


## Load the PDF file
* The loader reads the PDF at the specified path into memory.
* It then extracts text data using the pypdf package.
* Finally, it creates a LangChain Document for each page of the PDF with the page's content and some metadata about where in the document the text came from.

If you are using the pre-loaded poetry shell, you do not need to install the following packages because they are already pre-loaded for you:

In [3]:
#!pip install langchain-community

In [5]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-5.1.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.1.0-py3-none-any.whl (297 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/298.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m297.0/298.0 kB[0m [31m11.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.1.0


In [6]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "/content/Be_Good.pdf"

loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

11


In [7]:
print(docs[0].page_content[0:100])
print(docs[0].metadata)

Be Good - Essay by Paul Graham
Be Good
Be good
April 2008(This essay is derived from a talk at the 2
{'source': '/content/Be_Good.pdf', 'page': 0}


## RAG
* We will use the vector database (aka. vector store) Chroma DB.
* Using a text splitter, we will split the loaded PDF into smaller documents that can more easily fit into an LLM's context window, then load them into a vector store.
* We can then create a retriever from the vector store for use in our RAG chain:

If you are using the pre-loaded poetry shell, you do not need to install the following package because it is already pre-loaded for you:

In [8]:
!pip install langchain_chroma

Collecting langchain_chroma
  Downloading langchain_chroma-0.2.0-py3-none-any.whl.metadata (1.7 kB)
Collecting chromadb!=0.5.10,!=0.5.11,!=0.5.12,!=0.5.4,!=0.5.5,!=0.5.7,!=0.5.9,<0.6.0,>=0.4.0 (from langchain_chroma)
  Downloading chromadb-0.5.23-py3-none-any.whl.metadata (6.8 kB)
Collecting fastapi<1,>=0.95.2 (from langchain_chroma)
  Downloading fastapi-0.115.6-py3-none-any.whl.metadata (27 kB)
Collecting build>=1.0.3 (from chromadb!=0.5.10,!=0.5.11,!=0.5.12,!=0.5.4,!=0.5.5,!=0.5.7,!=0.5.9,<0.6.0,>=0.4.0->langchain_chroma)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb!=0.5.10,!=0.5.11,!=0.5.12,!=0.5.4,!=0.5.5,!=0.5.7,!=0.5.9,<0.6.0,>=0.4.0->langchain_chroma)
  Downloading chroma_hnswlib-0.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb!=0.5.10,!=0.5.11,!=0.5.12,!=0.5.4,!=0.5.5,!=0.5.7,!=0.5.9,<0.6.0,>=0.4.0->lang

In [10]:
from langchain_chroma import Chroma
# from langchain_openai import OpenAIEmbeddings
from langchain_huggingface import HuggingFaceEndpointEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)

splits = text_splitter.split_documents(docs)

model = "sentence-transformers/all-mpnet-base-v2"
embedding = HuggingFaceEndpointEmbeddings(
    model=model,
    task="feature-extraction",
    huggingfacehub_api_token= HUGGINGFACEHUB_API_TOKEN,
)

vectorstore = Chroma.from_documents(documents=splits, embedding=embedding)

retriever = vectorstore.as_retriever()

#### We will use two pre-defined chains to construct the final rag_chain:
In this exercise we are going to use two pre-defined chains to build the final chain:
* create_stuff_documents_chain
* create_retrieval_chain
* Let's learn a little bit more about these two pre-defined chains.

#### create_stuff_documents_chain
The create_stuff_documents_chain takes a list of documents and formats them all into a prompt, then passes that prompt to an LLM. It passes ALL documents, so you should make sure it fits within the context window of the LLM you are using.
1. **Taking a List of Documents**: This function starts by receiving a group of documents that you provide.
  
2. **Formatting into a Prompt**: It then takes all these documents and organizes them into a specific prompt. A prompt is essentially a text setup that is used to feed information into a language model (like an LLM, or Large Language Model).

3. **Passing to an LLM**: After formatting the documents into a prompt, this function sends the formatted prompt to a language model. The model will process this information to perform tasks like answering questions, generating text, etc.

4. **Fit within Context Window**: The function sends all the documents at once to the LLM. However, it's important to make sure that the total length of the prompt does not exceed what the LLM can handle at one time. This limit is known as the "context window" of the LLM. If the prompt is too long, the model might not process it effectively.

In simpler terms, think of this chain as a way of taking several pieces of text, bundling them together in a specific way, and then feeding them to an LLM that reads and uses this bundled text to do its job. Just make sure the bundle isn’t too big for the LLM to handle at once!


#### create_retrieval_chain
The create_retrieval_chain takes in a user inquiry, which is then passed to the retriever to fetch relevant documents. Those documents (and original inputs) are then passed to an LLM to generate a response.
1. **Receiving a User Inquiry**: This process begins when a user asks a question or makes a request.

2. **Using a Retriever to Fetch Documents**: The function then uses a retriever to find documents that are relevant to the user's inquiry. This means it searches through available information to pick out parts that can help answer the question.

3. **Passing Information to an LLM**: After gathering the relevant documents, both these documents and the original user inquiry are sent to an LLM.

4. **Generating a Response**: The LLM processes all the information it receives to come up with an appropriate response, which is then given back to the user.

In simpler terms, this chain acts like a smart assistant that first looks up information based on your question, gathers useful details, and then uses those details along with your original question to craft a helpful answer.

In [11]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

question_answer_chain = create_stuff_documents_chain(llm, prompt)

rag_chain = create_retrieval_chain(retriever, question_answer_chain)

results = rag_chain.invoke({"input": "What is this article about?"})

results["answer"]

' How does Microsoft fit into it?'

* If you print the whole `results` you will see that **you get both the answer, and the context the LLM used to generate that answer**. See it below:

In [12]:
results

{'input': 'What is this article about?',
 'context': [Document(id='050f0f75-5e9e-420c-9c01-eba5a7907d18', metadata={'page': 10, 'source': '/content/Be_Good.pdf'}, page_content="Be Good - Essay by Paul Graham\nGoogle does.Most explicitly benevolent projects don't hold themselves sufficiently\naccountable.  They act as if having good intentions were enough to\nguarantee good effects.[3] Users dislike their\nnew operating system so much that they're starting petitions to\nsave the old one.  And the old one was nothing special.  The hackers\nwithin Microsoft must know in their hearts that if the company\nreally cared about users they'd just advise them to switch to OSX.Thanks to Trevor Blackwell, Paul\nBuchheit, Jessica Livingston,\nand Robert Morris for reading drafts of this.\nPage 11"),
  Document(id='086d7cc1-d73d-4837-b662-f415fc643f11', metadata={'page': 10, 'source': '/content/Be_Good.pdf'}, page_content="Be Good - Essay by Paul Graham\nGoogle does.Most explicitly benevolent project

In [13]:
print(results)

{'input': 'What is this article about?', 'context': [Document(id='050f0f75-5e9e-420c-9c01-eba5a7907d18', metadata={'page': 10, 'source': '/content/Be_Good.pdf'}, page_content="Be Good - Essay by Paul Graham\nGoogle does.Most explicitly benevolent projects don't hold themselves sufficiently\naccountable.  They act as if having good intentions were enough to\nguarantee good effects.[3] Users dislike their\nnew operating system so much that they're starting petitions to\nsave the old one.  And the old one was nothing special.  The hackers\nwithin Microsoft must know in their hearts that if the company\nreally cared about users they'd just advise them to switch to OSX.Thanks to Trevor Blackwell, Paul\nBuchheit, Jessica Livingston,\nand Robert Morris for reading drafts of this.\nPage 11"), Document(id='086d7cc1-d73d-4837-b662-f415fc643f11', metadata={'page': 10, 'source': '/content/Be_Good.pdf'}, page_content="Be Good - Essay by Paul Graham\nGoogle does.Most explicitly benevolent projects d

* Examining the values under the context further, you can see that they are documents that each contain a chunk of the ingested page content. These documents also preserve the original **metadata** from way back when you first loaded them:

In [14]:
print(results["context"][0].metadata)

{'page': 10, 'source': '/content/Be_Good.pdf'}


* This particular chunk came from page 0 in the original PDF. You can use this data to show which page in the PDF the answer came from, allowing users to quickly verify that answers are based on the source material.

## How to execute the code from Visual Studio Code
* In Visual Studio Code, see the file 004-invoke-stream-batch.py
* In terminal, make sure you are in the directory of the file and run:
    * python 001-qa-from-pdf.py