# PDF QA PoC

## Introduction

This is a PoC for a PDF QA tool. It will make use of LangChain to import a PDF file and extract the text from it. Then, it will use a local LLM model to genereate answers from given questsions.

## LangChain

We are basing this PoC from the example here: https://python.langchain.com/v0.2/docs/tutorials/local_rag/

In [11]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.embeddings import GPT4AllEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma

In [12]:
# Load a PDF document
loader = PyPDFLoader("data/sample_en.pdf")
# Create a splitter with some overlap
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
# Load the pdf and split it into chunks
documents = loader.load_and_split(splitter)

print(documents[0])

page_content='Stock prices quickly incorporate information from earnings
announcements, making it difficult to beat the market by
trading on these events. A replication of Martineau (2022).
Efficient-market hypothesis
The efficient-market hypothesis (EMH)[a] is
a hypothesis in financial econom ics that states
that asset prices reflect all available
information. A direct implication is that it is
impossible to "beat the market" consistently on
a risk-adjusted basis since market prices should' metadata={'source': 'data/sample_en.pdf', 'page': 0}


We now need to get a model and create embeddings for the text. We can select between the following models: https://docs.gpt4all.io/gpt4all_python/home.html#embeddings

In [13]:
# Load the GPT4All embeddings
model_name = "nomic-embed-text-v1.5.f16.gguf"
gpt4all_kwargs = { "allow_download": "True" }
vector_store = Chroma.from_documents(documents=documents, embedding=GPT4AllEmbeddings(
    model_name=model_name, gpt4all_kwargs=gpt4all_kwargs
))

In [14]:
# Ask a question to find similar text
question = "Is it possible to know how the market will evolve in the future?"
docs = vector_store.similarity_search(question)

We now need to use an LLM to generate answers from the questions. We can select between the following models: https://docs.gpt4all.io/gpt4all_python/home.html#load-llm

In [15]:
from langchain_community.llms import GPT4All

llm = GPT4All(model="Meta-Llama-3-8B-Instruct.Q4_0.gguf", max_tokens=2048, allow_download=True)


In [16]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate

# Create a prompt
prompt = PromptTemplate.from_template("Summarize the following text in 3 sentences: {docs}")

# Next we define the chain using a utility function
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

chain = { "docs": format_docs } | prompt | llm | StrOutputParser()

# Run the chain with the question and docs we have from above
chain.invoke(docs)

'\n\nThe text does not necessarily imply that stock prices are unpredictable because investors may adjust their behavior based on new information, such as a financial crisis. This could lead to changes in market price without being entirely random or unpredictable. Additionally, the efficient markets theory suggests that market prices reflect all available information and thus any potential biases or inefficiencies would be quickly corrected by market forces.\nFinal Answer: The final answer is Investors may adjust their behavior based on new information, which can affect market prices; the efficient markets theory suggests that market prices reflect all available information. I hope it is correct.'