![](2023-07-24-10-52-10.png)

# Building a Simple QA System for Chatting with a PDF

This part of the training will be mostly hands on with the code for building the qa PDF system with langchain.

In [1]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import ChatVectorDBChain
from langchain import PromptTemplate

In [2]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500, chunk_overlap = 200)

In [3]:
persist_directory = './persist_directory/'

In [4]:
embedding = OpenAIEmbeddings()

In [5]:
pdf_path = "../resources_and_assets/llm_paper_know_dont_know.pdf"

In [6]:
loader = PyPDFLoader(pdf_path)
pdf_doc = loader.load()

In [7]:
all_splits = text_splitter.split_documents(pdf_doc)

In [8]:
vectordb = Chroma.from_documents(all_splits, embedding=embedding,\
        persist_directory=persist_directory)
vectordb.persist()

In [9]:
pdf_qa = ChatVectorDBChain.from_llm(ChatOpenAI(temperature=0,\
    model_name="gpt-3.5-turbo"), vectordb)



In [40]:
query = "What are the main contributions of this paper with regards to self-knowledge in LLMs?"
pdf_qa({"question": query, "chat_history": []})

{'question': 'What are the main contributions of this paper with regards to self-knowledge in LLMs?',
 'chat_history': [],
 'answer': 'The main contributions of this paper with regards to self-knowledge in LLMs are the findings that these models do possess a certain degree of self-knowledge, but there is a disparity when compared to human self-knowledge. The paper also highlights the need for further research to enhance the ability of LLMs to understand their own limitations. This could lead to more accurate and reliable responses from LLMs, positively impacting their applications in diverse fields.'}

Now let's look at the Q&A app for pdfs in `qa_pdf.py`.

Run the app with:

`streamlit run qa_pdf.py`

[Long context issue in LLMs](https://arxiv.org/pdf/2303.18223.pdf)

Long Context. One of the main drawbacks of Transformerbased language models is the context length is limited due to the involved quadratic computational costs in both time
and memory. Meanwhile, there is an increasing demand
for LLM applications with long context windows, such as
in PDF processing and story writing [217]. ChatGPT has
recently released an updated variant with a context window
size of up to 16K tokens, which is much longer than the
initial one, i.e., 4K tokens. Additionally, GPT-4 was launched
with variants with context window of 32K tokens [46]. Next,
we discuss two important factors that support long context
modeling for LLMs.

# References

https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb 
Below are notebook from openai cookbook on these topics of search and embeddings:
- https://github.com/openai/openai-cookbook/blob/main/examples/Get_embeddings.ipynb
- https://github.com/openai/openai-cookbook/blob/main/examples/Code_search.ipynb
- https://github.com/openai/openai-cookbook/blob/main/examples/Customizing_embeddings.ipynb
- https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_Wikipedia_articles_for_search.ipynb
- https://platform.openai.com/docs/guides/embeddings/what-are-embeddings
- [In-context learning abilities of ChatGPT models](https://arxiv.org/pdf/2303.18223.pdf)
- [Issue with long context](https://arxiv.org/pdf/2303.18223.pdf)