![](2023-07-24-10-52-10.png)

# Building a Simple QA System for Chatting with a PDF

This part of the training will be mostly hands on with the code for building the qa PDF system with langchain.

In [28]:
# !pip install langchain
# !pip install openai
# !pip install pypdf
# !pip install chromadb

In [29]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import ChatVectorDBChain
from langchain import PromptTemplate

In [30]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1500, chunk_overlap = 200)

In [31]:
persist_directory = './persist_directory/'

In [32]:
embedding = OpenAIEmbeddings()

In [33]:
pdf_path = "./assets-resources/llm_paper_know_dont_know.pdf"

In [34]:
loader = PyPDFLoader(pdf_path)
pdf_doc = loader.load()

In [35]:
all_splits = text_splitter.split_documents(pdf_doc)

In [36]:
vectordb = Chroma.from_documents(all_splits, embedding=embedding,\
        persist_directory=persist_directory)
vectordb.persist()

In [37]:
pdf_qa = ChatVectorDBChain.from_llm(ChatOpenAI(temperature=0,\
    model_name="gpt-3.5-turbo"), vectordb)



In [38]:
query = "What are the main contributions of this paper with regards to self-knowledge in LLMs?"
pdf_qa({"question": query, "chat_history": []})

{'question': 'What are the main contributions of this paper with regards to self-knowledge in LLMs?',
 'chat_history': [],
 'answer': 'The main contributions of this paper with regards to self-knowledge in LLMs are:\n\n1. Development of a new dataset called SelfAware, which comprises a diverse range of commonly posed unanswerable questions.\n\n2. Proposal of an innovative evaluation technique based on text similarity to quantify the degree of uncertainty inherent in model outputs.\n\n3. Detailed analysis of 20 LLMs, including GPT-3, InstructGPT, and LLaMA, benchmarked against human self-knowledge, revealing a significant disparity between the most advanced LLMs and humans.\n\n4. Identification of the ability of LLMs to understand their own limitations and deficiencies in the unknowns, referred to as self-knowledge.\n\n5. Demonstration that in-context learning and instruction tuning can effectively enhance the self-knowledge of LLMs.\n\nOverall, this paper provides insights into the sel

[Long context issue in LLMs](https://arxiv.org/pdf/2303.18223.pdf)

Long Context. One of the main drawbacks of Transformerbased language models is the context length is limited due to the involved quadratic computational costs in both time
and memory. Meanwhile, there is an increasing demand
for LLM applications with long context windows, such as
in PDF processing and story writing [217]. ChatGPT has
recently released an updated variant with a context window
size of up to 16K tokens, which is much longer than the
initial one, i.e., 4K tokens. Additionally, GPT-4 was launched
with variants with context window of 32K tokens [46]. Next,
we discuss two important factors that support long context
modeling for LLMs.

# References

https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb 
Below are notebook from openai cookbook on these topics of search and embeddings:
- https://github.com/openai/openai-cookbook/blob/main/examples/Get_embeddings.ipynb
- https://github.com/openai/openai-cookbook/blob/main/examples/Code_search.ipynb
- https://github.com/openai/openai-cookbook/blob/main/examples/Customizing_embeddings.ipynb
- https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_Wikipedia_articles_for_search.ipynb
- https://platform.openai.com/docs/guides/embeddings/what-are-embeddings
- [In-context learning abilities of ChatGPT models](https://arxiv.org/pdf/2303.18223.pdf)
- [Issue with long context](https://arxiv.org/pdf/2303.18223.pdf)