![](2023-07-24-10-52-10.png)

# Building a Simple QA System for Chatting with a PDF

This part of the training will be mostly hands on with the code for building the qa PDF system with langchain.

In [1]:
!pip install langchain
!pip install langchain-openai
!pip install langchainhub
!pip install pypdf
!pip install chromadb

In [2]:
import os

# # Set OPENAI API Key

os.environ["OPENAI_API_KEY"] = "your openai key"

# OR (load from .env file)

# from dotenv import load_dotenv
# load_dotenv("./.env")

In [1]:
from langchain import hub
from langchain_community.vectorstores import Chroma
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

In [2]:
MODEL="gpt-4o"

In [4]:
pdf_path = "./assets-resources/llm_paper_know_dont_know.pdf"
loader = PyPDFLoader(pdf_path) # LOAD

In [5]:
pdf_docs = loader.load_and_split() # SPLIT
pdf_docs

[Document(metadata={'source': './assets-resources/llm_paper_know_dont_know.pdf', 'page': 0}, page_content='Do Large Language Models Know What They Don’t Know?\nZhangyue Yin♢ Qiushi Sun♠ Qipeng Guo♢\nJiawen Wu♢ Xipeng Qiu♢∗ Xuanjing Huang♢\n♢School of Computer Science, Fudan University\n♠Department of Mathematics, National University of Singapore\n{yinzy21,jwwu21}@m.fudan.edu.cn qiushisun@u.nus.edu\n{qpguo16,xpqiu,xjhuang}@fudan.edu.cn\nAbstract\nLarge language models (LLMs) have a wealth\nof knowledge that allows them to excel in vari-\nous Natural Language Processing (NLP) tasks.\nCurrent research focuses on enhancing their\nperformance within their existing knowledge.\nDespite their vast knowledge, LLMs are still\nlimited by the amount of information they can\naccommodate and comprehend. Therefore, the\nability to understand their own limitations on\nthe unknows, referred to as self-knowledge,\nis of paramount importance. This study aims\nto evaluate LLMs’ self-knowledge by assess-\n

In [6]:
doc_obj = pdf_docs[0]
doc_obj

Document(metadata={'source': './assets-resources/llm_paper_know_dont_know.pdf', 'page': 0}, page_content='Do Large Language Models Know What They Don’t Know?\nZhangyue Yin♢ Qiushi Sun♠ Qipeng Guo♢\nJiawen Wu♢ Xipeng Qiu♢∗ Xuanjing Huang♢\n♢School of Computer Science, Fudan University\n♠Department of Mathematics, National University of Singapore\n{yinzy21,jwwu21}@m.fudan.edu.cn qiushisun@u.nus.edu\n{qpguo16,xpqiu,xjhuang}@fudan.edu.cn\nAbstract\nLarge language models (LLMs) have a wealth\nof knowledge that allows them to excel in vari-\nous Natural Language Processing (NLP) tasks.\nCurrent research focuses on enhancing their\nperformance within their existing knowledge.\nDespite their vast knowledge, LLMs are still\nlimited by the amount of information they can\naccommodate and comprehend. Therefore, the\nability to understand their own limitations on\nthe unknows, referred to as self-knowledge,\nis of paramount importance. This study aims\nto evaluate LLMs’ self-knowledge by assess-\ni

In [7]:
type(doc_obj)

langchain_core.documents.base.Document

In [8]:
doc_obj.page_content

'Do Large Language Models Know What They Don’t Know?\nZhangyue Yin♢ Qiushi Sun♠ Qipeng Guo♢\nJiawen Wu♢ Xipeng Qiu♢∗ Xuanjing Huang♢\n♢School of Computer Science, Fudan University\n♠Department of Mathematics, National University of Singapore\n{yinzy21,jwwu21}@m.fudan.edu.cn qiushisun@u.nus.edu\n{qpguo16,xpqiu,xjhuang}@fudan.edu.cn\nAbstract\nLarge language models (LLMs) have a wealth\nof knowledge that allows them to excel in vari-\nous Natural Language Processing (NLP) tasks.\nCurrent research focuses on enhancing their\nperformance within their existing knowledge.\nDespite their vast knowledge, LLMs are still\nlimited by the amount of information they can\naccommodate and comprehend. Therefore, the\nability to understand their own limitations on\nthe unknows, referred to as self-knowledge,\nis of paramount importance. This study aims\nto evaluate LLMs’ self-knowledge by assess-\ning their ability to identify unanswerable or\nunknowable questions. We introduce an auto-\nmated methodol

In [9]:
len(pdf_docs)

13

In [10]:
from IPython.display import display, Markdown

Markdown(doc_obj.page_content)

Do Large Language Models Know What They Don’t Know?
Zhangyue Yin♢ Qiushi Sun♠ Qipeng Guo♢
Jiawen Wu♢ Xipeng Qiu♢∗ Xuanjing Huang♢
♢School of Computer Science, Fudan University
♠Department of Mathematics, National University of Singapore
{yinzy21,jwwu21}@m.fudan.edu.cn qiushisun@u.nus.edu
{qpguo16,xpqiu,xjhuang}@fudan.edu.cn
Abstract
Large language models (LLMs) have a wealth
of knowledge that allows them to excel in vari-
ous Natural Language Processing (NLP) tasks.
Current research focuses on enhancing their
performance within their existing knowledge.
Despite their vast knowledge, LLMs are still
limited by the amount of information they can
accommodate and comprehend. Therefore, the
ability to understand their own limitations on
the unknows, referred to as self-knowledge,
is of paramount importance. This study aims
to evaluate LLMs’ self-knowledge by assess-
ing their ability to identify unanswerable or
unknowable questions. We introduce an auto-
mated methodology to detect uncertainty in the
responses of these models, providing a novel
measure of their self-knowledge. We further in-
troduce a unique dataset, SelfAware, consisting
of unanswerable questions from five diverse cat-
egories and their answerable counterparts. Our
extensive analysis, involving 20 LLMs includ-
ing GPT-3, InstructGPT, and LLaMA, discov-
ering an intrinsic capacity for self-knowledge
within these models. Moreover, we demon-
strate that in-context learning and instruction
tuning can further enhance this self-knowledge.
Despite this promising insight, our findings also
highlight a considerable gap between the capa-
bilities of these models and human proficiency
in recognizing the limits of their knowledge.
“True wisdom is knowing what you don’t know.”
–Confucius
1 Introduction
Recently, Large Language Models (LLMs) such
as GPT-4 (OpenAI, 2023), PaLM 2 (Anil et al.,
2023), and LLaMA (Touvron et al., 2023) have
shown exceptional performance on a wide range
of NLP tasks, including common sense reason-
ing (Wei et al., 2022; Zhou et al., 2022) and mathe-
∗ Corresponding author.
Unknows
KnowsUnknows
Knows
Known Knows Known Unknows
Unknown UnknowsUnknown Knows
Unlock
Figure 1: Know-Unknow Quadrant. The horizontal axis
represents the model’s memory capacity for knowledge,
and the vertical axis represents the model’s ability to
comprehend and utilize knowledge.
matical problem-solving (Lewkowycz et al., 2022;
Chen et al., 2022). Despite their ability to learn
from huge amounts of data, LLMs still have lim-
itations in their capacity to retain and understand
information. To ensure responsible usage, it is cru-
cial for LLMs to have the capability of recognizing
their limitations and conveying uncertainty when
responding to unanswerable or unknowable ques-
tions. This acknowledgment of limitations, also
known as “ knowing what you don’t know,” is a
crucial aspect in determining their practical appli-
cability. In this work, we refer to this ability as
model self-knowledge.
The Know-Unknow quadrant in Figure 1 il-
lustrates the relationship between the model’s
knowledge and comprehension. The ratio of
“Known Knows” to “Unknown Knows” demon-
strates the model’s proficiency in understanding
and applying existing knowledge. Techniques
such as Chain-of-Thought (Wei et al., 2022), Self-
Consistency (Wang et al., 2022), and Complex
CoT (Fu et al., 2022) can be utilized to increase
arXiv:2305.18153v2  [cs.CL]  30 May 2023

In [11]:
embeddings = OpenAIEmbeddings() # EMBED
embeddings

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x1274fec90>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x12764b150>, model='text-embedding-ada-002', dimensions=None, deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base='https://api.openai.com/v1', openai_api_type=None, openai_proxy=None, embedding_ctx_length=8191, openai_api_key=SecretStr('**********'), openai_organization=None, allowed_special=None, disallowed_special=None, chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None, http_async_client=None, check_embedding_ctx_length=True)

In [15]:
embedding_lie = embeddings.embed_query("Lucas is a gorgeous man.")
embedding_truth = embeddings.embed_query("Lucas is a silly man.")


In [17]:
import numpy as np

# Calculate cosine similarity between the two embedding vectors
similarity = np.dot(embedding_lie, embedding_truth) / (np.linalg.norm(embedding_lie) * np.linalg.norm(embedding_truth))
print(f"Cosine similarity between the two sentences: {similarity}")


Cosine similarity between the two sentences: 0.8996095419150993


In [18]:
embedding_random = embeddings.embed_query("The sky is blue.")

In [19]:
similarity = np.dot(embedding_lie, embedding_random) / (np.linalg.norm(embedding_lie) * np.linalg.norm(embedding_random))
print(f"Cosine similarity between the two sentences: {similarity}")

Cosine similarity between the two sentences: 0.7751547827937032


In [20]:
vectordb = Chroma.from_documents(pdf_docs, embedding=embeddings) # STORE
vectordb

<langchain_community.vectorstores.chroma.Chroma at 0x12774e610>

Definition of a [retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/#:~:text=A%20retriever%20is,Document's%20as%20output.):

> A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used as the backbone of a retriever, but there are other types of retrievers as well.

In [21]:
retriever = vectordb.as_retriever() 
retriever

VectorStoreRetriever(tags=['Chroma', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x12774e610>, search_kwargs={})

In [22]:
llm = ChatOpenAI(model=MODEL, temperature=0)

In [23]:
# source: https://python.langchain.com/v0.2/docs/tutorials/pdf_qa/#question-answering-with-rag

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

prompt

ChatPromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, say that you don't know. Use three sentences maximum and keep the answer concise.\n\n{context}"), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['input'], input_types={}, partial_variables={}, template='{input}'), additional_kwargs={})])

In [24]:
question_answer_chain = create_stuff_documents_chain(llm, prompt)

question_answer_chain

RunnableBinding(bound=RunnableBinding(bound=RunnableAssign(mapper={
  context: RunnableLambda(format_docs)
}), kwargs={}, config={'run_name': 'format_inputs'}, config_factories=[])
| ChatPromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, say that you don't know. Use three sentences maximum and keep the answer concise.\n\n{context}"), additional_kwargs={}), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['input'], input_types={}, partial_variables={}, template='{input}'), additional_kwargs={})])
| ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x327897810>, async_client=<openai.resources.chat.completions.AsyncCompletion

This method `create_stuff_documents_chain` [outputs an LCEL runnable](https://arc.net/l/quote/bnsztwth)

In [25]:
query = "What is the dataset refered in this paper?"

In [26]:
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

rag_chain

RunnableBinding(bound=RunnableAssign(mapper={
  context: RunnableBinding(bound=RunnableLambda(lambda x: x['input'])
           | VectorStoreRetriever(tags=['Chroma', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x12774e610>, search_kwargs={}), kwargs={}, config={'run_name': 'retrieve_documents'}, config_factories=[])
})
| RunnableAssign(mapper={
    answer: RunnableBinding(bound=RunnableBinding(bound=RunnableAssign(mapper={
              context: RunnableLambda(format_docs)
            }), kwargs={}, config={'run_name': 'format_inputs'}, config_factories=[])
            | ChatPromptTemplate(input_variables=['context', 'input'], input_types={}, partial_variables={}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], input_types={}, partial_variables={}, template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the 

In [27]:
results = rag_chain.invoke({"input": query})

results

{'input': 'What is the dataset refered in this paper?',
 'context': [Document(metadata={'page': 1, 'source': './assets-resources/llm_paper_know_dont_know.pdf'}, page_content='et al., 2017). Our selection was guided by Sim-\nCSE (Gao et al., 2021), which allowed us to iden-\ntify and select the answerable questions semanti-\ncally closest to the unanswerable ones. From these\nsources, we accordingly drew samples of 1,487,\n182, and 668 questions respectively, amassing a\ntotal of 2,337. Given that these questions can be\neffectively addressed using information available\non Wikipedia, the foundational corpus for the train-\ning of current LLMs, it is plausible to infer that\nthe model possesses the requisite knowledge to\ngenerate accurate responses to these questions.\nOur dataset, christened SelfAware, incorporates\n1,032 unanswerable and 2,337 answerable ques-\ntions. To reflect real-world distribution, our dataset\n1The code pertinent to our study can be accessed\nhttps://github.com

In [28]:
from IPython.display import Markdown

final_answer = results["answer"]

Markdown(final_answer)

The dataset referred to in the paper is called "SelfAware." It comprises 1,032 unanswerable questions and 2,337 answerable questions, designed to evaluate the self-knowledge of large language models (LLMs).

In [29]:
query_summary = "Write a simple bullet points summary about this paper"

 # adding chat history so the model remembers previous questions
output = rag_chain.invoke({"input": query_summary})

Markdown(output["answer"])

- The paper evaluates self-knowledge in language models, specifically GPT-3, InstructGPT, LLaMA, Alpaca, and Vicuna, using a dataset called SelfAware.
- SelfAware contains 1,032 unanswerable and 2,337 answerable questions, categorized into five types: no scientific consensus, imagination, completely subjective, too many variables, and philosophical.
- The study uses a similarity function to identify sentences with uncertain meanings and determines that a threshold of 0.75 provides the best balance of precision and recall for filtering uncertain sentences.

The final output is easily verifiable, we can see below that the chunk context for the answer came from pages 0,5,7 and 16 in the source pdf.

In [30]:
for i in range(len(output['context'])):
    print(output['context'][i].metadata)

{'page': 8, 'source': './assets-resources/llm_paper_know_dont_know.pdf'}
{'page': 2, 'source': './assets-resources/llm_paper_know_dont_know.pdf'}
{'page': 7, 'source': './assets-resources/llm_paper_know_dont_know.pdf'}
{'page': 1, 'source': './assets-resources/llm_paper_know_dont_know.pdf'}


Let's now dig deeper into RAG with pdf and construct this rag chain ourselves.

In [36]:
system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages([
    ('system', system_prompt),
    ('human', '{input}')
])

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain_from_docs = (
    {
        'input': lambda x: x['input'],
        'context': lambda x: format_docs(x['context']), 
    }
    | prompt
    | llm
    | StrOutputParser()
)

In [39]:
# passing the input query to the retriever
retrieve_docs = (lambda x: x['input']) | retriever

In [41]:
chain = RunnablePassthrough.assign(context=retrieve_docs).assign(
    answer=rag_chain_from_docs
)
chain

RunnableAssign(mapper={
  context: RunnableLambda(lambda x: x['input'])
           | VectorStoreRetriever(tags=['Chroma', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.chroma.Chroma object at 0x11a1ced90>)
})
| RunnableAssign(mapper={
    answer: {
              input: RunnableLambda(...),
              context: RunnableLambda(...)
            }
            | ChatPromptTemplate(input_variables=['context', 'input'], messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, say that you don't know. Use three sentences maximum and keep the answer concise.\n\n{context}")), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['input'], template='{input}'))])
            | ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x12bf88150>, async_cli

In [42]:
query = "According to this paper what are these reusable LLM-profiled components?" 
chain.invoke({'input': query})

{'input': 'According to this paper what are these reusable LLM-profiled components?',
 'context': [Document(metadata={'page': 0, 'source': './assets-resources/paper-llm-components.pdf'}, page_content='A Survey on LLM-Based Agents: Common Workflows and Reusable\nLLM-Profiled Components\nXinzhe Li\nSchool of IT, Deakin University, Australia\nlixinzhe@deakin.edu.au\nAbstract\nRecent advancements in Large Language Mod-\nels (LLMs) have catalyzed the development of so-\nphisticated frameworks for developing LLM-based\nagents. However, the complexity of these frame-\nworks r poses a hurdle for nuanced differentiation\nat a granular level, a critical aspect for enabling\nefficient implementations across different frame-\nworks and fostering future research. Hence, the\nprimary purpose of this survey is to facilitate a co-\nhesive understanding of diverse recently proposed\nframeworks by identifying common workflows and\nreusable LLM-Profiled Components (LMPCs).\n1 Introduction\nGenerative Lar

Adding structured sources:

In [43]:
# source: https://python.langchain.com/v0.2/docs/how_to/qa_sources/
from typing import List

from langchain_core.runnables import RunnablePassthrough
from typing_extensions import Annotated, TypedDict


# Desired schema for response
class AnswerWithSources(TypedDict):
    """An answer to the question, with sources."""

    answer: str
    sources: Annotated[
        List[str],
        ...,
        "List of sources (author + year) used to answer the question",
    ]


# Our rag_chain_from_docs has the following changes:
# - add `.with_structured_output` to the LLM;
# - remove the output parser
rag_chain_from_docs = (
    {
        "input": lambda x: x["input"],
        "context": lambda x: format_docs(x["context"]),
    }
    | prompt
    | llm.with_structured_output(AnswerWithSources)
)

retrieve_docs = (lambda x: x["input"]) | retriever

chain = RunnablePassthrough.assign(context=retrieve_docs).assign(
    answer=rag_chain_from_docs
)

response = chain.invoke({"input": query})
response

{'input': 'According to this paper what are these reusable LLM-profiled components?',
 'context': [Document(metadata={'page': 0, 'source': './assets-resources/paper-llm-components.pdf'}, page_content='A Survey on LLM-Based Agents: Common Workflows and Reusable\nLLM-Profiled Components\nXinzhe Li\nSchool of IT, Deakin University, Australia\nlixinzhe@deakin.edu.au\nAbstract\nRecent advancements in Large Language Mod-\nels (LLMs) have catalyzed the development of so-\nphisticated frameworks for developing LLM-based\nagents. However, the complexity of these frame-\nworks r poses a hurdle for nuanced differentiation\nat a granular level, a critical aspect for enabling\nefficient implementations across different frame-\nworks and fostering future research. Hence, the\nprimary purpose of this survey is to facilitate a co-\nhesive understanding of diverse recently proposed\nframeworks by identifying common workflows and\nreusable LLM-Profiled Components (LMPCs).\n1 Introduction\nGenerative Lar

# References

https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb 
Below are notebook from openai cookbook on these topics of search and embeddings:
- https://github.com/openai/openai-cookbook/blob/main/examples/Get_embeddings.ipynb
- https://github.com/openai/openai-cookbook/blob/main/examples/Code_search.ipynb
- https://github.com/openai/openai-cookbook/blob/main/examples/Customizing_embeddings.ipynb
- https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_Wikipedia_articles_for_search.ipynb
- https://platform.openai.com/docs/guides/embeddings/what-are-embeddings
- [In-context learning abilities of ChatGPT models](https://arxiv.org/pdf/2303.18223.pdf)
- [Issue with long context](https://arxiv.org/pdf/2303.18223.pdf)