In [None]:
!pip install langchain
!pip install langchain-openai
!pip install langchainhub
!pip install pypdf
!pip install chromadb

In [None]:
import os

# Set OPENAI API Key

os.environ["OPENAI_API_KEY"] = "your openai key"

# OR (load from .env file)

# from dotenv import load_dotenv
# make sure you have python-dotenv installed
# load_dotenv("./.env")

Let's set up a study workflow using Jupyter Notebooks, LLMs, and langchain.

In [15]:
import os
from langchain.document_loaders import PyPDFLoader
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

In [16]:
MODEL="gpt-4o-mini"

In [26]:
pdf_path = "./assets-resources/llm_paper_know_dont_know.pdf"
loader = PyPDFLoader(pdf_path) # LOAD
pdf_docs = loader.load_and_split() # SPLIT
pdf_docs

[Document(metadata={'source': './assets-resources/llm_paper_know_dont_know.pdf', 'page': 0}, page_content='Do Large Language Models Know What They Don’t Know?\nZhangyue Yin♢ Qiushi Sun♠ Qipeng Guo♢\nJiawen Wu♢ Xipeng Qiu♢∗ Xuanjing Huang♢\n♢School of Computer Science, Fudan University\n♠Department of Mathematics, National University of Singapore\n{yinzy21,jwwu21}@m.fudan.edu.cn qiushisun@u.nus.edu\n{qpguo16,xpqiu,xjhuang}@fudan.edu.cn\nAbstract\nLarge language models (LLMs) have a wealth\nof knowledge that allows them to excel in vari-\nous Natural Language Processing (NLP) tasks.\nCurrent research focuses on enhancing their\nperformance within their existing knowledge.\nDespite their vast knowledge, LLMs are still\nlimited by the amount of information they can\naccommodate and comprehend. Therefore, the\nability to understand their own limitations on\nthe unknows, referred to as self-knowledge,\nis of paramount importance. This study aims\nto evaluate LLMs’ self-knowledge by assess-\n

In [27]:
doc_obj = pdf_docs[0]
doc_obj

Document(metadata={'source': './assets-resources/llm_paper_know_dont_know.pdf', 'page': 0}, page_content='Do Large Language Models Know What They Don’t Know?\nZhangyue Yin♢ Qiushi Sun♠ Qipeng Guo♢\nJiawen Wu♢ Xipeng Qiu♢∗ Xuanjing Huang♢\n♢School of Computer Science, Fudan University\n♠Department of Mathematics, National University of Singapore\n{yinzy21,jwwu21}@m.fudan.edu.cn qiushisun@u.nus.edu\n{qpguo16,xpqiu,xjhuang}@fudan.edu.cn\nAbstract\nLarge language models (LLMs) have a wealth\nof knowledge that allows them to excel in vari-\nous Natural Language Processing (NLP) tasks.\nCurrent research focuses on enhancing their\nperformance within their existing knowledge.\nDespite their vast knowledge, LLMs are still\nlimited by the amount of information they can\naccommodate and comprehend. Therefore, the\nability to understand their own limitations on\nthe unknows, referred to as self-knowledge,\nis of paramount importance. This study aims\nto evaluate LLMs’ self-knowledge by assess-\ni

In [28]:
from IPython.display import display, Markdown

Markdown(doc_obj.page_content)

Do Large Language Models Know What They Don’t Know?
Zhangyue Yin♢ Qiushi Sun♠ Qipeng Guo♢
Jiawen Wu♢ Xipeng Qiu♢∗ Xuanjing Huang♢
♢School of Computer Science, Fudan University
♠Department of Mathematics, National University of Singapore
{yinzy21,jwwu21}@m.fudan.edu.cn qiushisun@u.nus.edu
{qpguo16,xpqiu,xjhuang}@fudan.edu.cn
Abstract
Large language models (LLMs) have a wealth
of knowledge that allows them to excel in vari-
ous Natural Language Processing (NLP) tasks.
Current research focuses on enhancing their
performance within their existing knowledge.
Despite their vast knowledge, LLMs are still
limited by the amount of information they can
accommodate and comprehend. Therefore, the
ability to understand their own limitations on
the unknows, referred to as self-knowledge,
is of paramount importance. This study aims
to evaluate LLMs’ self-knowledge by assess-
ing their ability to identify unanswerable or
unknowable questions. We introduce an auto-
mated methodology to detect uncertainty in the
responses of these models, providing a novel
measure of their self-knowledge. We further in-
troduce a unique dataset, SelfAware, consisting
of unanswerable questions from five diverse cat-
egories and their answerable counterparts. Our
extensive analysis, involving 20 LLMs includ-
ing GPT-3, InstructGPT, and LLaMA, discov-
ering an intrinsic capacity for self-knowledge
within these models. Moreover, we demon-
strate that in-context learning and instruction
tuning can further enhance this self-knowledge.
Despite this promising insight, our findings also
highlight a considerable gap between the capa-
bilities of these models and human proficiency
in recognizing the limits of their knowledge.
“True wisdom is knowing what you don’t know.”
–Confucius
1 Introduction
Recently, Large Language Models (LLMs) such
as GPT-4 (OpenAI, 2023), PaLM 2 (Anil et al.,
2023), and LLaMA (Touvron et al., 2023) have
shown exceptional performance on a wide range
of NLP tasks, including common sense reason-
ing (Wei et al., 2022; Zhou et al., 2022) and mathe-
∗ Corresponding author.
Unknows
KnowsUnknows
Knows
Known Knows Known Unknows
Unknown UnknowsUnknown Knows
Unlock
Figure 1: Know-Unknow Quadrant. The horizontal axis
represents the model’s memory capacity for knowledge,
and the vertical axis represents the model’s ability to
comprehend and utilize knowledge.
matical problem-solving (Lewkowycz et al., 2022;
Chen et al., 2022). Despite their ability to learn
from huge amounts of data, LLMs still have lim-
itations in their capacity to retain and understand
information. To ensure responsible usage, it is cru-
cial for LLMs to have the capability of recognizing
their limitations and conveying uncertainty when
responding to unanswerable or unknowable ques-
tions. This acknowledgment of limitations, also
known as “ knowing what you don’t know,” is a
crucial aspect in determining their practical appli-
cability. In this work, we refer to this ability as
model self-knowledge.
The Know-Unknow quadrant in Figure 1 il-
lustrates the relationship between the model’s
knowledge and comprehension. The ratio of
“Known Knows” to “Unknown Knows” demon-
strates the model’s proficiency in understanding
and applying existing knowledge. Techniques
such as Chain-of-Thought (Wei et al., 2022), Self-
Consistency (Wang et al., 2022), and Complex
CoT (Fu et al., 2022) can be utilized to increase
arXiv:2305.18153v2  [cs.CL]  30 May 2023

In [29]:
embeddings = OpenAIEmbeddings() # EMBED
embeddings
vectordb = Chroma.from_documents(pdf_docs, embedding=embeddings) # STORE


# Definition of a [retriever](https://python.langchain.com/docs/modules/data_connection/retrievers/#:~:text=A%20retriever%20is,Document's%20as%20output.):

# > A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever does not need to be able to store documents, only to return (or retrieve) them. Vector stores can be used as the backbone of a retriever, but there are other types of retrievers as well.
retriever = vectordb.as_retriever() 
# retriever
llm = ChatOpenAI(model=MODEL, temperature=0)
# source: https://python.langchain.com/v0.2/docs/tutorials/pdf_qa/#question-answering-with-rag

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

# prompt
question_answer_chain = create_stuff_documents_chain(llm, prompt)

In [30]:
# question_answer_chain
# This method `create_stuff_documents_chain` [outputs an LCEL runnable](https://arc.net/l/quote/bnsztwth)
query = "What are the key components of the transformer architecture?"
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

results = rag_chain.invoke({"input": query})

results

{'input': 'What are the key components of the transformer architecture?',
 'context': [Document(metadata={'page': 2, 'source': './assets-resources/attention-paper.pdf'}, page_content='Figure 1: The Transformer - model architecture.\nThe Transformer follows this overall architecture using stacked self-attention and point-wise, fully\nconnected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,\nrespectively.\n3.1 Encoder and Decoder Stacks\nEncoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two\nsub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-\nwise fully connected feed-forward network. We employ a residual connection [11] around each of\nthe two sub-layers, followed by layer normalization [ 1]. That is, the output of each sub-layer is\nLayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer\nitself. To facilitate these residual

In [31]:
from IPython.display import Markdown

final_answer = results["answer"]

Markdown(final_answer)

The key components of the Transformer architecture include the encoder and decoder stacks, each composed of identical layers. Each encoder layer has a multi-head self-attention mechanism and a position-wise fully connected feed-forward network, while the decoder includes an additional multi-head attention sub-layer that attends to the encoder's output. Residual connections and layer normalization are employed around each sub-layer to facilitate training.

In [32]:
query_summary = "Write a simple bullet points summary about this paper"

 # adding chat history so the model remembers previous questions
output = rag_chain.invoke({"input": query_summary})

Markdown(output["answer"])

- The paper introduces the Transformer, a new neural network architecture based solely on attention mechanisms, eliminating the need for recurrence and convolutions.
- It demonstrates superior performance in machine translation tasks, achieving state-of-the-art BLEU scores on English-to-German and English-to-French translations.
- The Transformer model is shown to generalize well to other tasks, such as English constituency parsing, with significantly reduced training time and costs.

In [33]:
def ask_pdf(pdf_qa,query):
    print("QUERY: ",query)
    result = pdf_qa.invoke({"input": query})
    answer = result["answer"]
    print("ANSWER", answer)
    return answer


ask_pdf(rag_chain,"How does the self-attention mechanism in transformers differ from traditional sequence alignment methods?")

QUERY:  How does the self-attention mechanism in transformers differ from traditional sequence alignment methods?
ANSWER The self-attention mechanism in transformers allows for modeling dependencies between all positions in a sequence simultaneously, without regard to their distance, while traditional sequence alignment methods, such as those used in recurrent neural networks (RNNs), typically process sequences in a sequential manner. This means that self-attention can capture relationships between distant positions more effectively and with a constant number of operations, whereas traditional methods may struggle with long-range dependencies due to their sequential nature. Additionally, self-attention enables greater parallelization during training, improving computational efficiency.


'The self-attention mechanism in transformers allows for modeling dependencies between all positions in a sequence simultaneously, without regard to their distance, while traditional sequence alignment methods, such as those used in recurrent neural networks (RNNs), typically process sequences in a sequential manner. This means that self-attention can capture relationships between distant positions more effectively and with a constant number of operations, whereas traditional methods may struggle with long-range dependencies due to their sequential nature. Additionally, self-attention enables greater parallelization during training, improving computational efficiency.'

In [34]:
quiz_questions = ask_pdf(rag_chain, "Quiz me with 3 simple questions on the positional encodings and the role they play in transformers.")

quiz_questions

QUERY:  Quiz me with 3 simple questions on the positional encodings and the role they play in transformers.
ANSWER 1. What is the purpose of positional encodings in the Transformer architecture?  
2. How do positional encodings help the model understand the order of input sequences?  
3. What are the two types of positional encodings mentioned in the context of Transformers?


'1. What is the purpose of positional encodings in the Transformer architecture?  \n2. How do positional encodings help the model understand the order of input sequences?  \n3. What are the two types of positional encodings mentioned in the context of Transformers?'

In [3]:
from langchain_openai import ChatOpenAI
MODEL="gpt-4o-mini"
llm = ChatOpenAI(model=MODEL, temperature=0.0)

In [5]:
from pydantic import BaseModel, Field
from typing import List

class QA(BaseModel):
    questions: List[str] = Field(description='List of questions about a given context.')

In [6]:
from langchain_core.prompts.chat import SystemMessagePromptTemplate, HumanMessagePromptTemplate

template = f"You transform unstructured questions about a topic into a structured list of questions."
system_message_prompt = SystemMessagePromptTemplate.from_template(template)
human_message_prompt = HumanMessagePromptTemplate.from_template("Return ONLY a PYTHON list containing the questions in this text: {questions}")

In [7]:
from langchain_core.prompts import ChatPromptTemplate

chat_prompt = ChatPromptTemplate.from_messages([system_message_prompt,human_message_prompt])

In [8]:
quiz_chain = chat_prompt | llm.with_structured_output(QA)

In [11]:
context_information = """
nalysis, involving 20 LLMs includ-
ing GPT-3, InstructGPT, and LLaMA, discov-
ering an intrinsic capacity for self-knowledge
within these models. Moreover, we demon-
strate that in-context learning and instruction
tuning can further enhance this self-knowledge.
Despite this promising insight, our findings also
highlight a considerable gap between the capa-
bilities of these models and human proficiency
in recognizing the limits of their knowledge.
“True wisdom is knowing what you don’t know.”
–Confucius
1 Introduction
Recently, Large Language Models (LLMs) such
as GPT-4 (OpenAI, 2023), PaLM 2 (Anil et al.,
2023), and LLaMA (Touvron et al., 2023) have
shown exceptional performance on a wide range
of NLP tasks, including common sense reason-
ing (Wei et al., 2022; Zhou et al., 2022) and mathe-
∗Corresponding author.UnknowsKnowsUnknowsKnowsKnown Knows Known UnknowsUnknown UnknowsUnknown KnowsUnlock
Figure 1: Know-Unknow Quadrant. The horizontal axis
represents the model’s memory capacity for knowledge,
and the vertical axis represents the model’s ability to
comprehend and utilize knowledge.
matical problem-solving (Lewkowycz et al., 2022;
Chen et al., 2022). Despite their ability to learn
from huge amounts of data, LLMs still have lim-
itations in their capacity to retain and understand
information. To ensure responsible usage, it is cru-
cial for LLMs to have the capability of recognizing
their limitations and conveying uncertainty when
responding to unanswerable or unknowable ques-
tions. This acknowledgment of limitations, also
known as “knowing what you don’t know,” is a
crucial aspect in determining their practical appli-
cability. In this work, we refer to this ability as
model self-knowledge.
The Know-Unknow quadrant in Figure 1 il-
lustrates the relationship between the model’s
knowledge and comprehension. The ratio of
“Known Knows” to “Unknown Knows” demon-
strates the model’s proficiency in understanding
and applying existing knowledge. Techniques
such as Chain-of-Thought (Wei et al., 2022), Self-
Consistency (Wang et al., 2022), and Complex
CoT (Fu et al., 2022) can be utilized to increase
arXiv:2305.18153v2  [cs.CL]  30 May 2023
this ratio, resulting in improved performance on
NLP tasks. We focus on the ratio of “Known Un-
knows” to “Unknown Unknows”, which indicates
the model’s self-knowledge level, specifically un-
derstanding its own limitations and deficiencies in
the unknows.
Existing datasets such as SQuAD2.0 (Rajpurkar
et al., 2018) and NewsQA (Trischler et al., 2017),
widely used in question answering (QA), have been
utilized to test the self-knowledge of models with
unanswerable questions. However, these questions
are context-specific and could become answerable
when supplemented with additional information.
Srivastava et al. (2022) attempted to address this by
evaluating LLMs’ competence in delineating their
knowledge boundaries, employing a set of 23 pairs
of answerable and unanswerable multiple-choice
questions. They discovered that these models’ per-
formance barely surpassed that of random guessing.
Kadavath et al. (2022) suggested probing the self-
knowledge of LLMs through the implementation
of a distinct "Value Head". Yet, this approach may
encounter difficulties when applied across varied
domains or tasks due to task-specific training. Con-
sequently, we redirect our focus to the inherent
abilities of LLMs, and pose the pivotal question:
“Do large language models know what they don’t
know?”.
In this study, we investigate the self-knowledge
of LLMs using a novel approach. By gathering
reference sentences with uncertain meanings, we
can determine whether the model’s responses re-
flect uncertainty using a text similarity algorithm.
We quantified the model’s self-knowledge using
the F1 score. To address the small and idiosyn-
cratic limitations of existing datasets, we created
a new dataset called SelfAware. This dataset com-
prises 1,032 unanswerable questions, which are dis-
tributed across five distinct categories, along with
an additional 2,337 questions that are classified as
answerable. Experimental results on GPT-3, In-
structGPT, LLaMA, and other LLMs demonstrate
that in-context learning and instruction tuning can
effectively enhance the self-knowledge of LLMs.
However, the self-knowledge exhibited by the cur-
rent state-of-the-art model, GPT-4, measures at
75.47%, signifying a notable disparity when con-
trasted with human self-knowledge, which is rated
at 84.93%.
Our key contributions to this field are summa-
rized as follows:
• We have developed a new dataset, SelfAware,
that comprises a diverse range of commonly
posed unanswerable questions.
• We propose an innovative evaluation tech-
nique based on text similarity to quantify the
degree of uncertainty inherent in model out-
puts.
• Through our detailed analysis of 20 LLMs,
benchmarked against human self-knowledge,
we identified a significant disparity between
the most advanced LLMs and humans 1.
2 Dataset Construction
To conduct a more comprehensive evaluation of
the model’s self-knowledge, we constructed a
dataset that includes a larger number and more di-
verse types of unanswerable questions than Know-
Unknowns dataset (Srivastava et al., 2022). To
facilitate this, we collected a corpus of 2,858 unan-
swerable questions, sourced from online platforms
like Quora and HowStuffWorks. These questions
were meticulously evaluated by three seasoned an-
notation analysts, each operating independently.
The analysts were permitted to leverage external
resources, such as search engines. To ensure the va-
lidity of our dataset, we retained only the questions
that all three analysts concurred were unanswerable.
This rigorous process yielded a finalized collection
of 1,032 unanswerable questions.
In pursuit of a comprehensive evaluation, we
opted for answerable questions drawn from three
datasets: SQuAD (Rajpurkar et al., 2016), Hot-
potQA (Yang et al., 2018), and TriviaQA (Joshi
et al., 2017). Our selection was guided by Sim-
CSE (Gao et al., 2021), which allowed us to iden-
tify and select the answerable questions semanti-
cally closest to the unanswerable ones. From these
sources, we accordingly drew samples of 1,487,
182, and 668 questions respectively, amassing a
total of 2,337. Given that these questions can be
effectively addressed using information available
on Wikipedia, the foundational corpus for the train-
ing of current LLMs, it is plausible to infer that
the model possesses the requisite knowledge to
generate accurate responses to these questions.
Our dataset, christened SelfAware, incorporates
1,032 unanswerable and 2,337 answerable ques-
tions. To reflect real-world distribution, our dataset
1The code pertinent to our study can be accessed
https://github.com/yinzhangyue/SelfAware
Category Description Example Percentage
No scientific
consensus
The answer is still up
for debate, with no consensus
in scientific community.
“Are we alone in the universe,
or will we discover alien
life at some point?”
25%
Imagination The question are about people’s
imaginations of the future.
"What will the fastest form of
transportation be in 2050?" 15%
Completely
subjective
The answer depends on
personal preference.
"Would you rather be shot
into space or explore the
deepest depths of the sea?"
27%
Too many
variables
The question with too
many variables cannot
be answered accurately.
“John made 6 dollars mowing lawns
and 18 dollars weed eating.
If he only spent 3 or 5 dollar a week,
how long would the money last him?”
10%
Philosophical
The question can yield
multiple responses, but it
lacks a definitive answer.
“How come god was
born from nothingness?” 23%
Table 1: Unanswerable questions in the SelfAware dataset that span across multiple categories.
contains a proportion of answerable questions that
is twice as large as the volume of unanswerable
ones. Nevertheless, to ensure the feasibility of test-
ing, we have purposefully capped the number of
answerable questions.
2.1 Dataset Analysis
To gain insight into the reasons precluding a cer-
tain answer, we undertook a manual analysis of
100 randomly selected unanswerable questions. As
tabulated in Table 1, we have broadly segregated
these questions into five distinctive categories. “No
Scientific Consensus" encapsulates questions that
ignite ongoing debates within the scientific com-
munity, such as those concerning the universe’s
origin. “Imagination" includes questions involving
speculative future scenarios, like envisaged events
over the next 50 years. “Completely Subjective"
comprises questions that are inherently personal,
where answers depend heavily on individual predis-
positions. “Too Many Variables" pertains to mathe-
matical problems that become unsolvable owing to
the overwhelming prevalence of variables. Lastly,
“Philosophical" represents questions of a profound,
often metaphysical, nature that resist concrete an-
swers. Ideally, upon encountering such questions,
the model should express uncertainty instead of
delivering conclusive responses.
3 Evaluation Method
This section elucidates the methodology employed
for assessing self-knowledge in the generated text.
In order to achieve this, we define a similarity func-
tion, fsim, to compute the similarity, S, between
a given sentence, t, and a collection of reference
sentences, U ={u1, u2, . . . , un}, endowed with
uncertain meanings
"""
questions_list = quiz_chain.invoke({"questions": context_information})
questions_list

QA(questions=['What is the intrinsic capacity for self-knowledge within large language models (LLMs)?', 'How can in-context learning and instruction tuning enhance self-knowledge in LLMs?', 'What is the gap between the capabilities of LLMs and human proficiency in recognizing their knowledge limits?', "What does the Know-Unknow quadrant illustrate about a model's knowledge and comprehension?", "What techniques can be utilized to improve the ratio of 'Known Knows' to 'Unknown Knows'?", "What is the significance of 'knowing what you don’t know' in LLMs?", 'How do existing datasets like SQuAD2.0 and NewsQA test the self-knowledge of models?', "What were the findings of Srivastava et al. (2022) regarding LLMs' performance on unanswerable questions?", "What is the proposed method to probe the self-knowledge of LLMs through a distinct 'Value Head'?", "What is the pivotal question posed in the study regarding LLMs' self-knowledge?", 'How was the self-knowledge of LLMs quantified in this study

In [12]:
questions = questions_list.questions

In [13]:
questions

['What is the intrinsic capacity for self-knowledge within large language models (LLMs)?',
 'How can in-context learning and instruction tuning enhance self-knowledge in LLMs?',
 'What is the gap between the capabilities of LLMs and human proficiency in recognizing their knowledge limits?',
 "What does the Know-Unknow quadrant illustrate about a model's knowledge and comprehension?",
 "What techniques can be utilized to improve the ratio of 'Known Knows' to 'Unknown Knows'?",
 "What is the significance of 'knowing what you don’t know' in LLMs?",
 'How do existing datasets like SQuAD2.0 and NewsQA test the self-knowledge of models?',
 "What were the findings of Srivastava et al. (2022) regarding LLMs' performance on unanswerable questions?",
 "What is the proposed method to probe the self-knowledge of LLMs through a distinct 'Value Head'?",
 "What is the pivotal question posed in the study regarding LLMs' self-knowledge?",
 'How was the self-knowledge of LLMs quantified in this study?',

In [35]:
# the questions variable was created within the string inside the `questions_list` variable.
answers = []
for q in questions:
    answers.append(ask_pdf(rag_chain,q))

QUERY:  What is the intrinsic capacity for self-knowledge within large language models (LLMs)?
ANSWER Large language models (LLMs) exhibit an intrinsic capacity for self-knowledge, which allows them to identify unanswerable or unknowable questions. This ability is assessed through their responses, reflecting uncertainty when faced with such questions. However, there remains a significant gap between the self-knowledge of LLMs and that of humans.
QUERY:  How can in-context learning and instruction tuning enhance self-knowledge in LLMs?
ANSWER In-context learning and instruction tuning can enhance self-knowledge in large language models (LLMs) by providing richer contextual information and structured guidance, which helps the models better understand their limitations. These techniques improve the models' ability to recognize unanswerable questions and convey uncertainty in their responses. Experimental results indicate that models like InstructGPT show significant improvements in self-k

In [36]:
evaluations = []

for q,a in zip(questions, answers):
    # Check for results
    evaluations.append(ask_pdf(rag_chain,f"Is this: {a} the correct answer to this question: {q} \
        according to the paper? Return ONLY '''YES''' or '''NO'''. Output:"))

evaluations

QUERY:  Is this: Large language models (LLMs) exhibit an intrinsic capacity for self-knowledge, which allows them to identify unanswerable or unknowable questions. This ability is assessed through their responses, reflecting uncertainty when faced with such questions. However, there remains a significant gap between the self-knowledge of LLMs and that of humans. the correct answer to this question: What is the intrinsic capacity for self-knowledge within large language models (LLMs)?         according to the paper? Return ONLY '''YES''' or '''NO'''. Output:
ANSWER YES
QUERY:  Is this: In-context learning and instruction tuning can enhance self-knowledge in large language models (LLMs) by providing richer contextual information and structured guidance, which helps the models better understand their limitations. These techniques improve the models' ability to recognize unanswerable questions and convey uncertainty in their responses. Experimental results indicate that models like Instruc

['YES',
 'YES',
 'YES',
 'YES',
 'YES',
 "'''YES'''",
 'YES',
 'YES',
 'YES',
 'YES',
 'YES',
 'YES',
 'YES',
 'YES',
 'YES',
 'YES']

In [37]:
scores = []

yes_count = evaluations.count('YES')
score = str(yes_count/len(evaluations) * 100) + "%"
print(score)

93.75%


A more 'langchain way' to do this would be:

In [38]:
from langchain_core.output_parsers import StrOutputParser

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

prompt_eval_template = """
You take in context, a question and a generated answer and you output ONLY a score of YES if the answer is correct,
or NO if the answer is not correct.

<context>
{context}
<context>

<question>
{question}
<question>

<answer>
{answer}
<answer>
"""

prompt_eval = ChatPromptTemplate.from_template(prompt_eval_template)

answer_eval_chain = (
    {
        'context': lambda x: format_docs(x['context']),
        'question': lambda x: x['question'],
        'answer': lambda x: x['answer']
        }
    ) | prompt_eval | llm | StrOutputParser()

In [39]:
evaluations = []
for q,a in zip(questions, answers):
    evaluations.append(answer_eval_chain.invoke({'context': pdf_docs, 'question': q, 'answer': a}))

In [40]:
evaluations

['YES',
 'YES',
 'YES',
 'YES',
 'YES',
 'YES',
 'YES',
 'YES',
 'YES',
 'YES',
 'YES',
 'YES',
 'YES',
 'YES',
 'YES',
 'YES']

In [41]:
scores = []

yes_count = evaluations.count('YES')
score = str(yes_count/len(evaluations) * 100) + "%"
print(score)

100.0%


In this example notebook we introduced a few interesting ideas:
1. Structured outputs
2. Some simple evaluation of rag answers using the 'llm-as-a-judge' strategy