# Lab | Langchain Evaluation

## Intro

Pick different sets of data and re-run this notebook. The point is for you to understand all steps involve and the many different ways one can and should evaluate LLM applications.

What did you learn? - Let's discuss that in class

## LangChain: Evaluation

### Outline:

* Example generation
* Manual evaluation (and debuging)
* LLM-assisted evaluation

In [1]:
from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY') 

### Example 1

#### Create our QandA application

In [2]:
%pip install langchain-huggingface

from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import CSVLoader, TextLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.chains import LLMChain


Note: you may need to restart the kernel to use updated packages.


In [1]:
import os
import pandas as pd
from langchain_community.document_loaders import CSVLoader

file = r'data\OutdoorClothingCatalog_1000.csv'  # Raw string for Windows
if not os.path.exists(file):
    raise FileNotFoundError(f"File not found: {file}")

# Load with pandas for DataFrame operations
data = pd.read_csv(file, encoding='utf-8')

# Load with LangChain CSVLoader for document processing
loader = CSVLoader(file_path=file, encoding='utf-8', source_column='description')
docs = loader.load()
data.head()

Unnamed: 0.1,Unnamed: 0,name,description
0,0,Women's Campside Oxfords,This ultracomfortable lace-to-toe Oxford boast...
1,1,"Recycled Waterhog Dog Mat, Chevron Weave",Protect your floors from spills and splashing ...
2,2,Infant and Toddler Girls' Coastal Chill Swimsu...,"She'll love the bright colors, ruffles and exc..."
3,3,"Refresh Swimwear, V-Neck Tankini Contrasts",Whether you're going for a swim or heading out...
4,4,EcoFlex 3L Storm Pants,Our new TEK O2 technology makes our four-seaso...


In [2]:
from langchain.vectorstores import DocArrayInMemorySearch
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.indexes import VectorstoreIndexCreator

In [3]:
file = r'data\OutdoorClothingCatalog_1000.csv'  # Match Cell [5]
if not os.path.exists(file):
    raise FileNotFoundError(f"File not found: {file}")

loader = CSVLoader(file_path=file, encoding='utf-8', source_column='description')
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2", model_kwargs={'device': 'cpu'})
).from_loaders([loader])



In [42]:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
llm = ChatOpenAI(temperature=0.0)
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=index.vectorstore.as_retriever(),
    verbose=True,
    chain_type_kwargs={"document_separator": "<<<<>>>>>"}
)

#### Coming up with test datapoints

In [7]:
data.iloc[10]

Unnamed: 0                                                    10
name                           Cozy Comfort Pullover Set, Stripe
description    Perfect for lounging, this striped knit set li...
Name: 10, dtype: object

In [8]:
data.iloc[11]

Unnamed: 0                                                    11
name                  Ultra-Lofty 850 Stretch Down Hooded Jacket
description    This technical stretch down jacket from our Do...
Name: 11, dtype: object

#### Hard-coded examples

In [9]:
from langchain.prompts import PromptTemplate

In [11]:
from langchain.prompts import PromptTemplate
from langchain.schema import BaseOutputParser
from pydantic import BaseModel, Field
from langchain.chains import LLMChain

examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

# Define the prompt template
prompt_template = PromptTemplate(
    input_variables=["query"],
    template="Examples:\n"
             "1. Query: Do the Cozy Comfort Pullover Set have side pockets?\n"
             "   Answer: Yes\n"
             "2. Query: What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?\n"
             "   Answer: The DownTek collection\n"
             "Query: {query}\n"
             "Answer:"
)

# Define the output model
class Answer(BaseModel):
    answer: str = Field(description="The answer to the query")

# Create the output parser
class AnswerOutputParser(BaseOutputParser):
    def parse(self, text: str) -> Answer:
        # Split the response to get the answer
        answer = text.strip().split("Answer:")[-1].strip()
        return Answer(answer=answer)

# Initialize the LLM
# llm = OpenAI()
llm = ChatOpenAI()

# Create the LLMChain
llm_chain = LLMChain(
    llm=llm,
    prompt=prompt_template,
    output_parser=AnswerOutputParser()
)

# Example query
query = "Is the Cozy Comfort Pullover Set available in different colors?"

# Run the chain
result = llm_chain.run({"query": query})

# Print the result
print(result)


  llm_chain = LLMChain(
  result = llm_chain.run({"query": query})


answer='Yes, it is available in multiple colors such as grey, navy, and burgundy.'


#### LLM-Generated examples

In [12]:
from langchain.evaluation.qa import QAGenerateChain

In [13]:
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI())

In [14]:
llm_chain = LLMChain(llm=llm, prompt=prompt_template)

In [15]:
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:5]]
)



In [16]:
new_examples[0]

{'qa_pairs': {'query': 'What is the title of the document?',
  'answer': 'The title of the document is "Unnamed: 0."'}}

In [19]:
data.iloc[0]

Unnamed: 0                                                     0
name                                    Women's Campside Oxfords
description    This ultracomfortable lace-to-toe Oxford boast...
Name: 0, dtype: object

In [20]:
d_flattened = [data['qa_pairs'] for data in new_examples]
d_flattened

[{'query': 'What is the title of the document?',
  'answer': 'The title of the document is "Unnamed: 0."'},
 {'query': 'What is the sole content of the document provided?',
  'answer': 'The name.'},
 {'query': 'What is one key requirement for the role of a teacher?',
  'answer': 'One key requirement for the role of a teacher is the ability to effectively describe information.'}]

#### Combine examples

In [21]:
# examples += new_example
examples += d_flattened

In [22]:
examples[0]

{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
 'answer': 'Yes'}

In [23]:
qa.invoke(examples[0]["query"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
 'result': 'Based on the provided context, the Cozy Comfort Pullover Set does not have side pockets. It has side seam pockets, a back zip pocket, two elastic mesh water bottle pockets, and a top compartment with a pocket with a double-seal zipper for quick access.'}

### Manual Evaluation - Fun part

In [24]:
import langchain
langchain.debug = True

In [25]:
qa.invoke(examples[0]["query"])

[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "Do the Cozy Comfort Pullover Set        have side pockets?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "Do the Cozy Comfort Pullover Set        have side pockets?",
  "context": "Side seam pockets and back zip pocket, with mesh insert for quick drainage.<<<<>>>>>Two elastic mesh water bottle pockets.\r\nTop compartment includes pocket with double-seal zipper for quick access.\r\nSide<<<<>>>>>All pockets have sturdy pocket bags and offer plenty of room for a wallet, cell phone and more.\r\n\r\nGusseted crotch for ease of movement.\r\n\r\nImported.<<<<>>>>>Two elastic mesh water bottle pockets.\r\nTop compartment includes pocket with double-se"
}
[32;1

{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
 'result': 'Yes, the Cozy Comfort Pullover Set has side seam pockets and a back zip pocket, as well as two elastic mesh water bottle pockets.'}

In [26]:
# Turn off the debug mode
langchain.debug = False

### LLM assisted evaluation

In [27]:
examples += d_flattened

In [28]:
examples

[{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
  'answer': 'Yes'},
 {'query': 'What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?',
  'answer': 'The DownTek collection'},
 {'query': 'What is the title of the document?',
  'answer': 'The title of the document is "Unnamed: 0."'},
 {'query': 'What is the sole content of the document provided?',
  'answer': 'The name.'},
 {'query': 'What is one key requirement for the role of a teacher?',
  'answer': 'One key requirement for the role of a teacher is the ability to effectively describe information.'},
 {'query': 'What is the title of the document?',
  'answer': 'The title of the document is "Unnamed: 0."'},
 {'query': 'What is the sole content of the document provided?',
  'answer': 'The name.'},
 {'query': 'What is one key requirement for the role of a teacher?',
  'answer': 'One key requirement for the role of a teacher is the ability to effectively describe information.'}]

In [29]:
predictions = qa.batch(examples)



[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


In [30]:
predictions

[{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
  'answer': 'Yes',
  'result': 'Yes, the Cozy Comfort Pullover Set has side seam pockets and a back zip pocket, as well as two elastic mesh water bottle pockets.'},
 {'query': 'What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?',
  'answer': 'The DownTek collection',
  'result': 'The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.'},
 {'query': 'What is the title of the document?',
  'answer': 'The title of the document is "Unnamed: 0."',
  'result': 'The title of the document is "Maine Guide Canvas Rod Travel Case."'},
 {'query': 'What is the sole content of the document provided?',
  'answer': 'The name.',
  'result': 'The document provided includes information about an Executive Leather Briefcase, its description, specifications, and fabric & care details.'},
 {'query': 'What is one key requirement for the role of a teacher?',
  'answer': 'One key requirem

In [31]:
from langchain.evaluation.qa import QAEvalChain

In [32]:
llm = ChatOpenAI(temperature=0)
eval_chain = QAEvalChain.from_llm(llm)

In [33]:
graded_outputs = eval_chain.evaluate(examples, predictions)

In [34]:
graded_outputs

[{'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'INCORRECT'},
 {'results': 'INCORRECT'},
 {'results': 'CORRECT'},
 {'results': 'INCORRECT'},
 {'results': 'INCORRECT'},
 {'results': 'CORRECT'}]

In [35]:
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    # print("Predicted Grade: " + graded_outputs[i]['text'])
    print()

Example 0:
Question: Do the Cozy Comfort Pullover Set        have side pockets?
Real Answer: Yes
Predicted Answer: Yes, the Cozy Comfort Pullover Set has side seam pockets and a back zip pocket, as well as two elastic mesh water bottle pockets.

Example 1:
Question: What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.

Example 2:
Question: What is the title of the document?
Real Answer: The title of the document is "Unnamed: 0."
Predicted Answer: The title of the document is "Maine Guide Canvas Rod Travel Case."

Example 3:
Question: What is the sole content of the document provided?
Real Answer: The name.
Predicted Answer: The document provided includes information about an Executive Leather Briefcase, its description, specifications, and fabric & care details.

Example 4:
Question: What is one key requirement for the role of a

### Example 2
One can also easily evaluate your QA chains with the metrics offered in ragas

In [45]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import TextLoader
import os

file = r'data\nyc_text.txt'  # Raw string for Windows
if not os.path.exists(file):
    raise FileNotFoundError(f"File not found: {file}")

# Load the text file as plain text (not as CSV)
with open(file, 'r', encoding='utf-8') as f:
    text_data = f.read()

# Load with LangChain TextLoader for document processing
loader = TextLoader(file_path=file, encoding='utf-8')
docs = loader.load()

llm = ChatOpenAI(temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=index.vectorstore.as_retriever(),
    return_source_documents=True,
)

In [46]:
# testing it out

question = "How did New York City get its name?"
result = qa_chain.invoke({"query": question})
result["result"]

"I don't know, for historical information like that, I recommend checking reliable historical sources or visiting a library for accurate information on the origin of New York City's name."

In [47]:
result

{'query': 'How did New York City get its name?',
 'result': "I don't know, for historical information like that, I recommend checking reliable historical sources or visiting a library for accurate information on the origin of New York City's name.",
 'source_documents': [Document(metadata={'source': 'As part of our partnership with National Park Foundation, we\'re excited to offer this Find Your Park collectible patch – inspiring adventurers of all ages to get out there and find their own special connection to our national parks. \r\n\r\nSpecs: Dimensions: 2"H x 3½"W. \r\n\r\nWhy We Love It: We\'re dedicated to supporting organizations that help people get outside and we think we\'ve found our perfect match. The National Park Foundation, the official charitable partner of the National Park Service, works to protect an amazing network of more than 400 national park sites, many of which you\'ll find just a short trip away.\r\n\r\nFabric & Care: Machine wash and dry. \r\n\r\nAdditional Fe

Now in order to evaluate the qa system we generated a few relevant questions. We've generated a few question for you but feel free to add any you want.

In [48]:
eval_questions = [
    "What is the population of New York City as of 2020?",
    "Which borough of New York City has the highest population?",
    "What is the economic significance of New York City?",
    "How did New York City get its name?",
    "What is the significance of the Statue of Liberty in New York City?",
]

eval_answers = [
    "8,804,190",
    "Brooklyn",
    "New York City's economic significance is vast, as it serves as the global financial capital, housing Wall Street and major financial institutions. Its diverse economy spans technology, media, healthcare, education, and more, making it resilient to economic fluctuations. NYC is a hub for international business, attracting global companies, and boasts a large, skilled labor force. Its real estate market, tourism, cultural industries, and educational institutions further fuel its economic prowess. The city's transportation network and global influence amplify its impact on the world stage, solidifying its status as a vital economic player and cultural epicenter.",
    "New York City got its name when it came under British control in 1664. King Charles II of England granted the lands to his brother, the Duke of York, who named the city New York in his own honor.",
    "The Statue of Liberty in New York City holds great significance as a symbol of the United States and its ideals of liberty and peace. It greeted millions of immigrants who arrived in the U.S. by ship in the late 19th and early 20th centuries, representing hope and freedom for those seeking a better life. It has since become an iconic landmark and a global symbol of cultural diversity and freedom.",
]

examples = [
    {"query": q, "ground_truths": [eval_answers[i]]}
    for i, q in enumerate(eval_questions)
]

In [49]:
examples

[{'query': 'What is the population of New York City as of 2020?',
  'ground_truths': ['8,804,190']},
 {'query': 'Which borough of New York City has the highest population?',
  'ground_truths': ['Brooklyn']},
 {'query': 'What is the economic significance of New York City?',
  'ground_truths': ["New York City's economic significance is vast, as it serves as the global financial capital, housing Wall Street and major financial institutions. Its diverse economy spans technology, media, healthcare, education, and more, making it resilient to economic fluctuations. NYC is a hub for international business, attracting global companies, and boasts a large, skilled labor force. Its real estate market, tourism, cultural industries, and educational institutions further fuel its economic prowess. The city's transportation network and global influence amplify its impact on the world stage, solidifying its status as a vital economic player and cultural epicenter."]},
 {'query': 'How did New York City

#### Introducing RagasEvaluatorChain

`RagasEvaluatorChain` creates a wrapper around the metrics ragas provides (documented [here](https://github.com/explodinggradients/ragas/blob/main/docs/metrics.md)), making it easier to run these evaluation with langchain and langsmith.

The evaluator chain has the following APIs

- `__call__()`: call the `RagasEvaluatorChain` directly on the result of a QA chain.
- `evaluate()`: evaluate on a list of examples (with the input queries) and predictions (outputs from the QA chain). 
- `evaluate_run()`: method implemented that is called by langsmith evaluators to evaluate langsmith datasets.

lets see each of them in action to learn more.

In [50]:
result = qa_chain.invoke({"query": eval_questions[1]})
result["result"]

"I don't know."

In [51]:
key_mapping = {
    "query": "question",
    "result": "answer",
    "source_documents": "contexts"
}

result_updated = {}
for old_key, new_key in key_mapping.items():
    if old_key in result:
        result_updated[new_key] = result[old_key]


In [52]:
result_updated

{'question': 'Which borough of New York City has the highest population?',
 'answer': "I don't know.",
 'contexts': [Document(metadata={'source': 'As part of our partnership with National Park Foundation, we\'re excited to offer this Find Your Park collectible patch – inspiring adventurers of all ages to get out there and find their own special connection to our national parks. \r\n\r\nSpecs: Dimensions: 2"H x 3½"W. \r\n\r\nWhy We Love It: We\'re dedicated to supporting organizations that help people get outside and we think we\'ve found our perfect match. The National Park Foundation, the official charitable partner of the National Park Service, works to protect an amazing network of more than 400 national park sites, many of which you\'ll find just a short trip away.\r\n\r\nFabric & Care: Machine wash and dry. \r\n\r\nAdditional Features: Simply iron on to a backpack or jacket; or sew on fabric surface for extra durability. Get your 2019 National Park annual pass with us. Learn more 

In [None]:
# !pip install --no-cache-dir recordclass

In [None]:
# !pip install ragas==0.1.9

In [54]:
%pip install ragas

from ragas.integrations.langchain import EvaluatorChain 
# from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
)

# create evaluation chains
faithfulness_chain   = EvaluatorChain(metric=faithfulness)
answer_rel_chain     = EvaluatorChain(metric=answer_relevancy)
context_recall_chain = EvaluatorChain(metric=context_recall)

Note: you may need to restart the kernel to use updated packages.


1. `__call__()`

Directly run the evaluation chain with the results from the QA chain. Do note that metrics like context_relevancy and faithfulness require the `source_documents` to be present.

In [55]:
# Recheck the result that we are going to validate.
result

{'query': 'Which borough of New York City has the highest population?',
 'result': "I don't know.",
 'source_documents': [Document(metadata={'source': 'As part of our partnership with National Park Foundation, we\'re excited to offer this Find Your Park collectible patch – inspiring adventurers of all ages to get out there and find their own special connection to our national parks. \r\n\r\nSpecs: Dimensions: 2"H x 3½"W. \r\n\r\nWhy We Love It: We\'re dedicated to supporting organizations that help people get outside and we think we\'ve found our perfect match. The National Park Foundation, the official charitable partner of the National Park Service, works to protect an amazing network of more than 400 national park sites, many of which you\'ll find just a short trip away.\r\n\r\nFabric & Care: Machine wash and dry. \r\n\r\nAdditional Features: Simply iron on to a backpack or jacket; or sew on fabric surface for extra durability. Get your 2019 National Park annual pass with us. Learn 

**Faithfulness**

In [57]:
eval_result = faithfulness_chain(result_updated)
print(eval_result.keys())  # See what keys are available
eval_result.get("faithfulness_score", eval_result)

dict_keys(['question', 'answer', 'contexts', 'faithfulness'])


{'question': 'Which borough of New York City has the highest population?',
 'answer': "I don't know.",
 'contexts': [Document(metadata={'source': 'As part of our partnership with National Park Foundation, we\'re excited to offer this Find Your Park collectible patch – inspiring adventurers of all ages to get out there and find their own special connection to our national parks. \r\n\r\nSpecs: Dimensions: 2"H x 3½"W. \r\n\r\nWhy We Love It: We\'re dedicated to supporting organizations that help people get outside and we think we\'ve found our perfect match. The National Park Foundation, the official charitable partner of the National Park Service, works to protect an amazing network of more than 400 national park sites, many of which you\'ll find just a short trip away.\r\n\r\nFabric & Care: Machine wash and dry. \r\n\r\nAdditional Features: Simply iron on to a backpack or jacket; or sew on fabric surface for extra durability. Get your 2019 National Park annual pass with us. Learn more 

High faithfulness_score means that there are exact consistency between the source documents and the answer.

You can check lower faithfulness scores by changing the result (answer from LLM) or source_documents to something else.

In [60]:
fake_result = result.copy()
fake_result["answer"] = "we are the champions"
fake_result["question"] = fake_result.pop("query", None)
fake_result["contexts"] = fake_result.pop("source_documents", None)
eval_result = faithfulness_chain(fake_result)
# Safely get the score or print the whole result if not present
print(eval_result)
faithfulness_score = eval_result.get("faithfulness_score", None)
faithfulness_score

{'result': "I don't know.", 'answer': 'we are the champions', 'question': 'Which borough of New York City has the highest population?', 'contexts': [Document(metadata={'source': 'As part of our partnership with National Park Foundation, we\'re excited to offer this Find Your Park collectible patch – inspiring adventurers of all ages to get out there and find their own special connection to our national parks. \r\n\r\nSpecs: Dimensions: 2"H x 3½"W. \r\n\r\nWhy We Love It: We\'re dedicated to supporting organizations that help people get outside and we think we\'ve found our perfect match. The National Park Foundation, the official charitable partner of the National Park Service, works to protect an amazing network of more than 400 national park sites, many of which you\'ll find just a short trip away.\r\n\r\nFabric & Care: Machine wash and dry. \r\n\r\nAdditional Features: Simply iron on to a backpack or jacket; or sew on fabric surface for extra durability. Get your 2019 National Park 

**Context Relevancy**

In [64]:
# Find the matching example for the question
question = result_updated["question"]
ground_truth = None
for ex in examples:
	if ex["query"] == question:
		ground_truth = ex["ground_truths"]
		break

# Prepare input with all required keys
input_dict = result_updated.copy()
# Extract string from list if needed
if isinstance(ground_truth, list) and len(ground_truth) == 1:
	input_dict["ground_truth"] = ground_truth[0]
else:
	input_dict["ground_truth"] = ground_truth

eval_result = context_recall_chain(input_dict)
eval_result.get("context_recall_score", eval_result)

{'question': 'Which borough of New York City has the highest population?',
 'answer': "I don't know.",
 'contexts': [Document(metadata={'source': 'As part of our partnership with National Park Foundation, we\'re excited to offer this Find Your Park collectible patch – inspiring adventurers of all ages to get out there and find their own special connection to our national parks. \r\n\r\nSpecs: Dimensions: 2"H x 3½"W. \r\n\r\nWhy We Love It: We\'re dedicated to supporting organizations that help people get outside and we think we\'ve found our perfect match. The National Park Foundation, the official charitable partner of the National Park Service, works to protect an amazing network of more than 400 national park sites, many of which you\'ll find just a short trip away.\r\n\r\nFabric & Care: Machine wash and dry. \r\n\r\nAdditional Features: Simply iron on to a backpack or jacket; or sew on fabric surface for extra durability. Get your 2019 National Park annual pass with us. Learn more 

High context_recall_score means that the ground truth is present in the source documents.

You can check lower context recall scores by changing the source_documents to something else.

In [68]:
from langchain.schema import Document

# Prepare the required input keys
input_dict = {
	"question": result["query"],
	"contexts": [Document(page_content="I love christmas")],
	"ground_truth": "Brooklyn"  # Provide the correct answer for the question
}

eval_result = context_recall_chain(input_dict)
eval_result.get("context_recall_score", eval_result)

{'question': 'Which borough of New York City has the highest population?',
 'contexts': [Document(metadata={}, page_content='I love christmas')],
 'ground_truth': 'Brooklyn',
 'context_recall': 0.5}

2. `evaluate()`

Evaluate a list of inputs/queries and the outputs/predictions from the QA chain.

In [70]:
# run the queries as a batch for efficiency
predictions = qa_chain.batch(examples)

# evaluate
print("evaluating...")
results = []
for ex, pred in zip(examples, predictions):
	# Prepare input for the evaluator chain
	eval_input = {
		"question": ex["query"],
		"answer": pred["result"],
		"contexts": pred.get("source_documents", []),
	}
	results.append(faithfulness_chain(eval_input))
results

evaluating...


[{'question': 'What is the population of New York City as of 2020?',
  'answer': "I don't know.",
  'contexts': [Document(metadata={'source': 'As part of our partnership with National Park Foundation, we\'re excited to offer this Find Your Park collectible patch – inspiring adventurers of all ages to get out there and find their own special connection to our national parks. \r\n\r\nSpecs: Dimensions: 2"H x 3½"W. \r\n\r\nWhy We Love It: We\'re dedicated to supporting organizations that help people get outside and we think we\'ve found our perfect match. The National Park Foundation, the official charitable partner of the National Park Service, works to protect an amazing network of more than 400 national park sites, many of which you\'ll find just a short trip away.\r\n\r\nFabric & Care: Machine wash and dry. \r\n\r\nAdditional Features: Simply iron on to a backpack or jacket; or sew on fabric surface for extra durability. Get your 2019 National Park annual pass with us. Learn more at b

In [72]:
# evaluate context recall
print("evaluating...")
r = [
	context_recall_chain({
		"question": ex["query"],
		"answer": pred["result"],
		"contexts": pred.get("source_documents", []),
		"ground_truth": ex["ground_truths"][0] if "ground_truths" in ex and len(ex["ground_truths"]) == 1 else ex.get("ground_truths", None)
	})
	for ex, pred in zip(examples, predictions)
]
r

evaluating...


[{'question': 'What is the population of New York City as of 2020?',
  'answer': "I don't know.",
  'contexts': [Document(metadata={'source': 'As part of our partnership with National Park Foundation, we\'re excited to offer this Find Your Park collectible patch – inspiring adventurers of all ages to get out there and find their own special connection to our national parks. \r\n\r\nSpecs: Dimensions: 2"H x 3½"W. \r\n\r\nWhy We Love It: We\'re dedicated to supporting organizations that help people get outside and we think we\'ve found our perfect match. The National Park Foundation, the official charitable partner of the National Park Service, works to protect an amazing network of more than 400 national park sites, many of which you\'ll find just a short trip away.\r\n\r\nFabric & Care: Machine wash and dry. \r\n\r\nAdditional Features: Simply iron on to a backpack or jacket; or sew on fabric surface for extra durability. Get your 2019 National Park annual pass with us. Learn more at b