# Lab | Langchain Evaluation

## Intro

Pick different sets of data and re-run this notebook. The point is for you to understand all steps involve and the many different ways one can and should evaluate LLM applications.

What did you learn? - Let's discuss that in class

## LangChain: Evaluation

### Outline:

* Example generation
* Manual evaluation (and debuging)
* LLM-assisted evaluation

In [1]:
from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY') 

### Example 1

#### Create our QandA application

In [4]:
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
from langchain.llms import OpenAI
from langchain.document_loaders import CSVLoader, TextLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.chains import LLMChain
from langchain.embeddings import HuggingFaceEmbeddings



In [5]:
file = r'C:\Users\ITCC\OneDrive\Desktop\langchain lab\OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)
data = loader.load()

In [6]:
pip install --upgrade --force-reinstall sentence-transformers

^C
Note: you may need to restart the kernel to use updated packages.


Collecting sentence-transformers
  Using cached sentence_transformers-4.1.0-py3-none-any.whl.metadata (13 kB)
Collecting transformers<5.0.0,>=4.41.0 (from sentence-transformers)
  Using cached transformers-4.51.3-py3-none-any.whl.metadata (38 kB)
Collecting tqdm (from sentence-transformers)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting torch>=1.11.0 (from sentence-transformers)
  Using cached torch-2.6.0-cp312-cp312-win_amd64.whl.metadata (28 kB)
Collecting scikit-learn (from sentence-transformers)
  Using cached scikit_learn-1.6.1-cp312-cp312-win_amd64.whl.metadata (15 kB)
Collecting scipy (from sentence-transformers)
  Using cached scipy-1.15.2-cp312-cp312-win_amd64.whl.metadata (60 kB)
Collecting huggingface-hub>=0.20.0 (from sentence-transformers)
  Using cached huggingface_hub-0.30.2-py3-none-any.whl.metadata (13 kB)
Collecting Pillow (from sentence-transformers)
  Using cached pillow-11.2.1-cp312-cp312-win_amd64.whl.metadata (9.1 kB)
Collecting typing_ex

  You can safely remove it manually.
  You can safely remove it manually.
  You can safely remove it manually.
  You can safely remove it manually.
  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 3.5.0 requires fsspec[http]<=2024.12.0,>=2023.1.0, but you have fsspec 2025.3.2 which is incompatible.
langchain-core 0.3.54 requires packaging<25,>=23.2, but you have packaging 25.0 which is incompatible.


In [8]:
pip install docarray

Collecting docarray
  Downloading docarray-0.41.0-py3-none-any.whl.metadata (36 kB)
Collecting rich>=13.1.0 (from docarray)
  Using cached rich-14.0.0-py3-none-any.whl.metadata (18 kB)
Collecting types-requests>=2.28.11.6 (from docarray)
  Downloading types_requests-2.32.0.20250328-py3-none-any.whl.metadata (2.3 kB)
Collecting markdown-it-py>=2.2.0 (from rich>=13.1.0->docarray)
  Using cached markdown_it_py-3.0.0-py3-none-any.whl.metadata (6.9 kB)
Collecting mdurl~=0.1 (from markdown-it-py>=2.2.0->rich>=13.1.0->docarray)
  Using cached mdurl-0.1.2-py3-none-any.whl.metadata (1.6 kB)
Downloading docarray-0.41.0-py3-none-any.whl (302 kB)
Using cached rich-14.0.0-py3-none-any.whl (243 kB)
Downloading types_requests-2.32.0.20250328-py3-none-any.whl (20 kB)
Using cached markdown_it_py-3.0.0-py3-none-any.whl (87 kB)
Using cached mdurl-0.1.2-py3-none-any.whl (10.0 kB)
Installing collected packages: types-requests, mdurl, markdown-it-py, rich, docarray
Successfully installed docarray-0.41.0 mar

In [9]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2", model_kwargs = {'device': 'cpu'})
).from_loaders([loader])



In [10]:
llm = ChatOpenAI(temperature = 0.0)
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.vectorstore.as_retriever(), 
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

#### Coming up with test datapoints

In [53]:
data[5]

Document(metadata={'source': 'C:\\Users\\ITCC\\OneDrive\\Desktop\\langchain lab\\OutdoorClothingCatalog_1000.csv', 'row': 5}, page_content=": 5\nname: Smooth Comfort Check Shirt, Slightly Fitted\ndescription: Our men's slightly fitted check shirt is the perfect choice for your wardrobe! Customers love how it fits right out of the dryer. Size & Fit: Slightly Fitted, Relaxed through the chest and sleeve with a slightly slimmer waist. Fabric & Care: 100% cotton poplin, with wrinkle-free performance that won't wash out. Our innovative TrueCoolآ® fabric wicks moisture away from your skin and helps it dry quickly. Additional Features: Traditional styling with a button-down collar and a single patch pocket. Imported.")

In [55]:
data[8]

Document(metadata={'source': 'C:\\Users\\ITCC\\OneDrive\\Desktop\\langchain lab\\OutdoorClothingCatalog_1000.csv', 'row': 8}, page_content=': 8\nname: Mountain Man Fleece Jacket\ndescription: Our best-value fleece jacket is designed with inspiration from our archives and made from 100% recycled polyester for unbeatable comfort and wear-anywhere style. \n\nSize & Fit: Slightly Fitted. Best with lightweight layer. Falls at hip. \n\nWhy We Love It: Our designers took inspiration from the  archives to create this ultrasoft fleece jacket. We love how the heritage styling is updated with a modern, slimming fit. Plus, itâ€™s made from 100% recycled fleece â€“ so you can stay warm and feel good about it. \n\nFabric & Care: 100% recycled fleece is soft, cozy and gentle on the planet. Ultraplush fibers resist wind for even more warmth. Machine wash and dry. \n\nAdditional Features: Features our classic Mount Katahdin logo. Bart Boot lace-inspired zippers and drawcord. Two lower zippered hand poc

#### Hard-coded examples

In [13]:
from langchain.prompts import PromptTemplate

In [57]:
from langchain.prompts import PromptTemplate
from langchain.schema import BaseOutputParser
from pydantic import BaseModel, Field

examples = [
    {
        "query": "Does the Cozy Comfort Pullover Set have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

# Define the prompt template
prompt_template = PromptTemplate(
    input_variables=["query"],
    template="Examples:\n"
             "1. Query: Does the Cozy Comfort Pullover Set have side pockets?\n"
             "   Answer: Yes\n"
             "2. Query: What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?\n"
             "   Answer: The DownTek collection\n"
             "Query: {query}\n"
             "Answer:"
)

# Define the output model
class Answer(BaseModel):
    answer: str = Field(description="The answer to the query")

# Create the output parser
class AnswerOutputParser(BaseOutputParser):
    def parse(self, text: str) -> Answer:
        # Split the response to get the answer
        answer = text.strip().split("Answer:")[-1].strip()
        return Answer(answer=answer)

# Initialize the LLM
llm = ChatOpenAI()

# Create the LLMChain
llm_chain = LLMChain(
    llm=llm,
    prompt=prompt_template,
    output_parser=AnswerOutputParser()
)

# Example query
query = "Is the Cozy Comfort Pullover Set available in different colors?"

# Run the chain
result = llm_chain.run({"query": query})

# Print the result
print(result)



answer='Yes, the Cozy Comfort Pullover Set is available in multiple colors such as gray, navy blue, and blush pink.'


#### LLM-Generated examples

In [15]:
from langchain.evaluation.qa import QAGenerateChain

In [16]:
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI())

In [17]:
llm_chain = LLMChain(llm=llm, prompt=prompt_template)

In [58]:
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:5]]
)



In [71]:
new_examples[3]

{'qa_pairs': {'query': 'What is the composition of the fabric used for the Refresh Swimwear, V-Neck Tankini Contrasts top?',
  'answer': 'The fabric is composed of 82% recycled nylon and 18% Lycra® spandex for the body, and 90% recycled nylon and 10% Lycra® spandex for the lining.'}}

In [73]:
data[3]

Document(metadata={'source': 'C:\\Users\\ITCC\\OneDrive\\Desktop\\langchain lab\\OutdoorClothingCatalog_1000.csv', 'row': 3}, page_content=": 3\nname: Refresh Swimwear, V-Neck Tankini Contrasts\ndescription: Whether you're going for a swim or heading out on an SUP, this watersport-ready tankini top is designed to move with you and stay comfortable. All while looking great in an eye-catching colorblock style. \n\nSize & Fit\nFitted: Sits close to the body.\n\nWhy We Love It\nNot only does this swimtop feel good to wear, its fabric is good for the earth too. In recycled nylon, with Lycraآ® spandex for the perfect amount of stretch. \n\nFabric & Care\nThe premium Italian-blend is breathable, quick drying and abrasion resistant. \nBody in 82% recycled nylon with 18% Lycraآ® spandex. \nLined in 90% recycled nylon with 10% Lycraآ® spandex. \nUPF 50+ rated â€“ the highest rated sun protection possible. \nHandwash, line dry.\n\nAdditional Features\nLightweight racerback straps are easy to get 

In [74]:
d_flattened = [data['qa_pairs'] for data in new_examples]
d_flattened

[{'query': "What are the key features and specifications of the Women's Campside Oxfords as listed in the document?",
  'answer': "The key features and specifications of the Women's Campside Oxfords include:"},
 {'query': 'What are the dimensions of the small and medium sizes of the Recycled Waterhog Dog Mat, Chevron Weave?',
  'answer': 'The small size of the Recycled Waterhog Dog Mat, Chevron Weave has dimensions of 18" x 28", while the medium size has dimensions of 22.5" x 34.5".'},
 {'query': "What are some key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece described in the document?",
  'answer': "Some key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece include bright colors, ruffles, exclusive whimsical prints, four-way-stretch and chlorine-resistant fabric, UPF 50+ rated fabric for sun protection, crossover no-slip straps, fully lined bottom for secure fit and maximum coverage, and the recommendation for machine wash and l

#### Combine examples

In [75]:
# examples += new_example
examples += d_flattened

In [76]:
examples[0]

{'query': 'Does the Cozy Comfort Pullover Set have side pockets?',
 'answer': 'Yes'}

In [77]:
qa.invoke(examples[0]["query"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'Does the Cozy Comfort Pullover Set have side pockets?',
 'result': 'The Cozy Comfort Pullover Set does not mention having side pockets in the provided description.'}

### Manual Evaluation - Fun part

In [65]:
import langchain
langchain.debug = True

In [66]:
qa.invoke(examples[0]["query"])

[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "Does the Cozy Comfort Pullover Set have side pockets?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "Does the Cozy Comfort Pullover Set have side pockets?",
  "context": "All pockets have sturdy pocket bags and offer plenty of room for a wallet, cell phone and more.\n\nGusseted crotch for ease of movement.\n\nImported.<<<<>>>>>: 73\nname: Cozy Cuddles Knit Pullover Set\ndescription: Perfect for lounging, this knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out. \n\nSize & Fit \nPants are Favorite Fit: Sits lower on the waist. \nRelaxed Fit: Our most gen

{'query': 'Does the Cozy Comfort Pullover Set have side pockets?',
 'result': 'The Cozy Comfort Pullover Set does not mention having side pockets in the provided context.'}

In [67]:
# Turn off the debug mode
langchain.debug = False

### LLM assisted evaluation

In [68]:
examples += d_flattened

In [69]:
examples

[{'query': 'Does the Cozy Comfort Pullover Set have side pockets?',
  'answer': 'Yes'},
 {'query': 'What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?',
  'answer': 'The DownTek collection'},
 {'query': "What are the key features and specifications of the Women's Campside Oxfords as listed in the document?",
  'answer': "The key features and specifications of the Women's Campside Oxfords include:"},
 {'query': 'What are the dimensions of the small and medium sizes of the Recycled Waterhog Dog Mat, Chevron Weave?',
  'answer': 'The small size of the Recycled Waterhog Dog Mat, Chevron Weave has dimensions of 18" x 28", while the medium size has dimensions of 22.5" x 34.5".'},
 {'query': "What are some key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece described in the document?",
  'answer': "Some key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece include bright colors, ruffles, exclusive whimsical prints, fou

In [78]:
predictions = qa.batch(examples)



[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m



In [79]:
predictions

[{'query': 'Does the Cozy Comfort Pullover Set have side pockets?',
  'answer': 'Yes',
  'result': 'The Cozy Comfort Pullover Set does not mention having side pockets in the provided context.'},
 {'query': 'What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?',
  'answer': 'The DownTek collection',
  'result': 'The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.'},
 {'query': "What are the key features and specifications of the Women's Campside Oxfords as listed in the document?",
  'answer': "The key features and specifications of the Women's Campside Oxfords include:",
  'result': "The key features and specifications of the Women's Campside Oxfords are:\n\n- Ultracomfortable lace-to-toe Oxford style\n- Made of super-soft canvas material\n- Thick cushioning for comfort\n- Quality construction for a broken-in feel from the first wear\n- Approximate weight: 1 lb. 1 oz. per pair\n- Comfortable EVA innersole with Cleansport NXT® antimicrobial 

In [80]:
from langchain.evaluation.qa import QAEvalChain

In [81]:
llm = ChatOpenAI(temperature=0)
eval_chain = QAEvalChain.from_llm(llm)

In [82]:
graded_outputs = eval_chain.evaluate(examples, predictions)

In [83]:
graded_outputs

[{'results': 'INCORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'}]

In [130]:
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    # print("Predicted Grade: " + graded_outputs[i]['text'])
    print()

Example 0:
Question: Does the Cozy Comfort Pullover Set have side pockets?
Real Answer: Yes
Predicted Answer: The Cozy Comfort Pullover Set does not mention having side pockets in the provided context.

Example 1:
Question: What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.

Example 2:
Question: What are the key features and specifications of the Women's Campside Oxfords as listed in the document?
Real Answer: The key features and specifications of the Women's Campside Oxfords include:
Predicted Answer: The key features and specifications of the Women's Campside Oxfords are:

- Ultracomfortable lace-to-toe Oxford style
- Made of super-soft canvas material
- Thick cushioning for comfort
- Quality construction for a broken-in feel from the first wear
- Approximate weight: 1 lb. 1 oz. per pair
- Comfortable EVA innersole with Cleansport 

### Example 2
One can also easily evaluate your QA chains with the metrics offered in ragas

In [131]:
#rom langchain_huggingface import HuggingFaceEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings
loader = TextLoader(r"C:\Users\ITCC\OneDrive\Desktop\langchain lab\nyc_text.txt")
index = VectorstoreIndexCreator(embedding=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2", model_kwargs = {'device': 'cpu'})).from_loaders([loader])


llm = ChatOpenAI(temperature= 0)
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=index.vectorstore.as_retriever(),
    return_source_documents=True,
)



In [132]:
# testing it out

question = "How did New York City get its name?"
result = qa_chain.invoke({"query": question})
result["result"]

'New York City was originally named New Amsterdam by Dutch colonists in 1626. When the city came under British control in 1664, it was renamed New York after King Charles II of England granted the lands to his brother, the Duke of York. The city has been continuously named New York since November 1674.'

In [133]:
result

{'query': 'How did New York City get its name?',
 'result': 'New York City was originally named New Amsterdam by Dutch colonists in 1626. When the city came under British control in 1664, it was renamed New York after King Charles II of England granted the lands to his brother, the Duke of York. The city has been continuously named New York since November 1674.',
 'source_documents': [Document(id='c8cb22ae-1d48-46fc-99b8-cdf97d2e393a', metadata={'source': 'C:\\Users\\ITCC\\OneDrive\\Desktop\\langchain lab\\nyc_text.txt'}, page_content='The city and its metropolitan area constitute the premier gateway for legal immigration to the United States. As many as 800 languages are spoken in New York, making it the most linguistically diverse city in the world. New York City is home to more than 3.2 million residents born outside the U.S., the largest foreign-born population of any city in the world as of 2016.New York City traces its origins to a trading post founded on the southern tip of Manh

Now in order to evaluate the qa system we generated a few relevant questions. We've generated a few question for you but feel free to add any you want.

In [134]:
eval_questions = [
    "What is the population of New York City as of 2020?",
    "Which borough of New York City has the highest population?",
    "What is the economic significance of New York City?",
    "How did New York City get its name?",
    "What is the significance of the Statue of Liberty in New York City?",
]

eval_answers = [
    "8,804,190",
    "Brooklyn",
    "New York City's economic significance is vast, as it serves as the global financial capital, housing Wall Street and major financial institutions. Its diverse economy spans technology, media, healthcare, education, and more, making it resilient to economic fluctuations. NYC is a hub for international business, attracting global companies, and boasts a large, skilled labor force. Its real estate market, tourism, cultural industries, and educational institutions further fuel its economic prowess. The city's transportation network and global influence amplify its impact on the world stage, solidifying its status as a vital economic player and cultural epicenter.",
    "New York City got its name when it came under British control in 1664. King Charles II of England granted the lands to his brother, the Duke of York, who named the city New York in his own honor.",
    "The Statue of Liberty in New York City holds great significance as a symbol of the United States and its ideals of liberty and peace. It greeted millions of immigrants who arrived in the U.S. by ship in the late 19th and early 20th centuries, representing hope and freedom for those seeking a better life. It has since become an iconic landmark and a global symbol of cultural diversity and freedom.",
]

examples = [
    {"query": q, "ground_truths": [eval_answers[i]]}
    for i, q in enumerate(eval_questions)
]

In [135]:
examples

[{'query': 'What is the population of New York City as of 2020?',
  'ground_truths': ['8,804,190']},
 {'query': 'Which borough of New York City has the highest population?',
  'ground_truths': ['Brooklyn']},
 {'query': 'What is the economic significance of New York City?',
  'ground_truths': ["New York City's economic significance is vast, as it serves as the global financial capital, housing Wall Street and major financial institutions. Its diverse economy spans technology, media, healthcare, education, and more, making it resilient to economic fluctuations. NYC is a hub for international business, attracting global companies, and boasts a large, skilled labor force. Its real estate market, tourism, cultural industries, and educational institutions further fuel its economic prowess. The city's transportation network and global influence amplify its impact on the world stage, solidifying its status as a vital economic player and cultural epicenter."]},
 {'query': 'How did New York City

#### Introducing RagasEvaluatorChain

`RagasEvaluatorChain` creates a wrapper around the metrics ragas provides (documented [here](https://github.com/explodinggradients/ragas/blob/main/docs/metrics.md)), making it easier to run these evaluation with langchain and langsmith.

The evaluator chain has the following APIs

- `__call__()`: call the `RagasEvaluatorChain` directly on the result of a QA chain.
- `evaluate()`: evaluate on a list of examples (with the input queries) and predictions (outputs from the QA chain). 
- `evaluate_run()`: method implemented that is called by langsmith evaluators to evaluate langsmith datasets.

lets see each of them in action to learn more.

In [136]:
result = qa_chain.invoke({"query": eval_questions[1]})
result["result"]

'Manhattan (New York County) has the highest population density of any borough in New York City.'

In [137]:
key_mapping = {
    "query": "question",
    "result": "answer",
    "source_documents": "contexts"
}

result_updated = {}
for old_key, new_key in key_mapping.items():
    if old_key in result:
        result_updated[new_key] = result[old_key]


In [138]:
result_updated

{'question': 'Which borough of New York City has the highest population?',
 'answer': 'Manhattan (New York County) has the highest population density of any borough in New York City.',
 'contexts': [Document(id='deda3776-d4fc-461f-b436-37099270785d', metadata={'source': 'C:\\Users\\ITCC\\OneDrive\\Desktop\\langchain lab\\nyc_text.txt'}, page_content="New York City is the most populous city in the United States, with 8,804,190 residents incorporating more immigration into the city than outmigration since the 2010 United States census. More than twice as many people live in New York City as compared to Los Angeles, the second-most populous U.S. city; and New York has more than three times the population of Chicago, the third-most populous U.S. city. New York City gained more residents between 2010 and 2020 (629,000) than any other U.S. city, and a greater amount than the total sum of the gains over the same decade of the next four largest U.S. cities, Los Angeles, Chicago, Houston, and P

In [50]:
pip install --no-cache-dir recordclass

Collecting recordclass
  Downloading recordclass-0.23.1-cp312-cp312-win_amd64.whl.metadata (44 kB)
Downloading recordclass-0.23.1-cp312-cp312-win_amd64.whl (249 kB)
Installing collected packages: recordclass
Successfully installed recordclass-0.23.1
Note: you may need to restart the kernel to use updated packages.


In [51]:
pip install ragas==0.1.9

Collecting ragas==0.1.9
  Downloading ragas-0.1.9-py3-none-any.whl.metadata (5.2 kB)
Collecting pysbd>=0.3.4 (from ragas==0.1.9)
  Downloading pysbd-0.3.4-py3-none-any.whl.metadata (6.1 kB)
Collecting appdirs (from ragas==0.1.9)
  Downloading appdirs-1.4.4-py2.py3-none-any.whl.metadata (9.0 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets->ragas==0.1.9)
  Using cached fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting packaging (from datasets->ragas==0.1.9)
  Using cached packaging-24.2-py3-none-any.whl.metadata (3.2 kB)
Downloading ragas-0.1.9-py3-none-any.whl (86 kB)
Downloading pysbd-0.3.4-py3-none-any.whl (71 kB)
Downloading appdirs-1.4.4-py2.py3-none-any.whl (9.6 kB)
Using cached fsspec-2024.12.0-py3-none-any.whl (183 kB)
Using cached packaging-24.2-py3-none-any.whl (65 kB)
Installing collected packages: appdirs, pysbd, packaging, fsspec, ragas
  Attempting uninstall: packaging
    Found existing installation: packaging 25.0

In [139]:
from ragas.integrations.langchain import EvaluatorChain 
# from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_relevancy,
    context_recall,
)

# create evaluation chains
faithfulness_chain   = EvaluatorChain(metric=faithfulness)
answer_rel_chain     = EvaluatorChain(metric=answer_relevancy)
context_rel_chain    = EvaluatorChain(metric=context_relevancy)
context_recall_chain = EvaluatorChain(metric=context_recall)

1. `__call__()`

Directly run the evaluation chain with the results from the QA chain. Do note that metrics like context_relevancy and faithfulness require the `source_documents` to be present.

In [140]:
# Recheck the result that we are going to validate.
result

{'query': 'Which borough of New York City has the highest population?',
 'result': 'Manhattan (New York County) has the highest population density of any borough in New York City.',
 'source_documents': [Document(id='deda3776-d4fc-461f-b436-37099270785d', metadata={'source': 'C:\\Users\\ITCC\\OneDrive\\Desktop\\langchain lab\\nyc_text.txt'}, page_content="New York City is the most populous city in the United States, with 8,804,190 residents incorporating more immigration into the city than outmigration since the 2010 United States census. More than twice as many people live in New York City as compared to Los Angeles, the second-most populous U.S. city; and New York has more than three times the population of Chicago, the third-most populous U.S. city. New York City gained more residents between 2010 and 2020 (629,000) than any other U.S. city, and a greater amount than the total sum of the gains over the same decade of the next four largest U.S. cities, Los Angeles, Chicago, Houston, 

**Faithfulness**

In [98]:
pip install nest_asyncio


Note: you may need to restart the kernel to use updated packages.


In [None]:
eval_result = faithfulness_chain(result_updated)
eval_result["faithfulness_score"]

In [141]:
result_updated = {
"question": result["query"],
"answer": result["result"],
"contexts": [doc.page_content for doc in result["source_documents"]]
} 

In [142]:
eval_result = await faithfulness_chain.ainvoke(result_updated)
print("Faithfulness Score:", eval_result["faithfulness"])

Faithfulness Score: 0.5


High faithfulness_score means that there are exact consistency between the source documents and the answer.

You can check lower faithfulness scores by changing the result (answer from LLM) or source_documents to something else.

In [144]:
fake_result = result_updated.copy()
fake_result["result_updated"] = "we are the champions"
eval_result = faithfulness_chain(fake_result)
eval_result["faithfulness"]

0.5

**Context Relevancy**

In [147]:
eval_result

{'question': 'Which borough of New York City has the highest population?',
 'answer': 'Manhattan (New York County) has the highest population density of any borough in New York City.',
 'contexts': ["New York City is the most populous city in the United States, with 8,804,190 residents incorporating more immigration into the city than outmigration since the 2010 United States census. More than twice as many people live in New York City as compared to Los Angeles, the second-most populous U.S. city; and New York has more than three times the population of Chicago, the third-most populous U.S. city. New York City gained more residents between 2010 and 2020 (629,000) than any other U.S. city, and a greater amount than the total sum of the gains over the same decade of the next four largest U.S. cities, Los Angeles, Chicago, Houston, and Phoenix, Arizona combined. New York City's population is about 44% of New York State's population, and about 39% of the population of the New York metropo

In [148]:
result_updated_context = {
"question": result["query"],
"answer": result["result"],
"ground_truth": "Manhattan (New York County) has the highest population density of any borough in New York City.",
"contexts": [doc.page_content for doc in result["source_documents"]]
} 

In [149]:
eval_result = await context_recall_chain.ainvoke(result_updated_context)
eval_result["context_recall"]

1.0

High context_recall_score means that the ground truth is present in the source documents.

You can check lower context recall scores by changing the source_documents to something else.

In [150]:
from langchain.schema import Document
fake_result = result_updated_context.copy()
fake_result["source_documents"] = [Document(page_content="I love christmas")]
eval_result = await context_recall_chain.ainvoke(fake_result)
eval_result["context_recall"]

1.0

2. `evaluate()`

Evaluate a list of inputs/queries and the outputs/predictions from the QA chain.

In [None]:
# run the queries as a batch for efficiency
predictions = qa_chain.batch(examples)

# evaluate
print("evaluating...")
r = faithfulness_chain.evaluate(examples, predictions)
r

In [154]:
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness

# Step 1: Get predictions
predictions = qa_chain.batch(examples)

# Step 2: Format data
formatted_data = []
for i, example in enumerate(examples):
    prediction = predictions[i] if i < len(predictions) else None
    if prediction:
        item = {
            "question": example["query"],
            "answer": prediction["result"],
            "contexts": [doc.page_content for doc in prediction["source_documents"]] if "source_documents" in prediction else [],
            "ground_truth": example["ground_truths"][0]
        }
        formatted_data.append(item)

# Step 2.5: Convert to Hugging Face Dataset
dataset = Dataset.from_list(formatted_data)

# Step 3: Evaluate
print("evaluating...")
results = evaluate(dataset, metrics=[faithfulness])
results


evaluating...


Evaluating:   0%|          | 0/5 [00:00<?, ?it/s]

{'faithfulness': 0.9000}

In [156]:
from ragas.metrics import context_recall

# Step 3: Evaluate context recall
print("evaluating context recall...")
results = evaluate(dataset, metrics=[context_recall])
results


evaluating context recall...


Evaluating:   0%|          | 0/5 [00:00<?, ?it/s]

{'context_recall': 0.8000}