# Lab | Langchain Evaluation

## Intro

Pick different sets of data and re-run this notebook. The point is for you to understand all steps involve and the many different ways one can and should evaluate LLM applications.

What did you learn? - Let's discuss that in class

Goal: This whole section is about learning how to test your AI. Not just build it — but test if it answers right, smart, and on point.

## LangChain: Evaluation

### Outline:

Example Generation 🧠

Manual Evaluation (you be the judge) ✋

LLM-assisted Evaluation (AI be the judge) 🤖

In [None]:
from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY') 

### Example 1

#### Create our QandA application

In [None]:
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
from langchain.llms import OpenAI
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.document_loaders import CSVLoader, TextLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.chains import LLMChain


In [None]:
file = 'data/OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)
data = loader.load()

In [None]:
 !pip install --upgrade --force-reinstall sentence-transformers

## makin' sure the sentence embedding tool is fresh

In [None]:
from langchain_community.document_loaders import CSVLoader
from langchain_community.embeddings import HuggingFaceEmbeddings


In [None]:
!pip install sentence-transformers
!pip install langchainhub

In [None]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2", model_kwargs = {'device': 'cpu'})
).from_loaders([loader])


## turnin' all your text into brainy math (embeddings)
#  so AI can find what it needs real fast


In [None]:
llm = ChatOpenAI(temperature = 0.0)
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.vectorstore.as_retriever(), 
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)


#  settin' up our smart Q&A bot 
#  strict tone, no randomness, answers only from the catalog


#### Coming up with test datapoints

In [None]:
data[10]

# peepin' at row 10 in your product catalog — makin' sure it loaded right

In [None]:
data[11]

# quick look at row 11 

#### Hard-coded examples

This means writing your own questions + expected answers.



In [None]:
from langchain.prompts import PromptTemplate

# bringin’ in the tool to make your prompt look fresh and structured


In [None]:
from langchain.prompts import PromptTemplate
from langchain.schema import BaseOutputParser
from pydantic import BaseModel, Field

examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]
# feeding the AI examples so it knows how to answer



# Define the prompt template
prompt_template = PromptTemplate(
    input_variables=["query"],
    template="Examples:\n"
             "1. Query: Do the Cozy Comfort Pullover Set have side pockets?\n"
             "   Answer: Yes\n"
             "2. Query: What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?\n"
             "   Answer: The DownTek collection\n"
             "Query: {query}\n"
             "Answer:"
)
# prompt layout — tellin’ the AI how to spit out answers




# Define the output model
class Answer(BaseModel):
    answer: str = Field(description="The answer to the query")
# setting up the answer format — just keep it clean.




# Create the output parser
class AnswerOutputParser(BaseOutputParser):
    def parse(self, text: str) -> Answer:
        # Split the response to get the answer
        answer = text.strip().split("Answer:")[-1].strip()
        return Answer(answer=answer)

# Initialize the LLM
# llm = OpenAI()
llm = ChatOpenAI()

# Create the LLMChain
llm_chain = LLMChain(
    llm=llm,
    prompt=prompt_template,
    output_parser=AnswerOutputParser()
)
# bundling all this into one smooth Q&A chain



# Example query
query = "Is the Cozy Comfort Pullover Set available in different colors?"

# Run the chain
result = llm_chain.run({"query": query})

# est question and seein' what the bot says 👀
print(result)


#### LLM-Generated examples


Let the AI generate questions for you based on the docs. This is fast but needs checking.

In [None]:
from langchain.evaluation.qa import QAGenerateChain
# importin’ the tool to let the AI write test questions for you 🧠


In [None]:
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI())
# settin’ up the test question generator with ChatGPT


In [None]:
llm_chain = LLMChain(llm=llm, prompt=prompt_template)
# re-bootin' the LLM chain — keepin' it tight and ready for more


In [None]:
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:5]]
)

# lettin’ the AI cook up Q&A pairs based on the first 5 docs


In [None]:
new_examples[0]

# peekin' at the first AI-made test question 👀


In [None]:
data[0]

# checkin' the first product doc — makin’ sure it matches the Q&A


In [None]:
d_flattened = [data['qa_pairs'] for data in new_examples]
d_flattened


# grabbin’ the Q&A pairs from each generated example 
# flattening the goods 


#### Combine examples

Mix both: your questions + AI-generated ones. Now you’ve got a whole test set.



In [None]:
# examples += new_example
examples += d_flattened

# stackin' the new AI-made questions on top of the old ones


In [None]:
examples[0]

# lookin' at the first test example — what we askin' and expectin

In [None]:
qa.invoke(examples[0]["query"])

# droppin’ the question into the Q&A bot — see what it spits back


### Manual Evaluation - Fun part

In [None]:
import langchain
langchain.debug = True

# turnin’ on detective mode — 
# logs everything so you see what’s poppin' under the hood 


In [None]:
qa.invoke(examples[0]["query"])

# rerunnin' the question with debug on 
# seein' all the behind-the-scenes sauce 


In [None]:
# Turn off the debug mode
langchain.debug = False

# chillin' the logs — turnin’ the noise off 


### LLM assisted evaluation

this is the part where you let the AI judge the AI based on how relevant, faithful, or on-topic the answers are. 

In [None]:
examples += d_flattened

# addin' all the new AI-made Q&A test data to the original example stack 


# ⚡the full list of test examples —  AI's pop quiz sheet 

In [None]:
examples

In [None]:
predictions = qa.batch(examples)

# feedin’ all examples into your smart Q&A bot — gettin’ answers in one go 

In [None]:
predictions

##  bringin’ in the LangChain tool to judge if our bot's answers slap or flop ⚖️



In [None]:
from langchain.evaluation.qa import QAEvalChain

In [None]:
llm = ChatOpenAI(temperature=0)
eval_chain = QAEvalChain.from_llm(llm)

#  settin' up a serious AI judge (no randomness) to grade the answers 


In [None]:
graded_outputs = eval_chain.evaluate(examples, predictions)

# lettin’ the judge AI grade your bot’s answers based on truth & quality 💯


### peek at the report card — this shows how your bot did on each 😬


In [None]:
graded_outputs

In [None]:
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    # print("Predicted Grade: " + graded_outputs[i]['text'])
    print()


    # loopin’ through each test Q 
    # showin' what the AI was asked, what it said
    #  and what the real answer was 


 ## Example 2 — Using ragas 💥
One can also easily evaluate your QA chains with the metrics offered in ragas



 ## Here Text doc about NYC and asking it real-world questions using that same LangChain power 🔥

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
loader = TextLoader("data/nyc_text.txt")
index = VectorstoreIndexCreator(embedding=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2",
                                                                 model_kwargs = {'device': 'mps'})).from_loaders([loader])
# # 'mps' means it's runnin' on Apple Silicon MacBook GPU power



# 🔹 Set up the LLM Q&A chain


In [None]:
llm = ChatOpenAI(temperature= 0)
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=index.vectorstore.as_retriever(),
    return_source_documents=True,
)

# 🔹 Ask your question

In [None]:
# testing it out

question = "How did New York City get its name?"
result = qa_chain.invoke({"query": question})
result["result"]
# throwin’ the question at the chain 
# Get the final answer

## View the whole result

In [None]:
result

Now in order to evaluate the qa system we generated a few relevant questions. We've generated a few question for you but feel free to add any you want.

In [None]:
eval_questions = [
    "What is the population of New York City as of 2020?",
    "Which borough of New York City has the highest population?",
    "What is the economic significance of New York City?",
    "How did New York City get its name?",
    "What is the significance of the Statue of Liberty in New York City?",
]

eval_answers = [
    "8,804,190",
    "Brooklyn",
    "New York City's economic significance is vast, as it serves as the global financial capital, housing Wall Street and major financial institutions. Its diverse economy spans technology, media, healthcare, education, and more, making it resilient to economic fluctuations. NYC is a hub for international business, attracting global companies, and boasts a large, skilled labor force. Its real estate market, tourism, cultural industries, and educational institutions further fuel its economic prowess. The city's transportation network and global influence amplify its impact on the world stage, solidifying its status as a vital economic player and cultural epicenter.",
    "New York City got its name when it came under British control in 1664. King Charles II of England granted the lands to his brother, the Duke of York, who named the city New York in his own honor.",
    "The Statue of Liberty in New York City holds great significance as a symbol of the United States and its ideals of liberty and peace. It greeted millions of immigrants who arrived in the U.S. by ship in the late 19th and early 20th centuries, representing hope and freedom for those seeking a better life. It has since become an iconic landmark and a global symbol of cultural diversity and freedom.",
]

# these are your gold-standard answers — like the answer key for a quiz 💯


examples = [
    {"query": q, "ground_truths": [eval_answers[i]]}
    for i, q in enumerate(eval_questions)
]


# bundling Qs and their matching correct As into one package 📦
# this format is ready to be passed into an evaluator 
# (like Ragas or LangChain QA Eval)

In [None]:
examples

#### Introducing RagasEvaluatorChain

`RagasEvaluatorChain` creates a wrapper around the metrics ragas provides (documented [here](https://github.com/explodinggradients/ragas/blob/main/docs/metrics.md)), making it easier to run these evaluation with langchain and langsmith.

The evaluator chain has the following APIs

- `__call__()`: call the `RagasEvaluatorChain` directly on the result of a QA chain.
- `evaluate()`: evaluate on a list of examples (with the input queries) and predictions (outputs from the QA chain). 
- `evaluate_run()`: method implemented that is called by langsmith evaluators to evaluate langsmith datasets.

lets see each of them in action to learn more.

# 💬 Ask a question + get an answer


In [None]:
result = qa_chain.invoke({"query": eval_questions[1]})
result["result"]


# askin' one of your real test questions 
# get the bot's answer back 

In [None]:
# Map your keys to Ragas format

key_mapping = {
    "query": "question",
    "result": "answer",
    "source_documents": "contexts"
}



#  Convert  result
result_updated = {}
for old_key, new_key in key_mapping.items():
    if old_key in result:
        result_updated[new_key] = result[old_key]





# This is your cleaned-up result — ready to be judged by the RagasEvaluatorChain 🧼


In [None]:
result_updated

In [None]:
!pip install --no-cache-dir recordclass

In [None]:
!pip install ragas==0.1.9

In [None]:
from ragas.integrations.langchain import EvaluatorChain 
# from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_relevancy,
    context_recall,
)

# these are the 4 big judge styles Ragas uses to score your bot 

# 🏆 Set up the Judges

In [None]:
# create evaluation chains
faithfulness_chain   = EvaluatorChain(metric=faithfulness)
answer_rel_chain     = EvaluatorChain(metric=answer_relevancy)
context_rel_chain    = EvaluatorChain(metric=context_relevancy)
context_recall_chain = EvaluatorChain(metric=context_recall)

faithfulness_chain – checks if your AI is makin' stuff up or stickin' to the source  (truth checker)

answer_rel_chain – checks if the answer actually hits what the question was askin'

context_rel_chain – checks if the retrieved docs even match the question topic 

context_recall_chain – checks if the real answer was even in the docs pulled 

1. `__call__()`

Directly run the evaluation chain with the results from the QA chain. Do note that metrics like context_relevancy and faithfulness require the `source_documents` to be present.

In [None]:
# Recheck the result that we are going to validate.
result

**Faithfulness**

In [None]:
eval_result = faithfulness_chain(result_updated)
eval_result["faithfulness_score"]

High faithfulness_score means that there are exact consistency between the source documents and the answer.

You can check lower faithfulness scores by changing the result (answer from LLM) or source_documents to something else.

In [None]:
fake_result = result.copy()
fake_result["result"] = "we are the champions"
eval_result = faithfulness_chain(fake_result)
eval_result["faithfulness_score"]

**Context Relevancy**

In [None]:
eval_result = context_recall_chain(result)
eval_result["context_recall_score"]

High context_recall_score means that the ground truth is present in the source documents.

You can check lower context recall scores by changing the source_documents to something else.

In [None]:
from langchain.schema import Document
fake_result = result.copy()
fake_result["source_documents"] = [Document(page_content="I love christmas")]
eval_result = context_recall_chain(fake_result)
eval_result["context_recall_score"]

2. `evaluate()`


✅Like grading a whole exam instead of one question.✅

Evaluate a list of inputs/queries and the outputs/predictions from the QA chain.

# ✅ 1.  Evaluate Faithfulness (Truth Check)

In [None]:
# run the queries as a batch for efficiency
predictions = qa_chain.batch(examples)

# evaluate
print("evaluating...")
r = faithfulness_chain.evaluate(examples, predictions)
r

# 📊 2. Evaluate Context Recall (Did it use the source?)

In [None]:
# evaluate context recall
print("evaluating...")
r = context_recall_chain.evaluate(examples, predictions)
r

# This tells you whether the bot even retrieved 
# the right context to answer the question
#  *super important in real-world apps like legal, health, etc*
