<a href="https://colab.research.google.com/github/GiX007/agent-labs/blob/main/03_langchain/04_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LangChain: Evaluation

## Outline

* Example generation
* Manual evaluation (and debuging)
* LLM-assisted evaluation
* LangChain evaluation platform

In [None]:
import os

from dotenv import load_dotenv, find_dotenv
dotenv_path = find_dotenv() or '/content/OPENAI_API_KEY.env' # read local .env file
load_dotenv(dotenv_path)

import warnings
warnings.filterwarnings('ignore')

Note: LLM's do not always produce the same results. When executing the code in your notebook, you may get slightly different answers.

In [None]:
llm_model = "gpt-4o-mini"

## Create our QandA application

In [None]:
!pip install langchain langchain-openai langchain-community



In [None]:
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch

In [None]:
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)
data = loader.load()

In [None]:
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [None]:
!pip install docarray



In [None]:
# Create a vector store index from the loaded documents using embeddings for semantic search
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embeddings
).from_loaders([loader])

In [None]:
# Create a RetrievalQA chain using the LLM with a simple "stuff" chain type and a retriever that fetches relevant documents from the vector store
llm = ChatOpenAI(temperature = 0.0, model=llm_model)
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff", # "stuff" means all retrieved documents are combined and fed together to the LLM
    retriever=index.vectorstore.as_retriever(), # use the vector store index we created as the retriever
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

### Coming up with test datapoints

In [None]:
data[10]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 10}, page_content=": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- Relaxed fit top with raglan sleeves and rounded hem.\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\n\nImported.")

In [None]:
data[11]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 11}, page_content=': 11\nname: Ultra-Lofty 850 Stretch Down Hooded Jacket\ndescription: This technical stretch down jacket from our DownTek collection is sure to keep you warm and comfortable with its full-stretch construction providing exceptional range of motion. With a slightly fitted style that falls at the hip and best with a midweight layer, this jacket is suitable for light activity up to 20° and moderate activity up to -30°. The soft and durable 100% polyester shell offers complete windproof protection and is insulated with warm, lofty goose down. Other features include welded baffles for a no-stitch construction and excellent stretch, an adjustable hood, an interior media port and mesh stash pocket and a hem drawcord. Machine wash and dry. Imported.')

### Hard-coded examples

In [None]:
examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

### LLM-Generated examples

In [None]:
from langchain.evaluation.qa import QAGenerateChain

In [None]:
# QAGenerateChain.from_llm creates a chain that can generate question-answer pairs from a document using the specified LLM
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI(model=llm_model))

In [None]:
# the warning below can be safely ignored

In [None]:
# Generate question-answer pairs from the first 5 documents using the LLM
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:5]]
)

In [None]:
new_examples[0]

{'qa_pairs': {'query': "What are the key features and construction details of the Women's Campside Oxfords as described in the document?",
  'answer': "The Women's Campside Oxfords are designed to be ultracomfortable with a lace-to-toe style, crafted from soft canvas material that provides a broken-in feel right from the first wear. They include thick cushioning for enhanced comfort, a comfortable EVA innersole featuring Cleansport NXT® antimicrobial odor control, and a vintage hunt, fish, and camping motif on the innersole. The shoes also have a moderate arch contour for support and an EVA foam midsole for additional cushioning. The outsole is made from molded rubber with a modified chain-tread pattern inspired by chains, and the approximate weight is 1 lb. 1 oz. per pair. For sizing, it is recommended to order your regular shoe size or to order up for half sizes not offered."}}

In [None]:
data[0]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0}, page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.")

Essentially, what we do here is to take a few documents and automatically generate corresponding questions and answers, which can be used for evaluation, fine-tuning, or creating training datasets.

### Combine examples

In [None]:
examples += new_examples

In [None]:
qa.invoke(examples[0]["query"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'Do the Cozy Comfort Pullover Set have side pockets?',
 'result': 'Yes, the Cozy Comfort Pullover Set has side pockets in the pull-on pants.'}

## Manual Evaluation

In [None]:
import langchain
langchain.debug = True

In [None]:
qa.invoke(examples[0]["query"])

[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "Do the Cozy Comfort Pullover Set have side pockets?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "Do the Cozy Comfort Pullover Set have side pockets?",
  "context": ": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- 

{'query': 'Do the Cozy Comfort Pullover Set have side pockets?',
 'result': 'Yes, the Cozy Comfort Pullover Set has side pockets in the pull-on pants.'}

In [None]:
# Turn off the debug mode
langchain.debug = False

## LLM assisted evaluation

In [None]:
predictions = qa.apply(examples)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m


ValueError: Missing some input keys: {'query'}

In [None]:
examples

[{'query': 'Do the Cozy Comfort Pullover Set have side pockets?',
  'answer': 'Yes'},
 {'query': 'What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?',
  'answer': 'The DownTek collection'},
 {'qa_pairs': {'query': "What are the key features and construction details of the Women's Campside Oxfords as described in the document?",
   'answer': "The Women's Campside Oxfords are designed to be ultracomfortable with a lace-to-toe style, crafted from soft canvas material that provides a broken-in feel right from the first wear. They include thick cushioning for enhanced comfort, a comfortable EVA innersole featuring Cleansport NXT® antimicrobial odor control, and a vintage hunt, fish, and camping motif on the innersole. The shoes also have a moderate arch contour for support and an EVA foam midsole for additional cushioning. The outsole is made from molded rubber with a modified chain-tread pattern inspired by chains, and the approximate weight is 1 lb. 1 oz. per pair. Fo

In [None]:
# Convert "qa_pairs" nested structure to flat "query"/"answer"
examples_flat = []
for ex in examples:
    if "qa_pairs" in ex:
        examples_flat.append({
            "query": ex["qa_pairs"]["query"],
            "answer": ex["qa_pairs"]["answer"]
        })
    else:
        examples_flat.append(ex)

In [None]:
examples_flat

[{'query': 'Do the Cozy Comfort Pullover Set have side pockets?',
  'answer': 'Yes'},
 {'query': 'What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?',
  'answer': 'The DownTek collection'},
 {'query': "What are the key features and construction details of the Women's Campside Oxfords as described in the document?",
  'answer': "The Women's Campside Oxfords are designed to be ultracomfortable with a lace-to-toe style, crafted from soft canvas material that provides a broken-in feel right from the first wear. They include thick cushioning for enhanced comfort, a comfortable EVA innersole featuring Cleansport NXT® antimicrobial odor control, and a vintage hunt, fish, and camping motif on the innersole. The shoes also have a moderate arch contour for support and an EVA foam midsole for additional cushioning. The outsole is made from molded rubber with a modified chain-tread pattern inspired by chains, and the approximate weight is 1 lb. 1 oz. per pair. For sizing, it i

In [None]:
predictions = qa.apply(examples_flat)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [None]:
predictions

[{'query': 'Do the Cozy Comfort Pullover Set have side pockets?',
  'answer': 'Yes',
  'result': 'Yes, the Cozy Comfort Pullover Set has side pockets in the pull-on pants.'},
 {'query': 'What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?',
  'answer': 'The DownTek collection',
  'result': 'The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.'},
 {'query': "What are the key features and construction details of the Women's Campside Oxfords as described in the document?",
  'answer': "The Women's Campside Oxfords are designed to be ultracomfortable with a lace-to-toe style, crafted from soft canvas material that provides a broken-in feel right from the first wear. They include thick cushioning for enhanced comfort, a comfortable EVA innersole featuring Cleansport NXT® antimicrobial odor control, and a vintage hunt, fish, and camping motif on the innersole. The shoes also have a moderate arch contour for support and an EVA foam midsole for add

In [None]:
from langchain.evaluation.qa import QAEvalChain

In [None]:
llm = ChatOpenAI(temperature=0, model=llm_model)
eval_chain = QAEvalChain.from_llm(llm)

In [None]:
# QAEvalChain compares the LLM-generated answers in 'predictions' (under 'result') with the ground-truth answers in 'examples_flat' (under 'answer')
graded_outputs = eval_chain.evaluate(examples_flat, predictions)

In [None]:
graded_outputs

[{'results': 'CORRECT'},
 {'results': 'INCORRECT'},
 {'results': 'GRADE: CORRECT'},
 {'results': 'GRADE: CORRECT'},
 {'results': 'GRADE: CORRECT'},
 {'results': 'GRADE: CORRECT'},
 {'results': 'GRADE: INCORRECT'}]

In [None]:
for i, eg in enumerate(examples_flat):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Predicted Grade: " + graded_outputs[i]['results'])
    print()

Example 0:
Question: Do the Cozy Comfort Pullover Set have side pockets?
Real Answer: Yes
Predicted Answer: Yes, the Cozy Comfort Pullover Set has side pockets in the pull-on pants.
Predicted Grade: CORRECT

Example 1:
Question: What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.
Predicted Grade: INCORRECT

Example 2:
Question: What are the key features and construction details of the Women's Campside Oxfords as described in the document?
Real Answer: The Women's Campside Oxfords are designed to be ultracomfortable with a lace-to-toe style, crafted from soft canvas material that provides a broken-in feel right from the first wear. They include thick cushioning for enhanced comfort, a comfortable EVA innersole featuring Cleansport NXT® antimicrobial odor control, and a vintage hunt, fish, and camping motif on the innersole. The shoes al

In [None]:
graded_outputs[0]

{'results': 'CORRECT'}

In this notebook, we explored several key steps for working with Q&A datasets and LLM evaluation. First, we saw how to generate artificial Q&A pairs from a dataset using `QAGenerateChain` and combine them with existing examples. Next, we looked at manually evaluating responses with `langchain.debug = True` to inspect how the LLM answers each query. Finally, we used `QAEvalChain` to automatically evaluate responses: we ran `predictions = qa.apply(examples_flat)` where the inputs were the questions (`query`) and the outputs were the LLM's answers (`result`), and then `eval_chain.evaluate(examples_flat, predictions)` compared these predictions against the ground-truth answers to provide structured evaluation results. This workflow demonstrates a full loop from generating examples to automated assessment of LLM performance.