# LangChain: Evaluation

## Outline:

* Example generation
* Manual evaluation (and debuging)
* LLM-assisted evaluation
* LangChain evaluation platform

In [1]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

Note: LLM's do not always produce the same results. When executing the code in your notebook, you may get slightly different answers that those in the video.

In [2]:
# account for deprecation of LLM model
import datetime
# Get the current date
current_date = datetime.datetime.now().date()

# Define the date after which the model should be set to "gpt-3.5-turbo"
target_date = datetime.date(2024, 6, 12)

# Set the model variable based on the current date
if current_date > target_date:
    llm_model = "gpt-3.5-turbo"
else:
    llm_model = "gpt-3.5-turbo-0301"

## Create our QandA application

In [3]:
from langchain_community.document_loaders import CSVLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import DocArrayInMemorySearch
from langchain_core.prompts import ChatPromptTemplate
from langchain_classic.chains.combine_documents import create_stuff_documents_chain
from langchain_classic.chains import create_retrieval_chain
from IPython.display import display, Markdown

In [4]:
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file, encoding="utf-8")

In [5]:
data = loader.load()

In [6]:
splits = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=150
).split_documents(data)

In [7]:
llm = ChatOpenAI(temperature = 0.0, model=llm_model)

In [8]:
# If you have an existing index, keep this:
embeddings = OpenAIEmbeddings()
vectorstore = DocArrayInMemorySearch.from_documents(splits, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 6})
# Or if you just have a vectorstore: retriever = vectorstore.as_retriever()

In [9]:
# Prompt: {context} is filled with retrieved docs, {input} is your question
prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer ONLY using the context. If unsure, say you don't know.\n\n{context}"),
    ("human", "{input}")
])

# This replaces chain_type="stuff" and chain_type_kwargs={"document_separator": "..."}
doc_chain = create_stuff_documents_chain(
    llm=llm,
    prompt=prompt,
    document_separator="<<<<>>>>>"
)

In [10]:
# Retrieval chain replaces RetrievalQA.from_chain_type(...)
qa_chain = create_retrieval_chain(retriever, doc_chain)

### Coming up with test datapoints

In [11]:
data[10]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 10}, page_content=": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- Relaxed fit top with raglan sleeves and rounded hem.\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\n\nImported.")

In [12]:
data[11]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 11}, page_content=': 11\nname: Ultra-Lofty 850 Stretch Down Hooded Jacket\ndescription: This technical stretch down jacket from our DownTek collection is sure to keep you warm and comfortable with its full-stretch construction providing exceptional range of motion. With a slightly fitted style that falls at the hip and best with a midweight layer, this jacket is suitable for light activity up to 20° and moderate activity up to -30°. The soft and durable 100% polyester shell offers complete windproof protection and is insulated with warm, lofty goose down. Other features include welded baffles for a no-stitch construction and excellent stretch, an adjustable hood, an interior media port and mesh stash pocket and a hem drawcord. Machine wash and dry. Imported.')

### Hard-coded examples

In [13]:
# 4) Ask
query = "Please list all your shirts with sun protection in a table in markdown and summarize each one."
result = qa_chain.invoke({"input": query})
answer = result["answer"]
print(answer)

| Name                                     | Summary                                                                                                                                                                                                                   |
|------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Men's Tropical Plaid Short-Sleeve Shirt  | Made of 100% polyester, UPF 50+ rated for superior sun protection. Features front and back cape venting, two front bellows pockets, and is wrinkle-resistant. Imported.                                                        |
| Men's Plaid Tropic Shirt, Short-Sleeve   | Made with 52% polyester and 48% nylon, UPF 50+ rated for sun protection. Features front and back cape venting, two front bellows pockets, and is wrinkle-fr

In [14]:
examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

### LLM-Generated examples

In [15]:
from langchain_classic.evaluation.qa import QAGenerateChain


In [16]:
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI(model=llm_model))

In [17]:
# the warning below can be safely ignored

In [18]:
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:5]]
)

  new_examples = example_gen_chain.apply_and_parse(


In [19]:
new_examples[0]

{'qa_pairs': {'query': "What is the approximate weight of one pair of Women's Campside Oxfords?",
  'answer': "The approximate weight of one pair of Women's Campside Oxfords is 1 lb. 1 oz."}}

In [20]:
data[0]

Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 0}, page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.")

### Combine examples

In [21]:
new_examples[0]['qa_pairs']

{'query': "What is the approximate weight of one pair of Women's Campside Oxfords?",
 'answer': "The approximate weight of one pair of Women's Campside Oxfords is 1 lb. 1 oz."}

In [22]:
new_examples[0]['qa_pairs']

{'query': "What is the approximate weight of one pair of Women's Campside Oxfords?",
 'answer': "The approximate weight of one pair of Women's Campside Oxfords is 1 lb. 1 oz."}

In [23]:
for ex in new_examples:
    print(ex['qa_pairs'])

{'query': "What is the approximate weight of one pair of Women's Campside Oxfords?", 'answer': "The approximate weight of one pair of Women's Campside Oxfords is 1 lb. 1 oz."}
{'query': 'What are the dimensions of the Small and Medium sizes for the Recycled Waterhog Dog Mat, Chevron Weave?', 'answer': 'The Small size has dimensions of 18" x 28" and the Medium size has dimensions of 22.5" x 34.5".'}
{'query': "What are the key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece?", 'answer': 'The key features of the swimsuit are bright colors, ruffles, exclusive whimsical prints, four-way-stretch and chlorine-resistant fabric, UPF 50+ sun protection rating, crossover no-slip straps, fully lined bottom for secure fit and coverage, and machine wash and line dry care instructions.'}
{'query': 'What is the fabric composition of the Refresh Swimwear V-Neck Tankini Contrasts?', 'answer': 'The body of the swimtop is made of 82% recycled nylon and 18% Lycra® spandex, whil

In [24]:
for ex in new_examples:
    examples.append(ex['qa_pairs'])

In [25]:
examples

[{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
  'answer': 'Yes'},
 {'query': 'What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?',
  'answer': 'The DownTek collection'},
 {'query': "What is the approximate weight of one pair of Women's Campside Oxfords?",
  'answer': "The approximate weight of one pair of Women's Campside Oxfords is 1 lb. 1 oz."},
 {'query': 'What are the dimensions of the Small and Medium sizes for the Recycled Waterhog Dog Mat, Chevron Weave?',
  'answer': 'The Small size has dimensions of 18" x 28" and the Medium size has dimensions of 22.5" x 34.5".'},
 {'query': "What are the key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece?",
  'answer': 'The key features of the swimsuit are bright colors, ruffles, exclusive whimsical prints, four-way-stretch and chlorine-resistant fabric, UPF 50+ sun protection rating, crossover no-slip straps, fully lined bottom for secure fit and coverage, a

In [26]:
result = qa_chain.invoke({"input": examples[0]["query"]})
answer = result["answer"]

In [27]:
answer

'Yes, the Cozy Comfort Pullover Set pants have side pockets.'

## Manual Evaluation

In [28]:
import langchain
langchain.debug = True

In [29]:
qa_chain.invoke({"input": examples[0]["query"]})

{'input': 'Do the Cozy Comfort Pullover Set        have side pockets?',
 'context': [Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 10}, page_content=": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- Relaxed fit top with raglan sleeves and rounded hem.\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\n\nImported."),
  Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 73}, page_content=": 73\nname: Cozy Cuddles Knit Pullover Set\ndescription: Perfect for lo

In [30]:
# Turn off the debug mode
langchain.debug = False

## LLM assisted evaluation

In [31]:
type(examples[0]['query'])

str

In [32]:
batch_inputs = [{"input": ex['query']} for ex in examples]
predictions = qa_chain.batch(batch_inputs)

In [33]:
predictions

[{'input': 'Do the Cozy Comfort Pullover Set        have side pockets?',
  'context': [Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 10}, page_content=": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- Relaxed fit top with raglan sleeves and rounded hem.\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\n\nImported."),
   Document(metadata={'source': 'OutdoorClothingCatalog_1000.csv', 'row': 73}, page_content=": 73\nname: Cozy Cuddles Knit Pullover Set\ndescription: Perfect for

In [34]:
from langchain_classic.evaluation.qa import QAEvalChain

In [35]:
llm = ChatOpenAI(temperature=0, model=llm_model)
eval_chain = QAEvalChain.from_llm(llm)

In [36]:
predictions_for_eval = [
    {
        "result": p["answer"],                                  # what the model said
        "question": p.get("input", ""),                         # optional but nice to have
        # optional: flatten docs to a single context string if you want the grader to see it
        "context": "\n\n".join(getattr(d, "page_content", "") for d in p.get("context", []))
    }
    for p in predictions
]

In [37]:
graded_outputs = eval_chain.evaluate(
    examples=examples,                 # expects keys: "query", "answer"
    predictions=predictions_for_eval,  # must include "result"; other keys are okay
    question_key="query",
    prediction_key="result",
    # answer_key defaults to "answer" which matches your examples
)

In [38]:
for i, (eg, pred, grade) in enumerate(zip(examples, predictions_for_eval, graded_outputs)):
    print(f"Example {i}:")
    print("Question:", eg["query"])
    print("Real Answer:", eg["answer"])
    print("Predicted Answer:", pred["result"])
    # peek at what's available
    print("Grade keys:", list(grade.keys()))

    # print a sensible primary field if present, else the whole dict
    print("Predicted Grade:", grade.get("text", grade.get("score", grade.get("reasoning", grade))))
    print()

Example 0:
Question: Do the Cozy Comfort Pullover Set        have side pockets?
Real Answer: Yes
Predicted Answer: The Cozy Comfort Pullover Set does not mention having side pockets in the description.
Grade keys: ['results']
Predicted Grade: {'results': 'INCORRECT'}

Example 1:
Question: What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.
Grade keys: ['results']
Predicted Grade: {'results': 'CORRECT'}

Example 2:
Question: What is the approximate weight of one pair of Women's Campside Oxfords?
Real Answer: The approximate weight of one pair of Women's Campside Oxfords is 1 lb. 1 oz.
Predicted Answer: The approximate weight of one pair of Women's Campside Oxfords is 1 lb. 1 oz.
Grade keys: ['results']
Predicted Grade: {'results': 'CORRECT'}

Example 3:
Question: What are the dimensions of the Small and Medium sizes for the Recy

In [39]:
graded_outputs[0]

{'results': 'INCORRECT'}

Reminder: Download your notebook to you local computer to save your work.