<a href="https://colab.research.google.com/github/Decoding-Data-Science/airesidency/blob/main/1_evaluation_recipe-c7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Chatbot And RAG Evaluation

Retrieval Augmented Generation (RAG) is a technique that enhances Large Language Models (LLMs) by providing them with relevant external knowledge. It has become one of the most widely used approaches for building LLM applications.

This tutorial will show you how to evaluate your RAG applications using LangSmith. You'll learn:

1. How to create test datasets
2. How to run your RAG application on those datasets
3. How to measure your application's performance using different evaluation metrics

#### Overview
A typical RAG evaluation workflow consists of three main steps:

1. Creating a dataset with questions and their expected answers
2. Running your RAG application on those questions
3. Using evaluators to measure how well your application performed, looking at factors like:
 - Answer relevance
 - Answer accuracy
 - Retrieval quality

For this tutorial, we'll create and evaluate a bot that answers questions about a few of Lilian Weng's insightful blog posts.

### Chatbot Evaluation

In [1]:
!pip install -q \
  python-dotenv \
  langsmith \
  langchain \
  langchain-openai \
  langchain-community \
  langchain-text-splitters \
  openai \
  pandas \
  tiktoken


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.3/84.3 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.7/64.7 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires requests==2.32.4, but you have requests 2.32.5 which is incompatible.[0m[31m
[0m

In [2]:
import os
from google.colab import userdata

# Read secrets from Colab’s User Secrets
OPENAI_API_KEY = userdata.get("openai")
LANGSMITH_API_KEY = userdata.get("LANGSMITH_API_KEY")

if not OPENAI_API_KEY or not LANGSMITH_API_KEY:
    raise ValueError("Please set OPENAI_API_KEY and LANGSMITH_API_KEY in Colab User Secrets.")

# Set environment variables so the rest of the notebook works unchanged
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
os.environ["LANGSMITH_API_KEY"] = LANGSMITH_API_KEY
os.environ["LANGSMITH_TRACING"] = "true"


In [3]:
from langsmith import Client

client = Client()

# Define dataset: recipe-focused Q/A for beginner evaluations
dataset_name = "Recipe Bot Evaluation — Q/A (Beginner Set)— 6th Dec"
dataset = client.create_dataset(dataset_name)

client.create_examples(
    dataset_id=dataset.id,
    examples=[
        {
            "inputs": {
                "question": "How many teaspoons are in one tablespoon?",
                "context": "US kitchen measurement equivalents."
            },
            "outputs": {"answer": "The number of teaspoons one tablespoon is 3"}
        },
        {
            "inputs": {
                "question": "What is the safe internal temperature for cooked chicken (°C)?",
                "context": "Food safety guideline for poultry doneness."
            },
            "outputs": {"answer": "74°C"}
        },
        {
            "inputs": {
                "question": "Convert 2 US cups to milliliters.",
                "context": "Use the US legal cup for home cooking."
            },
            "outputs": {"answer": "480 ml"}
        },
        {
            "inputs": {
                "question": "What is the classic vinaigrette oil-to-acid ratio?",
                "context": "Standard salad dressing ratio."
            },
            "outputs": {"answer": "3:1"}
        },
        {
            "inputs": {
                "question": "Substitute for 1 cup light brown sugar using white sugar and molasses.",
                "context": "Common home-baking substitution."
            },
            "outputs": {"answer": "1 cup white sugar + 1 tbsp molasses"}
        },
        {
            "inputs": {
                "question": "Minimum internal temperature for medium-rare steak (°C).",
                "context": "Typical doneness temperature."
            },
            "outputs": {"answer": "57°C"}
        },
        {
            "inputs": {
                "question": "How many grams are in 1 ounce (oz)?",
                "context": "Kitchen weight conversion."
            },
            "outputs": {"answer": "28.35 g"}
        },
        {
            "inputs": {
                "question": "What gas is produced when baking soda reacts with an acid?",
                "context": "Leavening reaction in quick breads."
            },
            "outputs": {"answer": "Carbon dioxide"}
        },
        {
            "inputs": {
                "question": "Boiling time for a soft-boiled egg (runny yolk) after simmering starts.",
                "context": "Stovetop method, large eggs."
            },
            "outputs": {"answer": "6 minutes"}
        },
        {
            "inputs": {
                "question": "How many tablespoons are in 1/4 cup (US)?",
                "context": "US kitchen measurement equivalents."
            },
            "outputs": {"answer": "4 tbsp"}
        },
        {
            "inputs": {

                "question": "Q1",
                "context": "US Mesaurements"
            },
            "outputs": {"answer": "1 tbsp"}
        }
    ]
)


{'example_ids': ['f98dee4b-7ab2-4faa-9d4b-818d79c9636f',
  'de6d1dbe-6261-4cf4-b3a5-3e9b4ee8bd03',
  '69aab53f-740f-4e2a-bed6-99cc3b5fef43',
  '06aada0c-6894-43a6-939f-9b829ec1a041',
  '59450985-3b61-48cd-a1c7-f742fbf1ab3d',
  '66c55769-1789-4df9-884c-cc64c3bbf496',
  '322cf8bf-900e-4e0c-8138-82d8324efb97',
  '0598f608-da6f-4335-9e71-72d034582de9',
  '75d0d488-9fb1-4895-a9ee-06e572be64e5',
  '4f733315-e9f0-45a2-9d9e-2133cc412009',
  'f3251d72-eab9-4a88-aff5-4c902bf3431a'],
 'count': 11}

### Define Metrics (LLM As A Judge)


In [4]:
import openai
from langsmith import wrappers

openai_client=wrappers.wrap_openai(openai.OpenAI())

eval_instructions = " Strict grader for short recipe Q&A"

def correctness(inputs:dict,outputs:dict, reference_outputs:dict)->bool:
      user_content = f"""You are grading the following question:
    {inputs['question']}
    Here is the real answer:
    {reference_outputs['answer']}
    You are grading the following predicted answer:
    {outputs['response']}
    Respond with CORRECT or INCORRECT:
    Grade:
    """
      response=openai_client.chat.completions.create(
            model="gpt-4o-mini",
            temperature=0,
            messages=[
                  {"role":"system","content":eval_instructions},
                  {"role":"user","content":user_content}
            ]
      ).choices[0].message.content

      return response == "CORRECT"

In [5]:
## Concisions- checks whether the actual output is less than 2x the length of the expected result.

def concision(outputs: dict, reference_outputs: dict) -> bool:
    return int(len(outputs["response"]) < 2 * len(reference_outputs["answer"]))

### Run Evaluations

In [6]:
default_instructions = "Respond to the users question in a short, concise manner (one or two word )."
def my_app(question: str, model: str = "gpt-4.1-nano-2025-04-14", instructions: str = default_instructions) -> str:
    return openai_client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": instructions},
            {"role": "user", "content": question},
        ],
    ).choices[0].message.content

In [7]:
### Call my_app for every datapoints
def ls_target(inputs: str) -> dict:
    return {"response": my_app(inputs["question"])}

In [8]:
## Run our evaluation
experiment_results=client.evaluate(
    ls_target, ## Your AI system
    data=dataset_name,
    evaluators=[correctness,concision],
    experiment_prefix="gpt-4.1-nano-2025-04-14_1"
)

View the evaluation results for experiment: 'gpt-4.1-nano-2025-04-14_1-1fcab761' at:
https://smith.langchain.com/o/c8f8810e-4941-552b-aef3-15ad938ead98/datasets/335a5760-8e7d-4ef3-8345-c0980ac2d618/compare?selectedSessions=c03364c7-2a63-4750-96aa-5b77cf5dc013




0it [00:00, ?it/s]

In [9]:
### Call my_app for every datapoints
def ls_target(inputs: str) -> dict:
    return {"response": my_app(inputs["question"],model="gpt-3.5-turbo")}

In [10]:
## Run our evaluation
experiment_results=client.evaluate(
    ls_target, ## Your AI system
    data=dataset_name,
    evaluators=[correctness,concision],
    experiment_prefix="gpt-3.5"
)

View the evaluation results for experiment: 'gpt-3.5-29fefc25' at:
https://smith.langchain.com/o/c8f8810e-4941-552b-aef3-15ad938ead98/datasets/335a5760-8e7d-4ef3-8345-c0980ac2d618/compare?selectedSessions=9a034e2f-50aa-4fcc-85ea-3b65c8f78a2f




0it [00:00, ?it/s]