### Chatbot And RAG Evaluation

Retrieval Augmented Generation (RAG) is a technique that enhances Large Language Models (LLMs) by providing them with relevant external knowledge. It has become one of the most widely used approaches for building LLM applications.

This tutorial will show you how to evaluate your RAG applications using LangSmith. You'll learn:

1. How to create test datasets
2. How to run your RAG application on those datasets
3. How to measure your application's performance using different evaluation metrics

#### Overview
A typical RAG evaluation workflow consists of three main steps:

1. Creating a dataset with questions and their expected answers
2. Running your RAG application on those questions
3. Using evaluators to measure how well your application performed, looking at factors like:
 - Answer relevance
 - Answer accuracy
 - Retrieval quality
 
For this tutorial, we'll create and evaluate a bot that answers questions about a few of Lilian Weng's insightful blog posts.

### Chatbot Evaluation

In [1]:
import os 
from dotenv import load_dotenv

load_dotenv()

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
os.environ["LANGSMITH_API_KEY"] = os.getenv("LANGSMITH_API_KEY")
os.environ["LANGSMITH_TRACING"] = "true"

In [3]:
from langsmith import Client

client = Client()

# Define dataset: these are your test cases
dataset_name = "Chatbots Evaluation 2"
dataset = client.create_dataset(dataset_name)
client.create_examples(
    dataset_id=dataset.id,
    examples=[
        {
            "inputs": {"question": "What is LangChain?"},
            "outputs": {"answer": "A framework for building LLM applications"},
        },
        {
            "inputs": {"question": "What is LangSmith?"},
            "outputs": {"answer": "A platform for observing and evaluating LLM applications"},
        },
        {
            "inputs": {"question": "What is OpenAI?"},
            "outputs": {"answer": "A company that creates Large Language Models"},
        },
        {
            "inputs": {"question": "What is Google?"},
            "outputs": {"answer": "A technology company known for search"},
        },
        {
            "inputs": {"question": "What is Mistral?"},
            "outputs": {"answer": "A company that creates Large Language Models"},
        }
    ]
)

{'example_ids': ['a3ca34ed-9916-4888-bb3d-10e48944a778',
  'c2b8dd13-37d5-4fc0-aa1b-bc8f7532c294',
  'febc493e-76d2-458a-9ac7-24b25f52589a',
  '5ab95691-7741-436d-995b-dcc50501c110',
  'bebd16d3-1284-43f1-94a0-1322b17c88cb'],
 'count': 5}

### Define Metrics (LLM AS A JUDGE)

In [4]:
import openai
from langsmith import wrappers

openai_client = wrappers.wrap_openai(openai.OpenAI())
eval_instructions = "You are an expert professor specialized in grading students' answers to questions."


In [18]:
import openai
from langsmith import wrappers
 
openai_client=wrappers.wrap_openai(openai.OpenAI())

eval_instructions = "You are an expert professor specialized in grading students' answers to questions."

def correctness(inputs:dict,outputs:dict, reference_outputs:dict)->bool:
      user_content = f"""You are grading the following question:
    {inputs['question']}
    Here is the real answer:
    {reference_outputs['answer']}
    You are grading the following predicted answer:
    {outputs['response']}
    Respond with CORRECT or INCORRECT:
    Grade:
    """
      response=openai_client.chat.completions.create(
            model="gpt-4o-mini",
            temperature=0,
            messages=[
                  {"role":"system","content":eval_instructions},
                  {"role":"user","content":user_content}
            ]
      ).choices[0].message.content

      return response == "CORRECT"

In [19]:
## Concisions- checks whether the actual output is less than 2x the length of the expected result.

def concision(outputs: dict, reference_outputs: dict) -> bool:
    return int(len(outputs["response"]) < 2 * len(reference_outputs["answer"]))

### Run Evaluations

In [20]:
default_instructions = "Respond to the users question in a short, concise manner (one short sentence)."
def my_app(question: str, model: str = "gpt-4o-mini", instructions: str = default_instructions) -> str:
    return openai_client.chat.completions.create(
        model=model,
        temperature=0,
        messages=[
            {"role": "system", "content": instructions},
            {"role": "user", "content": question},
        ],
    ).choices[0].message.content

In [21]:
### Call my_app for each question

def ls_target(inputs:str) -> dict:
    return {"response": my_app(inputs["question"])}

In [22]:
## Run Evaluations

experiment_results = client.evaluate(
    ls_target,  ## You AI system
    data= dataset_name, 
    evaluators = [correctness, concision], 
    experiment_prefix = "gpt-4o-mini"
)

View the evaluation results for experiment: 'gpt-4o-mini-b4912e4d' at:
https://smith.langchain.com/o/888508c2-6024-4c31-b81c-8eca3c339169/datasets/24c29ada-bd25-4343-b77e-bc80644635eb/compare?selectedSessions=92282181-6300-48fb-b5e8-fbfbbe2ae395




5it [00:08,  1.69s/it]


In [23]:
### Call my_app for each question

def ls_target(inputs:str) -> dict:
    return {"response": my_app(inputs["question"], model="gpt-4o")}

In [None]:
## Run Evaluations

experiment_results = client.evaluate(
    ls_target,  ## You AI system
    data= dataset_name, 
    evaluators = [correctness, concision], 
    experiment_prefix = "gpt-4o"
)