# Introduction To Evaluation
Evaluation is a quantitative way of measuring the performance of an LLM powered application

### Why is it important
- LLM are non deterministic which means their behavior is not predictable and a small change in prompt, the base LLM, the Input or some configuration can significantly change the output
- Evaluation provides a structured way to identify failures, compare changes across different versions of your application
- This help you easily iterate and scale your AI application more reliably

### Evaluation components
Evaluation is made up of three main component
1. Dataset: This is the collection of test input and reference output
2. Target Function: A target function defines what your are evaluating. For example
    - You can evaluation a single node
    - You can evaluate a new prompt
    - You can evaluate a part of your application
    - You can evaluate end to end application

3. Evaluators: This are function for scoring outputs. 
    - *NOTE:* This can either be **online evaluator** or **offline evaluator**


![Eval Conceptual](../static/Eval.png)


# Lets run a simple evaluation to test the correctness of LLM responses

## 1. Create a Dataset
A dataset is a collection of examples used for evaluating an application
- An example is a pair of test input and test output

### a. Create examples

In [20]:
dataset_examples = [
    {
        "inputs": {"question": "What is the capital city of Kenya?"},
        "outputs": {"answer": "The capital city of Kenya is Nairobi."}
    },
    {
        "inputs": {"question": "Who developed the theory of relativity?"},
        "outputs": {"answer": "The theory of relativity was developed by Albert Einstein."}
    },
    {
        "inputs": {"question": "What is the largest planet in our solar system?"},
        "outputs": {"answer": "The largest planet in our solar system is Jupiter."}
    },
    {
        "inputs": {"question": "In which year did World War II end?"},
        "outputs": {"answer": "World War II ended in 1945."}
    },
    {
        "inputs": {"question": "What is the square root of 144?"},
        "outputs": {"answer": "The square root of 144 is 12."}
    },
    {
        "inputs": {"question": "Who wrote the play 'Romeo and Juliet'?"},
        "outputs": {"answer": "The play 'Romeo and Juliet' was written by William Shakespeare."}
    },
    {
        "inputs": {"question": "What is the chemical symbol for gold?"},
        "outputs": {"answer": "The chemical symbol for gold is Au."}
    },
    {
        "inputs": {"question": "Which continent is the Sahara Desert located in?"},
        "outputs": {"answer": "The Sahara Desert is located in Africa."}
    },
    {
        "inputs": {"question": "How many sides does a hexagon have?"},
        "outputs": {"answer": "A hexagon has six sides."}
    },
    {
        "inputs": {"question": "What is the freezing point of water in Celsius?"},
        "outputs": {"answer": "The freezing point of water is 0 degrees Celsius."}
    }
]


### Programmatically create a dataset in LangSmith

In [21]:
from langsmith import Client

client = Client()

dataset_name = "agent_evaluation"

if not client.has_dataset(dataset_name=dataset_name):
    dataset = client.create_dataset(dataset_name, description="Dataset for evaluating agent responses")
    
    # Add examples to the dataset
    client.create_examples(
        dataset_id = dataset.id,
        examples = dataset_examples
    )

## 2. Define Target Function (What you are evaluating)
A target function contains what you are evaluating

- In our case we are going to evaluate the entire agent

In [22]:
import sys
import os

sys.path.append(os.path.abspath(".."))

from src.agent import graph

def target_function(inputs:dict):
    
    response = graph.invoke(inputs)
    
    return {"answer": response['answer']}
    
    

# 3. Define Evaluator
Evaluators are functions that score how well your application performs on a particular example.

In [23]:
from typing_extensions import Annotated, TypedDict
from langchain.chat_models import init_chat_model


# Grade output schema
class CorrectnessGrade(TypedDict):
    explanation: Annotated[str, ..., "Explain your reasoning for the score"]
    correct: Annotated[bool, ..., "True if the answer is correct, False otherwise."]


# Grade prompt
correctness_instructions = """
You are a teacher grading a quiz. 
You will be given a QUESTION, the GROUND TRUTH (correct) ANSWER, and the STUDENT ANSWER. 

Here is the grade criteria to follow:
(1) Grade the student answers based ONLY on their factual accuracy relative to the ground truth answer. 
(2) Ensure that the student answer does not contain any conflicting statements.
(3) It is OK if the student answer contains more information than the ground truth answer, as long as it is factually accurate relative to the  ground truth answer.

Correctness:
A correctness value of True means that the student's answer meets all of the criteria.
A correctness value of False means that the student's answer does not meet all of the criteria.

Explain your reasoning in a step-by-step manner to ensure your reasoning and conclusion are correct. 
Avoid simply stating the correct answer at the outset.
"""

# Grader LLM
grader_llm = init_chat_model(model="gpt-4o").with_structured_output(
    CorrectnessGrade
)


def correctness(inputs: dict, outputs: dict, reference_outputs: dict) -> bool:
    """An evaluator for answer accuracy"""
    
    
    answers = f"""\
        QUESTION: {inputs['question']}
        GROUND TRUTH ANSWER: {reference_outputs['answer']}
        STUDENT ANSWER: {outputs['answer']}
    """
    
    # Run evaluator
    grade = grader_llm.invoke(
        [
            {"role": "system", "content": correctness_instructions},
            {"role": "user", "content": answers},
        ]
    )
    return grade["correct"]

# 4. Run and View Results
Run the experiments

In [24]:
# After running the evaluation, a link will be provided to view the results in langsmith
experiment_results = client.evaluate(
    target_function,
    data=dataset_name,
    evaluators=[
        correctness,
        # can add multiple evaluators here
    ],
    experiment_prefix="agent-evaluation"
)

View the evaluation results for experiment: 'agent-evaluation-a36b1d94' at:
https://smith.langchain.com/o/5e26199c-44b7-5d71-a174-0781dc496380/datasets/7649f8d9-60f7-48c5-a40d-c2ed1ecde1cc/compare?selectedSessions=4577ffd6-3ce7-4858-98bb-3220ba64a8b6




10it [00:25,  2.57s/it]
