<center>
    <p style="text-align:center">
    <img alt="arize logo" src="https://storage.googleapis.com/arize-assets/arize-logo-white.jpg" width="300"/>
        <br>
        <a href="https://docs.arize.com/arize/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/client_python">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-11t1vbu4x-xkBIHmOREQnYnYDH1GDfCg">Slack Community</a>
    </p>
</center>

# <center>Evaluation using Pydantic Evals</center>

1. Use Pydantic Evals to evaluate your LLM app for a simple question-answering task.
2. Log your results to Arize to track your experiments and traces.

## Step 1: Install dependencies

In [None]:
!pip install -q pydantic-evals "arize[Tracing]" arize-otel openai openinference-instrumentation-openai

## Step 2: Setup API keys and imports

In [None]:
from openai import OpenAI
from pydantic_evals import Case, Dataset
from getpass import getpass
import os

SPACE_ID = globals().get("SPACE_ID") or getpass(
    "🔑 Enter your Arize Space ID: "
)
API_KEY = globals().get("API_KEY") or getpass("🔑 Enter your Arize API Key: ")
OPENAI_API_KEY = globals().get("OPENAI_API_KEY") or getpass(
    "🔑 Enter your OpenAI API key: "
)

os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

# Step 3: Setup Arize
Add our auto-instrumentation for OpenAI using arize-otel.

In [None]:
from arize.otel import register
tracer_provider = register(
    space_id=SPACE_ID,  
    api_key=API_KEY,
    project_name="pydantic-evals-tutorial",  
)

from openinference.instrumentation.openai import OpenAIInstrumentor
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

## Step 4: Define the Evaluation Dataset
Create a dataset of test cases using Pydantic Evals for a question-answering task.
1. Each Case represents a single test with an input (question) and an expected output (answer).
2. The Dataset aggregates these cases for evaluation.

In [None]:
cases = [
    Case(name="capital of France", inputs="What is the capital of France?", expected_output="Paris"),
    Case(name="author of Romeo and Juliet", inputs="Who wrote Romeo and Juliet?", expected_output="William Shakespeare"),
    Case(name="largest planet", inputs="What is the largest planet in our solar system?", expected_output="Jupiter")
]
dataset = Dataset(cases=cases)

## Step 5: Setup LLM task to evaluate


In [None]:
client = OpenAI(api_key=OPENAI_API_KEY)

def evaluate_case(case):
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": case.inputs}]
    )
    output = response.choices[0].message.content
    print(output)
    is_correct = case.expected_output.lower() in output.strip().lower()
    return is_correct

## Step 6: Run your experiment and evaluation

In [None]:
results = [evaluate_case(case) for case in dataset.cases]

for case, result in zip(dataset.cases, results):
    print(f"Case: {case.name}, Correct: {result}")

# Step 7. See your results in Arize

<img src="https://storage.googleapis.com/arize-assets/fixtures/pydantic-evals.png" width="800"/>