<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>

# <center>Evaluation using Pydantic Evals</center>

1. Use Pydantic Evals to evaluate your LLM app for a simple question-answering task.
2. Log your results to Arize Phoenix to track your experiments and traces.

## Step 1: Install dependencies

In [5]:
!pip install -q pydantic-evals arize-phoenix openai openinference-instrumentation-openai "httpx<0.28.0,>=0.23.0"

[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
crewai 0.114.0 requires chromadb>=0.5.23, which is not installed.
crewai-tools 0.40.1 requires chromadb>=0.4.22, which is not installed.
langchain 0.3.23 requires langsmith<0.4,>=0.1.17, which is not installed.
embedchain 0.1.128 requires chromadb<0.6.0,>=0.5.10, which is not installed.
embedchain 0.1.128 requires langsmith<0.4.0,>=0.3.18, which is not installed.
langchain-community 0.3.21 requires langsmith<0.4,>=0.1.125, which is not installed.[0m[31m
[0m

In [None]:
!pip install arize-phoenix

## Step 2: Setup API keys and imports

In [1]:
import os
from getpass import getpass

from openai import OpenAI
from pydantic_evals import Case, Dataset

import phoenix as px

if os.getenv("OPENAI_API_KEY") is None:
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

# Step 3: Enable Phoenix Tracing

Sign up for a free instance of [Phoenix Cloud](https://app.phoenix.arize.com) to get your API key. If you'd prefer, you can instead [self-host Phoenix](https://docs.arize.com/phoenix/deployment).

In [12]:
if os.getenv("PHOENIX_API_KEY") is None:
    os.environ["PHOENIX_API_KEY"] = getpass("Enter your Phoenix API key: ")

os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com/"
os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={os.getenv('PHOENIX_API_KEY')}"

In [2]:
from phoenix.otel import register

tracer_provider = register(
    endpoint="http://localhost:6006/v1/traces",
    project_name="pydantic-evals-tutorial",
    auto_instrument=True,
)

🔭 OpenTelemetry Tracing Details 🔭
|  Phoenix Project: pydantic-evals-tutorial
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: http://localhost:6006/v1/traces
|  Transport: HTTP + protobuf
|  Transport Headers: {'authorization': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.



# Step 4: Create Example Traces to Evaluate

Next, we'll run some example inputs through an LLM call to generate traces that we can evaluate. In practice, you'd likely already have an application you're tracing that you'd want to evaluate instead.

In [4]:
client = OpenAI()

inputs = [
    "What is the capital of France?",
    "Who wrote Romeo and Juliet?",
    "What is the largest planet in our solar system?",
]


def generate_trace(input):
    client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant. Only respond with the answer to the question as a single word or proper noun.",
            },
            {"role": "user", "content": input},
        ],
    )


for input in inputs:
    generate_trace(input)

# Step 5: Export Traces from Phoenix

Export your traces from Phoenix to then evaluate with Pydantic Evals.

In [1]:
from phoenix.trace.dsl import SpanQuery

query = (
    SpanQuery()
    .where(
        "span_kind == 'LLM'",
    )
    .select(
        input="input.value",
        output="output.value",
    )
)

# The Phoenix Client can take this query and return the dataframe.
px.Client().query_spans(query, project_name="pydantic-evals-tutorial")

  from .autonotebook import tqdm as notebook_tqdm


ModuleNotFoundError: No module named 'phoenix.evals'

## Step 4: Define the Evaluation Dataset
Create a dataset of test cases using Pydantic Evals for a question-answering task.
1. Each Case represents a single test with an input (question) and an expected output (answer).
2. The Dataset aggregates these cases for evaluation.

In [5]:
cases = [
    Case(
        name="capital of France", inputs="What is the capital of France?", expected_output="Paris"
    ),
    Case(
        name="author of Romeo and Juliet",
        inputs="Who wrote Romeo and Juliet?",
        expected_output="William Shakespeare",
    ),
    Case(
        name="largest planet",
        inputs="What is the largest planet in our solar system?",
        expected_output="Jupiter",
    ),
]
dataset = Dataset(cases=cases)

## Step 5: Setup LLM task to evaluate


In [5]:
client = OpenAI()


def evaluate_case(case):
    response = client.chat.completions.create(
        model="gpt-4o-mini", messages=[{"role": "user", "content": case.inputs}]
    )
    output = response.choices[0].message.content
    print(output)
    is_correct = case.expected_output.lower() in output.strip().lower()
    return is_correct

## Step 6: Run your experiment and evaluation

In [6]:
results = [evaluate_case(case) for case in dataset.cases]

for case, result in zip(dataset.cases, results):
    print(f"Case: {case.name}, Correct: {result}")

APIConnectionError: Connection error.

# Step 7. See your results in Arize Phoenix

<img src="" width="800"/>