# Using Evals

When the expected result is not deterministic, a normal test is not enough. Then the way of testing is
more like experiments.

This way you can e.g. test different models or different prompts for your usecase and compare the results

## The Example

This time we are asking for touristic recommendations for a city.

In [1]:
from importlib.metadata import metadata

from pydantic import BaseModel, Field
from dotenv import load_dotenv
import logfire
import os
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

# setup logging
load_dotenv()
logfire.configure(token=os.getenv("LOGFIRE_TOKEN"))
logfire.info('Example 2 configured!')
logfire.instrument_pydantic_ai()

#setup the LLM
BASE_URL = "http://127.0.0.1:1234/v1"
LM_STUDIO_MODEL = "openai/gpt-oss-20b"


class TouristicRecommendation(BaseModel):
    city: str = Field(..., description="The city name")
    country: str = Field(..., description="The country of the city")
    description: str = Field(..., description="A short description of the city")
    recommendations: list[str] = Field(..., description="A list of recommended touristic places")


model = OpenAIChatModel(LM_STUDIO_MODEL, provider=OpenAIProvider(BASE_URL))

#setup the agent, ignore the deps type for the moment
agent = Agent(model, output_type=TouristicRecommendation, system_prompt="""
You are a touristic recommendation agent.
You are given a city name and you have to provide a touristic recommendation for this city.
Give a short description of about 200 chars.
Also recommend at least 3 touristic places for this city with a short description.
""")

#First test without tool
result = await agent.run("What do you recommend for the city of Berlin?")
result.output

17:56:43.570 Example 2 configured!
17:56:43.609 agent run
17:56:43.610   chat openai/gpt-oss-20b


TouristicRecommendation(city='Berlin', country='Germany', description='Vibrant capital blending historic sites, cutting‑edge culture, and green spaces. From the iconic Brandenburg Gate to contemporary art hubs, Berlin offers a dynamic urban experience.', recommendations=['Brandenburg Gate – historic symbol of reunification', 'Museum Island – UNESCO‑listed art & history treasures', 'Tiergarten – sprawling urban park for relaxation'])

## Setup cases

Each experiment contains a number of cases. Those can be defined in code or in YAML and loaded from disk.

Lets start with a single case.


In [4]:
from pydantic_evals.evaluators import IsInstance
from pydantic_evals import Case, Dataset

case1 = Case(
    name='Berlin Test Case',
    inputs='What do you recommend for the city of Berlin?',
    expected_output={
        'city': 'Berlin',
        'country': 'Germany',
        'expected_description': 'The description should mention that Berlin is a city in Germany and its capital.',
        'expected_recommendations': ['Brandenburg Gate', 'Museum Island', 'Eastside Gallery']
    },
    metadata={'difficulty', 'easy'}
)

dataset = Dataset(cases=[case1])

#We start by just evaluating the return type.
dataset.add_evaluator(IsInstance(type_name='TouristicRecommendation'))


# The call to the agent is wrapped in a function
def call_agent(input: str) -> TouristicRecommendation:
    return agent.run_sync(input).output


report = await dataset.evaluate(call_agent)
report.print(include_input=True, include_output=True, include_durations=False)


Output()

18:04:45.124 evaluate call_agent
18:04:45.591   case: Berlin Test Case
18:04:45.592     execute call_agent
18:04:45.593       agent run
18:04:45.593         chat openai/gpt-oss-20b
18:04:48.136     evaluator: IsInstance


## Deterministic Evaluators

Now we add our own Evaluator, that does measures some deterministic parts of the result.

Such an evaluator returns a number to

In [5]:
from pydantic_evals.evaluators import Evaluator, EvaluatorContext

dataset = Dataset(cases=[case1])
#We start by just evaluating the return type.
dataset.add_evaluator(IsInstance(type_name='TouristicRecommendation'))


class DescriptionLenghtEvaluator(Evaluator[str, TouristicRecommendation]):
    def evaluate(self, ctx: EvaluatorContext[str, TouristicRecommendation]) -> float:
        desc_len = len(ctx.output.description)
        if desc_len <= 150:
            return 0.0
        if 150 < desc_len < 250:
            return 1
        if 250 < desc_len < 350:
            return 0.5
        else:
            return 0


dataset.add_evaluator(DescriptionLenghtEvaluator())


class NumberRecommendationsEvaluator(Evaluator[list[str], TouristicRecommendation]):
    def evaluate(self, ctx: EvaluatorContext[list[str], TouristicRecommendation]) -> float:
        num_recommendations = len(ctx.output.recommendations)
        if num_recommendations <= 2:
            return 0.0
        if 2 < num_recommendations < 4:
            return 1
        if 4 < num_recommendations < 10:
            return 0.5
        else:
            return 0


dataset.add_evaluator(NumberRecommendationsEvaluator())

report = await dataset.evaluate(call_agent)
report.print(include_input=True, include_output=True, include_durations=False)


Output()

18:04:53.957 evaluate call_agent
18:04:53.963   case: Berlin Test Case
18:04:53.963     execute call_agent
18:04:53.965       agent run
18:04:53.966         chat openai/gpt-oss-20b
18:04:56.699     evaluator: IsInstance
18:04:56.700     evaluator: DescriptionLenghtEvaluator
18:04:56.700     evaluator: NumberRecommendationsEvaluator


## Run with a different Model

Now lets see how a very simple model will handle this.



In [6]:
small_model = OpenAIChatModel("qwen/qwen3-4b-2507", provider=OpenAIProvider(BASE_URL))

#setting up a small agent and we have to shorten the prompt a bit to get bad results
small_agent = Agent(small_model, output_type=TouristicRecommendation, system_prompt="""
You are a touristic recommendation agent.
You are given a city name and you have to provide a touristic recommendation for this city.
Give a short description.
""")


# result = await small_agent.run("What do you recommend for the city of Berlin?")
# result.output

#The call to the agent is wrapped in a function
def call_small_agent(input: str) -> TouristicRecommendation:
    return small_agent.run_sync(input).output


report = await dataset.evaluate(call_small_agent)
report.print(include_input=True, include_output=True, include_durations=False)

Output()

18:05:07.354 evaluate call_small_agent
18:05:07.360   case: Berlin Test Case
18:05:07.360     execute call_small_agent
18:05:07.362       small_agent run
18:05:07.364         chat qwen/qwen3-4b-2507
18:05:14.904     evaluator: IsInstance
18:05:14.916     evaluator: DescriptionLenghtEvaluator
18:05:14.917     evaluator: NumberRecommendationsEvaluator


## LLM as a Judge

Now we let an LLM judge the output of our agent and let it give some

In [15]:
import os

#We load a longer dataset
loaded_dataset = Dataset.from_file(os.path.join(os.getcwd(), 'data', 'capitals_dataset.yaml'))

#now add an LLMJudge
from pydantic_evals.evaluators import LLMJudge

llm_evaluator = LLMJudge(rubric="""Does the output match the expected output?
Validate the expected description and recommendations.
There should be at least 2 of the expected recommendations in the output.""",
                         include_expected_output=True)
loaded_dataset.add_evaluator(llm_evaluator)

report = await loaded_dataset.evaluate(call_small_agent)
report.print(include_input=True, include_output=True, include_durations=False, include_reasons=True)

Output()

18:28:04.634 evaluate call_small_agent
18:28:04.640   case: Berlin Test Case
18:28:04.640     execute call_small_agent
18:28:04.641   case: Paris Test Case
18:28:04.641     execute call_small_agent
18:28:04.641   case: London Test Case
18:28:04.642     execute call_small_agent
18:28:04.642   case: Tokyo Test Case
18:28:04.642     execute call_small_agent
18:28:04.642   case: Rome Test Case
18:28:04.642     execute call_small_agent
18:28:04.642   case: Madrid Test Case
18:28:04.643     execute call_small_agent
18:28:04.643   case: Beijing Test Case
18:28:04.643     execute call_small_agent
18:28:04.643   case: Moscow Test Case
18:28:04.643     execute call_small_agent
18:28:04.643   case: Cairo Test Case
18:28:04.644     execute call_small_agent
18:28:04.644   case: New Delhi Test Case
18:28:04.644     execute call_small_agent
18:28:04.644   case: Canberra Test Case
18:28:04.644     execute call_small_agent
18:28:04.644   case: Ottawa Test Case
18:28:04.645     execute call_small_agent


In [16]:
#Lets try it with the big model again

report = await loaded_dataset.evaluate(call_agent)
report.print(include_input=True, include_output=True, include_durations=False, include_reasons=True)

Output()

18:29:05.778 evaluate call_agent
18:29:05.785   case: Berlin Test Case
18:29:05.785     execute call_agent
18:29:05.786   case: Paris Test Case
18:29:05.787     execute call_agent
18:29:05.787   case: London Test Case
18:29:05.787     execute call_agent
18:29:05.787   case: Tokyo Test Case
18:29:05.788     execute call_agent
18:29:05.788   case: Rome Test Case
18:29:05.788     execute call_agent
18:29:05.788   case: Madrid Test Case
18:29:05.789     execute call_agent
18:29:05.789   case: Beijing Test Case
18:29:05.790     execute call_agent
18:29:05.790   case: Moscow Test Case
18:29:05.790     execute call_agent
18:29:05.791   case: Cairo Test Case
18:29:05.791     execute call_agent
18:29:05.791   case: New Delhi Test Case
18:29:05.791     execute call_agent
18:29:05.791   case: Canberra Test Case
18:29:05.791     execute call_agent
18:29:05.792   case: Ottawa Test Case
18:29:05.792     execute call_agent
18:29:05.792   case: Washington DC Test Case
18:29:05.792     execute call_age