# Using Evals

When the expected result is not deterministic, a normal test is not enough. Then the way of testing is
more like experiments.

This way you can e.g. test different models or different prompts for your usecase and compare the results

## The Example

This time we are asking for touristic recommendations for a city.

In [17]:
import os

import logfire
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIChatModel
from pydantic_ai.providers.openai import OpenAIProvider

# setup logging
load_dotenv()
logfire.configure(token=os.getenv("LOGFIRE_TOKEN"))
logfire.info('Example 2 configured!')
logfire.instrument_pydantic_ai()

class TouristicRecommendation(BaseModel):
    city: str = Field(..., description="The city name")
    country: str = Field(..., description="The country of the city")
    description: str = Field(..., description="A short description of the city")
    recommendations: list[str] = Field(..., description="A list of recommended touristic places")

#setup the LLM
BASE_URL = "http://127.0.0.1:1234/v1"
LM_STUDIO_MODEL = "openai/gpt-oss-20b"

model = OpenAIChatModel(LM_STUDIO_MODEL, provider=OpenAIProvider(BASE_URL))

#setup the agent, ignore the deps type for the moment
agent = Agent(model, output_type=TouristicRecommendation, system_prompt="""
You are a touristic recommendation agent.
You are given a city name and you have to provide a touristic recommendation for this city.
Give a short description of about 200 chars.
Also recommend at least 3 touristic places for this city with a short description.
""")

#First test without tool
result = await agent.run("What do you recommend for the city of Berlin?")
result.output

09:33:33.416 Example 2 configured!
09:33:33.428 agent run
09:33:33.429   chat openai/gpt-oss-20b


TouristicRecommendation(city='Berlin', country='Germany', description='Vibrant capital blending history, art and tech. From iconic landmarks to lively neighborhoods, it’s a cultural hub that never sleeps.', recommendations=['Brandenburg Gate – emblematic gate with imperial past', 'Museum Island – world‑class museums on Spree River', 'East Side Gallery – open‑air gallery of mural art'])

## Setup cases

Each experiment contains a number of cases. Those can be defined in code or in YAML and loaded from disk.

Lets start with a single case.


In [18]:
from pydantic_evals.evaluators import IsInstance
from pydantic_evals import Case, Dataset

case1 = Case(
    name='Berlin Test Case',
    inputs='What do you recommend for the city of Berlin?',
    expected_output={
        'city': 'Berlin',
        'country': 'Germany',
        'expected_description': 'The description should mention that Berlin is a city in Germany and its capital.',
        'expected_recommendations': ['Brandenburg Gate', 'Museum Island', 'Eastside Gallery']
    },
    metadata={'difficulty', 'easy'}
)

dataset = Dataset(cases=[case1])

#We start by just evaluating the return type.
dataset.add_evaluator(IsInstance(type_name='TouristicRecommendation'))


# The call to the agent is wrapped in a function
def call_agent(input: str) -> TouristicRecommendation:
    return agent.run_sync(input).output


report = await dataset.evaluate(call_agent)
report.print(include_input=True, include_output=True, include_durations=False)


Output()

09:35:14.448 evaluate call_agent
09:35:14.456   case: Berlin Test Case
09:35:14.457     execute call_agent
09:35:14.465       agent run
09:35:14.472         chat openai/gpt-oss-20b
09:35:16.653     evaluator: IsInstance


## Deterministic Evaluators

Now we add our own Evaluator, that does measures some deterministic parts of the result.

Such an evaluator returns a number to

In [19]:
from pydantic_evals.evaluators import Evaluator, EvaluatorContext

dataset = Dataset(cases=[case1])
#We start by just evaluating the return type.
dataset.add_evaluator(IsInstance(type_name='TouristicRecommendation'))


class DescriptionLengthEvaluator(Evaluator[str, TouristicRecommendation]):
    def evaluate(self, ctx: EvaluatorContext[str, TouristicRecommendation]) -> float:
        desc_len = len(ctx.output.description)
        if desc_len <= 150:
            return 0.0
        if 150 < desc_len < 250:
            return 1
        if 250 < desc_len < 350:
            return 0.5
        else:
            return 0


dataset.add_evaluator(DescriptionLengthEvaluator())


class NumberRecommendationsEvaluator(Evaluator[list[str], TouristicRecommendation]):
    def evaluate(self, ctx: EvaluatorContext[list[str], TouristicRecommendation]) -> float:
        num_recommendations = len(ctx.output.recommendations)
        if num_recommendations <= 2:
            return 0.0
        if num_recommendations == 3:
            return 1
        if 3 < num_recommendations < 10:
            return 0.5
        else:
            return 0


dataset.add_evaluator(NumberRecommendationsEvaluator())

report = await dataset.evaluate(call_agent)
report.print(include_input=True, include_output=True, include_durations=False)


Output()

09:39:05.306 evaluate call_agent
09:39:05.334   case: Berlin Test Case
09:39:05.335     execute call_agent
09:39:05.338       agent run
09:39:05.354         chat openai/gpt-oss-20b
09:39:08.317     evaluator: IsInstance
09:39:08.319     evaluator: DescriptionLengthEvaluator
09:39:08.320     evaluator: NumberRecommendationsEvaluator


## Run with a different Model

Now lets see how a very simple model will handle this.



In [None]:
small_model = OpenAIChatModel("qwen/qwen3-4b-2507", provider=OpenAIProvider(BASE_URL))

#setting up a small agent and we have to shorten the prompt a bit to get bad results
small_agent = Agent(small_model, output_type=TouristicRecommendation, system_prompt="""
You are a touristic recommendation agent.
You are given a city name and you have to provide a touristic recommendation for this city.
Give a short description.
""")


# result = await small_agent.run("What do you recommend for the city of Berlin?")
# result.output

#The call to the agent is wrapped in a function
def call_small_agent(input: str) -> TouristicRecommendation:
    return small_agent.run_sync(input).output


report = await dataset.evaluate(call_small_agent)
report.print(include_input=True, include_output=True, include_durations=False)

## LLM as a Judge

Now we let an LLM judge the output of our agent and let it give some

In [None]:
import os

#We load a longer dataset
loaded_dataset = Dataset.from_file(os.path.join(os.getcwd(), 'data', 'capitals_dataset.yaml'))

#now add an LLMJudge
from pydantic_evals.evaluators import LLMJudge

llm_evaluator = LLMJudge(rubric="""Does the output match the expected output?
Validate the expected description and recommendations.
There should be at least 2 of the expected recommendations in the output.""",
                         include_expected_output=True)
loaded_dataset.add_evaluator(llm_evaluator)

report = await loaded_dataset.evaluate(call_small_agent)
report.print(include_input=True, include_output=True, include_durations=False, include_reasons=True)

In [None]:
#Lets try it with the big model again

report = await loaded_dataset.evaluate(call_agent)
report.print(include_input=True, include_output=True, include_durations=False, include_reasons=True)