# 03: Evaluation

In Notebook 02 we ran individual questions by hand. This notebook evaluates the agent
systematically: we upload a dataset subset to Langfuse, run the agent on every item, and
score each response with an LLM-as-judge grader using the official DeepSearchQA methodology.

## What You'll Learn

1. Uploading a DeepSearchQA subset to Langfuse as a persistent dataset
2. The LLM-as-judge grader: precision, recall, F1, and the four outcome categories
3. A single-sample evaluation walkthrough
4. Running the full experiment with `run_experiment`
5. Inspecting and interpreting item-level results

## Prerequisites

Complete Notebooks 01 and 02. You'll need all credentials in `.env`:
- `GOOGLE_API_KEY`
- `LANGFUSE_PUBLIC_KEY` and `LANGFUSE_SECRET_KEY`
- `OPENAI_API_KEY` (for the LLM grader)

In [None]:
import json
import os
import tempfile
from pathlib import Path
from typing import Any

import pandas as pd
from aieng.agent_evals.evaluation import run_experiment
from aieng.agent_evals.evaluation.graders.config import LLMRequestConfig
from aieng.agent_evals.knowledge_qa import DeepSearchQADataset, KnowledgeGroundedAgent
from aieng.agent_evals.knowledge_qa.deepsearchqa_grader import (
    EvaluationOutcome,
    evaluate_deepsearchqa_async,
)
from aieng.agent_evals.knowledge_qa.notebook import display_response, run_with_display
from aieng.agent_evals.langfuse import upload_dataset_to_langfuse
from dotenv import load_dotenv
from IPython.display import HTML, display  # noqa: A004
from langfuse.experiment import Evaluation
from rich.console import Console
from rich.panel import Panel
from rich.table import Table


if Path("").absolute().name == "eval-agents":
    print(f"Working directory: {Path('').absolute()}")
else:
    os.chdir(Path("").absolute().parent.parent)
    print(f"Working directory set to: {Path('').absolute()}")

load_dotenv(verbose=True)
console = Console(width=100)

DATASET_NAME = "DeepSearchQA-Subset"

## 1. Uploading the Dataset to Langfuse

Langfuse stores our evaluation dataset so we can run multiple experiments against the same items
and compare results over time. Each dataset item has three fields:

- **`input`**: the question (sent to the agent)
- **`expected_output`**: the ground truth answer (given to the grader, never shown to the agent)
- **`metadata`**: `category`, `answer_type`, `example_id`

Items are deduplicated by a hash of their content, so running this cell again is safe.

In [None]:
dataset = DeepSearchQADataset()
examples = dataset.get_by_category("Finance & Economics")[:1]

console.print(f"Uploading [cyan]{len(examples)}[/cyan] examples to dataset '{DATASET_NAME}'...")

# Write examples to a temporary JSONL file for the upload utility
with tempfile.NamedTemporaryFile(mode="w", suffix=".jsonl", delete=False, encoding="utf-8") as f:
    for ex in examples:
        record = {
            "input": ex.problem,
            "expected_output": ex.answer,
            "metadata": {
                "example_id": ex.example_id,
                "category": ex.problem_category,
                "answer_type": ex.answer_type,
            },
        }
        f.write(json.dumps(record, ensure_ascii=False) + "\n")
    temp_path = f.name

await upload_dataset_to_langfuse(dataset_path=temp_path, dataset_name=DATASET_NAME)
os.unlink(temp_path)

console.print(f"[green]✓[/green] Dataset '{DATASET_NAME}' ready in Langfuse")

## 2. The DeepSearchQA Grader

The grader is an LLM-as-judge that evaluates answers using the official DeepSearchQA methodology
from Appendix A of the paper. It handles both answer types:

- **Single Answer**: checks whether the response contains the one expected value
- **Set Answer**: checks which items from the ground truth set appear in the response,
  and flags any extra items the agent included

### Metrics

Let **S** = predicted items, **G** = ground truth items:

| Metric | Formula | Meaning |
|--------|---------|---------|
| **Precision** | \|S∩G\| / \|S\| | Of what the agent said, how much was correct |
| **Recall** | \|S∩G\| / \|G\| | Of the ground truth, how much did the agent find |
| **F1** | 2·P·R / (P+R) | Harmonic mean of precision and recall |

### Outcome Classification

| Outcome | Condition | Interpretation |
|---------|-----------|----------------|
| `fully_correct` | S = G | Perfect answer |
| `correct_with_extraneous` | G ⊆ S | All correct, but extra items included |
| `partially_correct` | S∩G ≠ ∅ | Some correct items found |
| `fully_incorrect` | S∩G = ∅ | No correct items |

### 2.1 Single-Sample Walkthrough

Before running at scale, let's walk through one example end-to-end: run the agent,
then grade its response with the LLM judge.

In [None]:
# Reproducibly select one Finance & Economics example
finance_examples = dataset.get_by_category("Finance & Economics")
example = finance_examples[0]

console.print(
    Panel(
        f"[bold]ID:[/bold] {example.example_id}\n"
        f"[bold]Category:[/bold] {example.problem_category}\n"
        f"[bold]Answer Type:[/bold] {example.answer_type}\n\n"
        f"[bold cyan]Question:[/bold cyan]\n{example.problem}\n\n"
        f"[bold yellow]Ground Truth:[/bold yellow]\n{example.answer}",
        title="Evaluation Example",
        border_style="blue",
    )
)

In [None]:
eval_agent = KnowledgeGroundedAgent(enable_planning=True)
eval_response = await run_with_display(eval_agent, example.problem)

display_response(
    console,
    eval_response.text,
    title="Agent Answer",
    subtitle=f"Duration: {eval_response.total_duration_ms / 1000:.1f}s  |  Tools: {len(eval_response.tool_calls)}",
)

In [None]:
console.print("[dim]Grading with LLM judge...[/dim]\n")

result = await evaluate_deepsearchqa_async(
    question=example.problem,
    answer=eval_response.text,
    ground_truth=example.answer,
    answer_type=example.answer_type,
)

outcome_color = {
    EvaluationOutcome.FULLY_CORRECT: "green",
    EvaluationOutcome.CORRECT_WITH_EXTRANEOUS: "yellow",
    EvaluationOutcome.PARTIALLY_CORRECT: "orange1",
    EvaluationOutcome.FULLY_INCORRECT: "red",
}.get(result.outcome, "white")

metrics_table = Table(title="Grader Results")
metrics_table.add_column("Metric", style="cyan")
metrics_table.add_column("Value", style="white")
metrics_table.add_row("Outcome", f"[{outcome_color}]{result.outcome.value}[/{outcome_color}]")
metrics_table.add_row("Precision", f"{result.precision:.3f}")
metrics_table.add_row("Recall", f"{result.recall:.3f}")
metrics_table.add_row("F1", f"[bold]{result.f1_score:.3f}[/bold]")
console.print(metrics_table)

if result.explanation:
    console.print(Panel(result.explanation, title="Grader Explanation", border_style="magenta"))

# Show per-item correctness for Set Answer questions
if result.correctness_details:
    details_table = Table(title="Correctness Details")
    details_table.add_column("Expected Item", style="white")
    details_table.add_column("Found", style="cyan", justify="center")
    for item, found in result.correctness_details.items():
        icon = "[green]✓[/green]" if found else "[red]✗[/red]"
        label = item[:60] + "..." if len(item) > 60 else item
        details_table.add_row(label, icon)
    console.print(details_table)

## 3. Running the Evaluation Experiment

`run_experiment` runs the agent against every item in the Langfuse dataset, scores each
response, and records results in Langfuse. Each call creates a new named experiment run
that you can compare to previous runs in the UI.

The experiment takes two functions:

- **`agent_task`** — receives a dataset item, runs the agent, returns the answer string
- **`deepsearchqa_evaluator`** — receives question, answer, and ground truth; returns grader scores

> **Note:** This makes one agent call and one grader call per item. With 10 items and
> `max_concurrency=1`, expect 20–40 minutes depending on model latency.

In [None]:
async def agent_task(*, item: Any, **kwargs: Any) -> str:
    """Run the Knowledge Agent on a Langfuse dataset item."""
    agent = KnowledgeGroundedAgent(enable_planning=True)
    response = await agent.answer_async(item.input)
    return response.text


async def deepsearchqa_evaluator(
    *,
    input: str,  # noqa: A002
    output: str,
    expected_output: str,
    metadata: dict[str, Any] | None = None,
    **kwargs: Any,
) -> list[Evaluation]:
    """LLM-as-judge grader using DeepSearchQA methodology."""
    answer_type = (metadata or {}).get("answer_type", "Set Answer")
    result = await evaluate_deepsearchqa_async(
        question=input,
        answer=output,
        ground_truth=expected_output,
        answer_type=answer_type,
        model_config=LLMRequestConfig(temperature=0.0),
    )
    return result.to_evaluations()

In [None]:
experiment_result = run_experiment(
    DATASET_NAME,
    name="knowledge-agent-baseline",
    task=agent_task,
    evaluators=[deepsearchqa_evaluator],
    description="Baseline Knowledge Agent on Finance & Economics questions.",
    max_concurrency=1,
)

console.print("[green]✓[/green] Experiment complete")
if experiment_result.dataset_run_url:
    display(
        HTML(
            f'<p>View experiment: <a href="{experiment_result.dataset_run_url}" target="_blank">{experiment_result.dataset_run_url}</a></p>'
        )
    )

## 4. Inspecting Results

The `ExperimentResult` object gives programmatic access to every item-level score.
Aggregate metrics are visible in the Langfuse experiment run summary in the UI.

In [None]:
rows = []
for item_result in experiment_result.item_results:
    item = item_result.item
    question = str(item.input)
    row = {
        "question": question[:55] + "..." if len(question) > 55 else question,
        "answer_type": (item.metadata or {}).get("answer_type", ""),
    }
    for evaluation in item_result.evaluations or []:
        row[evaluation.name] = evaluation.value
    rows.append(row)

df = pd.DataFrame(rows)
print(df.to_string(index=False))

In [None]:
# Mean of numeric metrics
numeric_cols = [c for c in ["F1", "Precision", "Recall"] if c in df.columns]
if numeric_cols:
    means_table = Table(title="Mean Scores")
    means_table.add_column("Metric", style="cyan")
    means_table.add_column("Mean", style="white")
    for col in numeric_cols:
        means_table.add_row(col, f"{df[col].mean():.3f}")
    console.print(means_table)

# Outcome distribution
if "Outcome" in df.columns:
    outcome_table = Table(title="Outcome Distribution")
    outcome_table.add_column("Outcome", style="cyan")
    outcome_table.add_column("Count", style="white", justify="right")
    outcome_table.add_column("Fraction", style="dim", justify="right")
    total = len(df)
    for outcome, count in df["Outcome"].value_counts().items():
        outcome_table.add_row(str(outcome), str(count), f"{count / total:.0%}")
    console.print(outcome_table)

## 5. Iterating on the Agent

The dataset in Langfuse is persistent — you don't need to re-upload it. To evaluate a modified
agent, call `run_experiment` again with a new `name` argument. Langfuse will create a new
experiment run and you can compare runs side-by-side in the UI.

### Levers to Explore

- **System prompt** — edit `SYSTEM_INSTRUCTIONS_TEMPLATE` in `system_instructions.py` to change
  the search strategy, verification rules, or final answer format
- **Planning** — toggle `enable_planning=False` to skip PlanReAct and compare quality vs. speed
- **Model** — change the Gemini model in `KnowledgeGroundedAgent` for different capability/cost trade-offs
- **Dataset** — change the `category` filter in Section 1 or increase `samples` to cover more examples

### What to Look for in Langfuse

- Items with **low F1** — did the agent fail to fetch the source? Stop early? Misread the question?
- Items with **`correct_with_extraneous`** — is the agent over-generating? Can the prompt be tightened?
- **Latency outliers** — which steps are slow? Is replanning happening unnecessarily?

## Summary

In this notebook you:

1. **Uploaded** a DeepSearchQA subset to Langfuse as a persistent, reusable dataset
2. **Understood** the LLM-as-judge grader: precision, recall, F1, and the four outcome categories
3. **Walked through** a single-sample evaluation end-to-end
4. **Ran** a full experiment with `run_experiment` and inspected item-level scores
5. **Learned** how to iterate: re-run with a new experiment name to compare configurations in Langfuse

The evaluation pipeline is the foundation for systematic agent improvement — each iteration
produces a new experiment run that you can compare to the baseline in the Langfuse UI.

In [None]:
console.print(Panel("[green]✓[/green] Notebook complete!", title="Done", border_style="green"))
if experiment_result.dataset_run_url:
    display(
        HTML(
            f'<p>View experiment results: <a href="{experiment_result.dataset_run_url}" target="_blank">{experiment_result.dataset_run_url}</a></p>'
        )
    )