# 01: Why Evaluate Agents?

LLMs are probabilistic. Agents are LLMs orchestrating tools across multiple steps, and errors can compound easily. A model even if it has 95% chance of being correct on a single step, a 10-step task might only complete successfully ~60% of the time.

Evaluation is how we move from "it seems to work" to "we know it works, and we can measure how well."

## What You'll Learn

1. The four dimensions of agent quality
2. Types of graders used to evaluate agents
3. Capability-driven improvement cycles
4. How evaluation results drive concrete system changes

No code execution required — this is a conceptual foundation for the rest of the bootcamp.

## 1. The Evaluation Framework

Agent quality has four dimensions that must be tracked separately — a correct final answer can still involve poor tool usage or incoherent reasoning.

- **Outcome Success** — Did the agent achieve the objective?
- **Tool Usage Quality** — Right tools, right arguments, right order, no redundancy
- **Reasoning Coherence** — Were intermediate steps logical and justified?
- **Cost-Performance** — Was the result worth the latency and tokens spent?

The pipeline below the pillars shows the evaluation workflow: a **golden dataset** (ground truth) → **hybrid judges** → **go/no-go gate** for production readiness.

<img src="agent_evaluation_framework.png" width="900" />

## 2. Types of Graders

Not all agent outputs can be judged the same way. We generally know of three grader types that fill complementary roles, and the right mix depends on what you're evaluating.

### Code-Based Graders
Rule-based checks: string matching (exact, regex, fuzzy), schema validation, tool call verification, outcome assertions, and transcript analysis.

- **Strengths:** Fast, cheap, deterministic, easy to debug
- **Weaknesses:** Brittle when valid variations exist; no nuance for open-ended outputs

### Model-Based Graders (LLM-as-Judge)
An LLM evaluates the agent output using rubrics, natural language assertions, pairwise comparison, or reference-based scoring. Multi-judge consensus reduces noise.

- **Strengths:** Flexible, scalable, handles subjective and freeform tasks
- **Weaknesses:** Non-deterministic, more expensive, requires calibration against human judgments

### Human Graders
Subject matter experts or crowd annotators review outputs, often via spot-check sampling or A/B testing. Inter-annotator agreement is measured to ensure reliability.

- **Strengths:** Gold-standard quality; essential for calibrating model-based graders
- **Weaknesses:** Slow and expensive; doesn't scale to continuous deployment

### How to Combine Them

| Grader Type | Best For | Trade-off |
|-------------|----------|-----------|
| **Code-based** | Format, schema, tool call checks | Fast but brittle |
| **Model-based** | Reasoning quality, nuanced correctness | Flexible but non-deterministic |
| **Human** | Calibration, edge cases, expert tasks | Accurate but slow and costly |

Scores can be binary, weighted, or hybrid — combining grader types compensates for each approach's individual weaknesses.

## 3. Capability Evaluation: Hills to Climb

Hard benchmarks serve a different purpose from correctness checks: they define long-term improvement targets. A 40% pass rate on a difficult benchmark isn't a failure — it's a starting point.

The improvement cycle:

1. **Test** on hard tasks (low initial pass rate is expected)
2. **Identify gaps** — where exactly does the agent fail?
3. **Improve** — prompt, tooling, or model changes
4. **Re-test** — measure progress; check for regressions

SWE-bench Verified is a real example: coding agents went from ~40% to 80%+ pass rate over one year of this cycle.

<img src="capability_evaluation.png" width="900" />

## 4. From Evaluation Results to System Changes

Evaluation only creates value if failures map to concrete improvements. Each failure pattern has a corresponding intervention:

- **Prompt** — rewrite instructions based on failure patterns; reduce token count
- **Tool Usage** — fix inefficient sequencing; eliminate redundant calls
- **Behavior** — address over-engineering or verbosity identified in transcripts
- **Planning & Reasoning** — add verification-aware subgoals; introduce self-critique

The loop closes when you re-run the same benchmark and see the delta.

<img src="system_optimization.png" width="900" />

## Summary

1. **Compounding errors** — 10-step agents need task-level evaluation, not just per-answer accuracy
2. **Four dimensions** — outcome, tool quality, reasoning, cost-performance
3. **Three grader types** — code-based (fast, deterministic), model-based (flexible, nuanced), human (gold-standard, for calibration)
4. **Capability benchmarks** — hard problems as hills to climb; progress is measured, not estimated
5. **Closed loop** — failure patterns map to prompt, tool, behavior, and reasoning changes

**Next:** In Notebook 02, we'll use the shared evaluation harness to run these ideas in code.