agenteval treats an agent run like a unit test. You declare the agent once with capa's capabilities.yaml, write scenarios (a prompt, fake MCP servers seeded with known state, and assertions), and run the whole suite from one CI-friendly command. Providers are pluggable, and Claude Code ships first.
Testing an AI agent is awkward. It calls real SaaS APIs, burns tokens on every run, and answers a little differently each time. So most agent "tests" are a human eyeballing the output once and calling it good. There is no red/green, nothing to gate a PR on, nothing that tells you a prompt edit quietly broke the Slack integration.
agenteval gives you the parts that are tedious to build yourself. You declare the agent once in capabilities.yaml, then write scenarios next to it: a prompt, a set of mock MCP servers seeded with state you control, and assertions about what should happen. Each run gets its own throwaway workspace, talks only to local mocks, and is checked against those assertions. The mocks record every tool call, so you can assert the agent posted to #releases exactly once, or never touched a delete_user tool at all.
One command runs the whole matrix of scenarios, models, and repeats, aggregates the results, and exits nonzero when something regresses. JUnit XML drops straight into CI.
- MCP mock infrastructure: stand up fake MCP servers from a
mock.yamlwith verbatim tool schemas, seed state, declarative responses, and an optional Python escape hatch for stateful tools. - Every tool call is recorded, so assertions can check what the agent did, not just what it said.
- Ephemeral workspace per run,
capa installto compilecapabilities.yamlinto provider config, the agent invocation, repeats, and result aggregation. - Declarative assertions (
mock_state,tool_called, file and output checks) plus an optional LLM judge scored against a rubric. - Reports as JSON, JUnit XML for native CI gating, markdown for PR comments, and a console summary.
- Pluggable providers: a new harness is a
Providersubclass that picks a differentcapa install -ptarget, with no changes to the runner.
For every (scenario, model, repeat) cell, agenteval builds an isolated environment, lets capa bootstrap the agent, runs it against local mocks, then checks the side effects. The mocks start before install so capa can validate tools against a live server.
flowchart TD
A["agenteval run"] --> B["load config + scenarios"]
B --> C{{"for each scenario x model x repeat"}}
subgraph Cell["one cell"]
direction TB
D["start MCP mocks + workspace"] --> E["capa install -p claude-code"]
E --> F["run agent"]
F --> G["assertions + optional LLM judge"]
G --> H["record result, teardown"]
end
C --> Cell
H --> C
C -->|all done| I["aggregate + emit reports"]
I --> J["exit nonzero on failure"]
The framework owns the workspace, the mocks, and the URL rewiring that connects them; it delegates all harness provisioning to capa install -p <provider>.
pip install agenteval-frameworkOr install it as an isolated CLI tool with uv:
uv tool install agenteval-frameworkThe distribution is agenteval-framework; the import package and the CLI are both agenteval.
Prerequisites (external, documented by their projects):
- The
capabinary on PATH. agenteval uses it to compilecapabilities.yaml. - The provider CLI. For Claude that is the
claudeCLI (Claude Code), authenticated via subscription orANTHROPIC_API_KEY.
agenteval init my-eval
cd my-evalThis drops an agenteval.yaml, an agent, and one example scenario next to your code.
You can also optionally install the agenteval skill to let your agent author tests:
capa add Minitour/agenteval@agentevalcp .env.example .env # add ANTHROPIC_API_KEY if your CLI needs itagenteval validate # structural check, no model calls
agenteval run -v # run the suiteOr run the bundled, fully working example:
agenteval run --root examples/my-eval -vExpected shape of the output:
scenario 'slack-release-note' provider=claude model=claude-opus-4-8
trial 1/1: PASS $0.1783 turns=4 13.2s judge=1.00
suite: 1/1 runs passed -> PASS
my-eval/
agenteval.yaml # provider, models, defaults, judge, report config
.env # ANTHROPIC_API_KEY etc (you bring; never committed)
agents/
release-bot/
capabilities.yaml # capa spec -> provider config
CLAUDE.md # agent base prompt (referenced by capabilities.yaml)
skills/slack/SKILL.md # local skills (optional)
scenarios/
slack-release-note/
scenario.yaml # agent ref, prompt, mcp list, assertions, judge
input/prompt.md # the task prompt
rubric.md # LLM-judge rubric (optional)
assets/ # files copied into the ephemeral workspace (optional)
mcp/
slack/
mock.yaml # schema ref + seed state + declarative responses
schema.json # verbatim tools/list (optional; can inline)
handler.py # optional Python hooks for stateful tools
reports/ # output (gitignored)
scenario.yaml:
id: slack-release-note
agent: release-bot
prompt_file: input/prompt.md
mcp: [slack] # which mocks (dirs under mcp/) to start
run:
repeats: 1
max_turns: 15
assertions:
- kind: mock_state # a message landed in #releases
server: slack
jsonpath: "$.messages[?(@.channel_name=='releases')]"
min_count: 1
- kind: tool_called # the agent actually posted
server: slack
tool: slack_send_message
min_count: 1
judge:
rubric_file: rubric.md # LLM-judged message quality
min_score: 0.6A scenario normally binds to one agent, but for benchmark-style comparisons you can give it a list. Each entry expands into its own (scenario, agent) cell that shares the same prompt, mocks, assertions, and rubric, so the whole matrix lives in one file (no symlinked duplicate scenarios). The cell ids are suffixed with the agent name to stay unique (e.g. my-task[agent-a], my-task[agent-b]).
id: my-task
agent: [agent-a, agent-b, agent-c] # one cell per agent--agent <name> (repeatable) overrides the agent: field at invocation time, which is handy for running a single variant from a multi-agent scenario.
When agents need different harnesses, a scenario can pin its own provider:, overriding the global one in agenteval.yaml (and falling through to it when absent). A single agenteval run then instantiates one provider per scenario as needed:
agent: my-agent
provider: my-provider # this scenario uses a specific provider; others fall back to the global one| kind | params | checks |
|---|---|---|
mock_state |
server, jsonpath, min_count, max_count |
JSONPath matches against a mock's final state |
tool_called |
tool, server, min_count, max_count |
a tool was (or was never, with max_count: 0) called |
file_exists |
path |
a file exists in the workspace |
file_contains |
path, values, mode |
file contains all/any of values |
dir_has_new_file |
path, matches, contents_include |
a matching file with given content exists |
output_contains |
values, mode, ignore_case |
final answer contains all/any of values |
output_matches |
pattern, ignore_case |
final answer matches a regex |
Each assertion is required: true by default; a required failure fails the run and sets a nonzero exit code. The jsonpath for mock_state supports the Goessner filter form ($.coll[?(@.field=='value')]) and plain paths natively, and falls back to jsonpath-ng for richer expressions.
mcp/<server>/mock.yaml is hybrid: declarative by default, with a Python escape hatch for stateful behaviour.
name: slack
schema: schema.json # verbatim tools/list, or inline `tools:`
handler: handler.py # optional; functions override named tools
seed: # initial state, reset before every run
channels:
- { id: C1, name: releases }
messages: []
responses: # declarative dispatch (used when no handler matches)
list_channels:
result: { ok: true, channels: "{{ state.channels }}" }
post_message:
mutate:
- append: { path: messages, value: { channel_name: releases, text: "{{ args.text }}" } }
result: { ok: true, ts: "{{ now }}" }Templating is Jinja2 over { args, state, now, uuid }. A value that is a single {{ expr }} keeps its native JSON type. Mutations support append, extend, set, and increment. A handler.py may export a HANDLERS dict (or tool_<name> functions); each handler is fn(args, ctx) and mutates ctx.state.data.
agenteval run run the suite, emit reports, exit nonzero on failure
agenteval list list discovered scenarios
agenteval validate check structure without calling any model
agenteval init scaffold a new eval project
run flags: --root, --filter <substr>, --provider, --agent (repeatable), --model (repeatable), --repeat N, --report-dir, --no-judge, --keep-workspace, -v.
agenteval run writes to report.dir (default reports/):
results.json: full per-run records plus per-cell aggregates (mean/stddev/median per metric, pass rate, judge score).junit.xml: one testcase per repeat, for native CI gating.report.md: a table plus a failure breakdown for PR comments.
The judge defaults to the claude-cli backend, which reuses the Claude Code CLI's auth (so it works without exporting ANTHROPIC_API_KEY). Set judge.backend: anthropic-api in agenteval.yaml to use the Anthropic SDK with ANTHROPIC_API_KEY instead.
Templates are included for both GitHub Actions (examples/my-eval/.github/workflows/agenteval.yml) and GitLab CI (examples/my-eval/.gitlab-ci.yml). Copy the one you need to your repo root, provide ANTHROPIC_API_KEY as a secret/variable, install capa and the claude CLI, then run agenteval run. The nonzero exit on failure gates the merge request; both templates upload reports/ as an artifact and publish junit.xml for native test reporting.
Subclass agenteval.providers.base.Provider (implement install, run, preflight), then register it:
# providers/my_provider.py
from agenteval.providers.base import Provider
from agenteval.providers.registry import register
class MyProvider(Provider):
name = "my-provider"
...
register(MyProvider)install rewrites the servers: URLs in capabilities.yaml to the local mock endpoints and compiles them for the target harness; run invokes the agent and returns a normalized ProviderRunOutput. No core changes required.
register() alone is not enough: the agenteval console script does not import your project, so the module that calls register() has to be loaded somehow. List it under plugins: in agenteval.yaml and the CLI imports it (with the project root on sys.path) before resolving providers — no wrapper script needed:
# agenteval.yaml
plugins:
- providers.my_provider # importable module path, relative to the project rootThen agenteval run --provider my-provider just works. Naming a provider after a built-in (e.g. claude) shadows the bundled one and emits a warning; pass register(MyProvider, force=True) to silence it if that is intentional.
See CONTRIBUTING.md. Security reports go through SECURITY.md.
MIT