Skip to content

docs: benchmarking best practices guide #1137

@christso

Description

@christso

Objective

Add a docs page covering end-to-end best practices for using AgentV to benchmark coding agents. This consolidates patterns that are already supported but not discoverable — users shouldn't have to reverse-engineer these from scattered feature docs.

Target audience

Teams setting up agent benchmarks for the first time — comparing agents, tracking regressions, evaluating harness configurations.

Proposed outline

1. Reproducibility

  • Pin workspace templates in git alongside eval files
  • Use agentv bundle to compile a portable, self-contained eval directory for sharing (feat: immutable run bundle / reproducibility artifact #1133)
  • Git commit hash as provenance — the repo is your reproducibility unit
  • Record which targets/models ran in result artifacts

2. Evaluating the configured agent (not just pass/fail)

Most benchmarks test the naked agent against coding tasks and report binary pass/fail accuracy. AgentV can evaluate how well the agent used its full harness:

  • Skill effectivenesstool-trajectory to assert the right skill-triggered tools fired
  • MCP tool verificationtool-trajectory with arg matching on MCP tool calls
  • Workspace instruction adherencerubric scoring whether the agent followed AGENTS.md/CLAUDE.md conventions
  • Composite harness qualitycomposite evaluator weighting correctness + tool usage + efficiency + instruction adherence

Include concrete EVAL.yaml examples for each pattern.

3. MCP servers in benchmarks

How to include MCP servers without introducing non-reproducibility:

  • Freeze MCP server data (SQLite dump, JSON fixtures) in the workspace template
  • Start/stop MCP server via before_all/after_all workspace hooks
  • The MCP server is real (agent calls real MCP tools), but the data is frozen
workspace: ./cases/customer-lookup/
  hooks:
    before_all:
      command: ./start-snapshot-mcp.sh
    after_all:
      command: ./stop-snapshot-mcp.sh

4. Comparing harness variants

Use target hooks to compare different agent configurations in a single eval run:

execution:
  targets:
    - name: baseline
    - name: with-skills
      use_target: baseline
      hooks:
        before_each:
          command: ["setup-plugins.sh", "skills"]
    - name: with-mcp
      use_target: baseline
      hooks:
        before_each:
          command: ["setup-mcp.sh", "filesystem-server"]

Reference: Target Hooks docs

5. Managing expensive runs

  • --resume — skip completed tests after an interruption, append new results
  • --rerun-failed — rerun only failed/errored tests while keeping passing results
  • agentv eval aggregate <run-dir> — recompute stats from accumulated results
  • --workers N — control parallelism to stay within API rate limits
  • Circuit breaker pattern for provider-level failures (feat: circuit breaker for provider-level failure detection during eval runs #974)

6. Oracle validation for benchmark authoring

When authoring benchmark cases, validate that cases are solvable and test harnesses are sound by running a reference solution through the same verifier (#1136).

7. Decision matrix: what to evaluate

Question Evaluator Example
Did the task pass? code-grader ./tests/verify.sh exit code
Did the agent use the right tools? tool-trajectory Assert MCP tool calls in order
Did it follow workspace conventions? rubric Check commit format, test-before-commit
Was tool usage efficient? execution-metrics max_tool_calls, exploration_ratio
Overall harness quality? composite Weighted average of all the above
Did the agent stay within budget? cost / token-usage Budget thresholds

Related issues

Acceptance signals

  • Docs page published at /docs/guides/benchmarking-best-practices or similar
  • Each section has at least one concrete EVAL.yaml example
  • Cross-links to relevant feature docs (target hooks, resume, evaluators)

Metadata

Metadata

Assignees

No one assigned

    Labels

    docsImprovements or additions to documentation

    Type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions