docs: benchmarking best practices guide

## Objective

Add a docs page covering end-to-end best practices for using AgentV to benchmark coding agents. This consolidates patterns that are already supported but not discoverable — users shouldn't have to reverse-engineer these from scattered feature docs.

## Target audience

Teams setting up agent benchmarks for the first time — comparing agents, tracking regressions, evaluating harness configurations.

## Proposed outline

### 1. Reproducibility

- Pin workspace templates in git alongside eval files
- Use `agentv bundle` to compile a portable, self-contained eval directory for sharing (#1133)
- Git commit hash as provenance — the repo is your reproducibility unit
- Record which targets/models ran in result artifacts

### 2. Evaluating the configured agent (not just pass/fail)

Most benchmarks test the naked agent against coding tasks and report binary pass/fail accuracy. AgentV can evaluate *how well* the agent used its full harness:

- **Skill effectiveness** — `tool-trajectory` to assert the right skill-triggered tools fired
- **MCP tool verification** — `tool-trajectory` with arg matching on MCP tool calls
- **Workspace instruction adherence** — `rubric` scoring whether the agent followed AGENTS.md/CLAUDE.md conventions
- **Composite harness quality** — `composite` evaluator weighting correctness + tool usage + efficiency + instruction adherence

Include concrete EVAL.yaml examples for each pattern.

### 3. MCP servers in benchmarks

How to include MCP servers without introducing non-reproducibility:

- Freeze MCP server data (SQLite dump, JSON fixtures) in the workspace template
- Start/stop MCP server via `before_all`/`after_all` workspace hooks
- The MCP server is real (agent calls real MCP tools), but the data is frozen

```yaml
workspace: ./cases/customer-lookup/
  hooks:
    before_all:
      command: ./start-snapshot-mcp.sh
    after_all:
      command: ./stop-snapshot-mcp.sh
```

### 4. Comparing harness variants

Use target hooks to compare different agent configurations in a single eval run:

```yaml
execution:
  targets:
    - name: baseline
    - name: with-skills
      use_target: baseline
      hooks:
        before_each:
          command: ["setup-plugins.sh", "skills"]
    - name: with-mcp
      use_target: baseline
      hooks:
        before_each:
          command: ["setup-mcp.sh", "filesystem-server"]
```

Reference: [Target Hooks docs](/docs/targets/configuration/#target-hooks)

### 5. Managing expensive runs

- `--resume` — skip completed tests after an interruption, append new results
- `--rerun-failed` — rerun only failed/errored tests while keeping passing results
- `agentv eval aggregate <run-dir>` — recompute stats from accumulated results
- `--workers N` — control parallelism to stay within API rate limits
- Circuit breaker pattern for provider-level failures (#974)

### 6. Oracle validation for benchmark authoring

When authoring benchmark cases, validate that cases are solvable and test harnesses are sound by running a reference solution through the same verifier (#1136).

### 7. Decision matrix: what to evaluate

| Question | Evaluator | Example |
|---|---|---|
| Did the task pass? | `code-grader` | `./tests/verify.sh` exit code |
| Did the agent use the right tools? | `tool-trajectory` | Assert MCP tool calls in order |
| Did it follow workspace conventions? | `rubric` | Check commit format, test-before-commit |
| Was tool usage efficient? | `execution-metrics` | max_tool_calls, exploration_ratio |
| Overall harness quality? | `composite` | Weighted average of all the above |
| Did the agent stay within budget? | `cost` / `token-usage` | Budget thresholds |

## Related issues

- #1133 — `agentv bundle` (reproducibility)
- #1136 — oracle-run workflow (benchmark authoring QA)
- #1076 — public benchmark starter pack (examples)
- #974 — circuit breaker (expensive run management)

## Acceptance signals

- [ ] Docs page published at `/docs/guides/benchmarking-best-practices` or similar
- [ ] Each section has at least one concrete EVAL.yaml example
- [ ] Cross-links to relevant feature docs (target hooks, resume, evaluators)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: benchmarking best practices guide #1137

Objective

Target audience

Proposed outline

1. Reproducibility

2. Evaluating the configured agent (not just pass/fail)

3. MCP servers in benchmarks

4. Comparing harness variants

5. Managing expensive runs

6. Oracle validation for benchmark authoring

7. Decision matrix: what to evaluate

Related issues

Acceptance signals

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question	Evaluator	Example
Did the task pass?	`code-grader`	`./tests/verify.sh` exit code
Did the agent use the right tools?	`tool-trajectory`	Assert MCP tool calls in order
Did it follow workspace conventions?	`rubric`	Check commit format, test-before-commit
Was tool usage efficient?	`execution-metrics`	max_tool_calls, exploration_ratio
Overall harness quality?	`composite`	Weighted average of all the above
Did the agent stay within budget?	`cost` / `token-usage`	Budget thresholds

docs: benchmarking best practices guide #1137

Description

Objective

Target audience

Proposed outline

1. Reproducibility

2. Evaluating the configured agent (not just pass/fail)

3. MCP servers in benchmarks

4. Comparing harness variants

5. Managing expensive runs

6. Oracle validation for benchmark authoring

7. Decision matrix: what to evaluate

Related issues

Acceptance signals

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions