You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add a docs page covering end-to-end best practices for using AgentV to benchmark coding agents. This consolidates patterns that are already supported but not discoverable — users shouldn't have to reverse-engineer these from scattered feature docs.
Target audience
Teams setting up agent benchmarks for the first time — comparing agents, tracking regressions, evaluating harness configurations.
Proposed outline
1. Reproducibility
Pin workspace templates in git alongside eval files
Git commit hash as provenance — the repo is your reproducibility unit
Record which targets/models ran in result artifacts
2. Evaluating the configured agent (not just pass/fail)
Most benchmarks test the naked agent against coding tasks and report binary pass/fail accuracy. AgentV can evaluate how well the agent used its full harness:
Skill effectiveness — tool-trajectory to assert the right skill-triggered tools fired
MCP tool verification — tool-trajectory with arg matching on MCP tool calls
Workspace instruction adherence — rubric scoring whether the agent followed AGENTS.md/CLAUDE.md conventions
When authoring benchmark cases, validate that cases are solvable and test harnesses are sound by running a reference solution through the same verifier (#1136).
Objective
Add a docs page covering end-to-end best practices for using AgentV to benchmark coding agents. This consolidates patterns that are already supported but not discoverable — users shouldn't have to reverse-engineer these from scattered feature docs.
Target audience
Teams setting up agent benchmarks for the first time — comparing agents, tracking regressions, evaluating harness configurations.
Proposed outline
1. Reproducibility
agentv bundleto compile a portable, self-contained eval directory for sharing (feat: immutable run bundle / reproducibility artifact #1133)2. Evaluating the configured agent (not just pass/fail)
Most benchmarks test the naked agent against coding tasks and report binary pass/fail accuracy. AgentV can evaluate how well the agent used its full harness:
tool-trajectoryto assert the right skill-triggered tools firedtool-trajectorywith arg matching on MCP tool callsrubricscoring whether the agent followed AGENTS.md/CLAUDE.md conventionscompositeevaluator weighting correctness + tool usage + efficiency + instruction adherenceInclude concrete EVAL.yaml examples for each pattern.
3. MCP servers in benchmarks
How to include MCP servers without introducing non-reproducibility:
before_all/after_allworkspace hooks4. Comparing harness variants
Use target hooks to compare different agent configurations in a single eval run:
Reference: Target Hooks docs
5. Managing expensive runs
--resume— skip completed tests after an interruption, append new results--rerun-failed— rerun only failed/errored tests while keeping passing resultsagentv eval aggregate <run-dir>— recompute stats from accumulated results--workers N— control parallelism to stay within API rate limits6. Oracle validation for benchmark authoring
When authoring benchmark cases, validate that cases are solvable and test harnesses are sound by running a reference solution through the same verifier (#1136).
7. Decision matrix: what to evaluate
code-grader./tests/verify.shexit codetool-trajectoryrubricexecution-metricscompositecost/token-usageRelated issues
agentv bundle(reproducibility)Acceptance signals
/docs/guides/benchmarking-best-practicesor similar