Objective
Publish a public-facing tracker at agentv.dev showing daily benchmark results for coding agents, including harness variant comparison (baseline vs superpowers vs compound-engineering vs agent-skills).
Why this matters
A public tracker turns AgentV from "eval framework" into "the place to check if your coding agent got better or worse." It's the primary adoption and trust wedge. Users come for the results, stay for the framework.
Two complementary tracker surfaces
1. Agent regression tracker (daily)
Run standard benchmarks daily against major coding agents and publish results:
- Agents: Claude Code, Codex, Copilot (whatever is cheaply runnable daily)
- What makes it different: Show dimensions other trackers don't — tool efficiency, exploration ratio, cost per resolution, composite quality scores. Same benchmarks, richer columns.
2. Harness comparison tracker
Compare how the same agent performs with different harness configurations:
This is the "configured agent benchmark" that nobody else runs — testing the harness, not just the model.
Uses target hooks to configure each variant:
execution:
targets:
- name: claude-baseline
- name: claude-superpowers
use_target: claude-baseline
hooks:
before_each:
command: ["./setup-plugins.sh", "superpowers"]
- name: claude-compound
use_target: claude-baseline
hooks:
before_each:
command: ["./setup-plugins.sh", "compound"]
- name: claude-agent-skills
use_target: claude-baseline
hooks:
before_each:
command: ["./setup-plugins.sh", "agent-skills"]
Benchmark sources
Curate a 50-task subset from upstream MIT-licensed SWE-bench datasets. Use Margin's task selection (Margin-Lab/swe-suites/swe-bench-pro-curated-50/suite.toml) as a reference for which tasks are good signal — but pull the actual data from upstream.
Import path: AgentV already has agentv import huggingface (#978) for importing SWE-bench datasets. Curate the 50-task subset, write workspace templates (Docker images from SWE-bench's own tooling), and package as EVAL.yaml files.
Benchmark pack repo
Benchmark data should live in a separate repo (e.g., EntityProcess/agentv-benchmarks), not in the main agentv repo:
- Benchmark data is heavy (workspace templates, Dockerfiles, test scripts)
- Benchmarks version independently from the framework
git clone agentv shouldn't pull hundreds of MB of benchmark assets
The main agentv repo keeps small example evals in examples/ for feature demos.
Cost management
- Use cheaper model tiers for daily runs (Haiku, small Sonnet configs, GPT-5-mini)
- Run flagship models weekly, not daily
- Use
--resume if runs get interrupted
- Budget thresholds via
execution.budget_usd
- Curate a small daily subset (50 tasks) — full suite runs weekly
Publication
- Static site generation from
benchmark.json results
- Embed JSON data in HTML (no live API needed)
- Hosted on agentv.dev/tracker
- Daily refresh via scheduled CI job
Infrastructure
Related
Acceptance signals
Objective
Publish a public-facing tracker at agentv.dev showing daily benchmark results for coding agents, including harness variant comparison (baseline vs superpowers vs compound-engineering vs agent-skills).
Why this matters
A public tracker turns AgentV from "eval framework" into "the place to check if your coding agent got better or worse." It's the primary adoption and trust wedge. Users come for the results, stay for the framework.
Two complementary tracker surfaces
1. Agent regression tracker (daily)
Run standard benchmarks daily against major coding agents and publish results:
2. Harness comparison tracker
Compare how the same agent performs with different harness configurations:
This is the "configured agent benchmark" that nobody else runs — testing the harness, not just the model.
Uses target hooks to configure each variant:
Benchmark sources
Curate a 50-task subset from upstream MIT-licensed SWE-bench datasets. Use Margin's task selection (Margin-Lab/swe-suites/swe-bench-pro-curated-50/suite.toml) as a reference for which tasks are good signal — but pull the actual data from upstream.
Import path: AgentV already has
agentv import huggingface(#978) for importing SWE-bench datasets. Curate the 50-task subset, write workspace templates (Docker images from SWE-bench's own tooling), and package as EVAL.yaml files.Benchmark pack repo
Benchmark data should live in a separate repo (e.g.,
EntityProcess/agentv-benchmarks), not in the main agentv repo:git clone agentvshouldn't pull hundreds of MB of benchmark assetsThe main agentv repo keeps small example evals in
examples/for feature demos.Cost management
--resumeif runs get interruptedexecution.budget_usdPublication
benchmark.jsonresultsInfrastructure
Related
Acceptance signals