feat: public eval tracker — daily benchmark runs with harness comparison

## Objective

Publish a public-facing tracker at agentv.dev showing daily benchmark results for coding agents, including harness variant comparison (baseline vs superpowers vs compound-engineering vs agent-skills).

## Why this matters

A public tracker turns AgentV from "eval framework" into "the place to check if your coding agent got better or worse." It's the primary adoption and trust wedge. Users come for the results, stay for the framework.

---

## Two complementary tracker surfaces

### 1. Agent regression tracker (daily)

Run standard benchmarks daily against major coding agents and publish results:

- **Agents**: Claude Code, Codex, Copilot (whatever is cheaply runnable daily)
- **What makes it different**: Show dimensions other trackers don't — tool efficiency, exploration ratio, cost per resolution, composite quality scores. Same benchmarks, richer columns.

### 2. Harness comparison tracker

Compare how the same agent performs with different harness configurations:

- **Baseline** — naked agent, no skills or plugins
- **Superpowers** — with superpowers plugin
- **Compound engineering** — with compound-engineering plugin
- **Agent-skills** — with [addyosmani/agent-skills](https://github.com/addyosmani/agent-skills)
- **Self-generated** — agent writes its own skills before starting (control condition from #1100)

This is the "configured agent benchmark" that nobody else runs — testing the harness, not just the model.

Uses target hooks to configure each variant:
```yaml
execution:
  targets:
    - name: claude-baseline
    - name: claude-superpowers
      use_target: claude-baseline
      hooks:
        before_each:
          command: ["./setup-plugins.sh", "superpowers"]
    - name: claude-compound
      use_target: claude-baseline
      hooks:
        before_each:
          command: ["./setup-plugins.sh", "compound"]
    - name: claude-agent-skills
      use_target: claude-baseline
      hooks:
        before_each:
          command: ["./setup-plugins.sh", "agent-skills"]
```

---

## Benchmark sources

Curate a 50-task subset from upstream MIT-licensed SWE-bench datasets. Use Margin's task selection ([Margin-Lab/swe-suites/swe-bench-pro-curated-50/suite.toml](https://github.com/Margin-Lab/swe-suites)) as a reference for which tasks are good signal — but pull the actual data from upstream.

| Benchmark | Upstream repo | License | Dataset |
|---|---|---|---|
| SWE-bench | [princeton-nlp/SWE-bench](https://github.com/princeton-nlp/SWE-bench) | MIT | [HuggingFace: princeton-nlp/SWE-bench](https://huggingface.co/datasets/princeton-nlp/SWE-bench) |
| SWE-bench Verified | Same repo, 500 human-validated tasks | MIT | [HuggingFace: princeton-nlp/SWE-bench_Verified](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Verified) |
| SWE-bench Lite | Same repo, 300-task subset | MIT | [HuggingFace: princeton-nlp/SWE-bench_Lite](https://huggingface.co/datasets/princeton-nlp/SWE-bench_Lite) |
| Terminal-Bench 2.0 | [harbor-framework/terminal-bench-2](https://github.com/harbor-framework/terminal-bench-2) | Check license | — |

**Import path**: AgentV already has `agentv import huggingface` (#978) for importing SWE-bench datasets. Curate the 50-task subset, write workspace templates (Docker images from SWE-bench's own tooling), and package as EVAL.yaml files.

---

## Benchmark pack repo

Benchmark data should live in a **separate repo** (e.g., `EntityProcess/agentv-benchmarks`), not in the main agentv repo:
- Benchmark data is heavy (workspace templates, Dockerfiles, test scripts)
- Benchmarks version independently from the framework
- `git clone agentv` shouldn't pull hundreds of MB of benchmark assets

The main agentv repo keeps small example evals in `examples/` for feature demos.

---

## Cost management

- Use cheaper model tiers for daily runs (Haiku, small Sonnet configs, GPT-5-mini)
- Run flagship models weekly, not daily
- Use `--resume` if runs get interrupted
- Budget thresholds via `execution.budget_usd`
- Curate a small daily subset (50 tasks) — full suite runs weekly

---

## Publication

- Static site generation from `benchmark.json` results
- Embed JSON data in HTML (no live API needed)
- Hosted on agentv.dev/tracker
- Daily refresh via scheduled CI job

---

## Infrastructure

- Runs on a CI/scheduled job (GitHub Actions or dedicated VPS — see #1001)
- Results committed to a results repo or published as CI artifacts

---

## Related

- #966 (closed) — original leaderboard proposal, strategy still applies
- #1076 — benchmark starter pack (the suites this tracker would run)
- #1100 (closed) — bug-fix-benchmark with superpowers/compound/agent-skills comparison
- #989 (closed) — superpowers vs without-superpowers benchmark run
- #1001 — Hetzner VPS provisioning
- #1137 — benchmarking best practices docs

---

## Acceptance signals

- [ ] Benchmark pack repo created with at least one curated suite from upstream MIT sources
- [ ] At least one benchmark suite running daily against at least one agent
- [ ] Results published at a public URL
- [ ] Harness comparison (baseline vs at least 2 plugin variants) included
- [ ] Historical trend visible (at least 7 days of data)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: public eval tracker — daily benchmark runs with harness comparison #1139

Objective

Why this matters

Two complementary tracker surfaces

1. Agent regression tracker (daily)

2. Harness comparison tracker

Benchmark sources

Benchmark pack repo

Cost management

Publication

Infrastructure

Related

Acceptance signals

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark	Upstream repo	License	Dataset
SWE-bench	princeton-nlp/SWE-bench	MIT	HuggingFace: princeton-nlp/SWE-bench
SWE-bench Verified	Same repo, 500 human-validated tasks	MIT	HuggingFace: princeton-nlp/SWE-bench_Verified
SWE-bench Lite	Same repo, 300-task subset	MIT	HuggingFace: princeton-nlp/SWE-bench_Lite
Terminal-Bench 2.0	harbor-framework/terminal-bench-2	Check license	—

feat: public eval tracker — daily benchmark runs with harness comparison #1139

Description

Objective

Why this matters

Two complementary tracker surfaces

1. Agent regression tracker (daily)

2. Harness comparison tracker

Benchmark sources

Benchmark pack repo

Cost management

Publication

Infrastructure

Related

Acceptance signals

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions