Skip to content

feat: public eval tracker — daily benchmark runs with harness comparison #1139

@christso

Description

@christso

Objective

Publish a public-facing tracker at agentv.dev showing daily benchmark results for coding agents, including harness variant comparison (baseline vs superpowers vs compound-engineering vs agent-skills).

Why this matters

A public tracker turns AgentV from "eval framework" into "the place to check if your coding agent got better or worse." It's the primary adoption and trust wedge. Users come for the results, stay for the framework.


Two complementary tracker surfaces

1. Agent regression tracker (daily)

Run standard benchmarks daily against major coding agents and publish results:

  • Agents: Claude Code, Codex, Copilot (whatever is cheaply runnable daily)
  • What makes it different: Show dimensions other trackers don't — tool efficiency, exploration ratio, cost per resolution, composite quality scores. Same benchmarks, richer columns.

2. Harness comparison tracker

Compare how the same agent performs with different harness configurations:

This is the "configured agent benchmark" that nobody else runs — testing the harness, not just the model.

Uses target hooks to configure each variant:

execution:
  targets:
    - name: claude-baseline
    - name: claude-superpowers
      use_target: claude-baseline
      hooks:
        before_each:
          command: ["./setup-plugins.sh", "superpowers"]
    - name: claude-compound
      use_target: claude-baseline
      hooks:
        before_each:
          command: ["./setup-plugins.sh", "compound"]
    - name: claude-agent-skills
      use_target: claude-baseline
      hooks:
        before_each:
          command: ["./setup-plugins.sh", "agent-skills"]

Benchmark sources

Curate a 50-task subset from upstream MIT-licensed SWE-bench datasets. Use Margin's task selection (Margin-Lab/swe-suites/swe-bench-pro-curated-50/suite.toml) as a reference for which tasks are good signal — but pull the actual data from upstream.

Benchmark Upstream repo License Dataset
SWE-bench princeton-nlp/SWE-bench MIT HuggingFace: princeton-nlp/SWE-bench
SWE-bench Verified Same repo, 500 human-validated tasks MIT HuggingFace: princeton-nlp/SWE-bench_Verified
SWE-bench Lite Same repo, 300-task subset MIT HuggingFace: princeton-nlp/SWE-bench_Lite
Terminal-Bench 2.0 harbor-framework/terminal-bench-2 Check license

Import path: AgentV already has agentv import huggingface (#978) for importing SWE-bench datasets. Curate the 50-task subset, write workspace templates (Docker images from SWE-bench's own tooling), and package as EVAL.yaml files.


Benchmark pack repo

Benchmark data should live in a separate repo (e.g., EntityProcess/agentv-benchmarks), not in the main agentv repo:

  • Benchmark data is heavy (workspace templates, Dockerfiles, test scripts)
  • Benchmarks version independently from the framework
  • git clone agentv shouldn't pull hundreds of MB of benchmark assets

The main agentv repo keeps small example evals in examples/ for feature demos.


Cost management

  • Use cheaper model tiers for daily runs (Haiku, small Sonnet configs, GPT-5-mini)
  • Run flagship models weekly, not daily
  • Use --resume if runs get interrupted
  • Budget thresholds via execution.budget_usd
  • Curate a small daily subset (50 tasks) — full suite runs weekly

Publication

  • Static site generation from benchmark.json results
  • Embed JSON data in HTML (no live API needed)
  • Hosted on agentv.dev/tracker
  • Daily refresh via scheduled CI job

Infrastructure


Related


Acceptance signals

  • Benchmark pack repo created with at least one curated suite from upstream MIT sources
  • At least one benchmark suite running daily against at least one agent
  • Results published at a public URL
  • Harness comparison (baseline vs at least 2 plugin variants) included
  • Historical trend visible (at least 7 days of data)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions