Skip to content

examples/showcase: expand bug-fix-benchmark with rigorous multi-scenario workflow evals #1100

@christso

Description

@christso

Context

The current bug-fix-benchmark (examples/showcase/bug-fix-benchmark) compares engineering workflow plugins (agent-skills, superpowers, compound) against a baseline on a real GitHub repo. It has one test case — a single-file fix with the root cause and file location described in the prompt.

The recent SkillsBench paper provides methodology grounding. Key finding: Software Engineering showed the smallest improvement (+4.5pp) when tasks were prescriptive — agents could navigate them without plugin help.

Related: addyosmani/agent-skills#51

What's needed

1. More complex task scenarios

Current task is too prescriptive — the prompt names the file, method, and fix pattern. Add at least 4 new tasks covering distinct types:

  • Multi-file bugs — root cause spans 2+ files, no location hints in the prompt
  • Regression bugs — "works on commit A, fails on commit B, find why"
  • Spec-driven implementation — given a spec, implement + add tests from scratch
  • Refactoring under test — restructure code without breaking existing test suite

All tasks must use the same agentv repo (https://github.com/EntityProcess/agentv) as the workspace so no new repo setup is needed.

2. Self-generated skills as a control condition

Add a fourth variant claude-self-generated alongside the existing three. Its workspaces/self-generated/CLAUDE.md should instruct the agent to write its own procedural knowledge before starting the task — something like: "Before solving this task, write a SKILL.md describing your approach and the engineering process you will follow. Then follow it." No plugin is installed. This isolates whether curated plugin content outperforms an agent's own self-generated process notes.

3. Multi-trial runs with confidence intervals

Configure the eval with:

trials:
  count: 5
  strategy: confidence_interval

This gives statistical significance to pass-rate deltas instead of single-run noise.

4. Token/cost/latency tracking

Add evaluators to measure the overhead of each plugin variant:

evaluators:
  - type: token-usage
  - type: cost
  - type: latency

Answers "is the plugin worth its cost?" — SkillsBench found skills add ~13s and ~1,700 tokens per task on average.

5. Difficulty stratification and domain tagging

Tag each test case with difficulty tier (core / extended / extreme based on estimated human completion time) and scenario type. Enables stratified analysis in agentv compare.

tests:
  - id: fix-multi-file-auth
    tags: [extended, multi-file, bugfix]

6. Multi-model comparison

Run each variant across at least 2 model tiers (e.g. Sonnet 4.5 + Opus 4.6) to test whether skills compensate for model scale. SkillsBench found Haiku + skills (27.7%) outperformed Opus without skills (22.0%).

Acceptance signals

  • At least 5 test cases total (existing + 4 new) covering distinct scenario types
  • claude-self-generated variant added with appropriate CLAUDE.md
  • trials: 5 with confidence_interval strategy configured
  • token-usage, cost, latency evaluators included
  • Each task tagged with difficulty tier and scenario type
  • Multi-model targets configured (at least 2 model tiers)
  • README updated with methodology notes and link to SkillsBench paper

Non-goals

  • Not reproducing SkillsBench (84 tasks, 7 model configs) — this is a focused workflow benchmark
  • Not adding Docker containerization — git workspace isolation is sufficient
  • Not covering domains outside software engineering
  • Not implementing leakage prevention CI — 5-10 tasks can be reviewed manually
  • Normalized gain and negative delta detection are already handled by agentv compare (feat(compare): add normalized gain metric #1101) — no changes needed there

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    docsImprovements or additions to documentationenhancementNew feature or request

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions