You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current bug-fix-benchmark (examples/showcase/bug-fix-benchmark) compares engineering workflow plugins (agent-skills, superpowers, compound) against a baseline on a real GitHub repo. It has one test case — a single-file fix with the root cause and file location described in the prompt.
The recent SkillsBench paper provides methodology grounding. Key finding: Software Engineering showed the smallest improvement (+4.5pp) when tasks were prescriptive — agents could navigate them without plugin help.
Current task is too prescriptive — the prompt names the file, method, and fix pattern. Add at least 4 new tasks covering distinct types:
Multi-file bugs — root cause spans 2+ files, no location hints in the prompt
Regression bugs — "works on commit A, fails on commit B, find why"
Spec-driven implementation — given a spec, implement + add tests from scratch
Refactoring under test — restructure code without breaking existing test suite
All tasks must use the same agentv repo (https://github.com/EntityProcess/agentv) as the workspace so no new repo setup is needed.
2. Self-generated skills as a control condition
Add a fourth variant claude-self-generated alongside the existing three. Its workspaces/self-generated/CLAUDE.md should instruct the agent to write its own procedural knowledge before starting the task — something like: "Before solving this task, write a SKILL.md describing your approach and the engineering process you will follow. Then follow it." No plugin is installed. This isolates whether curated plugin content outperforms an agent's own self-generated process notes.
3. Multi-trial runs with confidence intervals
Configure the eval with:
trials:
count: 5strategy: confidence_interval
This gives statistical significance to pass-rate deltas instead of single-run noise.
4. Token/cost/latency tracking
Add evaluators to measure the overhead of each plugin variant:
Answers "is the plugin worth its cost?" — SkillsBench found skills add ~13s and ~1,700 tokens per task on average.
5. Difficulty stratification and domain tagging
Tag each test case with difficulty tier (core / extended / extreme based on estimated human completion time) and scenario type. Enables stratified analysis in agentv compare.
Run each variant across at least 2 model tiers (e.g. Sonnet 4.5 + Opus 4.6) to test whether skills compensate for model scale. SkillsBench found Haiku + skills (27.7%) outperformed Opus without skills (22.0%).
Acceptance signals
At least 5 test cases total (existing + 4 new) covering distinct scenario types
claude-self-generated variant added with appropriate CLAUDE.md
trials: 5 with confidence_interval strategy configured
token-usage, cost, latency evaluators included
Each task tagged with difficulty tier and scenario type
Multi-model targets configured (at least 2 model tiers)
README updated with methodology notes and link to SkillsBench paper
Non-goals
Not reproducing SkillsBench (84 tasks, 7 model configs) — this is a focused workflow benchmark
Not adding Docker containerization — git workspace isolation is sufficient
Not covering domains outside software engineering
Not implementing leakage prevention CI — 5-10 tasks can be reviewed manually
Context
The current bug-fix-benchmark (
examples/showcase/bug-fix-benchmark) compares engineering workflow plugins (agent-skills, superpowers, compound) against a baseline on a real GitHub repo. It has one test case — a single-file fix with the root cause and file location described in the prompt.The recent SkillsBench paper provides methodology grounding. Key finding: Software Engineering showed the smallest improvement (+4.5pp) when tasks were prescriptive — agents could navigate them without plugin help.
Related: addyosmani/agent-skills#51
What's needed
1. More complex task scenarios
Current task is too prescriptive — the prompt names the file, method, and fix pattern. Add at least 4 new tasks covering distinct types:
All tasks must use the same agentv repo (
https://github.com/EntityProcess/agentv) as the workspace so no new repo setup is needed.2. Self-generated skills as a control condition
Add a fourth variant
claude-self-generatedalongside the existing three. Itsworkspaces/self-generated/CLAUDE.mdshould instruct the agent to write its own procedural knowledge before starting the task — something like: "Before solving this task, write a SKILL.md describing your approach and the engineering process you will follow. Then follow it." No plugin is installed. This isolates whether curated plugin content outperforms an agent's own self-generated process notes.3. Multi-trial runs with confidence intervals
Configure the eval with:
This gives statistical significance to pass-rate deltas instead of single-run noise.
4. Token/cost/latency tracking
Add evaluators to measure the overhead of each plugin variant:
Answers "is the plugin worth its cost?" — SkillsBench found skills add ~13s and ~1,700 tokens per task on average.
5. Difficulty stratification and domain tagging
Tag each test case with difficulty tier (
core/extended/extremebased on estimated human completion time) and scenario type. Enables stratified analysis inagentv compare.6. Multi-model comparison
Run each variant across at least 2 model tiers (e.g. Sonnet 4.5 + Opus 4.6) to test whether skills compensate for model scale. SkillsBench found Haiku + skills (27.7%) outperformed Opus without skills (22.0%).
Acceptance signals
claude-self-generatedvariant added with appropriateCLAUDE.mdtrials: 5withconfidence_intervalstrategy configuredtoken-usage,cost,latencyevaluators includedNon-goals
agentv compare(feat(compare): add normalized gain metric #1101) — no changes needed thereRelated