examples/showcase: expand bug-fix-benchmark with rigorous multi-scenario workflow evals

## Context

The current bug-fix-benchmark (`examples/showcase/bug-fix-benchmark`) compares engineering workflow plugins (agent-skills, superpowers, compound) against a baseline on a real GitHub repo. It has one test case — a single-file fix with the root cause and file location described in the prompt.

The recent [SkillsBench paper](https://arxiv.org/abs/2602.12670) provides methodology grounding. Key finding: Software Engineering showed the smallest improvement (+4.5pp) when tasks were prescriptive — agents could navigate them without plugin help.

Related: [addyosmani/agent-skills#51](https://github.com/addyosmani/agent-skills/pull/51)

## What's needed

### 1. More complex task scenarios

Current task is too prescriptive — the prompt names the file, method, and fix pattern. Add at least 4 new tasks covering distinct types:

- **Multi-file bugs** — root cause spans 2+ files, no location hints in the prompt
- **Regression bugs** — "works on commit A, fails on commit B, find why"
- **Spec-driven implementation** — given a spec, implement + add tests from scratch
- **Refactoring under test** — restructure code without breaking existing test suite

All tasks must use the same agentv repo (`https://github.com/EntityProcess/agentv`) as the workspace so no new repo setup is needed.

### 2. Self-generated skills as a control condition

Add a fourth variant `claude-self-generated` alongside the existing three. Its `workspaces/self-generated/CLAUDE.md` should instruct the agent to write its own procedural knowledge before starting the task — something like: "Before solving this task, write a SKILL.md describing your approach and the engineering process you will follow. Then follow it." No plugin is installed. This isolates whether curated plugin content outperforms an agent's own self-generated process notes.

### 3. Multi-trial runs with confidence intervals

Configure the eval with:

```yaml
trials:
  count: 5
  strategy: confidence_interval
```

This gives statistical significance to pass-rate deltas instead of single-run noise.

### 4. Token/cost/latency tracking

Add evaluators to measure the overhead of each plugin variant:

```yaml
evaluators:
  - type: token-usage
  - type: cost
  - type: latency
```

Answers "is the plugin worth its cost?" — SkillsBench found skills add ~13s and ~1,700 tokens per task on average.

### 5. Difficulty stratification and domain tagging

Tag each test case with difficulty tier (`core` / `extended` / `extreme` based on estimated human completion time) and scenario type. Enables stratified analysis in `agentv compare`.

```yaml
tests:
  - id: fix-multi-file-auth
    tags: [extended, multi-file, bugfix]
```

### 6. Multi-model comparison

Run each variant across at least 2 model tiers (e.g. Sonnet 4.5 + Opus 4.6) to test whether skills compensate for model scale. SkillsBench found Haiku + skills (27.7%) outperformed Opus without skills (22.0%).

## Acceptance signals

- At least 5 test cases total (existing + 4 new) covering distinct scenario types
- `claude-self-generated` variant added with appropriate `CLAUDE.md`
- `trials: 5` with `confidence_interval` strategy configured
- `token-usage`, `cost`, `latency` evaluators included
- Each task tagged with difficulty tier and scenario type
- Multi-model targets configured (at least 2 model tiers)
- README updated with methodology notes and link to SkillsBench paper

## Non-goals

- Not reproducing SkillsBench (84 tasks, 7 model configs) — this is a focused workflow benchmark
- Not adding Docker containerization — git workspace isolation is sufficient
- Not covering domains outside software engineering
- Not implementing leakage prevention CI — 5-10 tasks can be reviewed manually
- Normalized gain and negative delta detection are already handled by `agentv compare` (#1101) — no changes needed there

## Related

- #1076 public benchmark starter pack
- #1102 Studio visualization for benchmark analysis
- [SkillsBench arXiv](https://arxiv.org/abs/2602.12670)
- [addyosmani/agent-skills PR #51](https://github.com/addyosmani/agent-skills/pull/51)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples/showcase: expand bug-fix-benchmark with rigorous multi-scenario workflow evals #1100

Context

What's needed

1. More complex task scenarios

2. Self-generated skills as a control condition

3. Multi-trial runs with confidence intervals

4. Token/cost/latency tracking

5. Difficulty stratification and domain tagging

6. Multi-model comparison

Acceptance signals

Non-goals

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

examples/showcase: expand bug-fix-benchmark with rigorous multi-scenario workflow evals #1100

Description

Context

What's needed

1. More complex task scenarios

2. Self-generated skills as a control condition

3. Multi-trial runs with confidence intervals

4. Token/cost/latency tracking

5. Difficulty stratification and domain tagging

6. Multi-model comparison

Acceptance signals

Non-goals

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions