docs(showcase): add bug-fix-benchmark example for SWE-bench style evaluation by christso · Pull Request #1091 · EntityProcess/agentv

christso · 2026-04-13T23:14:09Z

Summary

Add examples/showcase/bug-fix-benchmark/ — a SWE-bench style showcase that compares coding agent performance with and without engineering plugins on identical bug fix tasks.

Uses target-level hooks (from #1095) to configure each plugin variant via execution.targets in the eval file.

Closes #919

What It Does

Evaluates the same bug fix task across four configurations:

Target	Plugin	Workspace
`claude-baseline`	None	baseline settings
`claude-superpowers`	obra/superpowers	superpowers plugin
`claude-compound`	EveryInc/compound-engineering-plugin	compound plugin
`claude-agent-skills`	addyosmani/agent-skills	agent-skills plugin

Metrics compared: tokens consumed, time to complete, fix correctness.

How It Works

A single claude base target is defined in targets.yaml. The eval file uses target-level hooks (execution.targets) so each variant runs a before_each hook that copies the appropriate .claude/settings.json into the workspace:

execution:
  targets:
    - name: claude-baseline
      use_target: claude
      hooks:
        before_each:
          command: ["bash", "../scripts/setup-variant.sh", "baseline"]
    - name: claude-superpowers
      use_target: claude
      hooks:
        before_each:
          command: ["bash", "../scripts/setup-variant.sh", "superpowers"]
    # ...

Files Added

examples/showcase/bug-fix-benchmark/
├── .agentv/targets.yaml           # Base claude target + grader targets
├── evals/bug-fixes.eval.yaml      # Test cases + target hooks per variant
├── workspaces/                    # Plugin config templates (copied by hooks)
│   ├── baseline/                  # bypassPermissions, no plugins
│   ├── superpowers/               # + superpowers plugin
│   ├── compound/                  # + compound-engineering plugin
│   └── agent-skills/              # + agent-skills plugin
├── scripts/
│   ├── setup-variant.sh           # Target hook: copy variant config into workspace
│   ├── mock-agent.sh              # Test harness without API keys
│   ├── setup-plugins.sh           # Install plugins into workspaces
│   └── import-swebench.sh         # Import SWE-bench dataset instances
└── README.md

How to Run

# All variants (runs all 4 from execution.targets)
agentv eval evals/bug-fixes.eval.yaml --workers 3

# Specific variants only
agentv eval evals/bug-fixes.eval.yaml \
  --target claude-baseline,claude-superpowers --workers 2

# Compare results
agentv compare <baseline-results> <plugin-results>

Test Plan

bun run build — passes
bun run typecheck — passes
bun run lint — passes
bun run test — 2157 tests pass (1642 core + 67 eval + 448 cli)
bun run validate:examples — 56/56 valid
Dry-run with --dry-run --workers 1 — all 4 targets resolve hooks, clone repo, copy correct settings.json
Verified workspace slots contain correct plugin configs per variant

🤖 Generated with Claude Code

cloudflare-workers-and-pages · 2026-04-13T23:14:51Z

Deploying agentv with Cloudflare Pages

Latest commit:	`a85f0b4`
Status:	⚡️ Build in progress...

View logs

…luation Add a showcase example demonstrating how to evaluate coding agents on real-world bug fixes using public GitHub repositories with Docker workspace isolation and commit-pinned repos. Includes: - EVAL.yaml with example test cases (null-check, fallback, property-access bugs) - targets.yaml showing all auth options (subscription, API key, mock) - mock-agent.sh for testing without API keys - import-swebench.sh for importing SWE-bench dataset instances Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add workspace templates for comparing agent performance with and without engineering plugins: superpowers, compound-engineering, agent-skills. - Add workspaces/ with per-plugin .claude/settings.json configs - Update targets.yaml with claude-baseline, claude-superpowers, claude-compound, claude-agent-skills targets - Replace hypothetical test cases with real issue #912 bug fix task - Add scripts/setup-plugins.sh for plugin installation - Update README with comparison workflow and plugin details Closes #919 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use defaultMode: bypassPermissions instead of listing individual Bash allow rules, matching how the agentv dev environment is configured. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace the removed workspace_template field with the target-level hooks pattern from #1095. A single base 'claude' target is defined in targets.yaml, and the eval file's execution.targets uses before_each hooks to copy variant-specific plugin configs into the workspace. Also fixes: - Use 'id' instead of deprecated 'case' in test definitions - Use full commit hash with resolve: local for base_commit - Remove shallow clone (depth: 1) that prevented commit checkout Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Switch from provider: claude to provider: claude-cli with an executable field that reads from CLAUDE_EXECUTABLE env var (defaults to "claude"). This allows using custom CLI binaries like claude-zai. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The script was speculative and non-functional (used deprecated fields, hardcoded docker config, broken template variables). Not needed for the benchmark showcase. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…rtions - Remove local .agentv/targets.yaml — use repo root targets instead (targets don't merge, closest shadows; local one forced duplicating grader targets unnecessarily) - Replace llm-grader assertion with inline rubric strings (auto-unwrapped to rubrics evaluator) - Remove unused scripts: mock-agent.sh (broken with workspace repos), setup-plugins.sh (orphaned, settings.json already checked in) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ns assertion - Replace hardcoded use_target: claude with ${{ AGENT_TARGET }} so the benchmark works with any provider via env var - Add workspace.hooks.before_each.reset: fast for proper isolation between pool slot reuse across plugin variants - Remove contains: effectiveCwd assertion (checks response text, not the diff); rubrics already validate the fix via file_changes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

christso mentioned this pull request Apr 14, 2026

feat: remove workspace_template from targets, add target-level hooks for harness variant testing #1094

Closed

christso and others added 4 commits April 14, 2026 04:33

fix(showcase): use bypassPermissions in all workspace settings

8fa52b2

Use defaultMode: bypassPermissions instead of listing individual Bash allow rules, matching how the agentv dev environment is configured. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

christso force-pushed the docs/bug-fix-benchmark-showcase branch from 68253cb to 29d75fe Compare April 14, 2026 04:46

christso and others added 4 commits April 14, 2026 04:55

chore(showcase): remove unused import-swebench.sh script

c4e83a6

The script was speculative and non-functional (used deprecated fields, hardcoded docker config, broken template variables). Not needed for the benchmark showcase. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

christso marked this pull request as ready for review April 14, 2026 06:35

christso merged commit 5af7455 into main Apr 14, 2026
3 of 4 checks passed

christso deleted the docs/bug-fix-benchmark-showcase branch April 14, 2026 06:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(showcase): add bug-fix-benchmark example for SWE-bench style evaluation#1091

docs(showcase): add bug-fix-benchmark example for SWE-bench style evaluation#1091
christso merged 8 commits intomainfrom
docs/bug-fix-benchmark-showcase

christso commented Apr 13, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages bot commented Apr 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What It Does

How It Works

Files Added

How to Run

Test Plan

Uh oh!

cloudflare-workers-and-pages bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

christso commented Apr 13, 2026 •

edited

Loading

cloudflare-workers-and-pages bot commented Apr 13, 2026 •

edited

Loading