Skip to content

docs(showcase): add bug-fix-benchmark example for SWE-bench style evaluation#1091

Merged
christso merged 8 commits intomainfrom
docs/bug-fix-benchmark-showcase
Apr 14, 2026
Merged

docs(showcase): add bug-fix-benchmark example for SWE-bench style evaluation#1091
christso merged 8 commits intomainfrom
docs/bug-fix-benchmark-showcase

Conversation

@christso
Copy link
Copy Markdown
Collaborator

@christso christso commented Apr 13, 2026

Summary

Add examples/showcase/bug-fix-benchmark/ — a SWE-bench style showcase that compares coding agent performance with and without engineering plugins on identical bug fix tasks.

Uses target-level hooks (from #1095) to configure each plugin variant via execution.targets in the eval file.

Closes #919

What It Does

Evaluates the same bug fix task across four configurations:

Target Plugin Workspace
claude-baseline None baseline settings
claude-superpowers obra/superpowers superpowers plugin
claude-compound EveryInc/compound-engineering-plugin compound plugin
claude-agent-skills addyosmani/agent-skills agent-skills plugin

Metrics compared: tokens consumed, time to complete, fix correctness.

How It Works

A single claude base target is defined in targets.yaml. The eval file uses target-level hooks (execution.targets) so each variant runs a before_each hook that copies the appropriate .claude/settings.json into the workspace:

execution:
  targets:
    - name: claude-baseline
      use_target: claude
      hooks:
        before_each:
          command: ["bash", "../scripts/setup-variant.sh", "baseline"]
    - name: claude-superpowers
      use_target: claude
      hooks:
        before_each:
          command: ["bash", "../scripts/setup-variant.sh", "superpowers"]
    # ...

Files Added

examples/showcase/bug-fix-benchmark/
├── .agentv/targets.yaml           # Base claude target + grader targets
├── evals/bug-fixes.eval.yaml      # Test cases + target hooks per variant
├── workspaces/                    # Plugin config templates (copied by hooks)
│   ├── baseline/                  # bypassPermissions, no plugins
│   ├── superpowers/               # + superpowers plugin
│   ├── compound/                  # + compound-engineering plugin
│   └── agent-skills/              # + agent-skills plugin
├── scripts/
│   ├── setup-variant.sh           # Target hook: copy variant config into workspace
│   ├── mock-agent.sh              # Test harness without API keys
│   ├── setup-plugins.sh           # Install plugins into workspaces
│   └── import-swebench.sh         # Import SWE-bench dataset instances
└── README.md

How to Run

# All variants (runs all 4 from execution.targets)
agentv eval evals/bug-fixes.eval.yaml --workers 3

# Specific variants only
agentv eval evals/bug-fixes.eval.yaml \
  --target claude-baseline,claude-superpowers --workers 2

# Compare results
agentv compare <baseline-results> <plugin-results>

Test Plan

  • bun run build — passes
  • bun run typecheck — passes
  • bun run lint — passes
  • bun run test — 2157 tests pass (1642 core + 67 eval + 448 cli)
  • bun run validate:examples — 56/56 valid
  • Dry-run with --dry-run --workers 1 — all 4 targets resolve hooks, clone repo, copy correct settings.json
  • Verified workspace slots contain correct plugin configs per variant

🤖 Generated with Claude Code

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Apr 13, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: a85f0b4
Status:⚡️  Build in progress...

View logs

christso and others added 4 commits April 14, 2026 04:33
…luation

Add a showcase example demonstrating how to evaluate coding agents on
real-world bug fixes using public GitHub repositories with Docker workspace
isolation and commit-pinned repos.

Includes:
- EVAL.yaml with example test cases (null-check, fallback, property-access bugs)
- targets.yaml showing all auth options (subscription, API key, mock)
- mock-agent.sh for testing without API keys
- import-swebench.sh for importing SWE-bench dataset instances

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add workspace templates for comparing agent performance with and without
engineering plugins: superpowers, compound-engineering, agent-skills.

- Add workspaces/ with per-plugin .claude/settings.json configs
- Update targets.yaml with claude-baseline, claude-superpowers,
  claude-compound, claude-agent-skills targets
- Replace hypothetical test cases with real issue #912 bug fix task
- Add scripts/setup-plugins.sh for plugin installation
- Update README with comparison workflow and plugin details

Closes #919

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use defaultMode: bypassPermissions instead of listing individual
Bash allow rules, matching how the agentv dev environment is configured.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the removed workspace_template field with the target-level hooks
pattern from #1095. A single base 'claude' target is defined in
targets.yaml, and the eval file's execution.targets uses before_each
hooks to copy variant-specific plugin configs into the workspace.

Also fixes:
- Use 'id' instead of deprecated 'case' in test definitions
- Use full commit hash with resolve: local for base_commit
- Remove shallow clone (depth: 1) that prevented commit checkout

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@christso christso force-pushed the docs/bug-fix-benchmark-showcase branch from 68253cb to 29d75fe Compare April 14, 2026 04:46
christso and others added 4 commits April 14, 2026 04:55
Switch from provider: claude to provider: claude-cli with an executable
field that reads from CLAUDE_EXECUTABLE env var (defaults to "claude").
This allows using custom CLI binaries like claude-zai.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The script was speculative and non-functional (used deprecated fields,
hardcoded docker config, broken template variables). Not needed for the
benchmark showcase.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rtions

- Remove local .agentv/targets.yaml — use repo root targets instead
  (targets don't merge, closest shadows; local one forced duplicating
  grader targets unnecessarily)
- Replace llm-grader assertion with inline rubric strings (auto-unwrapped
  to rubrics evaluator)
- Remove unused scripts: mock-agent.sh (broken with workspace repos),
  setup-plugins.sh (orphaned, settings.json already checked in)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ns assertion

- Replace hardcoded use_target: claude with ${{ AGENT_TARGET }} so the
  benchmark works with any provider via env var
- Add workspace.hooks.before_each.reset: fast for proper isolation
  between pool slot reuse across plugin variants
- Remove contains: effectiveCwd assertion (checks response text, not
  the diff); rubrics already validate the fix via file_changes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@christso christso marked this pull request as ready for review April 14, 2026 06:35
@christso christso merged commit 5af7455 into main Apr 14, 2026
3 of 4 checks passed
@christso christso deleted the docs/bug-fix-benchmark-showcase branch April 14, 2026 06:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

docs: evals for engineering plugins

1 participant