docs(showcase): add bug-fix-benchmark example for SWE-bench style evaluation#1091
Merged
docs(showcase): add bug-fix-benchmark example for SWE-bench style evaluation#1091
Conversation
…luation Add a showcase example demonstrating how to evaluate coding agents on real-world bug fixes using public GitHub repositories with Docker workspace isolation and commit-pinned repos. Includes: - EVAL.yaml with example test cases (null-check, fallback, property-access bugs) - targets.yaml showing all auth options (subscription, API key, mock) - mock-agent.sh for testing without API keys - import-swebench.sh for importing SWE-bench dataset instances Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add workspace templates for comparing agent performance with and without engineering plugins: superpowers, compound-engineering, agent-skills. - Add workspaces/ with per-plugin .claude/settings.json configs - Update targets.yaml with claude-baseline, claude-superpowers, claude-compound, claude-agent-skills targets - Replace hypothetical test cases with real issue #912 bug fix task - Add scripts/setup-plugins.sh for plugin installation - Update README with comparison workflow and plugin details Closes #919 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use defaultMode: bypassPermissions instead of listing individual Bash allow rules, matching how the agentv dev environment is configured. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the removed workspace_template field with the target-level hooks pattern from #1095. A single base 'claude' target is defined in targets.yaml, and the eval file's execution.targets uses before_each hooks to copy variant-specific plugin configs into the workspace. Also fixes: - Use 'id' instead of deprecated 'case' in test definitions - Use full commit hash with resolve: local for base_commit - Remove shallow clone (depth: 1) that prevented commit checkout Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
68253cb to
29d75fe
Compare
Switch from provider: claude to provider: claude-cli with an executable field that reads from CLAUDE_EXECUTABLE env var (defaults to "claude"). This allows using custom CLI binaries like claude-zai. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The script was speculative and non-functional (used deprecated fields, hardcoded docker config, broken template variables). Not needed for the benchmark showcase. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rtions - Remove local .agentv/targets.yaml — use repo root targets instead (targets don't merge, closest shadows; local one forced duplicating grader targets unnecessarily) - Replace llm-grader assertion with inline rubric strings (auto-unwrapped to rubrics evaluator) - Remove unused scripts: mock-agent.sh (broken with workspace repos), setup-plugins.sh (orphaned, settings.json already checked in) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ns assertion
- Replace hardcoded use_target: claude with ${{ AGENT_TARGET }} so the
benchmark works with any provider via env var
- Add workspace.hooks.before_each.reset: fast for proper isolation
between pool slot reuse across plugin variants
- Remove contains: effectiveCwd assertion (checks response text, not
the diff); rubrics already validate the fix via file_changes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add
examples/showcase/bug-fix-benchmark/— a SWE-bench style showcase that compares coding agent performance with and without engineering plugins on identical bug fix tasks.Uses target-level hooks (from #1095) to configure each plugin variant via
execution.targetsin the eval file.Closes #919
What It Does
Evaluates the same bug fix task across four configurations:
claude-baselineclaude-superpowersclaude-compoundclaude-agent-skillsMetrics compared: tokens consumed, time to complete, fix correctness.
How It Works
A single
claudebase target is defined intargets.yaml. The eval file uses target-level hooks (execution.targets) so each variant runs abefore_eachhook that copies the appropriate.claude/settings.jsoninto the workspace:Files Added
How to Run
Test Plan
bun run build— passesbun run typecheck— passesbun run lint— passesbun run test— 2157 tests pass (1642 core + 67 eval + 448 cli)bun run validate:examples— 56/56 valid--dry-run --workers 1— all 4 targets resolve hooks, clone repo, copy correct settings.json🤖 Generated with Claude Code