Skip to content

SWE-Bench Adapter Implementation #41

@jharris1679

Description

@jharris1679

Overview

Implement the SWE-Bench adapter as the first external benchmark integration. This enables users to run their agent configuration against real GitHub issue resolution tasks and measure resolution rates.

Parent Issue: #9
Linear Issue: ANS-466

Decisions Made

Decision Choice Rationale
Execution Variant containers True A/B testing isolation, aligns with sniffbench's core value prop
Evaluation Full pipeline (local harness) No point generating predictions without evaluation
Patch extraction File diff Most natural for coding agents, guaranteed valid format

Implementation Plan

Phase 1: Foundation

  1. Create src/benchmark/ module structure
  2. Implement dataset loading from HuggingFace (princeton-nlp/SWE-bench_Lite)
  3. Build instance runner that:
    • Clones repo at base_commit
    • Mounts into variant container
    • Provides problem_statement to agent
    • Extracts patch via git diff
  4. Add sniff bench swe-bench CLI command

Phase 2: Evaluation

  1. Integrate SWE-Bench harness (python -m swebench.harness.run_evaluation)
  2. Parse results and report resolution rates
  3. Store benchmark results in runs tracking

Phase 3: Polish

  1. Incremental runs (resume from where we left off)
  2. Parallel instance execution
  3. Caching (repos, Docker images)

CLI Commands

sniff bench swe-bench --variant lite
sniff bench swe-bench --variant lite --limit 10
sniff bench swe-bench --use-variant control

Requirements

  • Docker with 16GB RAM, 8 cores
  • ~100GB disk for env-level cache
  • Python 3.9+ with swebench package

Links

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions