Skip to content

test: end-to-end SWE-bench Docker eval run #988

@christso

Description

@christso

Objective

Verify the full SWE-bench evaluation pipeline works end-to-end: import from HuggingFace → Docker container → agent solves → code-grader scores → results in studio.

Context

The pieces are in place but have never been tested together:

Steps to verify

  1. Import a small SWE-bench instance
  2. Pull the SWE-bench Docker image
  3. Run the eval against a real agent provider
  4. Verify: container starts at correct commit, agent works, code-grader runs tests, results in studio

Blocked by

Acceptance criteria

  • At least 1 SWE-bench instance runs end-to-end with a real agent
  • Code-grader correctly reports pass/fail
  • Results visible in studio

Metadata

Metadata

Assignees

No one assigned

    Labels

    in-progressClaimed by an agent — do not duplicate work

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions