Objective
Verify the full SWE-bench evaluation pipeline works end-to-end: import from HuggingFace → Docker container → agent solves → code-grader scores → results in studio.
Context
The pieces are in place but have never been tested together:
Steps to verify
- Import a small SWE-bench instance
- Pull the SWE-bench Docker image
- Run the eval against a real agent provider
- Verify: container starts at correct commit, agent works, code-grader runs tests, results in studio
Blocked by
Acceptance criteria
Objective
Verify the full SWE-bench evaluation pipeline works end-to-end: import from HuggingFace → Docker container → agent solves → code-grader scores → results in studio.
Context
The pieces are in place but have never been tested together:
base_commitinto docker-workspace (not yet implemented)Steps to verify
Blocked by
Acceptance criteria