Skip to content

feat(core): Docker workspace execution environments#971

Merged
christso merged 2 commits intomainfrom
feat/965-docker-workspace
Apr 8, 2026
Merged

feat(core): Docker workspace execution environments#971
christso merged 2 commits intomainfrom
feat/965-docker-workspace

Conversation

@christso
Copy link
Copy Markdown
Collaborator

@christso christso commented Apr 8, 2026

Summary

Implements Docker-based workspace type for coding benchmarks (SWE-bench).

Design: Agent runs on host, grader runs inside container.

  • workspace.docker.image field in EVAL.yaml schema
  • Docker workspace provider: pull → create → cp → exec → rm
  • Timeout + resource limits (memory, cpus)
  • Works with existing code-grader evaluator

Closes #965

Acceptance Signals

  • A coding eval using a Docker workspace runs end-to-end with a real agent target
  • Container is destroyed after evaluation (no state leakage)
  • Timeout kills the container if evaluation exceeds limit
  • Existing git-based workspace templates continue working unchanged
  • At least one working example: SWE-bench instance evaluated through AgentV
  • bun run test passes with new Docker workspace tests

Implements Docker-based workspace type for coding benchmarks (SWE-bench).
Agent runs on host, grader runs inside container.

Closes #965

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages bot commented Apr 8, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 22888f2
Status: ✅  Deploy successful!
Preview URL: https://3a8c15b4.agentv.pages.dev
Branch Preview URL: https://feat-965-docker-workspace.agentv.pages.dev

View logs

Add support for running code-grader evaluations inside Docker containers,
enabling benchmarks like SWE-bench to run in isolated container environments.

New YAML schema:
  workspace:
    docker:
      image: <docker-image>
      timeout: 1800      # seconds
      memory: 4g         # optional
      cpus: 2            # optional

Changes:
- Add DockerWorkspaceConfig type and Zod schema
- Create DockerWorkspaceProvider with full container lifecycle management
  (pull, create, start, cp, exec, rm) using execFile for security
- Update CodeEvaluator to run graders inside Docker when configured
- Thread dockerConfig through orchestrator evaluation pipeline
- Pull Docker image once during eval setup phase
- Add comprehensive unit tests with mock executor (28 test cases)
- Add docker-workspace example with EVAL.yaml
- Regenerate eval-schema.json

The provider uses a CommandExecutor interface for testability and always
cleans up containers in finally blocks, even on errors.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@christso christso marked this pull request as ready for review April 8, 2026 06:41
@christso christso merged commit 4a52120 into main Apr 8, 2026
4 checks passed
@christso christso deleted the feat/965-docker-workspace branch April 8, 2026 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Docker workspace execution environments for coding benchmarks

1 participant