Skip to content

fix(phai): clean up sandbox docker containers after each eval case#60089

Merged
skoob13 merged 1 commit into
masterfrom
chore/kill-docker-images
May 28, 2026
Merged

fix(phai): clean up sandbox docker containers after each eval case#60089
skoob13 merged 1 commit into
masterfrom
chore/kill-docker-images

Conversation

@skoob13
Copy link
Copy Markdown
Contributor

@skoob13 skoob13 commented May 26, 2026

Problem

Docker container cleanup didn't work for long-running parallel tests.

Changes

  • Added _cleanup_case_containers(task_id) in ee/hogai/eval/sandboxed/runner.py that runs docker ps -a --filter name=task-sandbox-{task_id}- --format {{.ID}} and force-removes each match.
  • Wrapped the body of run_eval_case in a try / finally so the cleanup fires on both the success and exception branches, regardless of what the workflow's own cleanup does.
  • Offloaded the cleanup to a worker thread via asyncio.to_thread so docker rm -f doesn't block the event loop while Docker waits on container shutdown.

Scoping by task_id prefix (get_sandbox_name_for_task in products/tasks/backend/temporal/process_task/utils.py:597) guarantees we never touch a concurrently-running case's container.

How did you test this code?

  • ruff check and ruff format --check pass on the modified file.
  • Manual verification deferred to the next eval run: hogli test ee/hogai/eval/sandboxed/..., then watch docker ps --filter name=task-sandbox- between cases — only the in-flight case's container should appear.

Publish to changelog?

no

Eval cases return as soon as the agent emits end_turn, well before the
ProcessTaskWorkflow's finally block runs cleanup_sandbox. With 16GB-per-
sandbox defaults and max_concurrency=2, accumulated containers exhaust
host memory during a session. Add a finally block in run_eval_case that
force-removes any container matching task-sandbox-{task_id}-*, scoped by
task_id so concurrent cases never touch each other's container.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@skoob13 skoob13 requested a review from a team May 26, 2026 13:41
@github-actions
Copy link
Copy Markdown
Contributor

🎭 Playwright didn't run on this PR — your changes touch code that could affect E2E behavior, but Playwright is opt-in via label now to keep CI cost down.

Add the run-playwright label if you want an E2E sweep before merging — CI will pick it up automatically.

Most PRs don't need this. Real regressions still get caught on master and fix-forward.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 26, 2026

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
ee/hogai/eval/sandboxed/runner.py:45-49
If `docker ps` exits with a non-zero return code without raising a Python exception (e.g. Docker daemon returns an error), `result.returncode` is never inspected, so the failure goes completely unlogged and the loop silently skips cleanup. Adding a check here surfaces this silent failure mode.

```suggestion
    except Exception:
        logger.warning("Failed to list sandbox containers for task %s", task_id, exc_info=True)
        return

    if result.returncode != 0:
        logger.warning(
            "docker ps returned exit code %d for task %s: %s",
            result.returncode,
            task_id,
            result.stderr.strip(),
        )
        return

    for container_id in result.stdout.strip().splitlines():
```

Reviews (1): Last reviewed commit: "fix(phai): clean up sandbox docker conta..." | Re-trigger Greptile

Comment thread ee/hogai/eval/sandboxed/runner.py
@skoob13 skoob13 merged commit 0bc3530 into master May 28, 2026
221 checks passed
@skoob13 skoob13 deleted the chore/kill-docker-images branch May 28, 2026 16:32
@deployment-status-posthog
Copy link
Copy Markdown

deployment-status-posthog Bot commented May 28, 2026

Deploy status

Environment Status Deployed At Workflow
dev ✅ Deployed 2026-05-28 17:12 UTC Run
prod-us ✅ Deployed 2026-05-28 17:34 UTC Run
prod-eu ✅ Deployed 2026-05-28 17:37 UTC Run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants