Skip to content

Fix v1 sandbox client reuse and upload errors#1483

Merged
xeophon merged 3 commits into
mainfrom
fix/v1-threaded-sandbox-uploads
May 29, 2026
Merged

Fix v1 sandbox client reuse and upload errors#1483
xeophon merged 3 commits into
mainfrom
fix/v1-threaded-sandbox-uploads

Conversation

@xeophon
Copy link
Copy Markdown
Member

@xeophon xeophon commented May 29, 2026

Summary

  • Reuse the threaded sandbox client for v1 sandbox leases instead of creating a fresh raw AsyncSandboxClient per sandbox
  • Keep shared clients open when individual sandbox leases are deleted
  • Wrap program file upload API/timeouts as SandboxError so a single rollout can fail without crashing the whole eval
  • Add lifecycle coverage for shared client cleanup and upload timeout wrapping

Validation

  • uvx ruff format
  • uvx ruff check --fix
  • source /Users/xeophon/code/work/verifiers/.venv/bin/activate && PYTHONPATH=/Users/xeophon/code/work/verifiers-pr-fix python -m pytest tests/test_v1_runtime_lifecycle.py -k "program_sandbox_group_scope_reuses_and_cleans or program_sandbox_global_scope_lives_until_teardown or upload_program_files_wraps_sandbox_upload_timeout" -q
  • source /Users/xeophon/code/work/verifiers/.venv/bin/activate && PYTHONPATH=/Users/xeophon/code/work/verifiers-pr-fix python -m pytest tests/test_v1_runtime_lifecycle.py -q
  • source /Users/xeophon/code/work/verifiers/.venv/bin/activate && PYTHONPATH=/Users/xeophon/code/work/verifiers-pr-fix python -m py_compile verifiers/v1/utils/sandbox_utils.py tests/test_v1_runtime_lifecycle.py

Note: local 100x1 SWE-bench Pro smoke with max_concurrent=100 and num_workers=10 completed after this patch; one sandbox upload timeout was contained as a sample-level SandboxError instead of crashing the eval.


Note

Medium Risk
Changes sandbox client lifetime and error handling on a hot path for evals; behavior is narrower than before (shared client, contained upload errors) and is covered by new lifecycle tests.

Overview
The v1 Runtime now keeps one lazy ThreadedAsyncSandboxClient and passes it into every program/tool/user sandbox lease instead of spinning up a separate AsyncSandboxClient per lease. SandboxLease gains owns_client: shared-runtime leases skip closing the client on delete, and Runtime.teardown deletes leases then always shuts down the shared client via close_sandbox_client (prefers teardown(), else aclose()).

run_background_job only forwards timeout when set, so the backend default (e.g. 900s) applies. upload_program_files maps APIError / UploadTimeoutError to SandboxError so a failed upload fails one rollout instead of crashing the whole eval.

Lifecycle tests cover group/global cleanup (delete vs client close), default background-job timeout, upload timeout wrapping, and delete errors without closing a borrowed client.

Reviewed by Cursor Bugbot for commit 04b8df1. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Fix v1 sandbox client reuse and surface upload errors

  • Runtime now lazily creates and caches a single ThreadedAsyncSandboxClient shared across all lease creation calls, replacing per-lease client instantiation in runtime.py.
  • SandboxLease gains an owns_client flag so shared clients are not closed when a non-owning lease is deleted; teardown closes the shared client after all leases are cleaned up.
  • upload_program_files in sandbox_utils.py now catches APIError and UploadTimeoutError from prime_sandboxes and re-raises as SandboxError with path and sandbox ID context.
  • A new close_sandbox_client helper prefers a synchronous teardown() method on the client before falling back to aclose().
  • Behavioral Change: sandbox client is now closed in a finally block during Runtime.teardown, so it is always closed even if lease deletions raise.

Macroscope summarized 04b8df1.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 89f1cc132b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/utils/sandbox_utils.py Outdated
if _SHARED_SANDBOX_CLIENT is None:
from verifiers.utils.threaded_sandbox_client import ThreadedAsyncSandboxClient

_SHARED_SANDBOX_CLIENT = cast(SandboxClient, ThreadedAsyncSandboxClient())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle omitted command_timeout with threaded sandbox client

When a sandbox program omits command_timeout, run_sandbox_command passes None from sandbox_config.get("command_timeout") into lease.run_background_job. Because this line now always backs leases with ThreadedAsyncSandboxClient, that call reaches verifiers/utils/threaded_sandbox_client.py:78-94, where time.monotonic() + timeout raises TypeError for None. This breaks the default sandbox program path before the command runs; preserve the threaded client's default timeout or coerce None before calling it.

Useful? React with 👍 / 👎.

Comment thread verifiers/v1/utils/sandbox_utils.py Outdated
@macroscopeapp
Copy link
Copy Markdown

macroscopeapp Bot commented May 29, 2026

Approvability

Verdict: Needs human review

This PR modifies sandbox client lifecycle management including client reuse and cleanup patterns. Two unresolved review comments flag potential runtime issues (TypeError with None timeout, possible resource leaks). These substantive concerns about bug risks warrant human verification.

You can customize Macroscope's approvability policy. Learn more.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 54d95fac8c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/utils/sandbox_utils.py Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 54d95fa. Configure here.

Comment thread verifiers/v1/utils/sandbox_utils.py Outdated
@xeophon xeophon merged commit 0a39271 into main May 29, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants