Skip to content

fix(onboard): bump default sandbox-ready timeout to 300s for GPU images#3357

Closed
latenighthackathon wants to merge 2 commits into
NVIDIA:mainfrom
latenighthackathon:fix/onboard-sandbox-ready-timeout-default
Closed

fix(onboard): bump default sandbox-ready timeout to 300s for GPU images#3357
latenighthackathon wants to merge 2 commits into
NVIDIA:mainfrom
latenighthackathon:fix/onboard-sandbox-ready-timeout-default

Conversation

@latenighthackathon
Copy link
Copy Markdown
Contributor

@latenighthackathon latenighthackathon commented May 11, 2026

Summary

Closes #3344. Raises the default NEMOCLAW_SANDBOX_READY_TIMEOUT from 180s to 300s. The wait runs between openshell sandbox create returning and k3s reporting the pod Ready, and 180s is too aggressive when the sandbox image is GPU-attached and the host just spent several minutes uploading it to the gateway. wangericnv (NV QA) reported a fresh nemoclaw onboard --fresh on Ubuntu 24.04 / RTX 6000 Ada hitting "Sandbox 'brave-test' was created but did not become ready within 180s. The orphaned sandbox has been removed" after a healthy 525s image upload and a sandbox that was actually fine moments later; the wizard then deleted the working sandbox.

Related Issue

Closes #3344

Changes

  • src/lib/onboard.ts:26envInt("NEMOCLAW_SANDBOX_READY_TIMEOUT", 180)envInt("NEMOCLAW_SANDBOX_READY_TIMEOUT", 300). 67% headroom covers the typical GPU + image-extract + GPU-device-attach case (3-5 min on RTX-class hardware) while still aborting cleanly on truly broken pods. The env var override path is unchanged, so hosts with even slower disks or larger custom images can extend further.
  • Updated the JSDoc comment to record the bump rationale and point at [Ubuntu 24.04][Onboard] Brave-preset sandbox onboard times out: "Sandbox … did not become ready within 180s" after 500+s gateway-image upload #3344.
  • test/onboard.test.ts — the existing "allows slow sandbox create recovery to wait beyond 60 seconds" test asserted the 180s literal; updated to assert 300s with a comment explaining the bump.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only

Verification

  • npx prek run --all-files passes for the staged files.
  • npx vitest run -t 'slow sandbox create recovery' test/onboard.test.ts passes (1/1).
  • npm run build:cli clean.
  • No secrets, API keys, or credentials committed.

Rebased on current upstream/main (commit eb15e55e).


Signed-off-by: latenighthackathon latenighthackathon@users.noreply.github.com

Summary by CodeRabbit

  • Bug Fixes
    • Default sandbox readiness timeout clarified: GPU-enabled sandboxes now use 300s, non-GPU remain at 180s, reducing spurious readiness failures for longer GPU setups. Environment variable override still honored.
  • Tests
    • Updated tests to expect the revised GPU timeout behavior and added explanatory test comments.

Review Change Stack

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 11, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d782e953-7da6-46d9-a11d-11d0f32688b4

📥 Commits

Reviewing files that changed from the base of the PR and between 2c0804e and 33e8e7c.

📒 Files selected for processing (3)
  • src/lib/onboard/env.ts
  • src/lib/onboard/sandbox-gpu-create.ts
  • test/onboard.test.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/onboard.test.ts
  • src/lib/onboard/env.ts

📝 Walkthrough

Walkthrough

Default sandbox readiness timeout logic now uses a GPU-specific 300s default when sandboxGpuEnabled is true; the helper reads NEMOCLAW_SANDBOX_READY_TIMEOUT with that computed default, and tests/comments updated to expect/document 300s for GPU sandboxes.

Changes

Sandbox Readiness Timeout Extension

Layer / File(s) Summary
GPU-aware timeout constant and selection logic
src/lib/onboard/sandbox-gpu-create.ts
Adds GPU_SANDBOX_READY_TIMEOUT_SECS = 300 and updates getSandboxReadyTimeoutSecs to compute defaultSecs from sandboxGpuEnabled, using it as the fallback for NEMOCLAW_SANDBOX_READY_TIMEOUT.
Configuration default updated
src/lib/onboard/env.ts
SANDBOX_READY_TIMEOUT_SECS fallback changed from 180 to 300 seconds and doc comment expanded to note readiness semantics and the env override.
Test expectation and comments
test/onboard.test.ts
Tests updated to expect 300s default for GPU-enabled sandboxes on linux and win32; added comments describing 180s non-GPU vs 300s GPU defaults and env override behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#3434: Touches GPU sandbox readiness-timeout wiring and is related to readiness timeout behavior.

Suggested labels

NemoClaw CLI, Sandbox

Suggested reviewers

  • ericksoa
  • cv

Poem

🐰 I wait a bit longer beneath the moonlight,
GPUs stretch their paws, preparing for flight,
From one-eighty to three-hundred, patience takes hold,
Tests hum in chorus, no timeout story told.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately describes the main change: increasing the default sandbox-ready timeout from 180s to 300s specifically for GPU-enabled sandboxes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@wscurran wscurran added Platform: Ubuntu Support for Linux Ubuntu OpenShell Support for OpenShell, a safe, private runtime for autonomous AI agents fix labels May 11, 2026
@wscurran
Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this detailed PR to bump the default sandbox-ready timeout to 300s for GPU images, addressing the issue reported in #3344 where the sandbox onboard process times out after a lengthy image upload. This change aims to provide sufficient headroom for the typical GPU + image-extract + GPU-device-attach case.


Related open issues:

@cv cv closed this May 12, 2026
@cv cv reopened this May 12, 2026
@latenighthackathon latenighthackathon force-pushed the fix/onboard-sandbox-ready-timeout-default branch 2 times, most recently from d043bbc to 2c0804e Compare May 14, 2026 01:45
Closes NVIDIA#3344. The default `NEMOCLAW_SANDBOX_READY_TIMEOUT` was 180s,
which is the wait between `openshell sandbox create` returning and
k3s reporting the pod Ready. On RTX-class hardware with a GPU-attached
sandbox image, the image extract + GPU device attach + k3s pod
scheduling can legitimately exceed 3 minutes (wangericnv reported a
fresh onboard hitting the 180s wall on RTX 6000 Ada / Ubuntu 24.04
after a healthy 525s image upload). The onboard then surfaces a
confusing "Sandbox 'X' was created but did not become ready within
180s. The orphaned sandbox has been removed" message even though the
gateway and sandbox were fine, and deletes the working sandbox.

Raise the default to 300s. 67% headroom covers the common GPU+image
extract case while still aborting cleanly on truly broken pods. Hosts
with slower disks or larger custom images can still extend via
`NEMOCLAW_SANDBOX_READY_TIMEOUT`. Existing test that asserted the 180s
literal is updated to 300s with a comment pointing at the issue.

Signed-off-by: latenighthackathon <support@latenighthackathon.com>
Address CI test breakage exposed by the merge from main: upstream landed a
parametric getSandboxReadyTimeoutSecs(sandboxGpuEnabled) helper whose three
default-path assertions all expected 180s. Our PR bumped the shared default
to 300s, so both GPU and non-GPU callers regressed against those assertions.

Narrow the fix to the originally-reported scope (RTX-class GPU image extract
+ device attach hitting the 180s wall, NVIDIA#3344): non-GPU sandboxes keep the
180s baseline, GPU sandboxes get a 300s default via a new private
GPU_SANDBOX_READY_TIMEOUT_SECS constant in sandbox-gpu-create.ts.
NEMOCLAW_SANDBOX_READY_TIMEOUT continues to override both paths.

Signed-off-by: latenighthackathon <support@latenighthackathon.com>
Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
@latenighthackathon latenighthackathon force-pushed the fix/onboard-sandbox-ready-timeout-default branch from 44b90ee to 33e8e7c Compare May 15, 2026 02:10
@latenighthackathon
Copy link
Copy Markdown
Contributor Author

Closing as superseded by #3436, which shipped in v0.0.43 and addresses the root cause of #3344.

The framing here matches the precedent set by #3435 / #3440: the original "Sandbox 'brave-test' was created but did not become ready within 180s" symptom was the literal /proc/self/task/*/comm GPU policy crash preventing the sandbox from ever reaching ready, not a legitimate slowness on the readiness wait. With the GPU policy fix in #3436 landed, the 180s budget is sufficient on RTX-class hardware. Operators who still hit the wait on truly large custom images can override via NEMOCLAW_SANDBOX_READY_TIMEOUT, which is now documented in #3440.

Thanks for the review attention here. Cheers!

@latenighthackathon latenighthackathon deleted the fix/onboard-sandbox-ready-timeout-default branch May 18, 2026 05:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix OpenShell Support for OpenShell, a safe, private runtime for autonomous AI agents Platform: Ubuntu Support for Linux Ubuntu

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Ubuntu 24.04][Onboard] Brave-preset sandbox onboard times out: "Sandbox … did not become ready within 180s" after 500+s gateway-image upload

3 participants