fix(onboard): bump default sandbox-ready timeout to 300s for GPU images by latenighthackathon · Pull Request #3357 · NVIDIA/NemoClaw

latenighthackathon · 2026-05-11T19:59:41Z

Summary

Closes #3344. Raises the default NEMOCLAW_SANDBOX_READY_TIMEOUT from 180s to 300s. The wait runs between openshell sandbox create returning and k3s reporting the pod Ready, and 180s is too aggressive when the sandbox image is GPU-attached and the host just spent several minutes uploading it to the gateway. wangericnv (NV QA) reported a fresh nemoclaw onboard --fresh on Ubuntu 24.04 / RTX 6000 Ada hitting "Sandbox 'brave-test' was created but did not become ready within 180s. The orphaned sandbox has been removed" after a healthy 525s image upload and a sandbox that was actually fine moments later; the wizard then deleted the working sandbox.

Related Issue

Closes #3344

Changes

src/lib/onboard.ts:26 — envInt("NEMOCLAW_SANDBOX_READY_TIMEOUT", 180) → envInt("NEMOCLAW_SANDBOX_READY_TIMEOUT", 300). 67% headroom covers the typical GPU + image-extract + GPU-device-attach case (3-5 min on RTX-class hardware) while still aborting cleanly on truly broken pods. The env var override path is unchanged, so hosts with even slower disks or larger custom images can extend further.
Updated the JSDoc comment to record the bump rationale and point at [Ubuntu 24.04][Onboard] Brave-preset sandbox onboard times out: "Sandbox … did not become ready within 180s" after 500+s gateway-image upload #3344.
test/onboard.test.ts — the existing "allows slow sandbox create recovery to wait beyond 60 seconds" test asserted the 180s literal; updated to assert 300s with a comment explaining the bump.

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only

Verification

npx prek run --all-files passes for the staged files.
npx vitest run -t 'slow sandbox create recovery' test/onboard.test.ts passes (1/1).
npm run build:cli clean.
No secrets, API keys, or credentials committed.

Rebased on current upstream/main (commit eb15e55e).

Signed-off-by: latenighthackathon latenighthackathon@users.noreply.github.com

Summary by CodeRabbit

Bug Fixes
- Default sandbox readiness timeout clarified: GPU-enabled sandboxes now use 300s, non-GPU remain at 180s, reducing spurious readiness failures for longer GPU setups. Environment variable override still honored.
Tests
- Updated tests to expect the revised GPU timeout behavior and added explanatory test comments.

copy-pr-bot · 2026-05-11T19:59:44Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-05-11T19:59:58Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d782e953-7da6-46d9-a11d-11d0f32688b4

📥 Commits

Reviewing files that changed from the base of the PR and between 2c0804e and 33e8e7c.

📒 Files selected for processing (3)

src/lib/onboard/env.ts
src/lib/onboard/sandbox-gpu-create.ts
test/onboard.test.ts

🚧 Files skipped from review as they are similar to previous changes (2)

test/onboard.test.ts
src/lib/onboard/env.ts

📝 Walkthrough

Walkthrough

Default sandbox readiness timeout logic now uses a GPU-specific 300s default when sandboxGpuEnabled is true; the helper reads NEMOCLAW_SANDBOX_READY_TIMEOUT with that computed default, and tests/comments updated to expect/document 300s for GPU sandboxes.

Changes

Sandbox Readiness Timeout Extension

Layer / File(s)	Summary
GPU-aware timeout constant and selection logic `src/lib/onboard/sandbox-gpu-create.ts`	Adds `GPU_SANDBOX_READY_TIMEOUT_SECS = 300` and updates `getSandboxReadyTimeoutSecs` to compute `defaultSecs` from `sandboxGpuEnabled`, using it as the fallback for `NEMOCLAW_SANDBOX_READY_TIMEOUT`.
Configuration default updated `src/lib/onboard/env.ts`	`SANDBOX_READY_TIMEOUT_SECS` fallback changed from 180 to 300 seconds and doc comment expanded to note readiness semantics and the env override.
Test expectation and comments `test/onboard.test.ts`	Tests updated to expect 300s default for GPU-enabled sandboxes on linux and win32; added comments describing 180s non-GPU vs 300s GPU defaults and env override behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

NVIDIA/NemoClaw#3434: Touches GPU sandbox readiness-timeout wiring and is related to readiness timeout behavior.

Suggested labels

NemoClaw CLI, Sandbox

Suggested reviewers

ericksoa
cv

Poem

🐰 I wait a bit longer beneath the moonlight,
GPUs stretch their paws, preparing for flight,
From one-eighty to three-hundred, patience takes hold,
Tests hum in chorus, no timeout story told.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title accurately describes the main change: increasing the default sandbox-ready timeout from 180s to 300s specifically for GPU-enabled sandboxes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

wscurran · 2026-05-11T21:44:41Z

✨ Thanks for submitting this detailed PR to bump the default sandbox-ready timeout to 300s for GPU images, addressing the issue reported in #3344 where the sandbox onboard process times out after a lengthy image upload. This change aims to provide sufficient headroom for the typical GPU + image-extract + GPU-device-attach case.

Related open issues:

#3344 [Ubuntu 24.04][Onboard] Brave-preset sandbox onboard times out: "Sandbox … did not become ready within 180s" after 500+s gateway-image upload

Closes NVIDIA#3344. The default `NEMOCLAW_SANDBOX_READY_TIMEOUT` was 180s, which is the wait between `openshell sandbox create` returning and k3s reporting the pod Ready. On RTX-class hardware with a GPU-attached sandbox image, the image extract + GPU device attach + k3s pod scheduling can legitimately exceed 3 minutes (wangericnv reported a fresh onboard hitting the 180s wall on RTX 6000 Ada / Ubuntu 24.04 after a healthy 525s image upload). The onboard then surfaces a confusing "Sandbox 'X' was created but did not become ready within 180s. The orphaned sandbox has been removed" message even though the gateway and sandbox were fine, and deletes the working sandbox. Raise the default to 300s. 67% headroom covers the common GPU+image extract case while still aborting cleanly on truly broken pods. Hosts with slower disks or larger custom images can still extend via `NEMOCLAW_SANDBOX_READY_TIMEOUT`. Existing test that asserted the 180s literal is updated to 300s with a comment pointing at the issue. Signed-off-by: latenighthackathon <support@latenighthackathon.com>

Address CI test breakage exposed by the merge from main: upstream landed a parametric getSandboxReadyTimeoutSecs(sandboxGpuEnabled) helper whose three default-path assertions all expected 180s. Our PR bumped the shared default to 300s, so both GPU and non-GPU callers regressed against those assertions. Narrow the fix to the originally-reported scope (RTX-class GPU image extract + device attach hitting the 180s wall, NVIDIA#3344): non-GPU sandboxes keep the 180s baseline, GPU sandboxes get a 300s default via a new private GPU_SANDBOX_READY_TIMEOUT_SECS constant in sandbox-gpu-create.ts. NEMOCLAW_SANDBOX_READY_TIMEOUT continues to override both paths. Signed-off-by: latenighthackathon <support@latenighthackathon.com> Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>

latenighthackathon · 2026-05-15T04:28:19Z

Closing as superseded by #3436, which shipped in v0.0.43 and addresses the root cause of #3344.

The framing here matches the precedent set by #3435 / #3440: the original "Sandbox 'brave-test' was created but did not become ready within 180s" symptom was the literal /proc/self/task/*/comm GPU policy crash preventing the sandbox from ever reaching ready, not a legitimate slowness on the readiness wait. With the GPU policy fix in #3436 landed, the 180s budget is sufficient on RTX-class hardware. Operators who still hit the wait on truly large custom images can override via NEMOCLAW_SANDBOX_READY_TIMEOUT, which is now documented in #3440.

Thanks for the review attention here. Cheers!

wscurran added Platform: Ubuntu Support for Linux Ubuntu OpenShell Support for OpenShell, a safe, private runtime for autonomous AI agents fix labels May 11, 2026

cv closed this May 12, 2026

cv reopened this May 12, 2026

latenighthackathon force-pushed the fix/onboard-sandbox-ready-timeout-default branch 2 times, most recently from d043bbc to 2c0804e Compare May 14, 2026 01:45

latenighthackathon added 2 commits May 15, 2026 02:07

latenighthackathon force-pushed the fix/onboard-sandbox-ready-timeout-default branch from 44b90ee to 33e8e7c Compare May 15, 2026 02:10

latenighthackathon closed this May 15, 2026

latenighthackathon deleted the fix/onboard-sandbox-ready-timeout-default branch May 18, 2026 05:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(onboard): bump default sandbox-ready timeout to 300s for GPU images#3357

fix(onboard): bump default sandbox-ready timeout to 300s for GPU images#3357
latenighthackathon wants to merge 2 commits into
NVIDIA:mainfrom
latenighthackathon:fix/onboard-sandbox-ready-timeout-default

latenighthackathon commented May 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

coderabbitai Bot commented May 11, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

wscurran commented May 11, 2026

Uh oh!

latenighthackathon commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

latenighthackathon commented May 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented May 11, 2026

Uh oh!

coderabbitai Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

wscurran commented May 11, 2026

Uh oh!

latenighthackathon commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

latenighthackathon commented May 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 11, 2026 •

edited

Loading