fix(onboard): bump default sandbox-ready timeout to 300s for GPU images#3357
fix(onboard): bump default sandbox-ready timeout to 300s for GPU images#3357latenighthackathon wants to merge 2 commits into
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughDefault sandbox readiness timeout logic now uses a GPU-specific 300s default when ChangesSandbox Readiness Timeout Extension
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
|
✨ Thanks for submitting this detailed PR to bump the default sandbox-ready timeout to 300s for GPU images, addressing the issue reported in #3344 where the sandbox onboard process times out after a lengthy image upload. This change aims to provide sufficient headroom for the typical GPU + image-extract + GPU-device-attach case. Related open issues: |
d043bbc to
2c0804e
Compare
Closes NVIDIA#3344. The default `NEMOCLAW_SANDBOX_READY_TIMEOUT` was 180s, which is the wait between `openshell sandbox create` returning and k3s reporting the pod Ready. On RTX-class hardware with a GPU-attached sandbox image, the image extract + GPU device attach + k3s pod scheduling can legitimately exceed 3 minutes (wangericnv reported a fresh onboard hitting the 180s wall on RTX 6000 Ada / Ubuntu 24.04 after a healthy 525s image upload). The onboard then surfaces a confusing "Sandbox 'X' was created but did not become ready within 180s. The orphaned sandbox has been removed" message even though the gateway and sandbox were fine, and deletes the working sandbox. Raise the default to 300s. 67% headroom covers the common GPU+image extract case while still aborting cleanly on truly broken pods. Hosts with slower disks or larger custom images can still extend via `NEMOCLAW_SANDBOX_READY_TIMEOUT`. Existing test that asserted the 180s literal is updated to 300s with a comment pointing at the issue. Signed-off-by: latenighthackathon <support@latenighthackathon.com>
Address CI test breakage exposed by the merge from main: upstream landed a parametric getSandboxReadyTimeoutSecs(sandboxGpuEnabled) helper whose three default-path assertions all expected 180s. Our PR bumped the shared default to 300s, so both GPU and non-GPU callers regressed against those assertions. Narrow the fix to the originally-reported scope (RTX-class GPU image extract + device attach hitting the 180s wall, NVIDIA#3344): non-GPU sandboxes keep the 180s baseline, GPU sandboxes get a 300s default via a new private GPU_SANDBOX_READY_TIMEOUT_SECS constant in sandbox-gpu-create.ts. NEMOCLAW_SANDBOX_READY_TIMEOUT continues to override both paths. Signed-off-by: latenighthackathon <support@latenighthackathon.com> Signed-off-by: latenighthackathon <latenighthackathon@users.noreply.github.com>
44b90ee to
33e8e7c
Compare
|
Closing as superseded by #3436, which shipped in v0.0.43 and addresses the root cause of #3344. The framing here matches the precedent set by #3435 / #3440: the original "Sandbox 'brave-test' was created but did not become ready within 180s" symptom was the literal Thanks for the review attention here. Cheers! |
Summary
Closes #3344. Raises the default
NEMOCLAW_SANDBOX_READY_TIMEOUTfrom 180s to 300s. The wait runs betweenopenshell sandbox createreturning and k3s reporting the pod Ready, and 180s is too aggressive when the sandbox image is GPU-attached and the host just spent several minutes uploading it to the gateway. wangericnv (NV QA) reported a freshnemoclaw onboard --freshon Ubuntu 24.04 / RTX 6000 Ada hitting "Sandbox 'brave-test' was created but did not become ready within 180s. The orphaned sandbox has been removed" after a healthy 525s image upload and a sandbox that was actually fine moments later; the wizard then deleted the working sandbox.Related Issue
Closes #3344
Changes
src/lib/onboard.ts:26—envInt("NEMOCLAW_SANDBOX_READY_TIMEOUT", 180)→envInt("NEMOCLAW_SANDBOX_READY_TIMEOUT", 300). 67% headroom covers the typical GPU + image-extract + GPU-device-attach case (3-5 min on RTX-class hardware) while still aborting cleanly on truly broken pods. The env var override path is unchanged, so hosts with even slower disks or larger custom images can extend further.test/onboard.test.ts— the existing "allows slow sandbox create recovery to wait beyond 60 seconds" test asserted the 180s literal; updated to assert 300s with a comment explaining the bump.Type of Change
Verification
npx prek run --all-filespasses for the staged files.npx vitest run -t 'slow sandbox create recovery' test/onboard.test.tspasses (1/1).npm run build:cliclean.Rebased on current
upstream/main(commiteb15e55e).Signed-off-by: latenighthackathon latenighthackathon@users.noreply.github.com
Summary by CodeRabbit