Skip to content

docs(onboard): document NEMOCLAW_SANDBOX_READY_TIMEOUT#3440

Merged
cv merged 6 commits into
mainfrom
docs/sandbox-ready-timeout
May 15, 2026
Merged

docs(onboard): document NEMOCLAW_SANDBOX_READY_TIMEOUT#3440
cv merged 6 commits into
mainfrom
docs/sandbox-ready-timeout

Conversation

@laitingsheng
Copy link
Copy Markdown
Contributor

@laitingsheng laitingsheng commented May 13, 2026

Summary

NEMOCLAW_SANDBOX_READY_TIMEOUT has been a recognised env var since #2849, but no documentation accompanied it — docs/reference/commands.md, docs/reference/troubleshooting.md, and the inference / deployment guides only mention the companion NEMOCLAW_LOCAL_INFERENCE_TIMEOUT (added in #1620 and documented at that time). Operators hitting Sandbox '<name>' was created but did not become ready within 180s have no doc-grep path to the workaround, and the two timeouts are easy to conflate. This closes the documentation gap left by #2849.

Originally tried under #3435; closed because that PR mis-framed the docs as resolving #3344 / #3416 (the root cause of both was the GPU policy bug fixed in #3436, not a timeout misconfiguration). The docs themselves still have value as a follow-up to the env-var introductions, so reopening as a new PR with the correct framing.

Related Issue

Changes

  • docs/reference/commands.md: add NEMOCLAW_SANDBOX_READY_TIMEOUT and NEMOCLAW_LOCAL_INFERENCE_TIMEOUT to the Onboard Timeouts table.
  • docs/reference/troubleshooting.md: new troubleshooting entry "Sandbox onboard times out with 'did not become ready within Ns'" that distinguishes the readiness wait from the inference-probe budget, with a worked example.
  • docs/inference/use-local-inference.md: cross-link the two timeouts from the existing NEMOCLAW_LOCAL_INFERENCE_TIMEOUT section so readers of either knob land on the other.
  • docs/deployment/deploy-to-remote-gpu.md: new "First-Run Readiness Budget" section calling out DGX Station / cloud-VM / large-quantised-model conditions that exceed the default and showing how to raise it.

No code changes — the readiness behaviour is unchanged.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

  • Documentation
    • Added a “First-Run Readiness Budget” note for remote GPU hosts explaining longer initial sandbox build/upload times and advice to increase NEMOCLAW_SANDBOX_READY_TIMEOUT.
    • Clarified that NEMOCLAW_LOCAL_INFERENCE_TIMEOUT applies to inference-server validation while sandbox readiness uses NEMOCLAW_SANDBOX_READY_TIMEOUT (default 180s).
    • Expanded examples for exporting both timeouts and onboarding timeout messaging.
    • Added troubleshooting guidance and inspection steps when sandbox readiness timeouts delete partial sandboxes.

Review Change Stack

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

🚀 Docs preview ready!

https://NVIDIA.github.io/NemoClaw/pr-preview/pr-3440/

@laitingsheng laitingsheng added the documentation Improvements or additions to documentation label May 13, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 13, 2026

E2E Advisor Recommendation

Required E2E: None
Optional E2E: None

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • None. Docs-only PR. The changes update explanatory deployment, inference, command reference, and troubleshooting text for existing timeout behavior and do not modify installer/onboarding code, sandbox lifecycle logic, credentials, network policy, inference routing, deployment scripts, workflows, or user-flow test assets.

Optional E2E

  • None.

New E2E recommendations

  • None.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5089588d-d90c-419d-95a9-632ae7ed7336

📥 Commits

Reviewing files that changed from the base of the PR and between 569dfe4 and a1267d5.

📒 Files selected for processing (3)
  • docs/inference/use-local-inference.md
  • docs/reference/commands.md
  • docs/reference/troubleshooting.md
✅ Files skipped from review due to trivial changes (2)
  • docs/inference/use-local-inference.md
  • docs/reference/troubleshooting.md

📝 Walkthrough

Walkthrough

Adds documentation for a new sandbox readiness timeout NEMOCLAW_SANDBOX_READY_TIMEOUT, clarifies NEMOCLAW_LOCAL_INFERENCE_TIMEOUT is only for the inference-server validation probe, and updates examples, reference, deployment guidance, and troubleshooting for sandbox onboarding timeouts and cleanup behavior. (≤50 words)

Changes

Sandbox readiness & onboarding timeouts

Layer / File(s) Summary
CLI timeout reference & examples
docs/reference/commands.md
Add NEMOCLAW_LOCAL_INFERENCE_TIMEOUT (inference-server validation probe) and NEMOCLAW_SANDBOX_READY_TIMEOUT (post-create sandbox readiness wait). Update onboarding timeout text and example exports; document that readiness timeouts delete orphaned sandboxes and that Ollama preserves partial downloads.
Deployment guidance: first-run readiness budget
docs/deployment/deploy-to-remote-gpu.md
Add “First-Run Readiness Budget” section explaining why initial nemoclaw onboard can exceed defaults (cold build/upload, large models, remote VM), show example export NEMOCLAW_SANDBOX_READY_TIMEOUT=600, and note failed readiness deletes the partially-created sandbox before retry.
Local inference doc & troubleshooting link
docs/inference/use-local-inference.md
Clarify NEMOCLAW_LOCAL_INFERENCE_TIMEOUT applies only to the inference-server probe; introduce NEMOCLAW_SANDBOX_READY_TIMEOUT (default 180s) as separate readiness budget and add combined export example for slow builds; update troubleshooting link text/anchor.
Troubleshooting: sandbox readiness timeout
docs/reference/troubleshooting.md
New subsection for did not become ready within Ns explaining the 180s default readiness budget, common causes, how to extend via NEMOCLAW_SANDBOX_READY_TIMEOUT, orphaned sandbox deletion on expiry, and inspection commands (openshell sandbox list, nemoclaw <name> status).

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • cv
  • ericksoa

🐰
First-run waits, I gently say,
Two timeouts keep the stalls at bay,
Raise the budget when builds run slow,
Partials kept so retries go,
Clean sandboxes hop — and off we go!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately and specifically describes the main change: adding documentation for NEMOCLAW_SANDBOX_READY_TIMEOUT, which is the primary objective across all modified documentation files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/sandbox-ready-timeout

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
docs/reference/commands.md (1)

1170-1170: 💤 Low value

Consider splitting dense table cell content for clarity.

The Purpose column contains three sentences packed together. While table formatting makes strict one-sentence-per-line difficult, consider whether this cell could be more scannable.

As per coding guidelines, "One sentence per line in source (makes diffs readable)."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/reference/commands.md` at line 1170, Split the dense Purpose cell for
`NEMOCLAW_SANDBOX_READY_TIMEOUT` into multiple lines/sentences in the markdown
source so each sentence lives on its own line: break the current three-sentence
paragraph into three separate lines (e.g., one line describing what the flag
controls, one line with examples/when to raise it, and one line describing the
behavior when the deadline expires), updating the table cell content for
`NEMOCLAW_SANDBOX_READY_TIMEOUT` accordingly so diffs are readable and follow
the "one sentence per line" guideline.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/deployment/deploy-to-remote-gpu.md`:
- Line 146: Replace the passive clause "the partially-created sandbox is deleted
first" with an active construction that names the actor; rewrite the sentence so
that onboard performs the action (e.g., "onboard deletes the partially-created
sandbox first, so the next attempt with the raised budget starts from a clean
state"), updating the sentence in the docs string that currently reads "If
onboard ends with `Sandbox '<name>' was created but did not become ready within
180s`, the partially-created sandbox is deleted first, so the next attempt with
the raised budget starts from a clean state." to use active voice and reference
"onboard" as the actor.

In `@docs/reference/commands.md`:
- Line 1178: The line containing "If a timeout fires, onboarding emits the
elapsed budget plus a hint to raise the relevant variable. The Ollama pull
preserves its partial download for the next attempt; the readiness wait deletes
the orphaned sandbox first so the next `nemoclaw onboard` starts clean." should
be split so each sentence is on its own line: place the first sentence on one
line and the remaining sentence(s) each on their own lines to follow the
one-sentence-per-line guideline and improve diff readability.

---

Nitpick comments:
In `@docs/reference/commands.md`:
- Line 1170: Split the dense Purpose cell for `NEMOCLAW_SANDBOX_READY_TIMEOUT`
into multiple lines/sentences in the markdown source so each sentence lives on
its own line: break the current three-sentence paragraph into three separate
lines (e.g., one line describing what the flag controls, one line with
examples/when to raise it, and one line describing the behavior when the
deadline expires), updating the table cell content for
`NEMOCLAW_SANDBOX_READY_TIMEOUT` accordingly so diffs are readable and follow
the "one sentence per line" guideline.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f89b8574-7c48-4cff-8fda-7359840892ea

📥 Commits

Reviewing files that changed from the base of the PR and between d12cce5 and 226a2c8.

📒 Files selected for processing (4)
  • docs/deployment/deploy-to-remote-gpu.md
  • docs/inference/use-local-inference.md
  • docs/reference/commands.md
  • docs/reference/troubleshooting.md

Comment thread docs/deployment/deploy-to-remote-gpu.md Outdated
Comment thread docs/reference/commands.md Outdated
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
docs/deployment/deploy-to-remote-gpu.md (2)

133-133: ⚡ Quick win

Use active voice.

The phrase "is sized for" is passive. As per coding guidelines, "Active voice required. Flag passive constructions."

✏️ Suggested revision
-The post-create readiness wait defaults to 180 seconds (`NEMOCLAW_SANDBOX_READY_TIMEOUT`), which is sized for warm-cache, workstation-class onboarding and can be exceeded on:
+The post-create readiness wait defaults to 180 seconds (`NEMOCLAW_SANDBOX_READY_TIMEOUT`), which targets warm-cache, workstation-class onboarding and can be exceeded on:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/deployment/deploy-to-remote-gpu.md` at line 133, Rewrite the passive
sentence that references NEMOCLAW_SANDBOX_READY_TIMEOUT into active voice;
specifically change "which is sized for warm-cache, workstation-class
onboarding" to an active construction like "which we set for warm-cache,
workstation-class onboarding" or "which targets warm-cache, workstation-class
onboarding" so the line reads e.g. "The post-create readiness wait defaults to
180 seconds (NEMOCLAW_SANDBOX_READY_TIMEOUT), which we set for warm-cache,
workstation-class onboarding and can be exceeded on:"; update the sentence
containing NEMOCLAW_SANDBOX_READY_TIMEOUT to use one of these active
alternatives.

132-132: ⚡ Quick win

Replace colon with period and rewrite in active voice.

This sentence uses a colon to connect two clauses rather than introduce a list, and contains passive constructions ("is built," "uploaded"). As per coding guidelines, "Colons should only introduce a list. Flag colons used as general punctuation between clauses" and "Active voice required. Flag passive constructions."

✏️ Suggested revision
-On a remote GPU host, the first `nemoclaw onboard` typically does the slowest work of the lifecycle: the sandbox image is built locally and uploaded into the OpenShell gateway, which can stream hundreds of MiB over the VM's link before the readiness wait even starts.
+On a remote GPU host, the first `nemoclaw onboard` typically does the slowest work of the lifecycle.
+The sandbox image builds locally and uploads into the OpenShell gateway, streaming hundreds of MiB over the VM's link before the readiness wait even starts.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/deployment/deploy-to-remote-gpu.md` at line 132, Replace the colon with
a period and rewrite the sentence in active voice: locate the sentence
containing `nemoclaw onboard` and `OpenShell gateway` and change the passive
phrases "is built" and "uploaded" to active verbs (e.g., "builds the sandbox
image locally and uploads it to the OpenShell gateway"), and split the clauses
with a period so it reads as two clear, active sentences describing that
`nemoclaw onboard` performs the slowest work by building and uploading the
sandbox image, which can stream hundreds of MiB before readiness waits begin.
docs/reference/commands.md (2)

1169-1169: ⚡ Quick win

Split sentences and avoid weak intensifier.

This table cell contains two sentences on the same line. Additionally, "very large" is a weak intensifier. As per coding guidelines, "One sentence per line in source (makes diffs readable). Flag paragraphs where multiple sentences appear on the same line."

✏️ Suggested revision
-| `NEMOCLAW_LOCAL_INFERENCE_TIMEOUT` | `180` | Wall-clock timeout for the inference-server validation probe during onboard, in seconds. Raise on slow networks or for very large prompts. |
+| `NEMOCLAW_LOCAL_INFERENCE_TIMEOUT` | `180` | Wall-clock timeout for the inference-server validation probe during onboard, in seconds. Raise on slow networks or for large prompts. |

Note: For table cells, consider keeping the description concise to fit the table format, or break the longer explanation into a separate paragraph below the table if detailed guidance is needed.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/reference/commands.md` at line 1169, The table cell for
NEMOCLAW_LOCAL_INFERENCE_TIMEOUT currently contains two sentences and uses the
weak intensifier "very large"; change it to a single concise sentence and
replace "very large" with a more specific term (e.g., "extremely long prompts"
or "very long prompts"), or move the additional guidance to a separate sentence
below the table; update the cell describing NEMOCLAW_LOCAL_INFERENCE_TIMEOUT to
be one sentence only and, if needed, add a short paragraph after the table with
the expanded advice.

1170-1170: ⚡ Quick win

Split sentences to follow one-sentence-per-line formatting.

This table cell contains three sentences on the same line. As per coding guidelines, "One sentence per line in source (makes diffs readable). Flag paragraphs where multiple sentences appear on the same line."

✏️ Suggested revision
-| `NEMOCLAW_SANDBOX_READY_TIMEOUT` | `180` | Wall-clock timeout for the post-create readiness wait, in seconds. Raise when the sandbox image build, gateway upload, or in-sandbox boot exceeds the default (typical on 70B+ models, first-time gateway uploads over slow links, or DGX Station / remote-VM first runs). When the deadline expires onboarding deletes the orphaned sandbox and prints the retry hint. |
+| `NEMOCLAW_SANDBOX_READY_TIMEOUT` | `180` | Wall-clock timeout for the post-create readiness wait, in seconds. Raise when the sandbox image build, gateway upload, or in-sandbox boot exceeds the default (typical on 70B+ models, first-time gateway uploads over slow links, or DGX Station / remote-VM first runs). When the deadline expires, onboarding deletes the orphaned sandbox and prints the retry hint. |

Note: For table cells, consider keeping the description concise to fit the table format, or break the longer explanation into a separate paragraph below the table if multi-sentence guidance is needed.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/reference/commands.md` at line 1170, The table cell for
`NEMOCLAW_SANDBOX_READY_TIMEOUT` contains multiple sentences on one line; edit
the table cell so each sentence is on its own source line
(one-sentence-per-line), e.g., split the current description into separate lines
for the short definition, the examples/when to raise, and the note about
deletion/retry hint; if the explanatory text is too long for a table cell, move
the longer guidance into a separate paragraph below the table and keep the cell
concise.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@docs/deployment/deploy-to-remote-gpu.md`:
- Line 133: Rewrite the passive sentence that references
NEMOCLAW_SANDBOX_READY_TIMEOUT into active voice; specifically change "which is
sized for warm-cache, workstation-class onboarding" to an active construction
like "which we set for warm-cache, workstation-class onboarding" or "which
targets warm-cache, workstation-class onboarding" so the line reads e.g. "The
post-create readiness wait defaults to 180 seconds
(NEMOCLAW_SANDBOX_READY_TIMEOUT), which we set for warm-cache, workstation-class
onboarding and can be exceeded on:"; update the sentence containing
NEMOCLAW_SANDBOX_READY_TIMEOUT to use one of these active alternatives.
- Line 132: Replace the colon with a period and rewrite the sentence in active
voice: locate the sentence containing `nemoclaw onboard` and `OpenShell gateway`
and change the passive phrases "is built" and "uploaded" to active verbs (e.g.,
"builds the sandbox image locally and uploads it to the OpenShell gateway"), and
split the clauses with a period so it reads as two clear, active sentences
describing that `nemoclaw onboard` performs the slowest work by building and
uploading the sandbox image, which can stream hundreds of MiB before readiness
waits begin.

In `@docs/reference/commands.md`:
- Line 1169: The table cell for NEMOCLAW_LOCAL_INFERENCE_TIMEOUT currently
contains two sentences and uses the weak intensifier "very large"; change it to
a single concise sentence and replace "very large" with a more specific term
(e.g., "extremely long prompts" or "very long prompts"), or move the additional
guidance to a separate sentence below the table; update the cell describing
NEMOCLAW_LOCAL_INFERENCE_TIMEOUT to be one sentence only and, if needed, add a
short paragraph after the table with the expanded advice.
- Line 1170: The table cell for `NEMOCLAW_SANDBOX_READY_TIMEOUT` contains
multiple sentences on one line; edit the table cell so each sentence is on its
own source line (one-sentence-per-line), e.g., split the current description
into separate lines for the short definition, the examples/when to raise, and
the note about deletion/retry hint; if the explanatory text is too long for a
table cell, move the longer guidance into a separate paragraph below the table
and keep the cell concise.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4d2a51e4-4e25-4e85-a621-28467d9074d6

📥 Commits

Reviewing files that changed from the base of the PR and between 226a2c8 and f206e09.

📒 Files selected for processing (2)
  • docs/deployment/deploy-to-remote-gpu.md
  • docs/reference/commands.md

@laitingsheng laitingsheng added the v0.0.41 Release target label May 13, 2026
@cv cv added v0.0.42 Release target and removed v0.0.41 Release target labels May 14, 2026
@cv cv added v0.0.43 Release target and removed v0.0.42 Release target labels May 14, 2026
@cv cv added v0.0.44 Release target and removed v0.0.43 Release target labels May 15, 2026
@cv cv merged commit 17f9a9e into main May 15, 2026
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation v0.0.44 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants