docs(onboard): document NEMOCLAW_SANDBOX_READY_TIMEOUT#3435
docs(onboard): document NEMOCLAW_SANDBOX_READY_TIMEOUT#3435laitingsheng wants to merge 1 commit into
Conversation
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
📝 WalkthroughWalkthroughThis PR updates documentation across four files to clarify that onboarding timeouts are split into two independent controls: inference-server probe timeout and post-create sandbox readiness timeout. The changes define the variables, provide configuration examples, and add troubleshooting guidance for scenarios where sandbox initialization exceeds the default 180-second readiness window. ChangesSandbox Readiness Timeout Configuration
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (7)
docs/deployment/deploy-to-remote-gpu.md (3)
132-133: ⚡ Quick winSplit into separate lines for readability.
This sentence spans the colon with a full independent clause after it.
Place the explanation starting at "the sandbox image" on a new line per the style guide's one-sentence-per-line rule.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/deployment/deploy-to-remote-gpu.md` around lines 132 - 133, Split the long sentence that begins "On a remote GPU host, the first `nemoclaw onboard`..." into two sentences/lines: leave the initial clause ("On a remote GPU host, the first `nemoclaw onboard` typically does the slowest work of the lifecycle:") on its own line, and move the explanation starting with "the sandbox image is built locally and uploaded into the OpenShell gateway..." to a new line as a separate sentence; update the paragraph so the post-create readiness wait details ("The post-create readiness wait defaults...") remain clearly separate and each sentence follows the one-sentence-per-line style.
146-146: ⚡ Quick winSplit compound sentence and prefer active voice.
This line contains two independent clauses and uses passive voice.
Split after "first." and rewrite "is deleted" in active voice: "NemoClaw deletes the partially-created sandbox first."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/deployment/deploy-to-remote-gpu.md` at line 146, Split the compound sentence into two sentences and change the passive clause to active voice: replace "the partially-created sandbox is deleted first, so the next attempt..." with "NemoClaw deletes the partially-created sandbox first. The next attempt with the raised budget starts from a clean state." Update the sentence containing "Sandbox '<name>' was created but did not become ready within 180s" accordingly.
133-134: ⚡ Quick winPrefer active voice.
Several passive constructions appear in this paragraph:
- "the sandbox image is built" → "NemoClaw builds the sandbox image"
- "which is sized for" → "which fits"
- "can be exceeded on" → rewrite as "exceeds the default on"
As per coding guidelines, active voice is required in documentation.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/deployment/deploy-to-remote-gpu.md` around lines 133 - 134, Rewrite the paragraph to use active voice: replace passive phrases such as "the sandbox image is built" with "NemoClaw builds the sandbox image", change "which is sized for" to "which fits", and change "can be exceeded on" to "exceeds the default on"; keep the variable name NEMOCLAW_SANDBOX_READY_TIMEOUT and the 180-second default, and ensure the sentence reads smoothly with these active-voice substitutions.docs/reference/troubleshooting.md (2)
620-620: ⚡ Quick winPrefer active voice.
"can be exceeded" is passive.
Rewrite as: "The 180-second default fits typical workstations but is insufficient when:" or restructure to show what causes the timeout.
As per coding guidelines, active voice is required.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/reference/troubleshooting.md` at line 620, Change the passive phrase "can be exceeded" to active voice in the sentence shown in the diff: replace "The 180-second default fits typical workstations but can be exceeded when:" with an active construction such as "The 180-second default fits typical workstations but is insufficient when:" (or another active phrasing that clearly states conditions causing timeout) in docs/reference/troubleshooting.md so the sentence follows the active-voice guideline.
634-634: ⚡ Quick winSplit compound sentence.
This line contains two independent clauses joined by ", so".
Split after "hint." to follow the one-sentence-per-line formatting rule.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/reference/troubleshooting.md` at line 634, Split the compound sentence that joins two independent clauses with ", so": change "When the deadline expires, NemoClaw deletes the partially-created sandbox before printing the retry hint, so the next `nemoclaw onboard` starts from a clean state." into two sentences by ending the first clause after "hint." and starting a new sentence "The next `nemoclaw onboard` starts from a clean state." to follow the one-sentence-per-line rule.docs/reference/commands.md (2)
1167-1167: ⚡ Quick winSplit into separate lines.
This line contains two sentences.
Place "The Ollama pull preserves..." on a new line per the one-sentence-per-line formatting rule.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/reference/commands.md` at line 1167, The sentence "The Ollama pull preserves its partial download for the next attempt; the readiness wait deletes the orphaned sandbox first so the next `nemoclaw onboard` starts clean." should be split into two separate lines (one sentence per line) in the docs; locate that sentence in the paragraph containing "If a timeout fires, onboarding emits the elapsed budget plus a hint to raise the relevant variable." and break the compound sentence into: "The Ollama pull preserves its partial download for the next attempt." and on the next line "The readiness wait deletes the orphaned sandbox first so the next `nemoclaw onboard` starts clean."
1158-1158: 💤 Low valueRemove unnecessary intensifier.
"very large prompts" uses an overused intensifier.
Simplify to "large prompts" or be specific: "prompts exceeding typical token counts."
LLM pattern detected.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/reference/commands.md` at line 1158, The documentation entry for NEMOCLAW_LOCAL_INFERENCE_TIMEOUT uses an unnecessary intensifier ("very large prompts"); update the description for the `NEMOCLAW_LOCAL_INFERENCE_TIMEOUT` table row to remove "very" and use either "large prompts" or a more specific phrase such as "prompts exceeding typical token counts" so the entry reads e.g. "Wall-clock timeout for the inference-server validation probe during onboard, in seconds. Raise on slow networks or for large prompts (or specify token threshold)."
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/reference/commands.md`:
- Line 1159: The sentence in the docs entry for the
`NEMOCLAW_SANDBOX_READY_TIMEOUT` environment variable is missing a comma after
the introductory clause; edit the text in the `NEMOCLAW_SANDBOX_READY_TIMEOUT`
line so it reads "When the deadline expires, onboarding deletes the orphaned
sandbox..." (insert comma after "expires") to correctly separate the clause.
---
Nitpick comments:
In `@docs/deployment/deploy-to-remote-gpu.md`:
- Around line 132-133: Split the long sentence that begins "On a remote GPU
host, the first `nemoclaw onboard`..." into two sentences/lines: leave the
initial clause ("On a remote GPU host, the first `nemoclaw onboard` typically
does the slowest work of the lifecycle:") on its own line, and move the
explanation starting with "the sandbox image is built locally and uploaded into
the OpenShell gateway..." to a new line as a separate sentence; update the
paragraph so the post-create readiness wait details ("The post-create readiness
wait defaults...") remain clearly separate and each sentence follows the
one-sentence-per-line style.
- Line 146: Split the compound sentence into two sentences and change the
passive clause to active voice: replace "the partially-created sandbox is
deleted first, so the next attempt..." with "NemoClaw deletes the
partially-created sandbox first. The next attempt with the raised budget starts
from a clean state." Update the sentence containing "Sandbox '<name>' was
created but did not become ready within 180s" accordingly.
- Around line 133-134: Rewrite the paragraph to use active voice: replace
passive phrases such as "the sandbox image is built" with "NemoClaw builds the
sandbox image", change "which is sized for" to "which fits", and change "can be
exceeded on" to "exceeds the default on"; keep the variable name
NEMOCLAW_SANDBOX_READY_TIMEOUT and the 180-second default, and ensure the
sentence reads smoothly with these active-voice substitutions.
In `@docs/reference/commands.md`:
- Line 1167: The sentence "The Ollama pull preserves its partial download for
the next attempt; the readiness wait deletes the orphaned sandbox first so the
next `nemoclaw onboard` starts clean." should be split into two separate lines
(one sentence per line) in the docs; locate that sentence in the paragraph
containing "If a timeout fires, onboarding emits the elapsed budget plus a hint
to raise the relevant variable." and break the compound sentence into: "The
Ollama pull preserves its partial download for the next attempt." and on the
next line "The readiness wait deletes the orphaned sandbox first so the next
`nemoclaw onboard` starts clean."
- Line 1158: The documentation entry for NEMOCLAW_LOCAL_INFERENCE_TIMEOUT uses
an unnecessary intensifier ("very large prompts"); update the description for
the `NEMOCLAW_LOCAL_INFERENCE_TIMEOUT` table row to remove "very" and use either
"large prompts" or a more specific phrase such as "prompts exceeding typical
token counts" so the entry reads e.g. "Wall-clock timeout for the
inference-server validation probe during onboard, in seconds. Raise on slow
networks or for large prompts (or specify token threshold)."
In `@docs/reference/troubleshooting.md`:
- Line 620: Change the passive phrase "can be exceeded" to active voice in the
sentence shown in the diff: replace "The 180-second default fits typical
workstations but can be exceeded when:" with an active construction such as "The
180-second default fits typical workstations but is insufficient when:" (or
another active phrasing that clearly states conditions causing timeout) in
docs/reference/troubleshooting.md so the sentence follows the active-voice
guideline.
- Line 634: Split the compound sentence that joins two independent clauses with
", so": change "When the deadline expires, NemoClaw deletes the
partially-created sandbox before printing the retry hint, so the next `nemoclaw
onboard` starts from a clean state." into two sentences by ending the first
clause after "hint." and starting a new sentence "The next `nemoclaw onboard`
starts from a clean state." to follow the one-sentence-per-line rule.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 6dbb0844-efe9-4e84-97d5-7947fa7f7db9
📒 Files selected for processing (4)
docs/deployment/deploy-to-remote-gpu.mddocs/inference/use-local-inference.mddocs/reference/commands.mddocs/reference/troubleshooting.md
| |----------|---------|---------| | ||
| | `NEMOCLAW_OLLAMA_PULL_TIMEOUT` | `1800` (30 minutes) | Wall-clock timeout for `ollama pull` during onboard, in seconds. Accepts integer or float values. Already-downloaded layers are kept; re-running the pull resumes them. | | ||
| | `NEMOCLAW_LOCAL_INFERENCE_TIMEOUT` | `180` | Wall-clock timeout for the inference-server validation probe during onboard, in seconds. Raise on slow networks or for very large prompts. | | ||
| | `NEMOCLAW_SANDBOX_READY_TIMEOUT` | `180` | Wall-clock timeout for the post-create readiness wait, in seconds. Raise when the sandbox image build, gateway upload, or in-sandbox boot exceeds the default (typical on 70B+ models, first-time gateway uploads over slow links, or DGX Station / remote-VM first runs). When the deadline expires onboarding deletes the orphaned sandbox and prints the retry hint. | |
There was a problem hiding this comment.
Add comma after introductory clause.
"When the deadline expires onboarding deletes" is missing a comma.
Insert comma after "expires" to separate the introductory clause: "When the deadline expires, onboarding deletes..."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/reference/commands.md` at line 1159, The sentence in the docs entry for
the `NEMOCLAW_SANDBOX_READY_TIMEOUT` environment variable is missing a comma
after the introductory clause; edit the text in the
`NEMOCLAW_SANDBOX_READY_TIMEOUT` line so it reads "When the deadline expires,
onboarding deletes the orphaned sandbox..." (insert comma after "expires") to
correctly separate the clause.
## Summary `NEMOCLAW_SANDBOX_READY_TIMEOUT` has been a recognised env var since #2849, but no documentation accompanied it — `docs/reference/commands.md`, `docs/reference/troubleshooting.md`, and the inference / deployment guides only mention the companion `NEMOCLAW_LOCAL_INFERENCE_TIMEOUT` (added in #1620 and documented at that time). Operators hitting `Sandbox '<name>' was created but did not become ready within 180s` have no doc-grep path to the workaround, and the two timeouts are easy to conflate. This closes the documentation gap left by #2849. Originally tried under #3435; closed because that PR mis-framed the docs as resolving #3344 / #3416 (the root cause of both was the GPU policy bug fixed in #3436, not a timeout misconfiguration). The docs themselves still have value as a follow-up to the env-var introductions, so reopening as a new PR with the correct framing. ## Related Issue <!-- Not closing any issue; this addresses the doc-gap surfaced while investigating #3344 and #3416 (both already fixed in code by #3436). --> ## Changes - `docs/reference/commands.md`: add `NEMOCLAW_SANDBOX_READY_TIMEOUT` and `NEMOCLAW_LOCAL_INFERENCE_TIMEOUT` to the Onboard Timeouts table. - `docs/reference/troubleshooting.md`: new troubleshooting entry "Sandbox onboard times out with 'did not become ready within Ns'" that distinguishes the readiness wait from the inference-probe budget, with a worked example. - `docs/inference/use-local-inference.md`: cross-link the two timeouts from the existing `NEMOCLAW_LOCAL_INFERENCE_TIMEOUT` section so readers of either knob land on the other. - `docs/deployment/deploy-to-remote-gpu.md`: new "First-Run Readiness Budget" section calling out DGX Station / cloud-VM / large-quantised-model conditions that exceed the default and showing how to raise it. No code changes — the readiness behaviour is unchanged. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [x] Doc only (prose changes, no code sample modifications) - [ ] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [ ] No secrets, API keys, or credentials committed - [ ] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) --- Signed-off-by: Tinson Lai <tinsonl@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Added a “First-Run Readiness Budget” note for remote GPU hosts explaining longer initial sandbox build/upload times and advice to increase NEMOCLAW_SANDBOX_READY_TIMEOUT. * Clarified that NEMOCLAW_LOCAL_INFERENCE_TIMEOUT applies to inference-server validation while sandbox readiness uses NEMOCLAW_SANDBOX_READY_TIMEOUT (default 180s). * Expanded examples for exporting both timeouts and onboarding timeout messaging. * Added troubleshooting guidance and inspection steps when sandbox readiness timeouts delete partial sandboxes. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3440) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Tinson Lai <tinsonl@nvidia.com> Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Summary
NEMOCLAW_SANDBOX_READY_TIMEOUT(the post-create readiness wait, default 180s) was undocumented anywhere indocs/, so users hittingSandbox '<name>' was created but did not become ready within 180son heavy first runs had no doc-grep path to the workaround. This closes the gap and pairs the variable withNEMOCLAW_LOCAL_INFERENCE_TIMEOUT(the inference-probe budget the two are easy to conflate).Related Issue
Resolves #3344
Resolves #3416
Changes
docs/reference/commands.md: addNEMOCLAW_SANDBOX_READY_TIMEOUTandNEMOCLAW_LOCAL_INFERENCE_TIMEOUTto the Onboard Timeouts table.docs/reference/troubleshooting.md: new entry "Sandbox onboard times out with 'did not become ready within Ns'" distinguishing the readiness wait from the inference-probe budget, with a worked example.docs/inference/use-local-inference.md: cross-link the two timeouts from the existingNEMOCLAW_LOCAL_INFERENCE_TIMEOUTsection so readers of either knob land on the other.docs/deployment/deploy-to-remote-gpu.md: new "First-Run Readiness Budget" section calling out DGX Station / cloud-VM / large-quantised-model conditions that exceed the default and showing how to raise it.No code changes — the readiness behaviour is unchanged.
Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
Release Notes