Skip to content

docs(onboard): document NEMOCLAW_SANDBOX_READY_TIMEOUT#3435

Closed
laitingsheng wants to merge 1 commit into
mainfrom
docs/sandbox-ready-timeout
Closed

docs(onboard): document NEMOCLAW_SANDBOX_READY_TIMEOUT#3435
laitingsheng wants to merge 1 commit into
mainfrom
docs/sandbox-ready-timeout

Conversation

@laitingsheng
Copy link
Copy Markdown
Contributor

@laitingsheng laitingsheng commented May 13, 2026

Summary

NEMOCLAW_SANDBOX_READY_TIMEOUT (the post-create readiness wait, default 180s) was undocumented anywhere in docs/, so users hitting Sandbox '<name>' was created but did not become ready within 180s on heavy first runs had no doc-grep path to the workaround. This closes the gap and pairs the variable with NEMOCLAW_LOCAL_INFERENCE_TIMEOUT (the inference-probe budget the two are easy to conflate).

Related Issue

Resolves #3344
Resolves #3416

Changes

  • docs/reference/commands.md: add NEMOCLAW_SANDBOX_READY_TIMEOUT and NEMOCLAW_LOCAL_INFERENCE_TIMEOUT to the Onboard Timeouts table.
  • docs/reference/troubleshooting.md: new entry "Sandbox onboard times out with 'did not become ready within Ns'" distinguishing the readiness wait from the inference-probe budget, with a worked example.
  • docs/inference/use-local-inference.md: cross-link the two timeouts from the existing NEMOCLAW_LOCAL_INFERENCE_TIMEOUT section so readers of either knob land on the other.
  • docs/deployment/deploy-to-remote-gpu.md: new "First-Run Readiness Budget" section calling out DGX Station / cloud-VM / large-quantised-model conditions that exceed the default and showing how to raise it.

No code changes — the readiness behaviour is unchanged.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

Release Notes

  • Documentation
    • Updated onboarding documentation with clearer guidance on configurable timeout settings for remote GPU deployment scenarios.
    • Added troubleshooting section addressing timeout issues that may occur during initial sandbox setup.
    • Expanded deployment, inference, and reference documentation with configuration examples and best practices.

Review Change Stack

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 13, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

📝 Walkthrough

Walkthrough

This PR updates documentation across four files to clarify that onboarding timeouts are split into two independent controls: inference-server probe timeout and post-create sandbox readiness timeout. The changes define the variables, provide configuration examples, and add troubleshooting guidance for scenarios where sandbox initialization exceeds the default 180-second readiness window.

Changes

Sandbox Readiness Timeout Configuration

Layer / File(s) Summary
Timeout environment variable documentation
docs/reference/commands.md
Adds reference documentation for NEMOCLAW_LOCAL_INFERENCE_TIMEOUT and NEMOCLAW_SANDBOX_READY_TIMEOUT with their purposes and defaults. Updates the shell example to export both variables and rewords the timeout behavior description to report elapsed budget and clarify sandbox deletion on readiness timeout.
Configuration guidance and troubleshooting
docs/inference/use-local-inference.md, docs/deployment/deploy-to-remote-gpu.md, docs/reference/troubleshooting.md
Clarifies scope of inference-probe timeout, introduces post-create sandbox readiness timeout as separate control, provides configuration examples for slow model builds, and adds new troubleshooting section for "did not become ready within 180s" errors with diagnostic commands.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

A readiness timeout so clear,
Two timeouts now appear!
Inference here, sandbox there—
No more guessing, no more care.
🐰 Hop, configure, and deploy!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'docs(onboard): document NEMOCLAW_SANDBOX_READY_TIMEOUT' directly and specifically describes the main change: documenting the previously undocumented environment variable.
Linked Issues check ✅ Passed The PR documents the NEMOCLAW_SANDBOX_READY_TIMEOUT environment variable and cross-links it with NEMOCLAW_LOCAL_INFERENCE_TIMEOUT across four docs files, directly addressing the core requirement from both #3344 and #3416: providing users with guidance on the 180s readiness timeout and how to extend it.
Out of Scope Changes check ✅ Passed All changes are documentation-only updates to four files (commands.md, troubleshooting.md, use-local-inference.md, deploy-to-remote-gpu.md) that exclusively document the readiness timeout behavior and provide examples, with no unrelated modifications.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/sandbox-ready-timeout

Comment @coderabbitai help to get the list of available commands and usage tips.

@laitingsheng laitingsheng marked this pull request as ready for review May 13, 2026 03:15
@laitingsheng laitingsheng added the documentation Improvements or additions to documentation label May 13, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (7)
docs/deployment/deploy-to-remote-gpu.md (3)

132-133: ⚡ Quick win

Split into separate lines for readability.

This sentence spans the colon with a full independent clause after it.
Place the explanation starting at "the sandbox image" on a new line per the style guide's one-sentence-per-line rule.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/deployment/deploy-to-remote-gpu.md` around lines 132 - 133, Split the
long sentence that begins "On a remote GPU host, the first `nemoclaw
onboard`..." into two sentences/lines: leave the initial clause ("On a remote
GPU host, the first `nemoclaw onboard` typically does the slowest work of the
lifecycle:") on its own line, and move the explanation starting with "the
sandbox image is built locally and uploaded into the OpenShell gateway..." to a
new line as a separate sentence; update the paragraph so the post-create
readiness wait details ("The post-create readiness wait defaults...") remain
clearly separate and each sentence follows the one-sentence-per-line style.

146-146: ⚡ Quick win

Split compound sentence and prefer active voice.

This line contains two independent clauses and uses passive voice.
Split after "first." and rewrite "is deleted" in active voice: "NemoClaw deletes the partially-created sandbox first."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/deployment/deploy-to-remote-gpu.md` at line 146, Split the compound
sentence into two sentences and change the passive clause to active voice:
replace "the partially-created sandbox is deleted first, so the next attempt..."
with "NemoClaw deletes the partially-created sandbox first. The next attempt
with the raised budget starts from a clean state." Update the sentence
containing "Sandbox '<name>' was created but did not become ready within 180s"
accordingly.

133-134: ⚡ Quick win

Prefer active voice.

Several passive constructions appear in this paragraph:

  • "the sandbox image is built" → "NemoClaw builds the sandbox image"
  • "which is sized for" → "which fits"
  • "can be exceeded on" → rewrite as "exceeds the default on"

As per coding guidelines, active voice is required in documentation.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/deployment/deploy-to-remote-gpu.md` around lines 133 - 134, Rewrite the
paragraph to use active voice: replace passive phrases such as "the sandbox
image is built" with "NemoClaw builds the sandbox image", change "which is sized
for" to "which fits", and change "can be exceeded on" to "exceeds the default
on"; keep the variable name NEMOCLAW_SANDBOX_READY_TIMEOUT and the 180-second
default, and ensure the sentence reads smoothly with these active-voice
substitutions.
docs/reference/troubleshooting.md (2)

620-620: ⚡ Quick win

Prefer active voice.

"can be exceeded" is passive.
Rewrite as: "The 180-second default fits typical workstations but is insufficient when:" or restructure to show what causes the timeout.
As per coding guidelines, active voice is required.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/reference/troubleshooting.md` at line 620, Change the passive phrase
"can be exceeded" to active voice in the sentence shown in the diff: replace
"The 180-second default fits typical workstations but can be exceeded when:"
with an active construction such as "The 180-second default fits typical
workstations but is insufficient when:" (or another active phrasing that clearly
states conditions causing timeout) in docs/reference/troubleshooting.md so the
sentence follows the active-voice guideline.

634-634: ⚡ Quick win

Split compound sentence.

This line contains two independent clauses joined by ", so".
Split after "hint." to follow the one-sentence-per-line formatting rule.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/reference/troubleshooting.md` at line 634, Split the compound sentence
that joins two independent clauses with ", so": change "When the deadline
expires, NemoClaw deletes the partially-created sandbox before printing the
retry hint, so the next `nemoclaw onboard` starts from a clean state." into two
sentences by ending the first clause after "hint." and starting a new sentence
"The next `nemoclaw onboard` starts from a clean state." to follow the
one-sentence-per-line rule.
docs/reference/commands.md (2)

1167-1167: ⚡ Quick win

Split into separate lines.

This line contains two sentences.
Place "The Ollama pull preserves..." on a new line per the one-sentence-per-line formatting rule.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/reference/commands.md` at line 1167, The sentence "The Ollama pull
preserves its partial download for the next attempt; the readiness wait deletes
the orphaned sandbox first so the next `nemoclaw onboard` starts clean." should
be split into two separate lines (one sentence per line) in the docs; locate
that sentence in the paragraph containing "If a timeout fires, onboarding emits
the elapsed budget plus a hint to raise the relevant variable." and break the
compound sentence into: "The Ollama pull preserves its partial download for the
next attempt." and on the next line "The readiness wait deletes the orphaned
sandbox first so the next `nemoclaw onboard` starts clean."

1158-1158: 💤 Low value

Remove unnecessary intensifier.

"very large prompts" uses an overused intensifier.
Simplify to "large prompts" or be specific: "prompts exceeding typical token counts."
LLM pattern detected.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/reference/commands.md` at line 1158, The documentation entry for
NEMOCLAW_LOCAL_INFERENCE_TIMEOUT uses an unnecessary intensifier ("very large
prompts"); update the description for the `NEMOCLAW_LOCAL_INFERENCE_TIMEOUT`
table row to remove "very" and use either "large prompts" or a more specific
phrase such as "prompts exceeding typical token counts" so the entry reads e.g.
"Wall-clock timeout for the inference-server validation probe during onboard, in
seconds. Raise on slow networks or for large prompts (or specify token
threshold)."
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/reference/commands.md`:
- Line 1159: The sentence in the docs entry for the
`NEMOCLAW_SANDBOX_READY_TIMEOUT` environment variable is missing a comma after
the introductory clause; edit the text in the `NEMOCLAW_SANDBOX_READY_TIMEOUT`
line so it reads "When the deadline expires, onboarding deletes the orphaned
sandbox..." (insert comma after "expires") to correctly separate the clause.

---

Nitpick comments:
In `@docs/deployment/deploy-to-remote-gpu.md`:
- Around line 132-133: Split the long sentence that begins "On a remote GPU
host, the first `nemoclaw onboard`..." into two sentences/lines: leave the
initial clause ("On a remote GPU host, the first `nemoclaw onboard` typically
does the slowest work of the lifecycle:") on its own line, and move the
explanation starting with "the sandbox image is built locally and uploaded into
the OpenShell gateway..." to a new line as a separate sentence; update the
paragraph so the post-create readiness wait details ("The post-create readiness
wait defaults...") remain clearly separate and each sentence follows the
one-sentence-per-line style.
- Line 146: Split the compound sentence into two sentences and change the
passive clause to active voice: replace "the partially-created sandbox is
deleted first, so the next attempt..." with "NemoClaw deletes the
partially-created sandbox first. The next attempt with the raised budget starts
from a clean state." Update the sentence containing "Sandbox '<name>' was
created but did not become ready within 180s" accordingly.
- Around line 133-134: Rewrite the paragraph to use active voice: replace
passive phrases such as "the sandbox image is built" with "NemoClaw builds the
sandbox image", change "which is sized for" to "which fits", and change "can be
exceeded on" to "exceeds the default on"; keep the variable name
NEMOCLAW_SANDBOX_READY_TIMEOUT and the 180-second default, and ensure the
sentence reads smoothly with these active-voice substitutions.

In `@docs/reference/commands.md`:
- Line 1167: The sentence "The Ollama pull preserves its partial download for
the next attempt; the readiness wait deletes the orphaned sandbox first so the
next `nemoclaw onboard` starts clean." should be split into two separate lines
(one sentence per line) in the docs; locate that sentence in the paragraph
containing "If a timeout fires, onboarding emits the elapsed budget plus a hint
to raise the relevant variable." and break the compound sentence into: "The
Ollama pull preserves its partial download for the next attempt." and on the
next line "The readiness wait deletes the orphaned sandbox first so the next
`nemoclaw onboard` starts clean."
- Line 1158: The documentation entry for NEMOCLAW_LOCAL_INFERENCE_TIMEOUT uses
an unnecessary intensifier ("very large prompts"); update the description for
the `NEMOCLAW_LOCAL_INFERENCE_TIMEOUT` table row to remove "very" and use either
"large prompts" or a more specific phrase such as "prompts exceeding typical
token counts" so the entry reads e.g. "Wall-clock timeout for the
inference-server validation probe during onboard, in seconds. Raise on slow
networks or for large prompts (or specify token threshold)."

In `@docs/reference/troubleshooting.md`:
- Line 620: Change the passive phrase "can be exceeded" to active voice in the
sentence shown in the diff: replace "The 180-second default fits typical
workstations but can be exceeded when:" with an active construction such as "The
180-second default fits typical workstations but is insufficient when:" (or
another active phrasing that clearly states conditions causing timeout) in
docs/reference/troubleshooting.md so the sentence follows the active-voice
guideline.
- Line 634: Split the compound sentence that joins two independent clauses with
", so": change "When the deadline expires, NemoClaw deletes the
partially-created sandbox before printing the retry hint, so the next `nemoclaw
onboard` starts from a clean state." into two sentences by ending the first
clause after "hint." and starting a new sentence "The next `nemoclaw onboard`
starts from a clean state." to follow the one-sentence-per-line rule.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 6dbb0844-efe9-4e84-97d5-7947fa7f7db9

📥 Commits

Reviewing files that changed from the base of the PR and between 17325de and a6d6cc4.

📒 Files selected for processing (4)
  • docs/deployment/deploy-to-remote-gpu.md
  • docs/inference/use-local-inference.md
  • docs/reference/commands.md
  • docs/reference/troubleshooting.md

|----------|---------|---------|
| `NEMOCLAW_OLLAMA_PULL_TIMEOUT` | `1800` (30 minutes) | Wall-clock timeout for `ollama pull` during onboard, in seconds. Accepts integer or float values. Already-downloaded layers are kept; re-running the pull resumes them. |
| `NEMOCLAW_LOCAL_INFERENCE_TIMEOUT` | `180` | Wall-clock timeout for the inference-server validation probe during onboard, in seconds. Raise on slow networks or for very large prompts. |
| `NEMOCLAW_SANDBOX_READY_TIMEOUT` | `180` | Wall-clock timeout for the post-create readiness wait, in seconds. Raise when the sandbox image build, gateway upload, or in-sandbox boot exceeds the default (typical on 70B+ models, first-time gateway uploads over slow links, or DGX Station / remote-VM first runs). When the deadline expires onboarding deletes the orphaned sandbox and prints the retry hint. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add comma after introductory clause.

"When the deadline expires onboarding deletes" is missing a comma.
Insert comma after "expires" to separate the introductory clause: "When the deadline expires, onboarding deletes..."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/reference/commands.md` at line 1159, The sentence in the docs entry for
the `NEMOCLAW_SANDBOX_READY_TIMEOUT` environment variable is missing a comma
after the introductory clause; edit the text in the
`NEMOCLAW_SANDBOX_READY_TIMEOUT` line so it reads "When the deadline expires,
onboarding deletes the orphaned sandbox..." (insert comma after "expires") to
correctly separate the clause.

@laitingsheng laitingsheng deleted the docs/sandbox-ready-timeout branch May 13, 2026 04:20
cv added a commit that referenced this pull request May 15, 2026
## Summary
`NEMOCLAW_SANDBOX_READY_TIMEOUT` has been a recognised env var since
#2849, but no documentation accompanied it —
`docs/reference/commands.md`, `docs/reference/troubleshooting.md`, and
the inference / deployment guides only mention the companion
`NEMOCLAW_LOCAL_INFERENCE_TIMEOUT` (added in #1620 and documented at
that time). Operators hitting `Sandbox '<name>' was created but did not
become ready within 180s` have no doc-grep path to the workaround, and
the two timeouts are easy to conflate. This closes the documentation gap
left by #2849.

Originally tried under #3435; closed because that PR mis-framed the docs
as resolving #3344 / #3416 (the root cause of both was the GPU policy
bug fixed in #3436, not a timeout misconfiguration). The docs themselves
still have value as a follow-up to the env-var introductions, so
reopening as a new PR with the correct framing.

## Related Issue
<!-- Not closing any issue; this addresses the doc-gap surfaced while
investigating #3344 and #3416 (both already fixed in code by #3436). -->

## Changes
- `docs/reference/commands.md`: add `NEMOCLAW_SANDBOX_READY_TIMEOUT` and
`NEMOCLAW_LOCAL_INFERENCE_TIMEOUT` to the Onboard Timeouts table.
- `docs/reference/troubleshooting.md`: new troubleshooting entry
"Sandbox onboard times out with 'did not become ready within Ns'" that
distinguishes the readiness wait from the inference-probe budget, with a
worked example.
- `docs/inference/use-local-inference.md`: cross-link the two timeouts
from the existing `NEMOCLAW_LOCAL_INFERENCE_TIMEOUT` section so readers
of either knob land on the other.
- `docs/deployment/deploy-to-remote-gpu.md`: new "First-Run Readiness
Budget" section calling out DGX Station / cloud-VM /
large-quantised-model conditions that exceed the default and showing how
to raise it.

No code changes — the readiness behaviour is unchanged.

## Type of Change

- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [x] Doc only (prose changes, no code sample modifications)
- [ ] Doc only (includes code sample changes)

## Verification
- [ ] `npx prek run --all-files` passes
- [ ] `npm test` passes
- [ ] Tests added or updated for new or changed behavior
- [ ] No secrets, API keys, or credentials committed
- [ ] Docs updated for user-facing behavior changes
- [x] `make docs` builds without warnings (doc changes only)
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)

---
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Documentation**
* Added a “First-Run Readiness Budget” note for remote GPU hosts
explaining longer initial sandbox build/upload times and advice to
increase NEMOCLAW_SANDBOX_READY_TIMEOUT.
* Clarified that NEMOCLAW_LOCAL_INFERENCE_TIMEOUT applies to
inference-server validation while sandbox readiness uses
NEMOCLAW_SANDBOX_READY_TIMEOUT (default 180s).
* Expanded examples for exporting both timeouts and onboarding timeout
messaging.
* Added troubleshooting guidance and inspection steps when sandbox
readiness timeouts delete partial sandboxes.

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3440)

<!-- review_stack_entry_end -->
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

2 participants