fix(connect): add readiness poll with timeout before connecting (Fixes #430)#466
fix(connect): add readiness poll with timeout before connecting (Fixes #430)#466cv merged 5 commits intoNVIDIA:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaces inline sandbox-list parsing with reusable parsers, updates readiness detection, adds polling with timeout and terminal-state handling to Changes
Sequence Diagram(s)sequenceDiagram
participant User as CLI (nemoclaw)
participant Nemoclaw as bin/nemoclaw.js
participant Onboard as bin/lib/onboard.js
participant Openshell as openshell CLI
User->>Nemoclaw: sandboxConnect(sandboxName)
Nemoclaw->>Openshell: "openshell sandbox list" (runCapture, ignoreError)
Openshell-->>Nemoclaw: table/text output
Nemoclaw->>Onboard: parseSandboxStatus(output, sandboxName)
Onboard-->>Nemoclaw: status (e.g., Ready, Pending, Failed, ...)
alt status == Ready
Nemoclaw->>User: print "Ready"
User->>Openshell: start port-forward & "openshell sandbox connect"
Openshell-->>User: interactive shell
else status in (Failed, Error, CrashLoopBackOff, ImagePullBackOff, Unknown, Evicted)
Nemoclaw->>User: print error + suggested "nemoclaw <name> logs --follow" and "nemoclaw <name> status"
Nemoclaw-->>User: exit(1)
else not found / timeout
Nemoclaw->>User: print not-found/timeout message (hint NEMOCLAW_CONNECT_TIMEOUT)
Nemoclaw-->>User: exit(1)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
bin/nemoclaw.js (1)
262-264: Consider adding validation feedback for invalid timeout values.The
|| 120fallback silently converts invalid values (e.g.,NEMOCLAW_CONNECT_TIMEOUT=abc) to 120. This is safe but could confuse users who expect their setting to take effect.♻️ Optional: Add warning for invalid timeout values
function sandboxConnect(sandboxName) { const { isSandboxReady, parseSandboxStatus } = require("./lib/onboard"); - const timeout = parseInt(process.env.NEMOCLAW_CONNECT_TIMEOUT, 10) || 120; + const rawTimeout = process.env.NEMOCLAW_CONNECT_TIMEOUT; + let timeout = 120; + if (rawTimeout !== undefined) { + const parsed = parseInt(rawTimeout, 10); + if (Number.isNaN(parsed) || parsed < 0) { + console.warn(` Warning: Invalid NEMOCLAW_CONNECT_TIMEOUT="${rawTimeout}", using default 120s`); + } else { + timeout = parsed || 120; // treat 0 as "use default" + } + } const interval = 3;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/nemoclaw.js` around lines 262 - 264, The current timeout assignment silently falls back to 120 when NEMOCLAW_CONNECT_TIMEOUT is invalid; change the logic around the parseInt call that assigns timeout so you first read raw = process.env.NEMOCLAW_CONNECT_TIMEOUT, attempt to parse with parseInt, and if raw is defined but Number.isNaN(parsed) emit a clear warning (e.g., console.warn) mentioning the invalid value and that 120s will be used, otherwise use the parsed value; keep the fallback to 120 for undefined values and reference the timeout constant and the NEMOCLAW_CONNECT_TIMEOUT env var in your message so users see what was wrong.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@bin/nemoclaw.js`:
- Around line 262-264: The current timeout assignment silently falls back to 120
when NEMOCLAW_CONNECT_TIMEOUT is invalid; change the logic around the parseInt
call that assigns timeout so you first read raw =
process.env.NEMOCLAW_CONNECT_TIMEOUT, attempt to parse with parseInt, and if raw
is defined but Number.isNaN(parsed) emit a clear warning (e.g., console.warn)
mentioning the invalid value and that 120s will be used, otherwise use the
parsed value; keep the fallback to 120 for undefined values and reference the
timeout constant and the NEMOCLAW_CONNECT_TIMEOUT env var in your message so
users see what was wrong.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 76b8a6ba-1a7f-4f66-9c3b-f10fe6224423
📒 Files selected for processing (3)
bin/lib/onboard.jsbin/nemoclaw.jstest/onboard-readiness.test.js
954e578 to
a645867
Compare
|
Addressed the CodeRabbit nitpick. Rebased on main, resolved conflicts . |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
bin/lib/onboard.js (1)
71-85: LGTM! Clean implementation with proper input validation.The function correctly handles ANSI stripping, exact name matching in the first column, and safely returns
nullfor missing/invalid input. The logic is consistent withisSandboxReadyabove.💡 Optional: Extract shared parsing logic to reduce duplication
Both
isSandboxReadyandparseSandboxStatusshare identical ANSI-stripping and column-parsing logic. If you anticipate adding more status-parsing utilities, consider extracting the common parsing into a helper:+function parseSandboxRow(output, sandboxName) { + if (!output || typeof output !== "string") return null; + const clean = output.replace(/\x1b\[[0-9;]*m/g, ""); + for (const line of clean.split("\n")) { + const cols = line.trim().split(/\s+/); + if (cols[0] === sandboxName) return cols; + } + return null; +} + function isSandboxReady(output, sandboxName) { - const clean = output.replace(/\x1b\[[0-9;]*m/g, ""); - return clean.split("\n").some((l) => { - const cols = l.trim().split(/\s+/); - return cols[0] === sandboxName && cols.includes("Ready") && !cols.includes("NotReady"); - }); + const cols = parseSandboxRow(output, sandboxName); + return cols && cols.includes("Ready") && !cols.includes("NotReady"); } function parseSandboxStatus(output, sandboxName) { - if (!output || typeof output !== "string") return null; - const clean = output.replace(/\x1b\[[0-9;]*m/g, ""); - for (const line of clean.split("\n")) { - const cols = line.trim().split(/\s+/); - if (cols[0] === sandboxName && cols.length >= 2) { - return cols[1]; - } - } - return null; + const cols = parseSandboxRow(output, sandboxName); + return cols && cols.length >= 2 ? cols[1] : null; }This also adds null-safety to
isSandboxReadywhich currently lacks input validation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 71 - 85, Extract the common ANSI-stripping and column-splitting logic used in parseSandboxStatus and isSandboxReady into a small helper (e.g., cleanAndSplitLines or parseTableLines) and then call that helper from both parseSandboxStatus and isSandboxReady; ensure the helper performs null/typeof checks and returns a safe array of lines/columns so isSandboxReady gains the same input validation as parseSandboxStatus and duplication is removed.bin/nemoclaw.js (1)
297-339: Solid polling implementation with clear user feedback.The readiness gate logic handles the core requirements well: terminal state detection, progress display with elapsed time, and configurable timeout.
One minor consideration: If the sandbox doesn't exist at all (not in the list),
parseSandboxStatusreturnsnulland the status shows as "unknown" until timeout. This is acceptable since the sandbox might be in the process of being created, but you could optionally add a check after the first poll to detect "sandbox not found" early.💡 Optional: Early exit if sandbox never appears
let elapsed = 0; let ready = false; + let everSeen = status !== null; while (elapsed < timeout) { spawnSync("sleep", [String(interval)]); elapsed += interval; const poll = runCapture("openshell sandbox list 2>&1", { ignoreError: true }); if (isSandboxReady(poll, sandboxName)) { ready = true; break; } const cur = parseSandboxStatus(poll, sandboxName) || "unknown"; + if (cur !== "unknown") everSeen = true; if (TERMINAL.has(cur)) { console.error(""); console.error(` Sandbox '${sandboxName}' entered '${cur}' state.`); console.error(` Run: nemoclaw ${sandboxName} logs --follow`); process.exit(1); } + // If sandbox never appeared after reasonable time, exit early + if (!everSeen && elapsed >= 30) { + console.error(""); + console.error(` Sandbox '${sandboxName}' not found after ${elapsed}s.`); + console.error(` Check: openshell sandbox list`); + process.exit(1); + } process.stdout.write(`\r Status: ${cur.padEnd(20)} (${elapsed}s elapsed)`); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/nemoclaw.js` around lines 297 - 339, Add an early "sandbox not found" detection: after the initial runCapture("openshell sandbox list...") and after the first poll() in the loop, check if parseSandboxStatus(list, sandboxName) (or parseSandboxStatus(poll, sandboxName)) returns null/undefined meaning the sandbox is not present; if so, print a clear error message (e.g., "Sandbox '<name>' not found. Check creation or name.") and exit non‑zero. Update the blocks around isSandboxReady, parseSandboxStatus, list and poll to perform this check and return early instead of waiting the full timeout.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/nemoclaw.js`:
- Around line 300-306: Update the TERMINAL set used to detect sandbox pod
failures to include the Kubernetes pod phases "Unknown" and the eviction reason
"Evicted": modify the constant TERMINAL (currently defined as new
Set(["Failed","Error","CrashLoopBackOff","ImagePullBackOff"])) to also contain
"Unknown" and "Evicted" so the branch that checks if (status &&
TERMINAL.has(status)) (which logs using sandboxName and exits via
process.exit(1)) will catch those cases as well.
---
Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 71-85: Extract the common ANSI-stripping and column-splitting
logic used in parseSandboxStatus and isSandboxReady into a small helper (e.g.,
cleanAndSplitLines or parseTableLines) and then call that helper from both
parseSandboxStatus and isSandboxReady; ensure the helper performs null/typeof
checks and returns a safe array of lines/columns so isSandboxReady gains the
same input validation as parseSandboxStatus and duplication is removed.
In `@bin/nemoclaw.js`:
- Around line 297-339: Add an early "sandbox not found" detection: after the
initial runCapture("openshell sandbox list...") and after the first poll() in
the loop, check if parseSandboxStatus(list, sandboxName) (or
parseSandboxStatus(poll, sandboxName)) returns null/undefined meaning the
sandbox is not present; if so, print a clear error message (e.g., "Sandbox
'<name>' not found. Check creation or name.") and exit non‑zero. Update the
blocks around isSandboxReady, parseSandboxStatus, list and poll to perform this
check and return early instead of waiting the full timeout.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1c8b36ca-68e5-4f54-9792-f2c5d72e54fe
📒 Files selected for processing (3)
bin/lib/onboard.jsbin/nemoclaw.jstest/onboard-readiness.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
- test/onboard-readiness.test.js
a645867 to
8dce304
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
bin/lib/onboard.js (1)
74-76: Use the status column directly for readiness.
cols.includes("Ready")makes readiness depend on every token in the row. Checking the status column keepsisSandboxReady()aligned withparseSandboxStatus()and avoids false positives from future extra columns.♻️ Suggested change
function isSandboxReady(output, sandboxName) { const cols = parseSandboxRow(output, sandboxName); - return cols !== null && cols.includes("Ready") && !cols.includes("NotReady"); + return cols !== null && cols[1] === "Ready"; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/lib/onboard.js` around lines 74 - 76, isSandboxReady currently checks readiness by searching the entire row for "Ready"/"NotReady", which can give false positives; change it to use the status value returned by parseSandboxStatus(output, sandboxName) (or otherwise extract the explicit status column from parseSandboxRow) and return true only when that status === "Ready" (and not "NotReady"); update isSandboxReady to call parseSandboxStatus/output -> status and base the boolean on that status.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/nemoclaw.js`:
- Around line 299-306: The terminal-state handling is inconsistent: the
immediate terminal branch (using parseSandboxStatus, TERMINAL and sandboxName)
prints both "nemoclaw <sandboxName> logs --follow" and "nemoclaw <sandboxName>
status" but the in-loop terminal branch omits the status command; update the
in-loop terminal branch to mirror the immediate branch by adding the same
console.error line that suggests running "nemoclaw <sandboxName> status" so both
branches output identical guidance for terminal statuses.
- Around line 313-315: The loop that waits for connection oversleeps because it
always calls spawnSync("sleep", [String(interval)]) regardless of remaining
time; change the logic in the wait/retry loop (variables elapsed, timeout,
interval and the while loop) to compute remaining = timeout - elapsed and sleep
for Math.max(0, Math.min(interval, remaining)) seconds (skip sleeping if
remaining <= 0) so the total wait never exceeds NEMOCLAW_CONNECT_TIMEOUT; keep
using spawnSync("sleep", [String(sleepSec)]) and update elapsed by the actual
sleepSec used.
- Line 297: The two calls to runCapture("openshell sandbox list", { ignoreError:
true }) must pass a timeout so the child process cannot hang; modify the calls
in bin/nemoclaw.js (the runCapture invocations used by connect) to include a
timeout option derived from NEMOCLAW_CONNECT_TIMEOUT (converted to milliseconds)
e.g. merge { ignoreError: true } with { timeout: <NEMOCLAW_CONNECT_TIMEOUT in
ms> } before calling runCapture; ensure you reuse the same timeout value used by
the outer polling loop so the subprocess timeout aligns with the overall connect
timeout.
---
Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 74-76: isSandboxReady currently checks readiness by searching the
entire row for "Ready"/"NotReady", which can give false positives; change it to
use the status value returned by parseSandboxStatus(output, sandboxName) (or
otherwise extract the explicit status column from parseSandboxRow) and return
true only when that status === "Ready" (and not "NotReady"); update
isSandboxReady to call parseSandboxStatus/output -> status and base the boolean
on that status.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0ec63e5c-a749-41d2-a442-7ae0c7e6e219
📒 Files selected for processing (3)
bin/lib/onboard.jsbin/nemoclaw.jstest/onboard-readiness.test.js
8dce304 to
286fe06
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
bin/nemoclaw.js (3)
297-297:⚠️ Potential issue | 🟠 MajorBound each
openshell sandbox listsubprocess.
NEMOCLAW_CONNECT_TIMEOUTonly limits the outer loop. A wedgedrunCapture()here can still block forever beforeelapsedis checked again, which putsconnectback into the indefinite-hang state this PR is trying to eliminate.🩹 Suggested fix
const interval = 3; + const listOpts = { ignoreError: true, timeout: timeout * 1000 }; - const list = runCapture("openshell sandbox list 2>&1", { ignoreError: true }); + const list = runCapture("openshell sandbox list 2>&1", listOpts); if (!isSandboxReady(list, sandboxName)) { @@ - const poll = runCapture("openshell sandbox list 2>&1", { ignoreError: true }); + const poll = runCapture("openshell sandbox list 2>&1", listOpts);Verify that
bin/lib/runner.jsalready passes atimeoutoption through to the child process, and that these polling call sites are the remaining ones omitting it. Expected result:runCaptureshows timeout-capable plumbing, while the twosandbox listcalls in this function do not currently use it.#!/bin/bash set -euo pipefail echo "== runCapture timeout support ==" rg -n -C3 'function runCapture|const runCapture|timeout|execSync' bin/lib/runner.js echo echo "== sandbox list polling call sites ==" rg -n -C2 'runCapture\("openshell sandbox list 2>&1"' bin/nemoclaw.js bin/lib/onboard.jsAlso applies to: 316-316
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/nemoclaw.js` at line 297, The polling call to runCapture("openshell sandbox list 2>&1", { ignoreError: true }) is not bounded by the NEMOCLAW_CONNECT_TIMEOUT and can hang; update both sandbox-list runCapture invocations in this file to pass a timeout option (either the NEMOCLAW_CONNECT_TIMEOUT value or the remaining elapsed time calculated in the connect loop) so the child process is killed if it exceeds the limit, and confirm bin/lib/runner.js already wires a timeout option through to the child process before making the change to runCapture.
323-327:⚠️ Potential issue | 🟡 MinorKeep terminal-state guidance consistent inside the loop.
The immediate terminal-state branch prints both
logs --followandstatus, but the in-loop branch drops thestatuscommand. The failure UX should be identical regardless of when the sandbox breaks.🩹 Suggested fix
if (TERMINAL.has(cur)) { console.error(""); console.error(` Sandbox '${sandboxName}' entered '${cur}' state.`); console.error(` Run: nemoclaw ${sandboxName} logs --follow`); + console.error(` Run: nemoclaw ${sandboxName} status`); process.exit(1); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/nemoclaw.js` around lines 323 - 327, The terminal-state handling is inconsistent: inside the loop where TERMINAL.has(cur) is true (variables cur and sandboxName), the messages only suggest "nemoclaw <sandboxName> logs --follow" but the earlier immediate branch also suggests "nemoclaw <sandboxName> status"; update the in-loop branch that checks TERMINAL.has(cur) to print the same two guidance lines (both "logs --follow" and "status") before calling process.exit(1) so the failure UX is identical regardless of when the sandbox enters a terminal state.
313-315:⚠️ Potential issue | 🟡 MinorDon't sleep past the configured timeout.
NEMOCLAW_CONNECT_TIMEOUT=1still waits 3 seconds here, and larger non-multiples can overshoot by a full poll interval.⏱️ Suggested fix
let elapsed = 0; let ready = false; let everSeen = status !== null; while (elapsed < timeout) { - spawnSync("sleep", [String(interval)]); - elapsed += interval; + const sleepFor = Math.min(interval, timeout - elapsed); + if (sleepFor > 0) { + spawnSync("sleep", [String(sleepFor)]); + elapsed += sleepFor; + } const poll = runCapture("openshell sandbox list 2>&1", { ignoreError: true });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bin/nemoclaw.js` around lines 313 - 315, The loop that waits using spawnSync("sleep", [String(interval)]) can overshoot the configured timeout because it always sleeps the full interval; change the sleep duration inside the while loop (in the block using elapsed, timeout, interval and spawnSync) to compute remaining = timeout - elapsed, break if remaining <= 0, and call spawnSync with String(Math.min(interval, remaining)); then add the actual slept duration (min(interval, remaining)) to elapsed instead of always adding interval.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/lib/onboard.js`:
- Around line 74-76: isSandboxReady currently scans all tokens from
parseSandboxRow output which can incorrectly detect "Ready" in later columns;
update isSandboxReady to use the parsed status column (as defined by
parseSandboxStatus / cols[1]) instead of searching the entire row, i.e., call
parseSandboxStatus or read cols[1] from parseSandboxRow and return true only
when that parsed status equals "Ready" (and not "NotReady") so sandboxConnect's
readiness check follows the new parsed status semantics.
---
Duplicate comments:
In `@bin/nemoclaw.js`:
- Line 297: The polling call to runCapture("openshell sandbox list 2>&1", {
ignoreError: true }) is not bounded by the NEMOCLAW_CONNECT_TIMEOUT and can
hang; update both sandbox-list runCapture invocations in this file to pass a
timeout option (either the NEMOCLAW_CONNECT_TIMEOUT value or the remaining
elapsed time calculated in the connect loop) so the child process is killed if
it exceeds the limit, and confirm bin/lib/runner.js already wires a timeout
option through to the child process before making the change to runCapture.
- Around line 323-327: The terminal-state handling is inconsistent: inside the
loop where TERMINAL.has(cur) is true (variables cur and sandboxName), the
messages only suggest "nemoclaw <sandboxName> logs --follow" but the earlier
immediate branch also suggests "nemoclaw <sandboxName> status"; update the
in-loop branch that checks TERMINAL.has(cur) to print the same two guidance
lines (both "logs --follow" and "status") before calling process.exit(1) so the
failure UX is identical regardless of when the sandbox enters a terminal state.
- Around line 313-315: The loop that waits using spawnSync("sleep",
[String(interval)]) can overshoot the configured timeout because it always
sleeps the full interval; change the sleep duration inside the while loop (in
the block using elapsed, timeout, interval and spawnSync) to compute remaining =
timeout - elapsed, break if remaining <= 0, and call spawnSync with
String(Math.min(interval, remaining)); then add the actual slept duration
(min(interval, remaining)) to elapsed instead of always adding interval.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c15045bb-4fce-4372-acf0-e66f5f52eef5
📒 Files selected for processing (3)
bin/lib/onboard.jsbin/nemoclaw.jstest/onboard-readiness.test.js
|
Thanks for working on this. The issue is real, and the added readiness/status UX is the right direction, but I don’t think this is ready to merge yet. The main blocker is that the new polling path can still hang indefinitely if
If you update those pieces, I’d be happy to re-review. The feature is worthwhile; it just needs to fully close the hang path before we merge it. |
286fe06 to
377044b
Compare
|
Addressed all four points from @kjw3 review. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bin/nemoclaw.js`:
- Around line 312-355: Compute a real wall-clock deadline (e.g., const deadline
= Date.now() + timeout * 1000) and use the remaining time to drive both the
subprocess timeout and sleep intervals instead of the elapsed counter; before
each runCapture("openshell sandbox list 2>&1", listOpts) (including the initial
call) update listOpts.timeout to Math.max(0, deadline - Date.now()) in
milliseconds, and when computing sleepFor derive it from the remaining seconds
(e.g., remainingSec = Math.ceil((deadline - Date.now())/1000); sleepFor =
Math.min(interval, remainingSec)); also check remaining time and error/exit if
deadline passed so the while loop using elapsed/timeout is replaced by checking
Date.now() against deadline in the loop that polls
isSandboxReady/parseSandboxStatus.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 87d573a4-6751-42cf-9ddb-8abf1d2f5d52
📒 Files selected for processing (3)
bin/lib/onboard.jsbin/nemoclaw.jstest/onboard-readiness.test.js
377044b to
8f7d2b6
Compare
|
Switched to a wall-clock deadline (Date.now()) so subprocess time counts toward NEMOCLAW_CONNECT_TIMEOUT — rebased on latest main, all readiness tests pass. |
89a490b to
5d79bef
Compare
…ream PR NVIDIA#466) Adapted from deepujain/fix/430-connect-timeout (Fixes NVIDIA#430). sandboxConnect now polls `openshell sandbox list` for Ready status before connecting. Previously, connecting while a sandbox was still pulling (2.4GB image) would hang silently with no output. - bin/lib/onboard.js: extract parseSandboxRow/parseSandboxStatus from isSandboxReady; export parseSandboxStatus - bin/nemoclaw.js: add pre-connect readiness poll (default 120s timeout, configurable via NEMOCLAW_CONNECT_TIMEOUT env var); shows live status with elapsed time; exits immediately on terminal states (Failed, CrashLoopBackOff, ImagePullBackOff, etc.) - test/onboard-readiness.test.js: add 11 parseSandboxStatus unit tests Adaptation notes: readiness poll runs before ensureLiveSandboxOrExit (not instead of it), keeping the gateway state check intact. Source: NVIDIA#466
ac4c836 to
ca76826
Compare
ca76826 to
8586271
Compare
|
Rebased this onto current main and finished the TS-side connect/onboard port. and pass locally on the rebased branch. |
…NVIDIA#430) Polls openshell sandbox list with status display and configurable timeout (NEMOCLAW_CONNECT_TIMEOUT, default 120s) so nemoclaw connect no longer hangs when the sandbox is starting or has crashed. Each runCapture call is bounded by the same timeout so a wedged subprocess cannot block forever. Fixes NVIDIA#430 Signed-off-by: Deepak Jain <deepujain@gmail.com>
Signed-off-by: dejain <deepujain@gmail.com>
e34a16a to
e38d80a
Compare
|
Rebased on current main and fixed the PR title and DCO body checks. The readiness wait is now replayed on top of the current TS connect path, npm run build:cli passes, and the parseSandboxStatus readiness tests pass locally. |
Fixes NVIDIA#430 Signed-off-by: Deepak Jain <deepujain@gmail.com>
|
Pushed one more follow-up for the TS-era connect path. connect was still bailing on a non-Ready sandbox phase before the readiness poll could run; that is fixed now. |
|
Thanks for this — we've reviewed the approach and it addresses a real gap. The current This is in the review queue. We'll follow up with any feedback. |
prekshivyas
left a comment
There was a problem hiding this comment.
Reviewed end-to-end. The PR cleanly addresses #430 and has converged nicely through the CodeRabbit iterations — all the structural concerns (subprocess timeout bounded by remaining wall-clock, deadline-based loop, terminal-state UX parity between immediate and in-loop branches, sandbox-not-found early exit at 30s) are resolved. No UX, security, onboarding, or code-quality regressions. Approving.
A few non-blocking nits for @deepujain to consider — happy to ship as-is and address in a follow-up if preferred:
-
Non-TTY output:
process.stdout.write('\r Status: …')produces concatenated lines when stdout isn't a TTY (captured logs, CI). Gating the progress update onprocess.stdout.isTTYwould keep piped output clean. Minor — the primary use case is interactive. -
Lost diagnostic detail on stuck phases: the old
ensureLiveSandboxOrExitincluded useful context ("usually happens when a process crash inside the sandbox prevented clean startup"). The new TERMINAL-state branch prints a simpler message. Consider preserving the richer text specifically forCrashLoopBackOff/Error. -
Named constant for
30: theelapsed >= 30sandbox-not-found threshold would benefit from a named constant + one-line comment. IfNEMOCLAW_CONNECT_TIMEOUTis set below 30s, this check is never reached (harmless, just worth documenting). -
Stylistic:
let ready = false+ post-loop check could be a labelled break, but the current form is perfectly readable. Ignore if you prefer it.
None are blockers.
## Summary Fixes the lingering `sandbox-stuck-recovery.test.ts > connect times out with guidance when sandbox is stuck in Provisioning` failure on `wsl-e2e` that has been red on `main` since #466 merged. My earlier #2198 fixed the sibling tests but this one kept hitting the CI test-budget ceiling. Fixes #2204. ## Root cause Not a mock-script bug — the mocks for `sandbox list` already return the phase string that `parseSandboxStatus` expects, and the tests pass cleanly on macOS and the Linux `checks` job. The failure is WSL-specific timing: - `NEMOCLAW_EXEC_TIMEOUT=30000` caps each outer `spawnSync` to the CLI at 30 seconds. - WSL process-spawn overhead is high: each mocked `openshell` invocation is a fresh Node.js child process, and the CLI makes 4+ of these (`sandbox get`, `policy get --full`, `inference get`, first `sandbox list`) before the readiness poll loop from #466 starts. - The 3 sibling tests in `sandbox-stuck-recovery.test.ts` that skip the poll loop were passing at **26–28s** on WSL — already within a few seconds of the ceiling. - The failing test is the only one that sets `NEMOCLAW_CONNECT_TIMEOUT=3`, so it actually enters the poll loop (pre-poll + 3s sleep + final poll + "Timed out" print). That pushes total runtime past 30s on WSL runners, so the outer `spawnSync` kills the CLI before the "Timed out" guidance prints. The `Status: unknown` artifact mentioned in #2204 is the CLI's display fallback `parseSandboxStatus(poll, ...) || "unknown"` kicking in after the child was killed mid-poll — not a missing subcommand in the mocks. ## Changes - `.github/workflows/wsl-e2e.yaml`: bump `NEMOCLAW_EXEC_TIMEOUT` from `30000` to `60000`, aligned with the existing `NEMOCLAW_TEST_TIMEOUT=60000`. Short comment explains why WSL needs the larger ceiling. ## Why this is the right fix - **Zero impact on fast runners.** Tests that finish in a few seconds on macOS/Linux return early; they don't pay for the larger ceiling. - **No product-code change.** Only the WSL CI outer-spawn budget moves. - **Targeted to the actual bottleneck.** The CLI isn't hanging — it's slow on WSL spawn. Bumping the budget matches the structural overhead. ## Considered and rejected - **Patching the mocks** (the #2204 hypothesis). No new subcommand is invoked; the mocks are correct. - **Reducing pre-poll overhead in product code** (e.g. skipping `policy get --full` on connect). Out of scope, risks regressing real behavior. - **Dropping `NEMOCLAW_CONNECT_TIMEOUT` from the test.** The test asserts timeout behavior — it must enter the poll. - **Bumping `macos-e2e`.** Not needed; those jobs pass comfortably. ## Type of Change - [x] Code change (bug fix — CI config only) ## Verification - [x] Local \`npx vitest run test/sandbox-stuck-recovery.test.ts test/sandbox-connect-inference.test.ts --project cli\` — 7/7 pass in 7.6s - [x] \`npx prek run --files .github/workflows/wsl-e2e.yaml\` — clean - [ ] **Real validation: this PR's \`wsl-e2e\` check must go green.** ## AI Disclosure - [x] AI-assisted — tool: Claude Code 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Chores** * Improved Windows Subsystem for Linux (WSL) test reliability by adjusting execution timeout settings to account for system overhead. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Summary Bumps the published doc version to `0.0.22` and documents the user-visible CLI behavior changes to `nemoclaw <name> connect` that landed since v0.0.21. Drafted via the `nemoclaw-contributor-update-docs` skill against commits in `v0.0.21..origin/main`, filtered through `docs/.docs-skip`. ## Changes - **`docs/project.json`** and **`docs/versions1.json`**: bump the published version from `0.0.20` to `0.0.22`; insert a `0.0.21` entry into the version list so the history stays contiguous. - **`docs/reference/commands.md`** → `nemoclaw <name> connect`: document two new behaviors. - Readiness poll with `NEMOCLAW_CONNECT_TIMEOUT` (integer seconds; default `120`) that replaces the silent hang when the sandbox is not yet `Ready` — right after onboarding, while the 2.4 GB image is still pulling (#466). - Post-connect hint is now agent-aware, names the correct TUI command for the sandbox's agent, and tells you to use `/exit` to leave the chat before `exit` returns you to the host shell (#2080). Feature PRs that shipped their own docs in the same commit are intentionally not re-documented here: - `channels list/add/remove` (#2139) — command reference and the "`openclaw channels` blocked inside the sandbox" troubleshooting entry landed with the feature. - `nemoclaw gc` (#2176) — documented as part of the destroy/rebuild image cleanup PR. Skipped per `docs/.docs-skip`: - `e6bad533 fix(shields): verify config lock and fail hard on re-lock failure (#2066)` — matched `skip-features: src/lib/shields.ts`. Other commits in the range (#2141 OpenShell version bump, #1819 plugin banner live inference probe, #2085 / #2146 Slack Socket Mode fixes, #2110 axios proxy fix, #1818 NIM curl timeouts, #1824 onboard gateway bootstrap recovery, and assorted CI / test / install plumbing) are internal behavior refinements with no doc-relevant surface change. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes for the modified files via the pre-commit hook, including `Regenerate agent skills from docs` (source ↔ generated parity confirmed) - [ ] `npm test` passes — skipped; the one pre-existing `test/cli.test.ts > unknown command exits 1` failure on `origin/main` is unrelated to these markdown/JSON-only changes - [ ] Tests added or updated for new or changed behavior — n/a, doc-only - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) — not run locally - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) — n/a, no new pages ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * `connect` now displays the sandbox phase while waiting for readiness and honors a configurable timeout via NEMOCLAW_CONNECT_TIMEOUT (default 120s). * TTY hints are agent-aware and instruct using `/exit` before returning to the host shell. * **Documentation** * Command docs updated to describe polling, timeout, and TTY guidance. * Project/docs metadata updated for versions 0.0.21 and 0.0.22 (package version bumped to 0.0.22). <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eady (#2222) ## Summary - PR #466 refactored `isSandboxReady` to delegate to `parseSandboxStatus`, which hardcodes the status to column 2 (`cols[1]`). The previous implementation used `cols.includes("Ready")` to check all columns. - When `openshell sandbox list` outputs the status in a column other than the second, the readyCheck never returns true, causing `openshell sandbox create` to hang indefinitely after image upload — the sandbox becomes Ready but NemoClaw never detects it. - This broke every nightly E2E test. Bisected to `49ec8955` (#466) with confirmed good run on the prior commit. - Restores the column-agnostic `cols.includes("Ready") && !cols.includes("NotReady")` check while keeping the `parseSandboxRow` helper. ## Bisect results | Commit | Result | |--------|--------| | `bd82e1f0` (pre-#466) | Pass | | `ff18fb8f` (midpoint) | Pass | | `49ec8955` (#466) | Fail — sandbox hangs | | `37ca70e6` (pre-#466, docs) | Pass | ## Test plan - [x] `vitest run test/onboard-readiness.test.ts` — 34 passed - [ ] Nightly E2E on this branch 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved sandbox readiness detection to be more resilient across different versions of the sandbox management tool, ensuring reliable status parsing regardless of column layout variations. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
nemoclaw <name> connecthangs indefinitely with no output when the sandbox is not yet ready (starting, pulling image, crashed). This is especially problematic right after onboarding, when the README tells users to immediately runnemoclaw my-assistant connectwhile the 2.4 GB sandbox image may still be pulling.This change adds a readiness poll with status display and configurable timeout so users see what's happening and get clear errors on failure.
Fixes #430
Changes
parseSandboxStatus(output, sandboxName)— pure helper that extracts the status column for a named sandbox fromopenshell sandbox listoutput (strips ANSI, exact-matches name in first column). Exported for unit tests.sandboxConnect): Before connecting, pollsopenshell sandbox listevery 3s. Shows live status with elapsed time (Status: ContainerCreating (12s elapsed)). Exits immediately with actionable error on terminal states (Failed, Error, CrashLoopBackOff, ImagePullBackOff). Times out after 120s (configurable viaNEMOCLAW_CONNECT_TIMEOUTenv var). If sandbox is already Ready, connects immediately with no delay.parseSandboxStatuscovering: all status values, null/empty input, ANSI stripping, exact name matching, multi-sandbox output.Testing
npm run build:clinpx vitest run test/onboard-readiness.test.ts -t "parseSandboxStatus|sandbox readiness parsing"Signed-off-by: Deepak Jain deepujain@gmail.com