fix(connect): add readiness poll with timeout before connecting (Fixes #430) by deepujain · Pull Request #466 · NVIDIA/NemoClaw

deepujain · 2026-03-20T03:40:49Z

Summary

nemoclaw <name> connect hangs indefinitely with no output when the sandbox is not yet ready (starting, pulling image, crashed). This is especially problematic right after onboarding, when the README tells users to immediately run nemoclaw my-assistant connect while the 2.4 GB sandbox image may still be pulling.

This change adds a readiness poll with status display and configurable timeout so users see what's happening and get clear errors on failure.

Fixes #430

Changes

bin/lib/onboard.js: Added parseSandboxStatus(output, sandboxName) — pure helper that extracts the status column for a named sandbox from openshell sandbox list output (strips ANSI, exact-matches name in first column). Exported for unit tests.
bin/nemoclaw.js (sandboxConnect): Before connecting, polls openshell sandbox list every 3s. Shows live status with elapsed time (Status: ContainerCreating (12s elapsed)). Exits immediately with actionable error on terminal states (Failed, Error, CrashLoopBackOff, ImagePullBackOff). Times out after 120s (configurable via NEMOCLAW_CONNECT_TIMEOUT env var). If sandbox is already Ready, connects immediately with no delay.
test/onboard-readiness.test.js: Added 11 tests for parseSandboxStatus covering: all status values, null/empty input, ANSI stripping, exact name matching, multi-sandbox output.

Testing

npm run build:cli
npx vitest run test/onboard-readiness.test.ts -t "parseSandboxStatus|sandbox readiness parsing"

Signed-off-by: Deepak Jain deepujain@gmail.com

coderabbitai · 2026-03-20T03:41:07Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Replaces inline sandbox-list parsing with reusable parsers, updates readiness detection, adds polling with timeout and terminal-state handling to nemoclaw <name> connect, and adds tests for the new parser and behaviors.

Changes

Cohort / File(s)	Summary
Onboard helper `bin/lib/onboard.js`	Added `parseSandboxRow(output, sandboxName)` that strips ANSI and selects the exact-name row; added and exported `parseSandboxStatus(output, sandboxName)` returning the status column; refactored `isSandboxReady` to use the parser.
Connect flow `bin/nemoclaw.js`	Added readiness polling (3s interval) using `openshell sandbox list`, respects `NEMOCLAW_CONNECT_TIMEOUT` (default 120s) with live status output, treats several statuses as terminal (prints suggested `nemoclaw <name> logs --follow` and `nemoclaw <name> status` and exits 1), and only starts port-forward/connect after the sandbox is Ready.
Tests `test/onboard-readiness.test.js`	Added `parseSandboxStatus` tests covering normal statuses, ANSI-stripping, exact-name matching, multi-line selection, and null/empty input cases.

Sequence Diagram(s)

sequenceDiagram
  participant User as CLI (nemoclaw)
  participant Nemoclaw as bin/nemoclaw.js
  participant Onboard as bin/lib/onboard.js
  participant Openshell as openshell CLI

  User->>Nemoclaw: sandboxConnect(sandboxName)
  Nemoclaw->>Openshell: "openshell sandbox list" (runCapture, ignoreError)
  Openshell-->>Nemoclaw: table/text output
  Nemoclaw->>Onboard: parseSandboxStatus(output, sandboxName)
  Onboard-->>Nemoclaw: status (e.g., Ready, Pending, Failed, ...)
  alt status == Ready
    Nemoclaw->>User: print "Ready"
    User->>Openshell: start port-forward & "openshell sandbox connect"
    Openshell-->>User: interactive shell
  else status in (Failed, Error, CrashLoopBackOff, ImagePullBackOff, Unknown, Evicted)
    Nemoclaw->>User: print error + suggested "nemoclaw <name> logs --follow" and "nemoclaw <name> status"
    Nemoclaw-->>User: exit(1)
  else not found / timeout
    Nemoclaw->>User: print not-found/timeout message (hint NEMOCLAW_CONNECT_TIMEOUT)
    Nemoclaw-->>User: exit(1)
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

"I hopped through logs and timeouts so merry,
Waiting for status — not stuck in a ferry.
I parsed every row, stripped colours away,
Now connect waits wisely and shows you the way. 🐇"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	The PR successfully implements all coding requirements from issue `#430`: adds readiness polling, status display, configurable timeout via NEMOCLAW_CONNECT_TIMEOUT, terminal state detection with actionable errors, and a 3s polling interval.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to issue `#430`: the parseSandboxStatus helper and isSandboxReady refactor support the polling implementation, tests validate the new helper, and no unrelated modifications are present.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding a readiness poll with timeout before connecting to prevent hangs.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

bin/nemoclaw.js (1)

262-264: Consider adding validation feedback for invalid timeout values.

The || 120 fallback silently converts invalid values (e.g., NEMOCLAW_CONNECT_TIMEOUT=abc) to 120. This is safe but could confuse users who expect their setting to take effect.

♻️ Optional: Add warning for invalid timeout values

 function sandboxConnect(sandboxName) {
   const { isSandboxReady, parseSandboxStatus } = require("./lib/onboard");
-  const timeout = parseInt(process.env.NEMOCLAW_CONNECT_TIMEOUT, 10) || 120;
+  const rawTimeout = process.env.NEMOCLAW_CONNECT_TIMEOUT;
+  let timeout = 120;
+  if (rawTimeout !== undefined) {
+    const parsed = parseInt(rawTimeout, 10);
+    if (Number.isNaN(parsed) || parsed < 0) {
+      console.warn(`  Warning: Invalid NEMOCLAW_CONNECT_TIMEOUT="${rawTimeout}", using default 120s`);
+    } else {
+      timeout = parsed || 120; // treat 0 as "use default"
+    }
+  }
   const interval = 3;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 262 - 264, The current timeout assignment
silently falls back to 120 when NEMOCLAW_CONNECT_TIMEOUT is invalid; change the
logic around the parseInt call that assigns timeout so you first read raw =
process.env.NEMOCLAW_CONNECT_TIMEOUT, attempt to parse with parseInt, and if raw
is defined but Number.isNaN(parsed) emit a clear warning (e.g., console.warn)
mentioning the invalid value and that 120s will be used, otherwise use the
parsed value; keep the fallback to 120 for undefined values and reference the
timeout constant and the NEMOCLAW_CONNECT_TIMEOUT env var in your message so
users see what was wrong.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@bin/nemoclaw.js`:
- Around line 262-264: The current timeout assignment silently falls back to 120
when NEMOCLAW_CONNECT_TIMEOUT is invalid; change the logic around the parseInt
call that assigns timeout so you first read raw =
process.env.NEMOCLAW_CONNECT_TIMEOUT, attempt to parse with parseInt, and if raw
is defined but Number.isNaN(parsed) emit a clear warning (e.g., console.warn)
mentioning the invalid value and that 120s will be used, otherwise use the
parsed value; keep the fallback to 120 for undefined values and reference the
timeout constant and the NEMOCLAW_CONNECT_TIMEOUT env var in your message so
users see what was wrong.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 76b8a6ba-1a7f-4f66-9c3b-f10fe6224423

📥 Commits

Reviewing files that changed from the base of the PR and between dbfd78c and 954e578.

📒 Files selected for processing (3)

bin/lib/onboard.js
bin/nemoclaw.js
test/onboard-readiness.test.js

deepujain · 2026-03-21T23:02:49Z

Addressed the CodeRabbit nitpick. Rebased on main, resolved conflicts .

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

bin/lib/onboard.js (1)

71-85: LGTM! Clean implementation with proper input validation.

The function correctly handles ANSI stripping, exact name matching in the first column, and safely returns null for missing/invalid input. The logic is consistent with isSandboxReady above.

💡 Optional: Extract shared parsing logic to reduce duplication

Both isSandboxReady and parseSandboxStatus share identical ANSI-stripping and column-parsing logic. If you anticipate adding more status-parsing utilities, consider extracting the common parsing into a helper:

+function parseSandboxRow(output, sandboxName) {
+  if (!output || typeof output !== "string") return null;
+  const clean = output.replace(/\x1b\[[0-9;]*m/g, "");
+  for (const line of clean.split("\n")) {
+    const cols = line.trim().split(/\s+/);
+    if (cols[0] === sandboxName) return cols;
+  }
+  return null;
+}
+
 function isSandboxReady(output, sandboxName) {
-  const clean = output.replace(/\x1b\[[0-9;]*m/g, "");
-  return clean.split("\n").some((l) => {
-    const cols = l.trim().split(/\s+/);
-    return cols[0] === sandboxName && cols.includes("Ready") && !cols.includes("NotReady");
-  });
+  const cols = parseSandboxRow(output, sandboxName);
+  return cols && cols.includes("Ready") && !cols.includes("NotReady");
 }

 function parseSandboxStatus(output, sandboxName) {
-  if (!output || typeof output !== "string") return null;
-  const clean = output.replace(/\x1b\[[0-9;]*m/g, "");
-  for (const line of clean.split("\n")) {
-    const cols = line.trim().split(/\s+/);
-    if (cols[0] === sandboxName && cols.length >= 2) {
-      return cols[1];
-    }
-  }
-  return null;
+  const cols = parseSandboxRow(output, sandboxName);
+  return cols && cols.length >= 2 ? cols[1] : null;
 }

This also adds null-safety to isSandboxReady which currently lacks input validation.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 71 - 85, Extract the common ANSI-stripping
and column-splitting logic used in parseSandboxStatus and isSandboxReady into a
small helper (e.g., cleanAndSplitLines or parseTableLines) and then call that
helper from both parseSandboxStatus and isSandboxReady; ensure the helper
performs null/typeof checks and returns a safe array of lines/columns so
isSandboxReady gains the same input validation as parseSandboxStatus and
duplication is removed.

bin/nemoclaw.js (1)

297-339: Solid polling implementation with clear user feedback.

The readiness gate logic handles the core requirements well: terminal state detection, progress display with elapsed time, and configurable timeout.

One minor consideration: If the sandbox doesn't exist at all (not in the list), parseSandboxStatus returns null and the status shows as "unknown" until timeout. This is acceptable since the sandbox might be in the process of being created, but you could optionally add a check after the first poll to detect "sandbox not found" early.

💡 Optional: Early exit if sandbox never appears

     let elapsed = 0;
     let ready = false;
+    let everSeen = status !== null;
     while (elapsed < timeout) {
       spawnSync("sleep", [String(interval)]);
       elapsed += interval;
       const poll = runCapture("openshell sandbox list 2>&1", { ignoreError: true });
       if (isSandboxReady(poll, sandboxName)) {
         ready = true;
         break;
       }
       const cur = parseSandboxStatus(poll, sandboxName) || "unknown";
+      if (cur !== "unknown") everSeen = true;
       if (TERMINAL.has(cur)) {
         console.error("");
         console.error(`  Sandbox '${sandboxName}' entered '${cur}' state.`);
         console.error(`  Run:  nemoclaw ${sandboxName} logs --follow`);
         process.exit(1);
       }
+      // If sandbox never appeared after reasonable time, exit early
+      if (!everSeen && elapsed >= 30) {
+        console.error("");
+        console.error(`  Sandbox '${sandboxName}' not found after ${elapsed}s.`);
+        console.error(`  Check: openshell sandbox list`);
+        process.exit(1);
+      }
       process.stdout.write(`\r    Status: ${cur.padEnd(20)} (${elapsed}s elapsed)`);
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 297 - 339, Add an early "sandbox not found"
detection: after the initial runCapture("openshell sandbox list...") and after
the first poll() in the loop, check if parseSandboxStatus(list, sandboxName) (or
parseSandboxStatus(poll, sandboxName)) returns null/undefined meaning the
sandbox is not present; if so, print a clear error message (e.g., "Sandbox
'<name>' not found. Check creation or name.") and exit non‑zero. Update the
blocks around isSandboxReady, parseSandboxStatus, list and poll to perform this
check and return early instead of waiting the full timeout.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/nemoclaw.js`:
- Around line 300-306: Update the TERMINAL set used to detect sandbox pod
failures to include the Kubernetes pod phases "Unknown" and the eviction reason
"Evicted": modify the constant TERMINAL (currently defined as new
Set(["Failed","Error","CrashLoopBackOff","ImagePullBackOff"])) to also contain
"Unknown" and "Evicted" so the branch that checks if (status &&
TERMINAL.has(status)) (which logs using sandboxName and exits via
process.exit(1)) will catch those cases as well.

---

Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 71-85: Extract the common ANSI-stripping and column-splitting
logic used in parseSandboxStatus and isSandboxReady into a small helper (e.g.,
cleanAndSplitLines or parseTableLines) and then call that helper from both
parseSandboxStatus and isSandboxReady; ensure the helper performs null/typeof
checks and returns a safe array of lines/columns so isSandboxReady gains the
same input validation as parseSandboxStatus and duplication is removed.

In `@bin/nemoclaw.js`:
- Around line 297-339: Add an early "sandbox not found" detection: after the
initial runCapture("openshell sandbox list...") and after the first poll() in
the loop, check if parseSandboxStatus(list, sandboxName) (or
parseSandboxStatus(poll, sandboxName)) returns null/undefined meaning the
sandbox is not present; if so, print a clear error message (e.g., "Sandbox
'<name>' not found. Check creation or name.") and exit non‑zero. Update the
blocks around isSandboxReady, parseSandboxStatus, list and poll to perform this
check and return early instead of waiting the full timeout.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1c8b36ca-68e5-4f54-9792-f2c5d72e54fe

📥 Commits

Reviewing files that changed from the base of the PR and between 954e578 and a645867.

📒 Files selected for processing (3)

bin/lib/onboard.js
bin/nemoclaw.js
test/onboard-readiness.test.js

🚧 Files skipped from review as they are similar to previous changes (1)

test/onboard-readiness.test.js

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

bin/lib/onboard.js (1)

74-76: Use the status column directly for readiness.

cols.includes("Ready") makes readiness depend on every token in the row. Checking the status column keeps isSandboxReady() aligned with parseSandboxStatus() and avoids false positives from future extra columns.

♻️ Suggested change

 function isSandboxReady(output, sandboxName) {
   const cols = parseSandboxRow(output, sandboxName);
-  return cols !== null && cols.includes("Ready") && !cols.includes("NotReady");
+  return cols !== null && cols[1] === "Ready";
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 74 - 76, isSandboxReady currently checks
readiness by searching the entire row for "Ready"/"NotReady", which can give
false positives; change it to use the status value returned by
parseSandboxStatus(output, sandboxName) (or otherwise extract the explicit
status column from parseSandboxRow) and return true only when that status ===
"Ready" (and not "NotReady"); update isSandboxReady to call
parseSandboxStatus/output -> status and base the boolean on that status.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/nemoclaw.js`:
- Around line 299-306: The terminal-state handling is inconsistent: the
immediate terminal branch (using parseSandboxStatus, TERMINAL and sandboxName)
prints both "nemoclaw <sandboxName> logs --follow" and "nemoclaw <sandboxName>
status" but the in-loop terminal branch omits the status command; update the
in-loop terminal branch to mirror the immediate branch by adding the same
console.error line that suggests running "nemoclaw <sandboxName> status" so both
branches output identical guidance for terminal statuses.
- Around line 313-315: The loop that waits for connection oversleeps because it
always calls spawnSync("sleep", [String(interval)]) regardless of remaining
time; change the logic in the wait/retry loop (variables elapsed, timeout,
interval and the while loop) to compute remaining = timeout - elapsed and sleep
for Math.max(0, Math.min(interval, remaining)) seconds (skip sleeping if
remaining <= 0) so the total wait never exceeds NEMOCLAW_CONNECT_TIMEOUT; keep
using spawnSync("sleep", [String(sleepSec)]) and update elapsed by the actual
sleepSec used.
- Line 297: The two calls to runCapture("openshell sandbox list", { ignoreError:
true }) must pass a timeout so the child process cannot hang; modify the calls
in bin/nemoclaw.js (the runCapture invocations used by connect) to include a
timeout option derived from NEMOCLAW_CONNECT_TIMEOUT (converted to milliseconds)
e.g. merge { ignoreError: true } with { timeout: <NEMOCLAW_CONNECT_TIMEOUT in
ms> } before calling runCapture; ensure you reuse the same timeout value used by
the outer polling loop so the subprocess timeout aligns with the overall connect
timeout.

---

Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 74-76: isSandboxReady currently checks readiness by searching the
entire row for "Ready"/"NotReady", which can give false positives; change it to
use the status value returned by parseSandboxStatus(output, sandboxName) (or
otherwise extract the explicit status column from parseSandboxRow) and return
true only when that status === "Ready" (and not "NotReady"); update
isSandboxReady to call parseSandboxStatus/output -> status and base the boolean
on that status.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0ec63e5c-a749-41d2-a442-7ae0c7e6e219

📥 Commits

Reviewing files that changed from the base of the PR and between a645867 and 8dce304.

📒 Files selected for processing (3)

bin/lib/onboard.js
bin/nemoclaw.js
test/onboard-readiness.test.js

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

bin/nemoclaw.js (3)

297-297: ⚠️ Potential issue | 🟠 Major

Bound each openshell sandbox list subprocess.

NEMOCLAW_CONNECT_TIMEOUT only limits the outer loop. A wedged runCapture() here can still block forever before elapsed is checked again, which puts connect back into the indefinite-hang state this PR is trying to eliminate.

🩹 Suggested fix

   const interval = 3;
+  const listOpts = { ignoreError: true, timeout: timeout * 1000 };
 
-  const list = runCapture("openshell sandbox list 2>&1", { ignoreError: true });
+  const list = runCapture("openshell sandbox list 2>&1", listOpts);
   if (!isSandboxReady(list, sandboxName)) {
@@
-      const poll = runCapture("openshell sandbox list 2>&1", { ignoreError: true });
+      const poll = runCapture("openshell sandbox list 2>&1", listOpts);

Verify that bin/lib/runner.js already passes a timeout option through to the child process, and that these polling call sites are the remaining ones omitting it. Expected result: runCapture shows timeout-capable plumbing, while the two sandbox list calls in this function do not currently use it.

#!/bin/bash
set -euo pipefail

echo "== runCapture timeout support =="
rg -n -C3 'function runCapture|const runCapture|timeout|execSync' bin/lib/runner.js

echo
echo "== sandbox list polling call sites =="
rg -n -C2 'runCapture\("openshell sandbox list 2>&1"' bin/nemoclaw.js bin/lib/onboard.js

Also applies to: 316-316

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` at line 297, The polling call to runCapture("openshell
sandbox list 2>&1", { ignoreError: true }) is not bounded by the
NEMOCLAW_CONNECT_TIMEOUT and can hang; update both sandbox-list runCapture
invocations in this file to pass a timeout option (either the
NEMOCLAW_CONNECT_TIMEOUT value or the remaining elapsed time calculated in the
connect loop) so the child process is killed if it exceeds the limit, and
confirm bin/lib/runner.js already wires a timeout option through to the child
process before making the change to runCapture.

323-327: ⚠️ Potential issue | 🟡 Minor

Keep terminal-state guidance consistent inside the loop.

The immediate terminal-state branch prints both logs --follow and status, but the in-loop branch drops the status command. The failure UX should be identical regardless of when the sandbox breaks.

🩹 Suggested fix

       if (TERMINAL.has(cur)) {
         console.error("");
         console.error(`  Sandbox '${sandboxName}' entered '${cur}' state.`);
         console.error(`  Run:  nemoclaw ${sandboxName} logs --follow`);
+        console.error(`  Run:  nemoclaw ${sandboxName} status`);
         process.exit(1);
       }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 323 - 327, The terminal-state handling is
inconsistent: inside the loop where TERMINAL.has(cur) is true (variables cur and
sandboxName), the messages only suggest "nemoclaw <sandboxName> logs --follow"
but the earlier immediate branch also suggests "nemoclaw <sandboxName> status";
update the in-loop branch that checks TERMINAL.has(cur) to print the same two
guidance lines (both "logs --follow" and "status") before calling
process.exit(1) so the failure UX is identical regardless of when the sandbox
enters a terminal state.

313-315: ⚠️ Potential issue | 🟡 Minor

Don't sleep past the configured timeout.

NEMOCLAW_CONNECT_TIMEOUT=1 still waits 3 seconds here, and larger non-multiples can overshoot by a full poll interval.

⏱️ Suggested fix

     let elapsed = 0;
     let ready = false;
     let everSeen = status !== null;
     while (elapsed < timeout) {
-      spawnSync("sleep", [String(interval)]);
-      elapsed += interval;
+      const sleepFor = Math.min(interval, timeout - elapsed);
+      if (sleepFor > 0) {
+        spawnSync("sleep", [String(sleepFor)]);
+        elapsed += sleepFor;
+      }
       const poll = runCapture("openshell sandbox list 2>&1", { ignoreError: true });

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 313 - 315, The loop that waits using
spawnSync("sleep", [String(interval)]) can overshoot the configured timeout
because it always sleeps the full interval; change the sleep duration inside the
while loop (in the block using elapsed, timeout, interval and spawnSync) to
compute remaining = timeout - elapsed, break if remaining <= 0, and call
spawnSync with String(Math.min(interval, remaining)); then add the actual slept
duration (min(interval, remaining)) to elapsed instead of always adding
interval.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 74-76: isSandboxReady currently scans all tokens from
parseSandboxRow output which can incorrectly detect "Ready" in later columns;
update isSandboxReady to use the parsed status column (as defined by
parseSandboxStatus / cols[1]) instead of searching the entire row, i.e., call
parseSandboxStatus or read cols[1] from parseSandboxRow and return true only
when that parsed status equals "Ready" (and not "NotReady") so sandboxConnect's
readiness check follows the new parsed status semantics.

---

Duplicate comments:
In `@bin/nemoclaw.js`:
- Line 297: The polling call to runCapture("openshell sandbox list 2>&1", {
ignoreError: true }) is not bounded by the NEMOCLAW_CONNECT_TIMEOUT and can
hang; update both sandbox-list runCapture invocations in this file to pass a
timeout option (either the NEMOCLAW_CONNECT_TIMEOUT value or the remaining
elapsed time calculated in the connect loop) so the child process is killed if
it exceeds the limit, and confirm bin/lib/runner.js already wires a timeout
option through to the child process before making the change to runCapture.
- Around line 323-327: The terminal-state handling is inconsistent: inside the
loop where TERMINAL.has(cur) is true (variables cur and sandboxName), the
messages only suggest "nemoclaw <sandboxName> logs --follow" but the earlier
immediate branch also suggests "nemoclaw <sandboxName> status"; update the
in-loop branch that checks TERMINAL.has(cur) to print the same two guidance
lines (both "logs --follow" and "status") before calling process.exit(1) so the
failure UX is identical regardless of when the sandbox enters a terminal state.
- Around line 313-315: The loop that waits using spawnSync("sleep",
[String(interval)]) can overshoot the configured timeout because it always
sleeps the full interval; change the sleep duration inside the while loop (in
the block using elapsed, timeout, interval and spawnSync) to compute remaining =
timeout - elapsed, break if remaining <= 0, and call spawnSync with
String(Math.min(interval, remaining)); then add the actual slept duration
(min(interval, remaining)) to elapsed instead of always adding interval.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c15045bb-4fce-4372-acf0-e66f5f52eef5

📥 Commits

Reviewing files that changed from the base of the PR and between 8dce304 and 286fe06.

📒 Files selected for processing (3)

bin/lib/onboard.js
bin/nemoclaw.js
test/onboard-readiness.test.js

kjw3 · 2026-03-23T22:16:56Z

Thanks for working on this. The issue is real, and the added readiness/status UX is the right direction, but I don’t think this is ready to merge yet.

The main blocker is that the new polling path can still hang indefinitely if openshell sandbox list itself wedges. Right now the outer loop has a timeout, but the two runCapture("openshell sandbox list 2>&1", ...) calls are still unbounded.

Please pass a timeout into both runCapture(...) calls so the child process cannot block forever.
The in-loop terminal-state branch should print the same guidance as the immediate terminal branch, including nemoclaw <name> status.
The sleep loop should not overshoot NEMOCLAW_CONNECT_TIMEOUT for non-multiples of 3 seconds.
isSandboxReady() should check the parsed status column directly instead of scanning the whole row for "Ready".

If you update those pieces, I’d be happy to re-review. The feature is worthwhile; it just needs to fully close the hang path before we merge it.

deepujain · 2026-03-23T22:45:38Z

Addressed all four points from @kjw3 review.
Rebased on latest main. All readiness/status tests pass.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/nemoclaw.js`:
- Around line 312-355: Compute a real wall-clock deadline (e.g., const deadline
= Date.now() + timeout * 1000) and use the remaining time to drive both the
subprocess timeout and sleep intervals instead of the elapsed counter; before
each runCapture("openshell sandbox list 2>&1", listOpts) (including the initial
call) update listOpts.timeout to Math.max(0, deadline - Date.now()) in
milliseconds, and when computing sleepFor derive it from the remaining seconds
(e.g., remainingSec = Math.ceil((deadline - Date.now())/1000); sleepFor =
Math.min(interval, remainingSec)); also check remaining time and error/exit if
deadline passed so the while loop using elapsed/timeout is replaced by checking
Date.now() against deadline in the loop that polls
isSandboxReady/parseSandboxStatus.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 87d573a4-6751-42cf-9ddb-8abf1d2f5d52

📥 Commits

Reviewing files that changed from the base of the PR and between 286fe06 and 377044b.

📒 Files selected for processing (3)

bin/lib/onboard.js
bin/nemoclaw.js
test/onboard-readiness.test.js

deepujain · 2026-03-23T23:06:26Z

Switched to a wall-clock deadline (Date.now()) so subprocess time counts toward NEMOCLAW_CONNECT_TIMEOUT — rebased on latest main, all readiness tests pass.

…res (NVIDIA#466)

…ream PR NVIDIA#466) Adapted from deepujain/fix/430-connect-timeout (Fixes NVIDIA#430). sandboxConnect now polls `openshell sandbox list` for Ready status before connecting. Previously, connecting while a sandbox was still pulling (2.4GB image) would hang silently with no output. - bin/lib/onboard.js: extract parseSandboxRow/parseSandboxStatus from isSandboxReady; export parseSandboxStatus - bin/nemoclaw.js: add pre-connect readiness poll (default 120s timeout, configurable via NEMOCLAW_CONNECT_TIMEOUT env var); shows live status with elapsed time; exits immediately on terminal states (Failed, CrashLoopBackOff, ImagePullBackOff, etc.) - test/onboard-readiness.test.js: add 11 parseSandboxStatus unit tests Adaptation notes: readiness poll runs before ensureLiveSandboxOrExit (not instead of it), keeping the gateway state check intact. Source: NVIDIA#466

deepujain · 2026-04-14T19:41:08Z

Rebased this onto current main and finished the TS-side connect/onboard port. and pass locally on the rebased branch.

…NVIDIA#430) Polls openshell sandbox list with status display and configurable timeout (NEMOCLAW_CONNECT_TIMEOUT, default 120s) so nemoclaw connect no longer hangs when the sandbox is starting or has crashed. Each runCapture call is bounded by the same timeout so a wedged subprocess cannot block forever. Fixes NVIDIA#430 Signed-off-by: Deepak Jain <deepujain@gmail.com>

Signed-off-by: dejain <deepujain@gmail.com>

deepujain · 2026-04-21T01:07:49Z

Rebased on current main and fixed the PR title and DCO body checks. The readiness wait is now replayed on top of the current TS connect path, npm run build:cli passes, and the parseSandboxStatus readiness tests pass locally.

Fixes NVIDIA#430 Signed-off-by: Deepak Jain <deepujain@gmail.com>

deepujain · 2026-04-21T02:02:25Z

Pushed one more follow-up for the TS-era connect path. connect was still bailing on a non-Ready sandbox phase before the readiness poll could run; that is fixed now. npm run build:cli passes, the parseSandboxStatus readiness tests pass, and a direct Pending -> Ready connect repro now reaches sandbox connect and exits cleanly.

wscurran · 2026-04-21T05:41:18Z

Thanks for this — we've reviewed the approach and it addresses a real gap. The current connect command detects crashed/stuck sandboxes but doesn't poll during image pulls, so users hitting a fresh nemoclaw my-assistant connect right after onboard get a misleading error. The readiness poll with elapsed-time display and configurable timeout here is the right fix for that.

This is in the review queue. We'll follow up with any feedback.

prekshivyas

Reviewed end-to-end. The PR cleanly addresses #430 and has converged nicely through the CodeRabbit iterations — all the structural concerns (subprocess timeout bounded by remaining wall-clock, deadline-based loop, terminal-state UX parity between immediate and in-loop branches, sandbox-not-found early exit at 30s) are resolved. No UX, security, onboarding, or code-quality regressions. Approving.

A few non-blocking nits for @deepujain to consider — happy to ship as-is and address in a follow-up if preferred:

Non-TTY output: process.stdout.write('\r Status: …') produces concatenated lines when stdout isn't a TTY (captured logs, CI). Gating the progress update on process.stdout.isTTY would keep piped output clean. Minor — the primary use case is interactive.
Lost diagnostic detail on stuck phases: the old ensureLiveSandboxOrExit included useful context ("usually happens when a process crash inside the sandbox prevented clean startup"). The new TERMINAL-state branch prints a simpler message. Consider preserving the richer text specifically for CrashLoopBackOff / Error.
Named constant for 30: the elapsed >= 30 sandbox-not-found threshold would benefit from a named constant + one-line comment. If NEMOCLAW_CONNECT_TIMEOUT is set below 30s, this check is never reached (harmless, just worth documenting).
Stylistic: let ready = false + post-loop check could be a labelled break, but the current form is perfectly readable. Ignore if you prefer it.

None are blockers.

…#2198)

## Summary Fixes the lingering `sandbox-stuck-recovery.test.ts > connect times out with guidance when sandbox is stuck in Provisioning` failure on `wsl-e2e` that has been red on `main` since #466 merged. My earlier #2198 fixed the sibling tests but this one kept hitting the CI test-budget ceiling. Fixes #2204. ## Root cause Not a mock-script bug — the mocks for `sandbox list` already return the phase string that `parseSandboxStatus` expects, and the tests pass cleanly on macOS and the Linux `checks` job. The failure is WSL-specific timing: - `NEMOCLAW_EXEC_TIMEOUT=30000` caps each outer `spawnSync` to the CLI at 30 seconds. - WSL process-spawn overhead is high: each mocked `openshell` invocation is a fresh Node.js child process, and the CLI makes 4+ of these (`sandbox get`, `policy get --full`, `inference get`, first `sandbox list`) before the readiness poll loop from #466 starts. - The 3 sibling tests in `sandbox-stuck-recovery.test.ts` that skip the poll loop were passing at **26–28s** on WSL — already within a few seconds of the ceiling. - The failing test is the only one that sets `NEMOCLAW_CONNECT_TIMEOUT=3`, so it actually enters the poll loop (pre-poll + 3s sleep + final poll + "Timed out" print). That pushes total runtime past 30s on WSL runners, so the outer `spawnSync` kills the CLI before the "Timed out" guidance prints. The `Status: unknown` artifact mentioned in #2204 is the CLI's display fallback `parseSandboxStatus(poll, ...) || "unknown"` kicking in after the child was killed mid-poll — not a missing subcommand in the mocks. ## Changes - `.github/workflows/wsl-e2e.yaml`: bump `NEMOCLAW_EXEC_TIMEOUT` from `30000` to `60000`, aligned with the existing `NEMOCLAW_TEST_TIMEOUT=60000`. Short comment explains why WSL needs the larger ceiling. ## Why this is the right fix - **Zero impact on fast runners.** Tests that finish in a few seconds on macOS/Linux return early; they don't pay for the larger ceiling. - **No product-code change.** Only the WSL CI outer-spawn budget moves. - **Targeted to the actual bottleneck.** The CLI isn't hanging — it's slow on WSL spawn. Bumping the budget matches the structural overhead. ## Considered and rejected - **Patching the mocks** (the #2204 hypothesis). No new subcommand is invoked; the mocks are correct. - **Reducing pre-poll overhead in product code** (e.g. skipping `policy get --full` on connect). Out of scope, risks regressing real behavior. - **Dropping `NEMOCLAW_CONNECT_TIMEOUT` from the test.** The test asserts timeout behavior — it must enter the poll. - **Bumping `macos-e2e`.** Not needed; those jobs pass comfortably. ## Type of Change - [x] Code change (bug fix — CI config only) ## Verification - [x] Local \`npx vitest run test/sandbox-stuck-recovery.test.ts test/sandbox-connect-inference.test.ts --project cli\` — 7/7 pass in 7.6s - [x] \`npx prek run --files .github/workflows/wsl-e2e.yaml\` — clean - [ ] **Real validation: this PR's \`wsl-e2e\` check must go green.** ## AI Disclosure - [x] AI-assisted — tool: Claude Code 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Chores** * Improved Windows Subsystem for Linux (WSL) test reliability by adjusting execution timeout settings to account for system overhead.  Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

## Summary Bumps the published doc version to `0.0.22` and documents the user-visible CLI behavior changes to `nemoclaw <name> connect` that landed since v0.0.21. Drafted via the `nemoclaw-contributor-update-docs` skill against commits in `v0.0.21..origin/main`, filtered through `docs/.docs-skip`. ## Changes - **`docs/project.json`** and **`docs/versions1.json`**: bump the published version from `0.0.20` to `0.0.22`; insert a `0.0.21` entry into the version list so the history stays contiguous. - **`docs/reference/commands.md`** → `nemoclaw <name> connect`: document two new behaviors. - Readiness poll with `NEMOCLAW_CONNECT_TIMEOUT` (integer seconds; default `120`) that replaces the silent hang when the sandbox is not yet `Ready` — right after onboarding, while the 2.4 GB image is still pulling (#466). - Post-connect hint is now agent-aware, names the correct TUI command for the sandbox's agent, and tells you to use `/exit` to leave the chat before `exit` returns you to the host shell (#2080). Feature PRs that shipped their own docs in the same commit are intentionally not re-documented here: - `channels list/add/remove` (#2139) — command reference and the "`openclaw channels` blocked inside the sandbox" troubleshooting entry landed with the feature. - `nemoclaw gc` (#2176) — documented as part of the destroy/rebuild image cleanup PR. Skipped per `docs/.docs-skip`: - `e6bad533 fix(shields): verify config lock and fail hard on re-lock failure (#2066)` — matched `skip-features: src/lib/shields.ts`. Other commits in the range (#2141 OpenShell version bump, #1819 plugin banner live inference probe, #2085 / #2146 Slack Socket Mode fixes, #2110 axios proxy fix, #1818 NIM curl timeouts, #1824 onboard gateway bootstrap recovery, and assorted CI / test / install plumbing) are internal behavior refinements with no doc-relevant surface change. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes for the modified files via the pre-commit hook, including `Regenerate agent skills from docs` (source ↔ generated parity confirmed) - [ ] `npm test` passes — skipped; the one pre-existing `test/cli.test.ts > unknown command exits 1` failure on `origin/main` is unrelated to these markdown/JSON-only changes - [ ] Tests added or updated for new or changed behavior — n/a, doc-only - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) — not run locally - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) — n/a, no new pages ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>  ## Summary by CodeRabbit * **New Features** * `connect` now displays the sandbox phase while waiting for readiness and honors a configurable timeout via NEMOCLAW_CONNECT_TIMEOUT (default 120s). * TTY hints are agent-aware and instruct using `/exit` before returning to the host shell. * **Documentation** * Command docs updated to describe polling, timeout, and TTY guidance. * Project/docs metadata updated for versions 0.0.21 and 0.0.22 (package version bumped to 0.0.22).  --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eady (#2222) ## Summary - PR #466 refactored `isSandboxReady` to delegate to `parseSandboxStatus`, which hardcodes the status to column 2 (`cols[1]`). The previous implementation used `cols.includes("Ready")` to check all columns. - When `openshell sandbox list` outputs the status in a column other than the second, the readyCheck never returns true, causing `openshell sandbox create` to hang indefinitely after image upload — the sandbox becomes Ready but NemoClaw never detects it. - This broke every nightly E2E test. Bisected to `49ec8955` (#466) with confirmed good run on the prior commit. - Restores the column-agnostic `cols.includes("Ready") && !cols.includes("NotReady")` check while keeping the `parseSandboxRow` helper. ## Bisect results | Commit | Result | |--------|--------| | `bd82e1f0` (pre-#466) | Pass | | `ff18fb8f` (midpoint) | Pass | | `49ec8955` (#466) | Fail — sandbox hangs | | `37ca70e6` (pre-#466, docs) | Pass | ## Test plan - [x] `vitest run test/onboard-readiness.test.ts` — 34 passed - [ ] Nightly E2E on this branch 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Bug Fixes** * Improved sandbox readiness detection to be more resilient across different versions of the sandbox management tool, ensuring reliable status parsing regardless of column layout variations.  Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

deepujain changed the title ~~fix(connect): add readiness poll with timeout before connecting (Fixe…~~ ix(connect): add readiness poll with timeout before connecting (Fixes #430) Mar 20, 2026

deepujain changed the title ~~ix(connect): add readiness poll with timeout before connecting (Fixes #430)~~ Fix(connect): add readiness poll with timeout before connecting (Fixes #430) Mar 20, 2026

coderabbitai Bot reviewed Mar 20, 2026

View reviewed changes

wscurran added priority: high Important issue that should be resolved in the next release observability Use this label to improve NemoClaw logging, metrics, and tracing. labels Mar 20, 2026

deepujain force-pushed the fix/430-connect-timeout branch from 954e578 to a645867 Compare March 21, 2026 23:02

coderabbitai Bot reviewed Mar 21, 2026

View reviewed changes

Comment thread bin/nemoclaw.js Outdated

deepujain force-pushed the fix/430-connect-timeout branch from a645867 to 8dce304 Compare March 21, 2026 23:24

coderabbitai Bot reviewed Mar 21, 2026

View reviewed changes

Comment thread bin/nemoclaw.js Outdated

Comment thread bin/nemoclaw.js Outdated

Comment thread bin/nemoclaw.js Outdated

deepujain force-pushed the fix/430-connect-timeout branch from 8dce304 to 286fe06 Compare March 22, 2026 00:15

coderabbitai Bot reviewed Mar 22, 2026

View reviewed changes

Comment thread bin/lib/onboard.js Outdated

deepujain force-pushed the fix/430-connect-timeout branch from 286fe06 to 377044b Compare March 23, 2026 22:45

coderabbitai Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread bin/nemoclaw.js Outdated

deepujain force-pushed the fix/430-connect-timeout branch from 377044b to 8f7d2b6 Compare March 23, 2026 23:04

deepujain force-pushed the fix/430-connect-timeout branch 3 times, most recently from 89a490b to 5d79bef Compare March 26, 2026 01:33

mafueee pushed a commit to mafueee/NemoClaw that referenced this pull request Mar 28, 2026

fix(bootstrap): surface diagnostics for K8s namespace not ready failu…

a4883d8

…res (NVIDIA#466)

deepujain force-pushed the fix/430-connect-timeout branch 3 times, most recently from ac4c836 to ca76826 Compare April 1, 2026 04:25

wscurran added the status: rebase PR needs to be rebased against main before review can continue label Apr 14, 2026

deepujain force-pushed the fix/430-connect-timeout branch from ca76826 to 8586271 Compare April 14, 2026 19:11

ericksoa added v0.0.19 Release target v0.0.20 Release target and removed v0.0.19 Release target labels Apr 17, 2026

cv added v0.0.21 Release target and removed v0.0.20 Release target labels Apr 20, 2026

deepujain added 3 commits April 20, 2026 18:00

test: update connect recovery fixtures for readiness polling

a0ff17e

fix(cli): keep wrapper executable after rebase

e38d80a

Signed-off-by: dejain <deepujain@gmail.com>

deepujain force-pushed the fix/430-connect-timeout branch from e34a16a to e38d80a Compare April 21, 2026 01:06

deepujain changed the title ~~Fix(connect): add readiness poll with timeout before connecting (Fixes #430)~~ fix(connect): add readiness poll with timeout before connecting (Fixes #430) Apr 21, 2026

fix(connect): allow readiness polling for non-ready sandboxes

c9c0773

Fixes NVIDIA#430 Signed-off-by: Deepak Jain <deepujain@gmail.com>

wscurran requested a review from cv April 21, 2026 05:42

cv added v0.0.22 Release target and removed v0.0.21 Release target labels Apr 21, 2026

Merge branch 'main' into fix/430-connect-timeout

1d815d1

prekshivyas self-assigned this Apr 21, 2026

prekshivyas approved these changes Apr 21, 2026

View reviewed changes

cv merged commit 49ec895 into NVIDIA:main Apr 21, 2026
6 of 9 checks passed

prekshivyas mentioned this pull request Apr 21, 2026

fix(test): update connect tests for readiness poll behavior from #466 #2198

Merged

4 tasks

cv pushed a commit that referenced this pull request Apr 21, 2026

fix(test): update connect tests for readiness poll behavior from #466 (…

0deb4a3

…#2198)

jyaunches mentioned this pull request Apr 21, 2026

fix(test): sandbox-stuck-recovery and sandbox-connect-inference tests still failing after #2198 #2204

Closed

prekshivyas mentioned this pull request Apr 21, 2026

fix(ci): bump wsl-e2e NEMOCLAW_EXEC_TIMEOUT to 60s (Fixes #2204) #2205

Merged

5 tasks

miyoungc mentioned this pull request Apr 22, 2026

docs: prep for v0.0.22 #2209

Merged

13 tasks

ericksoa mentioned this pull request Apr 22, 2026

fix(gateway-state): restore column-agnostic Ready check in isSandboxReady #2222

Merged

2 tasks

Conversation

deepujain commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Uh oh!

coderabbitai Bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

deepujain commented Mar 21, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kjw3 commented Mar 23, 2026

Uh oh!

deepujain commented Mar 23, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

deepujain commented Mar 23, 2026

Uh oh!

deepujain commented Apr 14, 2026

Uh oh!

deepujain commented Apr 21, 2026

Uh oh!

deepujain commented Apr 21, 2026

Uh oh!

wscurran commented Apr 21, 2026

Uh oh!

prekshivyas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

deepujain commented Mar 20, 2026 •

edited

Loading

coderabbitai Bot commented Mar 20, 2026 •

edited

Loading