Skip to content

fix(connect): add readiness poll with timeout before connecting (Fixes #430)#466

Merged
cv merged 5 commits intoNVIDIA:mainfrom
deepujain:fix/430-connect-timeout
Apr 21, 2026
Merged

fix(connect): add readiness poll with timeout before connecting (Fixes #430)#466
cv merged 5 commits intoNVIDIA:mainfrom
deepujain:fix/430-connect-timeout

Conversation

@deepujain
Copy link
Copy Markdown
Contributor

@deepujain deepujain commented Mar 20, 2026

Summary

nemoclaw <name> connect hangs indefinitely with no output when the sandbox is not yet ready (starting, pulling image, crashed). This is especially problematic right after onboarding, when the README tells users to immediately run nemoclaw my-assistant connect while the 2.4 GB sandbox image may still be pulling.

This change adds a readiness poll with status display and configurable timeout so users see what's happening and get clear errors on failure.

Fixes #430

Changes

  • bin/lib/onboard.js: Added parseSandboxStatus(output, sandboxName) — pure helper that extracts the status column for a named sandbox from openshell sandbox list output (strips ANSI, exact-matches name in first column). Exported for unit tests.
  • bin/nemoclaw.js (sandboxConnect): Before connecting, polls openshell sandbox list every 3s. Shows live status with elapsed time (Status: ContainerCreating (12s elapsed)). Exits immediately with actionable error on terminal states (Failed, Error, CrashLoopBackOff, ImagePullBackOff). Times out after 120s (configurable via NEMOCLAW_CONNECT_TIMEOUT env var). If sandbox is already Ready, connects immediately with no delay.
  • test/onboard-readiness.test.js: Added 11 tests for parseSandboxStatus covering: all status values, null/empty input, ANSI stripping, exact name matching, multi-sandbox output.

Testing

  • npm run build:cli
  • npx vitest run test/onboard-readiness.test.ts -t "parseSandboxStatus|sandbox readiness parsing"

Signed-off-by: Deepak Jain deepujain@gmail.com

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces inline sandbox-list parsing with reusable parsers, updates readiness detection, adds polling with timeout and terminal-state handling to nemoclaw <name> connect, and adds tests for the new parser and behaviors.

Changes

Cohort / File(s) Summary
Onboard helper
bin/lib/onboard.js
Added parseSandboxRow(output, sandboxName) that strips ANSI and selects the exact-name row; added and exported parseSandboxStatus(output, sandboxName) returning the status column; refactored isSandboxReady to use the parser.
Connect flow
bin/nemoclaw.js
Added readiness polling (3s interval) using openshell sandbox list, respects NEMOCLAW_CONNECT_TIMEOUT (default 120s) with live status output, treats several statuses as terminal (prints suggested nemoclaw <name> logs --follow and nemoclaw <name> status and exits 1), and only starts port-forward/connect after the sandbox is Ready.
Tests
test/onboard-readiness.test.js
Added parseSandboxStatus tests covering normal statuses, ANSI-stripping, exact-name matching, multi-line selection, and null/empty input cases.

Sequence Diagram(s)

sequenceDiagram
  participant User as CLI (nemoclaw)
  participant Nemoclaw as bin/nemoclaw.js
  participant Onboard as bin/lib/onboard.js
  participant Openshell as openshell CLI

  User->>Nemoclaw: sandboxConnect(sandboxName)
  Nemoclaw->>Openshell: "openshell sandbox list" (runCapture, ignoreError)
  Openshell-->>Nemoclaw: table/text output
  Nemoclaw->>Onboard: parseSandboxStatus(output, sandboxName)
  Onboard-->>Nemoclaw: status (e.g., Ready, Pending, Failed, ...)
  alt status == Ready
    Nemoclaw->>User: print "Ready"
    User->>Openshell: start port-forward & "openshell sandbox connect"
    Openshell-->>User: interactive shell
  else status in (Failed, Error, CrashLoopBackOff, ImagePullBackOff, Unknown, Evicted)
    Nemoclaw->>User: print error + suggested "nemoclaw <name> logs --follow" and "nemoclaw <name> status"
    Nemoclaw-->>User: exit(1)
  else not found / timeout
    Nemoclaw->>User: print not-found/timeout message (hint NEMOCLAW_CONNECT_TIMEOUT)
    Nemoclaw-->>User: exit(1)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

"I hopped through logs and timeouts so merry,
Waiting for status — not stuck in a ferry.
I parsed every row, stripped colours away,
Now connect waits wisely and shows you the way. 🐇"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed The PR successfully implements all coding requirements from issue #430: adds readiness polling, status display, configurable timeout via NEMOCLAW_CONNECT_TIMEOUT, terminal state detection with actionable errors, and a 3s polling interval.
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #430: the parseSandboxStatus helper and isSandboxReady refactor support the polling implementation, tests validate the new helper, and no unrelated modifications are present.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding a readiness poll with timeout before connecting to prevent hangs.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@deepujain deepujain changed the title fix(connect): add readiness poll with timeout before connecting (Fixe… ix(connect): add readiness poll with timeout before connecting (Fixes #430) Mar 20, 2026
@deepujain deepujain changed the title ix(connect): add readiness poll with timeout before connecting (Fixes #430) Fix(connect): add readiness poll with timeout before connecting (Fixes #430) Mar 20, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
bin/nemoclaw.js (1)

262-264: Consider adding validation feedback for invalid timeout values.

The || 120 fallback silently converts invalid values (e.g., NEMOCLAW_CONNECT_TIMEOUT=abc) to 120. This is safe but could confuse users who expect their setting to take effect.

♻️ Optional: Add warning for invalid timeout values
 function sandboxConnect(sandboxName) {
   const { isSandboxReady, parseSandboxStatus } = require("./lib/onboard");
-  const timeout = parseInt(process.env.NEMOCLAW_CONNECT_TIMEOUT, 10) || 120;
+  const rawTimeout = process.env.NEMOCLAW_CONNECT_TIMEOUT;
+  let timeout = 120;
+  if (rawTimeout !== undefined) {
+    const parsed = parseInt(rawTimeout, 10);
+    if (Number.isNaN(parsed) || parsed < 0) {
+      console.warn(`  Warning: Invalid NEMOCLAW_CONNECT_TIMEOUT="${rawTimeout}", using default 120s`);
+    } else {
+      timeout = parsed || 120; // treat 0 as "use default"
+    }
+  }
   const interval = 3;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 262 - 264, The current timeout assignment
silently falls back to 120 when NEMOCLAW_CONNECT_TIMEOUT is invalid; change the
logic around the parseInt call that assigns timeout so you first read raw =
process.env.NEMOCLAW_CONNECT_TIMEOUT, attempt to parse with parseInt, and if raw
is defined but Number.isNaN(parsed) emit a clear warning (e.g., console.warn)
mentioning the invalid value and that 120s will be used, otherwise use the
parsed value; keep the fallback to 120 for undefined values and reference the
timeout constant and the NEMOCLAW_CONNECT_TIMEOUT env var in your message so
users see what was wrong.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@bin/nemoclaw.js`:
- Around line 262-264: The current timeout assignment silently falls back to 120
when NEMOCLAW_CONNECT_TIMEOUT is invalid; change the logic around the parseInt
call that assigns timeout so you first read raw =
process.env.NEMOCLAW_CONNECT_TIMEOUT, attempt to parse with parseInt, and if raw
is defined but Number.isNaN(parsed) emit a clear warning (e.g., console.warn)
mentioning the invalid value and that 120s will be used, otherwise use the
parsed value; keep the fallback to 120 for undefined values and reference the
timeout constant and the NEMOCLAW_CONNECT_TIMEOUT env var in your message so
users see what was wrong.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 76b8a6ba-1a7f-4f66-9c3b-f10fe6224423

📥 Commits

Reviewing files that changed from the base of the PR and between dbfd78c and 954e578.

📒 Files selected for processing (3)
  • bin/lib/onboard.js
  • bin/nemoclaw.js
  • test/onboard-readiness.test.js

@wscurran wscurran added priority: high Important issue that should be resolved in the next release observability Use this label to improve NemoClaw logging, metrics, and tracing. labels Mar 20, 2026
@deepujain deepujain force-pushed the fix/430-connect-timeout branch from 954e578 to a645867 Compare March 21, 2026 23:02
@deepujain
Copy link
Copy Markdown
Contributor Author

Addressed the CodeRabbit nitpick. Rebased on main, resolved conflicts .

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
bin/lib/onboard.js (1)

71-85: LGTM! Clean implementation with proper input validation.

The function correctly handles ANSI stripping, exact name matching in the first column, and safely returns null for missing/invalid input. The logic is consistent with isSandboxReady above.

💡 Optional: Extract shared parsing logic to reduce duplication

Both isSandboxReady and parseSandboxStatus share identical ANSI-stripping and column-parsing logic. If you anticipate adding more status-parsing utilities, consider extracting the common parsing into a helper:

+function parseSandboxRow(output, sandboxName) {
+  if (!output || typeof output !== "string") return null;
+  const clean = output.replace(/\x1b\[[0-9;]*m/g, "");
+  for (const line of clean.split("\n")) {
+    const cols = line.trim().split(/\s+/);
+    if (cols[0] === sandboxName) return cols;
+  }
+  return null;
+}
+
 function isSandboxReady(output, sandboxName) {
-  const clean = output.replace(/\x1b\[[0-9;]*m/g, "");
-  return clean.split("\n").some((l) => {
-    const cols = l.trim().split(/\s+/);
-    return cols[0] === sandboxName && cols.includes("Ready") && !cols.includes("NotReady");
-  });
+  const cols = parseSandboxRow(output, sandboxName);
+  return cols && cols.includes("Ready") && !cols.includes("NotReady");
 }

 function parseSandboxStatus(output, sandboxName) {
-  if (!output || typeof output !== "string") return null;
-  const clean = output.replace(/\x1b\[[0-9;]*m/g, "");
-  for (const line of clean.split("\n")) {
-    const cols = line.trim().split(/\s+/);
-    if (cols[0] === sandboxName && cols.length >= 2) {
-      return cols[1];
-    }
-  }
-  return null;
+  const cols = parseSandboxRow(output, sandboxName);
+  return cols && cols.length >= 2 ? cols[1] : null;
 }

This also adds null-safety to isSandboxReady which currently lacks input validation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 71 - 85, Extract the common ANSI-stripping
and column-splitting logic used in parseSandboxStatus and isSandboxReady into a
small helper (e.g., cleanAndSplitLines or parseTableLines) and then call that
helper from both parseSandboxStatus and isSandboxReady; ensure the helper
performs null/typeof checks and returns a safe array of lines/columns so
isSandboxReady gains the same input validation as parseSandboxStatus and
duplication is removed.
bin/nemoclaw.js (1)

297-339: Solid polling implementation with clear user feedback.

The readiness gate logic handles the core requirements well: terminal state detection, progress display with elapsed time, and configurable timeout.

One minor consideration: If the sandbox doesn't exist at all (not in the list), parseSandboxStatus returns null and the status shows as "unknown" until timeout. This is acceptable since the sandbox might be in the process of being created, but you could optionally add a check after the first poll to detect "sandbox not found" early.

💡 Optional: Early exit if sandbox never appears
     let elapsed = 0;
     let ready = false;
+    let everSeen = status !== null;
     while (elapsed < timeout) {
       spawnSync("sleep", [String(interval)]);
       elapsed += interval;
       const poll = runCapture("openshell sandbox list 2>&1", { ignoreError: true });
       if (isSandboxReady(poll, sandboxName)) {
         ready = true;
         break;
       }
       const cur = parseSandboxStatus(poll, sandboxName) || "unknown";
+      if (cur !== "unknown") everSeen = true;
       if (TERMINAL.has(cur)) {
         console.error("");
         console.error(`  Sandbox '${sandboxName}' entered '${cur}' state.`);
         console.error(`  Run:  nemoclaw ${sandboxName} logs --follow`);
         process.exit(1);
       }
+      // If sandbox never appeared after reasonable time, exit early
+      if (!everSeen && elapsed >= 30) {
+        console.error("");
+        console.error(`  Sandbox '${sandboxName}' not found after ${elapsed}s.`);
+        console.error(`  Check: openshell sandbox list`);
+        process.exit(1);
+      }
       process.stdout.write(`\r    Status: ${cur.padEnd(20)} (${elapsed}s elapsed)`);
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 297 - 339, Add an early "sandbox not found"
detection: after the initial runCapture("openshell sandbox list...") and after
the first poll() in the loop, check if parseSandboxStatus(list, sandboxName) (or
parseSandboxStatus(poll, sandboxName)) returns null/undefined meaning the
sandbox is not present; if so, print a clear error message (e.g., "Sandbox
'<name>' not found. Check creation or name.") and exit non‑zero. Update the
blocks around isSandboxReady, parseSandboxStatus, list and poll to perform this
check and return early instead of waiting the full timeout.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/nemoclaw.js`:
- Around line 300-306: Update the TERMINAL set used to detect sandbox pod
failures to include the Kubernetes pod phases "Unknown" and the eviction reason
"Evicted": modify the constant TERMINAL (currently defined as new
Set(["Failed","Error","CrashLoopBackOff","ImagePullBackOff"])) to also contain
"Unknown" and "Evicted" so the branch that checks if (status &&
TERMINAL.has(status)) (which logs using sandboxName and exits via
process.exit(1)) will catch those cases as well.

---

Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 71-85: Extract the common ANSI-stripping and column-splitting
logic used in parseSandboxStatus and isSandboxReady into a small helper (e.g.,
cleanAndSplitLines or parseTableLines) and then call that helper from both
parseSandboxStatus and isSandboxReady; ensure the helper performs null/typeof
checks and returns a safe array of lines/columns so isSandboxReady gains the
same input validation as parseSandboxStatus and duplication is removed.

In `@bin/nemoclaw.js`:
- Around line 297-339: Add an early "sandbox not found" detection: after the
initial runCapture("openshell sandbox list...") and after the first poll() in
the loop, check if parseSandboxStatus(list, sandboxName) (or
parseSandboxStatus(poll, sandboxName)) returns null/undefined meaning the
sandbox is not present; if so, print a clear error message (e.g., "Sandbox
'<name>' not found. Check creation or name.") and exit non‑zero. Update the
blocks around isSandboxReady, parseSandboxStatus, list and poll to perform this
check and return early instead of waiting the full timeout.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1c8b36ca-68e5-4f54-9792-f2c5d72e54fe

📥 Commits

Reviewing files that changed from the base of the PR and between 954e578 and a645867.

📒 Files selected for processing (3)
  • bin/lib/onboard.js
  • bin/nemoclaw.js
  • test/onboard-readiness.test.js
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/onboard-readiness.test.js

Comment thread bin/nemoclaw.js Outdated
@deepujain deepujain force-pushed the fix/430-connect-timeout branch from a645867 to 8dce304 Compare March 21, 2026 23:24
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
bin/lib/onboard.js (1)

74-76: Use the status column directly for readiness.

cols.includes("Ready") makes readiness depend on every token in the row. Checking the status column keeps isSandboxReady() aligned with parseSandboxStatus() and avoids false positives from future extra columns.

♻️ Suggested change
 function isSandboxReady(output, sandboxName) {
   const cols = parseSandboxRow(output, sandboxName);
-  return cols !== null && cols.includes("Ready") && !cols.includes("NotReady");
+  return cols !== null && cols[1] === "Ready";
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/lib/onboard.js` around lines 74 - 76, isSandboxReady currently checks
readiness by searching the entire row for "Ready"/"NotReady", which can give
false positives; change it to use the status value returned by
parseSandboxStatus(output, sandboxName) (or otherwise extract the explicit
status column from parseSandboxRow) and return true only when that status ===
"Ready" (and not "NotReady"); update isSandboxReady to call
parseSandboxStatus/output -> status and base the boolean on that status.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/nemoclaw.js`:
- Around line 299-306: The terminal-state handling is inconsistent: the
immediate terminal branch (using parseSandboxStatus, TERMINAL and sandboxName)
prints both "nemoclaw <sandboxName> logs --follow" and "nemoclaw <sandboxName>
status" but the in-loop terminal branch omits the status command; update the
in-loop terminal branch to mirror the immediate branch by adding the same
console.error line that suggests running "nemoclaw <sandboxName> status" so both
branches output identical guidance for terminal statuses.
- Around line 313-315: The loop that waits for connection oversleeps because it
always calls spawnSync("sleep", [String(interval)]) regardless of remaining
time; change the logic in the wait/retry loop (variables elapsed, timeout,
interval and the while loop) to compute remaining = timeout - elapsed and sleep
for Math.max(0, Math.min(interval, remaining)) seconds (skip sleeping if
remaining <= 0) so the total wait never exceeds NEMOCLAW_CONNECT_TIMEOUT; keep
using spawnSync("sleep", [String(sleepSec)]) and update elapsed by the actual
sleepSec used.
- Line 297: The two calls to runCapture("openshell sandbox list", { ignoreError:
true }) must pass a timeout so the child process cannot hang; modify the calls
in bin/nemoclaw.js (the runCapture invocations used by connect) to include a
timeout option derived from NEMOCLAW_CONNECT_TIMEOUT (converted to milliseconds)
e.g. merge { ignoreError: true } with { timeout: <NEMOCLAW_CONNECT_TIMEOUT in
ms> } before calling runCapture; ensure you reuse the same timeout value used by
the outer polling loop so the subprocess timeout aligns with the overall connect
timeout.

---

Nitpick comments:
In `@bin/lib/onboard.js`:
- Around line 74-76: isSandboxReady currently checks readiness by searching the
entire row for "Ready"/"NotReady", which can give false positives; change it to
use the status value returned by parseSandboxStatus(output, sandboxName) (or
otherwise extract the explicit status column from parseSandboxRow) and return
true only when that status === "Ready" (and not "NotReady"); update
isSandboxReady to call parseSandboxStatus/output -> status and base the boolean
on that status.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0ec63e5c-a749-41d2-a442-7ae0c7e6e219

📥 Commits

Reviewing files that changed from the base of the PR and between a645867 and 8dce304.

📒 Files selected for processing (3)
  • bin/lib/onboard.js
  • bin/nemoclaw.js
  • test/onboard-readiness.test.js

Comment thread bin/nemoclaw.js Outdated
Comment thread bin/nemoclaw.js Outdated
Comment thread bin/nemoclaw.js Outdated
@deepujain deepujain force-pushed the fix/430-connect-timeout branch from 8dce304 to 286fe06 Compare March 22, 2026 00:15
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
bin/nemoclaw.js (3)

297-297: ⚠️ Potential issue | 🟠 Major

Bound each openshell sandbox list subprocess.

NEMOCLAW_CONNECT_TIMEOUT only limits the outer loop. A wedged runCapture() here can still block forever before elapsed is checked again, which puts connect back into the indefinite-hang state this PR is trying to eliminate.

🩹 Suggested fix
   const interval = 3;
+  const listOpts = { ignoreError: true, timeout: timeout * 1000 };
 
-  const list = runCapture("openshell sandbox list 2>&1", { ignoreError: true });
+  const list = runCapture("openshell sandbox list 2>&1", listOpts);
   if (!isSandboxReady(list, sandboxName)) {
@@
-      const poll = runCapture("openshell sandbox list 2>&1", { ignoreError: true });
+      const poll = runCapture("openshell sandbox list 2>&1", listOpts);

Verify that bin/lib/runner.js already passes a timeout option through to the child process, and that these polling call sites are the remaining ones omitting it. Expected result: runCapture shows timeout-capable plumbing, while the two sandbox list calls in this function do not currently use it.

#!/bin/bash
set -euo pipefail

echo "== runCapture timeout support =="
rg -n -C3 'function runCapture|const runCapture|timeout|execSync' bin/lib/runner.js

echo
echo "== sandbox list polling call sites =="
rg -n -C2 'runCapture\("openshell sandbox list 2>&1"' bin/nemoclaw.js bin/lib/onboard.js

Also applies to: 316-316

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` at line 297, The polling call to runCapture("openshell
sandbox list 2>&1", { ignoreError: true }) is not bounded by the
NEMOCLAW_CONNECT_TIMEOUT and can hang; update both sandbox-list runCapture
invocations in this file to pass a timeout option (either the
NEMOCLAW_CONNECT_TIMEOUT value or the remaining elapsed time calculated in the
connect loop) so the child process is killed if it exceeds the limit, and
confirm bin/lib/runner.js already wires a timeout option through to the child
process before making the change to runCapture.

323-327: ⚠️ Potential issue | 🟡 Minor

Keep terminal-state guidance consistent inside the loop.

The immediate terminal-state branch prints both logs --follow and status, but the in-loop branch drops the status command. The failure UX should be identical regardless of when the sandbox breaks.

🩹 Suggested fix
       if (TERMINAL.has(cur)) {
         console.error("");
         console.error(`  Sandbox '${sandboxName}' entered '${cur}' state.`);
         console.error(`  Run:  nemoclaw ${sandboxName} logs --follow`);
+        console.error(`  Run:  nemoclaw ${sandboxName} status`);
         process.exit(1);
       }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 323 - 327, The terminal-state handling is
inconsistent: inside the loop where TERMINAL.has(cur) is true (variables cur and
sandboxName), the messages only suggest "nemoclaw <sandboxName> logs --follow"
but the earlier immediate branch also suggests "nemoclaw <sandboxName> status";
update the in-loop branch that checks TERMINAL.has(cur) to print the same two
guidance lines (both "logs --follow" and "status") before calling
process.exit(1) so the failure UX is identical regardless of when the sandbox
enters a terminal state.

313-315: ⚠️ Potential issue | 🟡 Minor

Don't sleep past the configured timeout.

NEMOCLAW_CONNECT_TIMEOUT=1 still waits 3 seconds here, and larger non-multiples can overshoot by a full poll interval.

⏱️ Suggested fix
     let elapsed = 0;
     let ready = false;
     let everSeen = status !== null;
     while (elapsed < timeout) {
-      spawnSync("sleep", [String(interval)]);
-      elapsed += interval;
+      const sleepFor = Math.min(interval, timeout - elapsed);
+      if (sleepFor > 0) {
+        spawnSync("sleep", [String(sleepFor)]);
+        elapsed += sleepFor;
+      }
       const poll = runCapture("openshell sandbox list 2>&1", { ignoreError: true });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bin/nemoclaw.js` around lines 313 - 315, The loop that waits using
spawnSync("sleep", [String(interval)]) can overshoot the configured timeout
because it always sleeps the full interval; change the sleep duration inside the
while loop (in the block using elapsed, timeout, interval and spawnSync) to
compute remaining = timeout - elapsed, break if remaining <= 0, and call
spawnSync with String(Math.min(interval, remaining)); then add the actual slept
duration (min(interval, remaining)) to elapsed instead of always adding
interval.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/lib/onboard.js`:
- Around line 74-76: isSandboxReady currently scans all tokens from
parseSandboxRow output which can incorrectly detect "Ready" in later columns;
update isSandboxReady to use the parsed status column (as defined by
parseSandboxStatus / cols[1]) instead of searching the entire row, i.e., call
parseSandboxStatus or read cols[1] from parseSandboxRow and return true only
when that parsed status equals "Ready" (and not "NotReady") so sandboxConnect's
readiness check follows the new parsed status semantics.

---

Duplicate comments:
In `@bin/nemoclaw.js`:
- Line 297: The polling call to runCapture("openshell sandbox list 2>&1", {
ignoreError: true }) is not bounded by the NEMOCLAW_CONNECT_TIMEOUT and can
hang; update both sandbox-list runCapture invocations in this file to pass a
timeout option (either the NEMOCLAW_CONNECT_TIMEOUT value or the remaining
elapsed time calculated in the connect loop) so the child process is killed if
it exceeds the limit, and confirm bin/lib/runner.js already wires a timeout
option through to the child process before making the change to runCapture.
- Around line 323-327: The terminal-state handling is inconsistent: inside the
loop where TERMINAL.has(cur) is true (variables cur and sandboxName), the
messages only suggest "nemoclaw <sandboxName> logs --follow" but the earlier
immediate branch also suggests "nemoclaw <sandboxName> status"; update the
in-loop branch that checks TERMINAL.has(cur) to print the same two guidance
lines (both "logs --follow" and "status") before calling process.exit(1) so the
failure UX is identical regardless of when the sandbox enters a terminal state.
- Around line 313-315: The loop that waits using spawnSync("sleep",
[String(interval)]) can overshoot the configured timeout because it always
sleeps the full interval; change the sleep duration inside the while loop (in
the block using elapsed, timeout, interval and spawnSync) to compute remaining =
timeout - elapsed, break if remaining <= 0, and call spawnSync with
String(Math.min(interval, remaining)); then add the actual slept duration
(min(interval, remaining)) to elapsed instead of always adding interval.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c15045bb-4fce-4372-acf0-e66f5f52eef5

📥 Commits

Reviewing files that changed from the base of the PR and between 8dce304 and 286fe06.

📒 Files selected for processing (3)
  • bin/lib/onboard.js
  • bin/nemoclaw.js
  • test/onboard-readiness.test.js

Comment thread bin/lib/onboard.js Outdated
@kjw3
Copy link
Copy Markdown
Contributor

kjw3 commented Mar 23, 2026

Thanks for working on this. The issue is real, and the added readiness/status UX is the right direction, but I don’t think this is ready to merge yet.

The main blocker is that the new polling path can still hang indefinitely if openshell sandbox list itself wedges. Right now the outer loop has a timeout, but the two runCapture("openshell sandbox list 2>&1", ...) calls are still unbounded.

  • Please pass a timeout into both runCapture(...) calls so the child process cannot block forever.
  • The in-loop terminal-state branch should print the same guidance as the immediate terminal branch, including nemoclaw <name> status.
  • The sleep loop should not overshoot NEMOCLAW_CONNECT_TIMEOUT for non-multiples of 3 seconds.
  • isSandboxReady() should check the parsed status column directly instead of scanning the whole row for "Ready".

If you update those pieces, I’d be happy to re-review. The feature is worthwhile; it just needs to fully close the hang path before we merge it.

@deepujain deepujain force-pushed the fix/430-connect-timeout branch from 286fe06 to 377044b Compare March 23, 2026 22:45
@deepujain
Copy link
Copy Markdown
Contributor Author

Addressed all four points from @kjw3 review.
Rebased on latest main. All readiness/status tests pass.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bin/nemoclaw.js`:
- Around line 312-355: Compute a real wall-clock deadline (e.g., const deadline
= Date.now() + timeout * 1000) and use the remaining time to drive both the
subprocess timeout and sleep intervals instead of the elapsed counter; before
each runCapture("openshell sandbox list 2>&1", listOpts) (including the initial
call) update listOpts.timeout to Math.max(0, deadline - Date.now()) in
milliseconds, and when computing sleepFor derive it from the remaining seconds
(e.g., remainingSec = Math.ceil((deadline - Date.now())/1000); sleepFor =
Math.min(interval, remainingSec)); also check remaining time and error/exit if
deadline passed so the while loop using elapsed/timeout is replaced by checking
Date.now() against deadline in the loop that polls
isSandboxReady/parseSandboxStatus.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 87d573a4-6751-42cf-9ddb-8abf1d2f5d52

📥 Commits

Reviewing files that changed from the base of the PR and between 286fe06 and 377044b.

📒 Files selected for processing (3)
  • bin/lib/onboard.js
  • bin/nemoclaw.js
  • test/onboard-readiness.test.js

Comment thread bin/nemoclaw.js Outdated
@deepujain deepujain force-pushed the fix/430-connect-timeout branch from 377044b to 8f7d2b6 Compare March 23, 2026 23:04
@deepujain
Copy link
Copy Markdown
Contributor Author

Switched to a wall-clock deadline (Date.now()) so subprocess time counts toward NEMOCLAW_CONNECT_TIMEOUT — rebased on latest main, all readiness tests pass.

@deepujain deepujain force-pushed the fix/430-connect-timeout branch 3 times, most recently from 89a490b to 5d79bef Compare March 26, 2026 01:33
mafueee pushed a commit to mafueee/NemoClaw that referenced this pull request Mar 28, 2026
quanticsoul4772 added a commit to quanticsoul4772/NemoClaw that referenced this pull request Mar 30, 2026
…ream PR NVIDIA#466)

Adapted from deepujain/fix/430-connect-timeout (Fixes NVIDIA#430).

sandboxConnect now polls `openshell sandbox list` for Ready status before
connecting. Previously, connecting while a sandbox was still pulling (2.4GB
image) would hang silently with no output.

- bin/lib/onboard.js: extract parseSandboxRow/parseSandboxStatus from
  isSandboxReady; export parseSandboxStatus
- bin/nemoclaw.js: add pre-connect readiness poll (default 120s timeout,
  configurable via NEMOCLAW_CONNECT_TIMEOUT env var); shows live status
  with elapsed time; exits immediately on terminal states (Failed,
  CrashLoopBackOff, ImagePullBackOff, etc.)
- test/onboard-readiness.test.js: add 11 parseSandboxStatus unit tests

Adaptation notes: readiness poll runs before ensureLiveSandboxOrExit (not
instead of it), keeping the gateway state check intact.

Source: NVIDIA#466
@deepujain deepujain force-pushed the fix/430-connect-timeout branch 3 times, most recently from ac4c836 to ca76826 Compare April 1, 2026 04:25
@wscurran wscurran added the status: rebase PR needs to be rebased against main before review can continue label Apr 14, 2026
@deepujain deepujain force-pushed the fix/430-connect-timeout branch from ca76826 to 8586271 Compare April 14, 2026 19:11
@deepujain
Copy link
Copy Markdown
Contributor Author

Rebased this onto current main and finished the TS-side connect/onboard port. and pass locally on the rebased branch.

@ericksoa ericksoa added v0.0.19 Release target v0.0.20 Release target and removed v0.0.19 Release target labels Apr 17, 2026
@cv cv added v0.0.21 Release target and removed v0.0.20 Release target labels Apr 20, 2026
…NVIDIA#430)

Polls openshell sandbox list with status display and configurable timeout
(NEMOCLAW_CONNECT_TIMEOUT, default 120s) so nemoclaw connect no longer hangs
when the sandbox is starting or has crashed.  Each runCapture call is bounded
by the same timeout so a wedged subprocess cannot block forever.

Fixes NVIDIA#430

Signed-off-by: Deepak Jain <deepujain@gmail.com>
Signed-off-by: dejain <deepujain@gmail.com>
@deepujain deepujain force-pushed the fix/430-connect-timeout branch from e34a16a to e38d80a Compare April 21, 2026 01:06
@deepujain deepujain changed the title Fix(connect): add readiness poll with timeout before connecting (Fixes #430) fix(connect): add readiness poll with timeout before connecting (Fixes #430) Apr 21, 2026
@deepujain
Copy link
Copy Markdown
Contributor Author

Rebased on current main and fixed the PR title and DCO body checks. The readiness wait is now replayed on top of the current TS connect path, npm run build:cli passes, and the parseSandboxStatus readiness tests pass locally.

Fixes NVIDIA#430

Signed-off-by: Deepak Jain <deepujain@gmail.com>
@deepujain
Copy link
Copy Markdown
Contributor Author

Pushed one more follow-up for the TS-era connect path. connect was still bailing on a non-Ready sandbox phase before the readiness poll could run; that is fixed now. npm run build:cli passes, the parseSandboxStatus readiness tests pass, and a direct Pending -> Ready connect repro now reaches sandbox connect and exits cleanly.

@wscurran
Copy link
Copy Markdown
Contributor

Thanks for this — we've reviewed the approach and it addresses a real gap. The current connect command detects crashed/stuck sandboxes but doesn't poll during image pulls, so users hitting a fresh nemoclaw my-assistant connect right after onboard get a misleading error. The readiness poll with elapsed-time display and configurable timeout here is the right fix for that.

This is in the review queue. We'll follow up with any feedback.

@wscurran wscurran requested a review from cv April 21, 2026 05:42
@cv cv added v0.0.22 Release target and removed v0.0.21 Release target labels Apr 21, 2026
@prekshivyas prekshivyas self-assigned this Apr 21, 2026
Copy link
Copy Markdown
Contributor

@prekshivyas prekshivyas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed end-to-end. The PR cleanly addresses #430 and has converged nicely through the CodeRabbit iterations — all the structural concerns (subprocess timeout bounded by remaining wall-clock, deadline-based loop, terminal-state UX parity between immediate and in-loop branches, sandbox-not-found early exit at 30s) are resolved. No UX, security, onboarding, or code-quality regressions. Approving.

A few non-blocking nits for @deepujain to consider — happy to ship as-is and address in a follow-up if preferred:

  1. Non-TTY output: process.stdout.write('\r Status: …') produces concatenated lines when stdout isn't a TTY (captured logs, CI). Gating the progress update on process.stdout.isTTY would keep piped output clean. Minor — the primary use case is interactive.

  2. Lost diagnostic detail on stuck phases: the old ensureLiveSandboxOrExit included useful context ("usually happens when a process crash inside the sandbox prevented clean startup"). The new TERMINAL-state branch prints a simpler message. Consider preserving the richer text specifically for CrashLoopBackOff / Error.

  3. Named constant for 30: the elapsed >= 30 sandbox-not-found threshold would benefit from a named constant + one-line comment. If NEMOCLAW_CONNECT_TIMEOUT is set below 30s, this check is never reached (harmless, just worth documenting).

  4. Stylistic: let ready = false + post-loop check could be a labelled break, but the current form is perfectly readable. Ignore if you prefer it.

None are blockers.

@cv cv merged commit 49ec895 into NVIDIA:main Apr 21, 2026
6 of 9 checks passed
brandonpelfrey pushed a commit that referenced this pull request Apr 21, 2026
## Summary

Fixes the lingering `sandbox-stuck-recovery.test.ts > connect times out
with guidance when sandbox is stuck in Provisioning` failure on
`wsl-e2e` that has been red on `main` since #466 merged. My earlier
#2198 fixed the sibling tests but this one kept hitting the CI
test-budget ceiling.

Fixes #2204.

## Root cause

Not a mock-script bug — the mocks for `sandbox list` already return the
phase string that `parseSandboxStatus` expects, and the tests pass
cleanly on macOS and the Linux `checks` job.

The failure is WSL-specific timing:

- `NEMOCLAW_EXEC_TIMEOUT=30000` caps each outer `spawnSync` to the CLI
at 30 seconds.
- WSL process-spawn overhead is high: each mocked `openshell` invocation
is a fresh Node.js child process, and the CLI makes 4+ of these
(`sandbox get`, `policy get --full`, `inference get`, first `sandbox
list`) before the readiness poll loop from #466 starts.
- The 3 sibling tests in `sandbox-stuck-recovery.test.ts` that skip the
poll loop were passing at **26–28s** on WSL — already within a few
seconds of the ceiling.
- The failing test is the only one that sets
`NEMOCLAW_CONNECT_TIMEOUT=3`, so it actually enters the poll loop
(pre-poll + 3s sleep + final poll + "Timed out" print). That pushes
total runtime past 30s on WSL runners, so the outer `spawnSync` kills
the CLI before the "Timed out" guidance prints.

The `Status: unknown` artifact mentioned in #2204 is the CLI's display
fallback `parseSandboxStatus(poll, ...) || "unknown"` kicking in after
the child was killed mid-poll — not a missing subcommand in the mocks.

## Changes

- `.github/workflows/wsl-e2e.yaml`: bump `NEMOCLAW_EXEC_TIMEOUT` from
`30000` to `60000`, aligned with the existing
`NEMOCLAW_TEST_TIMEOUT=60000`. Short comment explains why WSL needs the
larger ceiling.

## Why this is the right fix

- **Zero impact on fast runners.** Tests that finish in a few seconds on
macOS/Linux return early; they don't pay for the larger ceiling.
- **No product-code change.** Only the WSL CI outer-spawn budget moves.
- **Targeted to the actual bottleneck.** The CLI isn't hanging — it's
slow on WSL spawn. Bumping the budget matches the structural overhead.

## Considered and rejected

- **Patching the mocks** (the #2204 hypothesis). No new subcommand is
invoked; the mocks are correct.
- **Reducing pre-poll overhead in product code** (e.g. skipping `policy
get --full` on connect). Out of scope, risks regressing real behavior.
- **Dropping `NEMOCLAW_CONNECT_TIMEOUT` from the test.** The test
asserts timeout behavior — it must enter the poll.
- **Bumping `macos-e2e`.** Not needed; those jobs pass comfortably.

## Type of Change

- [x] Code change (bug fix — CI config only)

## Verification

- [x] Local \`npx vitest run test/sandbox-stuck-recovery.test.ts
test/sandbox-connect-inference.test.ts --project cli\` — 7/7 pass in
7.6s
- [x] \`npx prek run --files .github/workflows/wsl-e2e.yaml\` — clean
- [ ] **Real validation: this PR's \`wsl-e2e\` check must go green.**

## AI Disclosure

- [x] AI-assisted — tool: Claude Code

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Improved Windows Subsystem for Linux (WSL) test reliability by
adjusting execution timeout settings to account for system overhead.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@miyoungc miyoungc mentioned this pull request Apr 22, 2026
13 tasks
miyoungc added a commit that referenced this pull request Apr 22, 2026
## Summary

Bumps the published doc version to `0.0.22` and documents the
user-visible CLI behavior changes to `nemoclaw <name> connect` that
landed since v0.0.21. Drafted via the `nemoclaw-contributor-update-docs`
skill against commits in `v0.0.21..origin/main`, filtered through
`docs/.docs-skip`.

## Changes

- **`docs/project.json`** and **`docs/versions1.json`**: bump the
published version from `0.0.20` to `0.0.22`; insert a `0.0.21` entry
into the version list so the history stays contiguous.
- **`docs/reference/commands.md`** → `nemoclaw <name> connect`: document
two new behaviors.
- Readiness poll with `NEMOCLAW_CONNECT_TIMEOUT` (integer seconds;
default `120`) that replaces the silent hang when the sandbox is not yet
`Ready` — right after onboarding, while the 2.4 GB image is still
pulling (#466).
- Post-connect hint is now agent-aware, names the correct TUI command
for the sandbox's agent, and tells you to use `/exit` to leave the chat
before `exit` returns you to the host shell (#2080).

Feature PRs that shipped their own docs in the same commit are
intentionally not re-documented here:

- `channels list/add/remove` (#2139) — command reference and the
"`openclaw channels` blocked inside the sandbox" troubleshooting entry
landed with the feature.
- `nemoclaw gc` (#2176) — documented as part of the destroy/rebuild
image cleanup PR.

Skipped per `docs/.docs-skip`:

- `e6bad533 fix(shields): verify config lock and fail hard on re-lock
failure (#2066)` — matched `skip-features: src/lib/shields.ts`.

Other commits in the range (#2141 OpenShell version bump, #1819 plugin
banner live inference probe, #2085 / #2146 Slack Socket Mode fixes,
#2110 axios proxy fix, #1818 NIM curl timeouts, #1824 onboard gateway
bootstrap recovery, and assorted CI / test / install plumbing) are
internal behavior refinements with no doc-relevant surface change.

## Type of Change

- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification

- [x] `npx prek run --all-files` passes for the modified files via the
pre-commit hook, including `Regenerate agent skills from docs` (source ↔
generated parity confirmed)
- [ ] `npm test` passes — skipped; the one pre-existing
`test/cli.test.ts > unknown command exits 1` failure on `origin/main` is
unrelated to these markdown/JSON-only changes
- [ ] Tests added or updated for new or changed behavior — n/a, doc-only
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only) — not run
locally
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)
— n/a, no new pages

## AI Disclosure

- [x] AI-assisted — tool: Claude Code

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* `connect` now displays the sandbox phase while waiting for readiness
and honors a configurable timeout via NEMOCLAW_CONNECT_TIMEOUT (default
120s).
* TTY hints are agent-aware and instruct using `/exit` before returning
to the host shell.

* **Documentation**
  * Command docs updated to describe polling, timeout, and TTY guidance.
* Project/docs metadata updated for versions 0.0.21 and 0.0.22 (package
version bumped to 0.0.22).
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ericksoa added a commit that referenced this pull request Apr 22, 2026
…eady (#2222)

## Summary
- PR #466 refactored `isSandboxReady` to delegate to
`parseSandboxStatus`, which hardcodes the status to column 2
(`cols[1]`). The previous implementation used `cols.includes("Ready")`
to check all columns.
- When `openshell sandbox list` outputs the status in a column other
than the second, the readyCheck never returns true, causing `openshell
sandbox create` to hang indefinitely after image upload — the sandbox
becomes Ready but NemoClaw never detects it.
- This broke every nightly E2E test. Bisected to `49ec8955` (#466) with
confirmed good run on the prior commit.
- Restores the column-agnostic `cols.includes("Ready") &&
!cols.includes("NotReady")` check while keeping the `parseSandboxRow`
helper.

## Bisect results
| Commit | Result |
|--------|--------|
| `bd82e1f0` (pre-#466) | Pass |
| `ff18fb8f` (midpoint) | Pass |
| `49ec8955` (#466) | Fail — sandbox hangs |
| `37ca70e6` (pre-#466, docs) | Pass |

## Test plan
- [x] `vitest run test/onboard-readiness.test.ts` — 34 passed
- [ ] Nightly E2E on this branch

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved sandbox readiness detection to be more resilient across
different versions of the sandbox management tool, ensuring reliable
status parsing regardless of column layout variations.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

observability Use this label to improve NemoClaw logging, metrics, and tracing. priority: high Important issue that should be resolved in the next release v0.0.22 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] nemoclaw connect hangs indefinitely with no timeout or status indicator when sandbox is starting up or has crashed

6 participants