Skip to content

fix(onboard): use effectivePort in dashboard wait and guard forward port collision#2086

Merged
ericksoa merged 5 commits intoNVIDIA:mainfrom
Dongni-Yang:fix/hermes-coexistence-2063
Apr 20, 2026
Merged

fix(onboard): use effectivePort in dashboard wait and guard forward port collision#2086
ericksoa merged 5 commits intoNVIDIA:mainfrom
Dongni-Yang:fix/hermes-coexistence-2063

Conversation

@Dongni-Yang
Copy link
Copy Markdown
Contributor

@Dongni-Yang Dongni-Yang commented Apr 20, 2026

Closes #2063nemoclaw onboard --agent hermes fails when an existing OpenClaw sandbox is already running.

Two bugs found and fixed:

Bug 1 (onboard.ts) — The sandbox readiness wait loop hardcoded DASHBOARD_PORT (18789) when probing inside every sandbox. Hermes runs on port 8642, so the loop always timed out after 28 s and printed a misleading "Dashboard taking longer than expected" warning.

Bug 2 (ensureDashboardForward) — No guard against a port already forwarded by a different sandbox. A second onboard using the same port would silently steal the first sandbox's forward, leaving it unreachable.

Changes

File Change
src/lib/onboard.ts Use effectivePort (= agent.forwardPort for non-OpenClaw agents) instead of hardcoded DASHBOARD_PORT in the readiness wait loop
src/lib/onboard.ts ensureDashboardForward: check openshell forward list before stop/start; throw a clear error if the port belongs to a different sandbox
src/lib/gateway-state.test.ts 20 new unit tests for isGatewayHealthy, getGatewayReuseState, isSandboxReady, hasStaleGateway

Root cause

createSandbox already computes effectivePort = agent ? agent.forwardPort : CONTROL_UI_PORT but the readiness loop immediately below used DASHBOARD_PORT directly. One-token fix makes the loop consistent with the rest of the function.

Test plan

  • npm run build:cli — no TypeScript errors
  • npm test — 1634 pass, 0 new failures
  • Manual: nemoclaw onboard then nemoclaw onboard --agent hermes — both complete; openshell forward list shows ports 18789 and 8642 as distinct entries

Signed-off-by: Dongni Yang dongniy@nvidia.com

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Readiness checks now probe the actual dynamically allocated dashboard port during sandbox startup.
    • Added validation to detect and error on port-forward conflicts to prevent routing to the wrong sandbox.
  • Tests

    • Updated test stubs to reflect richer port-forward status output and refined readiness command matching, and added a test for port-forward collision handling.

…nsureDashboardForward against port collision

When onboarding a non-OpenClaw agent (e.g. Hermes, forwardPort=8642) next
to an existing OpenClaw sandbox (forwardPort=18789), the dashboard readiness
loop probed the wrong port (hardcoded DASHBOARD_PORT=18789) inside the Hermes
sandbox. This caused a spurious 28-second wait and a misleading "Dashboard
taking longer than expected" warning on every Hermes onboard, masking the
real health state.

Additionally ensureDashboardForward had no guard against a port already
forwarded by a different sandbox. If two agents share the same forward port
the first sandbox's forward would be silently stolen, leaving it unreachable.

- Replace hardcoded DASHBOARD_PORT with effectivePort (already computed as
  agent.forwardPort for non-OpenClaw agents) in the readiness wait loop
- Add a pre-check in ensureDashboardForward: inspect `openshell forward list`
  before issuing stop/start; throw a clear actionable error if the port is
  claimed by a different sandbox
- Add gateway-state.test.ts (20 tests) covering isGatewayHealthy, getGatewayReuseState,
  isSandboxReady, hasStaleGateway, and related pure classifiers

Fixes NVIDIA#2063

Signed-off-by: Dongni Yang <dongniy@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 5755f9c8-ff3d-44be-8178-68f362ed7d15

📥 Commits

Reviewing files that changed from the base of the PR and between cf1c597 and 2ef0638.

📒 Files selected for processing (1)
  • test/onboard.test.ts

📝 Walkthrough

Walkthrough

Readiness checks now curl the sandbox-local dashboard URL using the dynamically chosen effectivePort. ensureDashboardForward() parses openshell forward list output to detect port-forward collisions (throws if the local port is forwarded by another sandbox); tests updated to reflect new forward-list output and stricter exec command matching.

Changes

Cohort / File(s) Summary
Dashboard probing & forward validation
src/lib/onboard.ts
Readiness loop uses http://localhost:${effectivePort}/. ensureDashboardForward() runs openshell forward list, parses status lines to detect if portToStop is forwarded by a different sandbox and errors on collision; otherwise retains stop/start forward behavior.
Test mocking adjustments
test/onboard.test.ts
Updated runner.runCapture mocks so 'forward' 'list' returns structured status lines (<sandbox> 127.0.0.1 <port> <localPort> running) and tightened sandbox exec detection to the tokenized "'sandbox' 'exec'". Added test asserting createSandbox rejects on forward collision.

Sequence Diagram

sequenceDiagram
    participant Tool as NemoClaw Tool
    participant OpenShell as OpenShell CLI
    participant Sandbox as Sandbox Instance

    Tool->>OpenShell: forward list
    OpenShell-->>Tool: list of forward entries (status lines)
    Tool->>Tool: parse entries for portToStop

    alt port forwarded to different sandbox
        Tool->>Tool: throw error (port already forwarded by other sandbox)
    else port free or owned by same sandbox
        Tool->>OpenShell: forward stop (if needed)
        OpenShell-->>Tool: ack
        Tool->>OpenShell: forward start (establish mapping)
        OpenShell-->>Tool: ack
    end

    Tool->>Sandbox: curl http://localhost:{effectivePort}/ (readiness)
    Sandbox-->>Tool: HTTP response (ready / not ready)
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 I sniffed the ports and hopped with glee,
I checked the forwards and kept them free,
A tiny curl to the effective sea,
I warned of clashes loud and spry,
Tests now match — I twitched my ear and cried "Hi!" 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: fixing dashboard port usage (effectivePort) and adding guard against port collision in forward setup.
Linked Issues check ✅ Passed The PR implementation addresses all primary coding objectives from issue #2063: using effectivePort for readiness probe and detecting port-forward collisions before starting new forwards.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the two bugs preventing Hermes sandbox creation alongside OpenClaw; test updates properly support the implementation changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

…d wait loop

Upstream changed the sandbox readiness check from runCapture (shell string)
to runCaptureOpenshell (argv array) for safety. Our fix uses effectivePort
instead of CONTROL_UI_PORT. Resolution keeps both: the safer argv form from
upstream and the agent-aware port from this branch.

Signed-off-by: Dongni Yang <dongniy@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
src/lib/gateway-state.test.ts (1)

53-73: Add explicit coverage for active-unnamed reuse-state branches.

getGatewayReuseState has active-unnamed paths (see src/lib/gateway-state.ts Lines 90-114), but this suite currently covers "healthy", "foreign-active", "stale", and "missing" only.

Suggested test additions
 describe("getGatewayReuseState", () => {
@@
   it("returns missing when no state is present", () => {
     expect(getGatewayReuseState("", "", "")).toBe("missing");
   });
+
+  it("returns active-unnamed when connected and active gateway matches but metadata is missing", () => {
+    expect(getGatewayReuseState(STATUS_CONNECTED_NEMOCLAW, "", "")).toBe("active-unnamed");
+  });
+
+  it("returns active-unnamed when active gateway info exists without reusable metadata", () => {
+    expect(getGatewayReuseState("", "", ACTIVE_INFO_NEMOCLAW)).toBe("active-unnamed");
+  });
 });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/gateway-state.test.ts` around lines 53 - 73, Add tests for the
"active-unnamed" branches of getGatewayReuseState: create inputs where an active
gateway is present but its name is empty/unnamed and assert the function returns
"active-unnamed" (e.g., call getGatewayReuseState with a connected status like
STATUS_CONNECTED_NEMOCLAW or STATUS_CONNECTED_OTHER, provide an ACTIVE_INFO
value that represents an unnamed active gateway, and appropriate GW_INFO
values), and also add the complementary case where the local gateway has no
metadata but an unnamed active gateway exists to cover both unnamed-active
paths; use the existing test style and constants (getGatewayReuseState,
STATUS_CONNECTED_NEMOCLAW, STATUS_CONNECTED_OTHER, GW_INFO_NEMOCLAW,
ACTIVE_INFO_NEMOCLAW) to locate where to add these new it(...) cases.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/lib/gateway-state.test.ts`:
- Line 12: The test imports the compiled artifact
("../../dist/lib/gateway-state") which can cause stale builds to be tested;
change the import to point at the TypeScript source module (the gateway-state
source) instead of dist — i.e., replace the "../../dist/lib/gateway-state"
import in src/lib/gateway-state.test.ts with the corresponding source import
(e.g., "../../src/lib/gateway-state" or "./gateway-state" depending on relative
layout) so the tests run against the live TypeScript code, and ensure your test
runner is configured to load TS sources (ts-node/ts-jest or the
tsconfig.cli.json settings).

In `@src/lib/onboard.ts`:
- Around line 4739-4741: The error message that references
NEMOCLAW_DASHBOARD_PORT is misleading for agents that specify an explicit
forwardPort in their manifest; update the string that currently includes `Port
${portToStop} is already forwarded... Set NEMOCLAW_DASHBOARD_PORT...` so it
directs users to change the agent's dashboard port override (the manifest
`forwardPort` / local dashboard URL override) instead of suggesting the
environment variable; locate where `portToStop` is used to build this message
and replace the remediation text to reference `forwardPort`/local dashboard URL
override for Hermes-style agents rather than `NEMOCLAW_DASHBOARD_PORT`.
- Around line 4736-4742: The check using existingForwards.includes is too broad;
instead parse the runCaptureOpenshell(["forward","list"]) output by lines,
locate the single line that corresponds to the port (match the token containing
`:${portToStop}` as the port column), extract that line's sandbox name field and
compare it exactly to `sandboxName`; if the extracted owner exists and is not
equal to `sandboxName`, throw the same error. Use `existingForwards.split("\n")`
(or equivalent), find the line where the port column equals `:${portToStop}`,
parse the owner/sandbox column from that line, and base the collision check on
that exact owner rather than substring matches.

---

Nitpick comments:
In `@src/lib/gateway-state.test.ts`:
- Around line 53-73: Add tests for the "active-unnamed" branches of
getGatewayReuseState: create inputs where an active gateway is present but its
name is empty/unnamed and assert the function returns "active-unnamed" (e.g.,
call getGatewayReuseState with a connected status like STATUS_CONNECTED_NEMOCLAW
or STATUS_CONNECTED_OTHER, provide an ACTIVE_INFO value that represents an
unnamed active gateway, and appropriate GW_INFO values), and also add the
complementary case where the local gateway has no metadata but an unnamed active
gateway exists to cover both unnamed-active paths; use the existing test style
and constants (getGatewayReuseState, STATUS_CONNECTED_NEMOCLAW,
STATUS_CONNECTED_OTHER, GW_INFO_NEMOCLAW, ACTIVE_INFO_NEMOCLAW) to locate where
to add these new it(...) cases.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: d01f7876-76bc-418b-8814-f88697cdb433

📥 Commits

Reviewing files that changed from the base of the PR and between 67db244 and d2e8310.

📒 Files selected for processing (2)
  • src/lib/gateway-state.test.ts
  • src/lib/onboard.ts

Comment thread src/lib/gateway-state.test.ts Outdated
hasStaleGateway,
getReportedGatewayName,
isGatewayConnected,
} from "../../dist/lib/gateway-state";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Importing from dist makes these unit tests unreliable.

Line 12 imports compiled output (../../dist/...) instead of source. That can silently test stale build artifacts and miss regressions in src/lib/gateway-state.ts.

Proposed fix
 } from "vitest";
 import {
   isGatewayHealthy,
   getGatewayReuseState,
   isSandboxReady,
   hasStaleGateway,
   getReportedGatewayName,
   isGatewayConnected,
-} from "../../dist/lib/gateway-state";
+} from "./gateway-state";

As per coding guidelines, "Use TypeScript for CLI code in src/lib/ with type-checking via tsconfig.cli.json".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
} from "../../dist/lib/gateway-state";
} from "vitest";
import {
isGatewayHealthy,
getGatewayReuseState,
isSandboxReady,
hasStaleGateway,
getReportedGatewayName,
isGatewayConnected,
} from "./gateway-state";
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/gateway-state.test.ts` at line 12, The test imports the compiled
artifact ("../../dist/lib/gateway-state") which can cause stale builds to be
tested; change the import to point at the TypeScript source module (the
gateway-state source) instead of dist — i.e., replace the
"../../dist/lib/gateway-state" import in src/lib/gateway-state.test.ts with the
corresponding source import (e.g., "../../src/lib/gateway-state" or
"./gateway-state" depending on relative layout) so the tests run against the
live TypeScript code, and ensure your test runner is configured to load TS
sources (ts-node/ts-jest or the tsconfig.cli.json settings).

Comment thread src/lib/onboard.ts
Comment thread src/lib/onboard.ts Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
src/lib/onboard.ts (2)

5564-5566: ⚠️ Potential issue | 🟠 Major

Match the conflicting forward on the exact local-port entry, not with blob-level substring checks.

This guard can still misclassify ownership. With output like 9999 -> other:8642, includes(\:${portToStop}`)fires even though local port8642is free, and any unrelated line containingsandboxNamesuppresses a real collision. Parse the singleforward listrow whose **left-hand local port** equalsportToStop`, then compare that row’s owner exactly before deciding whether to throw.

Suggested fix
   const existingForwards = runCaptureOpenshell(["forward", "list"], { ignoreError: true });
-  if (existingForwards?.includes(`:${portToStop}`) && !existingForwards?.includes(sandboxName)) {
+  const conflictingOwner = String(existingForwards || "")
+    .split(/\r?\n/)
+    .map((line) => line.trim())
+    .map((line) => {
+      const match = line.match(/^(\d+)\s*->\s*([^:]+):(\d+)$/);
+      return match
+        ? { localPort: match[1], owner: match[2], remotePort: match[3] }
+        : null;
+    })
+    .find((entry) => entry && entry.localPort === String(portToStop))?.owner;
+  if (conflictingOwner && conflictingOwner !== sandboxName) {
     throw new Error(
       `Port ${portToStop} is already forwarded for another sandbox. ` +
         `Set NEMOCLAW_DASHBOARD_PORT to a different port before onboarding ` +
         `a second sandbox.`,
     );
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 5564 - 5566, The current check uses
existingForwards?.includes(`:${portToStop}`) which can match any substring;
instead parse the output of runCaptureOpenshell(["forward", "list"]) into rows,
split each row on the arrow or delimiter to extract the local-port (left-hand)
value and the owner field, find the single row whose local-port equals
portToStop, and only if that row exists and its owner is not exactly equal to
sandboxName throw the Error; update the logic around existingForwards,
runCaptureOpenshell, portToStop and sandboxName to perform exact matching on the
parsed left-hand port and exact owner comparison.

5567-5569: ⚠️ Potential issue | 🟡 Minor

The remediation hint is still wrong for agent-defined dashboard ports.

For agents that set forwardPort, NEMOCLAW_DASHBOARD_PORT does not change the claimed local forward port, so this sends Hermes-style users to a no-op. Point them at the local dashboard URL/port override instead (for example CHAT_UI_URL) or at the agent-specific forward-port setting.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 5567 - 5569, The current error message that
references NEMOCLAW_DASHBOARD_PORT is incorrect when an agent has an explicit
forwardPort; update the message generation in the onboarding code (where
portToStop is used and the string is built) to detect agent-defined forward port
(e.g., check the agent's forwardPort/config or the presence of a forwardPort
variable) and show a different remediation: if the agent defines forwardPort,
instruct the user to change the agent's forwardPort or override the local
dashboard URL (e.g., CHAT_UI_URL) or the agent-specific forward-port setting;
otherwise keep the existing suggestion to set NEMOCLAW_DASHBOARD_PORT. Ensure
the conditional logic is applied where the string is composed so the correct
hint is emitted for both cases.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 5564-5566: The current check uses
existingForwards?.includes(`:${portToStop}`) which can match any substring;
instead parse the output of runCaptureOpenshell(["forward", "list"]) into rows,
split each row on the arrow or delimiter to extract the local-port (left-hand)
value and the owner field, find the single row whose local-port equals
portToStop, and only if that row exists and its owner is not exactly equal to
sandboxName throw the Error; update the logic around existingForwards,
runCaptureOpenshell, portToStop and sandboxName to perform exact matching on the
parsed left-hand port and exact owner comparison.
- Around line 5567-5569: The current error message that references
NEMOCLAW_DASHBOARD_PORT is incorrect when an agent has an explicit forwardPort;
update the message generation in the onboarding code (where portToStop is used
and the string is built) to detect agent-defined forward port (e.g., check the
agent's forwardPort/config or the presence of a forwardPort variable) and show a
different remediation: if the agent defines forwardPort, instruct the user to
change the agent's forwardPort or override the local dashboard URL (e.g.,
CHAT_UI_URL) or the agent-specific forward-port setting; otherwise keep the
existing suggestion to set NEMOCLAW_DASHBOARD_PORT. Ensure the conditional logic
is applied where the string is composed so the correct hint is emitted for both
cases.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 75946eec-47b5-4b07-8da1-39725da3a05e

📥 Commits

Reviewing files that changed from the base of the PR and between d2e8310 and 44c6324.

📒 Files selected for processing (1)
  • src/lib/onboard.ts

…n check and test mock

- ensureDashboardForward: replace blob substring match with line-level
  parser so the collision check is scoped to the exact port entry
  (format: "<port> -> <sandbox>:<port>") rather than the full listing
- Error message now references CHAT_UI_URL instead of
  NEMOCLAW_DASHBOARD_PORT — the env var only applies to OpenClaw default
  port; agents with explicit forwardPort in their manifest need
  CHAT_UI_URL to override the local listen address
- test/onboard.test.ts: fix runCapture mock to match
  openshellShellCommand output ('\''sandbox'\'' '\''exec'\'')
  so the dashboard-ready loop exits on the first iteration instead
  of sleeping 28 s and timing out the test
- Remove duplicate src/lib/gateway-state.test.ts (superseded by
  the existing test/gateway-state.test.ts which covers the same
  functions with broader scenarios)

Signed-off-by: Dongni Yang <dongniy@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Dongni-Yang Dongni-Yang changed the title fix(#2063): use agent forwardPort in dashboard wait; guard ensureDashboardForward against port collision fix(onboard): use effectivePort in dashboard wait and guard forward port collision Apr 20, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/lib/onboard.ts (1)

5564-5575: ⚠️ Potential issue | 🟠 Major

Match the forwarded endpoint by port suffix, not bare-token equality.

Line 5572 only matches when openshell forward list prints the local endpoint as the bare port. If the CLI emits 127.0.0.1:${portToStop} or 0.0.0.0:${portToStop}, this guard misses the collision and forward stop can still steal another sandbox’s forward.

Suggested fix
   const portLine = existingForwards
     ?.split("\n")
     .map((l) => l.trim())
     .find((l) => {
-      const localPort = l.split(/\s+/)[0];
-      return localPort === portToStop;
+      const localEndpoint = l.split(/\s+/)[0] || "";
+      return localEndpoint === portToStop || localEndpoint.endsWith(`:${portToStop}`);
     });
What is the output format of `openshell forward list` for local forwards? Does the local endpoint column include the host (for example `127.0.0.1:18789` or `0.0.0.0:8642`) or only the bare port?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib/onboard.ts` around lines 5564 - 5575, The current match logic in the
block using runCaptureOpenshell(["forward", "list"]) only compares the first
token as a bare port (variable portToStop) and misses entries like
"127.0.0.1:1234"; update the find and parsing so you compare the port suffix
instead: inside the .find callback extract the local endpoint token (same
split(/\s+/)[0]), then derive its port by taking the substring after the last
":" (or the whole token if no ":"), and compare that port string to portToStop;
likewise compute portOwner from portLine the same way (keep existing portOwner
extraction from split(/\s+/)[2] but treat that token's sandbox portion the same)
so collisions where host:port is printed are detected (refer to
runCaptureOpenshell, portToStop, portLine, portOwner, sandboxName).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 5564-5575: The current match logic in the block using
runCaptureOpenshell(["forward", "list"]) only compares the first token as a bare
port (variable portToStop) and misses entries like "127.0.0.1:1234"; update the
find and parsing so you compare the port suffix instead: inside the .find
callback extract the local endpoint token (same split(/\s+/)[0]), then derive
its port by taking the substring after the last ":" (or the whole token if no
":"), and compare that port string to portToStop; likewise compute portOwner
from portLine the same way (keep existing portOwner extraction from
split(/\s+/)[2] but treat that token's sandbox portion the same) so collisions
where host:port is printed are detected (refer to runCaptureOpenshell,
portToStop, portLine, portOwner, sandboxName).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: b4f72c64-7cac-45b3-8947-96cc64c7f933

📥 Commits

Reviewing files that changed from the base of the PR and between 44c6324 and bcaa37f.

📒 Files selected for processing (2)
  • src/lib/onboard.ts
  • test/onboard.test.ts

…mocks

openshell forward list outputs SANDBOX BIND PORT PID STATUS columns.
The ensureDashboardForward parser was reading column [0] (SANDBOX name)
as the port and column [2] with colon-split for the owner, which never
matched any real output — silently disabling the collision guard.

Fix: read PORT from column [2] and SANDBOX name from column [0].
Update all test mocks from the fictional "PORT -> NAME:PORT" format
to match the real openshell output: "NAME 127.0.0.1 PORT PID STATUS".

Regression introduced in this branch; related to open issue NVIDIA#2007.

Signed-off-by: Dongni Yang <dongniy@nvidia.com>
@wscurran wscurran added bug Something isn't working fix labels Apr 20, 2026
@wscurran
Copy link
Copy Markdown
Contributor

✨ Thanks for submitting this PR that proposes a fix to the dashboard wait and forward port collision. The changes and test plan you provided will help us review this further. Possibly related to issue #2063.


Possibly closes:


Possibly related open issues:

Exercises the ensureDashboardForward collision path added in this PR:
mock `forward list` to return a different sandbox owning port 18789,
then assert createSandbox throws with the expected error message.
Copy link
Copy Markdown
Contributor

@ericksoa ericksoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — the two bug fixes are correct and well-scoped:

  1. Bug 1 (readiness loop): Trivial one-token fix replacing hardcoded CONTROL_UI_PORT with the already-computed effectivePort. Clearly correct.

  2. Bug 2 (forward collision guard): Line-level parsing of openshell forward list columns (SANDBOX BIND PORT PID STATUS) with exact port match at column 2 and sandbox-name comparison at column 0. Error message correctly references CHAT_UI_URL for agent-specific port overrides.

I pushed one additional commit adding a test that exercises the collision detection path (mock forward list returning a different sandbox name, assert the expected error). All 136 tests pass.

Minor note: PR description still references "20 new unit tests" in gateway-state.test.ts, but those were added then removed within the PR (duplicate of existing test/gateway-state.test.ts). The description could be updated but isn't blocking.

@ericksoa ericksoa merged commit ac9afba into NVIDIA:main Apr 20, 2026
7 of 8 checks passed
ericksoa added a commit that referenced this pull request Apr 21, 2026
## Summary
- TC-SBX-10 fails because PR #2086 added a port-collision guard in
`ensureDashboardForward()` but the E2E test was not updated to use a
different `CHAT_UI_URL` for sandbox B
- Sets `CHAT_UI_URL=http://127.0.0.1:18790` when onboarding the second
sandbox

## Test plan
- [ ] Nightly E2E `sandbox-operations-e2e` job passes (TC-SBX-10,
TC-SBX-11)

Fixes regression from #2086.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Tests**
  * Updated test infrastructure configuration for sandbox operations.

---

**Note:** This release contains no user-facing changes. Updates are
limited to internal testing procedures.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
ericksoa pushed a commit that referenced this pull request Apr 21, 2026
## Summary

Catches up the user-facing reference and troubleshooting docs with the
CLI and policy behavior changes that landed in v0.0.21. Drafted via the
`nemoclaw-contributor-update-docs` skill against commits in
`v0.0.20..v0.0.21`, filtered through `docs/.docs-skip`.

## Changes

- **`docs/reference/commands.md`**
- `nemoclaw list`: session indicator (●) for connected sandboxes
(#2117).
- `nemoclaw <name> connect`: active-session note; auto-recovery from SSH
identity drift after a host reboot (#2117, #2064).
- `nemoclaw <name> status`: three-state Inference line (`healthy` /
`unreachable` / `not probed`) covering both local and remote providers;
new `Connected` line (#2002, #2117).
- `nemoclaw <name> destroy` and `rebuild`: active-session warning with
second confirm; rebuild reapplies policy presets to the recreated
sandbox (#2117, #2026).
- `nemoclaw <name> policy-add` and `policy-remove`: positional preset
argument and non-interactive flow via
`--yes`/`--force`/`NEMOCLAW_NON_INTERACTIVE=1` (#2070).
- `nemoclaw <name> policy-list`: registry-vs-gateway desync detection
(#2089).
- **`docs/reference/troubleshooting.md`**
- `Reconnect after a host reboot`: now reflects automatic stale
`known_hosts` pruning on `connect` (#2064).
- `Running multiple sandboxes simultaneously`: onboard's forward-port
collision guard (#2086).
- New section: `openclaw config set` or `unset` is blocked inside the
sandbox (#2081).
- **`docs/network-policy/customize-network-policy.md`**: non-interactive
`policy-add`/`policy-remove` form; preset preservation across rebuild
(#2070, #2026).
- **`docs/inference/use-local-inference.md`**: NIM section now covers
the NGC API key prompt with masked input and `docker login nvcr.io
--password-stdin` behavior (#2043).
- **Generated skills regenerated** to pick up the source changes
(`.agents/skills/nemoclaw-user-reference/references/{commands,troubleshooting}.md`,
plus minor heading-flow deltas elsewhere). The pre-commit `Regenerate
agent skills from docs` hook ran and confirmed source ↔ generated
parity.

Commits skipped per `docs/.docs-skip` or no doc impact: `bbbaa0fb`
(skip-features), `7cb482cb` (skip-features), `8dee23fd` (skip-terms),
plus the usual CI / test / refactor / install-plumbing churn.

## Type of Change

- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification

- [x] `npx prek run --all-files` passes for the modified files (the one
failing test, `test/cli.test.ts > unknown command exits 1`, also fails
on `origin/main` and is unrelated to these markdown-only changes)
- [ ] `npm test` passes — skipped; same pre-existing CLI-dispatch test
failure unrelated to docs
- [ ] Tests added or updated for new or changed behavior — n/a, doc-only
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only) — not run
locally
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)
— n/a, no new pages

## AI Disclosure

- [x] AI-assisted — tool: Claude Code

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
  * Multi-session SSH connections with concurrent session support.
* Three-state inference health reporting (healthy/unreachable/not
probed) across all providers.
  * Automatic SSH host key rotation detection and recovery.
  * Non-interactive policy preset management via positional arguments.
  * Session indicators in sandbox list view.

* **Bug Fixes**
  * Protected destructive operations with active-session warnings.
  * Policy presets now preserved during sandbox rebuilds.

* **Documentation**
  * NGC registry authentication requirements for container images.
  * Multi-sandbox onboarding and reconnection guidance.
  * Troubleshooting updates for port conflicts and SSH issues.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

nemoclaw fails to create hermes sandbox next to openclaw

3 participants