fix(onboard): preserve concurrent instance gateway and dashboard during onboard#4598
Conversation
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
📝 WalkthroughWalkthroughGateway handler and preflight logic updated to handle concurrent gateways ( ChangesConcurrent Gateway Ports Support
Sequence Diagram(s)sequenceDiagram
participant Workflow as Nightly CI
participant Job as concurrent-gateway-ports-e2e
participant Test as test-concurrent-gateway-ports.sh
participant FakeServer as Fake_OpenAI_Server
participant SandboxA as Sandbox_A
participant SandboxB as Sandbox_B
Workflow->>Job: Trigger (schedule or workflow_dispatch)
Job->>Job: Authenticate to Docker Hub
Job->>Test: Execute test script
Test->>FakeServer: Start fake OpenAI server
Test->>SandboxA: Onboard (gateway port 8080)
SandboxA->>SandboxA: Allocate dashboard port (e.g. 18789)
Test->>SandboxB: Onboard (alternate gateway port)
SandboxB->>SandboxB: Allocate distinct dashboard port
Test->>Test: Verify both listeners and sandbox phases
Test->>SandboxB: Destroy sandbox B
Test->>SandboxA: Verify sandbox A remains healthy
Test->>Job: Exit with results
Job->>Workflow: Upload artifacts on failure
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
🌿 Preview your docs: https://nvidia-preview-pr-4598.docs.buildwithfern.com/nemoclaw |
PR Review AdvisorFindings: 2 needs attention, 4 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Consider writing more tests for
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
E2E Scenario Advisor RecommendationRequired scenario E2E: Dispatch required scenario E2E:
Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lib/state/gateway.ts (1)
54-82: ⚡ Quick winConsider extracting the shared liveness predicate.
The row-liveness check
(cols.includes("Ready") || cols.includes("Running")) && !cols.includes("NotReady")is now duplicated inisSandboxReady(Line 57) andlistLiveSandboxNames(Line 77). Extracting a smallisLiveSandboxRow(cols: string[])helper keeps the two call sites from drifting if the OpenShell status vocabulary changes.♻️ Proposed extraction
+function isLiveSandboxRow(cols: string[]): boolean { + return (cols.includes("Ready") || cols.includes("Running")) && !cols.includes("NotReady"); +} + export function isSandboxReady(output: string, sandboxName: string): boolean { const cols = parseSandboxRow(output, sandboxName); if (!cols) return false; - return (cols.includes("Ready") || cols.includes("Running")) && !cols.includes("NotReady"); + return isLiveSandboxRow(cols); } @@ for (const line of clean.split("\n")) { const cols = line.trim().split(/\s+/); if (cols.length < 2) continue; const name = cols[0]; if (!name) continue; - if ((cols.includes("Ready") || cols.includes("Running")) && !cols.includes("NotReady")) { + if (isLiveSandboxRow(cols)) { names.push(name); } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/state/gateway.ts` around lines 54 - 82, The liveness predicate is duplicated between isSandboxReady and listLiveSandboxNames; extract a helper like isLiveSandboxRow(cols: string[]): boolean that implements (cols.includes("Ready") || cols.includes("Running")) && !cols.includes("NotReady"), then replace the duplicated checks in isSandboxReady (which calls parseSandboxRow) and listLiveSandboxNames (which splits lines into cols) to call isLiveSandboxRow(cols) so both sites share the single canonical predicate.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/lib/state/gateway.ts`:
- Around line 54-82: The liveness predicate is duplicated between isSandboxReady
and listLiveSandboxNames; extract a helper like isLiveSandboxRow(cols:
string[]): boolean that implements (cols.includes("Ready") ||
cols.includes("Running")) && !cols.includes("NotReady"), then replace the
duplicated checks in isSandboxReady (which calls parseSandboxRow) and
listLiveSandboxNames (which splits lines into cols) to call
isLiveSandboxRow(cols) so both sites share the single canonical predicate.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b3649375-6721-4481-9cea-45c8b1ea1c68
📒 Files selected for processing (6)
docs/reference/troubleshooting.mdxsrc/lib/onboard.tssrc/lib/onboard/preflight-gateway-cleanup-decision.test.tssrc/lib/onboard/preflight-gateway-cleanup-decision.tssrc/lib/state/gateway.tstest/gateway-state.test.ts
Selective E2E Results — ❌ Some jobs failedRun: 26734588554
|
…omments) Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ❌ Some jobs failedRun: 26736515114
|
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ❌ Some jobs failedRun: 26740349870
|
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
…ay groundwork Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26745027988
|
Selective E2E Results — ✅ All requested jobs passedRun: 26745392857
|
…cessor Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26746811574
|
…ULT_GATEWAY_NAME Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26748043704
|
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26749185051
|
Selective E2E Results — ✅ All requested jobs passedRun: 26749747697
|
…es + stricter writes Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ❌ Some jobs failedRun: 26997817230
|
…erminal Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ❌ Some jobs failedRun: 26998381153
|
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26999184001
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2089-2100: The code currently calls
runCaptureOpenshell(["forward","list"], {ignoreError: true, ...}) so
failures/timeouts are treated as empty output and may cause an incorrect kill;
change this to surface command failures and bail on unknown ownership: call
runCaptureOpenshell without ignoreError (or wrap it in try/catch), check for
errors/timeouts and if the command failed log a warning and skip killing the SSH
forward for that port (i.e. do not fall through to the PID kill), otherwise
proceed to call getOccupiedPorts(forwardListOutput) as before; update the
runCaptureOpenshell usage and surrounding control flow in the block that
references runCaptureOpenshell and getOccupiedPorts so ownership is only assumed
on successful command output.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a79fe900-00d9-45be-90a6-125ebe458b06
📒 Files selected for processing (5)
.github/workflows/nightly-e2e.yamlsrc/lib/onboard.tssrc/lib/onboard/machine/handlers/gateway.test.tssrc/lib/onboard/machine/handlers/gateway.tstest/e2e/test-concurrent-gateway-ports.sh
🚧 Files skipped from review as they are similar to previous changes (1)
- .github/workflows/nightly-e2e.yaml
Selective E2E Results — ✅ All requested jobs passedRun: 26999971280
|
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2083-2089: The guard currently only invokes
tryCleanupOrphanedDashboardForward when port === DASHBOARD_PORT, which skips
orphan-forward handling when a custom --control-ui-port is used; update the
condition so any ssh listener for the dashboard check is handled regardless of
the numeric DASHBOARD_PORT (for example check portCheck.process === "ssh" &&
(port === DASHBOARD_PORT || label === "dashboard") or otherwise detect the
dashboard check by its label) and then call tryCleanupOrphanedDashboardForward
with the same args; if outcome.kind === "killed-still-blocked" replace portCheck
with outcome.portCheck, else if outcome.kind !== "not-openshell" continue — keep
the existing outcome handling but remove the strict numeric DASHBOARD_PORT
requirement so createSandbox can auto-allocate a different dashboard port later.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 25898d10-0f69-49a3-a11a-9ac86433b02f
📒 Files selected for processing (6)
.github/workflows/nightly-e2e.yamlsrc/lib/onboard.tssrc/lib/onboard/machine/handlers/gateway.test.tssrc/lib/onboard/orphaned-dashboard-forward.test.tssrc/lib/onboard/orphaned-dashboard-forward.tstest/e2e/test-concurrent-gateway-ports.sh
🚧 Files skipped from review as they are similar to previous changes (3)
- src/lib/onboard/machine/handlers/gateway.test.ts
- .github/workflows/nightly-e2e.yaml
- test/e2e/test-concurrent-gateway-ports.sh
Selective E2E Results — ✅ All requested jobs passedRun: 27001186985
|
Selective E2E Results — ✅ All requested jobs passedRun: 27001064634
|
prekshivyas
left a comment
There was a problem hiding this comment.
Re-reviewed against the current head (69dcdef) — refreshing my earlier approval, which predated a substantial rework. Verified the fix end to end, including the primitives it depends on:
- Gateway skip:
gateway.tsskipsretireLegacyGatewayForDockerDriverUpgradewhengatewayReuseState === "foreign-active"and normalizes tomissing, so a second onboard starts its own per-port gateway alongside instead of retiring the neighbor's. Theforeign-activestate is real production logic —getGatewayReuseState(src/lib/state/gateway.ts) returns it when a live gateway with a different name exists ((connected || activeInfo) && activeGatewayName !== gatewayName), i.e. exactly the concurrent-instance case. Not dead code, and it's a pure, separately-tested function. - Dashboard forward: the extracted
tryCleanupOrphanedDashboardForwardhelper only kills a forward when there is no live owner. Ownership comes fromgetOccupiedPorts, which maps port→sandbox but only for live forwards — so a stale entry is killable and a live foreign owner is protected.list-failedandowned-by-liveboth skip the kill (andforward listis deliberately allowed to throw rather than swallow to empty, preventing a wrongful kill on enumeration failure). - Integration: the
onboard.tscaller, inside thefor (const {port} of requiredPorts)loop, handles every outcome correctly —killed-cleared/owned-by-live/list-failed→continue(proceed, auto-allocate a different dashboard port);not-openshell/killed-still-blocked→ fall through to the port-blocked error with refreshed diagnostics.
Unit tests cover the foreign-active no-retire branch and the helper outcomes; the concurrent-gateway-ports E2E asserts two sandboxes reach Ready on distinct gateway/dashboard ports and survive each other's teardown. CI green (28 pass / 1 skip). Good to merge.
…ay-drift-on-live-sandbox # Conflicts: # .github/workflows/nightly-e2e.yaml
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/lib/onboard.ts (1)
2083-2089:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHandle non-default dashboard ports in the orphan-forward path.
This still gates the helper on
port === DASHBOARD_PORT, so--control-ui-port <non-default>skips the cleanup/ownership check and falls straight into the fatal port-blocked path even thoughcreateSandbox()later auto-allocates a different dashboard port. That keeps the concurrent-instance bug alive for custom dashboard ports.💡 Suggested fix
- if (port === DASHBOARD_PORT && portCheck.process === "ssh" && portCheck.pid) { + if (envVar === "NEMOCLAW_DASHBOARD_PORT" && portCheck.process === "ssh" && portCheck.pid) { const outcome = await tryCleanupOrphanedDashboardForward({ port, pid: portCheck.pid, label, portCheckOptions, captureProcessArgs, runCaptureOpenshell, run, sleepSeconds, checkPortAvailable, }); if (outcome.kind === "killed-still-blocked") portCheck = outcome.portCheck; else if (outcome.kind !== "not-openshell") continue; }As per coding guidelines,
src/lib/onboard.ts: "This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 2083 - 2089, The code currently only calls tryCleanupOrphanedDashboardForward when port === DASHBOARD_PORT, which skips orphan-forward cleanup for custom dashboard ports; remove that gate and always invoke tryCleanupOrphanedDashboardForward (passing the current port, pid, label, portCheckOptions, captureProcessArgs, runCaptureOpenshell, run, sleepSeconds, checkPortAvailable) so any non-default dashboard port is checked/cleaned before falling into the fatal blocked-port path; keep the existing outcome handling (use outcome.portCheck when kind === "killed-still-blocked" and continue only when kind === "not-openshell").
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 2083-2089: The code currently only calls
tryCleanupOrphanedDashboardForward when port === DASHBOARD_PORT, which skips
orphan-forward cleanup for custom dashboard ports; remove that gate and always
invoke tryCleanupOrphanedDashboardForward (passing the current port, pid, label,
portCheckOptions, captureProcessArgs, runCaptureOpenshell, run, sleepSeconds,
checkPortAvailable) so any non-default dashboard port is checked/cleaned before
falling into the fatal blocked-port path; keep the existing outcome handling
(use outcome.portCheck when kind === "killed-still-blocked" and continue only
when kind === "not-openshell").
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 135a8518-cc25-4a94-b715-3b97eeef124d
📒 Files selected for processing (2)
.github/workflows/nightly-e2e.yamlsrc/lib/onboard.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- .github/workflows/nightly-e2e.yaml
## Summary - Adds the `v0.0.60` section to `docs/about/release-notes.mdx` using the dev announcement from discussion #4877. - Fills the source-doc gaps found during release-prep review across inference, policy tiers, command behavior, security boundaries, Hermes dashboard/tooling, runtime context, and troubleshooting. - Refreshes generated agent skills under `.agents/skills/` from the current Fern docs output and upgrades Fern from `5.44.3` to `5.45.0`. ## Source summary - #4037 -> `docs/reference/architecture.mdx`, `docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents system-only runtime context that stays out of visible chat. - #4875 -> `docs/reference/architecture.mdx`, `docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents try-first sandbox network/filesystem guidance and clearer failure classification. - #4788 -> `docs/security/best-practices.mdx`, `docs/about/release-notes.mdx`: Documents shared OpenClaw device-approval policy for startup and connect. - #4768 -> `docs/reference/network-policies.mdx`, `docs/network-policy/integration-policy-examples.mdx`, `docs/get-started/quickstart.mdx`, `docs/get-started/quickstart-hermes.mdx`, `docs/reference/commands.mdx`: Documents `weather`, `public-reference`, and Hermes managed-tool gateway preset behavior. - #3788 and #4864 -> `docs/reference/network-policies.mdx`, `docs/reference/commands.mdx`: Documents non-interactive policy-tier fail-fast behavior and interactive prompt fallback. - #4756 and #4866 -> `docs/reference/commands.mdx`: Documents env-aware default sandbox resolution for `list`, `status`, and `tunnel` commands. - #4320 -> `docs/reference/commands.mdx`: Documents `$$nemoclaw tunnel status` behavior. - #4328 -> `docs/reference/commands.mdx`: Documents line-scoped policy preset descriptions in `policy-list`. - #4580 and #4748 -> `docs/reference/architecture.mdx`: Documents package-managed OpenShell gateway service and Docker-driver gateway-marker behavior. - #4598 -> `docs/manage-sandboxes/lifecycle.mdx`: Documents concurrent gateway/dashboard cleanup isolation by sandbox name and port. - #4777 -> `docs/reference/troubleshooting.mdx`: Documents Docker GPU patch rollback behavior. - #4610 -> `docs/reference/troubleshooting.mdx`, `docs/reference/commands.mdx`: Keeps mutable OpenClaw config permission guidance aligned and removes skipped experimental wording. - #4868 -> `docs/reference/commands.mdx`: Keeps `.dockerignore` handling for custom `onboard --from <Dockerfile>` contexts in generated skills. - #4870 -> `docs/reference/commands.mdx`, `docs/manage-sandboxes/runtime-controls.mdx`: Documents `NEMOCLAW_MINIMAL_BOOTSTRAP` and generated skill coverage. - #4641 -> `docs/inference/inference-options.mdx`, `docs/reference/troubleshooting.mdx`: Documents local NVIDIA NIM platform-digest pulls and served-model id adoption. - #4810 and #4867 -> `docs/inference/inference-options.mdx`: Documents stable NGC managed-vLLM image lineage and DGX Station DeepSeek V4 Flash coverage. - #4852 -> `docs/inference/use-local-inference.mdx`, `docs/reference/troubleshooting.mdx`: Documents Ollama model fit filtering, 16K context floor, cold-load retry, and failed-model exclusion. - #4847 -> `docs/inference/switch-inference-providers.mdx`: Documents API-family sync, Hermes `api_mode`, and Bedrock Runtime exception. - #4800 -> `docs/inference/tool-calling-reliability.mdx`: Documents Nemotron managed-inference native tool-search fallback. - #4333 -> `docs/inference/switch-inference-providers.mdx`: Documents interactive multimodal input prompting. - #4086 -> `docs/reference/troubleshooting.mdx`: Keeps proxy bypass normalization in generated troubleshooting coverage. - #4811 and #4855 -> `docs/get-started/quickstart-hermes.mdx`: Documents prebuilt Hermes dashboard assets and TUI recovery without runtime rebuilds. - #4854 -> `docs/inference/switch-inference-providers.mdx`, `docs/reference/commands.mdx`: Documents Hermes proxy API-key placeholder preservation during inference switches. - #4248 -> `docs/manage-sandboxes/messaging-channels.mdx`, `.agents/skills/`: Keeps messaging enrollment behavior aligned with manifest-hook implementation. - #4771 -> `docs/security/best-practices.mdx`, `docs/security/credential-storage.mdx`: Documents Hermes placeholder-only secret boundary for sandbox-visible runtime files. - #4787 -> `docs/security/best-practices.mdx`, `docs/about/release-notes.mdx`: Documents expanded memory scanner examples for OpenAI project keys and Slack app-level tokens. - #4848 -> `docs/reference/commands.mdx`: Documents OpenClaw skill install mirroring into the agent home directory. - #4790 -> `docs/about/release-notes.mdx`: Uses the prior release-prep structure and generated `.agents/skills/` refresh as the template for this release. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ skills/ --prefix nemoclaw-user --doc-platform fern-mdx --dry-run` - `npm run docs` - `git diff --check` - skip-term scan across `docs/`, `.agents/skills/`, and `skills/` - `npm run build:cli` - `npm run typecheck:cli` - Commit and pre-push hook suites, including markdownlint, gitleaks, env-var docs gate, docs-to-skills verification, and skills YAML tests <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * DeepSeek-V4-Flash now available as default inference model for DGX Station. * Hermes dashboard improved with dedicated port and OAuth-authenticated tool gateway selection. * Added weather and public-reference policy presets for expanded agent capabilities. * Enhanced Ollama model selection with GPU memory filtering and automatic retry for timeouts. * **Bug Fixes** * Improved policy tier validation to prevent invalid configurations. * Better sandbox cleanup scoping by port to prevent conflicts across deployments. * Added GPU patch failure recovery with automatic rollback. * **Documentation** * Expanded troubleshooting guides for inference, security, and sandbox lifecycle. * Added .dockerignore best practices for custom deployments. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Summary
Two preflight cleanup paths assumed the OpenShell gateway and dashboard forward were process-wide singletons. When a second NemoClaw onboard ran with
NEMOCLAW_GATEWAY_PORT=N, the preflight retired the existing per-port gateway as "legacy" and killed the first sandbox's dashboard SSH forward — leaving the first sandbox unreachable. This PR scopes both cleanups so the second instance starts its own gateway alongside the first instead of replacing it.Related Issue
Fixes #4422 · Refs #3053
#4422 is the specific SIGKILL-on-second-onboard symptom: a second onboard with
NEMOCLAW_GATEWAY_PORT=Ndestroyed the previous instance's per-port gateway and dashboard forward. This PR fixes both preflight cleanups so concurrent onboards no longer step on each other.#3053 is the broader ask — full multi-instance segregation of registry, credentials, snapshots, messaging, and lifecycle behind a configurable
NEMOCLAW_INSTANCEidentity. That work is out of scope here and tracked separately; this PR removes the destructive cross-talk that previously prevented two NemoClaw-managed sandboxes from coexisting at all, but does not yet introduce the instance identity primitive.Changes
src/lib/onboard/machine/handlers/gateway.ts: skipretireLegacyGatewayForDockerDriverUpgradewhengatewayReuseState === "foreign-active". A foreign-active gateway is another sandbox's per-portnemoclaw-<port>— not legacy state to retire. Normalises to"missing"so the current onboard proceeds with its own per-port gateway alongside.src/lib/onboard.ts: dashboard-port preflight no longer kills an "orphaned SSH port-forward" whenopenshell forward listshows the port is held by another live sandbox. The runtime allocator picks a different dashboard port for this sandbox at create time instead.src/lib/onboard/machine/handlers/gateway.test.ts: unit test for the foreign-active no-retire branch.test/e2e/test-concurrent-gateway-ports.sh: new E2E that onboards two sandboxes (default +NEMOCLAW_GATEWAY_PORT=18080), asserts both reachReady, distinct gateway ports (8080 + 18080), distinct dashboard ports (18789 + 18790), and that destroying one leaves the other intact. Each sandbox is queried via its own gateway withopenshell sandbox list -g <gateway-name>so the global active-gateway pointer does not flip the read..github/workflows/nightly-e2e.yaml: registersconcurrent-gateway-ports-e2ein the dispatchable-jobs catalog,needslists, and the advisor comment block. Also documents existingopenclaw-skill-cli-e2eandchannels-add-remove-e2ein the catalog so the PR-review E2E advisor surfaces them when relevant changes land — catches up leftover automation from PRs fix(onboard): pin OpenClaw home/state/workspace env in sandbox #4766 ([Ubuntu 26.04][Agent&Skills] openclaw skills list does not show workspace-installed skills after openclaw skills install #4709 OpenClaw skill CLI) and fix(rebuild): reuse gateway-stored credential when host env is empty #4745 ([macOS][Sandbox] nemohermes rebuild preflight fails with "provider credential not found" despite credential registered in gateway #3895 channels add/remove) where the tests shipped but were never advertised to the advisor.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesnpm run docsbuilds without warnings (doc changes only)Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores
Documentation