fix(onboard): preserve concurrent instance gateway and dashboard during onboard by laitingsheng · Pull Request #4598 · NVIDIA/NemoClaw

laitingsheng · 2026-06-01T04:02:42Z

Summary

Two preflight cleanup paths assumed the OpenShell gateway and dashboard forward were process-wide singletons. When a second NemoClaw onboard ran with NEMOCLAW_GATEWAY_PORT=N, the preflight retired the existing per-port gateway as "legacy" and killed the first sandbox's dashboard SSH forward — leaving the first sandbox unreachable. This PR scopes both cleanups so the second instance starts its own gateway alongside the first instead of replacing it.

Related Issue

Fixes #4422 · Refs #3053

#4422 is the specific SIGKILL-on-second-onboard symptom: a second onboard with NEMOCLAW_GATEWAY_PORT=N destroyed the previous instance's per-port gateway and dashboard forward. This PR fixes both preflight cleanups so concurrent onboards no longer step on each other.

#3053 is the broader ask — full multi-instance segregation of registry, credentials, snapshots, messaging, and lifecycle behind a configurable NEMOCLAW_INSTANCE identity. That work is out of scope here and tracked separately; this PR removes the destructive cross-talk that previously prevented two NemoClaw-managed sandboxes from coexisting at all, but does not yet introduce the instance identity primitive.

Changes

src/lib/onboard/machine/handlers/gateway.ts: skip retireLegacyGatewayForDockerDriverUpgrade when gatewayReuseState === "foreign-active". A foreign-active gateway is another sandbox's per-port nemoclaw-<port> — not legacy state to retire. Normalises to "missing" so the current onboard proceeds with its own per-port gateway alongside.
src/lib/onboard.ts: dashboard-port preflight no longer kills an "orphaned SSH port-forward" when openshell forward list shows the port is held by another live sandbox. The runtime allocator picks a different dashboard port for this sandbox at create time instead.
src/lib/onboard/machine/handlers/gateway.test.ts: unit test for the foreign-active no-retire branch.
test/e2e/test-concurrent-gateway-ports.sh: new E2E that onboards two sandboxes (default + NEMOCLAW_GATEWAY_PORT=18080), asserts both reach Ready, distinct gateway ports (8080 + 18080), distinct dashboard ports (18789 + 18790), and that destroying one leaves the other intact. Each sandbox is queried via its own gateway with openshell sandbox list -g <gateway-name> so the global active-gateway pointer does not flip the read.
.github/workflows/nightly-e2e.yaml: registers concurrent-gateway-ports-e2e in the dispatchable-jobs catalog, needs lists, and the advisor comment block. Also documents existing openclaw-skill-cli-e2e and channels-add-remove-e2e in the catalog so the PR-review E2E advisor surfaces them when relevant changes land — catches up leftover automation from PRs fix(onboard): pin OpenClaw home/state/workspace env in sandbox #4766 ([Ubuntu 26.04][Agent&Skills] openclaw skills list does not show workspace-installed skills after openclaw skills install #4709 OpenClaw skill CLI) and fix(rebuild): reuse gateway-stored credential when host env is empty #4745 ([macOS][Sandbox] nemohermes rebuild preflight fails with "provider credential not found" despite credential registered in gateway #3895 channels add/remove) where the tests shipped but were never advertised to the advisor.

Type of Change

Code change (feature, bug fix, or refactor)
Code change with doc updates
Doc only (prose changes, no code sample modifications)
Doc only (includes code sample changes)

Verification

npx prek run --all-files passes
npm test passes
Tests added or updated for new or changed behavior
No secrets, API keys, or credentials committed
Docs updated for user-facing behavior changes
npm run docs builds without warnings (doc changes only)
Doc pages follow the style guide (doc changes only)
New doc pages include SPDX header and frontmatter (new pages only)

Signed-off-by: Tinson Lai tinsonl@nvidia.com

Summary by CodeRabbit

New Features
- Manage concurrent gateway ports safely across multiple sandboxes on the same host.
Bug Fixes
- Improved cleanup for orphaned SSH port-forwards that block dashboard ports.
Tests
- Added E2E test validating concurrent gateway-port scenarios.
- Added/updated unit tests for gateway-state and orphaned-forward handling.
Chores
- Added nightly E2E workflow job for concurrent gateway port testing and integrated it into reporting.
Documentation
- Expanded nightly E2E job documentation for related tests.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

coderabbitai · 2026-06-01T04:02:53Z

📝 Walkthrough

Walkthrough

Gateway handler and preflight logic updated to handle concurrent gateways (foreign-active case) and delegate orphaned OpenShell forward cleanup to a new helper; unit tests added; a comprehensive E2E script validates concurrent gateway/dashboard allocation; nightly workflow runs the test and uploads failure artifacts.

Changes

Concurrent Gateway Ports Support

Layer / File(s)	Summary
Gateway handler foreign-active state handling `src/lib/onboard/machine/handlers/gateway.ts`, `src/lib/onboard/machine/handlers/gateway.test.ts`	When `gatewayReuseState` is `foreign-active`, legacy metadata retirement is skipped and the state is normalized to `missing` before starting the Docker-driver gateway. Unit test verifies retirement and legacy-replacement note are bypassed while the gateway starts.
Orphaned dashboard forward helper and tests `src/lib/onboard/orphaned-dashboard-forward.ts`, `src/lib/onboard/orphaned-dashboard-forward.test.ts`	Adds `tryCleanupOrphanedDashboardForward` plus DI types and runners; implements outcome classifications (not-openshell, list-failed, owned-by-live, killed-cleared, killed-still-blocked) and tests covering each classification and control flow.
Preflight port conflict awareness and wiring `src/lib/onboard.ts`	Preflight now calls `tryCleanupOrphanedDashboardForward` when a dashboard port is blocked by an SSH listener, refreshing port checks only if cleanup freed the port and otherwise following the existing port-unavailable error path.
Concurrent gateway ports E2E test `test/e2e/test-concurrent-gateway-ports.sh`	New bash E2E that starts a fake OpenAI server, ensures no default-install sandbox, onboards Sandbox A on 8080 and Sandbox B on a separate gateway port, verifies distinct dashboard ports, listeners, `nemoclaw list` contents, and that destroying B preserves A.
Nightly E2E workflow integration `.github/workflows/nightly-e2e.yaml`	Workflow docs and `workflow_dispatch` inputs updated; new job `concurrent-gateway-ports-e2e` added with repo-guard and selective-dispatch; job runs checkout, Docker Hub auth, NemoClaw install, invokes the E2E script, and uploads sandbox A/B onboard and sandbox B destroy artifacts on failure; job wired into `notify-on-failure`, `report-to-pr`, and `scorecard` needs.

Sequence Diagram(s)

sequenceDiagram
  participant Workflow as Nightly CI
  participant Job as concurrent-gateway-ports-e2e
  participant Test as test-concurrent-gateway-ports.sh
  participant FakeServer as Fake_OpenAI_Server
  participant SandboxA as Sandbox_A
  participant SandboxB as Sandbox_B

  Workflow->>Job: Trigger (schedule or workflow_dispatch)
  Job->>Job: Authenticate to Docker Hub
  Job->>Test: Execute test script
  Test->>FakeServer: Start fake OpenAI server
  Test->>SandboxA: Onboard (gateway port 8080)
  SandboxA->>SandboxA: Allocate dashboard port (e.g. 18789)
  Test->>SandboxB: Onboard (alternate gateway port)
  SandboxB->>SandboxB: Allocate distinct dashboard port
  Test->>Test: Verify both listeners and sandbox phases
  Test->>SandboxB: Destroy sandbox B
  Test->>SandboxA: Verify sandbox A remains healthy
  Test->>Job: Exit with results
  Job->>Workflow: Upload artifacts on failure

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

fix, Sandbox, Docker, E2E, platform: container

Suggested reviewers

prekshivyas
cv

Poem

🐰 Two gateways hop, each on its own port,
Dashboards find homes, no collisions to thwart.
Tests spin a fake server to keep the beat,
Nightly CI listens and gathers each log sheet.
A rabbit cheers: concurrent and neat!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 24.14% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: preventing concurrent instance gateway/dashboard destruction during onboard when using different gateway ports.
Linked Issues check	✅ Passed	The PR implements core requirements from `#4422`: skip retiring legacy gateway when gatewayReuseState is 'foreign-active' so concurrent instances preserve each other's gateways; preserve dashboard ports by checking if SSH forward is owned by another live sandbox before killing it; and adds E2E verification.
Out of Scope Changes check	✅ Passed	All changes directly address the linked issue `#4422`: gateway.ts/onboard.ts handle concurrent gateway/dashboard preservation, orphaned-dashboard-forward.ts/test implements cleanup logic, E2E test validates concurrent instances, and CI workflow registers the test.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/4422-refuse-gateway-drift-on-live-sandbox

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-06-01T04:03:10Z

🌿 Preview your docs: https://nvidia-preview-pr-4598.docs.buildwithfern.com/nemoclaw

github-actions · 2026-06-01T04:06:44Z

PR Review Advisor

Findings: 2 needs attention, 4 worth checking, 0 nice ideas
Since last review: 1 prior item resolved, 5 still apply, 0 new items found

Review findings

🛠️ Needs attention

Withhold NVIDIA_API_KEY from checked-out target_ref code (.github/workflows/nightly-e2e.yaml:2096): The new concurrent-gateway-ports job reuses the workflow's target-ref checkout, which can run code from `${{ inputs.target_ref || github.ref }}` during workflow_dispatch, but still passes the repository NVIDIA_API_KEY secret to both checked-out `install.sh` and the checked-out E2E script. A selected PR/head ref could modify either file and exfiltrate the secret.
- Recommendation: Remove NVIDIA_API_KEY from this job if the fake OpenAI endpoint is sufficient, or gate it the same way as Docker Hub credentials so it is empty when `github.event_name == 'workflow_dispatch' && inputs.target_ref != ''`. Add a static workflow guard for this trusted-code boundary.
- Evidence: The checkout anchor uses `ref: ${{ inputs.target_ref || github.ref }}` and nearby comments already say explicit target_ref can execute untrusted PR-head code. The new job passes `NVIDIA_API_KEY: ${{ secrets.NVIDIA_API_KEY }}` to `bash install.sh` and to `bash test/e2e/test-concurrent-gateway-ports.sh`, while the script uses a local fake endpoint with `COMPATIBLE_API_KEY=dummy`.
Exercise the remaining [WSL2 x86_64][Sandbox] NEMOCLAW_GATEWAY_PORT=N onboard recreates global gateway and destroys previous sandbox — concurrent instances unsupported #4422 coexistence clauses (test/e2e/test-concurrent-gateway-ports.sh:319): The new E2E validates core coexistence, distinct gateway/dashboard ports, list output, and destroying B leaves A healthy, but it does not exercise [WSL2 x86_64][Sandbox] NEMOCLAW_GATEWAY_PORT=N onboard recreates global gateway and destroys previous sandbox — concurrent instances unsupported #4422's explicit `connect` requirement or independent agent-state/no-cross-talk requirement. Because the PR is marked Fixes [WSL2 x86_64][Sandbox] NEMOCLAW_GATEWAY_PORT=N onboard recreates global gateway and destroys previous sandbox — concurrent instances unsupported #4422, those untested clauses leave acceptance incomplete.
- Recommendation: Extend the runtime scenario, or add targeted integration coverage, to run `nemoclaw sandbox-a connect` and `nemoclaw sandbox-b connect` through their own gateways and verify A's provider/credential/model/policy/config state remains unchanged after B onboard.
- Evidence: [WSL2 x86_64][Sandbox] NEMOCLAW_GATEWAY_PORT=N onboard recreates global gateway and destroys previous sandbox — concurrent instances unsupported #4422 Expected Result includes '`nemoclaw sandbox-a connect` and `nemoclaw sandbox-b connect` both work' and 'Independent agent state, no cross-talk'. The script verifies Ready/Running, listener ports, `nemoclaw list`, and destroy-B-leaves-A-ready, but does not call `connect` or inspect provider/credential/model/policy state.

🔎 Worth checking

Source-of-truth review needed: src/lib/onboard/orphaned-dashboard-forward.ts: The advisor marked localized patch analysis as needs_followup.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: The helper documents outcome meanings and safe skip behavior, but no comment or tracking reference identifies the source-level fix or removal condition.
Clarify strict behavior for explicit dashboard-port conflicts (src/lib/onboard.ts:2083): The orphaned-forward cleanup path treats `owned-by-live` and `list-failed` as a continue path from preflight, relying on later dashboard auto-allocation. That is safe for default auto-allocation, but the same branch is reached when `--control-ui-port` or a present `NEMOCLAW_DASHBOARD_PORT` selected a specific port, where users may expect that exact port to be honored or the command to fail.
- Recommendation: Define and test the contract for explicit dashboard ports held by a live foreign sandbox or when `openshell forward list` fails. If explicit ports must be strict, fall through to the generic port-blocked error instead of continuing; if auto-allocation is intended, document it and add tests.
- Evidence: `_preflightDashboardPort = opts.controlUiPort ?? (process.env.NEMOCLAW_DASHBOARD_PORT != null ? DASHBOARD_PORT : null)` makes explicit env/CLI ports enter this preflight check. The new `outcome.kind !== "not-openshell"` branch continues for `owned-by-live` and `list-failed`.
Document the removal condition for orphaned-forward cleanup (src/lib/onboard/orphaned-dashboard-forward.ts:42): The new helper is a localized recovery/workaround around OpenShell forward state. It documents the invalid states and has useful negative-path tests, but it does not state when the workaround can be removed, which risks permanent source-of-truth drift instead of fixing forward ownership/lifecycle at the source.
- Recommendation: Add a short comment or tracking reference explaining the source-level fix and removal condition, such as when OpenShell/NemoClaw records per-sandbox dashboard-forward ownership and reconciles stale forwards at create/destroy time rather than inferring ownership from process command lines.
- Evidence: `orphaned-dashboard-forward.ts` explains `not-openshell`, `list-failed`, `owned-by-live`, and kill outcomes, and tests cover those branches. No code comment identifies the upstream/source fix or removal condition.
Add a workflow guard for target_ref secret withholding (.github/workflows/nightly-e2e.yaml:2091): The workflow already encodes a trusted-code boundary for Docker Hub credentials, but the new dispatchable job bypasses that pattern for NVIDIA_API_KEY. Without a static guard, future jobs can repeat the same pattern even after this job is fixed.
- Recommendation: Add or extend workflow lint/static tests to assert that jobs using the target-ref checkout do not pass repository secrets to run steps when workflow_dispatch has a non-empty `target_ref`, unless the job checks out trusted workflow-ref scripts and documents the boundary.
- Evidence: The `dockerhub-auth-step` is gated with `if: ${{ github.event_name != 'workflow_dispatch' || inputs.target_ref == '' }}`, while this new job directly injects `NVIDIA_API_KEY` into checked-out code.

🌱 Nice ideas

None.

Consider writing more tests for

**Runtime validation** — Workflow static guard: workflow_dispatch with non-empty target_ref does not pass NVIDIA_API_KEY or other repository secrets to scripts checked out from target_ref.. The changed surfaces involve workflow trusted-code boundaries, host port cleanup, OpenShell gateway lifecycle, and real sandbox coexistence. Unit tests cover important helper branches, and the E2E covers core coexistence, but several linked [WSL2 x86_64][Sandbox] NEMOCLAW_GATEWAY_PORT=N onboard recreates global gateway and destroys previous sandbox — concurrent instances unsupported #4422 clauses and the workflow secret boundary still need behavior-specific validation.
**Runtime validation** — Runtime validation: concurrent sandboxes can both `nemoclaw sandbox-a connect` and `nemoclaw sandbox-b connect` through their own gateways after the second onboard.. The changed surfaces involve workflow trusted-code boundaries, host port cleanup, OpenShell gateway lifecycle, and real sandbox coexistence. Unit tests cover important helper branches, and the E2E covers core coexistence, but several linked [WSL2 x86_64][Sandbox] NEMOCLAW_GATEWAY_PORT=N onboard recreates global gateway and destroys previous sandbox — concurrent instances unsupported #4422 clauses and the workflow secret boundary still need behavior-specific validation.
**Runtime validation** — Runtime validation: both dashboard URLs respond on the host after the second onboard, not only that distinct ports appear in `nemoclaw list`.. The changed surfaces involve workflow trusted-code boundaries, host port cleanup, OpenShell gateway lifecycle, and real sandbox coexistence. Unit tests cover important helper branches, and the E2E covers core coexistence, but several linked [WSL2 x86_64][Sandbox] NEMOCLAW_GATEWAY_PORT=N onboard recreates global gateway and destroys previous sandbox — concurrent instances unsupported #4422 clauses and the workflow secret boundary still need behavior-specific validation.
**Runtime validation** — Runtime validation: sandbox A's provider credential, model, policy, and generated config remain unchanged after sandbox B onboards with its own gateway port.. The changed surfaces involve workflow trusted-code boundaries, host port cleanup, OpenShell gateway lifecycle, and real sandbox coexistence. Unit tests cover important helper branches, and the E2E covers core coexistence, but several linked [WSL2 x86_64][Sandbox] NEMOCLAW_GATEWAY_PORT=N onboard recreates global gateway and destroys previous sandbox — concurrent instances unsupported #4422 clauses and the workflow secret boundary still need behavior-specific validation.
**Runtime validation** — Correctness validation: explicit `--control-ui-port` held by a live foreign forward fails or auto-allocates according to the documented contract.. The changed surfaces involve workflow trusted-code boundaries, host port cleanup, OpenShell gateway lifecycle, and real sandbox coexistence. Unit tests cover important helper branches, and the E2E covers core coexistence, but several linked [WSL2 x86_64][Sandbox] NEMOCLAW_GATEWAY_PORT=N onboard recreates global gateway and destroys previous sandbox — concurrent instances unsupported #4422 clauses and the workflow secret boundary still need behavior-specific validation.
**Add a workflow guard for target_ref secret withholding** — Add or extend workflow lint/static tests to assert that jobs using the target-ref checkout do not pass repository secrets to run steps when workflow_dispatch has a non-empty `target_ref`, unless the job checks out trusted workflow-ref scripts and documents the boundary.
**Acceptance clause:** [WSL2 x86_64][Sandbox] NEMOCLAW_GATEWAY_PORT=N onboard recreates global gateway and destroys previous sandbox — concurrent instances unsupported #4422: "First gateway still listening on 8080 + dashboard on 18789 + sandbox-a container Up + sandbox-a `Phase=Ready`" — add test evidence or identify existing coverage. The E2E verifies A's gateway port 8080 is listening, A's dashboard port parsed from `nemoclaw list` is 18789, and A reaches Ready/Running through the default gateway. It does not directly inspect the Docker container status beyond OpenShell phase.
**Acceptance clause:** [WSL2 x86_64][Sandbox] NEMOCLAW_GATEWAY_PORT=N onboard recreates global gateway and destroys previous sandbox — concurrent instances unsupported #4422: "Second gateway listening on 8081 + dashboard on 18790 (auto-allocated) + sandbox-b container Up + sandbox-b `Phase=Ready`" — add test evidence or identify existing coverage. The E2E verifies B's configured gateway port is listening, B gets a distinct dashboard port, and B reaches Ready/Running, but defaults to 18080 rather than the literal 8081 and does not assert dashboard port 18790 exactly.

Since last review details

Current findings:

Source-of-truth review needed: src/lib/onboard/orphaned-dashboard-forward.ts: The advisor marked localized patch analysis as needs_followup.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: The helper documents outcome meanings and safe skip behavior, but no comment or tracking reference identifies the source-level fix or removal condition.
Withhold NVIDIA_API_KEY from checked-out target_ref code (.github/workflows/nightly-e2e.yaml:2096): The new concurrent-gateway-ports job reuses the workflow's target-ref checkout, which can run code from `${{ inputs.target_ref || github.ref }}` during workflow_dispatch, but still passes the repository NVIDIA_API_KEY secret to both checked-out `install.sh` and the checked-out E2E script. A selected PR/head ref could modify either file and exfiltrate the secret.
- Recommendation: Remove NVIDIA_API_KEY from this job if the fake OpenAI endpoint is sufficient, or gate it the same way as Docker Hub credentials so it is empty when `github.event_name == 'workflow_dispatch' && inputs.target_ref != ''`. Add a static workflow guard for this trusted-code boundary.
- Evidence: The checkout anchor uses `ref: ${{ inputs.target_ref || github.ref }}` and nearby comments already say explicit target_ref can execute untrusted PR-head code. The new job passes `NVIDIA_API_KEY: ${{ secrets.NVIDIA_API_KEY }}` to `bash install.sh` and to `bash test/e2e/test-concurrent-gateway-ports.sh`, while the script uses a local fake endpoint with `COMPATIBLE_API_KEY=dummy`.
Exercise the remaining [WSL2 x86_64][Sandbox] NEMOCLAW_GATEWAY_PORT=N onboard recreates global gateway and destroys previous sandbox — concurrent instances unsupported #4422 coexistence clauses (test/e2e/test-concurrent-gateway-ports.sh:319): The new E2E validates core coexistence, distinct gateway/dashboard ports, list output, and destroying B leaves A healthy, but it does not exercise [WSL2 x86_64][Sandbox] NEMOCLAW_GATEWAY_PORT=N onboard recreates global gateway and destroys previous sandbox — concurrent instances unsupported #4422's explicit `connect` requirement or independent agent-state/no-cross-talk requirement. Because the PR is marked Fixes [WSL2 x86_64][Sandbox] NEMOCLAW_GATEWAY_PORT=N onboard recreates global gateway and destroys previous sandbox — concurrent instances unsupported #4422, those untested clauses leave acceptance incomplete.
- Recommendation: Extend the runtime scenario, or add targeted integration coverage, to run `nemoclaw sandbox-a connect` and `nemoclaw sandbox-b connect` through their own gateways and verify A's provider/credential/model/policy/config state remains unchanged after B onboard.
- Evidence: [WSL2 x86_64][Sandbox] NEMOCLAW_GATEWAY_PORT=N onboard recreates global gateway and destroys previous sandbox — concurrent instances unsupported #4422 Expected Result includes '`nemoclaw sandbox-a connect` and `nemoclaw sandbox-b connect` both work' and 'Independent agent state, no cross-talk'. The script verifies Ready/Running, listener ports, `nemoclaw list`, and destroy-B-leaves-A-ready, but does not call `connect` or inspect provider/credential/model/policy state.
Clarify strict behavior for explicit dashboard-port conflicts (src/lib/onboard.ts:2083): The orphaned-forward cleanup path treats `owned-by-live` and `list-failed` as a continue path from preflight, relying on later dashboard auto-allocation. That is safe for default auto-allocation, but the same branch is reached when `--control-ui-port` or a present `NEMOCLAW_DASHBOARD_PORT` selected a specific port, where users may expect that exact port to be honored or the command to fail.
- Recommendation: Define and test the contract for explicit dashboard ports held by a live foreign sandbox or when `openshell forward list` fails. If explicit ports must be strict, fall through to the generic port-blocked error instead of continuing; if auto-allocation is intended, document it and add tests.
- Evidence: `_preflightDashboardPort = opts.controlUiPort ?? (process.env.NEMOCLAW_DASHBOARD_PORT != null ? DASHBOARD_PORT : null)` makes explicit env/CLI ports enter this preflight check. The new `outcome.kind !== "not-openshell"` branch continues for `owned-by-live` and `list-failed`.
Document the removal condition for orphaned-forward cleanup (src/lib/onboard/orphaned-dashboard-forward.ts:42): The new helper is a localized recovery/workaround around OpenShell forward state. It documents the invalid states and has useful negative-path tests, but it does not state when the workaround can be removed, which risks permanent source-of-truth drift instead of fixing forward ownership/lifecycle at the source.
- Recommendation: Add a short comment or tracking reference explaining the source-level fix and removal condition, such as when OpenShell/NemoClaw records per-sandbox dashboard-forward ownership and reconciles stale forwards at create/destroy time rather than inferring ownership from process command lines.
- Evidence: `orphaned-dashboard-forward.ts` explains `not-openshell`, `list-failed`, `owned-by-live`, and kill outcomes, and tests cover those branches. No code comment identifies the upstream/source fix or removal condition.
Add a workflow guard for target_ref secret withholding (.github/workflows/nightly-e2e.yaml:2091): The workflow already encodes a trusted-code boundary for Docker Hub credentials, but the new dispatchable job bypasses that pattern for NVIDIA_API_KEY. Without a static guard, future jobs can repeat the same pattern even after this job is fixed.
- Recommendation: Add or extend workflow lint/static tests to assert that jobs using the target-ref checkout do not pass repository secrets to run steps when workflow_dispatch has a non-empty `target_ref`, unless the job checks out trusted workflow-ref scripts and documents the boundary.
- Evidence: The `dockerhub-auth-step` is gated with `if: ${{ github.event_name != 'workflow_dispatch' || inputs.target_ref == '' }}`, while this new job directly injects `NVIDIA_API_KEY` into checked-out code.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

github-actions · 2026-06-01T04:06:56Z

E2E Advisor Recommendation

Required E2E: concurrent-gateway-ports-e2e
Optional E2E: double-onboard-e2e, tunnel-lifecycle-e2e, sandbox-survival-e2e

Dispatch hint: concurrent-gateway-ports-e2e

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

concurrent-gateway-ports-e2e (medium): Direct coverage for the changed behavior: two sandboxes on one host using distinct NEMOCLAW_GATEWAY_PORT values, live dashboard forward segregation, no collateral cleanup of another sandbox's gateway/forward, and destroying one sandbox without breaking the other.

Optional E2E

double-onboard-e2e (medium): Adjacent confidence for repeated onboard/re-onboard lifecycle behavior on a single host; useful because this PR changes onboard gateway reuse and preflight cleanup paths, but it does not specifically exercise concurrent gateway ports.
tunnel-lifecycle-e2e (medium): Adjacent confidence for OpenShell forward/tunnel lifecycle behavior, relevant to the dashboard SSH forward cleanup change but less directly targeted than the new concurrent gateway ports E2E.
sandbox-survival-e2e (medium): Optional sandbox lifecycle regression check around gateway stop/start and sandbox survival. Helpful because gateway reuse/recreation logic changed, but not merge-blocking when the targeted concurrent gateway test passes.

New E2E recommendations

None.

Dispatch hint

Workflow: .github/workflows/nightly-e2e.yaml
jobs input: concurrent-gateway-ports-e2e

github-actions · 2026-06-01T04:06:57Z

E2E Scenario Advisor Recommendation

Required scenario E2E: ubuntu-repo-cloud-openclaw-double-same-provider
Optional scenario E2E: ubuntu-repo-cloud-openclaw, ubuntu-repo-cloud-openclaw-resume, wsl-repo-cloud-openclaw

Dispatch required scenario E2E:

gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-double-same-provider

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: medium

Required scenario E2E

ubuntu-repo-cloud-openclaw-double-same-provider: Onboarding gateway reuse/startup logic and dashboard-forward preflight cleanup changed. This routed scenario is the closest scenario-suite coverage for repeated OpenClaw onboarding against an existing gateway/sandbox state on Ubuntu.
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-double-same-provider

Optional scenario E2E

ubuntu-repo-cloud-openclaw: Provides baseline Ubuntu OpenClaw onboarding coverage for the changed preflight/gateway path, though it does not specifically exercise concurrent non-default gateway ports.
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw
ubuntu-repo-cloud-openclaw-resume: Adjacent lifecycle coverage for gateway state handling during resume paths touched by the gateway handler changes.
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-resume
wsl-repo-cloud-openclaw: Optional platform-adjacent coverage because orphaned dashboard-forward cleanup relies on SSH/process/port behavior that can differ under WSL. Special-runner scenario, so not primary.
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=wsl-repo-cloud-openclaw

Relevant changed files

src/lib/onboard.ts
src/lib/onboard/machine/handlers/gateway.ts
src/lib/onboard/orphaned-dashboard-forward.ts

coderabbitai

🧹 Nitpick comments (1)

src/lib/state/gateway.ts (1)

54-82: ⚡ Quick win

Consider extracting the shared liveness predicate.

The row-liveness check (cols.includes("Ready") || cols.includes("Running")) && !cols.includes("NotReady") is now duplicated in isSandboxReady (Line 57) and listLiveSandboxNames (Line 77). Extracting a small isLiveSandboxRow(cols: string[]) helper keeps the two call sites from drifting if the OpenShell status vocabulary changes.

♻️ Proposed extraction

+function isLiveSandboxRow(cols: string[]): boolean {
+  return (cols.includes("Ready") || cols.includes("Running")) && !cols.includes("NotReady");
+}
+
 export function isSandboxReady(output: string, sandboxName: string): boolean {
   const cols = parseSandboxRow(output, sandboxName);
   if (!cols) return false;
-  return (cols.includes("Ready") || cols.includes("Running")) && !cols.includes("NotReady");
+  return isLiveSandboxRow(cols);
 }
@@
   for (const line of clean.split("\n")) {
     const cols = line.trim().split(/\s+/);
     if (cols.length < 2) continue;
     const name = cols[0];
     if (!name) continue;
-    if ((cols.includes("Ready") || cols.includes("Running")) && !cols.includes("NotReady")) {
+    if (isLiveSandboxRow(cols)) {
       names.push(name);
     }
   }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/state/gateway.ts` around lines 54 - 82, The liveness predicate is
duplicated between isSandboxReady and listLiveSandboxNames; extract a helper
like isLiveSandboxRow(cols: string[]): boolean that implements
(cols.includes("Ready") || cols.includes("Running")) &&
!cols.includes("NotReady"), then replace the duplicated checks in isSandboxReady
(which calls parseSandboxRow) and listLiveSandboxNames (which splits lines into
cols) to call isLiveSandboxRow(cols) so both sites share the single canonical
predicate.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/state/gateway.ts`:
- Around line 54-82: The liveness predicate is duplicated between isSandboxReady
and listLiveSandboxNames; extract a helper like isLiveSandboxRow(cols:
string[]): boolean that implements (cols.includes("Ready") ||
cols.includes("Running")) && !cols.includes("NotReady"), then replace the
duplicated checks in isSandboxReady (which calls parseSandboxRow) and
listLiveSandboxNames (which splits lines into cols) to call
isLiveSandboxRow(cols) so both sites share the single canonical predicate.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b3649375-6721-4481-9cea-45c8b1ea1c68

📥 Commits

Reviewing files that changed from the base of the PR and between d5b6670 and 6737a62.

📒 Files selected for processing (6)

docs/reference/troubleshooting.mdx
src/lib/onboard.ts
src/lib/onboard/preflight-gateway-cleanup-decision.test.ts
src/lib/onboard/preflight-gateway-cleanup-decision.ts
src/lib/state/gateway.ts
test/gateway-state.test.ts

github-actions · 2026-06-01T04:21:27Z

Selective E2E Results — ❌ Some jobs failed

Run: 26734588554
Target ref: 6737a6229a21db222e3d6ba37d3bd1a8e4d5d822
Workflow ref: main
Requested jobs: double-onboard-e2e,onboard-negative-paths-e2e,sandbox-survival-e2e
Summary: 2 passed, 1 failed, 0 skipped

Job	Result
double-onboard-e2e	✅ success
onboard-negative-paths-e2e	❌ failure
sandbox-survival-e2e	✅ success

Failed jobs: onboard-negative-paths-e2e. Check run artifacts for logs.

…omments) Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

github-actions · 2026-06-01T05:29:11Z

Selective E2E Results — ❌ Some jobs failed

Run: 26736515114
Target ref: ab5648a2f091b6796a0a5285cfe70e42f81ea48d
Workflow ref: main
Requested jobs: double-onboard-e2e,onboard-negative-paths-e2e
Summary: 1 passed, 1 failed, 0 skipped

Job	Result
double-onboard-e2e	✅ success
onboard-negative-paths-e2e	❌ failure

Failed jobs: onboard-negative-paths-e2e. Check run artifacts for logs.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

github-actions · 2026-06-01T07:20:14Z

Selective E2E Results — ❌ Some jobs failed

Run: 26740349870
Target ref: 84bdda87bcbbd3ada9ef5238c80fd8755cba2dfb
Workflow ref: main
Requested jobs: double-onboard-e2e,onboard-negative-paths-e2e
Summary: 1 passed, 1 failed, 0 skipped

Job	Result
double-onboard-e2e	✅ success
onboard-negative-paths-e2e	❌ failure

Failed jobs: onboard-negative-paths-e2e. Check run artifacts for logs.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

…ay groundwork Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

github-actions · 2026-06-01T08:59:52Z

Selective E2E Results — ✅ All requested jobs passed

Run: 26745027988
Target ref: 0ef1f56470297468a85768165430568b21f4ad4c
Workflow ref: main
Requested jobs: cloud-onboard-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job	Result
cloud-onboard-e2e	✅ success

github-actions · 2026-06-01T09:08:50Z

Selective E2E Results — ✅ All requested jobs passed

Run: 26745392857
Target ref: a1c55fecdb28bd62a3c436410c4773629228c2f4
Workflow ref: main
Requested jobs: cloud-onboard-e2e,sandbox-survival-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job	Result
cloud-onboard-e2e	✅ success
sandbox-survival-e2e	✅ success

…cessor Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

github-actions · 2026-06-01T09:45:57Z

Selective E2E Results — ✅ All requested jobs passed

Run: 26746811574
Target ref: fa4550762968788c27181380a13ab8983833ee9c
Workflow ref: main
Requested jobs: cloud-onboard-e2e,sandbox-operations-e2e,snapshot-commands-e2e
Summary: 3 passed, 0 failed, 0 skipped

Job	Result
cloud-onboard-e2e	✅ success
sandbox-operations-e2e	✅ success
snapshot-commands-e2e	✅ success

…ULT_GATEWAY_NAME Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

github-actions · 2026-06-01T10:11:50Z

Selective E2E Results — ✅ All requested jobs passed

Run: 26748043704
Target ref: fdddf5366f0aa16d42abbb0ccbddc137e84e11f2
Workflow ref: main
Requested jobs: sandbox-operations-e2e,inference-routing-e2e,snapshot-commands-e2e
Summary: 3 passed, 0 failed, 0 skipped

Job	Result
inference-routing-e2e	✅ success
sandbox-operations-e2e	✅ success
snapshot-commands-e2e	✅ success

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

github-actions · 2026-06-01T10:36:51Z

Selective E2E Results — ✅ All requested jobs passed

Run: 26749185051
Target ref: 48e58ae6fd800b4610f2099f9b671827a750930c
Workflow ref: main
Requested jobs: sandbox-operations-e2e,openclaw-inference-switch-e2e,sandbox-survival-e2e,snapshot-commands-e2e,onboard-resume-e2e
Summary: 4 passed, 0 failed, 0 skipped

Job	Result
onboard-resume-e2e	✅ success
openclaw-inference-switch-e2e	✅ success
sandbox-operations-e2e	⚠️ cancelled
sandbox-survival-e2e	✅ success
snapshot-commands-e2e	✅ success

github-actions · 2026-06-01T10:50:16Z

Selective E2E Results — ✅ All requested jobs passed

Run: 26749747697
Target ref: b9e28babae3df68b1c8ac6ee2c8379bd28e33449
Workflow ref: main
Requested jobs: cloud-onboard-e2e,sandbox-operations-e2e,openclaw-inference-switch-e2e,snapshot-commands-e2e,onboard-resume-e2e
Summary: 5 passed, 0 failed, 0 skipped

Job	Result
cloud-onboard-e2e	✅ success
onboard-resume-e2e	✅ success
openclaw-inference-switch-e2e	✅ success
sandbox-operations-e2e	✅ success
snapshot-commands-e2e	✅ success

…es + stricter writes Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

github-actions · 2026-06-05T05:57:25Z

Selective E2E Results — ❌ Some jobs failed

Run: 26997817230
Target ref: fix/4422-refuse-gateway-drift-on-live-sandbox
Requested jobs: concurrent-gateway-ports-e2e
Summary: 0 passed, 1 failed, 0 skipped

Job	Result
concurrent-gateway-ports-e2e	❌ failure

Failed jobs: concurrent-gateway-ports-e2e. Check run artifacts for logs.

…erminal Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

github-actions · 2026-06-05T06:16:46Z

Selective E2E Results — ❌ Some jobs failed

Run: 26998381153
Target ref: fix/4422-refuse-gateway-drift-on-live-sandbox
Requested jobs: concurrent-gateway-ports-e2e
Summary: 0 passed, 1 failed, 0 skipped

Job	Result
concurrent-gateway-ports-e2e	❌ failure

Failed jobs: concurrent-gateway-ports-e2e. Check run artifacts for logs.

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

github-actions · 2026-06-05T06:37:03Z

Selective E2E Results — ✅ All requested jobs passed

Run: 26999184001
Target ref: fix/4422-refuse-gateway-drift-on-live-sandbox
Requested jobs: concurrent-gateway-ports-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job	Result
concurrent-gateway-ports-e2e	✅ success

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2089-2100: The code currently calls
runCaptureOpenshell(["forward","list"], {ignoreError: true, ...}) so
failures/timeouts are treated as empty output and may cause an incorrect kill;
change this to surface command failures and bail on unknown ownership: call
runCaptureOpenshell without ignoreError (or wrap it in try/catch), check for
errors/timeouts and if the command failed log a warning and skip killing the SSH
forward for that port (i.e. do not fall through to the PID kill), otherwise
proceed to call getOccupiedPorts(forwardListOutput) as before; update the
runCaptureOpenshell usage and surrounding control flow in the block that
references runCaptureOpenshell and getOccupiedPorts so ownership is only assumed
on successful command output.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a79fe900-00d9-45be-90a6-125ebe458b06

📥 Commits

Reviewing files that changed from the base of the PR and between 40042b8 and beb9014.

📒 Files selected for processing (5)

.github/workflows/nightly-e2e.yaml
src/lib/onboard.ts
src/lib/onboard/machine/handlers/gateway.test.ts
src/lib/onboard/machine/handlers/gateway.ts
test/e2e/test-concurrent-gateway-ports.sh

🚧 Files skipped from review as they are similar to previous changes (1)

.github/workflows/nightly-e2e.yaml

github-actions · 2026-06-05T06:51:57Z

Selective E2E Results — ✅ All requested jobs passed

Run: 26999971280
Target ref: beb901495aae1048285cfcc1814755e1e909890f
Workflow ref: main
Requested jobs: sandbox-survival-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job	Result
sandbox-survival-e2e	✅ success

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2083-2089: The guard currently only invokes
tryCleanupOrphanedDashboardForward when port === DASHBOARD_PORT, which skips
orphan-forward handling when a custom --control-ui-port is used; update the
condition so any ssh listener for the dashboard check is handled regardless of
the numeric DASHBOARD_PORT (for example check portCheck.process === "ssh" &&
(port === DASHBOARD_PORT || label === "dashboard") or otherwise detect the
dashboard check by its label) and then call tryCleanupOrphanedDashboardForward
with the same args; if outcome.kind === "killed-still-blocked" replace portCheck
with outcome.portCheck, else if outcome.kind !== "not-openshell" continue — keep
the existing outcome handling but remove the strict numeric DASHBOARD_PORT
requirement so createSandbox can auto-allocate a different dashboard port later.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 25898d10-0f69-49a3-a11a-9ac86433b02f

📥 Commits

Reviewing files that changed from the base of the PR and between beb9014 and 69dcdef.

📒 Files selected for processing (6)

.github/workflows/nightly-e2e.yaml
src/lib/onboard.ts
src/lib/onboard/machine/handlers/gateway.test.ts
src/lib/onboard/orphaned-dashboard-forward.test.ts
src/lib/onboard/orphaned-dashboard-forward.ts
test/e2e/test-concurrent-gateway-ports.sh

🚧 Files skipped from review as they are similar to previous changes (3)

src/lib/onboard/machine/handlers/gateway.test.ts
.github/workflows/nightly-e2e.yaml
test/e2e/test-concurrent-gateway-ports.sh

github-actions · 2026-06-05T07:22:44Z

Selective E2E Results — ✅ All requested jobs passed

Run: 27001186985
Target ref: 69dcdefee627aadbd9268eeaaa4902645ef36b71
Workflow ref: main
Requested jobs: tunnel-lifecycle-e2e,sandbox-survival-e2e
Summary: 2 passed, 0 failed, 0 skipped

Job	Result
sandbox-survival-e2e	✅ success
tunnel-lifecycle-e2e	✅ success

github-actions · 2026-06-05T07:26:25Z

Selective E2E Results — ✅ All requested jobs passed

Run: 27001064634
Target ref: fix/4422-refuse-gateway-drift-on-live-sandbox
Requested jobs: concurrent-gateway-ports-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job	Result
concurrent-gateway-ports-e2e	✅ success

prekshivyas

Re-reviewed against the current head (69dcdef) — refreshing my earlier approval, which predated a substantial rework. Verified the fix end to end, including the primitives it depends on:

Gateway skip: gateway.ts skips retireLegacyGatewayForDockerDriverUpgrade when gatewayReuseState === "foreign-active" and normalizes to missing, so a second onboard starts its own per-port gateway alongside instead of retiring the neighbor's. The foreign-active state is real production logic — getGatewayReuseState (src/lib/state/gateway.ts) returns it when a live gateway with a different name exists ((connected || activeInfo) && activeGatewayName !== gatewayName), i.e. exactly the concurrent-instance case. Not dead code, and it's a pure, separately-tested function.
Dashboard forward: the extracted tryCleanupOrphanedDashboardForward helper only kills a forward when there is no live owner. Ownership comes from getOccupiedPorts, which maps port→sandbox but only for live forwards — so a stale entry is killable and a live foreign owner is protected. list-failed and owned-by-live both skip the kill (and forward list is deliberately allowed to throw rather than swallow to empty, preventing a wrongful kill on enumeration failure).
Integration: the onboard.ts caller, inside the for (const {port} of requiredPorts) loop, handles every outcome correctly — killed-cleared/owned-by-live/list-failed → continue (proceed, auto-allocate a different dashboard port); not-openshell/killed-still-blocked → fall through to the port-blocked error with refreshed diagnostics.

Unit tests cover the foreign-active no-retire branch and the helper outcomes; the concurrent-gateway-ports E2E asserts two sandboxes reach Ready on distinct gateway/dashboard ports and survive each other's teardown. CI green (28 pass / 1 skip). Good to merge.

…ay-drift-on-live-sandbox # Conflicts: # .github/workflows/nightly-e2e.yaml

coderabbitai

♻️ Duplicate comments (1)

src/lib/onboard.ts (1)
2083-2089: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle non-default dashboard ports in the orphan-forward path.

This still gates the helper on port === DASHBOARD_PORT, so --control-ui-port <non-default> skips the cleanup/ownership check and falls straight into the fatal port-blocked path even though createSandbox() later auto-allocates a different dashboard port. That keeps the concurrent-instance bug alive for custom dashboard ports.
💡 Suggested fix
-      if (port === DASHBOARD_PORT && portCheck.process === "ssh" && portCheck.pid) {
+      if (envVar === "NEMOCLAW_DASHBOARD_PORT" && portCheck.process === "ssh" && portCheck.pid) {
         const outcome = await tryCleanupOrphanedDashboardForward({
           port, pid: portCheck.pid, label, portCheckOptions,
           captureProcessArgs, runCaptureOpenshell, run, sleepSeconds, checkPortAvailable,
         });
         if (outcome.kind === "killed-still-blocked") portCheck = outcome.portCheck;
         else if (outcome.kind !== "not-openshell") continue;
       }
As per coding guidelines, src/lib/onboard.ts: "This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 2083 - 2089, The code currently only calls
tryCleanupOrphanedDashboardForward when port === DASHBOARD_PORT, which skips
orphan-forward cleanup for custom dashboard ports; remove that gate and always
invoke tryCleanupOrphanedDashboardForward (passing the current port, pid, label,
portCheckOptions, captureProcessArgs, runCaptureOpenshell, run, sleepSeconds,
checkPortAvailable) so any non-default dashboard port is checked/cleaned before
falling into the fatal blocked-port path; keep the existing outcome handling
(use outcome.portCheck when kind === "killed-still-blocked" and continue only
when kind === "not-openshell").

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@src/lib/onboard.ts`:
- Around line 2083-2089: The code currently only calls
tryCleanupOrphanedDashboardForward when port === DASHBOARD_PORT, which skips
orphan-forward cleanup for custom dashboard ports; remove that gate and always
invoke tryCleanupOrphanedDashboardForward (passing the current port, pid, label,
portCheckOptions, captureProcessArgs, runCaptureOpenshell, run, sleepSeconds,
checkPortAvailable) so any non-default dashboard port is checked/cleaned before
falling into the fatal blocked-port path; keep the existing outcome handling
(use outcome.portCheck when kind === "killed-still-blocked" and continue only
when kind === "not-openshell").

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 135a8518-cc25-4a94-b715-3b97eeef124d

📥 Commits

Reviewing files that changed from the base of the PR and between 69dcdef and c1e28bd.

📒 Files selected for processing (2)

.github/workflows/nightly-e2e.yaml
src/lib/onboard.ts

🚧 Files skipped from review as they are similar to previous changes (1)

.github/workflows/nightly-e2e.yaml

## Summary - Adds the `v0.0.60` section to `docs/about/release-notes.mdx` using the dev announcement from discussion #4877. - Fills the source-doc gaps found during release-prep review across inference, policy tiers, command behavior, security boundaries, Hermes dashboard/tooling, runtime context, and troubleshooting. - Refreshes generated agent skills under `.agents/skills/` from the current Fern docs output and upgrades Fern from `5.44.3` to `5.45.0`. ## Source summary - #4037 -> `docs/reference/architecture.mdx`, `docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents system-only runtime context that stays out of visible chat. - #4875 -> `docs/reference/architecture.mdx`, `docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents try-first sandbox network/filesystem guidance and clearer failure classification. - #4788 -> `docs/security/best-practices.mdx`, `docs/about/release-notes.mdx`: Documents shared OpenClaw device-approval policy for startup and connect. - #4768 -> `docs/reference/network-policies.mdx`, `docs/network-policy/integration-policy-examples.mdx`, `docs/get-started/quickstart.mdx`, `docs/get-started/quickstart-hermes.mdx`, `docs/reference/commands.mdx`: Documents `weather`, `public-reference`, and Hermes managed-tool gateway preset behavior. - #3788 and #4864 -> `docs/reference/network-policies.mdx`, `docs/reference/commands.mdx`: Documents non-interactive policy-tier fail-fast behavior and interactive prompt fallback. - #4756 and #4866 -> `docs/reference/commands.mdx`: Documents env-aware default sandbox resolution for `list`, `status`, and `tunnel` commands. - #4320 -> `docs/reference/commands.mdx`: Documents `$$nemoclaw tunnel status` behavior. - #4328 -> `docs/reference/commands.mdx`: Documents line-scoped policy preset descriptions in `policy-list`. - #4580 and #4748 -> `docs/reference/architecture.mdx`: Documents package-managed OpenShell gateway service and Docker-driver gateway-marker behavior. - #4598 -> `docs/manage-sandboxes/lifecycle.mdx`: Documents concurrent gateway/dashboard cleanup isolation by sandbox name and port. - #4777 -> `docs/reference/troubleshooting.mdx`: Documents Docker GPU patch rollback behavior. - #4610 -> `docs/reference/troubleshooting.mdx`, `docs/reference/commands.mdx`: Keeps mutable OpenClaw config permission guidance aligned and removes skipped experimental wording. - #4868 -> `docs/reference/commands.mdx`: Keeps `.dockerignore` handling for custom `onboard --from <Dockerfile>` contexts in generated skills. - #4870 -> `docs/reference/commands.mdx`, `docs/manage-sandboxes/runtime-controls.mdx`: Documents `NEMOCLAW_MINIMAL_BOOTSTRAP` and generated skill coverage. - #4641 -> `docs/inference/inference-options.mdx`, `docs/reference/troubleshooting.mdx`: Documents local NVIDIA NIM platform-digest pulls and served-model id adoption. - #4810 and #4867 -> `docs/inference/inference-options.mdx`: Documents stable NGC managed-vLLM image lineage and DGX Station DeepSeek V4 Flash coverage. - #4852 -> `docs/inference/use-local-inference.mdx`, `docs/reference/troubleshooting.mdx`: Documents Ollama model fit filtering, 16K context floor, cold-load retry, and failed-model exclusion. - #4847 -> `docs/inference/switch-inference-providers.mdx`: Documents API-family sync, Hermes `api_mode`, and Bedrock Runtime exception. - #4800 -> `docs/inference/tool-calling-reliability.mdx`: Documents Nemotron managed-inference native tool-search fallback. - #4333 -> `docs/inference/switch-inference-providers.mdx`: Documents interactive multimodal input prompting. - #4086 -> `docs/reference/troubleshooting.mdx`: Keeps proxy bypass normalization in generated troubleshooting coverage. - #4811 and #4855 -> `docs/get-started/quickstart-hermes.mdx`: Documents prebuilt Hermes dashboard assets and TUI recovery without runtime rebuilds. - #4854 -> `docs/inference/switch-inference-providers.mdx`, `docs/reference/commands.mdx`: Documents Hermes proxy API-key placeholder preservation during inference switches. - #4248 -> `docs/manage-sandboxes/messaging-channels.mdx`, `.agents/skills/`: Keeps messaging enrollment behavior aligned with manifest-hook implementation. - #4771 -> `docs/security/best-practices.mdx`, `docs/security/credential-storage.mdx`: Documents Hermes placeholder-only secret boundary for sandbox-visible runtime files. - #4787 -> `docs/security/best-practices.mdx`, `docs/about/release-notes.mdx`: Documents expanded memory scanner examples for OpenAI project keys and Slack app-level tokens. - #4848 -> `docs/reference/commands.mdx`: Documents OpenClaw skill install mirroring into the agent home directory. - #4790 -> `docs/about/release-notes.mdx`: Uses the prior release-prep structure and generated `.agents/skills/` refresh as the template for this release. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ skills/ --prefix nemoclaw-user --doc-platform fern-mdx --dry-run` - `npm run docs` - `git diff --check` - skip-term scan across `docs/`, `.agents/skills/`, and `skills/` - `npm run build:cli` - `npm run typecheck:cli` - Commit and pre-push hook suites, including markdownlint, gitleaks, env-var docs gate, docs-to-skills verification, and skills YAML tests  ## Summary by CodeRabbit ## Release Notes * **New Features** * DeepSeek-V4-Flash now available as default inference model for DGX Station. * Hermes dashboard improved with dedicated port and OAuth-authenticated tool gateway selection. * Added weather and public-reference policy presets for expanded agent capabilities. * Enhanced Ollama model selection with GPU memory filtering and automatic retry for timeouts. * **Bug Fixes** * Improved policy tier validation to prevent invalid configurations. * Better sandbox cleanup scoping by port to prevent conflicts across deployments. * Added GPU patch failure recovery with automatic rollback. * **Documentation** * Expanded troubleshooting guides for inference, security, and sandbox lifecycle. * Added .dockerignore best practices for custom deployments.  --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>

fix(onboard): refuse gateway recreate when live sandboxes exist

6737a62

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

coderabbitai Bot reviewed Jun 1, 2026

View reviewed changes

fixup: address review (extract live-row helper, wire-up test, docs, c…

ab5648a

…omments) Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

fix(onboard): narrow refuse-recreate to confirmed stale drift only

84bdda8

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

laitingsheng added the fix label Jun 1, 2026

laitingsheng added 2 commits June 1, 2026 08:48

Merge branch 'main' into fix/4422-refuse-gateway-drift-on-live-sandbox

429eb9f

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

refactor(state): introduce getGatewayName resolver for parallel-gatew…

0ef1f56

…ay groundwork Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

laitingsheng changed the title ~~fix(onboard): refuse gateway recreate when live sandboxes exist~~ refactor(state): introduce getGatewayName resolver for parallel-gateway groundwork Jun 1, 2026

feat(registry): track per-sandbox gateway name with singleton backfill

a1c55fe

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

refactor(state): single source for gateway name + tighten registry ac…

fa45507

…cessor Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

refactor(state): replace remaining hard-coded gateway names with DEFA…

fdddf53

…ULT_GATEWAY_NAME Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

laitingsheng added 2 commits June 1, 2026 10:19

refactor(state): persist + validate gatewayName at registry boundary

48e58ae

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

fix(state): inline gatewayName validation to drop runner dep on platform

b9e28ba

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

refactor(state): default getSandboxGatewayName + migrate reused entri…

7cf4f48

…es + stricter writes Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

laitingsheng added refactor PR restructures code without intended behavior change and removed fix labels Jun 1, 2026

fix(e2e): retry verify_sandbox_alive on Provisioning until Ready or t…

3ce01f0

…erminal Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

fix(e2e): query each sandbox via its own gateway in verify_sandbox_alive

939a182

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

laitingsheng marked this pull request as ready for review June 5, 2026 06:43

Merge branch 'main' into fix/4422-refuse-gateway-drift-on-live-sandbox

beb9014

laitingsheng changed the title ~~refactor(state): introduce getGatewayName resolver for parallel-gateway groundwork~~ fix(onboard): preserve concurrent instance gateway and dashboard during onboard Jun 5, 2026

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread src/lib/onboard.ts Outdated

refactor(onboard): extract orphaned dashboard forward cleanup helper

69dcdef

Signed-off-by: Tinson Lai <tinsonl@nvidia.com>

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread src/lib/onboard.ts

laitingsheng added the v0.0.60 Release target label Jun 5, 2026

laitingsheng requested a review from prekshivyas June 5, 2026 07:52

laitingsheng removed the feature PR adds or expands user-visible functionality label Jun 5, 2026

prekshivyas approved these changes Jun 5, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into fix/4422-refuse-gatew…

c1e28bd

…ay-drift-on-live-sandbox # Conflicts: # .github/workflows/nightly-e2e.yaml

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

cv merged commit 668f2a1 into main Jun 5, 2026
33 checks passed

cv deleted the fix/4422-refuse-gateway-drift-on-live-sandbox branch June 5, 2026 17:54

miyoungc mentioned this pull request Jun 6, 2026

docs: refresh v0.0.60 release notes #4879

Merged

laitingsheng mentioned this pull request Jun 6, 2026

[macOS][Sandbox][Policy&Network] Second NEMOCLAW_GATEWAY_PORT instance breaks first sandbox (sandbox has no spec), dashboards share same port #4865

Open

Conversation

laitingsheng commented Jun 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Type of Change

Verification

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Advisor

🛠️ Needs attention

🔎 Worth checking

🌱 Nice ideas

Uh oh!

github-actions Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Advisor Recommendation

E2E Recommendation Advisor

Required E2E

Optional E2E

New E2E recommendations

Dispatch hint

Uh oh!

github-actions Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Scenario Advisor Recommendation

E2E Scenario Advisor

Required scenario E2E

Optional scenario E2E

Relevant changed files

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 1, 2026

Selective E2E Results — ❌ Some jobs failed

Uh oh!

github-actions Bot commented Jun 1, 2026

Selective E2E Results — ❌ Some jobs failed

Uh oh!

github-actions Bot commented Jun 1, 2026

Selective E2E Results — ❌ Some jobs failed

Uh oh!

github-actions Bot commented Jun 1, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

github-actions Bot commented Jun 1, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

github-actions Bot commented Jun 1, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

github-actions Bot commented Jun 1, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

github-actions Bot commented Jun 1, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

github-actions Bot commented Jun 1, 2026

Selective E2E Results — ✅ All requested jobs passed

Uh oh!

github-actions Bot commented Jun 5, 2026

Selective E2E Results — ❌ Some jobs failed

Uh oh!

laitingsheng commented Jun 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

github-actions Bot commented Jun 1, 2026 •

edited

Loading

github-actions Bot commented Jun 1, 2026 •

edited

Loading

github-actions Bot commented Jun 1, 2026 •

edited

Loading