Skip to content

fix(onboard): use OpenShell gateway user service#4580

Merged
cv merged 13 commits into
mainfrom
fix/4423-openshell-service-lifecycle-v60
Jun 5, 2026
Merged

fix(onboard): use OpenShell gateway user service#4580
cv merged 13 commits into
mainfrom
fix/4423-openshell-service-lifecycle-v60

Conversation

@ericksoa
Copy link
Copy Markdown
Contributor

@ericksoa ericksoa commented May 31, 2026

Summary

Post-Computex / v0.0.60 follow-up for the gateway lifecycle half of #4423. This is intentionally separate from #4578, which remains the v0.0.56 safety hotfix for non-destructive status behavior.

  • Use OpenShell's package-managed openshell-gateway user service when its vendor/package unit is present, writing Docker-driver gateway env before restart.
  • Ignore per-user/stale gateway unit files so standalone recovery remains available when there is no package-managed OpenShell service.
  • Keep the existing standalone gateway launch path as an explicit compatibility fallback when the upstream service is unavailable.
  • Update docs/tests to describe package-service ownership vs. standalone fallback.

Validation

  • npm run build:cli
  • npm run typecheck:cli
  • npm run checks
  • npm run source-shape:check
  • npm run check:installer-hash
  • bash -n scripts/install-openshell.sh
  • bash test/e2e/test-openshell-version-pin.sh
  • npx vitest run src/lib/onboard/docker-driver-gateway-env.test.ts src/lib/onboard/docker-driver-gateway-service.test.ts test/install-openshell-version-check.test.ts test/runner.test.ts test/onboard-gateway-runtime.test.ts test/gateway-state-reconcile-2276.test.ts

Refs #4423
Follow-up to #4578

Summary by CodeRabbit

  • Documentation
    • Clarified Deployment Topology, uninstall/state-dir contents, Apple Silicon sandbox behavior, and environment-variable guidance for the standalone fallback.
  • New Features
    • Installer on Linux now attempts to start a package-managed OpenShell gateway and falls back to the standalone gateway when appropriate.
  • Tests
    • Expanded unit and e2e tests to cover package-managed vs standalone gateway flows and realistic checksum-driven installer scenarios.

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa ericksoa added bug Something fails against expected or documented behavior Platform: DGX Spark NV QA Bugs found by the NVIDIA QA Team UAT Issues flagged for User Acceptance Testing. v0.0.60 Release target labels May 31, 2026
@ericksoa ericksoa self-assigned this May 31, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 31, 2026

Worried about impact? Review this PR in Change Stack to explore blast radius before you approve or request changes.

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 682e97c9-8f61-41a6-9aab-5a4a1482060f

📥 Commits

Reviewing files that changed from the base of the PR and between 7523a1e and eee1212.

📒 Files selected for processing (3)
  • docs/reference/architecture.mdx
  • docs/reference/commands-nemohermes.mdx
  • docs/reference/commands.mdx
💤 Files with no reviewable changes (3)
  • docs/reference/architecture.mdx
  • docs/reference/commands-nemohermes.mdx
  • docs/reference/commands.mdx

📝 Walkthrough

Walkthrough

Adds Linux systemd user-service orchestration for openshell-gateway, gates Debian env-override writes on service presence, integrates package-managed startup into onboarding (short-circuiting standalone fallback on success), broadens unit and e2e tests, and updates docs and readiness comments.

Changes

Linux Gateway User Service Feature

Layer / File(s) Summary
Service detection types and unit path helpers
src/lib/onboard/docker-driver-gateway-service.ts
Exports constants/types and helpers to compute systemd user unit paths and check unit-file existence on Linux with injectable existsSync.
Service startup systemctl orchestration
src/lib/onboard/docker-driver-gateway-service.ts
Command-availability probing, spawn-sync abstraction, and startOpenShellGatewayUserService() implementing gated systemctl --user daemon-reload → enable → restart with structured result and fallback allowance.
Package-managed gateway startup wrapper
src/lib/onboard/docker-driver-gateway-service.ts
startPackageManagedDockerDriverGateway() attempts package-managed startup (or returns false if unit missing), polls for endpoint registration, validates health/TCP readiness, clears runtime files, and verifies sandbox-bridge reachability or exits/throws on failure.
Env override conditional write logic
src/lib/onboard/docker-driver-gateway-env.ts
Imports hasOpenShellGatewayUserService, re-exports startPackageManagedDockerDriverGateway, and changes writeDockerGatewayDebEnvOverride() to accept opts and return boolean (false when service absent; true after write + perms).
Env override conditional write tests
src/lib/onboard/docker-driver-gateway-env.test.ts
Updates existing test to assert wrote === true when systemd unit present; adds test asserting wrote === false and no gateway.env when only standalone gateway binary exists.
Onboard package-managed startup integration
src/lib/onboard.ts
Imports reportDockerDriverGatewayStartFailure; startDockerDriverGateway() computes gatewayEnv, writes DEB env override, delegates to startPackageManagedDockerDriverGateway(), and returns early on success.
Service module unit tests
src/lib/onboard/docker-driver-gateway-service.test.ts
Adds helpers and Vitest cases verifying platform detection, systemctl call sequence, fallback allowance on bus errors, non-fallback on restart failures, package-managed orchestration timing, and health-failure behavior.
E2E fake asset & checksum generation
test/e2e/test-openshell-version-pin.sh
Fake gh/curl handlers now create fake OpenShell assets, compute SHA-256 digests, and emit correct *-checksums-sha256.txt entries for openshell, gateway, and sandbox archives.
Documentation and TCP readiness comment updates
docs/reference/architecture.mdx, docs/reference/commands.mdx, docs/reference/commands-nemohermes.mdx, src/lib/onboard/gateway-tcp-readiness.ts
Deployment topology documents Linux package-managed service restart with standalone fallback and Apple Silicon Docker sandbox behavior; uninstall state dir contents and env-var descriptions updated for standalone-fallback gateway; TCP-readiness comment clarified.

Sequence Diagrams

sequenceDiagram
  participant Onboard as startDockerDriverGateway
  participant EnvOverride as writeDockerGatewayDebEnvOverride
  participant PkgMgd as startPackageManagedDockerDriverGateway
  participant Service as startOpenShellGatewayUserService
  participant Systemctl as systemctl --user
  participant Gateway as openshell-gateway

  Onboard->>Onboard: Compute gatewayEnv from OpenShell --version
  Onboard->>EnvOverride: Write DEB gateway.env override
  EnvOverride-->>Onboard: wrote boolean

  Onboard->>PkgMgd: startPackageManagedDockerDriverGateway(...)
  PkgMgd->>PkgMgd: Check unit file exists

  alt Unit exists
    PkgMgd->>Service: startOpenShellGatewayUserService()
    Service->>Systemctl: daemon-reload
    Systemctl-->>Service: result
    Service->>Systemctl: enable openshell-gateway
    Systemctl-->>Service: result
    Service->>Systemctl: restart openshell-gateway
    Systemctl-->>Gateway: restart signal
    Gateway-->>Service: started / failure

    alt Service started
      Service-->>PkgMgd: started true
      PkgMgd-->>Onboard: success
    else Service failed
      Service-->>PkgMgd: fallbackAllowed / reason
      PkgMgd-->>Onboard: false or exit/throw
    end
  else Unit not found
    PkgMgd-->>Onboard: false
  end

  Onboard->>Onboard: Standalone gateway fallback path
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4602: Modifies Docker-driver gateway startup behavior in startDockerDriverGateway, related to gateway health/liveness checks during polling.

Suggested labels

Docker, OpenShell

Suggested reviewers

  • prekshivyas
  • cv

Poem

🐇 I hopped through systemd gates at dawn,
Wrote envs, checked units, then moved on.
Fake assets hum, checksums crisp and bright,
Gateway wakes or falls back in the night.
Carrots for tests — delightfully light!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.70% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and specifically describes the main change: implementing the use of OpenShell's package-managed gateway user service as the primary ownership mechanism for gateway lifecycle management.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/4423-openshell-service-lifecycle-v60

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 31, 2026

E2E Advisor Recommendation

Required E2E: cloud-onboard-e2e, sandbox-survival-e2e, openshell-gateway-upgrade-e2e, gateway-health-honest-e2e, gateway-drift-preflight-e2e
Optional E2E: docker-unreachable-gateway-start-e2e, openshell-version-pin-e2e, openclaw-onboard-security-posture-e2e

Auto-dispatched E2E: cloud-onboard-e2e, sandbox-survival-e2e, openshell-gateway-upgrade-e2e via nightly-e2e.yaml at eee12120d7be889187c125d5dd9a4bf7d3bcc557nightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • cloud-onboard-e2e (medium): Runs a real clean Linux onboard path and should catch regressions in the changed Docker-driver gateway startup handoff before sandbox creation.
  • sandbox-survival-e2e (high): Directly exercises sandbox survival across gateway stop/start and recovery; the PR changes gateway ownership, runtime-file cleanup, and fallback/reuse logic that this flow depends on.
  • openshell-gateway-upgrade-e2e (high): Validates old-to-current OpenShell gateway upgrade and durable sandbox state across gateway replacement, which is the closest existing end-to-end coverage for standalone-to-current gateway handoff behavior.
  • gateway-health-honest-e2e (medium): The PR changes Docker-driver gateway readiness and service startup sequencing; this regression guard ensures NemoClaw does not report a healthy gateway when the underlying gateway is not actually serving.
  • gateway-drift-preflight-e2e (medium): The package-managed service path clears standalone runtime breadcrumbs and bypasses some standalone drift handling; this existing regression checks stale/drifted gateway runtime detection remains fail-closed.

Optional E2E

  • docker-unreachable-gateway-start-e2e (low): Useful adjacent coverage for gateway-start failure handling and fallback diagnostics when host services are unavailable, but the primary changed behavior is service handoff rather than Docker outage handling.
  • openshell-version-pin-e2e (low): The PR edits this E2E script and the gateway-service distinction is adjacent to OpenShell helper binary/version installation behavior; useful to verify the regression guard still runs as intended.
  • openclaw-onboard-security-posture-e2e (high): Optional non-root full onboard check for security posture around host-side gateway ownership and trusted runtime guards after introducing trusted systemd user service detection.

New E2E recommendations

  • package-managed Docker-driver gateway service handoff (high): Existing E2Es cover general onboard, gateway health, drift, and upgrade behavior, but none appears to install or fake a trusted package-managed openshell-gateway user unit and assert that NemoClaw writes gateway.env, starts via systemctl --user, registers the endpoint, clears standalone PID breadcrumbs, and creates a sandbox through that service path.
    • Suggested test: package-managed-docker-gateway-service-e2e
  • untrusted/per-user systemd unit fallback (medium): The critical security rule that per-user units, partial units, and untrusted ExecStart paths must not take over gateway ownership is unit-tested but not covered by a real host/onboard E2E.
    • Suggested test: docker-gateway-untrusted-user-service-fallback-e2e
  • user systemd manager outage fallback (medium): The code intentionally falls back to standalone gateway startup when systemctl --user cannot reach the user manager or bus; an E2E would protect the real diagnostics and ensure onboarding still succeeds or fails with the intended standalone path.
    • Suggested test: docker-gateway-user-manager-outage-fallback-e2e

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 31, 2026

E2E Scenario Advisor Recommendation

Required scenario E2E: ubuntu-repo-cloud-openclaw
Optional scenario E2E: ubuntu-repo-cloud-hermes, wsl-repo-cloud-openclaw

Dispatch required scenario E2E:

  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • ubuntu-repo-cloud-openclaw: Onboarding Docker-driver gateway startup was changed to prefer the package-managed OpenShell user service with standalone fallback. The Ubuntu repo cloud OpenClaw scenario is the smallest routed scenario that exercises Linux repo-current onboarding, Docker-driver gateway health, sandbox creation, and inference-route readiness.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw

Optional scenario E2E

  • ubuntu-repo-cloud-hermes: Optional adjacent coverage for the same Linux Docker-driver gateway/onboarding path with the Hermes agent stack.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-hermes
  • wsl-repo-cloud-openclaw: Optional special-runner coverage for the Linux/WSL Docker-driver gateway path, useful if the package-managed user-service fallback behavior is suspected to differ under WSL.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=wsl-repo-cloud-openclaw

Relevant changed files

  • src/lib/onboard.ts
  • src/lib/onboard/docker-driver-gateway-env.ts
  • src/lib/onboard/docker-driver-gateway-service.ts
  • src/lib/onboard/gateway-tcp-readiness.ts

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 31, 2026

PR Review Advisor

Findings: 0 needs attention, 1 worth checking, 0 nice ideas
Top item: PR review advisor unavailable

Review findings

🛠️ Needs attention

  • None.

🔎 Worth checking

  • PR review advisor unavailable: The automated advisor could not complete: Could not parse JSON from PR review advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/pr-review-advisor/pr-review-advisor-raw-output.txt
    • Recommendation: Re-run the PR Review Advisor or perform a manual review.
    • Evidence: Could not parse JSON from PR review advisor output; see /home/runner/work/NemoClaw/NemoClaw/artifacts/pr-review-advisor/pr-review-advisor-raw-output.txt

🌱 Nice ideas

  • None.
Consider writing more tests for
  • **Runtime validation** — Add or identify targeted runtime/integration validation for the changed behavior; do not report external E2E job pass/fail here.. Runtime/sandbox/infrastructure paths need behavioral runtime validation: docs/reference/architecture.mdx, docs/reference/commands-nemohermes.mdx, docs/reference/commands.mdx, src/lib/onboard.ts, src/lib/onboard/docker-driver-gateway-env.ts, src/lib/onboard/docker-driver-gateway-service.ts, src/lib/onboard/gateway-tcp-readiness.ts.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

ericksoa added 2 commits May 31, 2026 08:44
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 26717014158
Target ref: 04b7523a4b3e0692374bc8229b4eab11be5ee0df
Workflow ref: main
Requested jobs: cloud-onboard-e2e,openshell-gateway-upgrade-e2e,sandbox-survival-e2e,runtime-overrides-e2e
Summary: 0 passed, 3 failed, 0 skipped

Job Result
cloud-onboard-e2e ❌ failure
openshell-gateway-upgrade-e2e ⚠️ cancelled
runtime-overrides-e2e ❌ failure
sandbox-survival-e2e ❌ failure

Failed jobs: cloud-onboard-e2e, runtime-overrides-e2e, sandbox-survival-e2e. Check run artifacts for logs.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2414-2471: The helper function
tryStartPackageManagedDockerDriverGateway should be moved out of
src/lib/onboard.ts into a new sibling module (e.g.,
src/lib/onboard-package-managed.ts) so the big block no longer inflates
onboard.ts; keep startDockerDriverGateway() in onboard.ts as the coordinator and
have it import and call the relocated tryStartPackageManagedDockerDriverGateway.
Ensure you preserve the function signature and all referenced symbols
(dockerDriverGatewayService, clearDockerDriverGatewayRuntimeFiles,
registerDockerDriverGatewayEndpoint, runCaptureOpenshell, isGatewayHealthy,
isGatewayTcpReady, verifySandboxBridgeGatewayReachableOrExit, envInt,
sleepSeconds, GATEWAY_NAME) and their imports/exports so behavior is unchanged,
update module imports in onboard.ts, and export the helper from the new module
for testing/consumption.
- Around line 2443-2461: The call to clearDockerDriverGatewayRuntimeFiles
currently runs before the health-poll loop and removes runtime PID/marker files
prematurely; move that cleanup so it only runs after the gateway is confirmed
healthy (i.e., after isGatewayHealthy(status, namedInfo, currentInfo) && await
isGatewayTcpReady() succeeds) — update the code so
clearDockerDriverGatewayRuntimeFiles is invoked after the successful health
checks (for functions/variables involved: clearDockerDriverGatewayRuntimeFiles,
registerDockerDriverGatewayEndpoint, isGatewayHealthy, isGatewayTcpReady,
verifySandboxBridgeGatewayReachableOrExit) so recovery/fallback logic retains
runtime breadcrumbs until the service is truly up.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ad5de779-9431-49cf-972d-2394b910a053

📥 Commits

Reviewing files that changed from the base of the PR and between 9641ce0 and 04b7523.

📒 Files selected for processing (9)
  • docs/reference/architecture.mdx
  • docs/reference/commands.mdx
  • scripts/install-openshell.sh
  • src/lib/onboard.ts
  • src/lib/onboard/docker-driver-gateway-env.test.ts
  • src/lib/onboard/docker-driver-gateway-env.ts
  • src/lib/onboard/docker-driver-gateway-service.test.ts
  • src/lib/onboard/docker-driver-gateway-service.ts
  • test/install-openshell-version-check.test.ts

Comment thread src/lib/onboard.ts Outdated
Comment thread src/lib/onboard.ts Outdated
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26717115183
Target ref: 90a5959cb9b39e7058ef96abd1468cbd46bd8b83
Workflow ref: main
Requested jobs: cloud-onboard-e2e,openshell-gateway-upgrade-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ⚠️ cancelled
openshell-gateway-upgrade-e2e ⚠️ cancelled

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26717180037
Target ref: 795b3c376c96cd021ac9091c6160b37ab9ae2f51
Workflow ref: main
Requested jobs: cloud-onboard-e2e,openshell-gateway-upgrade-e2e,sandbox-survival-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ⚠️ cancelled
openshell-gateway-upgrade-e2e ⚠️ cancelled
sandbox-survival-e2e ⚠️ cancelled

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/lib/onboard.ts (1)

2424-2426: Please run one service-path E2E before merge.

This branch now short-circuits the legacy startup flow when the package-managed gateway takes ownership, so openshell-gateway-upgrade-e2e plus one happy-path onboard flow such as cloud-e2e or sandbox-operations-e2e would give good coverage of the new handoff.

As per coding guidelines, src/lib/onboard.ts: "This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 2424 - 2426, Run a service-path E2E before
merging to validate the new short-circuit in src/lib/onboard.ts: execute the
openshell-gateway-upgrade-e2e test plus one happy-path onboarding E2E (cloud-e2e
or sandbox-operations-e2e) to confirm
dockerDriverGatewayEnv.startPackageManagedDockerDriverGateway correctly takes
ownership and that the legacy startup path still behaves when it should; verify
the flow around verifySandboxBridgeGatewayReachableOrExit and GATEWAY_NAME,
observe logs/errors and ensure no regressions in gateway handoff or sandbox
creation before merging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 2424-2426: Run a service-path E2E before merging to validate the
new short-circuit in src/lib/onboard.ts: execute the
openshell-gateway-upgrade-e2e test plus one happy-path onboarding E2E (cloud-e2e
or sandbox-operations-e2e) to confirm
dockerDriverGatewayEnv.startPackageManagedDockerDriverGateway correctly takes
ownership and that the legacy startup path still behaves when it should; verify
the flow around verifySandboxBridgeGatewayReachableOrExit and GATEWAY_NAME,
observe logs/errors and ensure no regressions in gateway handoff or sandbox
creation before merging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e9cba808-c0cf-473b-950a-5efbb2171a45

📥 Commits

Reviewing files that changed from the base of the PR and between 04b7523 and 795b3c3.

📒 Files selected for processing (3)
  • src/lib/onboard.ts
  • src/lib/onboard/docker-driver-gateway-env.ts
  • src/lib/onboard/docker-driver-gateway-service.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/onboard/docker-driver-gateway-env.ts

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 26717266019
Target ref: 098007d653201720acea8b9f796c267c9feff776
Workflow ref: main
Requested jobs: cloud-onboard-e2e,sandbox-survival-e2e,openshell-gateway-upgrade-e2e
Summary: 1 passed, 2 failed, 0 skipped

Job Result
cloud-onboard-e2e ❌ failure
openshell-gateway-upgrade-e2e ✅ success
sandbox-survival-e2e ❌ failure

Failed jobs: cloud-onboard-e2e, sandbox-survival-e2e. Check run artifacts for logs.

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 26718114347
Target ref: c24aa4e2f9f2f1d765c4d68ad3d19b78bf33f12ba
Workflow ref: main
Requested jobs: cloud-onboard-e2e,sandbox-survival-e2e,openshell-gateway-upgrade-e2e
Summary: 0 passed, 3 failed, 0 skipped

Job Result
cloud-onboard-e2e ❌ failure
openshell-gateway-upgrade-e2e ❌ failure
sandbox-survival-e2e ❌ failure

Failed jobs: cloud-onboard-e2e, openshell-gateway-upgrade-e2e, sandbox-survival-e2e. Check run artifacts for logs.

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26718192181
Target ref: c24aa4e2f798b44cedc7665ffc5750b610d2e28a
Workflow ref: main
Requested jobs: cloud-e2e,double-onboard-e2e,openshell-gateway-upgrade-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
cloud-e2e ⚠️ cancelled
double-onboard-e2e ⚠️ cancelled
openshell-gateway-upgrade-e2e ⚠️ cancelled

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/reference/commands.mdx`:
- Line 1422: Update the description for the environment variable
NEMOCLAW_OPENSHELL_SANDBOX_BIN to use active voice to match the parallel entry
for NEMOCLAW_OPENSHELL_GATEWAY_BIN: replace the passive phrase "passed to the
Linux Docker-driver standalone fallback" with the active construction "used by
the Linux Docker-driver standalone fallback" while keeping the rest of the
sentence unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b6c6b2a5-0d6f-4db2-a49c-f202dd4ad0d1

📥 Commits

Reviewing files that changed from the base of the PR and between 098007d and c24aa4e.

📒 Files selected for processing (2)
  • docs/reference/commands.mdx
  • src/lib/onboard/docker-driver-gateway-service.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/onboard/docker-driver-gateway-service.ts

Comment thread docs/reference/commands.mdx Outdated
@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26718200252
Target ref: 3f68c539819164ab6324c26c69a524181c2c3451
Workflow ref: main
Requested jobs: cloud-onboard-e2e,sandbox-survival-e2e,openshell-gateway-upgrade-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ⚠️ cancelled
openshell-gateway-upgrade-e2e ⚠️ cancelled
sandbox-survival-e2e ⚠️ cancelled

@ericksoa ericksoa requested a review from cv May 31, 2026 17:04
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lib/onboard/docker-driver-gateway-service.ts (1)

145-150: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Allow standalone fallback for bus/user-manager errors on restart too.

Limiting fallbackAllowed to daemon-reload failures turns a systemctl --user restart bus/user-manager outage into a hard stop. Since the caller in src/lib/onboard.ts (Lines 2411-2427) routes through this helper before the standalone path, that blocks the documented fallback even though the package-managed service is simply unavailable.

🔧 Proposed fix
       return {
         attempted: true,
-        fallbackAllowed: args[0] === "daemon-reload" && userManagerLooksUnavailable(result.reason ?? ""),
+        fallbackAllowed: userManagerLooksUnavailable(result.reason ?? ""),
         reason,
         started: false,
       };
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/docker-driver-gateway-service.ts` around lines 145 - 150, The
fallbackAllowed check in the runSystemctlUser error handling currently only
permits fallback for daemon-reload failures; update that condition to also allow
fallback when the invoked command is "restart" and the failure matches the user
manager/bus outage detector. In the error branch where runSystemctlUser
result.ok is false (inside docker-driver-gateway-service.ts), change the
fallbackAllowed expression that references args[0] and
userManagerLooksUnavailable(result.reason ?? "") so it returns true for args[0]
=== "daemon-reload" OR args[0] === "restart" when
userManagerLooksUnavailable(...) is true, ensuring restart failures due to
bus/user-manager outages permit the standalone fallback.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/lib/onboard/docker-driver-gateway-service.ts`:
- Around line 145-150: The fallbackAllowed check in the runSystemctlUser error
handling currently only permits fallback for daemon-reload failures; update that
condition to also allow fallback when the invoked command is "restart" and the
failure matches the user manager/bus outage detector. In the error branch where
runSystemctlUser result.ok is false (inside docker-driver-gateway-service.ts),
change the fallbackAllowed expression that references args[0] and
userManagerLooksUnavailable(result.reason ?? "") so it returns true for args[0]
=== "daemon-reload" OR args[0] === "restart" when
userManagerLooksUnavailable(...) is true, ensuring restart failures due to
bus/user-manager outages permit the standalone fallback.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fc5291fc-ff9e-4407-9444-a734dc2c9a01

📥 Commits

Reviewing files that changed from the base of the PR and between c24aa4e and ce9cb66.

📒 Files selected for processing (3)
  • src/lib/onboard/docker-driver-gateway-service.test.ts
  • src/lib/onboard/docker-driver-gateway-service.ts
  • test/e2e/test-openshell-version-pin.sh
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/lib/onboard/docker-driver-gateway-service.test.ts
  • test/e2e/test-openshell-version-pin.sh

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26719041225
Target ref: 59b38da051a840236ef0d5de60bd900f64ac4008
Workflow ref: main
Requested jobs: all (no filter)
Summary: 5 passed, 0 failed, 2 skipped

Job Result
bedrock-runtime-compatible-anthropic-e2e ⚠️ cancelled
brave-search-e2e ✅ success
channels-add-remove-e2e ⚠️ cancelled
channels-stop-start-e2e ⚠️ cancelled
cloud-e2e ⚠️ cancelled
cloud-inference-e2e ⚠️ cancelled
cloud-onboard-e2e ⚠️ cancelled
credential-migration-e2e ⚠️ cancelled
credential-sanitization-e2e ⚠️ cancelled
device-auth-health-e2e ⚠️ cancelled
diagnostics-e2e ⚠️ cancelled
docs-validation-e2e ⚠️ cancelled
double-onboard-e2e ⚠️ cancelled
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-dashboard-e2e ⚠️ cancelled
hermes-discord-e2e ⚠️ cancelled
hermes-e2e ⚠️ cancelled
hermes-inference-switch-e2e ⚠️ cancelled
hermes-onboard-security-posture-e2e ⚠️ cancelled
hermes-root-entrypoint-smoke-e2e ✅ success
hermes-slack-e2e ⚠️ cancelled
inference-routing-e2e ⚠️ cancelled
issue-2478-crash-loop-recovery-e2e ⚠️ cancelled
issue-3600-gpu-proof-optional-e2e ✅ success
issue-4462-gateway-pinned-approval-characterization-e2e ⚠️ cancelled
issue-4462-scope-upgrade-approval-e2e ⚠️ cancelled
kimi-inference-compat-e2e ⚠️ cancelled
launchable-smoke-e2e ⚠️ cancelled
messaging-compatible-endpoint-e2e ⚠️ cancelled
messaging-providers-e2e ⚠️ cancelled
network-policy-e2e ⚠️ cancelled
onboard-negative-paths-e2e ⚠️ cancelled
onboard-repair-e2e ⚠️ cancelled
onboard-resume-e2e ⚠️ cancelled
openclaw-discord-pairing-e2e ⚠️ cancelled
openclaw-inference-switch-e2e ⚠️ cancelled
openclaw-onboard-security-posture-e2e ⚠️ cancelled
openclaw-slack-pairing-e2e ⚠️ cancelled
openclaw-tui-chat-correlation-e2e ⚠️ cancelled
openshell-gateway-upgrade-e2e ⚠️ cancelled
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ⚠️ cancelled
rebuild-hermes-stale-base-e2e ⚠️ cancelled
rebuild-openclaw-e2e ⚠️ cancelled
runtime-overrides-e2e ⚠️ cancelled
sandbox-operations-e2e ⚠️ cancelled
sandbox-survival-e2e ⚠️ cancelled
shields-config-e2e ⚠️ cancelled
skill-agent-e2e ⚠️ cancelled
snapshot-commands-e2e ⚠️ cancelled
state-backup-restore-e2e ⚠️ cancelled
telegram-injection-e2e ⚠️ cancelled
token-rotation-e2e ⚠️ cancelled
tunnel-lifecycle-e2e ⚠️ cancelled
upgrade-stale-sandbox-e2e ⚠️ cancelled
vm-driver-privileged-exec-routing-e2e ✅ success

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26719097375
Target ref: 59b38da051a840236ef0d5de60bd900f64ac4008
Workflow ref: main
Requested jobs: cloud-onboard-e2e,openshell-gateway-upgrade-e2e
Summary: 0 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ⚠️ cancelled
openshell-gateway-upgrade-e2e ⚠️ cancelled

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26719158501
Target ref: 59b38da051a840236ef0d5de60bd900f64ac4008
Workflow ref: main
Requested jobs: all (no filter)
Summary: 55 passed, 0 failed, 2 skipped

Job Result
bedrock-runtime-compatible-anthropic-e2e ✅ success
brave-search-e2e ✅ success
channels-add-remove-e2e ✅ success
channels-stop-start-e2e ✅ success
cloud-e2e ✅ success
cloud-inference-e2e ✅ success
cloud-onboard-e2e ✅ success
credential-migration-e2e ✅ success
credential-sanitization-e2e ✅ success
device-auth-health-e2e ✅ success
diagnostics-e2e ✅ success
docs-validation-e2e ✅ success
double-onboard-e2e ✅ success
gpu-double-onboard-e2e ⏭️ skipped
gpu-e2e ⏭️ skipped
hermes-dashboard-e2e ✅ success
hermes-discord-e2e ✅ success
hermes-e2e ✅ success
hermes-inference-switch-e2e ✅ success
hermes-onboard-security-posture-e2e ✅ success
hermes-root-entrypoint-smoke-e2e ✅ success
hermes-slack-e2e ✅ success
inference-routing-e2e ✅ success
issue-2478-crash-loop-recovery-e2e ✅ success
issue-3600-gpu-proof-optional-e2e ✅ success
issue-4462-gateway-pinned-approval-characterization-e2e ✅ success
issue-4462-scope-upgrade-approval-e2e ✅ success
kimi-inference-compat-e2e ✅ success
launchable-smoke-e2e ✅ success
messaging-compatible-endpoint-e2e ✅ success
messaging-providers-e2e ✅ success
network-policy-e2e ✅ success
onboard-negative-paths-e2e ✅ success
onboard-repair-e2e ✅ success
onboard-resume-e2e ✅ success
openclaw-discord-pairing-e2e ✅ success
openclaw-inference-switch-e2e ✅ success
openclaw-onboard-security-posture-e2e ✅ success
openclaw-slack-pairing-e2e ✅ success
openclaw-tui-chat-correlation-e2e ✅ success
openshell-gateway-upgrade-e2e ✅ success
overlayfs-autofix-e2e ✅ success
rebuild-hermes-e2e ✅ success
rebuild-hermes-stale-base-e2e ✅ success
rebuild-openclaw-e2e ✅ success
runtime-overrides-e2e ✅ success
sandbox-operations-e2e ✅ success
sandbox-survival-e2e ✅ success
shields-config-e2e ✅ success
skill-agent-e2e ✅ success
snapshot-commands-e2e ✅ success
state-backup-restore-e2e ✅ success
telegram-injection-e2e ✅ success
token-rotation-e2e ✅ success
tunnel-lifecycle-e2e ✅ success
upgrade-stale-sandbox-e2e ✅ success
vm-driver-privileged-exec-routing-e2e ✅ success

Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
@ericksoa
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 31, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26720532591
Target ref: 7523a1ebf586a3749bd3b091f2d22c64add9b021
Workflow ref: main
Requested jobs: cloud-e2e,openshell-gateway-upgrade-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
cloud-e2e ✅ success
openshell-gateway-upgrade-e2e ⚠️ cancelled

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ✅ All requested jobs passed

Run: 26720778107
Target ref: 7523a1ebf586a3749bd3b091f2d22c64add9b021
Workflow ref: main
Requested jobs: openshell-gateway-upgrade-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
openshell-gateway-upgrade-e2e ✅ success

@wscurran wscurran added area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression platform: dgx-spark Affects DGX Spark hardware or workflows and removed priority: high bug Something fails against expected or documented behavior labels Jun 3, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

Selective E2E Results — ✅ All requested jobs passed

Run: 27042899089
Target ref: 19b68534075c8d4680ac587a208dbb1f40a82f0d
Workflow ref: main
Requested jobs: cloud-onboard-e2e,openshell-gateway-upgrade-e2e
Summary: 1 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ✅ success
openshell-gateway-upgrade-e2e ⚠️ cancelled

@cv cv enabled auto-merge (squash) June 5, 2026 22:33
@cv cv merged commit 2e3daf7 into main Jun 5, 2026
34 checks passed
@cv cv deleted the fix/4423-openshell-service-lifecycle-v60 branch June 5, 2026 22:36
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

Selective E2E Results — ✅ All requested jobs passed

Run: 27043502178
Target ref: eee12120d7be889187c125d5dd9a4bf7d3bcc557
Workflow ref: main
Requested jobs: cloud-onboard-e2e,sandbox-survival-e2e,openshell-gateway-upgrade-e2e
Summary: 3 passed, 0 failed, 0 skipped

Job Result
cloud-onboard-e2e ✅ success
openshell-gateway-upgrade-e2e ✅ success
sandbox-survival-e2e ✅ success

miyoungc added a commit that referenced this pull request Jun 6, 2026
## Summary
- Adds the `v0.0.60` section to `docs/about/release-notes.mdx` using the
dev announcement from discussion #4877.
- Fills the source-doc gaps found during release-prep review across
inference, policy tiers, command behavior, security boundaries, Hermes
dashboard/tooling, runtime context, and troubleshooting.
- Refreshes generated agent skills under `.agents/skills/` from the
current Fern docs output and upgrades Fern from `5.44.3` to `5.45.0`.

## Source summary
- #4037 -> `docs/reference/architecture.mdx`,
`docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents
system-only runtime context that stays out of visible chat.
- #4875 -> `docs/reference/architecture.mdx`,
`docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents
try-first sandbox network/filesystem guidance and clearer failure
classification.
- #4788 -> `docs/security/best-practices.mdx`,
`docs/about/release-notes.mdx`: Documents shared OpenClaw
device-approval policy for startup and connect.
- #4768 -> `docs/reference/network-policies.mdx`,
`docs/network-policy/integration-policy-examples.mdx`,
`docs/get-started/quickstart.mdx`,
`docs/get-started/quickstart-hermes.mdx`, `docs/reference/commands.mdx`:
Documents `weather`, `public-reference`, and Hermes managed-tool gateway
preset behavior.
- #3788 and #4864 -> `docs/reference/network-policies.mdx`,
`docs/reference/commands.mdx`: Documents non-interactive policy-tier
fail-fast behavior and interactive prompt fallback.
- #4756 and #4866 -> `docs/reference/commands.mdx`: Documents env-aware
default sandbox resolution for `list`, `status`, and `tunnel` commands.
- #4320 -> `docs/reference/commands.mdx`: Documents `$$nemoclaw tunnel
status` behavior.
- #4328 -> `docs/reference/commands.mdx`: Documents line-scoped policy
preset descriptions in `policy-list`.
- #4580 and #4748 -> `docs/reference/architecture.mdx`: Documents
package-managed OpenShell gateway service and Docker-driver
gateway-marker behavior.
- #4598 -> `docs/manage-sandboxes/lifecycle.mdx`: Documents concurrent
gateway/dashboard cleanup isolation by sandbox name and port.
- #4777 -> `docs/reference/troubleshooting.mdx`: Documents Docker GPU
patch rollback behavior.
- #4610 -> `docs/reference/troubleshooting.mdx`,
`docs/reference/commands.mdx`: Keeps mutable OpenClaw config permission
guidance aligned and removes skipped experimental wording.
- #4868 -> `docs/reference/commands.mdx`: Keeps `.dockerignore` handling
for custom `onboard --from <Dockerfile>` contexts in generated skills.
- #4870 -> `docs/reference/commands.mdx`,
`docs/manage-sandboxes/runtime-controls.mdx`: Documents
`NEMOCLAW_MINIMAL_BOOTSTRAP` and generated skill coverage.
- #4641 -> `docs/inference/inference-options.mdx`,
`docs/reference/troubleshooting.mdx`: Documents local NVIDIA NIM
platform-digest pulls and served-model id adoption.
- #4810 and #4867 -> `docs/inference/inference-options.mdx`: Documents
stable NGC managed-vLLM image lineage and DGX Station DeepSeek V4 Flash
coverage.
- #4852 -> `docs/inference/use-local-inference.mdx`,
`docs/reference/troubleshooting.mdx`: Documents Ollama model fit
filtering, 16K context floor, cold-load retry, and failed-model
exclusion.
- #4847 -> `docs/inference/switch-inference-providers.mdx`: Documents
API-family sync, Hermes `api_mode`, and Bedrock Runtime exception.
- #4800 -> `docs/inference/tool-calling-reliability.mdx`: Documents
Nemotron managed-inference native tool-search fallback.
- #4333 -> `docs/inference/switch-inference-providers.mdx`: Documents
interactive multimodal input prompting.
- #4086 -> `docs/reference/troubleshooting.mdx`: Keeps proxy bypass
normalization in generated troubleshooting coverage.
- #4811 and #4855 -> `docs/get-started/quickstart-hermes.mdx`: Documents
prebuilt Hermes dashboard assets and TUI recovery without runtime
rebuilds.
- #4854 -> `docs/inference/switch-inference-providers.mdx`,
`docs/reference/commands.mdx`: Documents Hermes proxy API-key
placeholder preservation during inference switches.
- #4248 -> `docs/manage-sandboxes/messaging-channels.mdx`,
`.agents/skills/`: Keeps messaging enrollment behavior aligned with
manifest-hook implementation.
- #4771 -> `docs/security/best-practices.mdx`,
`docs/security/credential-storage.mdx`: Documents Hermes
placeholder-only secret boundary for sandbox-visible runtime files.
- #4787 -> `docs/security/best-practices.mdx`,
`docs/about/release-notes.mdx`: Documents expanded memory scanner
examples for OpenAI project keys and Slack app-level tokens.
- #4848 -> `docs/reference/commands.mdx`: Documents OpenClaw skill
install mirroring into the agent home directory.
- #4790 -> `docs/about/release-notes.mdx`: Uses the prior release-prep
structure and generated `.agents/skills/` refresh as the template for
this release.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ skills/
--prefix nemoclaw-user --doc-platform fern-mdx --dry-run`
- `npm run docs`
- `git diff --check`
- skip-term scan across `docs/`, `.agents/skills/`, and `skills/`
- `npm run build:cli`
- `npm run typecheck:cli`
- Commit and pre-push hook suites, including markdownlint, gitleaks,
env-var docs gate, docs-to-skills verification, and skills YAML tests

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **New Features**
* DeepSeek-V4-Flash now available as default inference model for DGX
Station.
* Hermes dashboard improved with dedicated port and OAuth-authenticated
tool gateway selection.
* Added weather and public-reference policy presets for expanded agent
capabilities.
* Enhanced Ollama model selection with GPU memory filtering and
automatic retry for timeouts.

* **Bug Fixes**
  * Improved policy tier validation to prevent invalid configurations.
* Better sandbox cleanup scoping by port to prevent conflicts across
deployments.
  * Added GPU patch failure recovery with automatic rollback.

* **Documentation**
* Expanded troubleshooting guides for inference, security, and sandbox
lifecycle.
  * Added .dockerignore best practices for custom deployments.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: sandbox OpenShell sandbox lifecycle, runtime, config, or recovery bug-fix PR fixes a bug or regression NV QA Bugs found by the NVIDIA QA Team platform: dgx-spark Affects DGX Spark hardware or workflows UAT Issues flagged for User Acceptance Testing. v0.0.60 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants