fix(onboard): use OpenShell gateway user service#4580
Conversation
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
|
Worried about impact? Review this PR in Change Stack to explore blast radius before you approve or request changes. No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (3)
💤 Files with no reviewable changes (3)
📝 WalkthroughWalkthroughAdds Linux systemd user-service orchestration for openshell-gateway, gates Debian env-override writes on service presence, integrates package-managed startup into onboarding (short-circuiting standalone fallback on success), broadens unit and e2e tests, and updates docs and readiness comments. ChangesLinux Gateway User Service Feature
Sequence DiagramssequenceDiagram
participant Onboard as startDockerDriverGateway
participant EnvOverride as writeDockerGatewayDebEnvOverride
participant PkgMgd as startPackageManagedDockerDriverGateway
participant Service as startOpenShellGatewayUserService
participant Systemctl as systemctl --user
participant Gateway as openshell-gateway
Onboard->>Onboard: Compute gatewayEnv from OpenShell --version
Onboard->>EnvOverride: Write DEB gateway.env override
EnvOverride-->>Onboard: wrote boolean
Onboard->>PkgMgd: startPackageManagedDockerDriverGateway(...)
PkgMgd->>PkgMgd: Check unit file exists
alt Unit exists
PkgMgd->>Service: startOpenShellGatewayUserService()
Service->>Systemctl: daemon-reload
Systemctl-->>Service: result
Service->>Systemctl: enable openshell-gateway
Systemctl-->>Service: result
Service->>Systemctl: restart openshell-gateway
Systemctl-->>Gateway: restart signal
Gateway-->>Service: started / failure
alt Service started
Service-->>PkgMgd: started true
PkgMgd-->>Onboard: success
else Service failed
Service-->>PkgMgd: fallbackAllowed / reason
PkgMgd-->>Onboard: false or exit/throw
end
else Unit not found
PkgMgd-->>Onboard: false
end
Onboard->>Onboard: Standalone gateway fallback path
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
🌿 Preview your docs: https://nvidia-preview-pr-4580.docs.buildwithfern.com/nemoclaw |
E2E Advisor RecommendationRequired E2E: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
|
E2E Scenario Advisor RecommendationRequired scenario E2E: Dispatch required scenario E2E:
Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
PR Review AdvisorFindings: 0 needs attention, 1 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Consider writing more tests for
This is an automated advisory review. A human maintainer must make the final merge decision. |
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Selective E2E Results — ❌ Some jobs failedRun: 26717014158
|
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 2414-2471: The helper function
tryStartPackageManagedDockerDriverGateway should be moved out of
src/lib/onboard.ts into a new sibling module (e.g.,
src/lib/onboard-package-managed.ts) so the big block no longer inflates
onboard.ts; keep startDockerDriverGateway() in onboard.ts as the coordinator and
have it import and call the relocated tryStartPackageManagedDockerDriverGateway.
Ensure you preserve the function signature and all referenced symbols
(dockerDriverGatewayService, clearDockerDriverGatewayRuntimeFiles,
registerDockerDriverGatewayEndpoint, runCaptureOpenshell, isGatewayHealthy,
isGatewayTcpReady, verifySandboxBridgeGatewayReachableOrExit, envInt,
sleepSeconds, GATEWAY_NAME) and their imports/exports so behavior is unchanged,
update module imports in onboard.ts, and export the helper from the new module
for testing/consumption.
- Around line 2443-2461: The call to clearDockerDriverGatewayRuntimeFiles
currently runs before the health-poll loop and removes runtime PID/marker files
prematurely; move that cleanup so it only runs after the gateway is confirmed
healthy (i.e., after isGatewayHealthy(status, namedInfo, currentInfo) && await
isGatewayTcpReady() succeeds) — update the code so
clearDockerDriverGatewayRuntimeFiles is invoked after the successful health
checks (for functions/variables involved: clearDockerDriverGatewayRuntimeFiles,
registerDockerDriverGatewayEndpoint, isGatewayHealthy, isGatewayTcpReady,
verifySandboxBridgeGatewayReachableOrExit) so recovery/fallback logic retains
runtime breadcrumbs until the service is truly up.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ad5de779-9431-49cf-972d-2394b910a053
📒 Files selected for processing (9)
docs/reference/architecture.mdxdocs/reference/commands.mdxscripts/install-openshell.shsrc/lib/onboard.tssrc/lib/onboard/docker-driver-gateway-env.test.tssrc/lib/onboard/docker-driver-gateway-env.tssrc/lib/onboard/docker-driver-gateway-service.test.tssrc/lib/onboard/docker-driver-gateway-service.tstest/install-openshell-version-check.test.ts
Selective E2E Results — ✅ All requested jobs passedRun: 26717115183
|
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 26717180037
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)
2424-2426: Please run one service-path E2E before merge.This branch now short-circuits the legacy startup flow when the package-managed gateway takes ownership, so
openshell-gateway-upgrade-e2eplus one happy-path onboard flow such ascloud-e2eorsandbox-operations-e2ewould give good coverage of the new handoff.As per coding guidelines,
src/lib/onboard.ts: "This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 2424 - 2426, Run a service-path E2E before merging to validate the new short-circuit in src/lib/onboard.ts: execute the openshell-gateway-upgrade-e2e test plus one happy-path onboarding E2E (cloud-e2e or sandbox-operations-e2e) to confirm dockerDriverGatewayEnv.startPackageManagedDockerDriverGateway correctly takes ownership and that the legacy startup path still behaves when it should; verify the flow around verifySandboxBridgeGatewayReachableOrExit and GATEWAY_NAME, observe logs/errors and ensure no regressions in gateway handoff or sandbox creation before merging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 2424-2426: Run a service-path E2E before merging to validate the
new short-circuit in src/lib/onboard.ts: execute the
openshell-gateway-upgrade-e2e test plus one happy-path onboarding E2E (cloud-e2e
or sandbox-operations-e2e) to confirm
dockerDriverGatewayEnv.startPackageManagedDockerDriverGateway correctly takes
ownership and that the legacy startup path still behaves when it should; verify
the flow around verifySandboxBridgeGatewayReachableOrExit and GATEWAY_NAME,
observe logs/errors and ensure no regressions in gateway handoff or sandbox
creation before merging.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e9cba808-c0cf-473b-950a-5efbb2171a45
📒 Files selected for processing (3)
src/lib/onboard.tssrc/lib/onboard/docker-driver-gateway-env.tssrc/lib/onboard/docker-driver-gateway-service.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/onboard/docker-driver-gateway-env.ts
Selective E2E Results — ❌ Some jobs failedRun: 26717266019
|
Selective E2E Results — ❌ Some jobs failedRun: 26718114347
|
Selective E2E Results — ✅ All requested jobs passedRun: 26718192181
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/reference/commands.mdx`:
- Line 1422: Update the description for the environment variable
NEMOCLAW_OPENSHELL_SANDBOX_BIN to use active voice to match the parallel entry
for NEMOCLAW_OPENSHELL_GATEWAY_BIN: replace the passive phrase "passed to the
Linux Docker-driver standalone fallback" with the active construction "used by
the Linux Docker-driver standalone fallback" while keeping the rest of the
sentence unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b6c6b2a5-0d6f-4db2-a49c-f202dd4ad0d1
📒 Files selected for processing (2)
docs/reference/commands.mdxsrc/lib/onboard/docker-driver-gateway-service.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/onboard/docker-driver-gateway-service.ts
Selective E2E Results — ✅ All requested jobs passedRun: 26718200252
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/onboard/docker-driver-gateway-service.ts (1)
145-150:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAllow standalone fallback for bus/user-manager errors on
restarttoo.Limiting
fallbackAllowedtodaemon-reloadfailures turns asystemctl --user restartbus/user-manager outage into a hard stop. Since the caller insrc/lib/onboard.ts(Lines 2411-2427) routes through this helper before the standalone path, that blocks the documented fallback even though the package-managed service is simply unavailable.🔧 Proposed fix
return { attempted: true, - fallbackAllowed: args[0] === "daemon-reload" && userManagerLooksUnavailable(result.reason ?? ""), + fallbackAllowed: userManagerLooksUnavailable(result.reason ?? ""), reason, started: false, };🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard/docker-driver-gateway-service.ts` around lines 145 - 150, The fallbackAllowed check in the runSystemctlUser error handling currently only permits fallback for daemon-reload failures; update that condition to also allow fallback when the invoked command is "restart" and the failure matches the user manager/bus outage detector. In the error branch where runSystemctlUser result.ok is false (inside docker-driver-gateway-service.ts), change the fallbackAllowed expression that references args[0] and userManagerLooksUnavailable(result.reason ?? "") so it returns true for args[0] === "daemon-reload" OR args[0] === "restart" when userManagerLooksUnavailable(...) is true, ensuring restart failures due to bus/user-manager outages permit the standalone fallback.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/lib/onboard/docker-driver-gateway-service.ts`:
- Around line 145-150: The fallbackAllowed check in the runSystemctlUser error
handling currently only permits fallback for daemon-reload failures; update that
condition to also allow fallback when the invoked command is "restart" and the
failure matches the user manager/bus outage detector. In the error branch where
runSystemctlUser result.ok is false (inside docker-driver-gateway-service.ts),
change the fallbackAllowed expression that references args[0] and
userManagerLooksUnavailable(result.reason ?? "") so it returns true for args[0]
=== "daemon-reload" OR args[0] === "restart" when
userManagerLooksUnavailable(...) is true, ensuring restart failures due to
bus/user-manager outages permit the standalone fallback.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: fc5291fc-ff9e-4407-9444-a734dc2c9a01
📒 Files selected for processing (3)
src/lib/onboard/docker-driver-gateway-service.test.tssrc/lib/onboard/docker-driver-gateway-service.tstest/e2e/test-openshell-version-pin.sh
🚧 Files skipped from review as they are similar to previous changes (2)
- src/lib/onboard/docker-driver-gateway-service.test.ts
- test/e2e/test-openshell-version-pin.sh
Selective E2E Results — ✅ All requested jobs passedRun: 26719041225
|
Selective E2E Results — ✅ All requested jobs passedRun: 26719097375
|
Selective E2E Results — ✅ All requested jobs passedRun: 26719158501
|
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Selective E2E Results — ✅ All requested jobs passedRun: 26720532591
|
Selective E2E Results — ✅ All requested jobs passedRun: 26720778107
|
…rvice-lifecycle-v60 # Conflicts: # docs/reference/commands.mdx # src/lib/onboard.ts
Selective E2E Results — ✅ All requested jobs passedRun: 27042899089
|
Selective E2E Results — ✅ All requested jobs passedRun: 27043502178
|
## Summary - Adds the `v0.0.60` section to `docs/about/release-notes.mdx` using the dev announcement from discussion #4877. - Fills the source-doc gaps found during release-prep review across inference, policy tiers, command behavior, security boundaries, Hermes dashboard/tooling, runtime context, and troubleshooting. - Refreshes generated agent skills under `.agents/skills/` from the current Fern docs output and upgrades Fern from `5.44.3` to `5.45.0`. ## Source summary - #4037 -> `docs/reference/architecture.mdx`, `docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents system-only runtime context that stays out of visible chat. - #4875 -> `docs/reference/architecture.mdx`, `docs/about/how-it-works.mdx`, `docs/about/release-notes.mdx`: Documents try-first sandbox network/filesystem guidance and clearer failure classification. - #4788 -> `docs/security/best-practices.mdx`, `docs/about/release-notes.mdx`: Documents shared OpenClaw device-approval policy for startup and connect. - #4768 -> `docs/reference/network-policies.mdx`, `docs/network-policy/integration-policy-examples.mdx`, `docs/get-started/quickstart.mdx`, `docs/get-started/quickstart-hermes.mdx`, `docs/reference/commands.mdx`: Documents `weather`, `public-reference`, and Hermes managed-tool gateway preset behavior. - #3788 and #4864 -> `docs/reference/network-policies.mdx`, `docs/reference/commands.mdx`: Documents non-interactive policy-tier fail-fast behavior and interactive prompt fallback. - #4756 and #4866 -> `docs/reference/commands.mdx`: Documents env-aware default sandbox resolution for `list`, `status`, and `tunnel` commands. - #4320 -> `docs/reference/commands.mdx`: Documents `$$nemoclaw tunnel status` behavior. - #4328 -> `docs/reference/commands.mdx`: Documents line-scoped policy preset descriptions in `policy-list`. - #4580 and #4748 -> `docs/reference/architecture.mdx`: Documents package-managed OpenShell gateway service and Docker-driver gateway-marker behavior. - #4598 -> `docs/manage-sandboxes/lifecycle.mdx`: Documents concurrent gateway/dashboard cleanup isolation by sandbox name and port. - #4777 -> `docs/reference/troubleshooting.mdx`: Documents Docker GPU patch rollback behavior. - #4610 -> `docs/reference/troubleshooting.mdx`, `docs/reference/commands.mdx`: Keeps mutable OpenClaw config permission guidance aligned and removes skipped experimental wording. - #4868 -> `docs/reference/commands.mdx`: Keeps `.dockerignore` handling for custom `onboard --from <Dockerfile>` contexts in generated skills. - #4870 -> `docs/reference/commands.mdx`, `docs/manage-sandboxes/runtime-controls.mdx`: Documents `NEMOCLAW_MINIMAL_BOOTSTRAP` and generated skill coverage. - #4641 -> `docs/inference/inference-options.mdx`, `docs/reference/troubleshooting.mdx`: Documents local NVIDIA NIM platform-digest pulls and served-model id adoption. - #4810 and #4867 -> `docs/inference/inference-options.mdx`: Documents stable NGC managed-vLLM image lineage and DGX Station DeepSeek V4 Flash coverage. - #4852 -> `docs/inference/use-local-inference.mdx`, `docs/reference/troubleshooting.mdx`: Documents Ollama model fit filtering, 16K context floor, cold-load retry, and failed-model exclusion. - #4847 -> `docs/inference/switch-inference-providers.mdx`: Documents API-family sync, Hermes `api_mode`, and Bedrock Runtime exception. - #4800 -> `docs/inference/tool-calling-reliability.mdx`: Documents Nemotron managed-inference native tool-search fallback. - #4333 -> `docs/inference/switch-inference-providers.mdx`: Documents interactive multimodal input prompting. - #4086 -> `docs/reference/troubleshooting.mdx`: Keeps proxy bypass normalization in generated troubleshooting coverage. - #4811 and #4855 -> `docs/get-started/quickstart-hermes.mdx`: Documents prebuilt Hermes dashboard assets and TUI recovery without runtime rebuilds. - #4854 -> `docs/inference/switch-inference-providers.mdx`, `docs/reference/commands.mdx`: Documents Hermes proxy API-key placeholder preservation during inference switches. - #4248 -> `docs/manage-sandboxes/messaging-channels.mdx`, `.agents/skills/`: Keeps messaging enrollment behavior aligned with manifest-hook implementation. - #4771 -> `docs/security/best-practices.mdx`, `docs/security/credential-storage.mdx`: Documents Hermes placeholder-only secret boundary for sandbox-visible runtime files. - #4787 -> `docs/security/best-practices.mdx`, `docs/about/release-notes.mdx`: Documents expanded memory scanner examples for OpenAI project keys and Slack app-level tokens. - #4848 -> `docs/reference/commands.mdx`: Documents OpenClaw skill install mirroring into the agent home directory. - #4790 -> `docs/about/release-notes.mdx`: Uses the prior release-prep structure and generated `.agents/skills/` refresh as the template for this release. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ skills/ --prefix nemoclaw-user --doc-platform fern-mdx --dry-run` - `npm run docs` - `git diff --check` - skip-term scan across `docs/`, `.agents/skills/`, and `skills/` - `npm run build:cli` - `npm run typecheck:cli` - Commit and pre-push hook suites, including markdownlint, gitleaks, env-var docs gate, docs-to-skills verification, and skills YAML tests <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * DeepSeek-V4-Flash now available as default inference model for DGX Station. * Hermes dashboard improved with dedicated port and OAuth-authenticated tool gateway selection. * Added weather and public-reference policy presets for expanded agent capabilities. * Enhanced Ollama model selection with GPU memory filtering and automatic retry for timeouts. * **Bug Fixes** * Improved policy tier validation to prevent invalid configurations. * Better sandbox cleanup scoping by port to prevent conflicts across deployments. * Added GPU patch failure recovery with automatic rollback. * **Documentation** * Expanded troubleshooting guides for inference, security, and sandbox lifecycle. * Added .dockerignore best practices for custom deployments. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Carlos Villela <cvillela@nvidia.com>
Summary
Post-Computex / v0.0.60 follow-up for the gateway lifecycle half of #4423. This is intentionally separate from #4578, which remains the v0.0.56 safety hotfix for non-destructive status behavior.
openshell-gatewayuser service when its vendor/package unit is present, writing Docker-driver gateway env before restart.Validation
npm run build:clinpm run typecheck:clinpm run checksnpm run source-shape:checknpm run check:installer-hashbash -n scripts/install-openshell.shbash test/e2e/test-openshell-version-pin.shnpx vitest run src/lib/onboard/docker-driver-gateway-env.test.ts src/lib/onboard/docker-driver-gateway-service.test.ts test/install-openshell-version-check.test.ts test/runner.test.ts test/onboard-gateway-runtime.test.ts test/gateway-state-reconcile-2276.test.tsRefs #4423
Follow-up to #4578
Summary by CodeRabbit