fix(uninstall): kill host openshell-gateway process during uninstall (#3516)#3554
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds host-level openshell-gateway termination to the uninstall "Stopping services" step and a unit test that stubs ChangesHost openshell-gateway process cleanup
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
…VIDIA#3516) When NemoClaw spawns openshell-gateway as a host process (non-containerized mode, v0.0.41+), `nemoclaw uninstall` did not stop it — the process survived and kept port 8080 bound. Add a `stopMatchingPids` call for openshell-gateway in the 'Stopping services' step, matching the existing pattern used for OpenShell forward processes and orphaned openshell processes. Signed-off-by: kagura-agent <kagura.agent.ai@gmail.com>
90e3ea4 to
c9c03e7
Compare
|
✨ Thanks for submitting this detailed PR to fix the issue with the uninstall process not killing the running openshell-gateway host process on Ubuntu 24.04. This change aims to improve the reliability of the uninstall lifecycle by preserving the necessary cleanup of host-process gateways. Related open issues: |
## Summary Updates the NemoClaw documentation for the v0.0.45 release by summarizing the user-facing changes merged since v0.0.44 and bumping the docs version metadata. Refreshes generated user skills so agent-facing references match the source docs. ## Changes - Added v0.0.45 release notes covering onboarding recovery, local inference, channel cleanup, share mount diagnostics, uninstall cleanup, and security redaction updates. - Updated command and troubleshooting docs for sandbox name limits, GPU gateway reuse, DNS preflight behavior, channel removal cleanup, and share mount path validation. - Bumped docs version metadata to 0.0.45 and regenerated NemoClaw user skills from the docs. - Source summary: #3672 -> `docs/reference/commands.md`: documented channel removal detaching bridge providers and un-applying channel policy presets. - Source summary: #3678 -> `docs/about/release-notes.md`: documented Ollama streamed usage accounting in the release notes. - Source summary: #3670 -> `docs/reference/commands.md`, `docs/reference/troubleshooting.md`: documented safe GPU gateway replacement behavior. - Source summary: #3664 -> `docs/about/release-notes.md`: documented blueprint permission normalization in the release notes. - Source summary: #3181 -> `docs/reference/troubleshooting.md`: documented GPU toolkit guidance when host drivers work but passthrough is disabled. - Source summary: #3554 -> `docs/about/release-notes.md`: documented host `openshell-gateway` cleanup during uninstall. - Source summary: #3651 -> `docs/reference/troubleshooting.md`: documented the uncached `.invalid` DNS preflight probe. - Source summary: #3643 -> `docs/reference/commands.md`: included existing `NEMOCLAW_PROVIDER` interactive-mode behavior in generated docs. - Source summary: #3647 -> `docs/reference/commands.md`: documented remote sandbox path verification for `share mount`. - Source summary: #3646 -> `docs/reference/commands.md`: included existing local writable mount target guidance in generated docs. - Source summary: #3642 -> `docs/inference/use-local-inference.md`, `docs/reference/commands.md`: documented managed-vLLM model override and gated-model token checks. - Source summary: #3639 -> `docs/reference/commands.md`: documented the 63-character sandbox name limit. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [ ] `npx prek run --all-files` passes - [ ] `npm test` passes - [ ] Tests added or updated for new or changed behavior - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [x] `make docs` builds without warnings (doc changes only) - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) Commit hooks passed for the staged files. A standalone `npx prek run --all-files` attempt was blocked by sandbox access to `/Users/miyoungc/.cache/prek/prek.log`, so that checkbox is left unchecked. --- <!-- DCO sign-off required by CI. Run: git config user.name && git config user.email --> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Enhanced CLI command reference documentation with clearer guidance on onboarding, GPU passthrough, inference configuration, channel removal, and shared mounts. * Improved troubleshooting sections with better DNS resolution and GPU passthrough remediation steps. * Added documentation for overriding managed vLLM model selection. * Updated release notes for v0.0.45 reflecting infrastructure and workflow improvements. * **Version Bump** * Released v0.0.45. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/3755?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…VIDIA#3516) PR NVIDIA#3554 stopped the host gateway process during `nemoclaw uninstall`, but direct `openshell gateway destroy` (and the sandbox-destroy `--cleanup-gateway` flow) on Linux Docker-driver mode can still leave `/usr/local/bin/openshell-gateway` running on port 8080. Consolidate the cleanup logic in a shared helper used by uninstall, sandbox destroy, and onboard, and update recovery hints to surface a `sudo pkill -f openshell-gateway` remediation alongside the existing `openshell gateway remove` / `destroy` verbs. Key changes: - New src/lib/onboard/host-gateway-process.ts: shared stopper with PID-file primary path, opt-out pgrep fallback for orphans, Docker compat parent recognition (argv0 == docker AND mount-path token), bounded SIGTERM/SIGKILL with verified exit via `ps`, sudo remediation log when a privileged PID survives, and a warning when pgrep is unavailable so an uninstaller never claims success while an orphan is still bound. - src/lib/actions/uninstall/run-plan.ts: invoke the shared helper during the "Stopping services" step. - src/lib/actions/sandbox/destroy.ts: replace the local PID-file probe with the shared helper and disable the pgrep sweep so a same-host gateway run for an unrelated project is not torn down by `destroy --cleanup-gateway`. - src/lib/onboard.ts: route `stopDockerDriverGatewayProcess` and the drift restart through the shared helper (drift restart targets only the supplied PID); update the final-failure recovery banner to add the sudo pkill line on Linux while keeping both lifecycle verbs. - Recovery hint helpers (gpu-recovery.ts, gateway-gpu-passthrough.ts) and user-facing docs (`docs/manage-sandboxes/lifecycle.mdx`, `docs/reference/commands.mdx`, `docs/reference/troubleshooting.mdx`, `scripts/install.sh`) tell users to run `gateway remove` first with the legacy `gateway destroy -g` fallback, then `sudo pkill -f openshell-gateway` for stuck privileged processes. Tests: - New unit tests cover the PID-file path, pgrep fallback, false-positive rejection (random `vim /opt/nemoclaw/openshell-gateway`, unrelated cmdlines mentioning openshell-gateway), Docker compat parent matching, pgrep-skip behaviour when explicit PIDs are passed, and the pgrep-unavailable warning. - Existing uninstall + onboard tests updated for the new copy. - E2E verified by spawning a fake `/tmp/.../openshell-gateway` that binds a unique port and confirming both PID-file-only and pgrep-only branches stop it via the built dist/ entry point. Signed-off-by: Yimo Jiang <yimoj@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Fixes #3516 —
nemoclaw uninstall --yesleaves the host-processopenshell-gatewayrunning and holding port 8080.Root cause
When host glibc satisfies the gateway requirement (v0.0.41+), NemoClaw spawns
/usr/local/bin/openshell-gatewaydirectly as a host process instead of inside a container. TheexecutePlan()"Stopping services" step did not include a kill for this host-process gateway — it only handled the containerized path viaopenshell gateway destroy.Fix
Add a
stopMatchingPids("openshell-gateway", ...)call in the "Stopping services" step ofexecutePlan(), following the existing pattern used for OpenShell forward processes and orphaned openshell processes. This usespgrep -f openshell-gatewayto find and SIGTERM/SIGKILL all matching gateway processes before the state directory cleanup.Testing
pgrepreturning a gateway PID and verifies it gets killedChanges
src/lib/actions/uninstall/run-plan.ts: 1 line — addstopMatchingPidsfor host gatewaysrc/lib/actions/uninstall/run-plan.test.ts: new test case for [Ubuntu 24.04][Install] nemoclaw uninstall does not kill running openshell-gateway host process — port 8080 leaks after uninstall #3516Signed-off-by: kagura-agent kagura-agent@users.noreply.github.com
Summary by CodeRabbit
Bug Fixes
Tests