fix(onboard): preserve PVC across host stop/start by recovering stopped gateway (#4187)#4210
Conversation
…ed gateway (NVIDIA#4187) After a host VM stop/start, the legacy k3s `openshell-cluster-<gateway>` container is stopped but still holds the local-path PVC volumes. verifyGatewayContainerRunning previously collapsed a stopped container into "missing", so the healthy-gateway preflight and machine handler went straight into destroyGatewayForReuse, which removes openshell-cluster-* Docker volumes — i.e. the PVC backing data — before sandbox recreation provisions a fresh, empty workspace. Split GatewayContainerState into running/stopped/missing/unknown and treat a stopped container as recoverable: try the existing recoverGatewayRuntime (openshell gateway start) before any destructive cleanup. On success we keep the registry, volumes, and the recorded sandbox intact; on failure we exit with guidance instead of silently deleting volumes. Truly missing containers, unhealthy HTTP, image drift, and unknown Docker states keep their existing semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughA new ChangesGateway recovery on stopped container
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/onboard.ts (1)
2085-2160: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winExtract this stopped-container recovery branch out of
src/lib/onboard.ts.CI is currently red on the onboard-entrypoint budget, and this inline recovery/image-drift block is the new net growth. Please move it behind a helper or the existing gateway handler so the top-level entrypoint stays flat and the PR becomes mergeable.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 2085 - 2160, Extract the "stopped" branch into a new async helper (e.g. handleStoppedContainerRecovery) that encapsulates the logic currently under "else if (containerState === 'stopped')" and returns the outcomes needed by the caller (at minimum { checkImageDrift: boolean, gatewayReuseState?: ReturnType<typeof destroyGatewayForReuse> } or similar). Move the block that calls recoverGatewayRuntime, runs runOpenshell(["forward","stop",...]), logs messages, calls destroyGatewayForReuse, and may call process.exit(1) into that helper, preserving uses of recoverGatewayRuntime, runOpenshell, destroyGatewayForReuse, getGatewayLocalEndpoint, and waitForGatewayHttpReady; then replace the inlined branch with a single await call to the helper and assign its returned checkImageDrift and gatewayReuseState into the surrounding scope so the subsequent code (including getGatewayClusterImageDrift) works unchanged.
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)
2094-2120: Please run the stop/start recovery E2Es on this branch.This change sits in the core onboarding recovery path that preserves PVC-backed workspaces across host restarts, so I'd want the stop/start and rebuild suites exercised before merge:
cloud-e2e,sandbox-operations-e2e,rebuild-openclaw-e2e,channels-stop-start-e2e,openshell-gateway-upgrade-e2e, andissue-3600-gpu-proof-optional-e2e.As per coding guidelines, "
src/lib/onboard.ts: This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow." and the listed E2E jobs are the recommended coverage for this path.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 2094 - 2120, This change modifies core onboarding stop/start recovery in src/lib/onboard.ts (see recoverGatewayRuntime(), getGatewayLocalEndpoint(), and the stopped-container branch that sets checkImageDrift and may call process.exit(1)), so run the full stop/start and rebuild E2E suites listed (cloud-e2e, sandbox-operations-e2e, rebuild-openclaw-e2e, channels-stop-start-e2e, openshell-gateway-upgrade-e2e, issue-3600-gpu-proof-optional-e2e) against this branch; if any test fails, trace failures back to the stopped-container recovery path, fix issues in recoverGatewayRuntime() or related error handling (preserve PVCs, proper port checks, non-destructive restart), re-run the failing suites until all pass, and only then mark the PR ready to merge.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 2085-2160: Extract the "stopped" branch into a new async helper
(e.g. handleStoppedContainerRecovery) that encapsulates the logic currently
under "else if (containerState === 'stopped')" and returns the outcomes needed
by the caller (at minimum { checkImageDrift: boolean, gatewayReuseState?:
ReturnType<typeof destroyGatewayForReuse> } or similar). Move the block that
calls recoverGatewayRuntime, runs runOpenshell(["forward","stop",...]), logs
messages, calls destroyGatewayForReuse, and may call process.exit(1) into that
helper, preserving uses of recoverGatewayRuntime, runOpenshell,
destroyGatewayForReuse, getGatewayLocalEndpoint, and waitForGatewayHttpReady;
then replace the inlined branch with a single await call to the helper and
assign its returned checkImageDrift and gatewayReuseState into the surrounding
scope so the subsequent code (including getGatewayClusterImageDrift) works
unchanged.
---
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 2094-2120: This change modifies core onboarding stop/start
recovery in src/lib/onboard.ts (see recoverGatewayRuntime(),
getGatewayLocalEndpoint(), and the stopped-container branch that sets
checkImageDrift and may call process.exit(1)), so run the full stop/start and
rebuild E2E suites listed (cloud-e2e, sandbox-operations-e2e,
rebuild-openclaw-e2e, channels-stop-start-e2e, openshell-gateway-upgrade-e2e,
issue-3600-gpu-proof-optional-e2e) against this branch; if any test fails, trace
failures back to the stopped-container recovery path, fix issues in
recoverGatewayRuntime() or related error handling (preserve PVCs, proper port
checks, non-destructive restart), re-run the failing suites until all pass, and
only then mark the PR ready to merge.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: ead8a67c-bef3-4038-a163-3d4d2ade806a
📒 Files selected for processing (5)
src/lib/onboard.tssrc/lib/onboard/gateway-container-running.test.tssrc/lib/onboard/gateway-container-running.tssrc/lib/onboard/machine/handlers/gateway.test.tssrc/lib/onboard/machine/handlers/gateway.ts
…o a module CI's onboard-entrypoint-budget guard requires `src/lib/onboard.ts` to be net-neutral or smaller. The NVIDIA#4187 fix added ~33 lines inline; move the preflight reconciliation block (existing missing/unknown/HTTP/image-drift branches plus the new stopped-container recovery branch) into `src/lib/onboard/preflight-gateway-reuse.ts` so the entry point shrinks instead of growing. Pure refactor — no behavior change vs the previous commit. Adds focused unit coverage for the extracted module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lib/onboard/preflight-gateway-reuse.test.ts (1)
148-161: ⚡ Quick winAdd the non-destructive
unknown+ HTTP-miss regression case.This suite only locks down the HTTP-ready half of the
unknownbranch. The implementation comment makes the HTTP-miss path the critical safety case, because regressing it back into cleanup would delete a live gateway after a transient Docker probe failure.🧪 Suggested test
it("does not destroy on unknown container state when HTTP is responding", async () => { const destroyForReuse = vi.fn(() => "missing" as GatewayReuseState); const deps = makeDeps({ verifyGatewayContainerRunning: vi.fn(() => "unknown" as GatewayContainerState), waitForGatewayHttpReady: vi.fn(async () => true), destroyGatewayForReuse: destroyForReuse, }); const result = await reconcilePreflightGatewayReuseState(deps); expect(result).toBe("healthy"); expect(destroyForReuse).not.toHaveBeenCalled(); expect(deps.exitProcess).not.toHaveBeenCalled(); }); + + it("does not destroy on unknown container state when HTTP is not responding", async () => { + const destroyForReuse = vi.fn(() => "missing" as GatewayReuseState); + const deps = makeDeps({ + verifyGatewayContainerRunning: vi.fn(() => "unknown" as GatewayContainerState), + waitForGatewayHttpReady: vi.fn(async () => false), + destroyGatewayForReuse: destroyForReuse, + }); + + const result = await reconcilePreflightGatewayReuseState(deps); + + expect(result).toBe("healthy"); + expect(destroyForReuse).not.toHaveBeenCalled(); + expect(deps.exitProcess).not.toHaveBeenCalled(); + });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard/preflight-gateway-reuse.test.ts` around lines 148 - 161, Add a test that covers the regression case where verifyGatewayContainerRunning returns "unknown" while waitForGatewayHttpReady reports HTTP not responding (false) to ensure we do NOT destroy the gateway; create a mock destroyGatewayForReuse (e.g., destroyForReuse) and a deps via makeDeps where verifyGatewayContainerRunning returns "unknown" and waitForGatewayHttpReady resolves to false, call reconcilePreflightGatewayReuseState(deps), then assert the result is "healthy" (or the expected non-destroy state), that destroyForReuse was not called, and deps.exitProcess was not called; reference the existing test pattern using reconcilePreflightGatewayReuseState, makeDeps, verifyGatewayContainerRunning, waitForGatewayHttpReady, destroyGatewayForReuse, and exitProcess to locate where to add this case.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/lib/onboard/preflight-gateway-reuse.test.ts`:
- Around line 148-161: Add a test that covers the regression case where
verifyGatewayContainerRunning returns "unknown" while waitForGatewayHttpReady
reports HTTP not responding (false) to ensure we do NOT destroy the gateway;
create a mock destroyGatewayForReuse (e.g., destroyForReuse) and a deps via
makeDeps where verifyGatewayContainerRunning returns "unknown" and
waitForGatewayHttpReady resolves to false, call
reconcilePreflightGatewayReuseState(deps), then assert the result is "healthy"
(or the expected non-destroy state), that destroyForReuse was not called, and
deps.exitProcess was not called; reference the existing test pattern using
reconcilePreflightGatewayReuseState, makeDeps, verifyGatewayContainerRunning,
waitForGatewayHttpReady, destroyGatewayForReuse, and exitProcess to locate where
to add this case.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 995716f7-326b-436e-b75f-db77574bf558
📒 Files selected for processing (3)
src/lib/onboard.tssrc/lib/onboard/preflight-gateway-reuse.test.tssrc/lib/onboard/preflight-gateway-reuse.ts
…IDIA#4187) Adds the regression test CodeRabbit flagged on the refactor commit: the preflight reconciliation must NOT destroy the gateway when Docker inspect returns "unknown" and the HTTP probe also fails. That sequence is the NVIDIA#2020 safety contract — a transient daemon hiccup plus a warm-up miss would otherwise route a live gateway through cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
Summary
After a host VM stop/start, NemoClaw was deleting the existing
openshell-cluster-*Docker volumes (which hold the legacy k3s local-path PVC backing data) before recreating the sandbox, so the user came back to an empty workspace. Split gateway container state to distinguish a stopped existing container from a missing one and attempt non-destructive recovery before any destructive cleanup.Related Issue
Fixes #4187
Changes
src/lib/onboard/gateway-container-running.ts: splitGatewayContainerStateintorunning/stopped/missing/unknown. A Docker inspect that succeeds with{{.State.Running}} == falsenow returnsstopped(was collapsed intomissing).src/lib/onboard/machine/handlers/gateway.ts: add astoppedbranch that calls a newrecoverGatewayRuntimedep (existingopenshell gateway startrecovery path) before any destructive cleanup. On success, continue with reuse (image-drift check still applies). On failure, exit with guidance instead of silently deleting volumes.src/lib/onboard.tspreflight: mirror the samestoppedrecovery flow for the early-preflight gateway sanity check. WirerecoverGatewayRuntimeinto the handler deps.stoppeddetection and three handler scenarios (recovery succeeds, recovery fails + refuse-to-destroy, recovery succeeds with image drift still triggering recreate).Type of Change
Verification
npm testpasses (2 pre-existing failures insrc/lib/inference/local.test.tson upstream/main, unrelated to this change)npm run typecheck:clipassesE2E note
Full hosted-VM stop/start reproduction was not possible from this workspace because the installed
openshell 0.0.39CLI does not exposegateway start/destroy(no lifecycle commands), so the buggy code path is gated off on this host. The affected population is hosts running an older OpenShell with the legacy k3s-in-Docker gateway. Coverage:gateway.test.tsexercise the full handler flow with the newstoppedstate.verifyGatewayContainerRunningreturnsstoppedfor an actually stoppedopenshell-cluster-*container (previously returnedmissing).Signed-off-by: Yimo Jiang yimoj@nvidia.com
Summary by CodeRabbit
New Features
Bug Fixes
Tests