fix(onboard): don't leave a default sandbox when cancelling at policy presets#4642
Conversation
… presets Pressing Ctrl+C at the [8/8] Policy presets step left a fully created OpenShell container registered as the default sandbox with no policies applied. Two things caused the orphan: 1. The sandbox was marked default during the "sandbox" step (in createSandbox and the sandbox state handler) — two steps before the policy prompt — so a cancel at the prompt left it as a valid default. 2. Nothing rolled the freshly-created sandbox back on cancel. Defer default-marking to the finalization step, which only runs once the policy presets are confirmed, and add a cancel-rollback guard that deletes the container and unregisters the sandbox when the operator cancels (Ctrl+C / SIGTERM) at the policy prompt. The guard is a two-key gate (armed after a brand-new createSandbox, fired only on an explicit cancel), so ordinary failure-path exits and recreate/rebuild of existing sandboxes are untouched. Fixes NVIDIA#4614 Signed-off-by: Dongni Yang <dongniy@nvidia.com>
Deferring default-marking to finalization (the NVIDIA#4614 fix) introduced a failure-path gap for recreate/rebuild: createSandbox removes the existing registry entry (clearing the default pointer) and re-registers it without a default, so a rebuild that fails before finalization left the operator's previously-default sandbox no longer marked default until a successful re-run. Snapshot whether the sandbox is the default at createSandbox entry — before the recreate tears it down — and restore that flag right after re-registration. A brand-new sandbox was never the default, so it stays deferred and the NVIDIA#4614 behaviour is unchanged; only a recreate of an already-default sandbox restores it eagerly. Refs NVIDIA#4614 Signed-off-by: Dongni Yang <dongniy@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds default-preservation helpers and an idempotent sandbox cancel-rollback guard, defers marking created sandboxes as default until finalization, wires cancel-exit handlers to invoke rollback (delete + unregister), and updates tests to reflect the new timing. ChangesSandbox cancellation rollback and default flag timing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/onboard.ts (1)
6075-6092:⚠️ Potential issue | 🟠 Major | ⚡ Quick winThe active policy-presets TUI still skips rollback on cancel.
setupPoliciesWithSelection()wiresselectTierPresetsAndAccess(), but this selector's SIGTERM/Ctrl+C paths callprocess.exit(1)withoutsandboxCancelRollback.markCancelled(). Cancelling in the actual[8/8] Policy presetsprompt will therefore leave the newly created sandbox/container behind because the exit hook still seescancelled === false.🩹 Minimal fix
const onSigterm = () => { cleanup(); + sandboxCancelRollback.markCancelled(); process.exit(1); }; @@ } else if (key === "\x03") { cleanup(); + sandboxCancelRollback.markCancelled(); process.exit(1); } else if (key === "\x1b[A" || key === "k") {Also applies to: 6275-6288
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 6075 - 6092, The SIGTERM/Ctrl+C handlers in selectTierPresetsAndAccess() (the onSigterm and onData paths) call process.exit without marking the sandbox rollback as cancelled, so cancelling the "[8/8] Policy presets" prompt leaves the sandbox behind; update those interrupt paths to call sandboxCancelRollback.markCancelled() before invoking cleanup() and process.exit(1). Apply the same fix to the other identical block referenced (the second onSigterm/onData at the later range) so both SIGTERM and Ctrl+C branches invoke sandboxCancelRollback.markCancelled() prior to cleanup/exit.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Around line 3877-3886: The rollback arming uses isRecreateSandbox() which only
reflects the CLI flag and misses recreate paths inside createSandbox()
(drift/rotation/missing providers) causing
sandboxCancelRollback.arm(sandboxName) to wrongly delete rebuilt sandboxes;
change the condition to rely on whether the sandbox truly did NOT exist before
this operation (e.g., a boolean like originalSandboxExisted or
createdNewSandbox) that you set before any delete/recreate logic in
createSandbox(), and use that flag here instead of isRecreateSandbox() so
rollback is armed only for genuine brand-new creations.
- Around line 6332-6345: The inline process-hook orchestration (creating
sandboxCancelRollback via createSandboxCancelRollback and attaching
process.on("exit", () => sandboxCancelRollback.runIfArmed())) should be moved
into a small helper function (e.g., registerSandboxCancelRollback or
setupSandboxCancelRollback) inside the existing onboard helpers so this file
shrinks; implement that helper to build the rollback with deleteSandboxContainer
(using runOpenshell), removeSandboxFromRegistry (registry.removeSandbox) and
log, attach the exit handler to call runIfArmed, export the helper, and replace
the current inline block with a single call to the new helper to preserve
behavior while moving the implementation elsewhere.
---
Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 6075-6092: The SIGTERM/Ctrl+C handlers in
selectTierPresetsAndAccess() (the onSigterm and onData paths) call process.exit
without marking the sandbox rollback as cancelled, so cancelling the "[8/8]
Policy presets" prompt leaves the sandbox behind; update those interrupt paths
to call sandboxCancelRollback.markCancelled() before invoking cleanup() and
process.exit(1). Apply the same fix to the other identical block referenced (the
second onSigterm/onData at the later range) so both SIGTERM and Ctrl+C branches
invoke sandboxCancelRollback.markCancelled() prior to cleanup/exit.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7a3387f6-3cfc-43db-b1ea-680f9de41ba9
📒 Files selected for processing (9)
src/lib/onboard.tssrc/lib/onboard/cancel-rollback.test.tssrc/lib/onboard/cancel-rollback.tssrc/lib/onboard/default-preservation.test.tssrc/lib/onboard/default-preservation.tssrc/lib/onboard/machine/handlers/finalization.test.tssrc/lib/onboard/machine/handlers/finalization.tssrc/lib/onboard/machine/handlers/sandbox.test.tssrc/lib/onboard/machine/handlers/sandbox.ts
…s under budget Addresses CodeRabbit review on NVIDIA#4642: - isRecreateSandbox() only reflects the CLI/env flag, but createSandbox() also recreates existing sandboxes for drift, credential rotation, missing providers, and non-ready state while that flag is false. Arming the cancel-rollback there would delete the rebuilt sandbox on Ctrl+C and wipe an operator's existing sandbox. Arm only when the sandbox did not exist before this run (captureSandboxPriorState().existed), so a cancel can never delete a pre-existing sandbox regardless of why it was being recreated. - Move the process-exit hook and the shared cancel-exit handler behind focused helpers (installSandboxCancelRollback, makeOnboardCancelExit) and collapse the pre-recreate snapshot into captureSandboxPriorState, bringing src/lib/onboard.ts back to net-negative under the codebase-growth guardrail. test/onboard.test.ts: createSandbox no longer marks the sandbox default (that is deferred to finalization), so it records no setDefault call. Refs NVIDIA#4614 Signed-off-by: Dongni Yang <dongniy@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/onboard.ts (1)
5863-5875:⚠️ Potential issue | 🟠 Major | ⚡ Quick winWire the cancel helper into the preset/access selector too.
This only updates
selectPolicyTier()andpresetsCheckboxSelector(). The flow used bysetupPoliciesWithSelection()still passes throughselectTierPresetsAndAccess(), and that function still exits directly on SIGTERM/Ctrl+C, so cancelling on the actual preset/access screen bypassessandboxCancelRollbackand can still leave the fresh sandbox behind.Apply the same handler in `selectTierPresetsAndAccess()`
- const onSigterm = () => { - cleanup(); - process.exit(1); - }; + const onSigterm = makeOnboardCancelExit(sandboxCancelRollback, cleanup); process.once("SIGTERM", onSigterm); const onData = (key: string) => { if (key === "\r" || key === "\n") { cleanup(); process.stdout.write("\n"); resolve( ordered .filter((p) => included.has(p.name)) .map((p) => ({ name: p.name, access: accessModes[p.name] })), ); } else if (key === "\x03") { - cleanup(); - process.exit(1); + onSigterm(); } else if (key === "\x1b[A" || key === "k") {As per coding guidelines,
src/lib/onboard.ts: This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow.Also applies to: 6200-6209
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 5863 - 5875, selectTierPresetsAndAccess() still exits directly on SIGTERM/Ctrl+C and must be wired to the same cancel helper used in selectPolicyTier() and presetsCheckboxSelector(); create an onSigterm handler via makeOnboardCancelExit(sandboxCancelRollback, cleanup), register it with process.once("SIGTERM", onSigterm), and replace any direct process.exit or direct SIGINT handling (including any inline makeOnboardCancelExit invocations) with calls to that onSigterm handler (and ensure the Ctrl+C key code "\x03" branch invokes the cancel helper so cleanup + sandboxCancelRollback run and the function resolves/rejects consistently).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/onboard.ts`:
- Line 2786: The current code captures prior sandbox state with
captureSandboxPriorState(registry, sandboxName) before
pruneStaleSandboxEntry(sandboxName), causing stale registry entries to mark a
non-existent sandbox as "existed" and preserve wasDefault incorrectly; change
the flow so that you determine existence and defaultness from the
post-prune/live state (either call captureSandboxPriorState after
pruneStaleSandboxEntry or query liveExists directly) and pass that corrected
existed/wasDefault into restoreDefaultAfterRecreate/rollback-arming logic;
update any uses in the surrounding onboarding logic (including the variables set
from captureSandboxPriorState and any calls to restoreDefaultAfterRecreate) so
wasDefault is only preserved when the sandbox is actually live.
---
Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 5863-5875: selectTierPresetsAndAccess() still exits directly on
SIGTERM/Ctrl+C and must be wired to the same cancel helper used in
selectPolicyTier() and presetsCheckboxSelector(); create an onSigterm handler
via makeOnboardCancelExit(sandboxCancelRollback, cleanup), register it with
process.once("SIGTERM", onSigterm), and replace any direct process.exit or
direct SIGINT handling (including any inline makeOnboardCancelExit invocations)
with calls to that onSigterm handler (and ensure the Ctrl+C key code "\x03"
branch invokes the cancel helper so cleanup + sandboxCancelRollback run and the
function resolves/rejects consistently).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 255f463d-945b-4ac3-9999-e1c907243d0b
📒 Files selected for processing (6)
src/lib/onboard.tssrc/lib/onboard/cancel-rollback.test.tssrc/lib/onboard/cancel-rollback.tssrc/lib/onboard/default-preservation.test.tssrc/lib/onboard/default-preservation.tstest/onboard.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/onboard/cancel-rollback.test.ts
…tence CodeRabbit review on NVIDIA#4642: captureSandboxPriorState() read registry state before pruneStaleSandboxEntry() ran, so a stale registry row was treated as an existing sandbox. That let a brand-new sandbox (with a leftover stale entry) be marked default immediately and skip rollback arming — reintroducing the bug for that case. Use the post-prune `liveExists` signal instead: arm cancel-rollback when the sandbox was not live before this run, and only preserve the default when it was a genuinely live default (`liveExists && wasSandboxDefault(...)`). Drop the pre-prune captureSandboxPriorState helper. Refs NVIDIA#4614 Signed-off-by: Dongni Yang <dongniy@nvidia.com>
CodeRabbit review on NVIDIA#4642: selectTierPresetsAndAccess() — the preset/access screen reached via setupPoliciesWithSelection() — still exited directly on SIGTERM / Ctrl+C, bypassing the cancel-rollback. Cancelling there could still leave the freshly-created sandbox behind. Route its SIGTERM and Ctrl+C ("\x03") handlers through makeOnboardCancelExit(sandboxCancelRollback, cleanup), matching selectPolicyTier(). Refs NVIDIA#4614 Signed-off-by: Dongni Yang <dongniy@nvidia.com>
…hantom A live end-to-end repro (build sandbox -> Ctrl+C at policy tier) showed the cancel-rollback deleted the container and removed the registry entry, but `nemoclaw list` still showed the sandbox. Its session-recovery (inventory keys on session.steps.sandbox.status === "complete") resurrected the just-removed entry as a phantom pointing at a deleted sandbox. Have the rollback also clear the onboard session (clearOnboardSession -> onboardSession.clearSession), so an aborted+rolled-back onboard leaves no resumable/recoverable record. The unit suite couldn't catch this — it doesn't exercise the list session-recovery path; the live repro did. Refs NVIDIA#4614 Signed-off-by: Dongni Yang <dongniy@nvidia.com>
prekshivyas
left a comment
There was a problem hiding this comment.
APPROVE. Cleanly resolves #4614.
Verified the one real hazard — over-deletion of a pre-existing sandbox on cancel — is avoided: rollback arming is gated on the post-prune !liveExists signal (onboard.ts:3870), so recreate/drift/rotation/missing-provider rebuilds keep liveExists === true and are never armed; default-preservation reads the same post-prune state. All three policy-preset selectors route Ctrl-C and SIGTERM through makeOnboardCancelExit (markCancelled → exit(1)), and rollback fires from the process.on("exit") hook, deleting container + registry entry + onboard session (clearing the session prevents nemoclaw list from resurrecting a phantom). Default-marking is correctly deferred from the sandbox step to handleFinalizationState, so a cancel before finalization can never leave a default.
CI green on 8c767f6, growth guardrail satisfied, all 3 CodeRabbit threads addressed. Negative/edge coverage is good (disarm, not-cancelled, double-run, delete-failure).
Non-blocking note: the arm→policy window includes the non-interactive agent-setup step (no exit/prompt, default not yet set), so no default orphan is reachable there.
Signed-off-by: Prekshi Vyas prekshiv@nvidia.com
Summary
Cancelling
nemoclaw onboardwith Ctrl+C at the[8/8] Policy presetsstep left a fully created OpenShell container registered as the default sandbox withpolicies: none. This defers default-marking until presets are confirmed and, on an explicit cancel of a genuinely new sandbox, rolls it back completely (container + registry + onboard session).Related Issue
Fixes #4614
Changes
createSandboxand the sandbox state handler) — two steps before the policy prompt. Default is now set inhandleFinalizationState, which runs only once presets are confirmed.onboard/cancel-rollback.ts). On an explicit cancel (Ctrl+C / SIGTERM) at any policy-preset prompt — the tier selector (selectPolicyTier) and the preset/access selector (selectTierPresetsAndAccess) — the freshly-created sandbox is rolled back: OpenShell container deleted, registry entry removed, and the aborted onboard session cleared. The session clear is required becausenemoclaw list's session-recovery (keyed onsession.steps.sandbox.status === "complete") would otherwise resurrect the removed sandbox as a phantom.liveExistssignal (afterpruneStaleSandboxEntry), not the CLI--recreateflag. Any sandbox that already existed live — including drift / credential-rotation / missing-provider / non-ready recreate paths — is never deleted on cancel, and a stale registry row can't masquerade as an existing sandbox.onboard/default-preservation.ts). A recreate removes the registry entry (clearing the default pointer) and re-registers without one. The prior default is captured after the stale-entry prune (liveExists && wasSandboxDefault(...)) and restored after re-registration, so a rebuild that fails before finalization doesn't silently demote the operator's default sandbox. Brand-new sandboxes were never live/default, so the [All Platforms][Onboard] Ctrl+C at Policy presets leaves sandbox created and registered as default #4614 deferral is unchanged.New logic lives in focused modules under
src/lib/onboard/;src/lib/onboard.tsis net-negative under the codebase-growth guardrail.Type of Change
Verification
npm run typecheck:cliclean; all CI checks greencancel-rollback(rollback gate,installSandboxCancelRollback,makeOnboardCancelExit),default-preservation, and thesandbox/finalizationhandlers; plus an updatedtest/onboard.test.tsassertingcreateSandboxno longer marks the sandbox default.nemoclaw onboard, let it build the sandbox, and sent a literal Ctrl+C (\x03) at the Policy tier TUI. Result: OpenShell container deleted, sandbox deleted from the gateway, registry entry removed, onboard session cleared, the pre-existing default sandbox untouched,nemoclaw listclean (no phantom), exit code 1.nemoclaw listis clean.Signed-off-by: Dongni Yang dongniy@nvidia.com