fix(connect): auto-recover from SSH identity drift after host reboot#2057
fix(connect): auto-recover from SSH identity drift after host reboot#2057
Conversation
…ixes #2056) Three fixes for the post-reboot reconnection failure: 1. Registry recovery gate now triggers for bare `nemoclaw <name>` (no explicit action). Previously `args[0]` was undefined, which didn't match the allowlist, skipping recovery entirely. 2. Live gateway probe now runs when `requestedSandboxName` is set, even if the registry is empty and there's no session file. 3. Identity drift in `ensureLiveSandboxOrExit` now auto-clears stale SSH known_hosts entries and retries the sandbox lookup instead of immediately exiting with an error. Signed-off-by: Test User <test@example.com>
📝 WalkthroughWalkthroughExtended registry recovery probing and improved SSH identity-drift handling: the CLI now prunes stale Changes
Sequence DiagramsequenceDiagram
participant CLI as CLI (nemoclaw)
participant Gateway as SSH Gateway
participant KnownHosts as ~/.ssh/known_hosts
participant Registry as Sandbox Registry
CLI->>Gateway: getReconciledSandboxGatewayState()
Gateway-->>CLI: handshake identity change error
Note over CLI: Detect identity_drift
CLI->>KnownHosts: pruneKnownHostsEntries()
KnownHosts-->>CLI: stale gateway entries removed
CLI->>Gateway: getReconciledSandboxGatewayState() (retry)
alt Retry returns present
Gateway-->>CLI: sandbox state retrieved
CLI->>Registry: return reconciled state (present, recoveredGateway=true)
Registry-->>CLI: success
else Retry still failing
Gateway-->>CLI: failure
CLI-->>CLI: emit updated failure message (include optional retry output) and exit non-zero
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/nemoclaw.ts (1)
776-791:⚠️ Potential issue | 🟡 MinorHandle
retry.state === "missing"after host-key cleanup.At Line 776 retry can return
missing, but current flow treats it as generic reconnect failure and leaves stale registry/session state behind.💡 Proposed fix
const retry = await getReconciledSandboxGatewayState(sandboxName); if (retry.state === "present") { console.error(" ✓ Reconnected after clearing stale SSH host keys."); return retry; } + if (retry.state === "missing") { + registry.removeSandbox(sandboxName); + const session = onboardSession.loadSession(); + if (session && session.sandboxName === sandboxName) { + onboardSession.updateSession((s) => { + s.sandboxName = null; + return s; + }); + } + console.error(` Sandbox '${sandboxName}' is not present in the live OpenShell gateway.`); + console.error(" Removed stale local registry entry."); + process.exit(1); + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/nemoclaw.ts` around lines 776 - 791, The retry result from getReconciledSandboxGatewayState can be "missing" after clearing host keys but the current branch treats all failures the same; update the logic that checks retry.state (after getReconciledSandboxGatewayState(sandboxName)) to explicitly detect retry.state === "missing" and, in that case, clear any stale registry/session records tied to sandboxName (the same cleanup path used for a missing sandbox), emit a clear log that the sandbox is missing and registry/session state was cleaned, and then exit with an appropriate message directing the user to recreate/onboard the sandbox; keep the existing reconnect-success branch (retry.state === "present") unchanged and only fall through to the generic reconnect-failure block for other non-missing error states.
🧹 Nitpick comments (1)
test/reboot-identity-drift.test.ts (1)
10-13: Add a positive auto-recovery assertion and refresh stale header notes.Current identity-drift tests validate the unrecoverable path well, but they do not verify the new success path where retry works after host-key cleanup. Also, the Line 12 note (“Once the fix … lands”) is now outdated.
Also applies to: 273-295
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/reboot-identity-drift.test.ts` around lines 10 - 13, Update the test header/note text to remove the stale “Once the fix for `#2056` lands” comment and instead document the expected auto-recovery behavior, and modify the test titled "Identity drift is detected and surfaced (current behavior)" to include a positive auto-recovery assertion: after simulating identity-drift and performing the host-key cleanup/retry step (the same sequence already used in the test), assert that the subsequent retry succeeds (e.g., expect success/connection established instead of only the unrecoverable error), and apply the same header/note and assertion update to the parallel test block that mirrors this behavior elsewhere in the file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/nemoclaw.ts`:
- Around line 776-791: The retry result from getReconciledSandboxGatewayState
can be "missing" after clearing host keys but the current branch treats all
failures the same; update the logic that checks retry.state (after
getReconciledSandboxGatewayState(sandboxName)) to explicitly detect retry.state
=== "missing" and, in that case, clear any stale registry/session records tied
to sandboxName (the same cleanup path used for a missing sandbox), emit a clear
log that the sandbox is missing and registry/session state was cleaned, and then
exit with an appropriate message directing the user to recreate/onboard the
sandbox; keep the existing reconnect-success branch (retry.state === "present")
unchanged and only fall through to the generic reconnect-failure block for other
non-missing error states.
---
Nitpick comments:
In `@test/reboot-identity-drift.test.ts`:
- Around line 10-13: Update the test header/note text to remove the stale “Once
the fix for `#2056` lands” comment and instead document the expected auto-recovery
behavior, and modify the test titled "Identity drift is detected and surfaced
(current behavior)" to include a positive auto-recovery assertion: after
simulating identity-drift and performing the host-key cleanup/retry step (the
same sequence already used in the test), assert that the subsequent retry
succeeds (e.g., expect success/connection established instead of only the
unrecoverable error), and apply the same header/note and assertion update to the
parallel test block that mirrors this behavior elsewhere in the file.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 5c689297-8da6-4b48-a921-c1bc2f79c7b2
📒 Files selected for processing (3)
src/nemoclaw.tstest/cli.test.tstest/reboot-identity-drift.test.ts
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/nemoclaw.ts`:
- Around line 776-790: The code treats any non-"present" retry as a reconnect
failure; explicitly handle the "missing" retry result returned by
getReconciledSandboxGatewayState(sandboxName) — add a branch checking if
retry.state === "missing" and log a clear message that the sandbox is now
missing after pruning known_hosts (indicating the stale registry/session was
cleared), provide the correct next step (e.g., suggest `nemoclaw onboard`), and
return retry instead of falling through to the generic failure/error path so the
stale entry does not survive and the user gets the proper guidance.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: d901c9b8-edc6-4e33-adfb-e6741fdaa39a
📒 Files selected for processing (1)
src/nemoclaw.ts
| const retry = await getReconciledSandboxGatewayState(sandboxName); | ||
| if (retry.state === "present") { | ||
| console.error(" ✓ Reconnected after clearing stale SSH host keys."); | ||
| return retry; | ||
| } | ||
| // Retry failed — fall through to error | ||
| console.error( | ||
| " Existing sandbox connections cannot be reattached safely after this gateway identity change.", | ||
| ` Could not reconnect to sandbox '${sandboxName}' after clearing stale host keys.`, | ||
| ); | ||
| if (retry.output) { | ||
| console.error(retry.output); | ||
| } | ||
| console.error( | ||
| " Recreate this sandbox with `nemoclaw onboard` once the gateway runtime is stable.", | ||
| ); |
There was a problem hiding this comment.
Handle the "missing" retry result explicitly.
After pruning known_hosts, the retry can legitimately resolve to state === "missing". This branch currently reports a reconnect failure instead, so the stale registry/session entry survives and the user gets the wrong next step.
Suggested fix
const retry = await getReconciledSandboxGatewayState(sandboxName);
if (retry.state === "present") {
console.error(" ✓ Reconnected after clearing stale SSH host keys.");
return retry;
}
+ if (retry.state === "missing") {
+ registry.removeSandbox(sandboxName);
+ const session = onboardSession.loadSession();
+ if (session && session.sandboxName === sandboxName) {
+ onboardSession.updateSession((s) => {
+ s.sandboxName = null;
+ return s;
+ });
+ }
+ console.error(` Sandbox '${sandboxName}' is not present in the live OpenShell gateway.`);
+ console.error(" Removed stale local registry entry.");
+ process.exit(1);
+ }
// Retry failed — fall through to error🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/nemoclaw.ts` around lines 776 - 790, The code treats any non-"present"
retry as a reconnect failure; explicitly handle the "missing" retry result
returned by getReconciledSandboxGatewayState(sandboxName) — add a branch
checking if retry.state === "missing" and log a clear message that the sandbox
is now missing after pruning known_hosts (indicating the stale registry/session
was cleared), provide the correct next step (e.g., suggest `nemoclaw onboard`),
and return retry instead of falling through to the generic failure/error path so
the stale entry does not survive and the user gets the proper guidance.
|
Superseded by #2064 — same changes, correct git identity. |
…2064) ## Summary - Auto-recover from SSH identity drift after host reboot instead of requiring full re-onboard - Fix registry recovery gate to trigger for bare `nemoclaw <name>` (no explicit action) - Probe live gateway when `requestedSandboxName` is set, even with empty registry Fixes #2056 Supersedes #2057 (which had commits from incorrect git identity). ## Changes **`src/nemoclaw.ts`:** 1. **Import** `pruneKnownHostsEntries` from `./lib/onboard` 2. **Registry recovery gate** (line ~2408): added `""` to the action allowlist so bare `nemoclaw <name>` triggers recovery when registry is empty 3. **`shouldProbeLiveGateway`** (line ~477): include `requestedSandboxName` in the probe condition so recovery works when both registry and session are empty 4. **`ensureLiveSandboxOrExit` identity_drift handler** (line ~763): instead of `process.exit(1)`, clear stale `openshell-*` entries from `~/.ssh/known_hosts` using the existing `pruneKnownHostsEntries` helper and retry the sandbox lookup **`test/cli.test.ts`:** - Updated assertion for the identity drift test to match new auto-recovery behavior (error message changed from "gateway trust material rotated" to "Could not reconnect") **`test/reboot-identity-drift.test.ts`** (new): - 6 tests covering healthy reconnect, identity drift detection, and registry recovery from empty state — both bare and explicit command forms ## Type of Change - [x] Code change (feature, bug fix, or refactor) ## Verification - [x] `npm run build:cli` passes - [x] `npx vitest run test/reboot-identity-drift.test.ts` — 6/6 pass - [x] `npx vitest run test/cli.test.ts` — all pass (including updated identity_drift assertion) - [x] `npm test` — 1690 pass, 1 pre-existing failure (version string mismatch, unrelated) 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved SSH identity drift recovery after reboot by clearing stale host keys and attempting reconnection * Enhanced gateway probing to better detect live gateways when sandbox names are provided * Updated error messaging for reconnection failures * **Tests** * Added regression test suite for post-reboot SSH identity drift and registry recovery scenarios <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
nemoclaw <name>(no explicit action)requestedSandboxNameis set, even with empty registryFixes #2056
Changes
src/nemoclaw.ts:""to the action allowlist so barenemoclaw <name>triggers recovery when registry is emptyshouldProbeLiveGateway(line ~477): includerequestedSandboxNamein the probe condition so recovery works when both registry and session are emptyensureLiveSandboxOrExitidentity_drift handler (line ~763): instead ofprocess.exit(1), clear staleopenshell-*entries from~/.ssh/known_hostsusing the existingpruneKnownHostsEntrieshelper and retry the sandbox lookuptest/cli.test.ts:test/reboot-identity-drift.test.ts(new):Type of Change
Verification
npm run build:clipassesnpx vitest run test/reboot-identity-drift.test.ts— 6/6 passnpx vitest run test/cli.test.ts— all pass (including updated identity_drift assertion)npm test— 1690 pass, 1 pre-existing failure (version string mismatch, unrelated)Summary by CodeRabbit
Bug Fixes
Tests