Skip to content

fix(connect): auto-recover from SSH identity drift after host reboot#2057

Closed
ericksoa wants to merge 2 commits intomainfrom
fix/2056-reboot-identity-drift
Closed

fix(connect): auto-recover from SSH identity drift after host reboot#2057
ericksoa wants to merge 2 commits intomainfrom
fix/2056-reboot-identity-drift

Conversation

@ericksoa
Copy link
Copy Markdown
Contributor

@ericksoa ericksoa commented Apr 18, 2026

Summary

  • Auto-recover from SSH identity drift after host reboot instead of requiring full re-onboard
  • Fix registry recovery gate to trigger for bare nemoclaw <name> (no explicit action)
  • Probe live gateway when requestedSandboxName is set, even with empty registry

Fixes #2056

Changes

src/nemoclaw.ts:

  1. Registry recovery gate (line ~2405): added "" to the action allowlist so bare nemoclaw <name> triggers recovery when registry is empty
  2. shouldProbeLiveGateway (line ~477): include requestedSandboxName in the probe condition so recovery works when both registry and session are empty
  3. ensureLiveSandboxOrExit identity_drift handler (line ~763): instead of process.exit(1), clear stale openshell-* entries from ~/.ssh/known_hosts using the existing pruneKnownHostsEntries helper and retry the sandbox lookup

test/cli.test.ts:

  • Updated assertion for the identity drift test to match new auto-recovery behavior (error message changed from "gateway trust material rotated" to "Could not reconnect")

test/reboot-identity-drift.test.ts (new):

  • 6 tests covering healthy reconnect, identity drift detection, and registry recovery from empty state — both bare and explicit command forms

Type of Change

  • Code change (feature, bug fix, or refactor)

Verification

  • npm run build:cli passes
  • npx vitest run test/reboot-identity-drift.test.ts — 6/6 pass
  • npx vitest run test/cli.test.ts — all pass (including updated identity_drift assertion)
  • npm test — 1690 pass, 1 pre-existing failure (version string mismatch, unrelated)

Summary by CodeRabbit

  • Bug Fixes

    • Improved recovery of sandbox registry during gateway synchronization.
    • Enhanced handling of SSH identity changes with better cleanup and retry logic.
    • Refined error messaging for unrecoverable gateway trust issues.
  • Tests

    • Added comprehensive regression test for SSH identity drift scenarios and gateway recovery.

…ixes #2056)

Three fixes for the post-reboot reconnection failure:

1. Registry recovery gate now triggers for bare `nemoclaw <name>` (no
   explicit action). Previously `args[0]` was undefined, which didn't
   match the allowlist, skipping recovery entirely.

2. Live gateway probe now runs when `requestedSandboxName` is set, even
   if the registry is empty and there's no session file.

3. Identity drift in `ensureLiveSandboxOrExit` now auto-clears stale
   SSH known_hosts entries and retries the sandbox lookup instead of
   immediately exiting with an error.

Signed-off-by: Test User <test@example.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 18, 2026

📝 Walkthrough

Walkthrough

Extended registry recovery probing and improved SSH identity-drift handling: the CLI now prunes stale ~/.ssh/known_hosts, retries gateway state resolution, and widens recovery triggering to include bare nemoclaw <name> invocations. New regression tests validate these behaviors.

Changes

Cohort / File(s) Summary
Core CLI logic
src/nemoclaw.ts
On identity_drift, best-effort prune stale ~/.ssh/known_hosts, retry getReconciledSandboxGatewayState(), return reconciled present if retry succeeds; set shouldProbeLiveGateway when requestedSandboxName is provided; include empty-action ("") in registry recovery allowlist.
Existing test adjustment
test/cli.test.ts
Updated failure assertion: expect "Could not reconnect" instead of the previous "gateway trust material rotated after restart" in the gateway trust-rotation test.
New regression tests
test/reboot-identity-drift.test.ts
Added tests that spawn the CLI with temp HOME and fake gateway states (healthy, identity-drift, down) to verify connect and bare-command behaviors, registry recovery when registry is empty, and identity-drift detection/output.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI (nemoclaw)
    participant Gateway as SSH Gateway
    participant KnownHosts as ~/.ssh/known_hosts
    participant Registry as Sandbox Registry

    CLI->>Gateway: getReconciledSandboxGatewayState()
    Gateway-->>CLI: handshake identity change error
    Note over CLI: Detect identity_drift
    CLI->>KnownHosts: pruneKnownHostsEntries()
    KnownHosts-->>CLI: stale gateway entries removed
    CLI->>Gateway: getReconciledSandboxGatewayState() (retry)
    alt Retry returns present
        Gateway-->>CLI: sandbox state retrieved
        CLI->>Registry: return reconciled state (present, recoveredGateway=true)
        Registry-->>CLI: success
    else Retry still failing
        Gateway-->>CLI: failure
        CLI-->>CLI: emit updated failure message (include optional retry output) and exit non-zero
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I nibbled through known_hosts at dawn,
Cleared the stale keys the reboot had drawn.
I hopped, retried, and found the gate true,
No re-onboard — just a fresh rendezvous.
Hooray — back in the burrow, connection too! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: auto-recovery from SSH identity drift after host reboot, which is the primary objective of this PR.
Linked Issues check ✅ Passed All coding requirements from issue #2056 are met: identity drift auto-recovery with known_hosts clearing and retry [#2056], bare command registry recovery gate fix [#2056], and live gateway probing when requestedSandboxName is set [#2056].
Out of Scope Changes check ✅ Passed All changes are directly scoped to resolving the identity drift recovery issue: core logic fixes in src/nemoclaw.ts, updated test assertions, and new regression tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/2056-reboot-identity-drift

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/nemoclaw.ts (1)

776-791: ⚠️ Potential issue | 🟡 Minor

Handle retry.state === "missing" after host-key cleanup.

At Line 776 retry can return missing, but current flow treats it as generic reconnect failure and leaves stale registry/session state behind.

💡 Proposed fix
     const retry = await getReconciledSandboxGatewayState(sandboxName);
     if (retry.state === "present") {
       console.error("  ✓ Reconnected after clearing stale SSH host keys.");
       return retry;
     }
+    if (retry.state === "missing") {
+      registry.removeSandbox(sandboxName);
+      const session = onboardSession.loadSession();
+      if (session && session.sandboxName === sandboxName) {
+        onboardSession.updateSession((s) => {
+          s.sandboxName = null;
+          return s;
+        });
+      }
+      console.error(`  Sandbox '${sandboxName}' is not present in the live OpenShell gateway.`);
+      console.error("  Removed stale local registry entry.");
+      process.exit(1);
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/nemoclaw.ts` around lines 776 - 791, The retry result from
getReconciledSandboxGatewayState can be "missing" after clearing host keys but
the current branch treats all failures the same; update the logic that checks
retry.state (after getReconciledSandboxGatewayState(sandboxName)) to explicitly
detect retry.state === "missing" and, in that case, clear any stale
registry/session records tied to sandboxName (the same cleanup path used for a
missing sandbox), emit a clear log that the sandbox is missing and
registry/session state was cleaned, and then exit with an appropriate message
directing the user to recreate/onboard the sandbox; keep the existing
reconnect-success branch (retry.state === "present") unchanged and only fall
through to the generic reconnect-failure block for other non-missing error
states.
🧹 Nitpick comments (1)
test/reboot-identity-drift.test.ts (1)

10-13: Add a positive auto-recovery assertion and refresh stale header notes.

Current identity-drift tests validate the unrecoverable path well, but they do not verify the new success path where retry works after host-key cleanup. Also, the Line 12 note (“Once the fix … lands”) is now outdated.

Also applies to: 273-295

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/reboot-identity-drift.test.ts` around lines 10 - 13, Update the test
header/note text to remove the stale “Once the fix for `#2056` lands” comment and
instead document the expected auto-recovery behavior, and modify the test titled
"Identity drift is detected and surfaced (current behavior)" to include a
positive auto-recovery assertion: after simulating identity-drift and performing
the host-key cleanup/retry step (the same sequence already used in the test),
assert that the subsequent retry succeeds (e.g., expect success/connection
established instead of only the unrecoverable error), and apply the same
header/note and assertion update to the parallel test block that mirrors this
behavior elsewhere in the file.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/nemoclaw.ts`:
- Around line 776-791: The retry result from getReconciledSandboxGatewayState
can be "missing" after clearing host keys but the current branch treats all
failures the same; update the logic that checks retry.state (after
getReconciledSandboxGatewayState(sandboxName)) to explicitly detect retry.state
=== "missing" and, in that case, clear any stale registry/session records tied
to sandboxName (the same cleanup path used for a missing sandbox), emit a clear
log that the sandbox is missing and registry/session state was cleaned, and then
exit with an appropriate message directing the user to recreate/onboard the
sandbox; keep the existing reconnect-success branch (retry.state === "present")
unchanged and only fall through to the generic reconnect-failure block for other
non-missing error states.

---

Nitpick comments:
In `@test/reboot-identity-drift.test.ts`:
- Around line 10-13: Update the test header/note text to remove the stale “Once
the fix for `#2056` lands” comment and instead document the expected auto-recovery
behavior, and modify the test titled "Identity drift is detected and surfaced
(current behavior)" to include a positive auto-recovery assertion: after
simulating identity-drift and performing the host-key cleanup/retry step (the
same sequence already used in the test), assert that the subsequent retry
succeeds (e.g., expect success/connection established instead of only the
unrecoverable error), and apply the same header/note and assertion update to the
parallel test block that mirrors this behavior elsewhere in the file.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 5c689297-8da6-4b48-a921-c1bc2f79c7b2

📥 Commits

Reviewing files that changed from the base of the PR and between f9b18c1 and f57eaac.

📒 Files selected for processing (3)
  • src/nemoclaw.ts
  • test/cli.test.ts
  • test/reboot-identity-drift.test.ts

@ericksoa ericksoa self-assigned this Apr 18, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/nemoclaw.ts`:
- Around line 776-790: The code treats any non-"present" retry as a reconnect
failure; explicitly handle the "missing" retry result returned by
getReconciledSandboxGatewayState(sandboxName) — add a branch checking if
retry.state === "missing" and log a clear message that the sandbox is now
missing after pruning known_hosts (indicating the stale registry/session was
cleared), provide the correct next step (e.g., suggest `nemoclaw onboard`), and
return retry instead of falling through to the generic failure/error path so the
stale entry does not survive and the user gets the proper guidance.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: d901c9b8-edc6-4e33-adfb-e6741fdaa39a

📥 Commits

Reviewing files that changed from the base of the PR and between f57eaac and 6b46279.

📒 Files selected for processing (1)
  • src/nemoclaw.ts

Comment thread src/nemoclaw.ts
Comment on lines +776 to 790
const retry = await getReconciledSandboxGatewayState(sandboxName);
if (retry.state === "present") {
console.error(" ✓ Reconnected after clearing stale SSH host keys.");
return retry;
}
// Retry failed — fall through to error
console.error(
" Existing sandbox connections cannot be reattached safely after this gateway identity change.",
` Could not reconnect to sandbox '${sandboxName}' after clearing stale host keys.`,
);
if (retry.output) {
console.error(retry.output);
}
console.error(
" Recreate this sandbox with `nemoclaw onboard` once the gateway runtime is stable.",
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle the "missing" retry result explicitly.

After pruning known_hosts, the retry can legitimately resolve to state === "missing". This branch currently reports a reconnect failure instead, so the stale registry/session entry survives and the user gets the wrong next step.

Suggested fix
     const retry = await getReconciledSandboxGatewayState(sandboxName);
     if (retry.state === "present") {
       console.error("  ✓ Reconnected after clearing stale SSH host keys.");
       return retry;
     }
+    if (retry.state === "missing") {
+      registry.removeSandbox(sandboxName);
+      const session = onboardSession.loadSession();
+      if (session && session.sandboxName === sandboxName) {
+        onboardSession.updateSession((s) => {
+          s.sandboxName = null;
+          return s;
+        });
+      }
+      console.error(`  Sandbox '${sandboxName}' is not present in the live OpenShell gateway.`);
+      console.error("  Removed stale local registry entry.");
+      process.exit(1);
+    }
     // Retry failed — fall through to error
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/nemoclaw.ts` around lines 776 - 790, The code treats any non-"present"
retry as a reconnect failure; explicitly handle the "missing" retry result
returned by getReconciledSandboxGatewayState(sandboxName) — add a branch
checking if retry.state === "missing" and log a clear message that the sandbox
is now missing after pruning known_hosts (indicating the stale registry/session
was cleared), provide the correct next step (e.g., suggest `nemoclaw onboard`),
and return retry instead of falling through to the generic failure/error path so
the stale entry does not survive and the user gets the proper guidance.

@ericksoa
Copy link
Copy Markdown
Contributor Author

Superseded by #2064 — same changes, correct git identity.

@ericksoa ericksoa closed this Apr 18, 2026
ericksoa added a commit that referenced this pull request Apr 18, 2026
…2064)

## Summary

- Auto-recover from SSH identity drift after host reboot instead of
requiring full re-onboard
- Fix registry recovery gate to trigger for bare `nemoclaw <name>` (no
explicit action)
- Probe live gateway when `requestedSandboxName` is set, even with empty
registry

Fixes #2056

Supersedes #2057 (which had commits from incorrect git identity).

## Changes

**`src/nemoclaw.ts`:**
1. **Import** `pruneKnownHostsEntries` from `./lib/onboard`
2. **Registry recovery gate** (line ~2408): added `""` to the action
allowlist so bare `nemoclaw <name>` triggers recovery when registry is
empty
3. **`shouldProbeLiveGateway`** (line ~477): include
`requestedSandboxName` in the probe condition so recovery works when
both registry and session are empty
4. **`ensureLiveSandboxOrExit` identity_drift handler** (line ~763):
instead of `process.exit(1)`, clear stale `openshell-*` entries from
`~/.ssh/known_hosts` using the existing `pruneKnownHostsEntries` helper
and retry the sandbox lookup

**`test/cli.test.ts`:**
- Updated assertion for the identity drift test to match new
auto-recovery behavior (error message changed from "gateway trust
material rotated" to "Could not reconnect")

**`test/reboot-identity-drift.test.ts`** (new):
- 6 tests covering healthy reconnect, identity drift detection, and
registry recovery from empty state — both bare and explicit command
forms

## Type of Change

- [x] Code change (feature, bug fix, or refactor)

## Verification

- [x] `npm run build:cli` passes
- [x] `npx vitest run test/reboot-identity-drift.test.ts` — 6/6 pass
- [x] `npx vitest run test/cli.test.ts` — all pass (including updated
identity_drift assertion)
- [x] `npm test` — 1690 pass, 1 pre-existing failure (version string
mismatch, unrelated)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Improved SSH identity drift recovery after reboot by clearing stale
host keys and attempting reconnection
* Enhanced gateway probing to better detect live gateways when sandbox
names are provided
  * Updated error messaging for reconnection failures

* **Tests**
* Added regression test suite for post-reboot SSH identity drift and
registry recovery scenarios

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(connect): auto-recover from SSH identity drift after host reboot

1 participant