Skip to content

fix(connect): auto-recover from SSH identity drift after host reboot#2064

Merged
ericksoa merged 1 commit intomainfrom
fix/2056-reboot-identity-drift-v2
Apr 18, 2026
Merged

fix(connect): auto-recover from SSH identity drift after host reboot#2064
ericksoa merged 1 commit intomainfrom
fix/2056-reboot-identity-drift-v2

Conversation

@ericksoa
Copy link
Copy Markdown
Contributor

@ericksoa ericksoa commented Apr 18, 2026

Summary

  • Auto-recover from SSH identity drift after host reboot instead of requiring full re-onboard
  • Fix registry recovery gate to trigger for bare nemoclaw <name> (no explicit action)
  • Probe live gateway when requestedSandboxName is set, even with empty registry

Fixes #2056

Supersedes #2057 (which had commits from incorrect git identity).

Changes

src/nemoclaw.ts:

  1. Import pruneKnownHostsEntries from ./lib/onboard
  2. Registry recovery gate (line ~2408): added "" to the action allowlist so bare nemoclaw <name> triggers recovery when registry is empty
  3. shouldProbeLiveGateway (line ~477): include requestedSandboxName in the probe condition so recovery works when both registry and session are empty
  4. ensureLiveSandboxOrExit identity_drift handler (line ~763): instead of process.exit(1), clear stale openshell-* entries from ~/.ssh/known_hosts using the existing pruneKnownHostsEntries helper and retry the sandbox lookup

test/cli.test.ts:

  • Updated assertion for the identity drift test to match new auto-recovery behavior (error message changed from "gateway trust material rotated" to "Could not reconnect")

test/reboot-identity-drift.test.ts (new):

  • 6 tests covering healthy reconnect, identity drift detection, and registry recovery from empty state — both bare and explicit command forms

Type of Change

  • Code change (feature, bug fix, or refactor)

Verification

  • npm run build:cli passes
  • npx vitest run test/reboot-identity-drift.test.ts — 6/6 pass
  • npx vitest run test/cli.test.ts — all pass (including updated identity_drift assertion)
  • npm test — 1690 pass, 1 pre-existing failure (version string mismatch, unrelated)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Improved SSH identity drift recovery after reboot by clearing stale host keys and attempting reconnection
    • Enhanced gateway probing to better detect live gateways when sandbox names are provided
    • Updated error messaging for reconnection failures
  • Tests

    • Added regression test suite for post-reboot SSH identity drift and registry recovery scenarios

(Fixes #2056)

Three fixes for the post-reboot reconnection failure:

1. Registry recovery gate now triggers for bare `nemoclaw <name>` (no
   explicit action). Previously `args[0]` was undefined, which didn't
   match the allowlist, skipping recovery entirely.

2. Live gateway probe now runs when `requestedSandboxName` is set, even
   if the registry is empty and there's no session file.

3. Identity drift in `ensureLiveSandboxOrExit` now auto-clears stale
   SSH known_hosts entries and retries the sandbox lookup instead of
   immediately exiting with an error.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 18, 2026

📝 Walkthrough

Walkthrough

The pull request implements auto-recovery from SSH identity drift after host reboot by clearing stale SSH known_hosts entries and retrying gateway connections, while broadening registry recovery gating to handle bare sandbox commands.

Changes

Cohort / File(s) Summary
Identity Drift Recovery
src/nemoclaw.ts
Added pruneKnownHostsEntries import; identity_drift recovery path now clears stale SSH host keys from ~/.ssh/known_hosts and retries gateway connection. Updated console messaging to reflect SSH key clearing and reconnect attempt. Broadened shouldProbeLiveGateway logic and fixed registry recovery gating to include empty-string action for bare sandbox commands.
Test Assertions
test/cli.test.ts
Updated test expectation for gateway trust rotation scenario to expect "Could not reconnect" message instead of "gateway trust material rotated after restart".
Regression Test Suite
test/reboot-identity-drift.test.ts
New Vitest test suite with temporary isolated HOME fixtures, stub openshell executables, and helpers to simulate post-reboot SSH identity drift and registry recovery. Tests both successful sandbox connection and identity drift detection/recovery paths, plus registry recovery when gateway is live but registry is empty.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A reboot came, SSH keys turned strange,
But clever nemoclaw rose to the change—
Pruning old entries with practised care,
Reconnecting through drift-laden air,
The sandbox hops on, no re-onboard despair! 🏠✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main fix: auto-recovery from SSH identity drift after host reboot, which is the primary change across the files.
Linked Issues check ✅ Passed All coding requirements from issue #2056 are implemented: clearing stale known_hosts entries on identity_drift [#2056], retrying sandbox lookup after cleanup [#2056], and fixing the registry recovery gate to include bare commands [#2056].
Out of Scope Changes check ✅ Passed All changes directly address the three coordinated fixes outlined in issue #2056; no extraneous modifications or unrelated functionality were introduced.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/2056-reboot-identity-drift-v2

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/reboot-identity-drift.test.ts (1)

9-12: Outdated comment: this PR is the fix for #2056.

Line 12 says "Once the fix for #2056 lands, update these tests to assert auto-recovery" — but this PR implements that fix. Consider removing or updating this comment to avoid confusion for future readers.

♻️ Suggested fix
 // Simulates the post-reboot scenario where the gateway restarts with new SSH
 // keys, causing "handshake verification failed" errors. Verifies:
 //   1. The registry recovery gate triggers for bare `nemoclaw <name>` (no action)
-//   2. Identity drift is detected and surfaced (current behavior)
-//
-// Once the fix for `#2056` lands, update these tests to assert auto-recovery.
+//   2. Identity drift is detected, stale SSH keys are cleared, and reconnect is attempted
+//   3. Registry recovery works when the registry is empty but gateway is live
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/reboot-identity-drift.test.ts` around lines 9 - 12, The top block
comment in reboot-identity-drift.test.ts contains an outdated note "Once the fix
for `#2056` lands, update these tests to assert auto-recovery"; update or remove
that sentence so it reflects that this PR implements the fix for `#2056`—either
delete the line or replace it with a short note that the tests now assert
auto-recovery (keep the remaining comment lines about registry recovery and
identity drift if still relevant) to avoid confusion for future readers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@test/reboot-identity-drift.test.ts`:
- Around line 9-12: The top block comment in reboot-identity-drift.test.ts
contains an outdated note "Once the fix for `#2056` lands, update these tests to
assert auto-recovery"; update or remove that sentence so it reflects that this
PR implements the fix for `#2056`—either delete the line or replace it with a
short note that the tests now assert auto-recovery (keep the remaining comment
lines about registry recovery and identity drift if still relevant) to avoid
confusion for future readers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 86ff307d-eef8-4445-a479-aeec71b0014b

📥 Commits

Reviewing files that changed from the base of the PR and between bbbaa0f and 4620c7c.

📒 Files selected for processing (3)
  • src/nemoclaw.ts
  • test/cli.test.ts
  • test/reboot-identity-drift.test.ts

@ericksoa ericksoa merged commit 2d1649f into main Apr 18, 2026
17 checks passed
ericksoa pushed a commit that referenced this pull request Apr 21, 2026
## Summary

Catches up the user-facing reference and troubleshooting docs with the
CLI and policy behavior changes that landed in v0.0.21. Drafted via the
`nemoclaw-contributor-update-docs` skill against commits in
`v0.0.20..v0.0.21`, filtered through `docs/.docs-skip`.

## Changes

- **`docs/reference/commands.md`**
- `nemoclaw list`: session indicator (●) for connected sandboxes
(#2117).
- `nemoclaw <name> connect`: active-session note; auto-recovery from SSH
identity drift after a host reboot (#2117, #2064).
- `nemoclaw <name> status`: three-state Inference line (`healthy` /
`unreachable` / `not probed`) covering both local and remote providers;
new `Connected` line (#2002, #2117).
- `nemoclaw <name> destroy` and `rebuild`: active-session warning with
second confirm; rebuild reapplies policy presets to the recreated
sandbox (#2117, #2026).
- `nemoclaw <name> policy-add` and `policy-remove`: positional preset
argument and non-interactive flow via
`--yes`/`--force`/`NEMOCLAW_NON_INTERACTIVE=1` (#2070).
- `nemoclaw <name> policy-list`: registry-vs-gateway desync detection
(#2089).
- **`docs/reference/troubleshooting.md`**
- `Reconnect after a host reboot`: now reflects automatic stale
`known_hosts` pruning on `connect` (#2064).
- `Running multiple sandboxes simultaneously`: onboard's forward-port
collision guard (#2086).
- New section: `openclaw config set` or `unset` is blocked inside the
sandbox (#2081).
- **`docs/network-policy/customize-network-policy.md`**: non-interactive
`policy-add`/`policy-remove` form; preset preservation across rebuild
(#2070, #2026).
- **`docs/inference/use-local-inference.md`**: NIM section now covers
the NGC API key prompt with masked input and `docker login nvcr.io
--password-stdin` behavior (#2043).
- **Generated skills regenerated** to pick up the source changes
(`.agents/skills/nemoclaw-user-reference/references/{commands,troubleshooting}.md`,
plus minor heading-flow deltas elsewhere). The pre-commit `Regenerate
agent skills from docs` hook ran and confirmed source ↔ generated
parity.

Commits skipped per `docs/.docs-skip` or no doc impact: `bbbaa0fb`
(skip-features), `7cb482cb` (skip-features), `8dee23fd` (skip-terms),
plus the usual CI / test / refactor / install-plumbing churn.

## Type of Change

- [ ] Code change (feature, bug fix, or refactor)
- [ ] Code change with doc updates
- [ ] Doc only (prose changes, no code sample modifications)
- [x] Doc only (includes code sample changes)

## Verification

- [x] `npx prek run --all-files` passes for the modified files (the one
failing test, `test/cli.test.ts > unknown command exits 1`, also fails
on `origin/main` and is unrelated to these markdown-only changes)
- [ ] `npm test` passes — skipped; same pre-existing CLI-dispatch test
failure unrelated to docs
- [ ] Tests added or updated for new or changed behavior — n/a, doc-only
- [x] No secrets, API keys, or credentials committed
- [x] Docs updated for user-facing behavior changes
- [ ] `make docs` builds without warnings (doc changes only) — not run
locally
- [x] Doc pages follow the [style
guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md)
(doc changes only)
- [ ] New doc pages include SPDX header and frontmatter (new pages only)
— n/a, no new pages

## AI Disclosure

- [x] AI-assisted — tool: Claude Code

---
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
  * Multi-session SSH connections with concurrent session support.
* Three-state inference health reporting (healthy/unreachable/not
probed) across all providers.
  * Automatic SSH host key rotation detection and recovery.
  * Non-interactive policy preset management via positional arguments.
  * Session indicators in sandbox list view.

* **Bug Fixes**
  * Protected destructive operations with active-session warnings.
  * Policy presets now preserved during sandbox rebuilds.

* **Documentation**
  * NGC registry authentication requirements for container images.
  * Multi-sandbox onboarding and reconnection guidance.
  * Troubleshooting updates for port conflicts and SSH issues.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(connect): auto-recover from SSH identity drift after host reboot

2 participants