fix(connect): auto-recover from SSH identity drift after host reboot#2064
fix(connect): auto-recover from SSH identity drift after host reboot#2064
Conversation
(Fixes #2056) Three fixes for the post-reboot reconnection failure: 1. Registry recovery gate now triggers for bare `nemoclaw <name>` (no explicit action). Previously `args[0]` was undefined, which didn't match the allowlist, skipping recovery entirely. 2. Live gateway probe now runs when `requestedSandboxName` is set, even if the registry is empty and there's no session file. 3. Identity drift in `ensureLiveSandboxOrExit` now auto-clears stale SSH known_hosts entries and retries the sandbox lookup instead of immediately exiting with an error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThe pull request implements auto-recovery from SSH identity drift after host reboot by clearing stale SSH known_hosts entries and retrying gateway connections, while broadening registry recovery gating to handle bare sandbox commands. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/reboot-identity-drift.test.ts (1)
9-12: Outdated comment: this PR is the fix for#2056.Line 12 says "Once the fix for
#2056lands, update these tests to assert auto-recovery" — but this PR implements that fix. Consider removing or updating this comment to avoid confusion for future readers.♻️ Suggested fix
// Simulates the post-reboot scenario where the gateway restarts with new SSH // keys, causing "handshake verification failed" errors. Verifies: // 1. The registry recovery gate triggers for bare `nemoclaw <name>` (no action) -// 2. Identity drift is detected and surfaced (current behavior) -// -// Once the fix for `#2056` lands, update these tests to assert auto-recovery. +// 2. Identity drift is detected, stale SSH keys are cleared, and reconnect is attempted +// 3. Registry recovery works when the registry is empty but gateway is live🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/reboot-identity-drift.test.ts` around lines 9 - 12, The top block comment in reboot-identity-drift.test.ts contains an outdated note "Once the fix for `#2056` lands, update these tests to assert auto-recovery"; update or remove that sentence so it reflects that this PR implements the fix for `#2056`—either delete the line or replace it with a short note that the tests now assert auto-recovery (keep the remaining comment lines about registry recovery and identity drift if still relevant) to avoid confusion for future readers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@test/reboot-identity-drift.test.ts`:
- Around line 9-12: The top block comment in reboot-identity-drift.test.ts
contains an outdated note "Once the fix for `#2056` lands, update these tests to
assert auto-recovery"; update or remove that sentence so it reflects that this
PR implements the fix for `#2056`—either delete the line or replace it with a
short note that the tests now assert auto-recovery (keep the remaining comment
lines about registry recovery and identity drift if still relevant) to avoid
confusion for future readers.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 86ff307d-eef8-4445-a479-aeec71b0014b
📒 Files selected for processing (3)
src/nemoclaw.tstest/cli.test.tstest/reboot-identity-drift.test.ts
## Summary Catches up the user-facing reference and troubleshooting docs with the CLI and policy behavior changes that landed in v0.0.21. Drafted via the `nemoclaw-contributor-update-docs` skill against commits in `v0.0.20..v0.0.21`, filtered through `docs/.docs-skip`. ## Changes - **`docs/reference/commands.md`** - `nemoclaw list`: session indicator (●) for connected sandboxes (#2117). - `nemoclaw <name> connect`: active-session note; auto-recovery from SSH identity drift after a host reboot (#2117, #2064). - `nemoclaw <name> status`: three-state Inference line (`healthy` / `unreachable` / `not probed`) covering both local and remote providers; new `Connected` line (#2002, #2117). - `nemoclaw <name> destroy` and `rebuild`: active-session warning with second confirm; rebuild reapplies policy presets to the recreated sandbox (#2117, #2026). - `nemoclaw <name> policy-add` and `policy-remove`: positional preset argument and non-interactive flow via `--yes`/`--force`/`NEMOCLAW_NON_INTERACTIVE=1` (#2070). - `nemoclaw <name> policy-list`: registry-vs-gateway desync detection (#2089). - **`docs/reference/troubleshooting.md`** - `Reconnect after a host reboot`: now reflects automatic stale `known_hosts` pruning on `connect` (#2064). - `Running multiple sandboxes simultaneously`: onboard's forward-port collision guard (#2086). - New section: `openclaw config set` or `unset` is blocked inside the sandbox (#2081). - **`docs/network-policy/customize-network-policy.md`**: non-interactive `policy-add`/`policy-remove` form; preset preservation across rebuild (#2070, #2026). - **`docs/inference/use-local-inference.md`**: NIM section now covers the NGC API key prompt with masked input and `docker login nvcr.io --password-stdin` behavior (#2043). - **Generated skills regenerated** to pick up the source changes (`.agents/skills/nemoclaw-user-reference/references/{commands,troubleshooting}.md`, plus minor heading-flow deltas elsewhere). The pre-commit `Regenerate agent skills from docs` hook ran and confirmed source ↔ generated parity. Commits skipped per `docs/.docs-skip` or no doc impact: `bbbaa0fb` (skip-features), `7cb482cb` (skip-features), `8dee23fd` (skip-terms), plus the usual CI / test / refactor / install-plumbing churn. ## Type of Change - [ ] Code change (feature, bug fix, or refactor) - [ ] Code change with doc updates - [ ] Doc only (prose changes, no code sample modifications) - [x] Doc only (includes code sample changes) ## Verification - [x] `npx prek run --all-files` passes for the modified files (the one failing test, `test/cli.test.ts > unknown command exits 1`, also fails on `origin/main` and is unrelated to these markdown-only changes) - [ ] `npm test` passes — skipped; same pre-existing CLI-dispatch test failure unrelated to docs - [ ] Tests added or updated for new or changed behavior — n/a, doc-only - [x] No secrets, API keys, or credentials committed - [x] Docs updated for user-facing behavior changes - [ ] `make docs` builds without warnings (doc changes only) — not run locally - [x] Doc pages follow the [style guide](https://github.com/NVIDIA/NemoClaw/blob/main/docs/CONTRIBUTING.md) (doc changes only) - [ ] New doc pages include SPDX header and frontmatter (new pages only) — n/a, no new pages ## AI Disclosure - [x] AI-assisted — tool: Claude Code --- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Multi-session SSH connections with concurrent session support. * Three-state inference health reporting (healthy/unreachable/not probed) across all providers. * Automatic SSH host key rotation detection and recovery. * Non-interactive policy preset management via positional arguments. * Session indicators in sandbox list view. * **Bug Fixes** * Protected destructive operations with active-session warnings. * Policy presets now preserved during sandbox rebuilds. * **Documentation** * NGC registry authentication requirements for container images. * Multi-sandbox onboarding and reconnection guidance. * Troubleshooting updates for port conflicts and SSH issues. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
nemoclaw <name>(no explicit action)requestedSandboxNameis set, even with empty registryFixes #2056
Supersedes #2057 (which had commits from incorrect git identity).
Changes
src/nemoclaw.ts:pruneKnownHostsEntriesfrom./lib/onboard""to the action allowlist so barenemoclaw <name>triggers recovery when registry is emptyshouldProbeLiveGateway(line ~477): includerequestedSandboxNamein the probe condition so recovery works when both registry and session are emptyensureLiveSandboxOrExitidentity_drift handler (line ~763): instead ofprocess.exit(1), clear staleopenshell-*entries from~/.ssh/known_hostsusing the existingpruneKnownHostsEntrieshelper and retry the sandbox lookuptest/cli.test.ts:test/reboot-identity-drift.test.ts(new):Type of Change
Verification
npm run build:clipassesnpx vitest run test/reboot-identity-drift.test.ts— 6/6 passnpx vitest run test/cli.test.ts— all pass (including updated identity_drift assertion)npm test— 1690 pass, 1 pre-existing failure (version string mismatch, unrelated)🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests