Skip to content

fix(ssh): auto-repair stale pub that does not pair with local priv#3395

Merged
la14-1 merged 2 commits intoOpenRouterTeam:mainfrom
AhmedTMM:fix/ssh-verify-keypair
May 7, 2026
Merged

fix(ssh): auto-repair stale pub that does not pair with local priv#3395
la14-1 merged 2 commits intoOpenRouterTeam:mainfrom
AhmedTMM:fix/ssh-verify-keypair

Conversation

@AhmedTMM
Copy link
Copy Markdown
Collaborator

@AhmedTMM AhmedTMM commented May 6, 2026

Summary

Fixes the silent-failure mode reported in Slack where spawn registered a local .pub with DigitalOcean, the droplet booted with that key in authorized_keys, and SSH then failed with Permission denied (publickey) 33 times because the local .priv didn't actually pair with the registered .pub.

Adds two exported helpers in shared/ssh-keys.ts:

  1. verifyKeyPair(priv, pub) — derives the pub from the priv via ssh-keygen -y -P "" -f <priv> and compares key-type + base64 (ignoring comment). Returns "match" | "mismatch" | "unverifiable".
  2. repairPubFromPriv(priv, pub) — on mismatch, backs up the stale .pub to <pub>.spawn-backup-<timestamp> and rewrites .pub from the derived key. The .priv is authoritative — any .pub that doesn't derive from it is wrong by definition, so the rewrite is safe.

discoverSshKeys() now runs verify-then-repair on every pair. Passphrase-protected / otherwise unverifiable keys are skipped silently — BatchMode SSH can't use them anyway without an active ssh-agent.

Bumps CLI to 1.0.38.

Before / After (for the Slack user)

Before:

SSH key 'id_ed25519' already registered with DigitalOcean
Waiting for SSH handshake...
SSH handshake failed (1/33): Permission denied (publickey).
SSH handshake failed (2/33): Permission denied (publickey).
... 31 more ...

After:

Repaired ~/.ssh/id_ed25519.pub (stale public key replaced; original saved as ~/.ssh/id_ed25519.pub.spawn-backup-1778111358881).
Using 1 SSH key(s)
SSH handshake succeeded

The orphan stale pub that was previously registered with DigitalOcean stays on the account but is unused. The user can delete it from the DO dashboard if they want.

How the original failure happens

  1. User has ~/.ssh/id_ed25519 (priv A) and ~/.ssh/id_ed25519.pub (pub B from a different machine, e.g. copied without the matching priv).
  2. ensureSshKey() fingerprints .pub (B), finds it on DO, logs "already registered."
  3. createServer() attaches all account keys to the droplet, including B.
  4. ssh -i id_ed25519 root@droplet → priv A presents pub A to the server, server only knows B → publickey denied.

With this PR, step 1 detects the mismatch, rewrites the local .pub from priv A (now correct pub A is on disk), and registration proceeds with the correct pub.

Test plan

  • bun test src/__tests__/ssh-keys.test.ts src/__tests__/ssh-keys-cov.test.ts — 29/29 pass
  • bunx @biomejs/biome check src/ — clean (0 errors across 202 files)
  • Auto-repair rewrites .pub with derived contents and preserves stale contents in a backup
  • Passphrase-protected keys continue to be skipped silently (no backup file created)
  • verifyKeyPair still returns "match" | "mismatch" | "unverifiable" (no signature change)

Closes #3396.

…ders

When a local SSH .pub file doesn't actually pair with the corresponding
.priv (e.g. .pub copied from another machine, regenerated mid-flow, or
edited by hand), spawn would still register the .pub with the cloud
provider's key store. The registration check passes by fingerprint, the
droplet boots with that key in authorized_keys, and SSH then fails with
"Permission denied (publickey)" because the local .priv can't prove
ownership of the registered .pub. This produced the silent failure mode
where users saw "SSH key 'id_ed25519' already registered with
DigitalOcean" immediately followed by 33 "Permission denied" retries.

Adds verifyKeyPair() which derives the public key from the private key
via `ssh-keygen -y -P "" -f priv` and compares it (key type + base64,
ignoring the comment field) to the .pub file. discoverSshKeys() now
filters out mismatched pairs with a clear warning naming the offending
file, and silently skips passphrase-protected or otherwise
unverifiable keys (BatchMode SSH can't use them anyway).

Bumps CLI to 1.0.37.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@la14-1
Copy link
Copy Markdown
Member

la14-1 commented May 6, 2026

Confirmed in the wild: this PR fixes the exact failure reported in Slack (thread) where a user hit SSH key 'id_ed25519' already registered with DigitalOcean followed by 33× Permission denied (publickey) when launching hermes. The local .pub was registered with DO (visible in their dashboard) but didn't pair with the local .priv — precisely the silent-failure mode verifyKeyPair() catches before any droplet is created.

Linked from Slack by SPA

@la14-1
Copy link
Copy Markdown
Member

la14-1 commented May 6, 2026

Follow-up filed: #3396 to extend this from diagnoseauto-repair (rewrite the stale .pub from the matching .priv, with backup). The current PR is a clean diagnosis layer and should merge as-is; the repair step lands on top.

The Slack user who prompted this (thread) wouldn't be unblocked by diagnosis alone — they'd still have to run ssh-keygen -y -P "" -f ~/.ssh/id_ed25519 > ~/.ssh/id_ed25519.pub manually. #3396 makes spawn do that itself so the next hermes launch Just Works.

Linked from Slack by SPA

When the local .pub doesn't derive from the matching .priv (stale copy
from another machine, etc.), the priv is still authoritative — any .pub
that doesn't derive from it is wrong by definition. Previously spawn
printed a warning and skipped the pair; now it backs up the stale .pub
as .pub.spawn-backup-<timestamp> and rewrites the .pub from the derived
key. The next launch uses the correct pub end-to-end, so the droplet
boots with a public key that actually pairs with the local priv and SSH
handshake succeeds instead of failing 33 times with "Permission denied
(publickey)".

Passphrase-protected keys (ssh-keygen -y cannot derive without the
passphrase) are still skipped silently — nothing to repair with.

Bumps CLI to 1.0.38.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@la14-1 la14-1 changed the title fix(ssh): verify pub/priv keypair before registering with cloud providers fix(ssh): auto-repair stale pub that does not pair with local priv May 6, 2026
@la14-1
Copy link
Copy Markdown
Member

la14-1 commented May 6, 2026

Pushed commit 2a53420 extending this from diagnose to auto-repair:

  • New exported repairPubFromPriv(priv, pub) helper: on mismatch, backs up the stale .pub to <pub>.spawn-backup-<timestamp> and rewrites .pub from the derived key.
  • discoverSshKeys() now calls verify-then-repair instead of warn-and-skip. The Slack user's run would now succeed end-to-end instead of needing manual ssh-keygen -y recovery.
  • Extracted a private derivePubFromPriv helper so verifyKeyPair and repairPubFromPriv share the derivation (no double-invocation of ssh-keygen).
  • 29/29 tests pass (was 27; added auto-repair integration test + 2 repairPubFromPriv unit tests). Biome clean on full src/.
  • Bumped CLI to 1.0.38.
  • Title + body updated to reflect the broader scope. Closed [Bug]: auto-repair stale .pub instead of skipping when it doesn't pair with local .priv #3396 as rolled-in.

Ready for re-review.

Updated from Slack by SPA

@AhmedTMM AhmedTMM marked this pull request as ready for review May 6, 2026 23:59
Copy link
Copy Markdown
Member

@la14-1 la14-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — fixes the Slack-reported hermes launch failure end-to-end. Verify + auto-repair logic is tasteful (priv is authoritative, stale pub backed up with timestamp, passphrase keys silently skipped). Tests 29/29 and biome clean.

@la14-1 la14-1 merged commit 070be39 into OpenRouterTeam:main May 7, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: auto-repair stale .pub instead of skipping when it doesn't pair with local .priv

2 participants