Skip to content

fix(ocap-kernel): re-dial relays on connection close with exponential backoff#860

Merged
sirtimid merged 6 commits intomainfrom
fix/relay-reconnection-859
Mar 3, 2026
Merged

fix(ocap-kernel): re-dial relays on connection close with exponential backoff#860
sirtimid merged 6 commits intomainfrom
fix/relay-reconnection-859

Conversation

@sirtimid
Copy link
Copy Markdown
Contributor

@sirtimid sirtimid commented Feb 27, 2026

Closes #859

Summary

  • Add relay health monitor in ConnectionFactory that detects when relay connections close and automatically re-dials them with exponential backoff (5s → 10s → 20s → 40s → 60s cap, 10 max attempts)
  • On successful re-dial, libp2p's reservation-store handles re-reservation internally — we only need to re-establish the connection
  • Clean up pending reconnect timers on stop() to prevent leaked timers

Test plan

  • Unit tests: relay reconnection triggered on connection:close for a relay peer
  • Unit tests: non-relay peer close does not trigger reconnection
  • Unit tests: backoff timing increases on consecutive failures (5s, 10s, 20s, 40s, 60s cap)
  • Unit tests: max retry limit (10) respected
  • Unit tests: stop() clears pending reconnect timers
  • Manual: two kernels connected via relay → kill relay → restart relay → kernels auto-reconnect

🤖 Generated with Claude Code


Note

Medium Risk
Adds automatic relay reconnection with timers/backoff and new shutdown coordination, which can affect connection stability and resource cleanup if edge cases are missed. Changes are localized to libp2p connection management and covered by targeted unit tests.

Overview
Adds automatic relay reconnection in ConnectionFactory: when a connection:close event involves a known relay peer, the factory schedules a re-dial using calculateReconnectionBackoff with exponential delays (5s base, 60s cap) and a max of 10 attempts, while preventing duplicate reconnect loops per relay.

Improves lifecycle cleanup by tracking pending reconnect timers, clearing them on stop(), and guarding against reconnects during/after shutdown or when the abort signal is already aborted. Expands unit tests to cover reconnect scheduling, backoff progression, max-attempt exhaustion, stop-time cleanup, and several race/edge cases (e.g., stop during in-flight dial).

Written by Cursor Bugbot for commit d4eefb8. This will update automatically on new commits. Configure here.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 27, 2026

Coverage Report

Status Category Percentage Covered / Total
🔵 Lines 76.11%
⬆️ +0.06%
6639 / 8722
🔵 Statements 76%
⬆️ +0.06%
6745 / 8874
🔵 Functions 73.95%
⬆️ +0.07%
1653 / 2235
🔵 Branches 75.38%
⬆️ +0.07%
2472 / 3279
File Coverage
File Stmts Branches Functions Lines Uncovered Lines
Changed Files
packages/ocap-kernel/src/remotes/platform/connection-factory.ts 92.74%
⬇️ -2.81%
86.17%
⬆️ +0.06%
97.36%
⬆️ +0.49%
92.63%
⬇️ -2.82%
70, 79, 86, 166-171, 385, 530, 578-579, 584-588, 647, 660
Generated in workflow #3848 for commit d4eefb8 by the Vitest Coverage Report Action

@sirtimid sirtimid marked this pull request as ready for review February 27, 2026 16:30
@sirtimid sirtimid requested a review from a team as a code owner February 27, 2026 16:30
Comment thread packages/ocap-kernel/src/remotes/platform/connection-factory.ts
@sirtimid sirtimid enabled auto-merge February 27, 2026 17:11
@rekmarks rekmarks disabled auto-merge February 27, 2026 21:36
rekmarks
rekmarks previously approved these changes Feb 27, 2026
Copy link
Copy Markdown
Member

@rekmarks rekmarks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go, but a couple of things to consider.

Comment thread packages/ocap-kernel/src/remotes/platform/connection-factory.ts Outdated
Comment thread packages/ocap-kernel/src/remotes/platform/connection-factory.ts Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Comment thread packages/ocap-kernel/src/remotes/platform/connection-factory.ts
sirtimid and others added 6 commits March 1, 2026 15:12
… backoff (#859)

Circuit relay v2 reservations are lost when the TCP connection to the
relay drops (network blip, relay restart, idle timeout). libp2p's
built-in refresh only works while the connection stays alive, so peers
become permanently disconnected with NO_RESERVATION errors.

Add a relay health monitor in ConnectionFactory that listens for
connection:close events and automatically re-dials relay peers with
exponential backoff (5s base, 60s cap, 10 max attempts). On successful
re-dial, libp2p's reservation-store handles re-reservation internally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…failures

- Add #stopped flag to prevent reconnect scheduling during stop() teardown
- Wrap async setTimeout callback in IIFE to avoid unhandled rejections
- Log malformed relay addresses instead of silently skipping them
- Use error-level logging for reconnect exhaustion and failures
- Clear reconnect timers after libp2p.stop() to catch late connection:close events
- Add test for connection:close events firing during stop()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add .catch() on IIFE to prevent unhandled rejections if catch block throws
- Log warning when relay address lacks /p2p/<peerId> suffix
- Include error details in malformed relay address warning
- Log warning when relay address lookup fails unexpectedly
- Add tests: recovery after transient failures, duplicate close deduplication,
  aborted signal suppresses reconnect

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…er stop

The recursive #reconnectRelay call from the catch block bypassed the
#stopped check that only existed in #scheduleRelayReconnect and the
timer callback. If stop() completed while a dial() was in-flight, the
subsequent rejection would schedule a new timer that stop() could never
clean up.

Add #stopped and signal.aborted checks at the top of #reconnectRelay
itself, and add a test that reproduces the exact race.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… catch errors

Use `calculateReconnectionBackoff` from `@metamask/kernel-utils` instead
of manual exponential backoff math, adding full jitter to relay reconnect
delays. Log errors in the async IIFE `.catch()` instead of silently
swallowing them.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… handler

The outer .catch() in #reconnectRelay logged errors but did not remove the
relay from #pendingRelayReconnects. If an unexpected throw occurred before
the inner try block, the stale entry permanently blocked future reconnection
attempts for that relay.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sirtimid sirtimid force-pushed the fix/relay-reconnection-859 branch from 9873928 to d4eefb8 Compare March 1, 2026 14:12
@sirtimid sirtimid requested a review from rekmarks March 2, 2026 12:16
@sirtimid sirtimid added this pull request to the merge queue Mar 3, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Mar 3, 2026
@sirtimid sirtimid added this pull request to the merge queue Mar 3, 2026
Merged via the queue into main with commit 7881c29 Mar 3, 2026
29 checks passed
@sirtimid sirtimid deleted the fix/relay-reconnection-859 branch March 3, 2026 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(ocap-kernel): circuit relay reservations expire without renewal

2 participants