fix(ocap-kernel): re-dial relays on connection close with exponential backoff#860
Merged
fix(ocap-kernel): re-dial relays on connection close with exponential backoff#860
Conversation
Contributor
Coverage Report
File Coverage
|
||||||||||||||||||||||||||||||||||||||
rekmarks
previously approved these changes
Feb 27, 2026
Member
rekmarks
left a comment
There was a problem hiding this comment.
Good to go, but a couple of things to consider.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
… backoff (#859) Circuit relay v2 reservations are lost when the TCP connection to the relay drops (network blip, relay restart, idle timeout). libp2p's built-in refresh only works while the connection stays alive, so peers become permanently disconnected with NO_RESERVATION errors. Add a relay health monitor in ConnectionFactory that listens for connection:close events and automatically re-dials relay peers with exponential backoff (5s base, 60s cap, 10 max attempts). On successful re-dial, libp2p's reservation-store handles re-reservation internally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…failures - Add #stopped flag to prevent reconnect scheduling during stop() teardown - Wrap async setTimeout callback in IIFE to avoid unhandled rejections - Log malformed relay addresses instead of silently skipping them - Use error-level logging for reconnect exhaustion and failures - Clear reconnect timers after libp2p.stop() to catch late connection:close events - Add test for connection:close events firing during stop() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add .catch() on IIFE to prevent unhandled rejections if catch block throws - Log warning when relay address lacks /p2p/<peerId> suffix - Include error details in malformed relay address warning - Log warning when relay address lookup fails unexpectedly - Add tests: recovery after transient failures, duplicate close deduplication, aborted signal suppresses reconnect Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…er stop The recursive #reconnectRelay call from the catch block bypassed the #stopped check that only existed in #scheduleRelayReconnect and the timer callback. If stop() completed while a dial() was in-flight, the subsequent rejection would schedule a new timer that stop() could never clean up. Add #stopped and signal.aborted checks at the top of #reconnectRelay itself, and add a test that reproduces the exact race. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… catch errors Use `calculateReconnectionBackoff` from `@metamask/kernel-utils` instead of manual exponential backoff math, adding full jitter to relay reconnect delays. Log errors in the async IIFE `.catch()` instead of silently swallowing them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… handler The outer .catch() in #reconnectRelay logged errors but did not remove the relay from #pendingRelayReconnects. If an unexpected throw occurred before the inner try block, the stale entry permanently blocked future reconnection attempts for that relay. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9873928 to
d4eefb8
Compare
rekmarks
approved these changes
Mar 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #859
Summary
ConnectionFactorythat detects when relay connections close and automatically re-dials them with exponential backoff (5s → 10s → 20s → 40s → 60s cap, 10 max attempts)reservation-storehandles re-reservation internally — we only need to re-establish the connectionstop()to prevent leaked timersTest plan
connection:closefor a relay peerstop()clears pending reconnect timers🤖 Generated with Claude Code
Note
Medium Risk
Adds automatic relay reconnection with timers/backoff and new shutdown coordination, which can affect connection stability and resource cleanup if edge cases are missed. Changes are localized to libp2p connection management and covered by targeted unit tests.
Overview
Adds automatic relay reconnection in
ConnectionFactory: when aconnection:closeevent involves a known relay peer, the factory schedules a re-dialusingcalculateReconnectionBackoffwith exponential delays (5s base, 60s cap) and a max of 10 attempts, while preventing duplicate reconnect loops per relay.Improves lifecycle cleanup by tracking pending reconnect timers, clearing them on
stop(), and guarding against reconnects during/after shutdown or when the abort signal is already aborted. Expands unit tests to cover reconnect scheduling, backoff progression, max-attempt exhaustion, stop-time cleanup, and several race/edge cases (e.g., stop during in-flight dial).Written by Cursor Bugbot for commit d4eefb8. This will update automatically on new commits. Configure here.