Skip to content

Conversation

@sirtimid
Copy link
Contributor

@sirtimid sirtimid commented Jan 29, 2026

Closes #688

Summary

  • Add error pattern tracking to ReconnectionManager to detect permanently unreachable peers
  • When the same error code (ECONNREFUSED, EHOSTUNREACH, ENOTFOUND, or ENETUNREACH) occurs consecutively N times (default 5), the peer is marked as permanently failed
  • Permanent failures stop reconnection attempts to avoid wasted retries for unreachable peers

Changes

kernel-errors:

  • Add getNetworkErrorCode() helper to extract error codes from errors

ocap-kernel:

  • Extend ReconnectionState with errorHistory and permanentlyFailed fields
  • Add recordError(), isPermanentlyFailed(), clearPermanentFailure() methods to ReconnectionManager
  • Update startReconnection() to return false for permanently failed peers and reset error history
  • Integrate error recording into reconnection lifecycle
  • Check permanent failure status before attempting reconnection

Test plan

  • Unit tests for getNetworkErrorCode helper
  • Unit tests for error tracking in ReconnectionManager
  • Unit tests for permanent failure detection (consecutive identical errors)
  • Unit tests for clearing permanent failure state
  • Integration tests for reconnection lifecycle with permanent failure
  • All existing tests pass

🤖 Generated with Claude Code


Note

Medium Risk
Changes core reconnection control flow by adding stateful error-pattern tracking and an early-exit path that can stop retries; misclassification or integration bugs could cause peers to be marked failed and never reconnect until manually cleared.

Overview
Adds permanent-failure detection to remote reconnection: ReconnectionManager now tracks per-peer errorHistory (capped) and marks peers permanentlyFailed after N consecutive identical errors from a configured set, preventing further automatic reconnection.

Integrates this into reconnection-lifecycle by extracting an error code via new kernel-errors helper getNetworkErrorCode, recording it on failures, and giving up immediately when a peer becomes permanently failed (including when startReconnection now returns false). Manual reconnectPeer now clears permanent-failure state before retrying, and isRetryableNetworkError adds ENOTFOUND to the retryable set; tests are expanded/updated across kernel-errors, lifecycle, transport, and reconnection manager.

Written by Cursor Bugbot for commit 7c82f74. This will update automatically on new commits. Configure here.

sirtimid and others added 4 commits January 29, 2026 15:07
Add error pattern tracking to ReconnectionManager to detect when a peer
is permanently unreachable. When the same error code (ECONNREFUSED,
EHOSTUNREACH, ENOTFOUND, or ENETUNREACH) occurs consecutively N times
(default 5), the peer is marked as permanently failed and reconnection
attempts stop.

Changes:
- Add error history tracking per peer in ReconnectionManager
- Add isPermanentlyFailed() and clearPermanentFailure() methods
- Add getNetworkErrorCode() helper to extract error codes
- Integrate error recording into reconnection lifecycle
- Check permanent failure status before attempting reconnection

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add comprehensive unit tests for:
- getNetworkErrorCode helper function
- Error tracking in ReconnectionManager (recordError, getErrorHistory)
- Permanent failure detection (isPermanentlyFailed)
- Clearing permanent failure state (clearPermanentFailure)
- Custom consecutive error threshold
- Integration with startReconnection, clearPeer, and clear

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add integration tests for permanent failure detection in the
reconnection lifecycle:
- Gives up when peer is permanently failed at start of loop
- Records errors after failed dial attempts
- Gives up when error triggers permanent failure
- Continues retrying when error does not trigger failure
- handleConnectionLoss skips reconnection for permanently failed peers

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update existing tests to work with permanent failure detection changes:
- Add getNetworkErrorCode export to index test
- Add getNetworkErrorCode mock to transport tests
- Update startReconnection mocks to return true
- Add isPermanentlyFailed and recordError mocks to ReconnectionManager

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

github-actions bot commented Jan 29, 2026

Coverage Report

Status Category Percentage Covered / Total
🔵 Lines 88.49%
⬆️ +0.08%
5941 / 6713
🔵 Statements 88.39%
⬆️ +0.09%
6039 / 6832
🔵 Functions 87.18%
⬆️ +0.06%
1537 / 1763
🔵 Branches 84.76%
⬆️ +0.17%
2164 / 2553
File Coverage
File Stmts Branches Functions Lines Uncovered Lines
Changed Files
packages/kernel-errors/src/index.ts 100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
packages/kernel-errors/src/utils/getNetworkErrorCode.ts 100% 100% 100% 100%
packages/kernel-errors/src/utils/isRetryableNetworkError.ts 100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
100%
🟰 ±0%
packages/ocap-kernel/src/remotes/platform/reconnection-lifecycle.ts 90.69%
⬆️ +2.12%
89.74%
⬆️ +1.87%
80%
🟰 ±0%
90.69%
⬆️ +2.12%
118-122, 142-143, 267-268
packages/ocap-kernel/src/remotes/platform/reconnection.ts 98.5%
⬇️ -1.50%
96.15%
⬇️ -3.85%
100%
🟰 ±0%
98.48%
⬇️ -1.52%
308
packages/ocap-kernel/src/remotes/platform/transport.ts 86.66%
⬆️ +0.07%
81.53%
🟰 ±0%
75%
🟰 ±0%
86.66%
⬆️ +0.07%
103, 122-131, 163, 197-215, 238, 322, 384, 441, 467, 478
Generated in workflow #3477 for commit 7c82f74 by the Vitest Coverage Report Action

Fix issues identified in code review:
- Add bounds validation for consecutiveErrorThreshold (must be >= 1)
- Cap error history to prevent unbounded memory growth

The error history is now limited to the threshold size since we only
need the last N errors for pattern detection.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@sirtimid sirtimid marked this pull request as ready for review January 29, 2026 15:18
@sirtimid sirtimid requested a review from a team as a code owner January 29, 2026 15:18
sirtimid and others added 3 commits January 29, 2026 16:32
When a user explicitly calls reconnectPeer, clear the permanent failure
status so the reconnection can proceed. Previously, permanently failed
peers could not be manually reconnected because startReconnection would
return false without attempting any connection.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Export both getNetworkErrorCode and isResourceLimitError from kernel-errors
- Handle rate limit and connection limit errors before permanent failure check
- Add both mocks to transport tests

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…rogress

Move isReconnecting() check before clearPermanentFailure() in reconnectPeer
to prevent resetting error history during active reconnection attempts.

Previously, calling reconnectPeer while reconnection was in progress would
clear the accumulated error history, defeating the purpose of permanent
failure detection by resetting progress toward the failure threshold.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

sirtimid and others added 2 commits January 29, 2026 23:17
Include ENOTFOUND (DNS lookup failed) in isRetryableNetworkError to enable
permanent failure detection for this error code. Previously, ENOTFOUND was
in PERMANENT_FAILURE_ERROR_CODES but not retryable, causing immediate
give-up after the first error without allowing errors to accumulate toward
the permanent failure threshold.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
resetBackoff() and resetAllBackoffs() now clear error history in addition
to resetting attempt counts. This prevents stale errors from accumulating
and triggering false permanent failure detection.

Previously, if a peer had 4 ECONNREFUSED errors, then successfully connected,
then had 1 more ECONNREFUSED error, it would be marked as permanently failed
(4+1=5). Now the error history is cleared on successful communication, so
only consecutive errors without intervening successes trigger permanent failure.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remote comms: Error pattern analysis and permanent failure detection

2 participants