feat(ocap-kernel): add permanent failure detection for reconnection #789

sirtimid · 2026-01-29T14:21:42Z

Closes #688

Summary

Add error pattern tracking to ReconnectionManager to detect permanently unreachable peers
When the same error code (ECONNREFUSED, EHOSTUNREACH, ENOTFOUND, or ENETUNREACH) occurs consecutively N times (default 5), the peer is marked as permanently failed
Permanent failures stop reconnection attempts to avoid wasted retries for unreachable peers

Changes

kernel-errors:

Add getNetworkErrorCode() helper to extract error codes from errors

ocap-kernel:

Extend ReconnectionState with errorHistory and permanentlyFailed fields
Add recordError(), isPermanentlyFailed(), clearPermanentFailure() methods to ReconnectionManager
Update startReconnection() to return false for permanently failed peers and reset error history
Integrate error recording into reconnection lifecycle
Check permanent failure status before attempting reconnection

Test plan

Unit tests for getNetworkErrorCode helper
Unit tests for error tracking in ReconnectionManager
Unit tests for permanent failure detection (consecutive identical errors)
Unit tests for clearing permanent failure state
Integration tests for reconnection lifecycle with permanent failure
All existing tests pass

🤖 Generated with Claude Code

Note

Medium Risk
Changes core reconnection control flow by adding stateful error-pattern tracking and an early-exit path that can stop retries; misclassification or integration bugs could cause peers to be marked failed and never reconnect until manually cleared.

Overview
Adds permanent-failure detection to remote reconnection: ReconnectionManager now tracks per-peer errorHistory (capped) and marks peers permanentlyFailed after N consecutive identical errors from a configured set, preventing further automatic reconnection.

Integrates this into reconnection-lifecycle by extracting an error code via new kernel-errors helper getNetworkErrorCode, recording it on failures, and giving up immediately when a peer becomes permanently failed (including when startReconnection now returns false). Manual reconnectPeer now clears permanent-failure state before retrying, and isRetryableNetworkError adds ENOTFOUND to the retryable set; tests are expanded/updated across kernel-errors, lifecycle, transport, and reconnection manager.

^{Written by Cursor Bugbot for commit 7c82f74. This will update automatically on new commits. Configure here.}

Add error pattern tracking to ReconnectionManager to detect when a peer is permanently unreachable. When the same error code (ECONNREFUSED, EHOSTUNREACH, ENOTFOUND, or ENETUNREACH) occurs consecutively N times (default 5), the peer is marked as permanently failed and reconnection attempts stop. Changes: - Add error history tracking per peer in ReconnectionManager - Add isPermanentlyFailed() and clearPermanentFailure() methods - Add getNetworkErrorCode() helper to extract error codes - Integrate error recording into reconnection lifecycle - Check permanent failure status before attempting reconnection Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add comprehensive unit tests for: - getNetworkErrorCode helper function - Error tracking in ReconnectionManager (recordError, getErrorHistory) - Permanent failure detection (isPermanentlyFailed) - Clearing permanent failure state (clearPermanentFailure) - Custom consecutive error threshold - Integration with startReconnection, clearPeer, and clear Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add integration tests for permanent failure detection in the reconnection lifecycle: - Gives up when peer is permanently failed at start of loop - Records errors after failed dial attempts - Gives up when error triggers permanent failure - Continues retrying when error does not trigger failure - handleConnectionLoss skips reconnection for permanently failed peers Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Update existing tests to work with permanent failure detection changes: - Add getNetworkErrorCode export to index test - Add getNetworkErrorCode mock to transport tests - Update startReconnection mocks to return true - Add isPermanentlyFailed and recordError mocks to ReconnectionManager Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

github-actions · 2026-01-29T14:28:59Z

Coverage Report

Status	Category	Percentage	Covered / Total
🔵	Lines	88.49% ⬆️ +0.08%	5941 / 6713
🔵	Statements	88.39% ⬆️ +0.09%	6039 / 6832
🔵	Functions	87.18% ⬆️ +0.06%	1537 / 1763
🔵	Branches	84.76% ⬆️ +0.17%	2164 / 2553

File Coverage

File	Stmts	Branches	Functions	Lines	Uncovered Lines
Changed Files
packages/kernel-errors/src/index.ts	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%
packages/kernel-errors/src/utils/getNetworkErrorCode.ts	100%	100%	100%	100%
packages/kernel-errors/src/utils/isRetryableNetworkError.ts	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%	100% 🟰 ±0%
packages/ocap-kernel/src/remotes/platform/reconnection-lifecycle.ts	90.69% ⬆️ +2.12%	89.74% ⬆️ +1.87%	80% 🟰 ±0%	90.69% ⬆️ +2.12%	118-122, 142-143, 267-268
packages/ocap-kernel/src/remotes/platform/reconnection.ts	98.5% ⬇️ -1.50%	96.15% ⬇️ -3.85%	100% 🟰 ±0%	98.48% ⬇️ -1.52%	308
packages/ocap-kernel/src/remotes/platform/transport.ts	86.66% ⬆️ +0.07%	81.53% 🟰 ±0%	75% 🟰 ±0%	86.66% ⬆️ +0.07%	103, 122-131, 163, 197-215, 238, 322, 384, 441, 467, 478

Generated in workflow #3477 for commit 7c82f74 by the Vitest Coverage Report Action

Fix issues identified in code review: - Add bounds validation for consecutiveErrorThreshold (must be >= 1) - Cap error history to prevent unbounded memory growth The error history is now limited to the threshold size since we only need the last N errors for pattern detection. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

packages/ocap-kernel/src/remotes/platform/reconnection-lifecycle.ts

When a user explicitly calls reconnectPeer, clear the permanent failure status so the reconnection can proceed. Previously, permanently failed peers could not be manually reconnected because startReconnection would return false without attempting any connection. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Export both getNetworkErrorCode and isResourceLimitError from kernel-errors - Handle rate limit and connection limit errors before permanent failure check - Add both mocks to transport tests Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…ilure-detection

packages/ocap-kernel/src/remotes/platform/transport.ts

…rogress Move isReconnecting() check before clearPermanentFailure() in reconnectPeer to prevent resetting error history during active reconnection attempts. Previously, calling reconnectPeer while reconnection was in progress would clear the accumulated error history, defeating the purpose of permanent failure detection by resetting progress toward the failure threshold. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

packages/ocap-kernel/src/remotes/platform/reconnection.ts

Include ENOTFOUND (DNS lookup failed) in isRetryableNetworkError to enable permanent failure detection for this error code. Previously, ENOTFOUND was in PERMANENT_FAILURE_ERROR_CODES but not retryable, causing immediate give-up after the first error without allowing errors to accumulate toward the permanent failure threshold. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

resetBackoff() and resetAllBackoffs() now clear error history in addition to resetting attempt counts. This prevents stale errors from accumulating and triggering false permanent failure detection. Previously, if a peer had 4 ECONNREFUSED errors, then successfully connected, then had 1 more ECONNREFUSED error, it would be marked as permanently failed (4+1=5). Now the error history is cleared on successful communication, so only consecutive errors without intervening successes trigger permanent failure. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

sirtimid and others added 4 commits January 29, 2026 15:07

sirtimid marked this pull request as ready for review January 29, 2026 15:18

sirtimid requested a review from a team as a code owner January 29, 2026 15:18

cursor bot reviewed Jan 29, 2026

View reviewed changes

packages/ocap-kernel/src/remotes/platform/reconnection-lifecycle.ts Show resolved Hide resolved

sirtimid and others added 3 commits January 29, 2026 16:32

Merge remote-tracking branch 'origin/main' into sirtimid/permanent-fa…

a1de85d

…ilure-detection

cursor bot reviewed Jan 29, 2026

View reviewed changes

packages/ocap-kernel/src/remotes/platform/transport.ts Outdated Show resolved Hide resolved

cursor bot reviewed Jan 29, 2026

View reviewed changes

packages/ocap-kernel/src/remotes/platform/reconnection.ts Show resolved Hide resolved

sirtimid and others added 2 commits January 29, 2026 23:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ocap-kernel): add permanent failure detection for reconnection #789

feat(ocap-kernel): add permanent failure detection for reconnection #789

Uh oh!

sirtimid commented Jan 29, 2026 •

edited by cursor bot

Loading

Uh oh!

github-actions bot commented Jan 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(ocap-kernel): add permanent failure detection for reconnection #789

Are you sure you want to change the base?

feat(ocap-kernel): add permanent failure detection for reconnection #789

Uh oh!

Conversation

sirtimid commented Jan 29, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

github-actions bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sirtimid commented Jan 29, 2026 •

edited by cursor bot

Loading

github-actions bot commented Jan 29, 2026 •

edited

Loading