-
-
Couldn't load subscription status.
- Fork 251
fix(core-backend): reconnection logic #6861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d26e7f6 to
f128e8b
Compare
| } | ||
| // Use the dedicated forceReconnection method which performs a controlled | ||
| // disconnect-then-connect sequence to clean up subscription state | ||
| await this.#messenger.call('BackendWebSocketService:forceReconnection'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this because I want to fully control the reconnection via scheduleConnect
| export type BackendWebSocketServiceForceReconnectionAction = { | ||
| type: `BackendWebSocketService:forceReconnection`; | ||
| handler: BackendWebSocketService['forceReconnection']; | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this because I want to fully control the reconnection via scheduleConnect
|
|
||
| #connectionTimeout: NodeJS.Timeout | null = null; | ||
|
|
||
| #stableConnectionTimer: NodeJS.Timeout | null = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if the backend accepts a connection, we make sure that we dont re-initialise the the reconnectionAttempt (used for the exponential backoff). This makes sure that the exponential backoff is not bypassed.
| // This prevents rapid loops when server accepts then immediately closes connections | ||
| if (this.#reconnectTimer) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// If a reconnect is already scheduled, defer to it to avoid bypassing exponential backoff
| 'AuthenticationController:getBearerToken', | ||
| ); | ||
| if (!token) { | ||
| this.#scheduleReconnect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this into the catch block L488
| #scheduleReconnect(): void { | ||
| #scheduleConnect(): void { | ||
| // If a reconnect is already scheduled, don't schedule another one | ||
| if (this.#reconnectTimer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important, before I was just replacing this.#reconnectTimer. Now I just bypass if there is already an reconnectTimer in flight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there was a breaking change to AccountActivityService not mentioned in the changelog, but other than that, seems like a good set of changes.
| // Advance time to trigger reconnection check | ||
| await completeAsyncOperations(50); | ||
| // With jitter (±25%), delay with reconnectDelay=50ms is 37.5-62.5ms | ||
| await completeAsyncOperations(70); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tip: Similar as above, you might try mocking Math.random if this test starts failing intermittently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I mock Math.random, but I prefer to keep 70 because 50 is causing some test flakiness, given that in this test, reconnectDelay: 50, it's better to have some room to make the reconnection really happen.
| // Increment attempts BEFORE calculating delay so backoff grows properly | ||
| this.#reconnectAttempts += 1; | ||
|
|
||
| // Calculate delay with exponential backoff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Probably not important now, but if we like we might want to reuse the same exponential backoff formula we are using for network requests. The Cockatiel library provides a way to generate a series of backoffs here: https://github.com/connor4312/cockatiel/blob/f87137b9bb55e0cdfb02e7d5c401e47c6f5fcc9f/src/backoff/ExponentialBackoff.ts (this is the formula itself).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THanks for this. I replaced with the Cockatiel library and it works well
ee521a1 to
4c1ce0b
Compare
0c2f17b to
d1bfe99
Compare
d1bfe99 to
d602e67
Compare
## Explanation This PR releases **@metamask/core-backend@3.0.0** with significant improvements to WebSocket reconnection reliability and robustness for real-time balance updates. ### Current State & Problems Addressed The previous WebSocket implementation had several reliability issues: 1. **Duplicate reconnection timers** could be created when multiple events triggered reconnects simultaneously, leading to memory leaks and inflated reconnect attempt counters 2. **Rapid reconnection loops** could occur when the server accepted connections but immediately closed them 3. **Race conditions** in reconnection scheduling could bypass exponential backoff, causing aggressive retry behavior 4. **Manual subscription cleanup** was fragile when recovering from subscription issues or orphaned subscriptions ### Solution This release introduces robust reconnection logic with the following improvements: #### 1. **Idempotent Reconnection Scheduling** - Prevents duplicate reconnection timers from being created - Checks if a reconnect is already scheduled before creating a new timer - Eliminates memory leaks from orphaned timers - Prevents unnecessary growth of reconnect attempt counters #### 2. **Stable Connection Timer** - Connection must remain stable for 10 seconds before resetting reconnect attempts - Prevents rapid reconnection loops when server accepts then immediately closes connections - Provides grace period for connection stability before declaring success #### 3. **Force Reconnection Method** - New `forceReconnection()` public API for controlled subscription state cleanup - Performs a controlled disconnect-then-reconnect sequence with exponential backoff - Useful for recovering from subscription/unsubscription issues and cleaning up orphaned subscriptions - Exposed via `BackendWebSocketService:forceReconnection` messenger action #### 4. **Improved Error Handling** - Always schedules reconnect on connection failure (exponential backoff prevents aggressive retries) - Removes redundant schedule calls from error paths - Better logging throughout for easier debugging #### 5. **Better State Management** - `disconnect()` now properly resets reconnect attempts counter - Fixed race condition in `connect()` that could bypass exponential backoff - More predictable state transitions ### Technical Details **Reconnection Behavior:** - Uses Cockatiel's exponential backoff with decorrelated jitter to prevent thundering herd - Initial delay: 1000ms, max delay: 30000ms - Backoff sequence resets only after 10 seconds of stable connection - Automatic reconnect on unexpected disconnects, stays disconnected on manual disconnects **Breaking Changes:** - `AccountActivityService` messenger allowed actions updated: removed `BackendWebSocketService:disconnect`, added `BackendWebSocketService:forceReconnection` ### Impact This improves the reliability of real-time balance updates across MetaMask Extension and Mobile by: - ✅ Preventing connection thrashing and resource exhaustion - ✅ Providing graceful recovery from network instability - ✅ Eliminating memory leaks from reconnection logic - ✅ Offering controlled subscription cleanup via `forceReconnection()` - ✅ Enabling better monitoring with improved logging ## References - Related to the broader WebSocket balance update initiative for MetaMask Mobile - Fixes #6861 - WebSocket reconnection improvements ## Checklist - [x] I've updated the test suite for new or updated code as appropriate - [x] I've updated documentation (JSDoc, Markdown, etc.) for new or updated code as appropriate - [x] I've communicated my changes to consumers by updating changelogs for packages I've changed, highlighting breaking changes as necessary - [ ] I've prepared draft pull requests for clients and consumer packages to resolve any breaking changes (MetaMask Extension and Mobile updated) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Releases `@metamask/core-backend@3.0.0` with robust WebSocket reconnection and updates dependent packages/peer deps (`assets-controllers@83`, `bridge-controller@55`, `bridge-status-controller@55`); bumps monorepo version to 638.0.0. > > - **Core Backend** > - Release `@metamask/core-backend@3.0.0` with improved, idempotent WS reconnection, stable-connection timer, `forceReconnection()` API, logging and error‑handling tweaks. > - BREAKING: `AccountActivityService` messenger actions updated (remove `disconnect`, add `BackendWebSocketService:forceReconnection`). > - **Assets** > - `@metamask/assets-controllers@83.0.0`: BREAKING peer dep bump to `@metamask/core-backend@^3.0.0`. > - **Bridge** > - `@metamask/bridge-controller@55.0.0`: BREAKING peer/dev dep bump to `@metamask/assets-controllers@^83.0.0`. > - `@metamask/bridge-status-controller@55.0.0`: BREAKING peer/dev dep bump to `@metamask/bridge-controller@^55.0.0`. > - **Monorepo** > - Bump root version to `638.0.0`; update changelogs and package versions accordingly. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 54996e2. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
…r/multichain-transactions-controller * origin/main: (35 commits) feat: `JsonRpcEngineV2` (#6176) Release 641.0.0 (#6940) feat: Add transaction emulation actions (#6935) Release/640.0.0 (#6934) fix(core-backend): control randomness to fix flaky test (#6936) chore: Add `@metamask-previews/*` to NPM age gate exceptions (#6937) Release/639.0.0 (#6931) feat: make getCryptoApproveTransactionParams synchronous (#6930) feat: add new actions to `KeyringController` (#6928) feat: add `getAccounts` to `AccountsController` (#6927) chore: remove `Monad Mainnet` single call balance contract and add into account v4 (#6929) Release/638.0.0 (#6923) fix: Downgrade `multiformats` to `^9.9.0` to avoid ESM-only dependency (#6920) Release/637.0.0 (#6919) feat(account-tree-controller): add callbacks for hidden and pinned data (#6910) Release 636.0.0 (#6918) fix(core-backend): reconnection logic (#6861) fix: Tx state listener and signature coverage (#6906) Release/635.0.0 (#6917) fix(base-controller): add TypeScript declaration file for legacy module resolution (#6915) ...
Explanation
Current State
The WebSocket service experienced a critical issue when the backend server accepted connections but immediately disconnected afterward. I was only able to reproduce this behavior locally, not in prod endpoint.
The Problem
When the server accepts the WebSocket connection but then disconnects rapidly:
scheduleConnectmethod could be called multiple times during rapid disconnect cyclesThe Solution
This PR fixes the issue with three key improvements:
1. Centralize Reconnection via
scheduleConnectAll reconnection logic now flows through a single entry point:
scheduleConnect2. Idempotent
scheduleConnectMakes
scheduleConnectidempotent to ensure multiple calls don't create duplicate or overlapping connection attempts:3. Jitter on Reconnection
Adds randomized jitter to reconnection timing to prevent thundering herd problems:
Combined Effect:
With all three fixes, when the server accepts connections but disconnects immediately:
Technical Details
The fix centralizes all reconnection logic through
scheduleConnect, makes it idempotent by guarding against duplicate timer creation, and adds jitter to the reconnection delay calculation to randomize timing across clients. This prevents both the individual reconnection storm (from rapid connect/disconnect cycles) and the collective thundering herd problem (from synchronized reconnections), while ensuring consistent behavior across the entire service.References
BackendWebSocketServicein@metamask/core-backendChecklist
Note
Introduces a controlled
forceReconnection()flow, idempotent/jittered backoff scheduling, and stability timer to prevent reconnect thrashing; updates AccountActivityService to use it and expands tests.forceReconnection()messenger action and method for controlled disconnect-then-reconnect.ExponentialBackoffwith jitter;connect()no-ops if a reconnect is scheduled; stable-connection timer (10s) resets attempts/backoff.disconnect()now synchronous and resets attempts; improved error handling and timer cleanup; avoids duplicate timers and races.forceReconnection; update method action types union.BackendWebSocketService:forceReconnectionfor recovery; remove directdisconnect/connectsequence.Written by Cursor Bugbot for commit 373a4f6. This will update automatically on new commits. Configure here.