Skip to content

Conversation

@Kriys94
Copy link
Contributor

@Kriys94 Kriys94 commented Oct 16, 2025

Explanation

Current State

The WebSocket service experienced a critical issue when the backend server accepted connections but immediately disconnected afterward. I was only able to reproduce this behavior locally, not in prod endpoint.

The Problem

When the server accepts the WebSocket connection but then disconnects rapidly:

  • The client would immediately attempt to reconnect (fast reconnection)
  • This created a connection thrashing loop: connect → disconnect → reconnect → disconnect
  • Multiple code paths could trigger reconnection attempts independently
  • The scheduleConnect method could be called multiple times during rapid disconnect cycles
  • Multiple overlapping reconnection attempts could be scheduled simultaneously
  • All disconnected clients would reconnect at exactly the same time, creating a "thundering herd" problem
  • This led to unpredictable behavior, duplicate connection attempts, and worsened the reconnection storm

The Solution

This PR fixes the issue with three key improvements:

1. Centralize Reconnection via scheduleConnect

All reconnection logic now flows through a single entry point:

  • All code paths that need to trigger reconnection now use scheduleConnect
  • Eliminates scattered reconnection logic across the codebase
  • Provides a single point of control for reconnection behavior
  • Makes the reconnection strategy consistent and predictable
  • Easier to maintain, test, and debug reconnection logic

2. Idempotent scheduleConnect

Makes scheduleConnect idempotent to ensure multiple calls don't create duplicate or overlapping connection attempts:

  • Checks if a reconnection is already scheduled before creating a new one
  • Ensures only one reconnection timer is active at any given time
  • Eliminates race conditions from concurrent connection attempts
  • Maintains clear, single-threaded reconnection state

3. Jitter on Reconnection

Adds randomized jitter to reconnection timing to prevent thundering herd problems:

  • Randomizes the reconnection delay to spread out reconnection attempts across clients
  • When the server disconnects many clients simultaneously, they won't all reconnect at the exact same moment
  • Reduces load spikes on the server during recovery from incidents
  • Improves overall system stability and reliability

Combined Effect:
With all three fixes, when the server accepts connections but disconnects immediately:

  • All reconnection attempts flow through a single, centralized method (consistency)
  • Only one reconnection attempt will be scheduled per client (idempotency)
  • Reconnection attempts are staggered across the client population (jitter)
  • The reconnection logic follows proper backoff strategy with randomization
  • No duplicate timers or overlapping connection attempts
  • Predictable, controlled reconnection behavior that respects server health
  • Server experiences gradual reconnection load rather than synchronized spikes

Technical Details

The fix centralizes all reconnection logic through scheduleConnect, makes it idempotent by guarding against duplicate timer creation, and adds jitter to the reconnection delay calculation to randomize timing across clients. This prevents both the individual reconnection storm (from rapid connect/disconnect cycles) and the collective thundering herd problem (from synchronized reconnections), while ensuring consistent behavior across the entire service.

References

  • Related to BackendWebSocketService in @metamask/core-backend
  • Fixes # [ADD ISSUE NUMBER HERE]

Checklist

  • I've updated the test suite for new or updated code as appropriate
  • I've updated documentation (JSDoc, Markdown, etc.) for new or updated code as appropriate
  • I've communicated my changes to consumers by updating changelogs for packages I've changed, highlighting breaking changes as necessary
  • I've prepared draft pull requests for clients and consumer packages to resolve any breaking changes

Note

Introduces a controlled forceReconnection() flow, idempotent/jittered backoff scheduling, and stability timer to prevent reconnect thrashing; updates AccountActivityService to use it and expands tests.

  • BackendWebSocketService
    • New API: forceReconnection() messenger action and method for controlled disconnect-then-reconnect.
    • Reconnection: Idempotent scheduler using ExponentialBackoff with jitter; connect() no-ops if a reconnect is scheduled; stable-connection timer (10s) resets attempts/backoff.
    • Behavior changes: disconnect() now synchronous and resets attempts; improved error handling and timer cleanup; avoids duplicate timers and races.
    • Actions/Types: Expose forceReconnection; update method action types union.
  • AccountActivityService
    • Switch to BackendWebSocketService:forceReconnection for recovery; remove direct disconnect/connect sequence.
    • Update allowed messenger actions accordingly.
  • Tests
    • Add/adjust tests for new reconnection semantics, stable timer, idempotent scheduling, and force reconnection flows; minor test cleanup tweaks.
  • Changelog
    • Document new API, reconnection improvements, fixes for race conditions/memory leaks, and breaking messenger action change for AccountActivityService.

Written by Cursor Bugbot for commit 373a4f6. This will update automatically on new commits. Configure here.

@Kriys94 Kriys94 force-pushed the fix/WebsocketReconnection branch 5 times, most recently from d26e7f6 to f128e8b Compare October 20, 2025 09:09
@Kriys94 Kriys94 marked this pull request as ready for review October 20, 2025 09:14
@Kriys94 Kriys94 requested review from a team as code owners October 20, 2025 09:14
salimtb
salimtb previously approved these changes Oct 21, 2025
}
// Use the dedicated forceReconnection method which performs a controlled
// disconnect-then-connect sequence to clean up subscription state
await this.#messenger.call('BackendWebSocketService:forceReconnection');
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this because I want to fully control the reconnection via scheduleConnect

export type BackendWebSocketServiceForceReconnectionAction = {
type: `BackendWebSocketService:forceReconnection`;
handler: BackendWebSocketService['forceReconnection'];
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this because I want to fully control the reconnection via scheduleConnect


#connectionTimeout: NodeJS.Timeout | null = null;

#stableConnectionTimer: NodeJS.Timeout | null = null;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if the backend accepts a connection, we make sure that we dont re-initialise the the reconnectionAttempt (used for the exponential backoff). This makes sure that the exponential backoff is not bypassed.

// This prevents rapid loops when server accepts then immediately closes connections
if (this.#reconnectTimer) {
return;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// If a reconnect is already scheduled, defer to it to avoid bypassing exponential backoff

'AuthenticationController:getBearerToken',
);
if (!token) {
this.#scheduleReconnect();
Copy link
Contributor Author

@Kriys94 Kriys94 Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this into the catch block L488

#scheduleReconnect(): void {
#scheduleConnect(): void {
// If a reconnect is already scheduled, don't schedule another one
if (this.#reconnectTimer) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important, before I was just replacing this.#reconnectTimer. Now I just bypass if there is already an reconnectTimer in flight

Copy link
Contributor

@mcmire mcmire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there was a breaking change to AccountActivityService not mentioned in the changelog, but other than that, seems like a good set of changes.

// Advance time to trigger reconnection check
await completeAsyncOperations(50);
// With jitter (±25%), delay with reconnectDelay=50ms is 37.5-62.5ms
await completeAsyncOperations(70);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tip: Similar as above, you might try mocking Math.random if this test starts failing intermittently.

Copy link
Contributor Author

@Kriys94 Kriys94 Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I mock Math.random, but I prefer to keep 70 because 50 is causing some test flakiness, given that in this test, reconnectDelay: 50, it's better to have some room to make the reconnection really happen.

// Increment attempts BEFORE calculating delay so backoff grows properly
this.#reconnectAttempts += 1;

// Calculate delay with exponential backoff
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Probably not important now, but if we like we might want to reuse the same exponential backoff formula we are using for network requests. The Cockatiel library provides a way to generate a series of backoffs here: https://github.com/connor4312/cockatiel/blob/f87137b9bb55e0cdfb02e7d5c401e47c6f5fcc9f/src/backoff/ExponentialBackoff.ts (this is the formula itself).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THanks for this. I replaced with the Cockatiel library and it works well

@Kriys94 Kriys94 force-pushed the fix/WebsocketReconnection branch 2 times, most recently from 0c2f17b to d1bfe99 Compare October 22, 2025 12:40
@Kriys94 Kriys94 force-pushed the fix/WebsocketReconnection branch from d1bfe99 to d602e67 Compare October 22, 2025 12:58
@Kriys94 Kriys94 merged commit 8974527 into main Oct 22, 2025
255 checks passed
@Kriys94 Kriys94 deleted the fix/WebsocketReconnection branch October 22, 2025 15:23
@Kriys94 Kriys94 mentioned this pull request Oct 23, 2025
4 tasks
Kriys94 added a commit that referenced this pull request Oct 23, 2025
## Explanation

This PR releases **@metamask/core-backend@3.0.0** with significant
improvements to WebSocket reconnection reliability and robustness for
real-time balance updates.

### Current State & Problems Addressed

The previous WebSocket implementation had several reliability issues:

1. **Duplicate reconnection timers** could be created when multiple
events triggered reconnects simultaneously, leading to memory leaks and
inflated reconnect attempt counters
2. **Rapid reconnection loops** could occur when the server accepted
connections but immediately closed them
3. **Race conditions** in reconnection scheduling could bypass
exponential backoff, causing aggressive retry behavior
4. **Manual subscription cleanup** was fragile when recovering from
subscription issues or orphaned subscriptions

### Solution

This release introduces robust reconnection logic with the following
improvements:

#### 1. **Idempotent Reconnection Scheduling**
- Prevents duplicate reconnection timers from being created
- Checks if a reconnect is already scheduled before creating a new timer
- Eliminates memory leaks from orphaned timers
- Prevents unnecessary growth of reconnect attempt counters

#### 2. **Stable Connection Timer**
- Connection must remain stable for 10 seconds before resetting
reconnect attempts
- Prevents rapid reconnection loops when server accepts then immediately
closes connections
- Provides grace period for connection stability before declaring
success

#### 3. **Force Reconnection Method**
- New `forceReconnection()` public API for controlled subscription state
cleanup
- Performs a controlled disconnect-then-reconnect sequence with
exponential backoff
- Useful for recovering from subscription/unsubscription issues and
cleaning up orphaned subscriptions
- Exposed via `BackendWebSocketService:forceReconnection` messenger
action

#### 4. **Improved Error Handling**
- Always schedules reconnect on connection failure (exponential backoff
prevents aggressive retries)
- Removes redundant schedule calls from error paths
- Better logging throughout for easier debugging

#### 5. **Better State Management**
- `disconnect()` now properly resets reconnect attempts counter
- Fixed race condition in `connect()` that could bypass exponential
backoff
- More predictable state transitions

### Technical Details

**Reconnection Behavior:**
- Uses Cockatiel's exponential backoff with decorrelated jitter to
prevent thundering herd
- Initial delay: 1000ms, max delay: 30000ms
- Backoff sequence resets only after 10 seconds of stable connection
- Automatic reconnect on unexpected disconnects, stays disconnected on
manual disconnects

**Breaking Changes:**
- `AccountActivityService` messenger allowed actions updated: removed
`BackendWebSocketService:disconnect`, added
`BackendWebSocketService:forceReconnection`

### Impact

This improves the reliability of real-time balance updates across
MetaMask Extension and Mobile by:
- ✅ Preventing connection thrashing and resource exhaustion
- ✅ Providing graceful recovery from network instability  
- ✅ Eliminating memory leaks from reconnection logic
- ✅ Offering controlled subscription cleanup via `forceReconnection()`
- ✅ Enabling better monitoring with improved logging

## References

- Related to the broader WebSocket balance update initiative for
MetaMask Mobile
- Fixes #6861 - WebSocket reconnection improvements

## Checklist

- [x] I've updated the test suite for new or updated code as appropriate
- [x] I've updated documentation (JSDoc, Markdown, etc.) for new or
updated code as appropriate
- [x] I've communicated my changes to consumers by updating changelogs
for packages I've changed, highlighting breaking changes as necessary
- [ ] I've prepared draft pull requests for clients and consumer
packages to resolve any breaking changes (MetaMask Extension and Mobile
updated)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Releases `@metamask/core-backend@3.0.0` with robust WebSocket
reconnection and updates dependent packages/peer deps
(`assets-controllers@83`, `bridge-controller@55`,
`bridge-status-controller@55`); bumps monorepo version to 638.0.0.
> 
> - **Core Backend**
> - Release `@metamask/core-backend@3.0.0` with improved, idempotent WS
reconnection, stable-connection timer, `forceReconnection()` API,
logging and error‑handling tweaks.
> - BREAKING: `AccountActivityService` messenger actions updated (remove
`disconnect`, add `BackendWebSocketService:forceReconnection`).
> - **Assets**
> - `@metamask/assets-controllers@83.0.0`: BREAKING peer dep bump to
`@metamask/core-backend@^3.0.0`.
> - **Bridge**
> - `@metamask/bridge-controller@55.0.0`: BREAKING peer/dev dep bump to
`@metamask/assets-controllers@^83.0.0`.
> - `@metamask/bridge-status-controller@55.0.0`: BREAKING peer/dev dep
bump to `@metamask/bridge-controller@^55.0.0`.
> - **Monorepo**
> - Bump root version to `638.0.0`; update changelogs and package
versions accordingly.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
54996e2. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
Gudahtt added a commit that referenced this pull request Oct 24, 2025
…r/multichain-transactions-controller

* origin/main: (35 commits)
  feat: `JsonRpcEngineV2` (#6176)
  Release 641.0.0 (#6940)
  feat: Add transaction emulation actions (#6935)
  Release/640.0.0 (#6934)
  fix(core-backend): control randomness to fix flaky test (#6936)
  chore: Add `@metamask-previews/*` to NPM age gate exceptions (#6937)
  Release/639.0.0 (#6931)
  feat: make getCryptoApproveTransactionParams synchronous (#6930)
  feat: add new actions to `KeyringController` (#6928)
  feat: add `getAccounts` to `AccountsController` (#6927)
  chore: remove `Monad Mainnet` single call balance contract and add into account v4 (#6929)
  Release/638.0.0 (#6923)
  fix: Downgrade `multiformats` to `^9.9.0` to avoid ESM-only dependency (#6920)
  Release/637.0.0 (#6919)
  feat(account-tree-controller): add callbacks for hidden and pinned data (#6910)
  Release 636.0.0 (#6918)
  fix(core-backend): reconnection logic (#6861)
  fix:  Tx state listener and signature coverage (#6906)
  Release/635.0.0 (#6917)
  fix(base-controller): add TypeScript declaration file for legacy module resolution (#6915)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants