Skip to content

fix: replace fixed 15s WebSocket reconnect with exponential backoff + jitter#5617

Merged
kodjima33 merged 2 commits intomainfrom
fix/issue-5527
Mar 14, 2026
Merged

fix: replace fixed 15s WebSocket reconnect with exponential backoff + jitter#5617
kodjima33 merged 2 commits intomainfrom
fix/issue-5527

Conversation

@kodjima33
Copy link
Collaborator

Summary

Replaces the fixed 15-second WebSocket reconnection timer with exponential backoff and full jitter to prevent synchronized reconnect storms.

Changes

  • Added _reconnectAttempt counter and _maxBackoffSeconds = 120
  • Added _getReconnectDelay() — exponential backoff (1s→2s→4s→...→120s cap) with full jitter
  • Replaced Timer.periodic(Duration(seconds: 15)) with single-shot Timer using computed delay
  • Reset _reconnectAttempt = 0 on successful connection in onConnected()

Note: The diff includes some auto-formatting changes alongside the substantive logic changes.

Fixes #5527

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 14, 2026

Greptile Summary

This PR replaces the fixed 15-second Timer.periodic WebSocket reconnection in CaptureProvider with a single-shot Timer using exponential backoff (1s→2s→4s→...→120s cap) and full jitter to prevent synchronized reconnect storms across clients.

  • Adds _reconnectAttempt counter and _maxBackoffSeconds = 120 cap
  • Adds _getReconnectDelay() implementing the AWS-style "full jitter" pattern
  • Replaces Timer.periodic with one-shot Timer using computed delay
  • Resets _reconnectAttempt to 0 on successful connection in onConnected()
  • Removes the now-unnecessary _keepAliveLastExecutedAt rate-limiting field
  • Bug: pow(2, _reconnectAttempt).toInt() is called before min() caps the value — this will crash with UnsupportedError when _reconnectAttempt >= 1024 (reachable after ~34 hours of persistent disconnection)
  • The bulk of the diff is auto-formatting changes (indentation normalization) with no behavioral impact

Confidence Score: 3/5

  • The PR improves reconnection behavior but has a latent crash bug in the backoff calculation that manifests after prolonged disconnection.
  • The core idea is sound (exponential backoff with jitter), and the reset logic is correct. However, the operator ordering bug in _getReconnectDelay() will cause a runtime crash after ~34 hours of disconnection, which is a realistic scenario for a mobile/desktop app. The fix is a one-line change. The rest of the diff is safe auto-formatting.
  • app/lib/providers/capture_provider.dart — the _getReconnectDelay() method needs the .toInt() moved after min() to prevent overflow crash.

Important Files Changed

Filename Overview
app/lib/providers/capture_provider.dart Replaces fixed 15s periodic WebSocket reconnect with exponential backoff + jitter. The backoff logic has a potential crash: pow(2, n).toInt() is called before the min() cap, which will throw UnsupportedError when _reconnectAttempt >= 1024 (~34h of disconnection). Most other changes are auto-formatting only.

Sequence Diagram

sequenceDiagram
    participant CP as CaptureProvider
    participant T as Timer (single-shot)
    participant WS as WebSocket

    Note over CP: WebSocket disconnects
    CP->>CP: onClosed() / onError()
    CP->>CP: _startKeepAliveServices()
    CP->>CP: _getReconnectDelay()<br/>(2^attempt capped at 120s, with jitter)
    CP->>CP: _reconnectAttempt++
    CP->>T: Schedule Timer(delay)
    T-->>CP: Timer fires
    CP->>WS: _initiateWebsocket()

    alt Socket connects successfully
        WS-->>CP: onConnected()
        CP->>CP: _reconnectAttempt = 0
    else Socket creation fails (null)
        CP->>CP: _startKeepAliveServices()<br/>(reschedule with increased backoff)
    else Socket connects then drops
        WS-->>CP: onClosed() / onError()
        CP->>CP: _startKeepAliveServices()<br/>(reschedule with increased backoff)
    end
Loading

Last reviewed commit: a67c19b

}

Duration _getReconnectDelay() {
final baseDelay = min(pow(2, _reconnectAttempt).toInt(), _maxBackoffSeconds);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toInt() before min() will crash on high attempt counts

.toInt() is called on pow(2, _reconnectAttempt) before min clamps the value. Since pow returns a double, once _reconnectAttempt reaches 1024, pow(2, 1024) returns double.infinity, and double.infinity.toInt() throws an UnsupportedError. With the backoff cap at 120s and full jitter, attempt 1024 can be reached in ~34 hours of persistent disconnection.

Move .toInt() after the min call so the cap is applied first:

Suggested change
final baseDelay = min(pow(2, _reconnectAttempt).toInt(), _maxBackoffSeconds);
final baseDelay = min(pow(2, _reconnectAttempt), _maxBackoffSeconds).toInt();

Implements exponential backoff (1s->2s->4s...->120s cap) with full jitter for
WebSocket reconnection, preventing synchronized reconnect storms.

Fixes #5527
@kodjima33 kodjima33 merged commit fa95dcf into main Mar 14, 2026
1 check passed
@kodjima33 kodjima33 deleted the fix/issue-5527 branch March 14, 2026 17:37
@beastoin
Copy link
Collaborator

beastoin commented Mar 15, 2026

@kodjima33 The switch from periodic to one-shot timer means early returns (user not signed in, device not ready) silently abandon the retry chain with no follow-up timer scheduled, which can cause a permanent WebSocket disconnection during active recording until manual intervention. Could you revert this and add tests covering the early-return paths to ensure the retry always re-schedules on transient failures before re-merging?

Let us know if you need a hand with testing.


by AI for @beastoin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace fixed 15s WebSocket reconnect with exponential backoff + jitter

2 participants