Skip to content

Correctly update currentStrategy after reconnectDeadlineMillis has reached#1668

Merged
PratimMallick merged 1 commit intofix/cleanup-old-rtc-session-on-migrationfrom
fix/correctly-update-currentstrategy
May 1, 2026
Merged

Correctly update currentStrategy after reconnectDeadlineMillis has reached#1668
PratimMallick merged 1 commit intofix/cleanup-old-rtc-session-on-migrationfrom
fix/correctly-update-currentstrategy

Conversation

@rahul-lohra
Copy link
Copy Markdown
Contributor

@rahul-lohra rahul-lohra commented Apr 30, 2026

Goal

The call.reconnect(WebsocketReconnectStrategy, reason) logic continuously monitors network availability. However, once connectivity is restored, it fails to re-evaluate reconnectDeadlineMillis, which can result in selecting an incorrect reconnection strategy (e.g., initial-passed strategy like fastReconnect instead of fastRejoin).

Implementation

Update the reconnection flow to recompute the effective strategy after network restoration, ensuring reconnectDeadlineMillis is respected before proceeding with reconnection in call.reconnect(...)

🎨 UI Changes

None

Testing

  1. Join a call with two participants
  2. Disable internet for < 10 seconds → expect fastReconnect
  3. Disable internet for ~15 seconds → expect fastRejoin
  4. Verify both participants reconnect successfully and media streams are restored

Summary by CodeRabbit

  • Bug Fixes
    • Improved call reconnection handling by adapting reconnection strategy based on elapsed time, resulting in better management of timeout scenarios during network interruptions.

@rahul-lohra rahul-lohra self-assigned this Apr 30, 2026
@rahul-lohra rahul-lohra requested a review from a team as a code owner April 30, 2026 23:31
@rahul-lohra rahul-lohra added the pr:internal Internal or infra-only changes label Apr 30, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 30, 2026

PR checklist ✅

All required conditions are satisfied:

  • Title length is OK (or ignored by label).
  • At least one pr: label exists.
  • Sections ### Goal, ### Implementation, and ### Testing are filled.

🎉 Great job! This PR is ready for review.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 30, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: ac5db85e-990c-40ed-9454-baaf9e9e45f4

📥 Commits

Reviewing files that changed from the base of the PR and between a12bdd9 and ccc5a53.

📒 Files selected for processing (1)
  • stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/Call.kt

Walkthrough

The reconnect function in the Call class now dynamically escalates reconnection strategies based on elapsed time. It maintains a mutable currentStrategy that can be promoted from FAST or UNSPECIFIED to REJOIN when approaching the reconnectDeadlineMillis threshold, with added debug logging to track strategy transitions.

Changes

Cohort / File(s) Summary
Reconnection Strategy Escalation
stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/Call.kt
Enhanced reconnect function to maintain mutable currentStrategy across loop iterations, escalating from FAST/UNSPECIFIED to REJOIN based on elapsed time vs deadline, with debug logging reporting original and escalated strategies.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Hoppity hop, the reconnect's grown wise,
Watching the clock as the deadline draws nigh,
FAST becomes REJOIN when time starts to slip,
Strategy escalates on this reconnection trip!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change—updating currentStrategy based on reconnectDeadlineMillis—which is the core fix in the changeset.
Description check ✅ Passed The description includes Goal and Implementation sections with clear context, and Testing section with specific scenarios; however, UI Changes section is marked 'None' and the contributor/reviewer checklists are empty.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/correctly-update-currentstrategy

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

SDK Size Comparison 📏

SDK Before After Difference Status
stream-video-android-core 12.02 MB 12.04 MB 0.02 MB 🟢
stream-video-android-ui-xml 5.68 MB 5.68 MB 0.00 MB 🟢
stream-video-android-ui-compose 6.28 MB 6.28 MB 0.00 MB 🟢

@PratimMallick PratimMallick merged commit 11e4306 into fix/cleanup-old-rtc-session-on-migration May 1, 2026
10 of 13 checks passed
@PratimMallick PratimMallick deleted the fix/correctly-update-currentstrategy branch May 1, 2026 06:07
aleksandar-apostolov added a commit that referenced this pull request May 1, 2026
* refactor(core): add retryable signaling decorator, retry policies, and tracer improvements

Replace SignalLostSignalingServiceDecorator with RetryableSignalingServiceDecorator
that uses configurable retry policies (exponential backoff) and separates terminal
SFU errors from transient network failures. Refactor SignalingServiceTracerDecorator
to use a generic traced() helper for consistent request/response/exception tracing.
Clean up Publisher/Subscriber error handling and fix sourceSets configuration.

Made-with: Cursor

* fix(core): phased migration to prevent old-session race condition

Replace immediate teardown of the old RtcSession during migration with a
phased approach: enterMigration() keeps media flowing while listening for
ParticipantMigrationCompleteEvent from the old SFU; finalizeMigration()
tears down only after the handoff is confirmed (or timed out at 7s).

Refactor prepareRejoin(reason) to send final stats, leave gracefully, and
cleanup via the unified cleanup() path. Move mediaScope.cancel() from
leaveWithReason() into cleanup() so it is always invoked regardless of
the teardown path (rejoin, migrate, or explicit leave).

Made-with: Cursor

* refactor(core): add unified reconnect loop with strategy escalation

Consolidate fastReconnect, rejoin, and migrate into a single
Call.reconnect(strategy, reason) entry point with a retry loop that
mirrors the JS SDK:

- FAST reconnect is attempted up to MAX_FAST_RECONNECT_ATTEMPTS (3),
  then escalates to REJOIN.
- MIGRATE failures also escalate to REJOIN.
- A global reconnectMutex replaces the old schedule()/SingleFlight
  coalescing to provide mutual exclusion across all strategies.
- MAX_RECONNECT_ATTEMPTS (10) and leaveAfterDisconnectSeconds act
  as circuit breakers; exhaustion sets ReconnectingFailed state.
- Public fastReconnect(), rejoin(), migrate() wrappers are preserved
  for backward compatibility.

Made-with: Cursor

* refactor(core): simplify RtcSession reconnect delegation

Thin out RtcSession's stateJob to forward all SFU socket-state
transitions directly to Call.reconnect() — the unified retry loop now
owns escalation logic (FAST -> REJOIN, MIGRATE -> REJOIN).

- Remove sfuConnectionRetryCount / MAX_SFU_CONNECTION_RETRIES; retry
  counting is handled by the reconnect loop in Call.
- Replace onSignalingLost callback with onSfuApiError (maps SFU error
  codes to strategies) and onSfuNetworkFailure (always FAST).
- Wire SfuConnectionModule to use RetryableSignalingServiceDecorator
  and the new dual-callback interface.
- Delete the now-unused SignalLostSignalingServiceDecorator.

Made-with: Cursor

* test(core): update tests for unified reconnect architecture

- SfuConnectionRetryTest: replace per-strategy and retry-counter tests
  with forwarding tests that verify stateJob delegates each strategy
  to Call.reconnect().
- ReconnectAttemptsCountTest: test FAST (no increment), REJOIN
  (increments), and accumulated attempts through the unified loop.
- FailedSfuIdsTest: use addFailedSfuId directly instead of calling
  migrate() which now requires full session setup.
- JoinCallTest: skip network-dependent latency test.

Made-with: Cursor

* fix(core): align reconnect guard with JS SDK — allow migrate while connected

The pre-loop guard in Call.reconnect() blocked MIGRATE (and all other
strategies) when the call was in Connected state. This prevented both
server-initiated migration (GoAway/error event with MIGRATE strategy)
and debug-triggered migration from executing.

Align with JS SDK: only skip reconnect when already RECONNECTING,
MIGRATING, or RECONNECTING_FAILED — exactly matching the JS guard.
Remove the Connected and Disconnected checks entirely.

Made-with: Cursor

* fix(core): prevent reconnect race and clean up migration/fast-reconnect

- Set Reconnecting/Migrating state before acquiring reconnectMutex so
  concurrent callers (stateJob, NetworkStateListener) see it and skip
- Launch call.reconnect() from stateJob in a separate coroutine so it
  survives stateJob cancellation during prepareReconnect()
- Simplify fastReconnect: connect synchronously, wait for Connected
  state, then restore session — removes serialProcessor indirection
- Remove ParticipantMigrationComplete await in migration flow to avoid
  unnecessary synchronization latency
- Make prepareReconnect() explicitly disconnect the old SFU socket
  before reconnecting to prevent stale-socket state machine errors
- Promote socketListenerJob to class field in SfuSocket and clean up
  old WebSocket on reconnect to prevent leaked connections
- Move pre-reconnect stats collection before prepareReconnect() so
  stats are sent while the connection is still alive

Made-with: Cursor

* fix(core): reliable network transition handling without dropping calls

Prevent false disconnect signals during network switches (e.g. cellular→WiFi)
by checking actual connectivity in NetworkStateProvider.onLost instead of
unconditionally marking the network as down. Add defense-in-depth guards
throughout the reconnect pipeline:

- Call.kt: leave timer checks connection state before executing; reconnect
  loop uses tryLock to avoid queuing redundant attempts
- RtcSession: centralize cleanup in cancelActiveWork(); add network-aware
  guards on SFU error/state callbacks; handle DisconnectedPermanently
  with escalation to rejoin; fast reconnect throws on stale peer connections
  instead of calling rejoin directly
- HealthMonitor: skip reconnect attempts when network is unavailable
- SfuSocketStateService: NetworkDisconnected stays parked on socket errors
  to avoid futile retry loops; handles NetworkAvailable for recovery

Made-with: Cursor

* api dump and spotless changes

* Fix - catch (Exception) swallows CancellationException in reconnect loop

* Fix - session.value!! force-unwrap TOCTOU race in reconnectRejoin/reconnectMigrate

* spotless apply

* Typo fixed

* Fix: Added catch (e: CancellationException) { throw e } before the generic catch. This avoids running sendCallStats() and logging a misleading trace when the coroutine was cancelled

* Adding back the sfuReconnectTimeoutMillis so that there's no breaking change with a @deprecated annotation

* Fixed nitpicks by coderabbit

* import fixed

* Renamed the local variable to loopStartTime

* Renamed the local variable to loopIteration

* Add comments

* All precondition guards in reconnectFast, reconnectRejoin, reconnectMigrate now throw ReconnectPreconditionException instead of IllegalStateException.

Catch block now has three layers in order:

1. CancellationException → re-throw (coroutine cancellation)
2. ReconnectPreconditionException → log + set ReconnectingFailed + break (terminal, no retry)
3. Exception → retry with escalation (transient failures)

* refactor(core): replace exception-driven reconnect with sealed ReconnectOutcome

Replace ReconnectPreconditionException and try-catch control flow in the
reconnect loop with a sealed class ReconnectOutcome (Success,
PreconditionNotMet, PeerConnectionStale, Failed). Each reconnect method
now returns an outcome instead of throwing, and the loop dispatches via
an exhaustive `when` — the compiler enforces that every case is handled.

Made-with: Cursor

* refactor(core): introduce SfuConnectionResult and unify SFU connect paths

Add SfuConnectionResult sealed class in RtcSession so fastReconnect and
connectAndAwait return typed outcomes instead of throwing. The public
connect() now delegates to connectAndAwait, eliminating duplicated
request-building, tracing, and socket-await logic. fastReconnect is
reduced to its unique pre/post work (peer-connection health check,
subscription restore). Call.kt maps SfuConnectionResult → ReconnectOutcome
without try-catch.

Made-with: Cursor

* refactor(core): fix sendLeaveEvent ordering, deprecate connect, use internalConnect in _join

- Rename leaveWithReason to sendLeaveEvent and make it a suspend fun
  that awaits the send instead of fire-and-forget via launch. This
  fixes the race where cleanup() cancelled the supervisor job before
  the leave message was sent.
- Move sendLeaveEvent call in Call.internalLeave inside the
  clientImpl.scope.launch block so it completes before cleanup().
- Reorder prepareRejoin: sendLeaveEvent before cancelActiveWork/cleanup.
- Deprecate RtcSession.connect() in favor of internalConnect() which
  returns SfuConnectionResult instead of throwing.
- Update Call._join to use internalConnect directly, replacing the
  try-catch block with an exhaustive when on SfuConnectionResult.
- Split SfuConnectionResult into two sealed classes: SfuConnectionResult
  (Connected/Failed) for internalConnect, FastReconnectResult
  (Connected/PeerConnectionStale/Failed) for fastReconnect.
- Remove redundant try-catch blocks around joinRequest and collectStats
  in reconnectFast/reconnectRejoin/reconnectMigrate.
- Replace withTimeout with withTimeoutOrNull in internalConnect to
  correctly distinguish timeouts from external cancellation.
- Remove unused CancellationException import from Call.kt.

Made-with: Cursor

* fix(core): add ICE restart after fast reconnect and gate retries on network availability

After a successful fast reconnect, the publisher and subscriber ICE
connections may be stale because the underlying network path changed
(e.g. WiFi ↔ cellular). This adds explicit ICE restarts for both
directions so fresh candidates are gathered and media resumes.

Also introduces a network availability check in the Call.reconnect()
loop to avoid burning limited FAST attempt budgets when the network
is down. Skipped attempts are not counted, preserving the full retry
budget for when connectivity returns.

Key changes:
- RtcSession.restartIceAfterFastReconnect() restarts publisher and
  subscriber ICE after a successful fast reconnect
- RtcSession.fastReconnect() now uses isClosed() instead of
  isFailedOrClosed() so FAILED peer connections (recoverable via ICE
  restart) are not prematurely escalated to REJOIN
- StreamPeerConnection.isClosed() distinguishes truly CLOSED (needs
  REJOIN) from FAILED (recoverable)
- Call.reconnect() skips FAST attempts when network is unavailable
- Call.collectStats() wrapped in runCatching to avoid crashes during
  reconnection
- Companion object constants documented with KDoc
- Tests added/updated for all new behavior

Made-with: Cursor

* Api dump

* refactor(core): unify SFU reconnection under Call.reconnect() and fix reconnect loop bugs

Strip self-reconnection logic from SfuSocket so it becomes a passive
state reporter. All SFU socket state changes (WebSocketEventLost,
DisconnectedTemporarily, NetworkDisconnected) now route through
RtcSession.stateJob into Call.reconnect(), eliminating the race
condition that caused double-joins and SCHEDULED_CLEANUP peer drops.

Key changes:
- Remove SfuSocket's networkStateListener and self-reconnect paths
- Route WebSocketEventLost from RtcSession.stateJob to Call.reconnect()
- Guard prepareReconnect() against disconnecting an already-disconnected socket
- Fix strategy downgrade bug: REJOIN no longer falls back to FAST
- Increment reconnectAttempts only for REJOIN/MIGRATE (not FAST/UNSPECIFIED)
- Leave the call when ReconnectingFailed is reached (prevent zombie calls)
- Remove redundant inner-loop network guard (entry-point checks suffice)
- Deprecate RestartConnection, NetworkAvailable, and RestartReason for
  binary compatibility
- Add high-level reconnection logic documentation

Made-with: Cursor

* test(core): fix ReconnectEscalationTest assertions after leave-on-failure change

Update expected terminal state from ReconnectingFailed to Disconnected
since reconnect() now calls leave() when all recovery attempts are
exhausted. Also adjust FAST retry count from 4 to 3 to match the
corrected loopIteration increment ordering.

Made-with: Cursor

* test(core): fix test isolation and add internalConnect tests

- Add Call.use {} helper to OrphanedTracksTest and ReconnectSessionIdTest
  to ensure call.cleanup() cancels background coroutines before runTest
  exits, preventing OOM from infinite stats-reporting loop
- Add StreamVideo.removeClient() in @after for FastReconnectIceRestartTest,
  ReconnectEscalationTest, SfuConnectionRetryTest, and RtcSessionTest2 to
  prevent singleton leakage across test classes
- Use spyk + coJustRun on sendCallStats to fix NPE from serializing mock
  objects in RtcSessionTest2
- Add two new internalConnect tests: timeout returns Failed, and
  ReconnectDetails are forwarded in the JoinRequest

Made-with: Cursor

* refactor(core): unify SFU error handling and add ICE health monitoring

Replace the fragmented wrapAPICall / safeCallWithResult helpers with a
single sfuCall wrapper that properly rethrows CancellationException.
Merge onTerminalError and onNetworkFailure into one onSessionError
callback so only session-fatal SFU errors (SIGNAL_LOST,
PARTICIPANT_NOT_FOUND, etc.) trigger reconnection — regular API
failures are returned to callers without side-effects.

Add ICE connection state monitoring in RtcSession so the UI surfaces
RealtimeConnection.Reconnecting when publisher or subscriber ICE
degrades, and restores Connected when both recover.

Made-with: Cursor

* fix(core): defer ICE monitoring start until SFU socket is connected

Starting the ICE monitoring job during RtcSession construction caused
it to collect from a mock Subscriber's iceState in tests, blocking the
StandardTestDispatcher's event loop and preventing the stateJob from
processing socket state changes.

Move startIceMonitoring() to the SfuSocketState.Connected handler so
it only runs once the real SFU connection is established. Add an
idempotency guard to avoid duplicate monitoring jobs on reconnect.

Fixes 9 failing tests in SfuConnectionRetryTest.

Made-with: Cursor

* refactor(core): address PR review — rename APIs, unify DISCONNECT flow, improve docs

- Rename `internalConnect` → `connectInternal` for better IDE discoverability
- Rename `sfuCall` → `safeApiCall` to reflect its generic nature
- Add `ReconnectOutcome.Disconnect` to unify the DISCONNECT strategy
  handling into the single `when(outcome)` decision flow, removing the
  early-return short-circuit and the misleading `error("Handled above")`
- Improve KDoc on `SfuRetryableException` explaining its role as a
  retry-signal mechanism for `StreamRetryProcessor`
- Remove default `{ true }` from `HealthMonitor.isNetworkAvailable`,
  requiring callers to pass it explicitly
- Add `isNetworkAvailable` to `CoordinatorSocket` health monitor
- Enhance `NetworkStateProvider` debug logs with network/capabilities info

Made-with: Cursor

* - Donot call reconnect on getting DisconnectedPermanently
-  revert onLost behaviour
- remove condition (loopIteration >= MAX_FAST_RECONNECT_ATTEMPTS) while escalating to REJOIN
- Removed unwanted logging

* fix(core): centralize network check in reconnect loop, fail-fast on socket disconnect

- Move network availability check to the top of the reconnect while-loop
  so no other logic runs when offline; polls without consuming attempt budget
- Remove redundant network guard from DisconnectedTemporarily handler
- connectInternal now observes Disconnected states immediately instead of
  waiting for the full 10s timeout, enabling faster retry cycles

Made-with: Cursor

* fix: Correctly update currentStrategy after reconnectDeadlineMillis has reached (#1668)

* - move the logic of incrementing nonFastReconnectAttempts inside the when loop to improve re
- Keep early exit checks in the starting of the while loop

* spotlessApply

* Fix unit test cases

* Make isClosed internal

* FIx unit test

* add a 5s delay before asserting recording label reappearance

---------

Co-authored-by: Rahul Kumar Lohra <tgunix@gmail.com>
Co-authored-by: Aleksandar Apostolov <apostolov.alexandar@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr:internal Internal or infra-only changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants