fix(llc): improve the handling of SFU join errors#1191
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughBumps protocolModelsVersion, adds migratingFromList to coordinator join flows, adds webrtcVersion to ClientDetails, extends SFU join/retry/migration and exclusion logic, adds two SdkType enum values, adjusts WebRTC track handling, updates tests/fixtures, changelog, and example Gradle heap. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client SDK
participant Coordinator as Coordinator Service
participant SFU as SFU Server
Client->>Coordinator: joinCall(callCid, migratingFromList)
Coordinator-->>Client: CoordinatorJoined (selected SFU)
Client->>SFU: join request (client details incl. webrtcVersion)
SFU-->>Client: join response (success) or error (e.g., sfuFull)
alt SFU join success
Client->>Client: proceed with in-call setup
else SFU join error (join-error)
Client->>Client: add SFU to sfusToExclude
Client->>Coordinator: joinCall(callCid, migratingFromList + currentSfu)
Coordinator-->>Client: CoordinatorJoined (alternative SFU)
Client->>SFU: join request to alternative SFU
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1191 +/- ##
========================================
+ Coverage 6.36% 6.42% +0.06%
========================================
Files 615 615
Lines 43362 43412 +50
========================================
+ Hits 2758 2788 +30
- Misses 40604 40624 +20 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/stream_video/lib/src/coordinator/retry/coordinator_client_retry.dart (1)
282-289:⚠️ Potential issue | 🟠 MajorForward
membersLimitin retry join delegation (currently dropped).Line 279 accepts
membersLimit, but the delegate call omits it. This can change join response shape when usingCoordinatorClientRetry.🔧 Proposed fix
() => _delegate.joinCall( callCid: callCid, ringing: ringing, create: create, migratingFrom: migratingFrom, migratingFromList: migratingFromList, video: video, + membersLimit: membersLimit, ),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/stream_video/lib/src/coordinator/retry/coordinator_client_retry.dart` around lines 282 - 289, The retry wrapper is dropping the membersLimit parameter when delegating to the underlying client; update the closure passed to _delegate.joinCall inside CoordinatorClientRetry so it forwards the membersLimit argument (membersLimit: membersLimit) along with callCid, ringing, create, migratingFrom, migratingFromList, and video, ensuring the delegate receives the same membersLimit value and the join response shape remains consistent.
🧹 Nitpick comments (2)
packages/stream_video/lib/src/webrtc/rtc_manager.dart (1)
738-739: Rollback provisionaltracksstate on publish failuresLine [738] and Line [815] register a local track before publish is confirmed. If an error returns at Line [760], Line [826], or Line [848],
trackscan retain a non-published track, which can mislead later flows (getPublisherTrackByType, mute/unmute, device switching).Proposed fix
@@ - tracks[audioTrack.trackId] = audioTrack; + tracks[audioTrack.trackId] = audioTrack; + var publishedAny = false; @@ if (transceiverResult is Failure) { await mediaTrackClone.stop(); + if (!publishedAny) tracks.remove(audioTrack.trackId); return transceiverResult; } + publishedAny = true; @@ - tracks[videoTrack.trackId] = videoTrack; + tracks[videoTrack.trackId] = videoTrack; + var publishedAny = false; @@ if (!publishOptions.any((o) => o.trackType == videoTrack.trackType)) { + tracks.remove(videoTrack.trackId); _logger.w( @@ if (transceiverResult is Failure) { await mediaTrackClone.stop(); + if (!publishedAny) tracks.remove(videoTrack.trackId); return transceiverResult; } + publishedAny = true;Also applies to: 760-762, 815-816, 826-828, 848-850
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/stream_video/lib/src/webrtc/rtc_manager.dart` around lines 738 - 739, The provisional insertion of a local track into the tracks map (tracks[audioTrack.trackId] = audioTrack / tracks[videoTrack.trackId] = videoTrack and their updatedTrack variants) must be rolled back if publish fails; update the publish flow in the methods that register these provisional tracks (references: tracks, audioTrack.trackId, videoTrack.trackId, updatedTrack) to remove the entry from tracks when an error path is taken (the failure return points around the current lines and the other error locations mentioned) so the tracks map only contains actually published tracks—i.e., after setting tracks[...] = ..., attempt the publish, and on any exception or non-success response call tracks.remove(audioTrack.trackId) / tracks.remove(videoTrack.trackId) (and ensure you don't remove an already replaced/persisted track) before returning the error.packages/stream_video/test/src/call/call_test.dart (1)
67-75: StrengthenmigratingFromListassertions to validate behavior, not just presence.Using
any(named: 'migratingFromList')won’t catch regressions in exclusion-list construction. Assert concrete values for first join ([]) and migration join (contains previous SFU).✅ Suggested test tightening
- migratingFromList: any(named: 'migratingFromList'), + migratingFromList: <String>[], ... - migratingFromList: any(named: 'migratingFromList'), + migratingFromList: <String>[ + SampleCallData.defaultCredentials.sfuServer.name, + ], ... - migratingFromList: any(named: 'migratingFromList'), + migratingFromList: <String>[],Also applies to: 94-102, 141-149
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/stream_video/test/src/call/call_test.dart` around lines 67 - 75, The test currently uses any(named: 'migratingFromList') when verifying coordinatorClient.joinCall, which only checks presence and can miss regressions; change the expectations to assert concrete migratingFromList values: for a first join assert an empty list ([]) is passed to migratingFromList, and for migration join assertions assert that migratingFromList contains the previous SFU identifier (or the exact list expected), updating the verifier calls around coordinatorClient.joinCall and any other similar expectations (the other joinCall assertions in the test file) to check the exact migratingFromList contents instead of any(...).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/stream_video/CHANGELOG.md`:
- Line 4: Update the release note text in the CHANGELOG entry that currently
reads "Improved SFU join, recovery and migration logic to reduce failed joins
with reconnec loops when joining full or shutting down SFU" by fixing the typo
"reconnec loops" to "reconnection loops" (or "reconnect loops" if preferred) so
the user-facing changelog reads correctly; locate the line containing the phrase
"reconnec loops" and replace it with "reconnection loops".
In
`@packages/stream_video/lib/open_api/video/coordinator/model/join_call_request.dart`:
- Line 101: The equality uses deep list comparison for migratingFromList but the
hashCode uses the list's identity hash, breaking the equality/hashCode contract;
update the JoinCallRequest (or the class defining == and hashCode) to compute
the hash for migratingFromList using _deepEquality.hash(migratingFromList)
instead of migratingFromList.hashCode so equal lists produce the same hash;
ensure the symbol names to change are the equality check referencing
_deepEquality.equals(other.migratingFromList, migratingFromList) and the
hashCode implementation that currently uses migratingFromList.hashCode.
---
Outside diff comments:
In
`@packages/stream_video/lib/src/coordinator/retry/coordinator_client_retry.dart`:
- Around line 282-289: The retry wrapper is dropping the membersLimit parameter
when delegating to the underlying client; update the closure passed to
_delegate.joinCall inside CoordinatorClientRetry so it forwards the membersLimit
argument (membersLimit: membersLimit) along with callCid, ringing, create,
migratingFrom, migratingFromList, and video, ensuring the delegate receives the
same membersLimit value and the join response shape remains consistent.
---
Nitpick comments:
In `@packages/stream_video/lib/src/webrtc/rtc_manager.dart`:
- Around line 738-739: The provisional insertion of a local track into the
tracks map (tracks[audioTrack.trackId] = audioTrack / tracks[videoTrack.trackId]
= videoTrack and their updatedTrack variants) must be rolled back if publish
fails; update the publish flow in the methods that register these provisional
tracks (references: tracks, audioTrack.trackId, videoTrack.trackId,
updatedTrack) to remove the entry from tracks when an error path is taken (the
failure return points around the current lines and the other error locations
mentioned) so the tracks map only contains actually published tracks—i.e., after
setting tracks[...] = ..., attempt the publish, and on any exception or
non-success response call tracks.remove(audioTrack.trackId) /
tracks.remove(videoTrack.trackId) (and ensure you don't remove an already
replaced/persisted track) before returning the error.
In `@packages/stream_video/test/src/call/call_test.dart`:
- Around line 67-75: The test currently uses any(named: 'migratingFromList')
when verifying coordinatorClient.joinCall, which only checks presence and can
miss regressions; change the expectations to assert concrete migratingFromList
values: for a first join assert an empty list ([]) is passed to
migratingFromList, and for migration join assertions assert that
migratingFromList contains the previous SFU identifier (or the exact list
expected), updating the verifier calls around coordinatorClient.joinCall and any
other similar expectations (the other joinCall assertions in the test file) to
check the exact migratingFromList contents instead of any(...).
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: cc7822cf-879a-4a14-819f-4cee12f22d65
📒 Files selected for processing (16)
packages/stream_video/CHANGELOG.mdpackages/stream_video/lib/globals.dartpackages/stream_video/lib/open_api/video/coordinator/model/join_call_request.dartpackages/stream_video/lib/protobuf/video/sfu/models/models.pb.dartpackages/stream_video/lib/protobuf/video/sfu/models/models.pbenum.dartpackages/stream_video/lib/protobuf/video/sfu/models/models.pbjson.dartpackages/stream_video/lib/src/call/call.dartpackages/stream_video/lib/src/coordinator/coordinator_client.dartpackages/stream_video/lib/src/coordinator/open_api/coordinator_client_open_api.dartpackages/stream_video/lib/src/coordinator/retry/coordinator_client_retry.dartpackages/stream_video/lib/src/sfu/data/models/sfu_error.dartpackages/stream_video/lib/src/stream_video.dartpackages/stream_video/lib/src/webrtc/rtc_manager.dartpackages/stream_video/test/src/call/call_allow_multiple_active_calls_test.dartpackages/stream_video/test/src/call/call_test.dartpackages/stream_video/test/src/call/fixtures/call_test_helpers.dart
packages/stream_video/lib/open_api/video/coordinator/model/join_call_request.dart
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
packages/stream_video/CHANGELOG.md (1)
4-4: Consider adding detail about the SFU exclusion mechanism.While the current description is accurate, users might benefit from understanding that the improvement works by excluding problematic SFUs when requesting a new one from the coordinator. This aligns with the migratingFromList feature documented in the code.
💡 Optional enhancement for clarity
-* Improved SFU join, recovery and migration logic to reduce failed joins with reconnect loops when joining full or shutting-down SFUs +* Improved SFU join, recovery and migration logic to reduce failed joins with reconnect loops when joining full or shutting-down SFUs by requesting alternative SFUs from the coordinator🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/stream_video/CHANGELOG.md` at line 4, The changelog line is vague about how SFU join/recovery improvements work; update the entry to mention the SFU exclusion mechanism by describing that the client/agent now supplies a migratingFromList (or similar exclusion list) to the coordinator so problematic SFUs are excluded when requesting a new SFU, reducing reconnect loops and failed joins during migration or when SFUs are full/shutting down; reference the migratingFromList feature in the description for clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/stream_video/CHANGELOG.md`:
- Line 4: Update the CHANGELOG entry bullet (the line starting with "* Improved
SFU join, recovery and migration logic...") by removing the trailing whitespace
and rephrasing for clarity and grammar; for example, add the missing serial
comma after "recovery", pluralize "SFU" to "SFUs" and clarify the outcome by
changing "...to reduce failed joins with reconnect loops when joining full or
shutting-down SFU" to something like "...to reduce failed joins and reconnect
loops when joining full or shutting-down SFUs". Ensure there is no trailing
space at the end of the line.
In `@packages/stream_video/lib/src/call/call.dart`:
- Around line 936-941: The unrecoverable SFU error branch in _join currently
logs and returns early via the _isUnrecoverableSfuError check, which skips the
normal disconnect/cleanup; change the branch in _join so it invokes the same
teardown/cleanup path (e.g., call the existing disconnect/cleanup method used
elsewhere in this class such as the cleanup/disconnect/teardown helper) before
returning and ensure the call state is set to closed/failed (not left in
reconnecting). Locate the _isUnrecoverableSfuError check and add a call to the
class's disconnect/teardown routine and state update immediately after logging
and before returning.
---
Nitpick comments:
In `@packages/stream_video/CHANGELOG.md`:
- Line 4: The changelog line is vague about how SFU join/recovery improvements
work; update the entry to mention the SFU exclusion mechanism by describing that
the client/agent now supplies a migratingFromList (or similar exclusion list) to
the coordinator so problematic SFUs are excluded when requesting a new SFU,
reducing reconnect loops and failed joins during migration or when SFUs are
full/shutting down; reference the migratingFromList feature in the description
for clarity.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a31f881a-0ece-402a-a30c-6b801637154c
📒 Files selected for processing (3)
packages/stream_video/CHANGELOG.mdpackages/stream_video/lib/src/call/call.dartpackages/stream_video/lib/src/stream_video.dart
…-video-flutter into fix/full-sfu-join-erros
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/stream_video/lib/src/call/call.dart`:
- Around line 936-943: The _join flow currently calls leave() when
_isUnrecoverableSfuError(connectionFailure) is true, but that triggers _clear()
which cancels the in-flight _idConnect and causes join() to report "connect
cancelled" instead of the original SFU error; change _join so it does NOT call
leave() in this unrecoverable-SFU branch—instead log the unrecoverable SFU error
via _logger.e and propagate/return the original failure result (or rethrow) so
_idConnect and join() surface the true SFU error; keep leave() behavior for
other disconnect reasons and avoid invoking _clear() from this branch.
- Around line 1243-1255: Normalize forceMigratingFrom by treating empty or
whitespace-only strings as null before using it to compute migratingFrom and
building effectiveMigratingFromList: replace the direct use of
forceMigratingFrom in the migratingFrom assignment with a normalized value
(e.g., null if forceMigratingFrom is null or trim().isEmpty) and ensure
migratingFromList is filtered to exclude empty/whitespace entries before
spreading into effectiveMigratingFromList so no empty string is added to the
coordinator exclusion list; refer to forceMigratingFrom, migratingFrom,
migratingFromList, effectiveMigratingFromList, _reconnectStrategy and
_session?.config.sfuName to locate the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 252121cd-1816-4163-a32d-0eecd41a650e
📒 Files selected for processing (1)
packages/stream_video/lib/src/call/call.dart
There was a problem hiding this comment.
🧹 Nitpick comments (2)
packages/stream_video/test/src/call/call_join_sfu_error_test.dart (2)
147-199: Consider explicitly verifyingmigratingFromListvalues.The test uses
any(named: 'migratingFromList')for all calls, but verifying the actual values would ensure the SFU exclusion list is properly accumulated across retry attempts. According to the PR objectives, the list should grow as SFUs are excluded.💡 Suggested improvement
verifyInOrder([ () => localCoordinator.joinCall( callCid: any(named: 'callCid'), create: any(named: 'create'), migratingFrom: null, - migratingFromList: any(named: 'migratingFromList'), + migratingFromList: [], video: any(named: 'video'), membersLimit: any(named: 'membersLimit'), ), // ... first start call ... () => localCoordinator.joinCall( callCid: any(named: 'callCid'), create: any(named: 'create'), migratingFrom: SampleCallData.defaultCredentials.sfuServer.name, - migratingFromList: any(named: 'migratingFromList'), + migratingFromList: [SampleCallData.defaultCredentials.sfuServer.name], video: any(named: 'video'), membersLimit: any(named: 'membersLimit'), ),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/stream_video/test/src/call/call_join_sfu_error_test.dart` around lines 147 - 199, The test should assert concrete migratingFromList values instead of any(...) so we can confirm the SFU exclusion list grows across retries: update the verifyInOrder expectations for localCoordinator.joinCall to check migratingFromList equals an empty list on the first call, equals a list containing the first excluded SFU on the second call, and equals a list containing both previously excluded SFUs on the third call (use the actual variables/constants used to represent excluded SFU names in the test), while keeping other argument matchers the same for localCoordinator.joinCall and localSession.start.
142-142: Consider verifying the join result.The test calls
await call.join()but discards the result. Capturing and asserting on it would strengthen the test by confirming the expected failure state.💡 Suggested improvement
- await call.join(); + final result = await call.join(); + + expect(result.isFailure, isTrue);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/stream_video/test/src/call/call_join_sfu_error_test.dart` at line 142, Capture the result of call.join() (e.g., assign to a variable like joinResult) and add assertions against its expected failure state instead of discarding it; locate the call to call.join() in the test (call.join()) and assert properties on the returned value (for example that it indicates failure, contains the expected error code/message, or that its success flag is false) to make the test verify the join outcome.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@packages/stream_video/test/src/call/call_join_sfu_error_test.dart`:
- Around line 147-199: The test should assert concrete migratingFromList values
instead of any(...) so we can confirm the SFU exclusion list grows across
retries: update the verifyInOrder expectations for localCoordinator.joinCall to
check migratingFromList equals an empty list on the first call, equals a list
containing the first excluded SFU on the second call, and equals a list
containing both previously excluded SFUs on the third call (use the actual
variables/constants used to represent excluded SFU names in the test), while
keeping other argument matchers the same for localCoordinator.joinCall and
localSession.start.
- Line 142: Capture the result of call.join() (e.g., assign to a variable like
joinResult) and add assertions against its expected failure state instead of
discarding it; locate the call to call.join() in the test (call.join()) and
assert properties on the returned value (for example that it indicates failure,
contains the expected error code/message, or that its success flag is false) to
make the test verify the join outcome.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 429929ba-ad42-4e6d-9189-de98bda5be3d
📒 Files selected for processing (1)
packages/stream_video/test/src/call/call_join_sfu_error_test.dart
Some SFU errors are preventing the join flow. The common case is SFU_FULL, which technically shouldn't be retried, as the SFU will likely reject the second attempt too. In this case, we immediately ask the coordinator for a brand new SFU, hopefully with enough capacity.
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores