feat(sync): graceful cooldowns for 429 fair-use + backend-busy stale-fails; surface the reason, back off, and log why#7451
Conversation
…ntry, stop loop, no retry bump)
…r bumps retryCount
…nstead of a Sync pill
Resolve l10n conflicts: take main's version of all 49 ARBs (it carries #7449's keys — syncStatusBackingUp, syncStatusOnDevice, syncCardDownloadingTitle, syncFlowIntro), then re-add syncCardRateLimited × 49 from this branch. Generated AppLocalizations files regenerated via flutter gen-l10n. No code conflicts.
Calls SyncRateLimiter.instance.markLimited(retryAfterSeconds: 60) so the rate-limit UI (auto-sync card 'Fair-use limit reached', upload gating) can be tested locally without tripping the backend's fair-use cap.
…rly-return when limited Without this the sync UI never updated when the cooldown flipped, and tapping Sync inside the cooldown briefly flashed the syncing state (Cancel pill) then 'completed' — with no rate-limit message anywhere. Now markLimited triggers a provider rebuild, and Sync/SyncWal short-circuit without entering the syncing state, so the rate-limited card stays visible.
…des the spinner while paused - Add the isRateLimited branch to _buildProcessCard (reads SyncProvider.isRateLimited). - Hide the upload/processing spinner during the cooldown so the card reads unambiguously as paused; per-row 'Uploaded · processing on Omi' still shows the background processing for already-uploaded WALs.
Same tweak as sync_page — the spinner contradicts a 'paused' message; per-row subtitles still convey that uploaded WALs are processing in the background.
So a shared dev log shows when the status GET itself is being throttled or 5xx'd — previously the reason the reconciler couldn't make progress was invisible.
…ckendBusy) Same cooldown gate, but persisted with a reason so the UI can pick distinct copy for a fair-use 429 vs a backend-worker-saturation pause.
…ryCount bump The server's stale guard marks queued-but-unworked jobs 'failed' with 'Job timed out (background worker likely died)' after 600s. That's a backend-capacity issue, not a content failure — bumping retryCount mislabels the recording as 'Couldn't process — retrying' and the user keeps tapping Retry, which spawns more jobs that also stale out. When the reconciler sees that specific error: revert the WAL to miss (file kept, no retryCount bump), and markLimited(backendBusy) to pause further uploads so the backlog can drain. The row falls back to the calm grey 'Waiting to sync' and the status card surfaces the cause. Also adds 'reconcile_poll' breadcrumbs for every per-job outcome (transient/non-terminal/completed) and a 'reconcile_revert' event with the server's status/error/segments + backendBusy + retryCountBumped, so 'Log to file' shares show the actual cause instead of guesses.
|
@greptile-apps review |
Greptile SummaryThis PR addresses two distinct backend conditions (HTTP 429 fair-use throttling and backend-busy stale-guard failures) that were both surfacing as alarming "Couldn't process — retrying" states, burning the per-recording retry budget with no backoff. It introduces a persisted, app-global
Confidence Score: 3/5Safe to merge after fixing the missing The core rate-limiting and backoff logic is well-structured and the observability additions are valuable. The missing app/lib/providers/sync_provider.dart needs the Important Files Changed
|
| // attribute a failed segment to a specific member, so revert all | ||
| // members for re-upload — the server dedups segments that already | ||
| // succeeded, so completed work is not duplicated. | ||
| // | ||
| // Backend-busy detection: when the server's stale guard marks a | ||
| // queued job 'failed' with this specific error, the job never | ||
| // even reached a worker — it's a backend-capacity issue, not a | ||
| // content failure. Don't bump retryCount (which would mislabel | ||
| // the recording as 'failed'), and pause uploads via the rate | ||
| // limiter so we stop submitting more jobs that will also stale | ||
| // out. UI surfaces this as 'Backend busy' (distinct from 429). | ||
| final backendBusy = (s.error ?? '').contains('background worker likely died'); | ||
| if (backendBusy) { |
There was a problem hiding this comment.
Fragile string-match for backend-busy detection
The backendBusy path matches the exact literal "background worker likely died" from the server's stale guard. If the backend message is ever rephrased or the error key changes (it currently says the error string is misleading per issue #7469), detection silently reverts to treating the failure as a content failure, bumping retryCount and eventually showing "Failed" — the exact bad behavior this PR fixes. Worth extracting as a named constant or adding a secondary signal (e.g., s.status == 'failed' with empty failedSegments) so a server-side rename is easy to track down.
…constant Greptile P2: matching the literal 'background worker likely died' string is fragile (backend issue #7469 even proposes renaming it). Extract the string as a named constant and add a structural signal — status=='failed' with totalSegments==0 can only come from the stale guard, since mark_job_completed only marks 'failed' when total>0. The OR catches both the current backend behavior and a future rename.
Greptile P2: a misconfigured Retry-After (e.g. 99999999) would lock the app out of syncing for years. Cap at 24h, which is well above any reasonable rate-limit window.
Problem
Two distinct backend conditions both manifested as the same alarming amber "Couldn't process — retrying" on every recording, with users tapping Retry forever and amplifying the storm. The chain was the same in both cases: the client treated transient-but-systemic backend signals as content failures, burned each recording's retry budget, and never told the user or itself what was actually happening.
(1) Fair-use throttling (HTTP 429) — a user draining a multi-day offline backlog past their fair-use cap got the cap enforced (~474× 429 in 48h in Cloud Run for one user).
uploadLocalFilesV2mapped 429 to a genericException('Rate limited or budget exhausted')and auto-sync's catch — which already excludesSocketException— didn't exclude 429. So every throttle bumpedretryCountand the app re-fired uploads every minute with no backoff.(2) Backend-busy stale-guard fails — when
backend-sync'sstorage/postprocesspools are saturated (storage 96/96 + 29 queued,postprocess 24/24 + 80 queuedobserved), uploads are accepted (202), assigned ajobId, and sit inqueuedserver-side. After 10 min the server's stale guard indatabase/sync_jobs.py:get_sync_jobrewrites the status tofailedwith the misleading message"Job timed out (background worker likely died)"— but no worker died, none was available. The reconciler readsfailed, reverts the WAL, bumps retry. User taps Retry → new job → also queued → same fate → retry budget exhausted → red "Failed". Reproduced and confirmed end-to-end via reporter'sreconcile_poll/reconcile_revertlog breadcrumbs (added in this PR). Separate backend issue opened: #7469.Fix
Surface the real reason, back off instead of hammering
uploadLocalFilesV2throws a typedSyncRateLimitedExceptionon 429, parsingRetry-After.SyncRateLimiter(new, persisted, app-global) holds a cooldown timestamp + aRateLimitReason(rateLimit|backendBusy).extends ChangeNotifiersoSyncProviderrebuilds the UI the moment the gate flips._syncSingleWal, batchsyncAll(gates entry, breaks the loop on 429 mid-batch), singlesyncWal.syncWals/syncWalat the provider level early-return when limited — no more "Cancel pill flashes then resets". Any successful upload clears the cooldown.retryCount— treated likeSocketException(transient throttle, not content failure). Side-effect: rows that would have flipped to amber "Couldn't process — retrying" stay calm grey "Waiting to sync" automatically — no new per-row state needed.Backend-busy detection (new in this PR)
failed/partial_failurebranch detects the specific"background worker likely died"error from the server's stale guard and treats it as backend capacity, not a content failure:misswithout bumpingretryCount.SyncRateLimiterwithreason: backendBusy.SyncProvider.rateLimitReasonand pick distinct, calm copy:backendBusy→ "Omi servers are busy — your recordings will sync once capacity returns".rateLimit→ "Fair-use limit reached — syncing will resume automatically".Observability (the gap that prevented diagnosis before)
The reconciler used to silently break on
transient/ non-terminal and silently revert onfailed/notFound— the server'sstatus,error,failedSegmentswere received but discarded. With "Log to file" on in Settings → Developer, you now get the actual cause in a shareable log:fetchSyncJobStatuslogs the HTTP status code when non-200 (so 429-on-the-GET, 5xx, etc. are visible — answers "why is the WAL stuck on 'Processing on Omi'?").reconcileUploadedWalsemitsreconcile_pollfor every per-job outcome (transient / non-terminal-queued / non-terminal-processing / completed) andreconcile_reverton the notFound and failed/partial_failure branches with the server'sstatus,error,failedSegments/totalSegments, the post-bumpretryCount, andbackendBusy/retryCountBumpedflags.This is how the reporter's stuck-recordings case was actually diagnosed — the breadcrumbs lined up with the server's
"background worker likely died"error at the 10-minute mark, leading to the backend-busy fix above.Files
app/lib/services/wals/sync_rate_limiter.dartSyncRateLimiter, persisted, ChangeNotifier, withRateLimitReasonapp/lib/backend/http/api/conversations.dartSyncRateLimitedException,Retry-Afterparser, HTTP-status log on non-200 GETapp/lib/services/wals.dart,app/lib/services/wals/wal_interfaces.dartapp/lib/services/wals/local_wal_sync.dartsyncAll, typed 429 catch in batch + single paths, backend-busy detection in reconciler, per-job poll / revert loggingapp/lib/providers/sync_provider.dartsyncWals/syncWalearly-return when limited; exposeisRateLimited/rateLimitedUntil/rateLimitReasonapp/lib/providers/capture_provider.dart_syncSingleWal, typed 429 catch (noretryCountbump), clear on successapp/lib/pages/conversations/auto_sync_page.dart,app/lib/pages/conversations/sync_page.dartapp/lib/l10n/app_en.arb+ 48 locales + codegensyncCardRateLimited,syncCardBackendBusy)Out of scope (explicit follow-ups)
queuedvsprocessingin the stale guard, fix the misleading error string, scale the saturated pools, and add observability (log stale-guard rewrites, include UID in POST 202 accept logs).uploadedindefinitely. Sketched earlier; not in this PR.Verification
flutter analyzeclean on every touched file (the warnings shown for other files are pre-existing).flutter gen-l10n→ zero untranslated.🤖 Generated with Claude Code