feat(sync): graceful cooldowns for 429 fair-use + backend-busy stale-fails; surface the reason, back off, and log why by mdmohsin7 · Pull Request #7451 · BasedHardware/omi

mdmohsin7 · 2026-05-21T19:46:50Z

Problem

Two distinct backend conditions both manifested as the same alarming amber "Couldn't process — retrying" on every recording, with users tapping Retry forever and amplifying the storm. The chain was the same in both cases: the client treated transient-but-systemic backend signals as content failures, burned each recording's retry budget, and never told the user or itself what was actually happening.

(1) Fair-use throttling (HTTP 429) — a user draining a multi-day offline backlog past their fair-use cap got the cap enforced (~474× 429 in 48h in Cloud Run for one user). uploadLocalFilesV2 mapped 429 to a generic Exception('Rate limited or budget exhausted') and auto-sync's catch — which already excludes SocketException — didn't exclude 429. So every throttle bumped retryCount and the app re-fired uploads every minute with no backoff.

(2) Backend-busy stale-guard fails — when backend-sync's storage/postprocess pools are saturated (storage 96/96 + 29 queued, postprocess 24/24 + 80 queued observed), uploads are accepted (202), assigned a jobId, and sit in queued server-side. After 10 min the server's stale guard in database/sync_jobs.py:get_sync_job rewrites the status to failed with the misleading message "Job timed out (background worker likely died)" — but no worker died, none was available. The reconciler reads failed, reverts the WAL, bumps retry. User taps Retry → new job → also queued → same fate → retry budget exhausted → red "Failed". Reproduced and confirmed end-to-end via reporter's reconcile_poll / reconcile_revert log breadcrumbs (added in this PR). Separate backend issue opened: #7469.

Fix

Surface the real reason, back off instead of hammering

uploadLocalFilesV2 throws a typed SyncRateLimitedException on 429, parsing Retry-After.
SyncRateLimiter (new, persisted, app-global) holds a cooldown timestamp + a RateLimitReason (rateLimit | backendBusy). extends ChangeNotifier so SyncProvider rebuilds the UI the moment the gate flips.
Every upload entry point checks the cooldown and skips while active: auto-sync _syncSingleWal, batch syncAll (gates entry, breaks the loop on 429 mid-batch), single syncWal. syncWals/syncWal at the provider level early-return when limited — no more "Cancel pill flashes then resets". Any successful upload clears the cooldown.
429 no longer bumps retryCount — treated like SocketException (transient throttle, not content failure). Side-effect: rows that would have flipped to amber "Couldn't process — retrying" stay calm grey "Waiting to sync" automatically — no new per-row state needed.

Backend-busy detection (new in this PR)

Reconciler's failed/partial_failure branch detects the specific "background worker likely died" error from the server's stale guard and treats it as backend capacity, not a content failure:
- WAL reverts to miss without bumping retryCount.
- Triggers a 10-min cooldown via SyncRateLimiter with reason: backendBusy.
Status cards (auto-sync + manual) read SyncProvider.rateLimitReason and pick distinct, calm copy:
- backendBusy → "Omi servers are busy — your recordings will sync once capacity returns".
- rateLimit → "Fair-use limit reached — syncing will resume automatically".
Spinner is hidden in the rate-limited state on both cards so the card reads unambiguously as paused (per-row "Uploaded · processing on Omi" still shows the in-flight server work on the WALs that uploaded before the limit hit).

Observability (the gap that prevented diagnosis before)

The reconciler used to silently break on transient / non-terminal and silently revert on failed/notFound — the server's status, error, failedSegments were received but discarded. With "Log to file" on in Settings → Developer, you now get the actual cause in a shareable log:

fetchSyncJobStatus logs the HTTP status code when non-200 (so 429-on-the-GET, 5xx, etc. are visible — answers "why is the WAL stuck on 'Processing on Omi'?").
reconcileUploadedWals emits reconcile_poll for every per-job outcome (transient / non-terminal-queued / non-terminal-processing / completed) and reconcile_revert on the notFound and failed/partial_failure branches with the server's status, error, failedSegments/totalSegments, the post-bump retryCount, and backendBusy / retryCountBumped flags.

This is how the reporter's stuck-recordings case was actually diagnosed — the breadcrumbs lined up with the server's "background worker likely died" error at the 10-minute mark, leading to the backend-busy fix above.

Files

Area	File	Change
Gate (new)	`app/lib/services/wals/sync_rate_limiter.dart`	New `SyncRateLimiter`, persisted, ChangeNotifier, with `RateLimitReason`
HTTP	`app/lib/backend/http/api/conversations.dart`	Typed `SyncRateLimitedException`, `Retry-After` parser, HTTP-status log on non-200 GET
Barrels	`app/lib/services/wals.dart`, `app/lib/services/wals/wal_interfaces.dart`	Export the limiter + the exception
Engine	`app/lib/services/wals/local_wal_sync.dart`	Gate `syncAll`, typed 429 catch in batch + single paths, backend-busy detection in reconciler, per-job poll / revert logging
Provider	`app/lib/providers/sync_provider.dart`	Listens to limiter; `syncWals`/`syncWal` early-return when limited; expose `isRateLimited` / `rateLimitedUntil` / `rateLimitReason`
Auto-sync	`app/lib/providers/capture_provider.dart`	Gate `_syncSingleWal`, typed 429 catch (no `retryCount` bump), clear on success
UI	`app/lib/pages/conversations/auto_sync_page.dart`, `app/lib/pages/conversations/sync_page.dart`	"Fair-use limit reached" / "Omi servers are busy" branches selected by reason; spinner hidden while paused; pill hidden so taps don't no-op
l10n	`app/lib/l10n/app_en.arb` + 48 locales + codegen	Two new keys (`syncCardRateLimited`, `syncCardBackendBusy`)

Out of scope (explicit follow-ups)

Backend fix for the stale guard mis-flagging healthy queued jobs and for capacity: tracked in sync-job stale guard incorrectly marks healthy queued jobs as 'failed' (background-busy is mislabelled) #7469 (cc @beastoin). The client mitigation here is pattern-matching a server error string; the right fix is server-side. Lays out: distinguish queued vs processing in the stale guard, fix the misleading error string, scale the saturated pools, and add observability (log stale-guard rewrites, include UID in POST 202 accept logs).
Reconciler staleness bound for the different "stuck on Processing on Omi" case where the status GET is unreachable (429'd, 5xx'd) so WALs sit uploaded indefinitely. Sketched earlier; not in this PR.
Manual sync_page's full integration with feat(sync): redesign manual sync page (sync_page) to match auto-sync; fix 'stuck' processing perception #7449 already merged — this PR's branch carries the merge.

Verification

flutter analyze clean on every touched file (the warnings shown for other files are pre-existing).
flutter gen-l10n → zero untranslated.
Reporter ran the simulate path and the dev log shows the full reconcile-poll/revert breadcrumbs matching the expected flow; the WALs stop showing as "Couldn't process — retrying" when the server returns the stale-guard error.

🤖 Generated with Claude Code

…-After

…ntry, stop loop, no retry bump)

…r bumps retryCount

…nstead of a Sync pill

Resolve l10n conflicts: take main's version of all 49 ARBs (it carries #7449's keys — syncStatusBackingUp, syncStatusOnDevice, syncCardDownloadingTitle, syncFlowIntro), then re-add syncCardRateLimited × 49 from this branch. Generated AppLocalizations files regenerated via flutter gen-l10n. No code conflicts.

Calls SyncRateLimiter.instance.markLimited(retryAfterSeconds: 60) so the rate-limit UI (auto-sync card 'Fair-use limit reached', upload gating) can be tested locally without tripping the backend's fair-use cap.

…otify

…rly-return when limited Without this the sync UI never updated when the cooldown flipped, and tapping Sync inside the cooldown briefly flashed the syncing state (Cancel pill) then 'completed' — with no rate-limit message anywhere. Now markLimited triggers a provider rebuild, and Sync/SyncWal short-circuit without entering the syncing state, so the rate-limited card stays visible.

…des the spinner while paused - Add the isRateLimited branch to _buildProcessCard (reads SyncProvider.isRateLimited). - Hide the upload/processing spinner during the cooldown so the card reads unambiguously as paused; per-row 'Uploaded · processing on Omi' still shows the background processing for already-uploaded WALs.

Same tweak as sync_page — the spinner contradicts a 'paused' message; per-row subtitles still convey that uploaded WALs are processing in the background.

…t for shipping

So a shared dev log shows when the status GET itself is being throttled or 5xx'd — previously the reason the reconciler couldn't make progress was invisible.

…ckendBusy) Same cooldown gate, but persisted with a reason so the UI can pick distinct copy for a fair-use 429 vs a backend-worker-saturation pause.

…ryCount bump The server's stale guard marks queued-but-unworked jobs 'failed' with 'Job timed out (background worker likely died)' after 600s. That's a backend-capacity issue, not a content failure — bumping retryCount mislabels the recording as 'Couldn't process — retrying' and the user keeps tapping Retry, which spawns more jobs that also stale out. When the reconciler sees that specific error: revert the WAL to miss (file kept, no retryCount bump), and markLimited(backendBusy) to pause further uploads so the backlog can drain. The row falls back to the calm grey 'Waiting to sync' and the status card surfaces the cause. Also adds 'reconcile_poll' breadcrumbs for every per-job outcome (transient/non-terminal/completed) and a 'reconcile_revert' event with the server's status/error/segments + backendBusy + retryCountBumped, so 'Log to file' shares show the actual cause instead of guesses.

…ason

…reason

mdmohsin7 · 2026-05-23T13:18:10Z

@greptile-apps review

greptile-apps · 2026-05-23T13:22:15Z

Greptile Summary

This PR addresses two distinct backend conditions (HTTP 429 fair-use throttling and backend-busy stale-guard failures) that were both surfacing as alarming "Couldn't process — retrying" states, burning the per-recording retry budget with no backoff. It introduces a persisted, app-global SyncRateLimiter singleton that gates all upload entry points, distinguishes the two failure modes with typed exceptions and RateLimitReason, and surfaces calm, reason-specific copy in both the auto-sync and manual-sync status cards.

SyncRateLimiter (new): A ChangeNotifier singleton backed by SharedPreferences; markLimited / clear are called at all three upload sites (syncAll, _syncSingleWal, single WAL). The reconciler's failed branch detects the server's stale-guard error string and triggers a 10-min backendBusy cooldown without bumping retryCount.
Observability: fetchSyncJobStatus now logs HTTP status on non-200; the reconciler emits reconcile_poll / reconcile_revert breadcrumbs with job ID, WAL IDs, server error, and bump flags.
l10n: Two new string keys (syncCardRateLimited, syncCardBackendBusy) added to all 48 locales via code-gen.

Confidence Score: 3/5

Safe to merge after fixing the missing removeListener in SyncProvider.dispose — without it, any upload activity after the provider is disposed will trigger a Flutter assertion error in debug builds and a memory leak in release builds.

The core rate-limiting and backoff logic is well-structured and the observability additions are valuable. The missing removeListener in SyncProvider.dispose is the one concrete defect that would reliably fire whenever uploads happen after navigation. The backend-busy string-match fragility and uncapped retryAfterSeconds are worth noting but do not block the happy path.

app/lib/providers/sync_provider.dart needs the removeListener fix; app/lib/services/wals/local_wal_sync.dart and app/lib/services/wals/sync_rate_limiter.dart have the lower-priority concerns around string-matching fragility and the missing retryAfter cap.

Important Files Changed

Filename	Overview
app/lib/providers/sync_provider.dart	Adds `isRateLimited`/`rateLimitReason` getters and early-returns; missing `removeListener` in `dispose` causes a use-after-dispose error and memory leak.
app/lib/services/wals/sync_rate_limiter.dart	New singleton `SyncRateLimiter` persisted via SharedPreferences; solid design but no cap on server-supplied `retryAfterSeconds`.
app/lib/services/wals/local_wal_sync.dart	Adds rate-limit gate, typed 429 catch with batch-break, backend-busy detection via fragile string match, and detailed reconciler logging.
app/lib/backend/http/api/conversations.dart	Adds typed `SyncRateLimitedException`, `Retry-After` parser, and HTTP-status debug logging on non-200 job-status fetches — clean and self-contained.
app/lib/providers/capture_provider.dart	Auto-sync path gates on rate limiter before upload and handles `SyncRateLimitedException` without bumping retry count; correct and symmetric with batch path.
app/lib/pages/conversations/auto_sync_page.dart	Hides spinner and action pill in rate-limited state; shows distinct copy for `backendBusy` vs `rateLimit` reason.
app/lib/pages/conversations/sync_page.dart	Manual sync card mirrors auto-sync rate-limited UI branch consistently.

Comments Outside Diff (2)

app/lib/providers/sync_provider.dart, line 636-644 (link)

Missing removeListener for SyncRateLimiter in dispose

The constructor registers notifyListeners as a listener on the long-lived SyncRateLimiter.instance singleton, but dispose never calls SyncRateLimiter.instance.removeListener(notifyListeners). After SyncProvider is disposed (e.g., after navigating away), any subsequent markLimited() or clear() call on the singleton will call notifyListeners() on the dead provider — triggering a FlutterError: A [_ChangeNotifier] was used after being disposed assertion in debug mode and a memory leak in release mode.
app/lib/providers/sync_provider.dart, line 637-643 (link)

Add the matching removeListener call so the disposed provider is not kept alive through the singleton's listener list.

_{Reviews (1): Last reviewed commit: "chore(l10n): regenerate AppLocalizations..." | Re-trigger Greptile}

greptile-apps · 2026-05-23T13:22:25Z

              // attribute a failed segment to a specific member, so revert all
              // members for re-upload — the server dedups segments that already
              // succeeded, so completed work is not duplicated.
+              //
+              // Backend-busy detection: when the server's stale guard marks a
+              // queued job 'failed' with this specific error, the job never
+              // even reached a worker — it's a backend-capacity issue, not a
+              // content failure. Don't bump retryCount (which would mislabel
+              // the recording as 'failed'), and pause uploads via the rate
+              // limiter so we stop submitting more jobs that will also stale
+              // out. UI surfaces this as 'Backend busy' (distinct from 429).
+              final backendBusy = (s.error ?? '').contains('background worker likely died');
+              if (backendBusy) {


Fragile string-match for backend-busy detection

The backendBusy path matches the exact literal "background worker likely died" from the server's stale guard. If the backend message is ever rephrased or the error key changes (it currently says the error string is misleading per issue #7469), detection silently reverts to treating the failure as a content failure, bumping retryCount and eventually showing "Failed" — the exact bad behavior this PR fixes. Worth extracting as a named constant or adding a secondary signal (e.g., s.status == 'failed' with empty failedSegments) so a server-side rename is easy to track down.

…constant Greptile P2: matching the literal 'background worker likely died' string is fragile (backend issue #7469 even proposes renaming it). Extract the string as a named constant and add a structural signal — status=='failed' with totalSegments==0 can only come from the stale guard, since mark_job_completed only marks 'failed' when total>0. The OR catches both the current backend behavior and a future rename.

Greptile P2: a misconfigured Retry-After (e.g. 99999999) would lock the app out of syncing for years. Cap at 24h, which is well above any reasonable rate-limit window.

mdmohsin7 added 11 commits May 22, 2026 01:14

feat(sync): add SyncRateLimiter — persisted fair-use (429) cooldown gate

08407ca

feat(sync): throw typed SyncRateLimitedException on 429 + parse Retry…

de66a7a

…-After

feat(sync): export SyncRateLimitedException from the wal barrel

24d96c3

feat(sync): export SyncRateLimiter from the wals barrel

53a6154

feat(sync): honor 429 cooldown in batch + single upload paths (gate e…

1484e90

…ntry, stop loop, no retry bump)

feat(sync): auto-sync skips uploads during 429 cooldown; 429 no longe…

2f20aa8

…r bumps retryCount

feat(sync): expose isRateLimited / rateLimitedUntil on SyncProvider

1b6491c

feat(sync): status card shows 'Fair-use limit reached' when paused, i…

9134e6f

…nstead of a Sync pill

i18n(sync): add syncCardRateLimited (EN)

46ae386

i18n(sync): translate syncCardRateLimited across 48 locales

b893f9c

chore(l10n): regenerate AppLocalizations for syncCardRateLimited

5935c31

mdmohsin7 marked this pull request as ready for review May 22, 2026 11:25

mdmohsin7 added 16 commits May 23, 2026 15:51

feat(dev): 'Simulate sync rate limit (60s)' tile in Settings → Developer

c4570d1

Calls SyncRateLimiter.instance.markLimited(retryAfterSeconds: 60) so the rate-limit UI (auto-sync card 'Fair-use limit reached', upload gating) can be tested locally without tripping the backend's fair-use cap.

feat(sync): SyncRateLimiter is a ChangeNotifier — markLimited/clear n…

a5f67d8

…otify

feat(sync): auto_sync_page hides the spinner in the rate-limited card

4488ac2

Same tweak as sync_page — the spinner contradicts a 'paused' message; per-row subtitles still convey that uploaded WALs are processing in the background.

chore(dev): remove 'Simulate sync rate limit' tester — debug-only, no…

1bf9c1a

…t for shipping

feat(sync): log raw HTTP status of fetchSyncJobStatus when non-200

6402c55

So a shared dev log shows when the status GET itself is being throttled or 5xx'd — previously the reason the reconciler couldn't make progress was invisible.

feat(sync): SyncRateLimiter carries a RateLimitReason (rateLimit | ba…

beacc10

…ckendBusy) Same cooldown gate, but persisted with a reason so the UI can pick distinct copy for a fair-use 429 vs a backend-worker-saturation pause.

feat(sync): expose rateLimitReason on SyncProvider

687666b

feat(sync): auto_sync_page status card picks message by rate-limit re…

9edc19a

…ason

feat(sync): manual sync_page status card picks message by rate-limit …

4bf4581

…reason

i18n(sync): add syncCardBackendBusy EN

47cdaa9

i18n(sync): translate syncCardBackendBusy across 48 locales

9b11b4f

chore(l10n): regenerate AppLocalizations for syncCardBackendBusy

a78da81

mdmohsin7 mentioned this pull request May 23, 2026

sync-job stale guard incorrectly marks healthy queued jobs as 'failed' (background-busy is mislabelled) #7469

Open

mdmohsin7 changed the title ~~feat(sync): handle fair-use (429) throttling — surface the reason, back off instead of hammering~~ feat(sync): graceful cooldowns for 429 fair-use + backend-busy stale-fails; surface the reason, back off, and log why May 23, 2026

greptile-apps Bot reviewed May 23, 2026

View reviewed changes

mdmohsin7 added 2 commits May 23, 2026 19:00

fix(sync): clamp markLimited retryAfterSeconds at a 24h ceiling

28ed859

Greptile P2: a misconfigured Retry-After (e.g. 99999999) would lock the app out of syncing for years. Cap at 24h, which is well above any reasonable rate-limit window.

mdmohsin7 merged commit b0ca581 into main May 23, 2026
2 checks passed

mdmohsin7 deleted the feat/sync-rate-limit-handling branch May 23, 2026 14:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sync): graceful cooldowns for 429 fair-use + backend-busy stale-fails; surface the reason, back off, and log why#7451

feat(sync): graceful cooldowns for 429 fair-use + backend-busy stale-fails; surface the reason, back off, and log why#7451
mdmohsin7 merged 29 commits into
mainfrom
feat/sync-rate-limit-handling

mdmohsin7 commented May 21, 2026 •

edited

Loading

Uh oh!

mdmohsin7 commented May 23, 2026

Uh oh!

greptile-apps Bot commented May 23, 2026 •

edited

Loading

Comments Outside Diff (2)

Uh oh!

greptile-apps Bot May 23, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mdmohsin7 commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Surface the real reason, back off instead of hammering

Backend-busy detection (new in this PR)

Observability (the gap that prevented diagnosis before)

Files

Out of scope (explicit follow-ups)

Verification

Uh oh!

mdmohsin7 commented May 23, 2026

Uh oh!

greptile-apps Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Comments Outside Diff (2)

Uh oh!

greptile-apps Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mdmohsin7 commented May 21, 2026 •

edited

Loading

greptile-apps Bot commented May 23, 2026 •

edited

Loading