Skip to content

feat(sync): graceful cooldowns for 429 fair-use + backend-busy stale-fails; surface the reason, back off, and log why#7451

Merged
mdmohsin7 merged 29 commits into
mainfrom
feat/sync-rate-limit-handling
May 23, 2026
Merged

feat(sync): graceful cooldowns for 429 fair-use + backend-busy stale-fails; surface the reason, back off, and log why#7451
mdmohsin7 merged 29 commits into
mainfrom
feat/sync-rate-limit-handling

Conversation

@mdmohsin7
Copy link
Copy Markdown
Member

@mdmohsin7 mdmohsin7 commented May 21, 2026

Problem

Two distinct backend conditions both manifested as the same alarming amber "Couldn't process — retrying" on every recording, with users tapping Retry forever and amplifying the storm. The chain was the same in both cases: the client treated transient-but-systemic backend signals as content failures, burned each recording's retry budget, and never told the user or itself what was actually happening.

(1) Fair-use throttling (HTTP 429) — a user draining a multi-day offline backlog past their fair-use cap got the cap enforced (~474× 429 in 48h in Cloud Run for one user). uploadLocalFilesV2 mapped 429 to a generic Exception('Rate limited or budget exhausted') and auto-sync's catch — which already excludes SocketExceptiondidn't exclude 429. So every throttle bumped retryCount and the app re-fired uploads every minute with no backoff.

(2) Backend-busy stale-guard fails — when backend-sync's storage/postprocess pools are saturated (storage 96/96 + 29 queued, postprocess 24/24 + 80 queued observed), uploads are accepted (202), assigned a jobId, and sit in queued server-side. After 10 min the server's stale guard in database/sync_jobs.py:get_sync_job rewrites the status to failed with the misleading message "Job timed out (background worker likely died)" — but no worker died, none was available. The reconciler reads failed, reverts the WAL, bumps retry. User taps Retry → new job → also queued → same fate → retry budget exhausted → red "Failed". Reproduced and confirmed end-to-end via reporter's reconcile_poll / reconcile_revert log breadcrumbs (added in this PR). Separate backend issue opened: #7469.

Fix

Surface the real reason, back off instead of hammering

  • uploadLocalFilesV2 throws a typed SyncRateLimitedException on 429, parsing Retry-After.
  • SyncRateLimiter (new, persisted, app-global) holds a cooldown timestamp + a RateLimitReason (rateLimit | backendBusy). extends ChangeNotifier so SyncProvider rebuilds the UI the moment the gate flips.
  • Every upload entry point checks the cooldown and skips while active: auto-sync _syncSingleWal, batch syncAll (gates entry, breaks the loop on 429 mid-batch), single syncWal. syncWals/syncWal at the provider level early-return when limited — no more "Cancel pill flashes then resets". Any successful upload clears the cooldown.
  • 429 no longer bumps retryCount — treated like SocketException (transient throttle, not content failure). Side-effect: rows that would have flipped to amber "Couldn't process — retrying" stay calm grey "Waiting to sync" automatically — no new per-row state needed.

Backend-busy detection (new in this PR)

  • Reconciler's failed/partial_failure branch detects the specific "background worker likely died" error from the server's stale guard and treats it as backend capacity, not a content failure:
    • WAL reverts to miss without bumping retryCount.
    • Triggers a 10-min cooldown via SyncRateLimiter with reason: backendBusy.
  • Status cards (auto-sync + manual) read SyncProvider.rateLimitReason and pick distinct, calm copy:
    • backendBusy"Omi servers are busy — your recordings will sync once capacity returns".
    • rateLimit"Fair-use limit reached — syncing will resume automatically".
  • Spinner is hidden in the rate-limited state on both cards so the card reads unambiguously as paused (per-row "Uploaded · processing on Omi" still shows the in-flight server work on the WALs that uploaded before the limit hit).

Observability (the gap that prevented diagnosis before)

The reconciler used to silently break on transient / non-terminal and silently revert on failed/notFound — the server's status, error, failedSegments were received but discarded. With "Log to file" on in Settings → Developer, you now get the actual cause in a shareable log:

  • fetchSyncJobStatus logs the HTTP status code when non-200 (so 429-on-the-GET, 5xx, etc. are visible — answers "why is the WAL stuck on 'Processing on Omi'?").
  • reconcileUploadedWals emits reconcile_poll for every per-job outcome (transient / non-terminal-queued / non-terminal-processing / completed) and reconcile_revert on the notFound and failed/partial_failure branches with the server's status, error, failedSegments/totalSegments, the post-bump retryCount, and backendBusy / retryCountBumped flags.

This is how the reporter's stuck-recordings case was actually diagnosed — the breadcrumbs lined up with the server's "background worker likely died" error at the 10-minute mark, leading to the backend-busy fix above.

Files

Area File Change
Gate (new) app/lib/services/wals/sync_rate_limiter.dart New SyncRateLimiter, persisted, ChangeNotifier, with RateLimitReason
HTTP app/lib/backend/http/api/conversations.dart Typed SyncRateLimitedException, Retry-After parser, HTTP-status log on non-200 GET
Barrels app/lib/services/wals.dart, app/lib/services/wals/wal_interfaces.dart Export the limiter + the exception
Engine app/lib/services/wals/local_wal_sync.dart Gate syncAll, typed 429 catch in batch + single paths, backend-busy detection in reconciler, per-job poll / revert logging
Provider app/lib/providers/sync_provider.dart Listens to limiter; syncWals/syncWal early-return when limited; expose isRateLimited / rateLimitedUntil / rateLimitReason
Auto-sync app/lib/providers/capture_provider.dart Gate _syncSingleWal, typed 429 catch (no retryCount bump), clear on success
UI app/lib/pages/conversations/auto_sync_page.dart, app/lib/pages/conversations/sync_page.dart "Fair-use limit reached" / "Omi servers are busy" branches selected by reason; spinner hidden while paused; pill hidden so taps don't no-op
l10n app/lib/l10n/app_en.arb + 48 locales + codegen Two new keys (syncCardRateLimited, syncCardBackendBusy)

Out of scope (explicit follow-ups)

Verification

  • flutter analyze clean on every touched file (the warnings shown for other files are pre-existing).
  • flutter gen-l10n → zero untranslated.
  • Reporter ran the simulate path and the dev log shows the full reconcile-poll/revert breadcrumbs matching the expected flow; the WALs stop showing as "Couldn't process — retrying" when the server returns the stale-guard error.

🤖 Generated with Claude Code

@mdmohsin7 mdmohsin7 marked this pull request as ready for review May 22, 2026 11:25
mdmohsin7 added 16 commits May 23, 2026 15:51
Resolve l10n conflicts: take main's version of all 49 ARBs (it carries
#7449's keys — syncStatusBackingUp, syncStatusOnDevice, syncCardDownloadingTitle,
syncFlowIntro), then re-add syncCardRateLimited × 49 from this branch.
Generated AppLocalizations files regenerated via flutter gen-l10n.

No code conflicts.
Calls SyncRateLimiter.instance.markLimited(retryAfterSeconds: 60) so the
rate-limit UI (auto-sync card 'Fair-use limit reached', upload gating)
can be tested locally without tripping the backend's fair-use cap.
…rly-return when limited

Without this the sync UI never updated when the cooldown flipped, and tapping
Sync inside the cooldown briefly flashed the syncing state (Cancel pill) then
'completed' — with no rate-limit message anywhere. Now markLimited triggers
a provider rebuild, and Sync/SyncWal short-circuit without entering the
syncing state, so the rate-limited card stays visible.
…des the spinner while paused

- Add the isRateLimited branch to _buildProcessCard (reads SyncProvider.isRateLimited).
- Hide the upload/processing spinner during the cooldown so the card reads
  unambiguously as paused; per-row 'Uploaded · processing on Omi' still shows
  the background processing for already-uploaded WALs.
Same tweak as sync_page — the spinner contradicts a 'paused' message; per-row
subtitles still convey that uploaded WALs are processing in the background.
So a shared dev log shows when the status GET itself is being throttled
or 5xx'd — previously the reason the reconciler couldn't make progress
was invisible.
…ckendBusy)

Same cooldown gate, but persisted with a reason so the UI can pick distinct
copy for a fair-use 429 vs a backend-worker-saturation pause.
…ryCount bump

The server's stale guard marks queued-but-unworked jobs 'failed' with
'Job timed out (background worker likely died)' after 600s. That's a
backend-capacity issue, not a content failure — bumping retryCount
mislabels the recording as 'Couldn't process — retrying' and the user
keeps tapping Retry, which spawns more jobs that also stale out.

When the reconciler sees that specific error: revert the WAL to miss
(file kept, no retryCount bump), and markLimited(backendBusy) to pause
further uploads so the backlog can drain. The row falls back to the
calm grey 'Waiting to sync' and the status card surfaces the cause.

Also adds 'reconcile_poll' breadcrumbs for every per-job outcome
(transient/non-terminal/completed) and a 'reconcile_revert' event with
the server's status/error/segments + backendBusy + retryCountBumped, so
'Log to file' shares show the actual cause instead of guesses.
@mdmohsin7 mdmohsin7 changed the title feat(sync): handle fair-use (429) throttling — surface the reason, back off instead of hammering feat(sync): graceful cooldowns for 429 fair-use + backend-busy stale-fails; surface the reason, back off, and log why May 23, 2026
@mdmohsin7
Copy link
Copy Markdown
Member Author

@greptile-apps review

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 23, 2026

Greptile Summary

This PR addresses two distinct backend conditions (HTTP 429 fair-use throttling and backend-busy stale-guard failures) that were both surfacing as alarming "Couldn't process — retrying" states, burning the per-recording retry budget with no backoff. It introduces a persisted, app-global SyncRateLimiter singleton that gates all upload entry points, distinguishes the two failure modes with typed exceptions and RateLimitReason, and surfaces calm, reason-specific copy in both the auto-sync and manual-sync status cards.

  • SyncRateLimiter (new): A ChangeNotifier singleton backed by SharedPreferences; markLimited / clear are called at all three upload sites (syncAll, _syncSingleWal, single WAL). The reconciler's failed branch detects the server's stale-guard error string and triggers a 10-min backendBusy cooldown without bumping retryCount.
  • Observability: fetchSyncJobStatus now logs HTTP status on non-200; the reconciler emits reconcile_poll / reconcile_revert breadcrumbs with job ID, WAL IDs, server error, and bump flags.
  • l10n: Two new string keys (syncCardRateLimited, syncCardBackendBusy) added to all 48 locales via code-gen.

Confidence Score: 3/5

Safe to merge after fixing the missing removeListener in SyncProvider.dispose — without it, any upload activity after the provider is disposed will trigger a Flutter assertion error in debug builds and a memory leak in release builds.

The core rate-limiting and backoff logic is well-structured and the observability additions are valuable. The missing removeListener in SyncProvider.dispose is the one concrete defect that would reliably fire whenever uploads happen after navigation. The backend-busy string-match fragility and uncapped retryAfterSeconds are worth noting but do not block the happy path.

app/lib/providers/sync_provider.dart needs the removeListener fix; app/lib/services/wals/local_wal_sync.dart and app/lib/services/wals/sync_rate_limiter.dart have the lower-priority concerns around string-matching fragility and the missing retryAfter cap.

Important Files Changed

Filename Overview
app/lib/providers/sync_provider.dart Adds isRateLimited/rateLimitReason getters and early-returns; missing removeListener in dispose causes a use-after-dispose error and memory leak.
app/lib/services/wals/sync_rate_limiter.dart New singleton SyncRateLimiter persisted via SharedPreferences; solid design but no cap on server-supplied retryAfterSeconds.
app/lib/services/wals/local_wal_sync.dart Adds rate-limit gate, typed 429 catch with batch-break, backend-busy detection via fragile string match, and detailed reconciler logging.
app/lib/backend/http/api/conversations.dart Adds typed SyncRateLimitedException, Retry-After parser, and HTTP-status debug logging on non-200 job-status fetches — clean and self-contained.
app/lib/providers/capture_provider.dart Auto-sync path gates on rate limiter before upload and handles SyncRateLimitedException without bumping retry count; correct and symmetric with batch path.
app/lib/pages/conversations/auto_sync_page.dart Hides spinner and action pill in rate-limited state; shows distinct copy for backendBusy vs rateLimit reason.
app/lib/pages/conversations/sync_page.dart Manual sync card mirrors auto-sync rate-limited UI branch consistently.

Comments Outside Diff (2)

  1. app/lib/providers/sync_provider.dart, line 636-644 (link)

    P1 Missing removeListener for SyncRateLimiter in dispose

    The constructor registers notifyListeners as a listener on the long-lived SyncRateLimiter.instance singleton, but dispose never calls SyncRateLimiter.instance.removeListener(notifyListeners). After SyncProvider is disposed (e.g., after navigating away), any subsequent markLimited() or clear() call on the singleton will call notifyListeners() on the dead provider — triggering a FlutterError: A [_ChangeNotifier] was used after being disposed assertion in debug mode and a memory leak in release mode.

  2. app/lib/providers/sync_provider.dart, line 637-643 (link)

    P1 Add the matching removeListener call so the disposed provider is not kept alive through the singleton's listener list.

Reviews (1): Last reviewed commit: "chore(l10n): regenerate AppLocalizations..." | Re-trigger Greptile

Comment on lines +955 to +967
// attribute a failed segment to a specific member, so revert all
// members for re-upload — the server dedups segments that already
// succeeded, so completed work is not duplicated.
//
// Backend-busy detection: when the server's stale guard marks a
// queued job 'failed' with this specific error, the job never
// even reached a worker — it's a backend-capacity issue, not a
// content failure. Don't bump retryCount (which would mislabel
// the recording as 'failed'), and pause uploads via the rate
// limiter so we stop submitting more jobs that will also stale
// out. UI surfaces this as 'Backend busy' (distinct from 429).
final backendBusy = (s.error ?? '').contains('background worker likely died');
if (backendBusy) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Fragile string-match for backend-busy detection

The backendBusy path matches the exact literal "background worker likely died" from the server's stale guard. If the backend message is ever rephrased or the error key changes (it currently says the error string is misleading per issue #7469), detection silently reverts to treating the failure as a content failure, bumping retryCount and eventually showing "Failed" — the exact bad behavior this PR fixes. Worth extracting as a named constant or adding a secondary signal (e.g., s.status == 'failed' with empty failedSegments) so a server-side rename is easy to track down.

Comment thread app/lib/services/wals/sync_rate_limiter.dart Outdated
mdmohsin7 added 2 commits May 23, 2026 19:00
…constant

Greptile P2: matching the literal 'background worker likely died' string is
fragile (backend issue #7469 even proposes renaming it). Extract the string
as a named constant and add a structural signal — status=='failed' with
totalSegments==0 can only come from the stale guard, since
mark_job_completed only marks 'failed' when total>0. The OR catches both
the current backend behavior and a future rename.
Greptile P2: a misconfigured Retry-After (e.g. 99999999) would lock the
app out of syncing for years. Cap at 24h, which is well above any
reasonable rate-limit window.
@mdmohsin7 mdmohsin7 merged commit b0ca581 into main May 23, 2026
2 checks passed
@mdmohsin7 mdmohsin7 deleted the feat/sync-rate-limit-handling branch May 23, 2026 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant