Skip to content

fix(desktop): reduce API polling frequency and optimize slow backend queries #6500

@beastoin

Description

@beastoin

Problem

After PR #6175 migrated desktop CRUD from Rust backend to Python backend, desktop traffic is generating 504 timeouts on listing endpoints. Root cause is two-fold:

  1. Aggressive polling — desktop app polls 6+ endpoints every 15-30 seconds, generating ~4,275 requests/user/day across ~500 users (2.15M total req/day peak)
  2. Slow Firestore queries — p50 latency is sub-second but long tail (users with large collections) exceeds the 2-minute app timeout

The mobile endpoints are unaffected (/v2/messages stayed at ~3 504s/day), proving the backend itself didn't regress — the desktop migration just added massive new traffic that amplified a pre-existing tail latency issue.

Evidence

Before/after PR #6175 (mon's data):

endpoint before 504s/day after 504s/day desktop reqs/day added
/v1/action-items 2 298 141K-648K
/v1/conversations 12 203 97K-347K
/v2/desktop/messages 0 (new) 206 430K
/v1/conversations/count 0 36 100% desktop
/v3/memories 2 103 10K-30K

No load balancer rate limiting — Cloud Armor not configured, zero 429s. All requests pass through.

Root Cause: Desktop Polling Timers

Every polling timer found in the desktop app, ranked by request volume:

# Source File:Line Endpoint Interval Req/user/hr Guards
1 ChatProvider messagePoll ChatProvider.swift:553 GET /v1/messages 15s 240 isSignedIn, !sending, !loading, messages not empty
2 DesktopHomeView refresh DesktopHomeView.swift:238 GET /v1/conversations + /v1/conversations/count 30s 240 (2 calls/tick) isSignedIn, !loading
3 TasksStore auto-refresh TasksStore.swift:165 GET /v1/action-items 30s 120 isActive (page visible), isSignedIn
4 MemoriesPage auto-refresh MemoriesPage.swift:210 GET /v3/memories 30s 120 isActive (page visible), isSignedIn
5 CrispManager CrispManager.swift:63 GET /v1/crisp/unread 120s 30 !AuthBackoff
6 TranscriptionRetryService TranscriptionRetryService.swift:25 GET /v1/conversations 60s 0-60 hasPendingSessions
7 didBecomeActive cascade DesktopHomeView.swift:200 GET /v1/conversations + count on app activate ~48/hr every cmd-tab back

Key issues:

  • ChatProvider (15s) and DesktopHomeView (30s) have no page-visibility guard — they run even when the window is hidden. Menu bar apps keep windows alive indefinitely.
  • refreshConversations() makes 2 separate API calls (getConversations + getConversationsCount) per tick
  • No timers are stopped when the app window is closed/hidden
  • ~500 users x ~14,240 req/user/day (if app runs 24h as menu bar apps do) = ~7.1M theoretical max

APIs Requiring Backend Optimization

These are the Python backend endpoints that 504 for heavy users (large Firestore collections):

Endpoint p50 p99 Max 504 rate Issue
GET /v1/conversations 0.68s 5.8s 158s 0.04% Full doc reads including compressed transcript_segments, no field projection
GET /v1/conversations/count fast - - 0.02% .count().get() on unindexed filter combos
GET /v1/action-items 0.51s 6.0s 111s 0.05% Double Firestore query for has_more pagination + Python re-sort
GET /v2/desktop/messages 0.35s 4.0s 111s 0.03% 3-4 sequential Firestore round-trips on POST (save_message)
GET /v3/memories 0.64s 24s 110s 0.53% Hardcoded limit=5000 on first page, no pagination

Specific backend code issues:

  • conversations.py:277 — reads full documents with no .select() field projection. Transcript segments are large compressed blobs unnecessary for list views.
  • action_items.py:234-246 — executes a second identical query with offset+limit just to check has_more. Should request limit+1 instead.
  • chat.py:689-720save_message does 3-4 sequential Firestore round-trips (acquire_session + set + get + update). Should batch writes.
  • memories.py:48 — hardcodes limit=5000 when offset=0. Should cap at 200 and paginate.

Proposed Fix Plan

Phase 1: Reduce polling frequency (desktop client, highest impact)

  • ChatProvider: 15s → 120s (or replace with push/WebSocket)
  • DesktopHomeView: 30s → 120s, add window-visibility guard
  • TasksStore/MemoriesPage: 30s → 120s (already have isActive guard)
  • Combine getConversations + getConversationsCount into single call
  • Add 60s cooldown on didBecomeActive refresh cascade
  • Stop CrispManager timer when Help tab not recently viewed

Phase 2: Optimize backend queries (server-side)

  • Add .select() field projection for list endpoints (skip transcript_segments)
  • Fix action-items has_more: request limit+1 instead of double query
  • Batch writes in save_message (session acquire + write + update)
  • Cap /v3/memories first-page limit to 200

Phase 3: Infrastructure

  • Consider adding Cloud Armor rate limiting as safety net
  • Consider server-side response caching for list endpoints

Impact

  • Phase 1 alone would reduce desktop traffic by 4-8x (from ~2M to ~250-500K req/day)
  • Phase 2 would reduce p99 latency and eliminate remaining 504s for heavy users
  • Combined: should bring desktop 504s from ~800/day to near-zero

Metadata

Metadata

Assignees

No one assigned

    Labels

    backendBackend Task (python)bugSomething isn't workingdesktopp1Priority: Critical (score 22-29)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions