Skip to content

sync-local-files 504 timeouts on large payloads (>120s pipeline) #5941

@beastoin

Description

@beastoin

Problem

/v1/sync-local-files on backend-sync returns 504 Gateway Timeout — 41 occurrences in 48h across 17 users. All failures hit 125-170s latency, exceeding the 120s TimeoutMiddleware cutoff.

Successful requests: P50=17s, max=55s. Failures: larger payloads (1-32MB), more speech (up to 487s per request).

Pipeline Trace

The endpoint (backend/routers/sync.py:756) runs 4 serial stages, all synchronous within an async def:

Stage Function Parallelism Estimated time (large payload)
1. Decode decode_files_to_wav() Sequential ~5s
2. VAD retrieve_vad_segments() via chunk_threads 5 threads/chunk 30-60s (hosted VAD API has 300s timeout)
3. Transcription + LLM process_segment() via chunk_threads 5 threads/chunk 50-120s (Deepgram 10-30s + LLM 15-30s per segment)
4. Cleanup _cleanup_files() Sequential <1s

For a 487s-speech payload producing 8 segments:

  • VAD: 1-2 chunks × 30s = 30-60s
  • process_segment: 2 chunks × (Deepgram + LLM) = 50-120s
  • Total: 80-180s → exceeds 120s timeout

Key Observations

  1. Data is NOT lost: The 504 fires via asyncio.wait_for which cancels the coroutine, but the sync threads continue running in the background. The conversation is eventually created/updated.

  2. The 02:36-02:57 UTC Mar 22 cluster (13 failures, 9 IPs in 21 min) suggests Deepgram/LLM contention under concurrent load amplifies latency.

  3. process_segment serializes Deepgram + LLM per segment: Each thread does network call (Deepgram) then LLM calls (process_conversationget_transcript_structure + extract_action_items + folder assignment). No parallelism within a segment.

  4. VAD hosted API has 300s timeout (vad.py:34) — a single slow VAD response can consume half the budget.

Instrumentation PR

PR with sync_timing structured logs at each stage boundary — will show exactly where time is spent per request in prod.

Potential Fixes (for discussion, not CTO-verified)

  • Background processing: Return 202 Accepted immediately, process async, notify via FCM when done
  • Per-endpoint timeout override: TimeoutMiddleware.methods_timeout already supports this — set /v1/sync-local-files to 300s
  • Parallelize within segment: Run Deepgram and LLM in parallel where possible
  • Streaming progress: SSE endpoint for client to poll status

Metadata

Metadata

Assignees

No one assigned

    Labels

    backendBackend Task (python)bugSomething isn't workingp2Priority: Important (score 14-21)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions