Skip to content

P0: Pusher retry exhaustion cascade — circuit breaker + graceful degradation #6022

@beastoin

Description

@beastoin

Problem

Confirmed root cause of the 3.5h production outage on 2026-03-24 (14:56-18:24 UTC). When pusher service degrades, existing backend-listen sessions enter an infinite retry loop that saturates the event loop and crashes pods.

Incident Timeline

Time Event Error Count
14:44 Pusher keepalive ping timeout errors begin 81/min
14:46 Peak pusher errors 228/min
14:55 Prometheus scrape drops, pods going unready 141/min
14:57 Pods being killed 14/min
15:20-15:29 Pusher retry storm during recovery 409/min peak

2,350+ pusher 1011 errors in 12 minutes, 3,000+ retry warnings during recovery.

Root Cause — Infinite Pusher Retry Loop

The pusher WS connection uses ping_interval=30, ping_timeout=60 (utils/pusher.py:37-38). When pusher can't respond to pings, the websockets library closes the connection with code 1011 after 60s.

Three concurrent tasks per session (audio_bytes_consume, transcript_consume, pusher_receive) each run on a 1s loop. When they discover the dead pusher connection, they catch ConnectionClosed, set pusher_connected = False, and call connect().

connect() calls connect_to_trigger_pusher(retries=5) with exponential backoff (base 1s, cap 15s). When all 5 retries fail, _connect() swallows the exception (transcribe.py:1456-1457). pusher_connected stays False. One second later, the loop runs again and retries.

The loop never stops. The only exit is websocket_active = False (client disconnect, 90s inactivity timeout, or pod kill). For a 2-hour session with pusher down, this means ~36,000 failed connection attempts from a single session.

Scale: Hundreds of sessions × 30 pods × continuous retries = tens of thousands of concurrent retry attempts against dead pusher → event loop saturated → TCP probe timeouts → pod kills → 3.5h outage.

Three Bugs

  1. No circuit breaker — each session independently retries a dead pusher with no coordination
  2. Exception swallowed in _connect()pusher_connected stays False, retry loop continues forever
  3. No graceful degradation — initial connect failure closes client WS with 1011 (transcribe.py:2611), causing app reconnect storm; ongoing sessions keep retrying forever instead of degrading

Solution — 3-Layer Defense (Codex Validated)

Layer 1: Pod-Level Circuit Breaker (utils/pusher.py)

Module singleton shared across all sessions on a pod:

CLOSED ──(≥20 failures in 30s)──► OPEN ──(60s cooldown)──► HALF_OPEN ──(probe succeeds)──► CLOSED
                                    │                          │
                                    ◄──────(probe fails)───────┘

When OPEN: connect_to_trigger_pusher() returns None immediately — zero retries, zero event loop work. Stops the herd stampede.

Layer 2: Per-Session Reconnect State Machine (routers/transcribe.py)

Replace 3 scattered auto_reconnect → connect() calls with a single pusher_reconnect_loop() task:

CONNECTED ──(lost)──► RECONNECT_BACKOFF ──(6 failures)──► DEGRADED ──(60s)──► HALF_OPEN_PROBE
     ▲                    │ (backoff: 1s → 60s)                                     │
     └──────────────────────────────────(probe succeeds)────────────────────────────┘
  • _audio_bytes_flush, _transcript_flush, pusher_receive — stop calling connect(), just set pusher_connected = False
  • Single pusher_reconnect_loop() handles all reconnection with backoff
  • After 6 failed attempts → enter DEGRADED mode

Layer 3: Graceful Degradation

When degraded:

  • DG streaming continues — transcripts still flow to the app
  • Audio buffers stop accumulating for pusher (drop, don't replay)
  • Conversation processing routes to _create_conversation_fallback() (existing fallback at line 689) instead of pending forever
  • Initial connect failure no longer closes client WS — starts in degraded mode instead

Before vs After

Before After
Pusher dies → 300 sessions × 5 retries/sec × 30s each Breaker trips in <30s → all sessions fail-fast (0ms)
Exception swallowed → retry loop forever 6 failures → DEGRADED → single probe every 60s
3 tasks per session race to reconnect 1 reconnect loop per session
Initial pusher fail → close client WS → app reconnects Initial fail → degraded mode → DG continues

Files to Modify

File Change
backend/utils/pusher.py Add PusherCircuitBreaker singleton (~50 lines), connect timeout 3s, conservative retry (3 retries, base 250ms, cap 2s)
backend/routers/transcribe.py Remove auto-reconnect from 3 flush functions, add pusher_reconnect_loop() task, add degraded mode with fallback routing, remove 1011 hard close at line 2611
backend/utils/metrics.py Add circuit breaker state gauge, pusher reconnect counters

Impact

Prevents the exact cascade that caused the 3.5h outage. When pusher goes down:

  • Users see uninterrupted transcription (DG streaming continues)
  • Conversations process via fallback path
  • Circuit breaker stops retry storm within 30s
  • Recovery is automatic when pusher comes back (half-open probe)

Dependencies


by AI for @beastoin

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions