Problem
Confirmed root cause of the 3.5h production outage on 2026-03-24 (14:56-18:24 UTC). When pusher service degrades, existing backend-listen sessions enter an infinite retry loop that saturates the event loop and crashes pods.
Incident Timeline
| Time |
Event |
Error Count |
| 14:44 |
Pusher keepalive ping timeout errors begin |
81/min |
| 14:46 |
Peak pusher errors |
228/min |
| 14:55 |
Prometheus scrape drops, pods going unready |
141/min |
| 14:57 |
Pods being killed |
14/min |
| 15:20-15:29 |
Pusher retry storm during recovery |
409/min peak |
2,350+ pusher 1011 errors in 12 minutes, 3,000+ retry warnings during recovery.
Root Cause — Infinite Pusher Retry Loop
The pusher WS connection uses ping_interval=30, ping_timeout=60 (utils/pusher.py:37-38). When pusher can't respond to pings, the websockets library closes the connection with code 1011 after 60s.
Three concurrent tasks per session (audio_bytes_consume, transcript_consume, pusher_receive) each run on a 1s loop. When they discover the dead pusher connection, they catch ConnectionClosed, set pusher_connected = False, and call connect().
connect() calls connect_to_trigger_pusher(retries=5) with exponential backoff (base 1s, cap 15s). When all 5 retries fail, _connect() swallows the exception (transcribe.py:1456-1457). pusher_connected stays False. One second later, the loop runs again and retries.
The loop never stops. The only exit is websocket_active = False (client disconnect, 90s inactivity timeout, or pod kill). For a 2-hour session with pusher down, this means ~36,000 failed connection attempts from a single session.
Scale: Hundreds of sessions × 30 pods × continuous retries = tens of thousands of concurrent retry attempts against dead pusher → event loop saturated → TCP probe timeouts → pod kills → 3.5h outage.
Three Bugs
- No circuit breaker — each session independently retries a dead pusher with no coordination
- Exception swallowed in
_connect() — pusher_connected stays False, retry loop continues forever
- No graceful degradation — initial connect failure closes client WS with 1011 (
transcribe.py:2611), causing app reconnect storm; ongoing sessions keep retrying forever instead of degrading
Solution — 3-Layer Defense (Codex Validated)
Layer 1: Pod-Level Circuit Breaker (utils/pusher.py)
Module singleton shared across all sessions on a pod:
CLOSED ──(≥20 failures in 30s)──► OPEN ──(60s cooldown)──► HALF_OPEN ──(probe succeeds)──► CLOSED
│ │
◄──────(probe fails)───────┘
When OPEN: connect_to_trigger_pusher() returns None immediately — zero retries, zero event loop work. Stops the herd stampede.
Layer 2: Per-Session Reconnect State Machine (routers/transcribe.py)
Replace 3 scattered auto_reconnect → connect() calls with a single pusher_reconnect_loop() task:
CONNECTED ──(lost)──► RECONNECT_BACKOFF ──(6 failures)──► DEGRADED ──(60s)──► HALF_OPEN_PROBE
▲ │ (backoff: 1s → 60s) │
└──────────────────────────────────(probe succeeds)────────────────────────────┘
_audio_bytes_flush, _transcript_flush, pusher_receive — stop calling connect(), just set pusher_connected = False
- Single
pusher_reconnect_loop() handles all reconnection with backoff
- After 6 failed attempts → enter DEGRADED mode
Layer 3: Graceful Degradation
When degraded:
- DG streaming continues — transcripts still flow to the app
- Audio buffers stop accumulating for pusher (drop, don't replay)
- Conversation processing routes to
_create_conversation_fallback() (existing fallback at line 689) instead of pending forever
- Initial connect failure no longer closes client WS — starts in degraded mode instead
Before vs After
| Before |
After |
| Pusher dies → 300 sessions × 5 retries/sec × 30s each |
Breaker trips in <30s → all sessions fail-fast (0ms) |
| Exception swallowed → retry loop forever |
6 failures → DEGRADED → single probe every 60s |
| 3 tasks per session race to reconnect |
1 reconnect loop per session |
| Initial pusher fail → close client WS → app reconnects |
Initial fail → degraded mode → DG continues |
Files to Modify
| File |
Change |
backend/utils/pusher.py |
Add PusherCircuitBreaker singleton (~50 lines), connect timeout 3s, conservative retry (3 retries, base 250ms, cap 2s) |
backend/routers/transcribe.py |
Remove auto-reconnect from 3 flush functions, add pusher_reconnect_loop() task, add degraded mode with fallback routing, remove 1011 hard close at line 2611 |
backend/utils/metrics.py |
Add circuit breaker state gauge, pusher reconnect counters |
Impact
Prevents the exact cascade that caused the 3.5h outage. When pusher goes down:
- Users see uninterrupted transcription (DG streaming continues)
- Conversations process via fallback path
- Circuit breaker stops retry storm within 30s
- Recovery is automatic when pusher comes back (half-open probe)
Dependencies
by AI for @beastoin
Problem
Confirmed root cause of the 3.5h production outage on 2026-03-24 (14:56-18:24 UTC). When pusher service degrades, existing backend-listen sessions enter an infinite retry loop that saturates the event loop and crashes pods.
Incident Timeline
keepalive ping timeouterrors begin2,350+ pusher 1011 errors in 12 minutes, 3,000+ retry warnings during recovery.
Root Cause — Infinite Pusher Retry Loop
The pusher WS connection uses
ping_interval=30, ping_timeout=60(utils/pusher.py:37-38). When pusher can't respond to pings, thewebsocketslibrary closes the connection with code 1011 after 60s.Three concurrent tasks per session (
audio_bytes_consume,transcript_consume,pusher_receive) each run on a 1s loop. When they discover the dead pusher connection, they catchConnectionClosed, setpusher_connected = False, and callconnect().connect()callsconnect_to_trigger_pusher(retries=5)with exponential backoff (base 1s, cap 15s). When all 5 retries fail,_connect()swallows the exception (transcribe.py:1456-1457).pusher_connectedstaysFalse. One second later, the loop runs again and retries.The loop never stops. The only exit is
websocket_active = False(client disconnect, 90s inactivity timeout, or pod kill). For a 2-hour session with pusher down, this means ~36,000 failed connection attempts from a single session.Scale: Hundreds of sessions × 30 pods × continuous retries = tens of thousands of concurrent retry attempts against dead pusher → event loop saturated → TCP probe timeouts → pod kills → 3.5h outage.
Three Bugs
_connect()—pusher_connectedstays False, retry loop continues forevertranscribe.py:2611), causing app reconnect storm; ongoing sessions keep retrying forever instead of degradingSolution — 3-Layer Defense (Codex Validated)
Layer 1: Pod-Level Circuit Breaker (
utils/pusher.py)Module singleton shared across all sessions on a pod:
When OPEN:
connect_to_trigger_pusher()returnsNoneimmediately — zero retries, zero event loop work. Stops the herd stampede.Layer 2: Per-Session Reconnect State Machine (
routers/transcribe.py)Replace 3 scattered
auto_reconnect → connect()calls with a singlepusher_reconnect_loop()task:_audio_bytes_flush,_transcript_flush,pusher_receive— stop callingconnect(), just setpusher_connected = Falsepusher_reconnect_loop()handles all reconnection with backoffLayer 3: Graceful Degradation
When degraded:
_create_conversation_fallback()(existing fallback at line 689) instead of pending foreverBefore vs After
Files to Modify
backend/utils/pusher.pyPusherCircuitBreakersingleton (~50 lines), connect timeout 3s, conservative retry (3 retries, base 250ms, cap 2s)backend/routers/transcribe.pypusher_reconnect_loop()task, add degraded mode with fallback routing, remove 1011 hard close at line 2611backend/utils/metrics.pyImpact
Prevents the exact cascade that caused the 3.5h outage. When pusher goes down:
Dependencies
asyncio.to_threadfor DG connectby AI for @beastoin