P0: Pusher retry exhaustion cascade — circuit breaker + graceful degradation

### Problem

Confirmed root cause of the 3.5h production outage on 2026-03-24 (14:56-18:24 UTC). When pusher service degrades, existing backend-listen sessions enter an **infinite retry loop** that saturates the event loop and crashes pods.

### Incident Timeline

| Time | Event | Error Count |
|------|-------|-------------|
| 14:44 | Pusher `keepalive ping timeout` errors begin | 81/min |
| 14:46 | Peak pusher errors | 228/min |
| 14:55 | Prometheus scrape drops, pods going unready | 141/min |
| 14:57 | Pods being killed | 14/min |
| 15:20-15:29 | Pusher retry storm during recovery | 409/min peak |

**2,350+ pusher 1011 errors** in 12 minutes, **3,000+ retry warnings** during recovery.

### Root Cause — Infinite Pusher Retry Loop

The pusher WS connection uses `ping_interval=30, ping_timeout=60` (`utils/pusher.py:37-38`). When pusher can't respond to pings, the `websockets` library closes the connection with code 1011 after 60s.

Three concurrent tasks per session (`audio_bytes_consume`, `transcript_consume`, `pusher_receive`) each run on a 1s loop. When they discover the dead pusher connection, they catch `ConnectionClosed`, set `pusher_connected = False`, and call `connect()`.

`connect()` calls `connect_to_trigger_pusher(retries=5)` with exponential backoff (base 1s, cap 15s). When all 5 retries fail, `_connect()` **swallows the exception** (`transcribe.py:1456-1457`). `pusher_connected` stays `False`. One second later, the loop runs again and retries.

**The loop never stops.** The only exit is `websocket_active = False` (client disconnect, 90s inactivity timeout, or pod kill). For a 2-hour session with pusher down, this means ~36,000 failed connection attempts from a single session.

**Scale**: Hundreds of sessions × 30 pods × continuous retries = tens of thousands of concurrent retry attempts against dead pusher → event loop saturated → TCP probe timeouts → pod kills → 3.5h outage.

### Three Bugs

1. **No circuit breaker** — each session independently retries a dead pusher with no coordination
2. **Exception swallowed in `_connect()`** — `pusher_connected` stays False, retry loop continues forever
3. **No graceful degradation** — initial connect failure closes client WS with 1011 (`transcribe.py:2611`), causing app reconnect storm; ongoing sessions keep retrying forever instead of degrading

### Solution — 3-Layer Defense (Codex Validated)

#### Layer 1: Pod-Level Circuit Breaker (`utils/pusher.py`)

Module singleton shared across all sessions on a pod:

```
CLOSED ──(≥20 failures in 30s)──► OPEN ──(60s cooldown)──► HALF_OPEN ──(probe succeeds)──► CLOSED
                                    │                          │
                                    ◄──────(probe fails)───────┘
```

When OPEN: `connect_to_trigger_pusher()` returns `None` immediately — zero retries, zero event loop work. Stops the herd stampede.

#### Layer 2: Per-Session Reconnect State Machine (`routers/transcribe.py`)

Replace 3 scattered `auto_reconnect → connect()` calls with a single `pusher_reconnect_loop()` task:

```
CONNECTED ──(lost)──► RECONNECT_BACKOFF ──(6 failures)──► DEGRADED ──(60s)──► HALF_OPEN_PROBE
     ▲                    │ (backoff: 1s → 60s)                                     │
     └──────────────────────────────────(probe succeeds)────────────────────────────┘
```

- `_audio_bytes_flush`, `_transcript_flush`, `pusher_receive` — stop calling `connect()`, just set `pusher_connected = False`
- Single `pusher_reconnect_loop()` handles all reconnection with backoff
- After 6 failed attempts → enter DEGRADED mode

#### Layer 3: Graceful Degradation

When degraded:
- **DG streaming continues** — transcripts still flow to the app
- **Audio buffers stop accumulating** for pusher (drop, don't replay)
- **Conversation processing routes to `_create_conversation_fallback()`** (existing fallback at line 689) instead of pending forever
- **Initial connect failure no longer closes client WS** — starts in degraded mode instead

### Before vs After

| Before | After |
|--------|-------|
| Pusher dies → 300 sessions × 5 retries/sec × 30s each | Breaker trips in <30s → all sessions fail-fast (0ms) |
| Exception swallowed → retry loop forever | 6 failures → DEGRADED → single probe every 60s |
| 3 tasks per session race to reconnect | 1 reconnect loop per session |
| Initial pusher fail → close client WS → app reconnects | Initial fail → degraded mode → DG continues |

### Files to Modify

| File | Change |
|------|--------|
| `backend/utils/pusher.py` | Add `PusherCircuitBreaker` singleton (~50 lines), connect timeout 3s, conservative retry (3 retries, base 250ms, cap 2s) |
| `backend/routers/transcribe.py` | Remove auto-reconnect from 3 flush functions, add `pusher_reconnect_loop()` task, add degraded mode with fallback routing, remove 1011 hard close at line 2611 |
| `backend/utils/metrics.py` | Add circuit breaker state gauge, pusher reconnect counters |

### Impact

Prevents the exact cascade that caused the 3.5h outage. When pusher goes down:
- Users see uninterrupted transcription (DG streaming continues)
- Conversations process via fallback path
- Circuit breaker stops retry storm within 30s
- Recovery is automatic when pusher comes back (half-open probe)

### Dependencies

- PR #6021 (merged) — `asyncio.to_thread` for DG connect
- No new pip packages required

---
_by AI for @beastoin_

File	Change
`backend/utils/pusher.py`	Add `PusherCircuitBreaker` singleton (~50 lines), connect timeout 3s, conservative retry (3 retries, base 250ms, cap 2s)
`backend/routers/transcribe.py`	Remove auto-reconnect from 3 flush functions, add `pusher_reconnect_loop()` task, add degraded mode with fallback routing, remove 1011 hard close at line 2611
`backend/utils/metrics.py`	Add circuit breaker state gauge, pusher reconnect counters

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P0: Pusher retry exhaustion cascade — circuit breaker + graceful degradation #6022

Problem

Incident Timeline

Root Cause — Infinite Pusher Retry Loop

Three Bugs

Solution — 3-Layer Defense (Codex Validated)

Layer 1: Pod-Level Circuit Breaker (`utils/pusher.py`)

Layer 2: Per-Session Reconnect State Machine (`routers/transcribe.py`)

Layer 3: Graceful Degradation

Before vs After

Files to Modify

Impact

Dependencies

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Time	Event	Error Count
14:44	Pusher `keepalive ping timeout` errors begin	81/min
14:46	Peak pusher errors	228/min
14:55	Prometheus scrape drops, pods going unready	141/min
14:57	Pods being killed	14/min
15:20-15:29	Pusher retry storm during recovery	409/min peak

Before	After
Pusher dies → 300 sessions × 5 retries/sec × 30s each	Breaker trips in <30s → all sessions fail-fast (0ms)
Exception swallowed → retry loop forever	6 failures → DEGRADED → single probe every 60s
3 tasks per session race to reconnect	1 reconnect loop per session
Initial pusher fail → close client WS → app reconnects	Initial fail → degraded mode → DG continues

P0: Pusher retry exhaustion cascade — circuit breaker + graceful degradation #6022

Description

Problem

Incident Timeline

Root Cause — Infinite Pusher Retry Loop

Three Bugs

Solution — 3-Layer Defense (Codex Validated)

Layer 1: Pod-Level Circuit Breaker (utils/pusher.py)

Layer 2: Per-Session Reconnect State Machine (routers/transcribe.py)

Layer 3: Graceful Degradation

Before vs After

Files to Modify

Impact

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Layer 1: Pod-Level Circuit Breaker (`utils/pusher.py`)

Layer 2: Per-Session Reconnect State Machine (`routers/transcribe.py`)