When Deepgram hits its concurrent connection limit (or any DG outage), backend-listen closes the client WebSocket with 1011. The client immediately reconnects, spawning a new session that tries DG again — creating a self-amplifying reconnect storm that escalated the Mar 25 incident from a DG capacity issue into a 2.5-hour total outage. Related: #6030 (pusher circuit breaker).
Current Behavior
connect_to_deepgram_with_backoff() retries 3x with exponential backoff (~8-10s total) — if all fail, the exception propagates up and the session closes with code 1011 (transcribe.py:1125)
- Client reconnects on 1011, creating a new session → new DG connection attempt → rejected again → infinite loop
- Each failed cycle also creates a new pusher connection (CB stays CLOSED due to
record_success() clearing failures), producing zombie connections
- No fallback to alternative STT providers when DG is unavailable
- Existing DG connections stay alive via
SafeDeepgramSocket keepalive thread, so old connections are never released while new ones are rejected
Expected Behavior
When DG connections fail, keep the client WebSocket open in a degraded transcription mode and retry DG server-side with backoff — breaking the client reconnect storm feedback loop.
Affected Areas
| File |
Line |
Description |
backend/utils/stt/streaming.py |
412-440 |
connect_to_deepgram_with_backoff() — raises on exhaustion, no CB |
backend/routers/transcribe.py |
1123-1127 |
Exception handler closes session with 1011 on DG init failure |
backend/routers/transcribe.py |
710-729 |
_create_conversation_fallback() — SYNC process_conversation() blocks event loop |
backend/utils/pusher.py |
35-103 |
Existing PusherCircuitBreaker — pattern to follow for DG |
backend/utils/stt/safe_socket.py |
75-84 |
Keepalive thread holds DG connections alive indefinitely |
Solution
- DG circuit breaker (pod-level singleton, same pattern as
PusherCircuitBreaker): track DG connection failures across sessions. When failure rate exceeds threshold, trip OPEN and fast-fail new DG attempts instead of burning 8-10s on retries that will fail.
- Degraded transcription mode: when DG CB is OPEN, keep client WS alive, send a
service_status: stt_degraded event to the client, buffer audio, and retry DG server-side with jitter + capped concurrency on HALF_OPEN probe.
- Don't close with 1011 on DG failure: remove the
websocket.close(code=1011) path for STT init errors. Instead, enter degraded mode and keep the session open.
Files to Modify
backend/utils/stt/streaming.py — add DeepgramCircuitBreaker singleton, check before connect attempts
backend/routers/transcribe.py — replace 1011 close with degraded mode entry on DG failure, add server-side DG retry loop
backend/utils/stt/safe_socket.py — release DG connection on CB trip (call finish() to free connection slots)
Impact
Breaks the self-amplifying feedback loop that turned a DG capacity limit into a full outage. Client sessions stay alive during DG degradation, eliminating the reconnect storm. No impact on normal operation — CB stays CLOSED when DG is healthy.
by AI for @beastoin
When Deepgram hits its concurrent connection limit (or any DG outage), backend-listen closes the client WebSocket with 1011. The client immediately reconnects, spawning a new session that tries DG again — creating a self-amplifying reconnect storm that escalated the Mar 25 incident from a DG capacity issue into a 2.5-hour total outage. Related: #6030 (pusher circuit breaker).
Current Behavior
connect_to_deepgram_with_backoff()retries 3x with exponential backoff (~8-10s total) — if all fail, the exception propagates up and the session closes with code 1011 (transcribe.py:1125)record_success()clearing failures), producing zombie connectionsSafeDeepgramSocketkeepalive thread, so old connections are never released while new ones are rejectedExpected Behavior
When DG connections fail, keep the client WebSocket open in a degraded transcription mode and retry DG server-side with backoff — breaking the client reconnect storm feedback loop.
Affected Areas
backend/utils/stt/streaming.pyconnect_to_deepgram_with_backoff()— raises on exhaustion, no CBbackend/routers/transcribe.pybackend/routers/transcribe.py_create_conversation_fallback()— SYNCprocess_conversation()blocks event loopbackend/utils/pusher.pyPusherCircuitBreaker— pattern to follow for DGbackend/utils/stt/safe_socket.pySolution
PusherCircuitBreaker): track DG connection failures across sessions. When failure rate exceeds threshold, trip OPEN and fast-fail new DG attempts instead of burning 8-10s on retries that will fail.service_status: stt_degradedevent to the client, buffer audio, and retry DG server-side with jitter + capped concurrency on HALF_OPEN probe.websocket.close(code=1011)path for STT init errors. Instead, enter degraded mode and keep the session open.Files to Modify
backend/utils/stt/streaming.py— addDeepgramCircuitBreakersingleton, check before connect attemptsbackend/routers/transcribe.py— replace 1011 close with degraded mode entry on DG failure, add server-side DG retry loopbackend/utils/stt/safe_socket.py— release DG connection on CB trip (callfinish()to free connection slots)Impact
Breaks the self-amplifying feedback loop that turned a DG capacity limit into a full outage. Client sessions stay alive during DG degradation, eliminating the reconnect storm. No impact on normal operation — CB stays CLOSED when DG is healthy.
by AI for @beastoin