Skip to content

Add STT circuit breaker + degraded mode for Deepgram connection failures #6052

@beastoin

Description

@beastoin

When Deepgram hits its concurrent connection limit (or any DG outage), backend-listen closes the client WebSocket with 1011. The client immediately reconnects, spawning a new session that tries DG again — creating a self-amplifying reconnect storm that escalated the Mar 25 incident from a DG capacity issue into a 2.5-hour total outage. Related: #6030 (pusher circuit breaker).

Current Behavior

  • connect_to_deepgram_with_backoff() retries 3x with exponential backoff (~8-10s total) — if all fail, the exception propagates up and the session closes with code 1011 (transcribe.py:1125)
  • Client reconnects on 1011, creating a new session → new DG connection attempt → rejected again → infinite loop
  • Each failed cycle also creates a new pusher connection (CB stays CLOSED due to record_success() clearing failures), producing zombie connections
  • No fallback to alternative STT providers when DG is unavailable
  • Existing DG connections stay alive via SafeDeepgramSocket keepalive thread, so old connections are never released while new ones are rejected

Expected Behavior

When DG connections fail, keep the client WebSocket open in a degraded transcription mode and retry DG server-side with backoff — breaking the client reconnect storm feedback loop.

Affected Areas

File Line Description
backend/utils/stt/streaming.py 412-440 connect_to_deepgram_with_backoff() — raises on exhaustion, no CB
backend/routers/transcribe.py 1123-1127 Exception handler closes session with 1011 on DG init failure
backend/routers/transcribe.py 710-729 _create_conversation_fallback() — SYNC process_conversation() blocks event loop
backend/utils/pusher.py 35-103 Existing PusherCircuitBreaker — pattern to follow for DG
backend/utils/stt/safe_socket.py 75-84 Keepalive thread holds DG connections alive indefinitely

Solution

  1. DG circuit breaker (pod-level singleton, same pattern as PusherCircuitBreaker): track DG connection failures across sessions. When failure rate exceeds threshold, trip OPEN and fast-fail new DG attempts instead of burning 8-10s on retries that will fail.
  2. Degraded transcription mode: when DG CB is OPEN, keep client WS alive, send a service_status: stt_degraded event to the client, buffer audio, and retry DG server-side with jitter + capped concurrency on HALF_OPEN probe.
  3. Don't close with 1011 on DG failure: remove the websocket.close(code=1011) path for STT init errors. Instead, enter degraded mode and keep the session open.

Files to Modify

  • backend/utils/stt/streaming.py — add DeepgramCircuitBreaker singleton, check before connect attempts
  • backend/routers/transcribe.py — replace 1011 close with degraded mode entry on DG failure, add server-side DG retry loop
  • backend/utils/stt/safe_socket.py — release DG connection on CB trip (call finish() to free connection slots)

Impact

Breaks the self-amplifying feedback loop that turned a DG capacity limit into a full outage. Client sessions stay alive during DG degradation, eliminating the reconnect storm. No impact on normal operation — CB stays CLOSED when DG is healthy.


by AI for @beastoin

Metadata

Metadata

Assignees

No one assigned

    Labels

    backendBackend Task (python)bugSomething isn't workingp1Priority: Critical (score 22-29)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions