Fix DG keepalive thread permanently killing connection (#5870)#5871
Fix DG keepalive thread permanently killing connection (#5870)#5871
Conversation
The SDK's built-in keepalive thread calls _signal_exit() on any exception, permanently killing the DG connection. Our VAD gate already handles keepalives. Remove the key entirely (string "false" is still truthy in Python). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Check return values of send() and keep_alive(). When either returns False or raises, mark connection as dead and stop forwarding audio. Prevents silent audio loss for the rest of the session. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Check is_connection_dead before sending audio to DG. When detected, log error and set dg_socket to None to stop further sends. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests: send returns False, keepalive returns False, keepalive exception, dead socket stops sending, passthrough mode resilience. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Required fixes before merge:
Please push a follow-up commit and I’ll re-review quickly. by AI for @beastoin |
Add dg_socket to nonlocal declaration so the dead-connection assignment (dg_socket = None) modifies the enclosing scope. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR addresses a real bug in Deepgram SDK v4.8.1 where a background keepalive thread could permanently kill the WebSocket connection by calling The core idea is sound, but there are two places where the implementation doesn't fully deliver the stated behaviour:
The Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant TA as transcribe.py<br/>flush_stt_buffer
participant GDS as GatedDeepgramSocket
participant DG as Deepgram SDK
TA->>GDS: send(chunk)
GDS->>GDS: _dg_dead? → False
GDS->>DG: conn.send(audio)
DG-->>GDS: returns False
GDS->>GDS: _dg_dead = True
GDS-->>TA: returns
Note over TA: Next audio chunk (~30 ms later)
TA->>TA: check is_connection_dead → True
TA->>TA: log error (EVERY call ⚠️)
TA->>TA: dg_socket = None (LOCAL only, no nonlocal)
Note over TA,GDS: Outer dg_socket still dead socket!
Note over TA: Next audio chunk
TA->>TA: dg_socket still not None (outer scope)
TA->>TA: check is_connection_dead → True
TA->>TA: log error AGAIN ⚠️
Note over GDS: Gate=None passthrough path
TA->>GDS: send(chunk)
GDS->>GDS: _dg_dead? → False (never set in passthrough)
GDS->>DG: conn.send(audio) → False
GDS-->>TA: returns (flag NOT set ⚠️)
|
| # Check if DG connection died (keepalive or send failure) | ||
| if hasattr(dg_socket, 'is_connection_dead') and dg_socket.is_connection_dead: | ||
| logger.error('DG connection died mid-session uid=%s session=%s', uid, session_id) | ||
| dg_socket = None # Stop sending to dead connection |
There was a problem hiding this comment.
dg_socket = None assignment is scoped locally — dead detection fires on every flush
flush_stt_buffer does not declare dg_socket as nonlocal, so the dg_socket = None assignment on this line only creates a local variable within the single invocation of flush_stt_buffer. The parameter dg_socket in the enclosing receive_data function is unaffected.
On every subsequent call to flush_stt_buffer (which happens every ~30 ms of incoming audio), dg_socket is re-read from receive_data's scope — still the dead socket — so the is_connection_dead branch is entered again, and logger.error(...) fires again. The PR description says "Dead connection logs once per detection, not per send", but the code does the opposite: the error is emitted once per 30 ms audio chunk for the rest of the session.
The fix is to add dg_socket to the nonlocal declaration so the assignment actually persists:
async def flush_stt_buffer(force: bool = False):
nonlocal stt_audio_buffer, soniox_profile_socket, deepgram_profile_socket, dg_usage_ms_pending
nonlocal dg_socket # required so the dead-detection assignment propagatesWithout this, dg_socket = None on line 2310 is a no-op for all future calls and the error log will flood at audio-chunk rate (~33×/s).
| if self._gate is None: | ||
| return self._conn.send(data) |
There was a problem hiding this comment.
Passthrough mode never marks connection dead
When self._gate is None (passthrough mode), the return value of self._conn.send(data) is forwarded to the caller but _dg_dead is never set to True, even if the DG SDK returns False. Subsequent calls still reach this branch and keep forwarding audio to the dead connection.
The early-exit guard at the top (if self._dg_dead: return) is only useful after _dg_dead has been set — but in passthrough mode that never happens.
The test test_passthrough_dead_on_send_false only verifies "don't crash"; it doesn't assert that later sends are dropped.
if self._gate is None:
ret = self._conn.send(data)
if ret is False:
self._dg_dead = True
returnThis gap matters even if all current production sockets use an active gate, because the gate=None path exists and could be exercised by future callers or during the VAD-error fallback (where self._gate is cleared).
Tests the caller-side pattern where is_connection_dead triggers dg_socket = None, matching transcribe.py:2307-2310 behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Asserts that deepgram_options and deepgram_cloud_options do not contain the "keepalive" key, which spawns a dangerous background thread. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CP9 — Live Backend Validation EvidenceTest 1: DG connection without SDK keepalive (the fix)Test 2: Keepalive thread verification (old vs new)Confirms: Removing Test summary
by AI for @beastoin |
All checkpoints passed — PR ready for merge
7 commits, 125 unit tests pass, live DG test confirms fix works. Awaiting human merge approval. by AI for @beastoin |
…d detection (#5870) Dead-connection detection should work regardless of whether a VAD gate is used. SafeDeepgramSocket wraps the raw DG connection with send/keepalive monitoring. GatedDeepgramSocket now delegates is_connection_dead to it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Always wraps the raw DG connection with SafeDeepgramSocket for dead detection, regardless of whether a VAD gate is enabled. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Both SafeDeepgramSocket and GatedDeepgramSocket expose is_connection_dead, so the hasattr check is no longer needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…5870) Tests now verify dead detection at both layers: SafeDeepgramSocket alone (no VAD gate) and GatedDeepgramSocket delegating to SafeDeepgramSocket. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove utils.stt.vad_gate from mock list since it has no heavy deps. Prevents test isolation issues when running both test files together. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Architecture Refactor: Dead detection moved to connection levelManager feedback: keepalive/dead detection shouldn't be VAD gate's responsibility — what if we don't have the VAD gate? Fix: Extracted New architecture:
safe_conn = SafeDeepgramSocket(dg_connection)
if vad_gate is not None:
return GatedDeepgramSocket(safe_conn, gate=vad_gate)
return safe_conn
@property
def is_connection_dead(self) -> bool:
if isinstance(self._conn, SafeDeepgramSocket):
return self._conn.is_connection_dead
return FalseTest coverage (129 passing):
Live DG test (repeated after refactor):WS backend pipeline test:
by AI for @beastoin |
All checkpoints passed — PR ready for merge
Key changes since last review:
12 commits, 129 unit tests pass, live DG test passes. Awaiting human merge approval. by AI for @beastoin |
5-Minute Live DG Connection Test — PASSPattern: 10s speech + 20s silence repeating (simulates real conversation), manual keepalives every 8s. Connection held for full 5 minutes with zero errors. by AI for @beastoin |
App + Live Test EvidenceApp Screenshots (flow-walker on emulator)
App successfully enters Listening mode and connects to the local backend with the keepalive fix applied. WS connection chain:
Live DG Connection Test — 5 minutes (300s)Backend log — WS → DG connection chainUnit tests — 129 passby AI for @beastoin |
Physical Device (Pixel 7a) Live Test EvidenceTranscription Working on Physical DevicePixel 7a (33041JEHN18287) connected to local backend (VPS 100.125.36.102:10161) via Tailscale:
SafeDeepgramSocket Dead Detection WorkingBackend logs confirm the fix works exactly as designed: Pipeline: Connection chain (from backend log)by AI for @beastoin |
SafeDeepgramSocket is about STT connection safety, not VAD gating. Moving it to streaming.py where process_audio_dg() wraps connections. Uses class attribute marker (_is_safe_dg_socket = True) with `is True` check to avoid circular import and MagicMock false positives. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GatedDeepgramSocket now uses getattr(_is_safe_dg_socket, None) is True to detect SafeDeepgramSocket without importing it (avoids circular import). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
#5870) GatedDeepgramSocket now delegates keep_alive() to underlying SafeDeepgramSocket, allowing callers to explicitly ping DG. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…5870) Replace 15s sleep with keepalive loop (5s intervals) to prevent DG 10s idle timeout on the main socket while audio routes to the profile socket. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Avoids heavy import chain (soniox_util → storage → GCP clients) when unit tests only need the socket wrapper. Tests now collect without GOOGLE_APPLICATION_CREDENTIALS. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…5870) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ests (#5870) Tests: keep_alive delegation alive/dead, SafeDeepgramSocket finalize/finish. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests: SafeDeepgramSocket/GatedDeepgramSocket return types from process_audio_dg, stabilization keepalive loop pattern, dead-stop. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>


Summary
Fixes #5870 — Deepgram SDK keepalive thread permanently kills connection on failure, causing silent transcription loss.
Root cause: The SDK's
_keep_alivethread calls_signal_exit()on any exception, permanently killing the DG connection. All subsequentsend()calls returnFalsesilently — no error, no log, transcription just stops.Fix (3 layers):
streaming.py): Removekeepalive: "true"fromDeepgramClientOptionsto prevent SDK from spawning the dangerous background thread. Our VAD gate already sends manualkeep_alive()calls.SafeDeepgramSocket(vad_gate.py): Connection-level wrapper with_dg_deadone-way latch that detects whensend()orkeep_alive()returnsFalseor raises. Works with or without VAD gate.GatedDeepgramSocketdelegates dead detection toSafeDeepgramSocket— no longer owns_dg_deaddirectly.process_audio_dg()(streaming.py): Raw DG connection always wrapped withSafeDeepgramSocket, then optionallyGatedDeepgramSocket.transcribe.py): Checkis_connection_deadbefore sending audio. If dead, null outdg_socketand log error.Architecture
Test evidence
App screenshots (flow-walker)
Live DG test — 5 minutes (300s)
Unit tests — 129 pass
Review cycles
nonlocalscoping bug → fixedSafeDeepgramSocket, always applied inprocess_audio_dg()Risks
keep_alive()in VAD gate. Already battle-tested (chaos v7, 88.1% silence savings)._dg_deadis a one-way latch — once set, the session cannot recover DG. Intentional since SDK's_signal_exit()already makes the connection unrecoverable.🤖 Generated with Claude Code