-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
The Deepgram SDK v4.8.1 has a built-in keepalive background thread that permanently kills the DG WebSocket connection on any single failure. After this, all audio is silently dropped for the rest of the user's session — no transcription, no error surfaced to the user.
Production impact: thousands of keep_alive failed errors per hour. Sessions silently lose transcription mid-conversation.
Root Cause
Three compounding problems:
1. SDK keepalive thread calls _signal_exit() on ANY exception
The SDK's _keep_alive thread (async_client.py:376-419) runs every 5 seconds. When it encounters any exception (DG server drops, transient network issue), it:
- Emits an Error event
- Calls
super()._signal_exit()which sets_exit_event, closes the socket, sets_socket = None - Returns (thread exits forever, no retry)
After _signal_exit(), every subsequent send() call checks _exit_event.is_set() and returns False silently. Audio is dropped with no error raised.
2. Duplicate keepalives — SDK thread + our VAD gate
Both the SDK (every 5s, enabled by options={"keepalive": "true"} in streaming.py:290) and our VAD gate (vad_gate.py:736-742, every 5s during silence) send {"type": "KeepAlive"}. The SDK's is redundant and dangerous because of #1.
3. Our code doesn't check return values
vad_gate.py:739 calls self._conn.keep_alive() but ignores the return value. The SDK's keep_alive() returns False when the connection is dead (it swallows exceptions internally). Our code then calls record_keepalive(now) thinking it succeeded. Same for self._conn.send() at line 735.
Evidence
- Deepgram docs confirm 10-second idle timeout; recommended keepalive interval is 3-5 seconds
- SDK v4.8.1
is_keep_alive_enabled()uses Python truthiness —{"keepalive": "false"}(string) would still enable it - v5/v6 SDK removed the automatic keepalive thread; developers manage keepalives manually
- Our VAD gate already handles keepalives correctly — the SDK's thread is redundant
Solution
Change 1 — streaming.py:290-292: Remove SDK keepalive key
# BEFORE
deepgram_options = DeepgramClientOptions(options={"keepalive": "true", "termination_exception_connect": "true"})
deepgram_cloud_options = DeepgramClientOptions(options={"keepalive": "true", "termination_exception_connect": "true"})
# AFTER (remove "keepalive" key entirely — "false" string is still truthy!)
deepgram_options = DeepgramClientOptions(options={"termination_exception_connect": "true"})
deepgram_cloud_options = DeepgramClientOptions(options={"termination_exception_connect": "true"})This prevents the SDK from spawning its _keep_alive task. Our VAD gate handles keepalives.
Change 2 — vad_gate.py: Check return values, detect dead connection
# Add in GatedDeepgramSocket.__init__:
self._dg_dead = False
# Change send() — audio path (line 734-735):
if gate_out.audio_to_send:
ret = self._conn.send(gate_out.audio_to_send)
if ret is False:
logger.warning('DG send returned False, connection dead uid=%s session=%s',
self._gate.uid, self._gate.session_id)
self._dg_dead = True
# Change send() — keepalive path (line 736-742):
elif self._gate.needs_keepalive(now):
try:
ret = self._conn.keep_alive()
if ret is False:
logger.warning('DG keep_alive returned False, connection dead uid=%s session=%s',
self._gate.uid, self._gate.session_id)
self._dg_dead = True
else:
self._gate.record_keepalive(now)
except Exception:
logger.warning('DG keepalive exception, connection dead uid=%s session=%s',
self._gate.uid, self._gate.session_id)
self._dg_dead = True
# Add property:
@property
def is_connection_dead(self) -> bool:
return self._dg_deadChange 3 — transcribe.py: Detect dead DG connection in audio loop
if hasattr(deepgram_socket, 'is_connection_dead') and deepgram_socket.is_connection_dead:
logger.error('DG connection died mid-session uid=%s session=%s', uid, session_id)
break # End session cleanly (future: attempt reconnection)Affected Areas
| File | Line | Change |
|---|---|---|
backend/utils/stt/streaming.py |
290-292 | Remove "keepalive" key from DeepgramClientOptions |
backend/utils/stt/vad_gate.py |
707, 720, 734-742 | Add _dg_dead flag, check return values |
backend/routers/transcribe.py |
audio loop | Check is_connection_dead and break |
Reproduction Script
See attached reproduction script that demonstrates the bug and fix using mocks (no DG API key required): https://gist.github.com/placeholder
by AI for @beastoin