Skip to content

Deepgram keepalive thread permanently kills connection on failure — silent transcription loss #5870

@beastoin

Description

@beastoin

The Deepgram SDK v4.8.1 has a built-in keepalive background thread that permanently kills the DG WebSocket connection on any single failure. After this, all audio is silently dropped for the rest of the user's session — no transcription, no error surfaced to the user.

Production impact: thousands of keep_alive failed errors per hour. Sessions silently lose transcription mid-conversation.

Root Cause

Three compounding problems:

1. SDK keepalive thread calls _signal_exit() on ANY exception

The SDK's _keep_alive thread (async_client.py:376-419) runs every 5 seconds. When it encounters any exception (DG server drops, transient network issue), it:

  • Emits an Error event
  • Calls super()._signal_exit() which sets _exit_event, closes the socket, sets _socket = None
  • Returns (thread exits forever, no retry)

After _signal_exit(), every subsequent send() call checks _exit_event.is_set() and returns False silently. Audio is dropped with no error raised.

2. Duplicate keepalives — SDK thread + our VAD gate

Both the SDK (every 5s, enabled by options={"keepalive": "true"} in streaming.py:290) and our VAD gate (vad_gate.py:736-742, every 5s during silence) send {"type": "KeepAlive"}. The SDK's is redundant and dangerous because of #1.

3. Our code doesn't check return values

vad_gate.py:739 calls self._conn.keep_alive() but ignores the return value. The SDK's keep_alive() returns False when the connection is dead (it swallows exceptions internally). Our code then calls record_keepalive(now) thinking it succeeded. Same for self._conn.send() at line 735.

Evidence

  • Deepgram docs confirm 10-second idle timeout; recommended keepalive interval is 3-5 seconds
  • SDK v4.8.1 is_keep_alive_enabled() uses Python truthiness — {"keepalive": "false"} (string) would still enable it
  • v5/v6 SDK removed the automatic keepalive thread; developers manage keepalives manually
  • Our VAD gate already handles keepalives correctly — the SDK's thread is redundant

Solution

Change 1 — streaming.py:290-292: Remove SDK keepalive key

# BEFORE
deepgram_options = DeepgramClientOptions(options={"keepalive": "true", "termination_exception_connect": "true"})
deepgram_cloud_options = DeepgramClientOptions(options={"keepalive": "true", "termination_exception_connect": "true"})

# AFTER (remove "keepalive" key entirely — "false" string is still truthy!)
deepgram_options = DeepgramClientOptions(options={"termination_exception_connect": "true"})
deepgram_cloud_options = DeepgramClientOptions(options={"termination_exception_connect": "true"})

This prevents the SDK from spawning its _keep_alive task. Our VAD gate handles keepalives.

Change 2 — vad_gate.py: Check return values, detect dead connection

# Add in GatedDeepgramSocket.__init__:
self._dg_dead = False

# Change send() — audio path (line 734-735):
if gate_out.audio_to_send:
    ret = self._conn.send(gate_out.audio_to_send)
    if ret is False:
        logger.warning('DG send returned False, connection dead uid=%s session=%s',
                       self._gate.uid, self._gate.session_id)
        self._dg_dead = True

# Change send() — keepalive path (line 736-742):
elif self._gate.needs_keepalive(now):
    try:
        ret = self._conn.keep_alive()
        if ret is False:
            logger.warning('DG keep_alive returned False, connection dead uid=%s session=%s',
                          self._gate.uid, self._gate.session_id)
            self._dg_dead = True
        else:
            self._gate.record_keepalive(now)
    except Exception:
        logger.warning('DG keepalive exception, connection dead uid=%s session=%s',
                      self._gate.uid, self._gate.session_id)
        self._dg_dead = True

# Add property:
@property
def is_connection_dead(self) -> bool:
    return self._dg_dead

Change 3 — transcribe.py: Detect dead DG connection in audio loop

if hasattr(deepgram_socket, 'is_connection_dead') and deepgram_socket.is_connection_dead:
    logger.error('DG connection died mid-session uid=%s session=%s', uid, session_id)
    break  # End session cleanly (future: attempt reconnection)

Affected Areas

File Line Change
backend/utils/stt/streaming.py 290-292 Remove "keepalive" key from DeepgramClientOptions
backend/utils/stt/vad_gate.py 707, 720, 734-742 Add _dg_dead flag, check return values
backend/routers/transcribe.py audio loop Check is_connection_dead and break

Reproduction Script

See attached reproduction script that demonstrates the bug and fix using mocks (no DG API key required): https://gist.github.com/placeholder


by AI for @beastoin

Metadata

Metadata

Assignees

No one assigned

    Labels

    p1Priority: Critical (score 22-29)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions