Skip to content

Replace presecond speech profile trick with speaker embedding for user identification #6117

@beastoin

Description

@beastoin

Problem

The backend listen pipeline uses a "presecond speech profile trick" to identify the user's voice: it prepends 10s of the user's speech profile audio + 5s padding before the actual stream, then assumes Deepgram's speaker == 0 is the user. This is:

  1. Redundant — speaker embedding matching (_match_speaker_embedding() in transcribe.py:1809-1980) already exists and can identify the user by voice biometrics
  2. Adds 15s startup delaySPEECH_PROFILE_STABILIZE_DELAY = 15 seconds before actual transcription begins
  3. Fragile — assumes the speech profile audio will cause Deepgram to assign speaker_id=0 to the user, which isn't guaranteed
  4. Creates dual sockets — opens deepgram_socket (with preseconds=15) AND deepgram_profile_socket (preseconds=0), doubling Deepgram API costs per session

Current Architecture

The Presecond Trick (to be removed)

User speech profile (10s WAV) → 5s padding → actual audio stream
                                              ↓
                              Deepgram sees speaker_id=0 as "the user"
                              Words with start < 15s are filtered out

Key code locations:

  • streaming.py:20-23 — Constants: SPEECH_PROFILE_FIXED_DURATION=10, PADDING=5, STABILIZE_DELAY=15
  • streaming.py:347-350is_user = True if word.speaker == 0 and preseconds > 0
  • transcribe.py:970-1150 — Dual socket creation, profile audio sending, stabilization wait
  • transcribe.py:1117-1146send_initial_file_path() sends profile audio to Deepgram

Speaker Embedding (already exists, should replace)

User's stored embedding (512-dim vector)
    ↓
Extract embedding from live audio segment → cosine distance comparison
    ↓
If distance < 0.45 → same speaker (user identified)

Key code locations:

  • utils/stt/speaker_embedding.pyextract_embedding_from_bytes(), compare_embeddings(), is_same_speaker()
  • transcribe.py:1809-1980_match_speaker_embedding() already matches speakers in real-time
  • database/users.py:240-264 — Speaker embedding stored per person in Firestore

Proposed Solution

  1. Extract user's speech profile embedding at session start (or use pre-computed embedding from Firestore)
  2. When Deepgram assigns speaker IDs, extract a short audio segment for each new speaker_id
  3. Compare extracted embedding against user's profile embedding using existing is_same_speaker() (threshold 0.45)
  4. If match → mark as user (same as current is_user=True behavior)
  5. Remove presecond trick: no more profile audio prepending, no dual sockets, no 15s delay

Benefits

  • Eliminates 15s startup delay — transcription begins immediately
  • Saves Deepgram API costs — single socket instead of dual
  • More accurate — voice biometric matching vs fragile speaker_id=0 assumption
  • Simpler code — removes ~100 lines of presecond handling

Acceptance Criteria

  • User identification uses speaker embedding comparison, not presecond trick
  • Speech profile audio is NOT prepended to the stream
  • Only one Deepgram socket opened per session (not two)
  • No 15s startup delay — transcription begins immediately
  • is_user flag is set correctly based on embedding match
  • Existing speaker embedding infrastructure reused (no new ML models)
  • All existing tests pass
  • Backward compatible — users without speech profiles still work (just no user identification)

Metadata

Metadata

Assignees

No one assigned

    Labels

    backendBackend Task (python)enhancementNew feature or requestp2Priority: Important (score 14-21)transcription

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions