Problem
The backend listen pipeline uses a "presecond speech profile trick" to identify the user's voice: it prepends 10s of the user's speech profile audio + 5s padding before the actual stream, then assumes Deepgram's speaker == 0 is the user. This is:
- Redundant — speaker embedding matching (
_match_speaker_embedding() in transcribe.py:1809-1980) already exists and can identify the user by voice biometrics
- Adds 15s startup delay —
SPEECH_PROFILE_STABILIZE_DELAY = 15 seconds before actual transcription begins
- Fragile — assumes the speech profile audio will cause Deepgram to assign
speaker_id=0 to the user, which isn't guaranteed
- Creates dual sockets — opens
deepgram_socket (with preseconds=15) AND deepgram_profile_socket (preseconds=0), doubling Deepgram API costs per session
Current Architecture
The Presecond Trick (to be removed)
User speech profile (10s WAV) → 5s padding → actual audio stream
↓
Deepgram sees speaker_id=0 as "the user"
Words with start < 15s are filtered out
Key code locations:
streaming.py:20-23 — Constants: SPEECH_PROFILE_FIXED_DURATION=10, PADDING=5, STABILIZE_DELAY=15
streaming.py:347-350 — is_user = True if word.speaker == 0 and preseconds > 0
transcribe.py:970-1150 — Dual socket creation, profile audio sending, stabilization wait
transcribe.py:1117-1146 — send_initial_file_path() sends profile audio to Deepgram
Speaker Embedding (already exists, should replace)
User's stored embedding (512-dim vector)
↓
Extract embedding from live audio segment → cosine distance comparison
↓
If distance < 0.45 → same speaker (user identified)
Key code locations:
utils/stt/speaker_embedding.py — extract_embedding_from_bytes(), compare_embeddings(), is_same_speaker()
transcribe.py:1809-1980 — _match_speaker_embedding() already matches speakers in real-time
database/users.py:240-264 — Speaker embedding stored per person in Firestore
Proposed Solution
- Extract user's speech profile embedding at session start (or use pre-computed embedding from Firestore)
- When Deepgram assigns speaker IDs, extract a short audio segment for each new speaker_id
- Compare extracted embedding against user's profile embedding using existing
is_same_speaker() (threshold 0.45)
- If match → mark as user (same as current
is_user=True behavior)
- Remove presecond trick: no more profile audio prepending, no dual sockets, no 15s delay
Benefits
- Eliminates 15s startup delay — transcription begins immediately
- Saves Deepgram API costs — single socket instead of dual
- More accurate — voice biometric matching vs fragile speaker_id=0 assumption
- Simpler code — removes ~100 lines of presecond handling
Acceptance Criteria
Problem
The backend listen pipeline uses a "presecond speech profile trick" to identify the user's voice: it prepends 10s of the user's speech profile audio + 5s padding before the actual stream, then assumes Deepgram's
speaker == 0is the user. This is:_match_speaker_embedding()intranscribe.py:1809-1980) already exists and can identify the user by voice biometricsSPEECH_PROFILE_STABILIZE_DELAY = 15seconds before actual transcription beginsspeaker_id=0to the user, which isn't guaranteeddeepgram_socket(with preseconds=15) ANDdeepgram_profile_socket(preseconds=0), doubling Deepgram API costs per sessionCurrent Architecture
The Presecond Trick (to be removed)
Key code locations:
streaming.py:20-23— Constants:SPEECH_PROFILE_FIXED_DURATION=10,PADDING=5,STABILIZE_DELAY=15streaming.py:347-350—is_user = True if word.speaker == 0 and preseconds > 0transcribe.py:970-1150— Dual socket creation, profile audio sending, stabilization waittranscribe.py:1117-1146—send_initial_file_path()sends profile audio to DeepgramSpeaker Embedding (already exists, should replace)
Key code locations:
utils/stt/speaker_embedding.py—extract_embedding_from_bytes(),compare_embeddings(),is_same_speaker()transcribe.py:1809-1980—_match_speaker_embedding()already matches speakers in real-timedatabase/users.py:240-264— Speaker embedding stored per person in FirestoreProposed Solution
is_same_speaker()(threshold 0.45)is_user=Truebehavior)Benefits
Acceptance Criteria
is_userflag is set correctly based on embedding match