Fix speaker embedding dimension mismatch crash#6240
Conversation
v2→v3 migration can leave contacts with 512-dim embeddings tagged as v3. When the user has a 256-dim v3 embedding, scipy.cdist crashes on shape mismatch (~1,197 errors/day affecting 4 users). Two fixes: - compare_embeddings() returns max distance (2.0) on dim mismatch - Cache loading in transcribe.py filters out stale-dimension contacts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR fixes a crash (#6238) caused by a partially-failed v2→v3 speaker embedding migration that left some contacts tagged as Two-layer defence is added:
Issues found:
Confidence Score: 4/5Safe to merge after fixing the contradictory test docstring, which also masks a real edge-case in the filter logic. The production fix in backend/tests/unit/test_user_speaker_embedding.py — Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[speaker_identification_task starts] --> B[Load user embedding from Firestore]
B --> C{Embedding found?}
C -- Yes --> D[Add 256-dim user embedding to cache]
C -- No --> E[Cache remains empty]
D --> F[Loop over contacts]
E --> F
F --> G[Fetch contact speaker_embedding]
G --> H{speech_samples_version >= 3?}
H -- No --> I[Skip - migration not done]
H -- Yes --> J[Reshape to 1xD array]
J --> K{Cache non-empty AND dim mismatch with first entry?}
K -- Yes --> L[NEW: Log warning, skip stale 512-dim contact]
K -- No --> M[Add to person_embeddings_cache]
M --> F
L --> F
F --> N{Cache empty?}
N -- Yes --> O[Disable speaker ID task]
N -- No --> P[Consume segment queue]
P --> Q[compare_embeddings called]
Q --> R{Embedding dims match?}
R -- No --> S[NEW: Return 2.0 max distance instead of crash]
R -- Yes --> T[scipy.cdist cosine distance]
T --> U[find_best_match result]
S --> U
Reviews (1): Last reviewed commit: "Fix speaker embedding dimension mismatch..." | Re-trigger Greptile |
| def test_person_cache_loading_no_filter_when_cache_empty(self): | ||
| """When cache is empty (no user embedding), all persons should be loaded regardless of dim.""" | ||
| person_embeddings_cache = {} | ||
|
|
||
| persons = [ | ||
| {'id': 'p1', 'name': 'Alice', 'speaker_embedding': list(np.random.randn(512))}, | ||
| {'id': 'p2', 'name': 'Bob', 'speaker_embedding': list(np.random.randn(256))}, | ||
| ] | ||
|
|
||
| for person in persons: | ||
| emb = person.get('speaker_embedding') | ||
| if emb: | ||
| emb_array = np.array(emb, dtype=np.float32).reshape(1, -1) | ||
| if ( | ||
| person_embeddings_cache | ||
| and next(iter(person_embeddings_cache.values()))['embedding'].shape[1] != emb_array.shape[1] | ||
| ): | ||
| continue | ||
| person_embeddings_cache[person['id']] = { | ||
| 'embedding': emb_array, | ||
| 'name': person['name'], | ||
| } | ||
|
|
||
| # Both loaded since cache was empty when first person loaded | ||
| assert 'p1' in person_embeddings_cache | ||
| # p2 gets filtered because after p1 loads (512), p2 (256) mismatches | ||
| assert 'p2' not in person_embeddings_cache |
There was a problem hiding this comment.
Test docstring and inline comment contradict the assertions
The docstring claims "all persons should be loaded regardless of dim" but the test asserts the exact opposite — p2 (256-dim) is not loaded because after p1 (512-dim) is added to the cache it becomes the dimension reference, and subsequent non-matching entries are filtered.
Specifically:
- Line 694:
"""When cache is empty (no user embedding), all persons should be loaded regardless of dim."""— incorrect; p2 is filtered. - Line 716:
# Both loaded since cache was empty when first person loaded— incorrect; only p1 is loaded.
Beyond the misleading docs, this test also reveals a legitimate behavioral concern: when the user's own embedding is absent from the cache (e.g., loading failed), the first contact encountered sets the dimension reference. If that first contact happens to have a stale 512-dim embedding, all properly-migrated 256-dim contacts will be silently skipped. The test name test_person_cache_loading_no_filter_when_cache_empty implies the filter is inactive for an empty cache, but filtering still occurs for every entry after the first.
The test should be renamed to reflect actual behaviour and the docstring/comment corrected:
| def test_person_cache_loading_no_filter_when_cache_empty(self): | |
| """When cache is empty (no user embedding), all persons should be loaded regardless of dim.""" | |
| person_embeddings_cache = {} | |
| persons = [ | |
| {'id': 'p1', 'name': 'Alice', 'speaker_embedding': list(np.random.randn(512))}, | |
| {'id': 'p2', 'name': 'Bob', 'speaker_embedding': list(np.random.randn(256))}, | |
| ] | |
| for person in persons: | |
| emb = person.get('speaker_embedding') | |
| if emb: | |
| emb_array = np.array(emb, dtype=np.float32).reshape(1, -1) | |
| if ( | |
| person_embeddings_cache | |
| and next(iter(person_embeddings_cache.values()))['embedding'].shape[1] != emb_array.shape[1] | |
| ): | |
| continue | |
| person_embeddings_cache[person['id']] = { | |
| 'embedding': emb_array, | |
| 'name': person['name'], | |
| } | |
| # Both loaded since cache was empty when first person loaded | |
| assert 'p1' in person_embeddings_cache | |
| # p2 gets filtered because after p1 loads (512), p2 (256) mismatches | |
| assert 'p2' not in person_embeddings_cache | |
| def test_person_cache_loading_first_entry_sets_dim_reference(self): | |
| """When no user embedding pre-populates the cache, the first contact loaded sets the | |
| dimension reference; subsequent contacts with a different dimension are filtered. | |
| This means if the first contact still has a stale 512-dim embedding, correctly | |
| migrated 256-dim contacts will be skipped — a known edge-case when the user has | |
| no stored speaker embedding. | |
| """ | |
| person_embeddings_cache = {} | |
| persons = [ | |
| {'id': 'p1', 'name': 'Alice', 'speaker_embedding': list(np.random.randn(512))}, | |
| {'id': 'p2', 'name': 'Bob', 'speaker_embedding': list(np.random.randn(256))}, | |
| ] | |
| for person in persons: | |
| emb = person.get('speaker_embedding') | |
| if emb: | |
| emb_array = np.array(emb, dtype=np.float32).reshape(1, -1) | |
| if ( | |
| person_embeddings_cache | |
| and next(iter(person_embeddings_cache.values()))['embedding'].shape[1] != emb_array.shape[1] | |
| ): | |
| continue | |
| person_embeddings_cache[person['id']] = { | |
| 'embedding': emb_array, | |
| 'name': person['name'], | |
| } | |
| # p1 (512-dim) loaded first; p2 (256-dim) filtered because it doesn't match 512 | |
| assert 'p1' in person_embeddings_cache | |
| assert 'p2' not in person_embeddings_cache |
| if embedding1.shape[1] != embedding2.shape[1]: | ||
| return 2.0 |
There was a problem hiding this comment.
Dimension guard assumes 2-D array shape
embedding1.shape[1] raises IndexError if a 1-D array (D,) is ever passed instead of (1, D). All current callers reshape to (1, D) before calling this function, so the risk is low, but a defensive check (or using shape[-1]) would make the guard robust against future callers that skip the reshape:
| if embedding1.shape[1] != embedding2.shape[1]: | |
| return 2.0 | |
| if embedding1.ndim < 2 or embedding2.ndim < 2 or embedding1.shape[-1] != embedding2.shape[-1]: | |
| return 2.0 |
Codex review flagged that comparing against the first cache entry is order-dependent: a stale user embedding (512-dim) would poison the filter and drop valid 256-dim contacts. Instead, load all embeddings into cache and let compare_embeddings() return 2.0 on dimension mismatch at comparison time. Added observability log for mixed-dimension caches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The v2→v3 migration sets version=3 for contacts with no speech_samples but leaves the old 512-dim embedding untouched — confirmed by Firestore query on affected user (Kat: version=3, 512-dim, 0 samples). Three fixes: - Migration: clear speaker_embedding before bumping version when no samples exist (both v1→v2 and v2→v3 paths) - Hardening: transcribe.py and sync.py skip loading person embeddings when speech_samples is empty, making historical bad data inert - Tests: 3 new migration tests for stale embedding clearing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
lgtm |
Fixes #6238
Root Cause (verified via Firestore query)
The v2→v3 speaker embedding migration (
speaker_sample_migration.py:186-190) has a bug: when a contact has no speech_samples (0 audio files), it setsspeech_samples_version=3but leaves the old 512-dim embedding untouched. The old embedding from the v1/v2 model (pyannote/embedding, 512-dim) survives tagged as v3, crashingscipy.cdistwhen compared against the user's 256-dim v3 embedding (~1,197 errors/day across 4 users).Verified on affected user: contact "Kat" has
version=3,512-dim embedding,0 speech_samples. Contacts with samples (Chris Bond, Vince) were correctly re-extracted to 256-dim.Fix (4 commits, 2 Codex reviews)
Commit 1 — Crash prevention:
compare_embeddings()returns max distance (2.0) on dimension mismatch instead of crashing viascipy.cdistCommit 2 — Remove fragile cache filter (Codex review #1):
compare_embeddings()guardCommit 3 — Root cause fix (Codex review #2):
speaker_embeddingbefore bumping version when no samples exist (both v1→v2 and v2→v3 paths)speech_samplesis empty — makes historical bad data inertCommit 4 — Cleanup:
Changed Files
backend/utils/stt/speaker_embedding.pycompare_embeddings()backend/utils/speaker_sample_migration.pybackend/routers/transcribe.pybackend/routers/sync.pybackend/tests/unit/test_user_speaker_embedding.pybackend/tests/unit/test_speaker_sample_migration.pyTests
test_user_speaker_embedding.py) — 8 new for dimension guardtest_speaker_sample_migration.py) — 3 new for stale embedding clearingtest_speaker_id_pipeline.py) — existing, no regressionstest_short_audio_embedding.py) — existing, no regressionsDeploy
compare_embeddingsor the transcribe/sync speaker ID pathsgh workflow run "Deploy Backend to Cloud RUN" --repo BasedHardware/omi --ref main -f environment=prodNot in this PR
by AI for @beastoin