Skip to content

Fix speaker embedding dimension mismatch crash#6240

Merged
beastoin merged 4 commits into
mainfrom
fix/speaker-embedding-dim-mismatch-6238
Apr 1, 2026
Merged

Fix speaker embedding dimension mismatch crash#6240
beastoin merged 4 commits into
mainfrom
fix/speaker-embedding-dim-mismatch-6238

Conversation

@beastoin
Copy link
Copy Markdown
Collaborator

@beastoin beastoin commented Apr 1, 2026

Fixes #6238

Root Cause (verified via Firestore query)

The v2→v3 speaker embedding migration (speaker_sample_migration.py:186-190) has a bug: when a contact has no speech_samples (0 audio files), it sets speech_samples_version=3 but leaves the old 512-dim embedding untouched. The old embedding from the v1/v2 model (pyannote/embedding, 512-dim) survives tagged as v3, crashing scipy.cdist when compared against the user's 256-dim v3 embedding (~1,197 errors/day across 4 users).

Verified on affected user: contact "Kat" has version=3, 512-dim embedding, 0 speech_samples. Contacts with samples (Chris Bond, Vince) were correctly re-extracted to 256-dim.

Fix (4 commits, 2 Codex reviews)

Commit 1 — Crash prevention:

  • compare_embeddings() returns max distance (2.0) on dimension mismatch instead of crashing via scipy.cdist
  • Safety net for all callers (transcribe.py, sync.py)

Commit 2 — Remove fragile cache filter (Codex review #1):

  • Initial cache-level filter anchored on first entry — Codex flagged as order-dependent
  • Removed in favor of relying on compare_embeddings() guard

Commit 3 — Root cause fix (Codex review #2):

  • Migration: clear speaker_embedding before bumping version when no samples exist (both v1→v2 and v2→v3 paths)
  • Hardening: transcribe.py and sync.py skip loading person embeddings when speech_samples is empty — makes historical bad data inert

Commit 4 — Cleanup:

  • Removed dimension mismatch observability log (not needed)

Changed Files

File Change
backend/utils/stt/speaker_embedding.py Dimension guard in compare_embeddings()
backend/utils/speaker_sample_migration.py Clear stale embedding in no-samples path
backend/routers/transcribe.py Skip loading embeddings without speech_samples
backend/routers/sync.py Same hardening
backend/tests/unit/test_user_speaker_embedding.py 8 new tests
backend/tests/unit/test_speaker_sample_migration.py 3 new tests

Tests

  • 30 passed (test_user_speaker_embedding.py) — 8 new for dimension guard
  • 18 passed (test_speaker_sample_migration.py) — 3 new for stale embedding clearing
  • 98 passed (test_speaker_id_pipeline.py) — existing, no regressions
  • 12 passed (test_short_audio_embedding.py) — existing, no regressions

Deploy

  • Backend Cloud Run only — pusher doesn't call compare_embeddings or the transcribe/sync speaker ID paths
  • gh workflow run "Deploy Backend to Cloud RUN" --repo BasedHardware/omi --ref main -f environment=prod

Not in this PR

  • One-off backfill to clear existing incorrectly tagged contacts (follow-up)

by AI for @beastoin

v2→v3 migration can leave contacts with 512-dim embeddings tagged as v3.
When the user has a 256-dim v3 embedding, scipy.cdist crashes on shape
mismatch (~1,197 errors/day affecting 4 users). Two fixes:
- compare_embeddings() returns max distance (2.0) on dim mismatch
- Cache loading in transcribe.py filters out stale-dimension contacts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 1, 2026

Greptile Summary

This PR fixes a crash (#6238) caused by a partially-failed v2→v3 speaker embedding migration that left some contacts tagged as speech_samples_version=3 but still holding 512-dim embeddings from the old pyannote model. When the user already had a correct 256-dim v3 embedding, scipy.cdist would raise a ValueError on the shape mismatch — affecting ~1,197 errors/day across 4 users.

Two-layer defence is added:

  • compare_embeddings() in speaker_embedding.py — now short-circuits and returns 2.0 (max cosine distance) before calling cdist if the two embeddings have different dimensions. This is the universal safety net.
  • Cache loading in transcribe.py — before inserting a contact into person_embeddings_cache, the new code compares the contact's embedding dimension against the first item already in the cache (typically the user's own 256-dim embedding). Any contact whose dimension differs is logged at WARNING level and skipped.

Issues found:

  • The test test_person_cache_loading_no_filter_when_cache_empty has a docstring and inline comment that directly contradict its own assertions — the test proves p2 IS filtered, not that all entries are loaded. This also surfaces a real edge-case: when no user embedding is in the cache, the first contact sets the dimension reference and all contacts with a different dimension are silently dropped.
  • The dimension guard in compare_embeddings uses shape[1] which would IndexError on a 1-D array; shape[-1] with an ndim check would be more defensive.

Confidence Score: 4/5

Safe to merge after fixing the contradictory test docstring, which also masks a real edge-case in the filter logic.

The production fix in speaker_embedding.py and transcribe.py is correct and sound. The P1 finding is confined to the test file: the docstring and comment in test_person_cache_loading_no_filter_when_cache_empty claim the opposite of what the assertions verify, and the test name reinforces the wrong mental model. This also exposes a real edge-case where the first contact sets the dim reference when no user embedding is present — a scenario the test implies is safe but the assertions show is not.

backend/tests/unit/test_user_speaker_embedding.py — test_person_cache_loading_no_filter_when_cache_empty docstring and comment contradict the assertions.

Important Files Changed

Filename Overview
backend/utils/stt/speaker_embedding.py Adds a dimension guard to compare_embeddings that returns the max cosine distance (2.0) instead of crashing scipy.cdist when embeddings have mismatched shapes — a correct and minimal fix for the crash.
backend/routers/transcribe.py Adds a pre-filter when loading contact embeddings into the per-session cache: contacts whose embedding dimension differs from the first cached entry (the user's own embedding) are logged and skipped, preventing stale 512-dim embeddings from entering the comparison path.
backend/tests/unit/test_user_speaker_embedding.py Adds 8 unit tests for the two-layer fix; test_person_cache_loading_no_filter_when_cache_empty has a docstring and inline comment that directly contradict the actual assertions — the test proves p2 IS filtered, not that all entries are loaded.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[speaker_identification_task starts] --> B[Load user embedding from Firestore]
    B --> C{Embedding found?}
    C -- Yes --> D[Add 256-dim user embedding to cache]
    C -- No --> E[Cache remains empty]
    D --> F[Loop over contacts]
    E --> F
    F --> G[Fetch contact speaker_embedding]
    G --> H{speech_samples_version >= 3?}
    H -- No --> I[Skip - migration not done]
    H -- Yes --> J[Reshape to 1xD array]
    J --> K{Cache non-empty AND dim mismatch with first entry?}
    K -- Yes --> L[NEW: Log warning, skip stale 512-dim contact]
    K -- No --> M[Add to person_embeddings_cache]
    M --> F
    L --> F
    F --> N{Cache empty?}
    N -- Yes --> O[Disable speaker ID task]
    N -- No --> P[Consume segment queue]
    P --> Q[compare_embeddings called]
    Q --> R{Embedding dims match?}
    R -- No --> S[NEW: Return 2.0 max distance instead of crash]
    R -- Yes --> T[scipy.cdist cosine distance]
    T --> U[find_best_match result]
    S --> U
Loading

Reviews (1): Last reviewed commit: "Fix speaker embedding dimension mismatch..." | Re-trigger Greptile

Comment on lines +693 to +719
def test_person_cache_loading_no_filter_when_cache_empty(self):
"""When cache is empty (no user embedding), all persons should be loaded regardless of dim."""
person_embeddings_cache = {}

persons = [
{'id': 'p1', 'name': 'Alice', 'speaker_embedding': list(np.random.randn(512))},
{'id': 'p2', 'name': 'Bob', 'speaker_embedding': list(np.random.randn(256))},
]

for person in persons:
emb = person.get('speaker_embedding')
if emb:
emb_array = np.array(emb, dtype=np.float32).reshape(1, -1)
if (
person_embeddings_cache
and next(iter(person_embeddings_cache.values()))['embedding'].shape[1] != emb_array.shape[1]
):
continue
person_embeddings_cache[person['id']] = {
'embedding': emb_array,
'name': person['name'],
}

# Both loaded since cache was empty when first person loaded
assert 'p1' in person_embeddings_cache
# p2 gets filtered because after p1 loads (512), p2 (256) mismatches
assert 'p2' not in person_embeddings_cache
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Test docstring and inline comment contradict the assertions

The docstring claims "all persons should be loaded regardless of dim" but the test asserts the exact opposite — p2 (256-dim) is not loaded because after p1 (512-dim) is added to the cache it becomes the dimension reference, and subsequent non-matching entries are filtered.

Specifically:

  • Line 694: """When cache is empty (no user embedding), all persons should be loaded regardless of dim.""" — incorrect; p2 is filtered.
  • Line 716: # Both loaded since cache was empty when first person loaded — incorrect; only p1 is loaded.

Beyond the misleading docs, this test also reveals a legitimate behavioral concern: when the user's own embedding is absent from the cache (e.g., loading failed), the first contact encountered sets the dimension reference. If that first contact happens to have a stale 512-dim embedding, all properly-migrated 256-dim contacts will be silently skipped. The test name test_person_cache_loading_no_filter_when_cache_empty implies the filter is inactive for an empty cache, but filtering still occurs for every entry after the first.

The test should be renamed to reflect actual behaviour and the docstring/comment corrected:

Suggested change
def test_person_cache_loading_no_filter_when_cache_empty(self):
"""When cache is empty (no user embedding), all persons should be loaded regardless of dim."""
person_embeddings_cache = {}
persons = [
{'id': 'p1', 'name': 'Alice', 'speaker_embedding': list(np.random.randn(512))},
{'id': 'p2', 'name': 'Bob', 'speaker_embedding': list(np.random.randn(256))},
]
for person in persons:
emb = person.get('speaker_embedding')
if emb:
emb_array = np.array(emb, dtype=np.float32).reshape(1, -1)
if (
person_embeddings_cache
and next(iter(person_embeddings_cache.values()))['embedding'].shape[1] != emb_array.shape[1]
):
continue
person_embeddings_cache[person['id']] = {
'embedding': emb_array,
'name': person['name'],
}
# Both loaded since cache was empty when first person loaded
assert 'p1' in person_embeddings_cache
# p2 gets filtered because after p1 loads (512), p2 (256) mismatches
assert 'p2' not in person_embeddings_cache
def test_person_cache_loading_first_entry_sets_dim_reference(self):
"""When no user embedding pre-populates the cache, the first contact loaded sets the
dimension reference; subsequent contacts with a different dimension are filtered.
This means if the first contact still has a stale 512-dim embedding, correctly
migrated 256-dim contacts will be skippeda known edge-case when the user has
no stored speaker embedding.
"""
person_embeddings_cache = {}
persons = [
{'id': 'p1', 'name': 'Alice', 'speaker_embedding': list(np.random.randn(512))},
{'id': 'p2', 'name': 'Bob', 'speaker_embedding': list(np.random.randn(256))},
]
for person in persons:
emb = person.get('speaker_embedding')
if emb:
emb_array = np.array(emb, dtype=np.float32).reshape(1, -1)
if (
person_embeddings_cache
and next(iter(person_embeddings_cache.values()))['embedding'].shape[1] != emb_array.shape[1]
):
continue
person_embeddings_cache[person['id']] = {
'embedding': emb_array,
'name': person['name'],
}
# p1 (512-dim) loaded first; p2 (256-dim) filtered because it doesn't match 512
assert 'p1' in person_embeddings_cache
assert 'p2' not in person_embeddings_cache

Comment on lines +124 to +125
if embedding1.shape[1] != embedding2.shape[1]:
return 2.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Dimension guard assumes 2-D array shape

embedding1.shape[1] raises IndexError if a 1-D array (D,) is ever passed instead of (1, D). All current callers reshape to (1, D) before calling this function, so the risk is low, but a defensive check (or using shape[-1]) would make the guard robust against future callers that skip the reshape:

Suggested change
if embedding1.shape[1] != embedding2.shape[1]:
return 2.0
if embedding1.ndim < 2 or embedding2.ndim < 2 or embedding1.shape[-1] != embedding2.shape[-1]:
return 2.0

beastoin and others added 2 commits April 1, 2026 05:02
Codex review flagged that comparing against the first cache entry is
order-dependent: a stale user embedding (512-dim) would poison the filter
and drop valid 256-dim contacts. Instead, load all embeddings into cache
and let compare_embeddings() return 2.0 on dimension mismatch at
comparison time. Added observability log for mixed-dimension caches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The v2→v3 migration sets version=3 for contacts with no speech_samples
but leaves the old 512-dim embedding untouched — confirmed by Firestore
query on affected user (Kat: version=3, 512-dim, 0 samples).

Three fixes:
- Migration: clear speaker_embedding before bumping version when no
  samples exist (both v1→v2 and v2→v3 paths)
- Hardening: transcribe.py and sync.py skip loading person embeddings
  when speech_samples is empty, making historical bad data inert
- Tests: 3 new migration tests for stale embedding clearing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@beastoin beastoin merged commit ee36561 into main Apr 1, 2026
2 checks passed
@beastoin beastoin deleted the fix/speaker-embedding-dim-mismatch-6238 branch April 1, 2026 06:12
@beastoin
Copy link
Copy Markdown
Collaborator Author

beastoin commented Apr 1, 2026

lgtm

Glucksberg pushed a commit to Glucksberg/omi-local that referenced this pull request Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Speaker embedding dimension mismatch: 512-dim v1/v2 embeddings tagged as v3 crash cdist comparison

1 participant