Fix speaker embedding dimension mismatch crash by beastoin · Pull Request #6240 · BasedHardware/omi

beastoin · 2026-04-01T04:52:33Z

Root Cause (verified via Firestore query)

The v2→v3 speaker embedding migration (speaker_sample_migration.py:186-190) has a bug: when a contact has no speech_samples (0 audio files), it sets speech_samples_version=3 but leaves the old 512-dim embedding untouched. The old embedding from the v1/v2 model (pyannote/embedding, 512-dim) survives tagged as v3, crashing scipy.cdist when compared against the user's 256-dim v3 embedding (~1,197 errors/day across 4 users).

Verified on affected user: contact "Kat" has version=3, 512-dim embedding, 0 speech_samples. Contacts with samples (Chris Bond, Vince) were correctly re-extracted to 256-dim.

Fix (4 commits, 2 Codex reviews)

Commit 1 — Crash prevention:

compare_embeddings() returns max distance (2.0) on dimension mismatch instead of crashing via scipy.cdist
Safety net for all callers (transcribe.py, sync.py)

Commit 2 — Remove fragile cache filter (Codex review #1):

Initial cache-level filter anchored on first entry — Codex flagged as order-dependent
Removed in favor of relying on compare_embeddings() guard

Commit 3 — Root cause fix (Codex review #2):

Migration: clear speaker_embedding before bumping version when no samples exist (both v1→v2 and v2→v3 paths)
Hardening: transcribe.py and sync.py skip loading person embeddings when speech_samples is empty — makes historical bad data inert

Commit 4 — Cleanup:

Removed dimension mismatch observability log (not needed)

Changed Files

File	Change
`backend/utils/stt/speaker_embedding.py`	Dimension guard in `compare_embeddings()`
`backend/utils/speaker_sample_migration.py`	Clear stale embedding in no-samples path
`backend/routers/transcribe.py`	Skip loading embeddings without speech_samples
`backend/routers/sync.py`	Same hardening
`backend/tests/unit/test_user_speaker_embedding.py`	8 new tests
`backend/tests/unit/test_speaker_sample_migration.py`	3 new tests

Tests

30 passed (test_user_speaker_embedding.py) — 8 new for dimension guard
18 passed (test_speaker_sample_migration.py) — 3 new for stale embedding clearing
98 passed (test_speaker_id_pipeline.py) — existing, no regressions
12 passed (test_short_audio_embedding.py) — existing, no regressions

Deploy

Backend Cloud Run only — pusher doesn't call compare_embeddings or the transcribe/sync speaker ID paths
gh workflow run "Deploy Backend to Cloud RUN" --repo BasedHardware/omi --ref main -f environment=prod

Not in this PR

One-off backfill to clear existing incorrectly tagged contacts (follow-up)

by AI for @beastoin

v2→v3 migration can leave contacts with 512-dim embeddings tagged as v3. When the user has a 256-dim v3 embedding, scipy.cdist crashes on shape mismatch (~1,197 errors/day affecting 4 users). Two fixes: - compare_embeddings() returns max distance (2.0) on dim mismatch - Cache loading in transcribe.py filters out stale-dimension contacts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

greptile-apps · 2026-04-01T04:56:33Z

Greptile Summary

This PR fixes a crash (#6238) caused by a partially-failed v2→v3 speaker embedding migration that left some contacts tagged as speech_samples_version=3 but still holding 512-dim embeddings from the old pyannote model. When the user already had a correct 256-dim v3 embedding, scipy.cdist would raise a ValueError on the shape mismatch — affecting ~1,197 errors/day across 4 users.

Two-layer defence is added:

compare_embeddings() in speaker_embedding.py — now short-circuits and returns 2.0 (max cosine distance) before calling cdist if the two embeddings have different dimensions. This is the universal safety net.
Cache loading in transcribe.py — before inserting a contact into person_embeddings_cache, the new code compares the contact's embedding dimension against the first item already in the cache (typically the user's own 256-dim embedding). Any contact whose dimension differs is logged at WARNING level and skipped.

Issues found:

The test test_person_cache_loading_no_filter_when_cache_empty has a docstring and inline comment that directly contradict its own assertions — the test proves p2 IS filtered, not that all entries are loaded. This also surfaces a real edge-case: when no user embedding is in the cache, the first contact sets the dimension reference and all contacts with a different dimension are silently dropped.
The dimension guard in compare_embeddings uses shape[1] which would IndexError on a 1-D array; shape[-1] with an ndim check would be more defensive.

Confidence Score: 4/5

Safe to merge after fixing the contradictory test docstring, which also masks a real edge-case in the filter logic.

The production fix in speaker_embedding.py and transcribe.py is correct and sound. The P1 finding is confined to the test file: the docstring and comment in test_person_cache_loading_no_filter_when_cache_empty claim the opposite of what the assertions verify, and the test name reinforces the wrong mental model. This also exposes a real edge-case where the first contact sets the dim reference when no user embedding is present — a scenario the test implies is safe but the assertions show is not.

backend/tests/unit/test_user_speaker_embedding.py — test_person_cache_loading_no_filter_when_cache_empty docstring and comment contradict the assertions.

Important Files Changed

Filename	Overview
backend/utils/stt/speaker_embedding.py	Adds a dimension guard to `compare_embeddings` that returns the max cosine distance (2.0) instead of crashing `scipy.cdist` when embeddings have mismatched shapes — a correct and minimal fix for the crash.
backend/routers/transcribe.py	Adds a pre-filter when loading contact embeddings into the per-session cache: contacts whose embedding dimension differs from the first cached entry (the user's own embedding) are logged and skipped, preventing stale 512-dim embeddings from entering the comparison path.
backend/tests/unit/test_user_speaker_embedding.py	Adds 8 unit tests for the two-layer fix; `test_person_cache_loading_no_filter_when_cache_empty` has a docstring and inline comment that directly contradict the actual assertions — the test proves p2 IS filtered, not that all entries are loaded.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[speaker_identification_task starts] --> B[Load user embedding from Firestore]
    B --> C{Embedding found?}
    C -- Yes --> D[Add 256-dim user embedding to cache]
    C -- No --> E[Cache remains empty]
    D --> F[Loop over contacts]
    E --> F
    F --> G[Fetch contact speaker_embedding]
    G --> H{speech_samples_version >= 3?}
    H -- No --> I[Skip - migration not done]
    H -- Yes --> J[Reshape to 1xD array]
    J --> K{Cache non-empty AND dim mismatch with first entry?}
    K -- Yes --> L[NEW: Log warning, skip stale 512-dim contact]
    K -- No --> M[Add to person_embeddings_cache]
    M --> F
    L --> F
    F --> N{Cache empty?}
    N -- Yes --> O[Disable speaker ID task]
    N -- No --> P[Consume segment queue]
    P --> Q[compare_embeddings called]
    Q --> R{Embedding dims match?}
    R -- No --> S[NEW: Return 2.0 max distance instead of crash]
    R -- Yes --> T[scipy.cdist cosine distance]
    T --> U[find_best_match result]
    S --> U

_{Reviews (1): Last reviewed commit: "Fix speaker embedding dimension mismatch..." | Re-trigger Greptile}

greptile-apps · 2026-04-01T04:56:36Z

+    def test_person_cache_loading_no_filter_when_cache_empty(self):
+        """When cache is empty (no user embedding), all persons should be loaded regardless of dim."""
+        person_embeddings_cache = {}
+
+        persons = [
+            {'id': 'p1', 'name': 'Alice', 'speaker_embedding': list(np.random.randn(512))},
+            {'id': 'p2', 'name': 'Bob', 'speaker_embedding': list(np.random.randn(256))},
+        ]
+
+        for person in persons:
+            emb = person.get('speaker_embedding')
+            if emb:
+                emb_array = np.array(emb, dtype=np.float32).reshape(1, -1)
+                if (
+                    person_embeddings_cache
+                    and next(iter(person_embeddings_cache.values()))['embedding'].shape[1] != emb_array.shape[1]
+                ):
+                    continue
+                person_embeddings_cache[person['id']] = {
+                    'embedding': emb_array,
+                    'name': person['name'],
+                }
+
+        # Both loaded since cache was empty when first person loaded
+        assert 'p1' in person_embeddings_cache
+        # p2 gets filtered because after p1 loads (512), p2 (256) mismatches
+        assert 'p2' not in person_embeddings_cache


Test docstring and inline comment contradict the assertions

The docstring claims "all persons should be loaded regardless of dim" but the test asserts the exact opposite — p2 (256-dim) is not loaded because after p1 (512-dim) is added to the cache it becomes the dimension reference, and subsequent non-matching entries are filtered.

Specifically:

Line 694: """When cache is empty (no user embedding), all persons should be loaded regardless of dim.""" — incorrect; p2 is filtered.

Line 716: # Both loaded since cache was empty when first person loaded — incorrect; only p1 is loaded.

Beyond the misleading docs, this test also reveals a legitimate behavioral concern: when the user's own embedding is absent from the cache (e.g., loading failed), the first contact encountered sets the dimension reference. If that first contact happens to have a stale 512-dim embedding, all properly-migrated 256-dim contacts will be silently skipped. The test name test_person_cache_loading_no_filter_when_cache_empty implies the filter is inactive for an empty cache, but filtering still occurs for every entry after the first.

The test should be renamed to reflect actual behaviour and the docstring/comment corrected:

Suggested change

def test_person_cache_loading_no_filter_when_cache_empty(self):

"""When cache is empty (no user embedding), all persons should be loaded regardless of dim."""

person_embeddings_cache = {}

persons = [

{'id': 'p1', 'name': 'Alice', 'speaker_embedding': list(np.random.randn(512))},

{'id': 'p2', 'name': 'Bob', 'speaker_embedding': list(np.random.randn(256))},

]

for person in persons:

emb = person.get('speaker_embedding')

if emb:

emb_array = np.array(emb, dtype=np.float32).reshape(1, -1)

if (

person_embeddings_cache

and next(iter(person_embeddings_cache.values()))['embedding'].shape[1] != emb_array.shape[1]

):

continue

person_embeddings_cache[person['id']] = {

'embedding': emb_array,

'name': person['name'],

}

# Both loaded since cache was empty when first person loaded

assert 'p1' in person_embeddings_cache

# p2 gets filtered because after p1 loads (512), p2 (256) mismatches

assert 'p2' not in person_embeddings_cache

def test_person_cache_loading_first_entry_sets_dim_reference(self):

"""When no user embedding pre-populates the cache, the first contact loaded sets the

dimension reference; subsequent contacts with a different dimension are filtered.

This means if the first contact still has a stale 512-dim embedding, correctly

migrated 256-dim contacts will be skipped — a known edge-case when the user has

no stored speaker embedding.

"""

person_embeddings_cache = {}

persons = [

{'id': 'p1', 'name': 'Alice', 'speaker_embedding': list(np.random.randn(512))},

{'id': 'p2', 'name': 'Bob', 'speaker_embedding': list(np.random.randn(256))},

]

for person in persons:

emb = person.get('speaker_embedding')

if emb:

emb_array = np.array(emb, dtype=np.float32).reshape(1, -1)

if (

person_embeddings_cache

and next(iter(person_embeddings_cache.values()))['embedding'].shape[1] != emb_array.shape[1]

):

continue

person_embeddings_cache[person['id']] = {

'embedding': emb_array,

'name': person['name'],

}

# p1 (512-dim) loaded first; p2 (256-dim) filtered because it doesn't match 512

assert 'p1' in person_embeddings_cache

assert 'p2' not in person_embeddings_cache

greptile-apps · 2026-04-01T04:56:37Z

+    if embedding1.shape[1] != embedding2.shape[1]:
+        return 2.0


Dimension guard assumes 2-D array shape

embedding1.shape[1] raises IndexError if a 1-D array (D,) is ever passed instead of (1, D). All current callers reshape to (1, D) before calling this function, so the risk is low, but a defensive check (or using shape[-1]) would make the guard robust against future callers that skip the reshape:

Suggested change

if embedding1.shape[1] != embedding2.shape[1]:

return 2.0

if embedding1.ndim < 2 or embedding2.ndim < 2 or embedding1.shape[-1] != embedding2.shape[-1]:

return 2.0

Codex review flagged that comparing against the first cache entry is order-dependent: a stale user embedding (512-dim) would poison the filter and drop valid 256-dim contacts. Instead, load all embeddings into cache and let compare_embeddings() return 2.0 on dimension mismatch at comparison time. Added observability log for mixed-dimension caches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The v2→v3 migration sets version=3 for contacts with no speech_samples but leaves the old 512-dim embedding untouched — confirmed by Firestore query on affected user (Kat: version=3, 512-dim, 0 samples). Three fixes: - Migration: clear speaker_embedding before bumping version when no samples exist (both v1→v2 and v2→v3 paths) - Hardening: transcribe.py and sync.py skip loading person embeddings when speech_samples is empty, making historical bad data inert - Tests: 3 new migration tests for stale embedding clearing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

beastoin · 2026-04-01T06:12:21Z

lgtm

greptile-apps Bot reviewed Apr 1, 2026

View reviewed changes

beastoin and others added 2 commits April 1, 2026 05:02

beastoin mentioned this pull request Apr 1, 2026

Speaker embedding dimension mismatch: 512-dim v1/v2 embeddings tagged as v3 crash cdist comparison #6238

Closed

Remove dimension mismatch log — not needed

a93e52c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

beastoin merged commit ee36561 into main Apr 1, 2026
2 checks passed

beastoin deleted the fix/speaker-embedding-dim-mismatch-6238 branch April 1, 2026 06:12

Glucksberg pushed a commit to Glucksberg/omi-local that referenced this pull request Apr 28, 2026

Fix speaker embedding dimension mismatch crash (BasedHardware#6240)

73eb6fb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix speaker embedding dimension mismatch crash#6240

Fix speaker embedding dimension mismatch crash#6240
beastoin merged 4 commits into
mainfrom
fix/speaker-embedding-dim-mismatch-6238

beastoin commented Apr 1, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 1, 2026

Uh oh!

greptile-apps Bot Apr 1, 2026

Uh oh!

greptile-apps Bot Apr 1, 2026

Uh oh!

Uh oh!

beastoin commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

beastoin commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root Cause (verified via Firestore query)

Fix (4 commits, 2 Codex reviews)

Changed Files

Tests

Deploy

Not in this PR

Uh oh!

greptile-apps Bot commented Apr 1, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

beastoin commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

beastoin commented Apr 1, 2026 •

edited

Loading