fix: add speaker embedding matching to offline sync (issue #5907)#5946
fix: add speaker embedding matching to offline sync (issue #5907)#5946sungdark wants to merge 1 commit intoBasedHardware:mainfrom
Conversation
Offline sync (sync_local_files / process_segment) was skipping the speaker identification pipeline, causing all transcribed segments to show generic 'SPEAKER_00', 'SPEAKER_01' labels instead of being matched against stored person embeddings. Live recording runs speaker_identification_task which calls get_speech_profile_matching_predictions to identify speakers from their voice embeddings. This fix adds the same call to process_segment after postprocess_words returns. Fixes BasedHardware#5907
Greptile SummaryThis PR fixes issue #5907 by adding speaker embedding matching to the offline sync path ( Key changes:
Minor issues found:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Client
participant sync_local_files
participant process_segment
participant deepgram_prerecorded
participant get_speech_profile_matching_predictions
participant SpeechProfileAPI
participant DB
Client->>sync_local_files: POST /v1/sync-local-files (audio .bin files)
sync_local_files->>sync_local_files: decode_files_to_wav (.bin → .wav)
sync_local_files->>sync_local_files: retrieve_vad_segments (split into speech segments)
sync_local_files->>process_segment: process each segmented .wav (thread)
process_segment->>deepgram_prerecorded: transcribe via signed URL
deepgram_prerecorded-->>process_segment: transcript_segments (SPEAKER_00, SPEAKER_01…)
Note over process_segment: NEW: speaker embedding matching
process_segment->>get_speech_profile_matching_predictions: uid + wav_path + segments
get_speech_profile_matching_predictions->>SpeechProfileAPI: POST audio + segments
SpeechProfileAPI-->>get_speech_profile_matching_predictions: [{is_user, person_id}, …]
get_speech_profile_matching_predictions-->>process_segment: matches list
process_segment->>process_segment: set seg.is_user / seg.person_id
process_segment->>DB: store/merge conversation with identified speakers
Reviews (1): Last reviewed commit: "fix: add speaker embedding matching to o..." | Re-trigger Greptile |
| # Speaker identification: match segments against stored person embeddings | ||
| # This uses the same pipeline as live recording (speaker_identification_task) | ||
| try: | ||
| wav_path = path.replace('.bin', '.wav') |
There was a problem hiding this comment.
Unnecessary
.bin → .wav path substitution
path passed to process_segment is always a .wav file — it comes from segmented_paths, which are populated inside retrieve_vad_segments with paths like f'{path_dir}/{segment_timestamp}.wav'. The .replace('.bin', '.wav') call has no effect here (there is no .bin in the path), so wav_path is always equal to path.
The misleading substitution is a silent no-op today, but it implies that a .bin path might arrive here. If the call-site ever changes, the speaker matching step would silently try to open a file whose name was never transformed, causing the except block to swallow the error with no speaker data written.
| wav_path = path.replace('.bin', '.wav') | |
| wav_path = path # path is already a .wav segment from retrieve_vad_segments |
| for i, seg in enumerate(transcript_segments): | ||
| seg.is_user = matches[i]['is_user'] | ||
| seg.person_id = matches[i].get('person_id') |
There was a problem hiding this comment.
No bounds-check before indexing
matches
get_speech_profile_matching_predictions returns [{'is_user': False, 'person_id': None}] * len(segments) on the error paths, but on a successful API response it simply returns whatever the remote service returned — there is no guarantee the length matches transcript_segments. If the response contains fewer items, matches[i] raises an IndexError; if it contains more, extra matches are silently ignored.
The current except Exception wrapper will catch the IndexError and log it, so this is not a crash, but it means speaker identification is completely skipped when the API returns even one fewer result than expected.
Consider guarding the loop or falling back to a safe default when lengths differ:
for i, seg in enumerate(transcript_segments):
if i < len(matches):
seg.is_user = matches[i]['is_user']
seg.person_id = matches[i].get('person_id')|
AI PRs with low efforts are not welcome here. Thank you. — by CTO |
|
Hey @sungdark 👋 Thank you so much for taking the time to contribute to Omi! We truly appreciate you putting in the effort to submit this pull request. After careful review, we've decided not to merge this particular PR. Please don't take this personally — we genuinely try to merge as many contributions as possible, but sometimes we have to make tough calls based on:
Your contribution is still valuable to us, and we'd love to see you contribute again in the future! If you'd like feedback on how to improve this PR or want to discuss alternative approaches, please don't hesitate to reach out. Thank you for being part of the Omi community! 💜 |
Fix: Offline sync no speaker diarization (issue #5907)
Problem
Offline recording sync (sync_local_files / process_segment) was skipping the speaker identification pipeline, causing all transcribed segments to show generic 'SPEAKER_00', 'SPEAKER_01' labels instead of being matched against stored person embeddings. Live recording worked correctly because it runs speaker_identification_task which calls get_speech_profile_matching_predictions to identify speakers from their voice embeddings.
Solution
Added the same speaker embedding matching call to process_segment after postprocess_words returns. The get_speech_profile_matching_predictions function extracts speaker embeddings from the audio and matches them against stored person embeddings, setting is_user and person_id on each segment.
Changes
Testing
The fix follows the same pattern used in postprocess_conversation.py's _handle_segment_embedding_matching function and the speaker_identification_task in transcribe.py.
Closes #5907