feat: add OpenAI GPT-4o speaker diarization support#3814
Conversation
Add support for OpenAI's gpt-4o-transcribe-diarize model with speaker diarization functionality. This enables automatic identification and separation of different speakers in transcriptions. Changes: - Add openaiDiarize enum and provider configuration - Add openAIDiarize response schema for diarized_json format - Extend speaker ID parsing to support "A", "B", "C" format - Add factory method for OpenAI diarize provider
There was a problem hiding this comment.
Code Review
This pull request adds support for OpenAI's GPT-4o speaker diarization feature. The changes include adding a new SttProvider enum, defining its configuration and response schema, and updating the speaker ID parsing logic to handle the new format.
The implementation is mostly correct, but I've identified a couple of areas for improvement regarding maintainability. Specifically, there's a duplication of configuration logic for the new provider, and a hardcoded model name that could be made more flexible. My comments provide details and suggestions for refactoring these parts to make the code more robust and easier to maintain.
…parameter - Change default language from 'ja' to 'en' for openAIDiarize provider - Fix model parameter to use dynamic value instead of hardcoded string - Ensure consistency with other STT providers
|
sir give me the demo then we go. |
|
@beastoin I tested this branch on the iOS Simulator and verified the functionality! Steps:
Test case: Demo video and screenshots are attached below. Simulator.Screen.Recording.-.iPhone15Pro-iOS17.-.2025-12-19.at.01.33.27.mov
|
|
LGTM @syou6162 Thank you and congrats on the first contribution to OMI! |
* feat: add OpenAI GPT-4o speaker diarization support Add support for OpenAI's gpt-4o-transcribe-diarize model with speaker diarization functionality. This enables automatic identification and separation of different speakers in transcriptions. Changes: - Add openaiDiarize enum and provider configuration - Add openAIDiarize response schema for diarized_json format - Extend speaker ID parsing to support "A", "B", "C" format - Add factory method for OpenAI diarize provider * fix: update OpenAI Diarize default language to English and fix model parameter - Change default language from 'ja' to 'en' for openAIDiarize provider - Fix model parameter to use dynamic value instead of hardcoded string - Ensure consistency with other STT providers



Summary
Add support for OpenAI's
gpt-4o-transcribe-diarizemodel to enable speaker diarization when using OpenAI as the STT provider.Background
The default STT engine (Deepgram) supports speaker diarization, allowing the app to annotate which speaker said each segment. However, for users who need better recognition accuracy in certain languages (e.g., Japanese), OpenAI Whisper is a better choice.
The problem is that the standard OpenAI Whisper model (
whisper-1) does not support speaker diarization, resulting in all segments being attributed to "Speaker 0". This makes it difficult to:Solution
OpenAI recently released
gpt-4o-transcribe-diarize, a model that provides both high-quality transcription and speaker diarization. This PR adds it as a selectable option in the STT settings.References