Skip to content

feat: add OpenAI GPT-4o speaker diarization support#3814

Merged
beastoin merged 2 commits into
BasedHardware:mainfrom
syou6162:feature/openai-diarize-support
Dec 19, 2025
Merged

feat: add OpenAI GPT-4o speaker diarization support#3814
beastoin merged 2 commits into
BasedHardware:mainfrom
syou6162:feature/openai-diarize-support

Conversation

@syou6162
Copy link
Copy Markdown
Collaborator

@syou6162 syou6162 commented Dec 17, 2025

Summary

Add support for OpenAI's gpt-4o-transcribe-diarize model to enable speaker diarization when using OpenAI as the STT provider.

Background

The default STT engine (Deepgram) supports speaker diarization, allowing the app to annotate which speaker said each segment. However, for users who need better recognition accuracy in certain languages (e.g., Japanese), OpenAI Whisper is a better choice.

The problem is that the standard OpenAI Whisper model (whisper-1) does not support speaker diarization, resulting in all segments being attributed to "Speaker 0". This makes it difficult to:

  • Correctly annotate who said what
  • Review conversations with multiple speakers in the app

Solution

OpenAI recently released gpt-4o-transcribe-diarize, a model that provides both high-quality transcription and speaker diarization. This PR adds it as a selectable option in the STT settings.

References

Add support for OpenAI's gpt-4o-transcribe-diarize model with speaker
diarization functionality. This enables automatic identification and
separation of different speakers in transcriptions.

Changes:
- Add openaiDiarize enum and provider configuration
- Add openAIDiarize response schema for diarized_json format
- Extend speaker ID parsing to support "A", "B", "C" format
- Add factory method for OpenAI diarize provider
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for OpenAI's GPT-4o speaker diarization feature. The changes include adding a new SttProvider enum, defining its configuration and response schema, and updating the speaker ID parsing logic to handle the new format.

The implementation is mostly correct, but I've identified a couple of areas for improvement regarding maintainability. Specifically, there's a duplication of configuration logic for the new provider, and a hardcoded model name that could be made more flexible. My comments provide details and suggestions for refactoring these parts to make the code more robust and easier to maintain.

Comment thread app/lib/models/stt_provider.dart Outdated
Comment thread app/lib/services/sockets/transcription_polling_service.dart
…parameter

- Change default language from 'ja' to 'en' for openAIDiarize provider
- Fix model parameter to use dynamic value instead of hardcoded string
- Ensure consistency with other STT providers
@syou6162 syou6162 marked this pull request as ready for review December 17, 2025 06:20
@beastoin
Copy link
Copy Markdown
Collaborator

sir give me the demo then we go.

@syou6162

@syou6162
Copy link
Copy Markdown
Collaborator Author

@beastoin I tested this branch on the iOS Simulator and verified the functionality!

Steps:

  1. Go to "Developer Settings" → "Transcription"
  2. Select "OpenAI GPT-4o (Speaker)"
  3. Save the settings and start transcription

Test case:
I used BBC News audio as an example of multi-speaker content. The OpenAI model successfully identified and separated multiple speakers in the transcription.

Demo video and screenshots are attached below.

Simulator.Screen.Recording.-.iPhone15Pro-iOS17.-.2025-12-19.at.01.33.27.mov
スクリーンショット 2025-12-19 1 44 54 スクリーンショット 2025-12-19 1 45 03 スクリーンショット 2025-12-19 1 45 11

@beastoin
Copy link
Copy Markdown
Collaborator

LGTM @syou6162

Thank you and congrats on the first contribution to OMI!

@beastoin beastoin merged commit cff2113 into BasedHardware:main Dec 19, 2025
@syou6162 syou6162 deleted the feature/openai-diarize-support branch December 19, 2025 02:17
Glucksberg pushed a commit to Glucksberg/omi-local that referenced this pull request Apr 28, 2026
* feat: add OpenAI GPT-4o speaker diarization support

Add support for OpenAI's gpt-4o-transcribe-diarize model with speaker
diarization functionality. This enables automatic identification and
separation of different speakers in transcriptions.

Changes:
- Add openaiDiarize enum and provider configuration
- Add openAIDiarize response schema for diarized_json format
- Extend speaker ID parsing to support "A", "B", "C" format
- Add factory method for OpenAI diarize provider

* fix: update OpenAI Diarize default language to English and fix model parameter

- Change default language from 'ja' to 'en' for openAIDiarize provider
- Fix model parameter to use dynamic value instead of hardcoded string
- Ensure consistency with other STT providers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants