Migrate speaker diarization to OSS SDK by Alex-Wengg · Pull Request #1 · FluidInference/FluidAudio

Alex-Wengg · 2025-06-22T04:19:30Z

Migrating speaker diarization functionality from Slipbox to open source SDK. This creates a standalone, reusable component that other developers can integrate.

Extract SpeakerDiarizationManager as independent Swift package
Add SherpaOnnx wrapper integration
Include model auto-download functionality
Slipbox branch using this library is SeamlessAudioSwift
main Sources/SeamlessAudioSwift/SeamlessAudioSwift.swift

usage
let manager = SpeakerDiarizationManager()
await manager.initialize()
let segments = try await manager.performSegmentation(audioSamples)

… binary files - Migrate diarizer functionality from slipbox repo - Set up proper Swift Package structure with SherpaOnnx integration - Configure Git LFS for all .a library files to avoid GitHub size limits - Add comprehensive test suite - Fix module map and linker settings for proper C/Swift interop

BrandonWeng · 2025-06-22T20:56:49Z

Package.swift

@@ -0,0 +1,59 @@
+// swift-tools-version: 5.9


Lets use 6.1

.gitattributes

BrandonWeng

Great starting point! We can merge this first and make improvements as needed

Properly support different Apple architectures, currently we only support MacOS
Import and support our CoreML diarization models
Improve our benchmarks too

- Add .DS_Store and .swiftpm to .gitignore to exclude system files - Remove existing .DS_Store files from tracking - Remove .swiftpm directory from tracking (auto-generated by Xcode) - Add comprehensive README with proper attribution to SherpaOnnx - Include installation instructions, usage examples, and model attribution - Credit K2-FSA team and sherpa-onnx project for underlying libraries

- Update swift-tools-version from 5.9 to 6.0 - Remove Sendable conformance from SpeakerDiarizationManager to fix concurrency errors - Exclude lib/ directory from SherpaOnnxWrapper target to avoid unhandled file warnings - Update README requirements to reflect Swift 6.0+ and Xcode 16.0+ requirements - All tests passing with Swift 6.0 strict concurrency checking

- Updated swift-tools-version from 6.0 to 6.1 in Package.swift - Added .swift-version file specifying Swift 6.1.2 - Updated README.md to require Swift 6.1+ - Verified all tests pass with Swift 6.1.2

### Why is this change needed?  Taking inspiration from the silero https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py Updating our segmentation implementation and supporitng streaming VAD ```bash %swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming Building for debugging... [1/1] Write swift-version--58304C5D6DBC2206.txt Build of product 'fluidaudio' complete! (0.07s) [00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed [00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc [00:08:02.812] [INFO] [VadManager] VAD model loaded successfully [00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s [00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation... [00:08:02.820] [INFO] [VadAnalyze] • Speech Start at 1.200s [00:08:02.821] [INFO] [VadAnalyze] • Speech End at 2.700s [00:08:02.822] [INFO] [VadAnalyze] • Speech Start at 4.300s [00:08:02.825] [INFO] [VadAnalyze] • Speech End at 7.800s [00:08:02.828] [INFO] [VadAnalyze] • Speech Start at 13.700s [00:08:02.830] [INFO] [VadAnalyze] • Speech End at 16.200s [00:08:02.830] [INFO] [VadAnalyze] • Speech Start at 17.300s [00:08:02.832] [INFO] [VadAnalyze] • Speech End at 19.000s [00:08:02.839] [INFO] [VadAnalyze] • Speech Start at 29.600s [00:08:02.840] [INFO] [VadAnalyze] • Speech End at 30.600s [00:08:02.849] [INFO] [VadAnalyze] • Speech Start at 45.000s [00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments... [00:08:02.850] [INFO] [VadAnalyze] • Speech End at 45.500s [00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events % swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds Building for debugging... [1/1] Write swift-version--58304C5D6DBC2206.txt Build of product 'fluidaudio' complete! (0.07s) [00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed [00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc [00:08:08.309] [INFO] [VadManager] VAD model loaded successfully [00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s [00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation... [00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s [00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s) [00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s) [00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s) [00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s) [00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s) [00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s) [00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s) % ffmpeg -i voiceink-issue-279.wav -af silencedetect=noise=-30dB:d=0.5 -f null - ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers built with Apple clang version 17.0.0 (clang-1700.0.13.3) ... libavutil 60. 8.100 / 60. 8.100 libavcodec 62. 11.100 / 62. 11.100 libavformat 62. 3.100 / 62. 3.100 libavdevice 62. 1.100 / 62. 1.100 libavfilter 11. 4.100 / 11. 4.100 libswscale 9. 1.100 / 9. 1.100 libswresample 6. 1.100 / 6. 1.100 [aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono Input #0, wav, from 'voiceink-issue-279.wav': Duration: 00:00:45.66, bitrate: 256 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s Stream mapping: Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native)) Press [q] to stop, [?] for help Output #0, null, to 'pipe:': Metadata: encoder : Lavf62.3.100 Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s Metadata: encoder : Lavc62.11.100 pcm_s16le [silencedetect @ 0xb22c6c420] silence_start: 0 [silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364 [silencedetect @ 0xb22c6c420] silence_start: 2.305687 [silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125 [silencedetect @ 0xb22c6c420] silence_start: 7.579813 [silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125 [silencedetect @ 0xb22c6c420] silence_start: 15.845063 [silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687 [silencedetect @ 0xb22c6c420] silence_start: 18.692625 [silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813 [silencedetect @ 0xb22c6c420] silence_start: 30.367563 [silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445 [silencedetect @ 0xb22c6c420] silence_start: 41.454687 [silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125 [out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00 ```

Migrate speaker diarization to OSS SDK

### Why is this change needed?  Taking inspiration from the silero https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py Updating our segmentation implementation and supporitng streaming VAD ```bash %swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming Building for debugging... [1/1] Write swift-version--58304C5D6DBC2206.txt Build of product 'fluidaudio' complete! (0.07s) [00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed [00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc [00:08:02.812] [INFO] [VadManager] VAD model loaded successfully [00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s [00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation... [00:08:02.820] [INFO] [VadAnalyze] • Speech Start at 1.200s [00:08:02.821] [INFO] [VadAnalyze] • Speech End at 2.700s [00:08:02.822] [INFO] [VadAnalyze] • Speech Start at 4.300s [00:08:02.825] [INFO] [VadAnalyze] • Speech End at 7.800s [00:08:02.828] [INFO] [VadAnalyze] • Speech Start at 13.700s [00:08:02.830] [INFO] [VadAnalyze] • Speech End at 16.200s [00:08:02.830] [INFO] [VadAnalyze] • Speech Start at 17.300s [00:08:02.832] [INFO] [VadAnalyze] • Speech End at 19.000s [00:08:02.839] [INFO] [VadAnalyze] • Speech Start at 29.600s [00:08:02.840] [INFO] [VadAnalyze] • Speech End at 30.600s [00:08:02.849] [INFO] [VadAnalyze] • Speech Start at 45.000s [00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments... [00:08:02.850] [INFO] [VadAnalyze] • Speech End at 45.500s [00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events % swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds Building for debugging... [1/1] Write swift-version--58304C5D6DBC2206.txt Build of product 'fluidaudio' complete! (0.07s) [00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed [00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc [00:08:08.309] [INFO] [VadManager] VAD model loaded successfully [00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s [00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation... [00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s [00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s) [00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s) [00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s) [00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s) [00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s) [00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s) [00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s) % ffmpeg -i voiceink-issue-279.wav -af silencedetect=noise=-30dB:d=0.5 -f null - ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers built with Apple clang version 17.0.0 (clang-1700.0.13.3) ... libavutil 60. 8.100 / 60. 8.100 libavcodec 62. 11.100 / 62. 11.100 libavformat 62. 3.100 / 62. 3.100 libavdevice 62. 1.100 / 62. 1.100 libavfilter 11. 4.100 / 11. 4.100 libswscale 9. 1.100 / 9. 1.100 libswresample 6. 1.100 / 6. 1.100 [aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono Input #0, wav, from 'voiceink-issue-279.wav': Duration: 00:00:45.66, bitrate: 256 kb/s Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s Stream mapping: Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native)) Press [q] to stop, [?] for help Output #0, null, to 'pipe:': Metadata: encoder : Lavf62.3.100 Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s Metadata: encoder : Lavc62.11.100 pcm_s16le [silencedetect @ 0xb22c6c420] silence_start: 0 [silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364 [silencedetect @ 0xb22c6c420] silence_start: 2.305687 [silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125 [silencedetect @ 0xb22c6c420] silence_start: 7.579813 [silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125 [silencedetect @ 0xb22c6c420] silence_start: 15.845063 [silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687 [silencedetect @ 0xb22c6c420] silence_start: 18.692625 [silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813 [silencedetect @ 0xb22c6c420] silence_start: 30.367563 [silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445 [silencedetect @ 0xb22c6c420] silence_start: 41.454687 [silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125 [out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00 ```

Alex-Wengg added 3 commits June 22, 2025 00:04

Working branch

de1ea4c

Move libonnxruntime.a to Git LFS

c0b42c8

BrandonWeng reviewed Jun 22, 2025

View reviewed changes

Package.swift Outdated

@@ -0,0 +1,59 @@

// swift-tools-version: 5.9

Copy link
Copy Markdown

Member

BrandonWeng Jun 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets use 6.1

BrandonWeng reviewed Jun 22, 2025

View reviewed changes

.gitattributes Show resolved Hide resolved

BrandonWeng approved these changes Jun 22, 2025

View reviewed changes

Alex-Wengg added 3 commits June 23, 2025 18:10

Upgrade to Swift 6.1.2

6bee0f9

- Updated swift-tools-version from 6.0 to 6.1 in Package.swift - Added .swift-version file specifying Swift 6.1.2 - Updated README.md to require Swift 6.1+ - Verified all tests pass with Swift 6.1.2

Alex-Wengg merged commit 9c56c81 into main Jun 23, 2025

BrandonWeng deleted the beta branch June 24, 2025 19:47

rohithjnayak mentioned this pull request Oct 31, 2025

Add support for x86_64 architecture #173

Closed

Alex-Wengg added a commit that referenced this pull request Jan 1, 2026

Merge pull request #1 from FluidInference/beta

51b5f2e

Migrate speaker diarization to OSS SDK

claude bot mentioned this pull request Feb 15, 2026

feat: integrate Qwen3-ForcedAligner-0.6B for per-word timestamp alignment #315

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate speaker diarization to OSS SDK#1

Migrate speaker diarization to OSS SDK#1
Alex-Wengg merged 6 commits intomainfrom
beta

Alex-Wengg commented Jun 22, 2025

Uh oh!

BrandonWeng Jun 22, 2025

Uh oh!

Uh oh!

BrandonWeng left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Alex-Wengg commented Jun 22, 2025

Uh oh!

BrandonWeng Jun 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

BrandonWeng left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants