Skip to content

Migrate speaker diarization to OSS SDK#1

Merged
Alex-Wengg merged 6 commits intomainfrom
beta
Jun 23, 2025
Merged

Migrate speaker diarization to OSS SDK#1
Alex-Wengg merged 6 commits intomainfrom
beta

Conversation

@Alex-Wengg
Copy link
Copy Markdown
Member

Migrating speaker diarization functionality from Slipbox to open source SDK. This creates a standalone, reusable component that other developers can integrate.

  • Extract SpeakerDiarizationManager as independent Swift package

  • Add SherpaOnnx wrapper integration

  • Include model auto-download functionality

  • Slipbox branch using this library is SeamlessAudioSwift

  • main Sources/SeamlessAudioSwift/SeamlessAudioSwift.swift

usage
let manager = SpeakerDiarizationManager()
await manager.initialize()
let segments = try await manager.performSegmentation(audioSamples)

… binary files

- Migrate diarizer functionality from slipbox repo
- Set up proper Swift Package structure with SherpaOnnx integration
- Configure Git LFS for all .a library files to avoid GitHub size limits
- Add comprehensive test suite
- Fix module map and linker settings for proper C/Swift interop
Package.swift Outdated
@@ -0,0 +1,59 @@
// swift-tools-version: 5.9
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets use 6.1

Copy link
Copy Markdown
Member

@BrandonWeng BrandonWeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great starting point! We can merge this first and make improvements as needed

Properly support different Apple architectures, currently we only support MacOS
Import and support our CoreML diarization models
Improve our benchmarks too

- Add .DS_Store and .swiftpm to .gitignore to exclude system files
- Remove existing .DS_Store files from tracking
- Remove .swiftpm directory from tracking (auto-generated by Xcode)
- Add comprehensive README with proper attribution to SherpaOnnx
- Include installation instructions, usage examples, and model attribution
- Credit K2-FSA team and sherpa-onnx project for underlying libraries
- Update swift-tools-version from 5.9 to 6.0
- Remove Sendable conformance from SpeakerDiarizationManager to fix concurrency errors
- Exclude lib/ directory from SherpaOnnxWrapper target to avoid unhandled file warnings
- Update README requirements to reflect Swift 6.0+ and Xcode 16.0+ requirements
- All tests passing with Swift 6.0 strict concurrency checking
- Updated swift-tools-version from 6.0 to 6.1 in Package.swift
- Added .swift-version file specifying Swift 6.1.2
- Updated README.md to require Swift 6.1+
- Verified all tests pass with Swift 6.1.2
@Alex-Wengg Alex-Wengg merged commit 9c56c81 into main Jun 23, 2025
@BrandonWeng BrandonWeng deleted the beta branch June 24, 2025 19:47
BrandonWeng added a commit that referenced this pull request Sep 17, 2025
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Taking inspiration from the silero
https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py

Updating our segmentation implementation and supporitng streaming VAD

```bash
%swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:02.812] [INFO] [VadManager] VAD model loaded successfully
[00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation...
[00:08:02.820] [INFO] [VadAnalyze]   • Speech Start at 1.200s
[00:08:02.821] [INFO] [VadAnalyze]   • Speech End at 2.700s
[00:08:02.822] [INFO] [VadAnalyze]   • Speech Start at 4.300s
[00:08:02.825] [INFO] [VadAnalyze]   • Speech End at 7.800s
[00:08:02.828] [INFO] [VadAnalyze]   • Speech Start at 13.700s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech End at 16.200s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech Start at 17.300s
[00:08:02.832] [INFO] [VadAnalyze]   • Speech End at 19.000s
[00:08:02.839] [INFO] [VadAnalyze]   • Speech Start at 29.600s
[00:08:02.840] [INFO] [VadAnalyze]   • Speech End at 30.600s
[00:08:02.849] [INFO] [VadAnalyze]   • Speech Start at 45.000s
[00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments...
[00:08:02.850] [INFO] [VadAnalyze]   • Speech End at 45.500s
[00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events

% swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:08.309] [INFO] [VadManager] VAD model loaded successfully
[00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation...
[00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s
[00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s)

% ffmpeg -i voiceink-issue-279.wav  -af silencedetect=noise=-30dB:d=0.5  -f null -
ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers
  built with Apple clang version 17.0.0 (clang-1700.0.13.3)
...
  libavutil      60.  8.100 / 60.  8.100
  libavcodec     62. 11.100 / 62. 11.100
  libavformat    62.  3.100 / 62.  3.100
  libavdevice    62.  1.100 / 62.  1.100
  libavfilter    11.  4.100 / 11.  4.100
  libswscale      9.  1.100 /  9.  1.100
  libswresample   6.  1.100 /  6.  1.100
[aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono
Input #0, wav, from 'voiceink-issue-279.wav':
  Duration: 00:00:45.66, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf62.3.100
  Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s
    Metadata:
      encoder         : Lavc62.11.100 pcm_s16le
[silencedetect @ 0xb22c6c420] silence_start: 0
[silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364
[silencedetect @ 0xb22c6c420] silence_start: 2.305687
[silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125
[silencedetect @ 0xb22c6c420] silence_start: 7.579813
[silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125
[silencedetect @ 0xb22c6c420] silence_start: 15.845063
[silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687
[silencedetect @ 0xb22c6c420] silence_start: 18.692625
[silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813
[silencedetect @ 0xb22c6c420] silence_start: 30.367563
[silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445
[silencedetect @ 0xb22c6c420] silence_start: 41.454687
[silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125
[out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00
```
Alex-Wengg added a commit that referenced this pull request Jan 1, 2026
Migrate speaker diarization to OSS SDK
Alex-Wengg pushed a commit that referenced this pull request Jan 1, 2026
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Taking inspiration from the silero
https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py

Updating our segmentation implementation and supporitng streaming VAD

```bash
%swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:02.812] [INFO] [VadManager] VAD model loaded successfully
[00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation...
[00:08:02.820] [INFO] [VadAnalyze]   • Speech Start at 1.200s
[00:08:02.821] [INFO] [VadAnalyze]   • Speech End at 2.700s
[00:08:02.822] [INFO] [VadAnalyze]   • Speech Start at 4.300s
[00:08:02.825] [INFO] [VadAnalyze]   • Speech End at 7.800s
[00:08:02.828] [INFO] [VadAnalyze]   • Speech Start at 13.700s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech End at 16.200s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech Start at 17.300s
[00:08:02.832] [INFO] [VadAnalyze]   • Speech End at 19.000s
[00:08:02.839] [INFO] [VadAnalyze]   • Speech Start at 29.600s
[00:08:02.840] [INFO] [VadAnalyze]   • Speech End at 30.600s
[00:08:02.849] [INFO] [VadAnalyze]   • Speech Start at 45.000s
[00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments...
[00:08:02.850] [INFO] [VadAnalyze]   • Speech End at 45.500s
[00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events

% swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:08.309] [INFO] [VadManager] VAD model loaded successfully
[00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation...
[00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s
[00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s)

% ffmpeg -i voiceink-issue-279.wav  -af silencedetect=noise=-30dB:d=0.5  -f null -
ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers
  built with Apple clang version 17.0.0 (clang-1700.0.13.3)
...
  libavutil      60.  8.100 / 60.  8.100
  libavcodec     62. 11.100 / 62. 11.100
  libavformat    62.  3.100 / 62.  3.100
  libavdevice    62.  1.100 / 62.  1.100
  libavfilter    11.  4.100 / 11.  4.100
  libswscale      9.  1.100 /  9.  1.100
  libswresample   6.  1.100 /  6.  1.100
[aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono
Input #0, wav, from 'voiceink-issue-279.wav':
  Duration: 00:00:45.66, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf62.3.100
  Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s
    Metadata:
      encoder         : Lavc62.11.100 pcm_s16le
[silencedetect @ 0xb22c6c420] silence_start: 0
[silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364
[silencedetect @ 0xb22c6c420] silence_start: 2.305687
[silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125
[silencedetect @ 0xb22c6c420] silence_start: 7.579813
[silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125
[silencedetect @ 0xb22c6c420] silence_start: 15.845063
[silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687
[silencedetect @ 0xb22c6c420] silence_start: 18.692625
[silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813
[silencedetect @ 0xb22c6c420] silence_start: 30.367563
[silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445
[silencedetect @ 0xb22c6c420] silence_start: 41.454687
[silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125
[out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00
```
SGD2718 pushed a commit that referenced this pull request Jan 4, 2026
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Taking inspiration from the silero
https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py

Updating our segmentation implementation and supporitng streaming VAD

```bash
%swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:02.812] [INFO] [VadManager] VAD model loaded successfully
[00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation...
[00:08:02.820] [INFO] [VadAnalyze]   • Speech Start at 1.200s
[00:08:02.821] [INFO] [VadAnalyze]   • Speech End at 2.700s
[00:08:02.822] [INFO] [VadAnalyze]   • Speech Start at 4.300s
[00:08:02.825] [INFO] [VadAnalyze]   • Speech End at 7.800s
[00:08:02.828] [INFO] [VadAnalyze]   • Speech Start at 13.700s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech End at 16.200s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech Start at 17.300s
[00:08:02.832] [INFO] [VadAnalyze]   • Speech End at 19.000s
[00:08:02.839] [INFO] [VadAnalyze]   • Speech Start at 29.600s
[00:08:02.840] [INFO] [VadAnalyze]   • Speech End at 30.600s
[00:08:02.849] [INFO] [VadAnalyze]   • Speech Start at 45.000s
[00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments...
[00:08:02.850] [INFO] [VadAnalyze]   • Speech End at 45.500s
[00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events

% swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:08.309] [INFO] [VadManager] VAD model loaded successfully
[00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation...
[00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s
[00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s)

% ffmpeg -i voiceink-issue-279.wav  -af silencedetect=noise=-30dB:d=0.5  -f null -
ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers
  built with Apple clang version 17.0.0 (clang-1700.0.13.3)
...
  libavutil      60.  8.100 / 60.  8.100
  libavcodec     62. 11.100 / 62. 11.100
  libavformat    62.  3.100 / 62.  3.100
  libavdevice    62.  1.100 / 62.  1.100
  libavfilter    11.  4.100 / 11.  4.100
  libswscale      9.  1.100 /  9.  1.100
  libswresample   6.  1.100 /  6.  1.100
[aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono
Input #0, wav, from 'voiceink-issue-279.wav':
  Duration: 00:00:45.66, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf62.3.100
  Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s
    Metadata:
      encoder         : Lavc62.11.100 pcm_s16le
[silencedetect @ 0xb22c6c420] silence_start: 0
[silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364
[silencedetect @ 0xb22c6c420] silence_start: 2.305687
[silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125
[silencedetect @ 0xb22c6c420] silence_start: 7.579813
[silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125
[silencedetect @ 0xb22c6c420] silence_start: 15.845063
[silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687
[silencedetect @ 0xb22c6c420] silence_start: 18.692625
[silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813
[silencedetect @ 0xb22c6c420] silence_start: 30.367563
[silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445
[silencedetect @ 0xb22c6c420] silence_start: 41.454687
[silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125
[out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00
```
Alex-Wengg pushed a commit that referenced this pull request Jan 5, 2026
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Taking inspiration from the silero
https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py

Updating our segmentation implementation and supporitng streaming VAD

```bash
%swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:02.812] [INFO] [VadManager] VAD model loaded successfully
[00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation...
[00:08:02.820] [INFO] [VadAnalyze]   • Speech Start at 1.200s
[00:08:02.821] [INFO] [VadAnalyze]   • Speech End at 2.700s
[00:08:02.822] [INFO] [VadAnalyze]   • Speech Start at 4.300s
[00:08:02.825] [INFO] [VadAnalyze]   • Speech End at 7.800s
[00:08:02.828] [INFO] [VadAnalyze]   • Speech Start at 13.700s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech End at 16.200s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech Start at 17.300s
[00:08:02.832] [INFO] [VadAnalyze]   • Speech End at 19.000s
[00:08:02.839] [INFO] [VadAnalyze]   • Speech Start at 29.600s
[00:08:02.840] [INFO] [VadAnalyze]   • Speech End at 30.600s
[00:08:02.849] [INFO] [VadAnalyze]   • Speech Start at 45.000s
[00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments...
[00:08:02.850] [INFO] [VadAnalyze]   • Speech End at 45.500s
[00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events

% swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:08.309] [INFO] [VadManager] VAD model loaded successfully
[00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation...
[00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s
[00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s)

% ffmpeg -i voiceink-issue-279.wav  -af silencedetect=noise=-30dB:d=0.5  -f null -
ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers
  built with Apple clang version 17.0.0 (clang-1700.0.13.3)
...
  libavutil      60.  8.100 / 60.  8.100
  libavcodec     62. 11.100 / 62. 11.100
  libavformat    62.  3.100 / 62.  3.100
  libavdevice    62.  1.100 / 62.  1.100
  libavfilter    11.  4.100 / 11.  4.100
  libswscale      9.  1.100 /  9.  1.100
  libswresample   6.  1.100 /  6.  1.100
[aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono
Input #0, wav, from 'voiceink-issue-279.wav':
  Duration: 00:00:45.66, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf62.3.100
  Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s
    Metadata:
      encoder         : Lavc62.11.100 pcm_s16le
[silencedetect @ 0xb22c6c420] silence_start: 0
[silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364
[silencedetect @ 0xb22c6c420] silence_start: 2.305687
[silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125
[silencedetect @ 0xb22c6c420] silence_start: 7.579813
[silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125
[silencedetect @ 0xb22c6c420] silence_start: 15.845063
[silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687
[silencedetect @ 0xb22c6c420] silence_start: 18.692625
[silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813
[silencedetect @ 0xb22c6c420] silence_start: 30.367563
[silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445
[silencedetect @ 0xb22c6c420] silence_start: 41.454687
[silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125
[out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants