Skip to content

Fix DER calculation and add diarization proper AMI benchmarking#4

Merged
BrandonWeng merged 17 commits intomainfrom
debug-threshold-issue
Jun 29, 2025
Merged

Fix DER calculation and add diarization proper AMI benchmarking#4
BrandonWeng merged 17 commits intomainfrom
debug-threshold-issue

Conversation

@BrandonWeng
Copy link
Copy Markdown
Member

@BrandonWeng BrandonWeng commented Jun 29, 2025

Summary

  • Fixed critical DER calculation bug that was preventing parameter optimization
  • Implemented optimal speaker mapping using frame-based overlap analysis
  • Achieved 17.7% DER, surpassing the target of <30% and competitive with state-of-the-art research
  • Enhanced CLI with comprehensive parameter support and debugging capabilities

Key Achievements

  • Performance breakthrough: 81.0% DER → 17.7% DER (77% improvement)
  • Research competitive: Better than EEND (25.3%) and x-vector clustering (28.7%)
  • Near state-of-art: Very close to Powerset BCE (18.5% DER)
  • Optimal configuration found: clusteringThreshold=0.7 provides best results

Technical Changes

  • Fixed DER calculation: Added optimal speaker assignment before ID comparison
  • Enhanced clustering debug: Comprehensive logging to track decision flow and pre-filtering
  • CLI improvements: Added --min-duration-on, --min-duration-off, --min-activity, --single-file parameters
  • Parameter validation: Confirmed clustering algorithm works correctly, issue was in evaluation

Root Cause Analysis

The original issue was in the DER calculation methodology:

  • Problem: Comparing "Speaker 1" vs "FEE013" without any ID mapping
  • Solution: Implemented greedy speaker assignment using frame-overlap analysis
  • Impact: Reduced speaker error from 69.5% to 6.3%

Optimization Results

Threshold DER Notes
0.1 75.8% Over-clustering (153+ speakers)
0.5 20.6% Still too many speakers
0.7 17.7% Optimal configuration
0.8 18.0% Very close to optimal
0.9 40.2% Under-clustering

BrandonWeng and others added 4 commits June 28, 2025 21:00
- Fixed critical DER calculation bug by implementing optimal speaker mapping
- Added comprehensive clustering debug logging and parameter tracking
- Achieved 17.7% DER (target was <30%), competitive with state-of-the-art research
- Optimal configuration: clusteringThreshold=0.7 outperforms research benchmarks
- Reduced speaker error from 69.5% to 6.3% through proper ID assignment
- Enhanced CLI with missing parameters: --min-duration-on, --min-duration-off, --min-activity
- Added single-file testing capability for rapid parameter iteration
- Comprehensive parameter optimization results documented in CLAUDE.md

Performance improvements:
- Before: 81.0% DER (broken speaker mapping)
- After: 17.7% DER (optimal speaker assignment)
- Better than EEND (25.3%) and x-vector clustering (28.7%)
- Competitive with Powerset BCE state-of-art (18.5%)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@BrandonWeng BrandonWeng changed the title Fix DER calculation and achieve breakthrough diarization performance Fix DER calculation and add diarization proper AMI benchmarking Jun 29, 2025
@FluidInference FluidInference deleted a comment from github-actions bot Jun 29, 2025
@FluidInference FluidInference deleted a comment from github-actions bot Jun 29, 2025
@FluidInference FluidInference deleted a comment from github-actions bot Jun 29, 2025
@FluidInference FluidInference deleted a comment from github-actions bot Jun 29, 2025
@FluidInference FluidInference deleted a comment from github-actions bot Jun 29, 2025
@FluidInference FluidInference deleted a comment from github-actions bot Jun 29, 2025
@FluidInference FluidInference deleted a comment from github-actions bot Jun 29, 2025
@FluidInference FluidInference deleted a comment from github-actions bot Jun 29, 2025
@FluidInference FluidInference deleted a comment from github-actions bot Jun 29, 2025
@FluidInference FluidInference deleted a comment from github-actions bot Jun 29, 2025
@FluidInference FluidInference deleted a comment from github-actions bot Jun 29, 2025
@github-actions
Copy link
Copy Markdown

🎯 Single File Benchmark Results

Test File: ES2004a (NaNs audio)

Metric Value Target Status
DER (Diarization Error Rate) NaN% < 30%
JER (Jaccard Error Rate) NaN% < 25%
RTF (Real-Time Factor) NaNx < 1.0x
Speakers Detected - ℹ️

⚠️ Performance Below Target - Consider parameter optimization

📊 Research Comparison:

  • Powerset BCE (2023): 18.5% DER
  • EEND (2019): 25.3% DER
  • x-vector clustering: 28.7% DER

Automated benchmark using AMI corpus ES2004a test file

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do something about this CLUADE.md name, since this PR and the commits were targeted toward benchmarking

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is the default Claude Code file it uses overtime. We want to build it up. its like a readme for claude code

}

// Convert overlap matrix to cost matrix (higher overlap = lower cost)
let costMatrix = HungarianAlgorithm.overlapToCostMatrix(numericalOverlapMatrix)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since HungarianAlgorithm uses O^3 complexity would we want to implement this on Slipbox too ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also since we are using HungarianAlgorithm , how much did it improve the DER

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, greedy is a bit less accuract but O(MN) in the worst case. We should only use hungarian for DER. Suprisngly it didntt, at least with the subset of AMI data I had tested. If we ran on all ~200 it probably should imprvoe

@Alex-Wengg
Copy link
Copy Markdown
Member

Alex-Wengg commented Jun 29, 2025

we can probably help with basic diarizer testings with these videos for SDK
All-in Podcast (4 speakers)
https://www.youtube.com/watch?v=86t6YNf_B7Q
Online Meeting
https://www.youtube.com/watch?v=lBVtvOpU80Q
IRL Meeting
https://www.youtube.com/watch?v=4jkZH3DqOtA

@BrandonWeng
Copy link
Copy Markdown
Member Author

we can probably help with basic diarizer testings with these videos for SDK All-in Podcast (4 speakers) https://www.youtube.com/watch?v=86t6YNf_B7Q Online Meeting https://www.youtube.com/watch?v=lBVtvOpU80Q IRL Meeting https://www.youtube.com/watch?v=4jkZH3DqOtA

We will need the annotated tests for these to properly benchmark

@BrandonWeng BrandonWeng merged commit 12c4c3f into main Jun 29, 2025
2 checks passed
@BrandonWeng BrandonWeng deleted the debug-threshold-issue branch August 1, 2025 20:14
BrandonWeng added a commit that referenced this pull request Sep 17, 2025
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Taking inspiration from the silero
https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py

Updating our segmentation implementation and supporitng streaming VAD

```bash
%swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:02.812] [INFO] [VadManager] VAD model loaded successfully
[00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation...
[00:08:02.820] [INFO] [VadAnalyze]   • Speech Start at 1.200s
[00:08:02.821] [INFO] [VadAnalyze]   • Speech End at 2.700s
[00:08:02.822] [INFO] [VadAnalyze]   • Speech Start at 4.300s
[00:08:02.825] [INFO] [VadAnalyze]   • Speech End at 7.800s
[00:08:02.828] [INFO] [VadAnalyze]   • Speech Start at 13.700s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech End at 16.200s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech Start at 17.300s
[00:08:02.832] [INFO] [VadAnalyze]   • Speech End at 19.000s
[00:08:02.839] [INFO] [VadAnalyze]   • Speech Start at 29.600s
[00:08:02.840] [INFO] [VadAnalyze]   • Speech End at 30.600s
[00:08:02.849] [INFO] [VadAnalyze]   • Speech Start at 45.000s
[00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments...
[00:08:02.850] [INFO] [VadAnalyze]   • Speech End at 45.500s
[00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events

% swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:08.309] [INFO] [VadManager] VAD model loaded successfully
[00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation...
[00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s
[00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s)

% ffmpeg -i voiceink-issue-279.wav  -af silencedetect=noise=-30dB:d=0.5  -f null -
ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers
  built with Apple clang version 17.0.0 (clang-1700.0.13.3)
...
  libavutil      60.  8.100 / 60.  8.100
  libavcodec     62. 11.100 / 62. 11.100
  libavformat    62.  3.100 / 62.  3.100
  libavdevice    62.  1.100 / 62.  1.100
  libavfilter    11.  4.100 / 11.  4.100
  libswscale      9.  1.100 /  9.  1.100
  libswresample   6.  1.100 /  6.  1.100
[aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono
Input #0, wav, from 'voiceink-issue-279.wav':
  Duration: 00:00:45.66, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf62.3.100
  Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s
    Metadata:
      encoder         : Lavc62.11.100 pcm_s16le
[silencedetect @ 0xb22c6c420] silence_start: 0
[silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364
[silencedetect @ 0xb22c6c420] silence_start: 2.305687
[silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125
[silencedetect @ 0xb22c6c420] silence_start: 7.579813
[silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125
[silencedetect @ 0xb22c6c420] silence_start: 15.845063
[silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687
[silencedetect @ 0xb22c6c420] silence_start: 18.692625
[silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813
[silencedetect @ 0xb22c6c420] silence_start: 30.367563
[silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445
[silencedetect @ 0xb22c6c420] silence_start: 41.454687
[silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125
[out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00
```
Alex-Wengg pushed a commit that referenced this pull request Jan 1, 2026
## Summary
- Fixed critical DER calculation bug that was preventing parameter
optimization
- Implemented optimal speaker mapping using frame-based overlap analysis
- Achieved **17.7% DER**, surpassing the target of <30% and competitive
with state-of-the-art research
- Enhanced CLI with comprehensive parameter support and debugging
capabilities

## Key Achievements
- **Performance breakthrough**: 81.0% DER → 17.7% DER (77% improvement)
- **Research competitive**: Better than EEND (25.3%) and x-vector
clustering (28.7%)
- **Near state-of-art**: Very close to Powerset BCE (18.5% DER)
- **Optimal configuration found**: clusteringThreshold=0.7 provides best
results

## Technical Changes
- **Fixed DER calculation**: Added optimal speaker assignment before ID
comparison
- **Enhanced clustering debug**: Comprehensive logging to track decision
flow and pre-filtering
- **CLI improvements**: Added --min-duration-on, --min-duration-off,
--min-activity, --single-file parameters
- **Parameter validation**: Confirmed clustering algorithm works
correctly, issue was in evaluation

## Root Cause Analysis
The original issue was in the DER calculation methodology:
- **Problem**: Comparing "Speaker 1" vs "FEE013" without any ID mapping
- **Solution**: Implemented greedy speaker assignment using
frame-overlap analysis
- **Impact**: Reduced speaker error from 69.5% to 6.3%

## Optimization Results
| Threshold | DER | Notes |
|-----------|-----|-------|
| 0.1 | 75.8% | Over-clustering (153+ speakers) |
| 0.5 | 20.6% | Still too many speakers |
| **0.7** | **17.7%** | **Optimal configuration** |
| 0.8 | 18.0% | Very close to optimal |
| 0.9 | 40.2% | Under-clustering |

---------
Alex-Wengg pushed a commit that referenced this pull request Jan 1, 2026
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Taking inspiration from the silero
https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py

Updating our segmentation implementation and supporitng streaming VAD

```bash
%swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:02.812] [INFO] [VadManager] VAD model loaded successfully
[00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation...
[00:08:02.820] [INFO] [VadAnalyze]   • Speech Start at 1.200s
[00:08:02.821] [INFO] [VadAnalyze]   • Speech End at 2.700s
[00:08:02.822] [INFO] [VadAnalyze]   • Speech Start at 4.300s
[00:08:02.825] [INFO] [VadAnalyze]   • Speech End at 7.800s
[00:08:02.828] [INFO] [VadAnalyze]   • Speech Start at 13.700s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech End at 16.200s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech Start at 17.300s
[00:08:02.832] [INFO] [VadAnalyze]   • Speech End at 19.000s
[00:08:02.839] [INFO] [VadAnalyze]   • Speech Start at 29.600s
[00:08:02.840] [INFO] [VadAnalyze]   • Speech End at 30.600s
[00:08:02.849] [INFO] [VadAnalyze]   • Speech Start at 45.000s
[00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments...
[00:08:02.850] [INFO] [VadAnalyze]   • Speech End at 45.500s
[00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events

% swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:08.309] [INFO] [VadManager] VAD model loaded successfully
[00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation...
[00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s
[00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s)

% ffmpeg -i voiceink-issue-279.wav  -af silencedetect=noise=-30dB:d=0.5  -f null -
ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers
  built with Apple clang version 17.0.0 (clang-1700.0.13.3)
...
  libavutil      60.  8.100 / 60.  8.100
  libavcodec     62. 11.100 / 62. 11.100
  libavformat    62.  3.100 / 62.  3.100
  libavdevice    62.  1.100 / 62.  1.100
  libavfilter    11.  4.100 / 11.  4.100
  libswscale      9.  1.100 /  9.  1.100
  libswresample   6.  1.100 /  6.  1.100
[aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono
Input #0, wav, from 'voiceink-issue-279.wav':
  Duration: 00:00:45.66, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf62.3.100
  Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s
    Metadata:
      encoder         : Lavc62.11.100 pcm_s16le
[silencedetect @ 0xb22c6c420] silence_start: 0
[silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364
[silencedetect @ 0xb22c6c420] silence_start: 2.305687
[silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125
[silencedetect @ 0xb22c6c420] silence_start: 7.579813
[silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125
[silencedetect @ 0xb22c6c420] silence_start: 15.845063
[silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687
[silencedetect @ 0xb22c6c420] silence_start: 18.692625
[silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813
[silencedetect @ 0xb22c6c420] silence_start: 30.367563
[silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445
[silencedetect @ 0xb22c6c420] silence_start: 41.454687
[silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125
[out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00
```
SGD2718 pushed a commit that referenced this pull request Jan 4, 2026
## Summary
- Fixed critical DER calculation bug that was preventing parameter
optimization
- Implemented optimal speaker mapping using frame-based overlap analysis
- Achieved **17.7% DER**, surpassing the target of <30% and competitive
with state-of-the-art research
- Enhanced CLI with comprehensive parameter support and debugging
capabilities

## Key Achievements
- **Performance breakthrough**: 81.0% DER → 17.7% DER (77% improvement)
- **Research competitive**: Better than EEND (25.3%) and x-vector
clustering (28.7%)
- **Near state-of-art**: Very close to Powerset BCE (18.5% DER)
- **Optimal configuration found**: clusteringThreshold=0.7 provides best
results

## Technical Changes
- **Fixed DER calculation**: Added optimal speaker assignment before ID
comparison
- **Enhanced clustering debug**: Comprehensive logging to track decision
flow and pre-filtering
- **CLI improvements**: Added --min-duration-on, --min-duration-off,
--min-activity, --single-file parameters
- **Parameter validation**: Confirmed clustering algorithm works
correctly, issue was in evaluation

## Root Cause Analysis
The original issue was in the DER calculation methodology:
- **Problem**: Comparing "Speaker 1" vs "FEE013" without any ID mapping
- **Solution**: Implemented greedy speaker assignment using
frame-overlap analysis
- **Impact**: Reduced speaker error from 69.5% to 6.3%

## Optimization Results
| Threshold | DER | Notes |
|-----------|-----|-------|
| 0.1 | 75.8% | Over-clustering (153+ speakers) |
| 0.5 | 20.6% | Still too many speakers |
| **0.7** | **17.7%** | **Optimal configuration** |
| 0.8 | 18.0% | Very close to optimal |
| 0.9 | 40.2% | Under-clustering |

---------

Co-authored-by: Claude <noreply@anthropic.com>
SGD2718 pushed a commit that referenced this pull request Jan 4, 2026
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Taking inspiration from the silero
https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py

Updating our segmentation implementation and supporitng streaming VAD

```bash
%swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:02.812] [INFO] [VadManager] VAD model loaded successfully
[00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation...
[00:08:02.820] [INFO] [VadAnalyze]   • Speech Start at 1.200s
[00:08:02.821] [INFO] [VadAnalyze]   • Speech End at 2.700s
[00:08:02.822] [INFO] [VadAnalyze]   • Speech Start at 4.300s
[00:08:02.825] [INFO] [VadAnalyze]   • Speech End at 7.800s
[00:08:02.828] [INFO] [VadAnalyze]   • Speech Start at 13.700s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech End at 16.200s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech Start at 17.300s
[00:08:02.832] [INFO] [VadAnalyze]   • Speech End at 19.000s
[00:08:02.839] [INFO] [VadAnalyze]   • Speech Start at 29.600s
[00:08:02.840] [INFO] [VadAnalyze]   • Speech End at 30.600s
[00:08:02.849] [INFO] [VadAnalyze]   • Speech Start at 45.000s
[00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments...
[00:08:02.850] [INFO] [VadAnalyze]   • Speech End at 45.500s
[00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events

% swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:08.309] [INFO] [VadManager] VAD model loaded successfully
[00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation...
[00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s
[00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s)

% ffmpeg -i voiceink-issue-279.wav  -af silencedetect=noise=-30dB:d=0.5  -f null -
ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers
  built with Apple clang version 17.0.0 (clang-1700.0.13.3)
...
  libavutil      60.  8.100 / 60.  8.100
  libavcodec     62. 11.100 / 62. 11.100
  libavformat    62.  3.100 / 62.  3.100
  libavdevice    62.  1.100 / 62.  1.100
  libavfilter    11.  4.100 / 11.  4.100
  libswscale      9.  1.100 /  9.  1.100
  libswresample   6.  1.100 /  6.  1.100
[aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono
Input #0, wav, from 'voiceink-issue-279.wav':
  Duration: 00:00:45.66, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf62.3.100
  Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s
    Metadata:
      encoder         : Lavc62.11.100 pcm_s16le
[silencedetect @ 0xb22c6c420] silence_start: 0
[silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364
[silencedetect @ 0xb22c6c420] silence_start: 2.305687
[silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125
[silencedetect @ 0xb22c6c420] silence_start: 7.579813
[silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125
[silencedetect @ 0xb22c6c420] silence_start: 15.845063
[silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687
[silencedetect @ 0xb22c6c420] silence_start: 18.692625
[silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813
[silencedetect @ 0xb22c6c420] silence_start: 30.367563
[silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445
[silencedetect @ 0xb22c6c420] silence_start: 41.454687
[silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125
[out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00
```
Alex-Wengg pushed a commit that referenced this pull request Jan 5, 2026
## Summary
- Fixed critical DER calculation bug that was preventing parameter
optimization
- Implemented optimal speaker mapping using frame-based overlap analysis
- Achieved **17.7% DER**, surpassing the target of <30% and competitive
with state-of-the-art research
- Enhanced CLI with comprehensive parameter support and debugging
capabilities

## Key Achievements
- **Performance breakthrough**: 81.0% DER → 17.7% DER (77% improvement)
- **Research competitive**: Better than EEND (25.3%) and x-vector
clustering (28.7%)
- **Near state-of-art**: Very close to Powerset BCE (18.5% DER)
- **Optimal configuration found**: clusteringThreshold=0.7 provides best
results

## Technical Changes
- **Fixed DER calculation**: Added optimal speaker assignment before ID
comparison
- **Enhanced clustering debug**: Comprehensive logging to track decision
flow and pre-filtering
- **CLI improvements**: Added --min-duration-on, --min-duration-off,
--min-activity, --single-file parameters
- **Parameter validation**: Confirmed clustering algorithm works
correctly, issue was in evaluation

## Root Cause Analysis
The original issue was in the DER calculation methodology:
- **Problem**: Comparing "Speaker 1" vs "FEE013" without any ID mapping
- **Solution**: Implemented greedy speaker assignment using
frame-overlap analysis
- **Impact**: Reduced speaker error from 69.5% to 6.3%

## Optimization Results
| Threshold | DER | Notes |
|-----------|-----|-------|
| 0.1 | 75.8% | Over-clustering (153+ speakers) |
| 0.5 | 20.6% | Still too many speakers |
| **0.7** | **17.7%** | **Optimal configuration** |
| 0.8 | 18.0% | Very close to optimal |
| 0.9 | 40.2% | Under-clustering |

---------

Co-authored-by: Claude <noreply@anthropic.com>
Alex-Wengg pushed a commit that referenced this pull request Jan 5, 2026
### Why is this change needed?
<!-- Explain the motivation for this change. What problem does it solve?
-->

Taking inspiration from the silero
https://github.com/snakers4/silero-vad/blob/master/src/silero_vad/utils_vad.py

Updating our segmentation implementation and supporitng streaming VAD

```bash
%swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds --mode streaming
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:02.789] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:02.812] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:02.812] [INFO] [VadManager] VAD model loaded successfully
[00:08:02.812] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:02.812] [INFO] [VadAnalyze] 📶 Running streaming simulation...
[00:08:02.820] [INFO] [VadAnalyze]   • Speech Start at 1.200s
[00:08:02.821] [INFO] [VadAnalyze]   • Speech End at 2.700s
[00:08:02.822] [INFO] [VadAnalyze]   • Speech Start at 4.300s
[00:08:02.825] [INFO] [VadAnalyze]   • Speech End at 7.800s
[00:08:02.828] [INFO] [VadAnalyze]   • Speech Start at 13.700s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech End at 16.200s
[00:08:02.830] [INFO] [VadAnalyze]   • Speech Start at 17.300s
[00:08:02.832] [INFO] [VadAnalyze]   • Speech End at 19.000s
[00:08:02.839] [INFO] [VadAnalyze]   • Speech Start at 29.600s
[00:08:02.840] [INFO] [VadAnalyze]   • Speech End at 30.600s
[00:08:02.849] [INFO] [VadAnalyze]   • Speech Start at 45.000s
[00:08:02.849] [INFO] [VadAnalyze] Flushing trailing silence to close open segments...
[00:08:02.850] [INFO] [VadAnalyze]   • Speech End at 45.500s
[00:08:02.850] [INFO] [VadAnalyze] Streaming simulation produced 12 events

% swift run fluidaudio vad-analyze voiceink-issue-279.wav --seconds
Building for debugging...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build of product 'fluidaudio' complete! (0.07s)
[00:08:08.289] [INFO] [DownloadUtils] Found silero-vad-coreml locally, no download needed
[00:08:08.309] [INFO] [DownloadUtils] Loaded model: silero-vad-unified-256ms-v6.0.0.mlmodelc
[00:08:08.309] [INFO] [VadManager] VAD model loaded successfully
[00:08:08.309] [INFO] [VadManager] VAD system initialized in 0.02s
[00:08:08.309] [INFO] [VadAnalyze] 📍 Running offline speech segmentation...
[00:08:08.344] [INFO] [VadAnalyze] Detected 6 speech segments in 0.03s
[00:08:08.344] [INFO] [VadAnalyze] RTFx: 1369.21x (audio: 45.66s, inference: 0.03s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #1: samples 18880-42560 (1.18s-2.66s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #2: samples 68032-124480 (4.25s-7.78s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #3: samples 219584-259648 (13.72s-16.23s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #4: samples 276928-304704 (17.31s-19.04s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #5: samples 473536-489024 (29.60s-30.56s)
[00:08:08.344] [INFO] [VadAnalyze] Segment #6: samples 719296-730616 (44.96s-45.66s)

% ffmpeg -i voiceink-issue-279.wav  -af silencedetect=noise=-30dB:d=0.5  -f null -
ffmpeg version 8.0 Copyright (c) 2000-2025 the FFmpeg developers
  built with Apple clang version 17.0.0 (clang-1700.0.13.3)
...
  libavutil      60.  8.100 / 60.  8.100
  libavcodec     62. 11.100 / 62. 11.100
  libavformat    62.  3.100 / 62.  3.100
  libavdevice    62.  1.100 / 62.  1.100
  libavfilter    11.  4.100 / 11.  4.100
  libswscale      9.  1.100 /  9.  1.100
  libswresample   6.  1.100 /  6.  1.100
[aist#0:0/pcm_s16le @ 0xb22c38180] Guessed Channel Layout: mono
Input #0, wav, from 'voiceink-issue-279.wav':
  Duration: 00:00:45.66, bitrate: 256 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, null, to 'pipe:':
  Metadata:
    encoder         : Lavf62.3.100
  Stream #0:0: Audio: pcm_s16le, 16000 Hz, mono, s16, 256 kb/s
    Metadata:
      encoder         : Lavc62.11.100 pcm_s16le
[silencedetect @ 0xb22c6c420] silence_start: 0
[silencedetect @ 0xb22c6c420] silence_end: 1.364 | silence_duration: 1.364
[silencedetect @ 0xb22c6c420] silence_start: 2.305687
[silencedetect @ 0xb22c6c420] silence_end: 4.394813 | silence_duration: 2.089125
[silencedetect @ 0xb22c6c420] silence_start: 7.579813
[silencedetect @ 0xb22c6c420] silence_end: 14.003938 | silence_duration: 6.424125
[silencedetect @ 0xb22c6c420] silence_start: 15.845063
[silencedetect @ 0xb22c6c420] silence_end: 17.45075 | silence_duration: 1.605687
[silencedetect @ 0xb22c6c420] silence_start: 18.692625
[silencedetect @ 0xb22c6c420] silence_end: 29.667438 | silence_duration: 10.974813
[silencedetect @ 0xb22c6c420] silence_start: 30.367563
[silencedetect @ 0xb22c6c420] silence_end: 41.412062 | silence_duration: 11.0445
[silencedetect @ 0xb22c6c420] silence_start: 41.454687
[silencedetect @ 0xb22c6c420] silence_end: 45.000813 | silence_duration: 3.546125
[out#0/null @ 0xb2300c780] video:0KiB audio:1427KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: unknown
size=N/A time=00:00:45.66 bitrate=N/A speed=8.51e+03x elapsed=0:00:00.00
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants