Skip to content

Update README.md#42

Merged
BrandonWeng merged 1 commit intomainfrom
BrandonWeng-patch-1
Jul 27, 2025
Merged

Update README.md#42
BrandonWeng merged 1 commit intomainfrom
BrandonWeng-patch-1

Conversation

@BrandonWeng
Copy link
Copy Markdown
Member

No description provided.

@BrandonWeng BrandonWeng merged commit 9f18a1b into main Jul 27, 2025
5 checks passed
@BrandonWeng BrandonWeng deleted the BrandonWeng-patch-1 branch July 27, 2025 16:44
@github-actions
Copy link
Copy Markdown

🗣️ Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric Value Target Status Description
DER 18.7% <30% Diarization Error Rate (lower is better)
JER 22.6% <25% Jaccard Error Rate
RTF 0.06x <1.0x Real-Time Factor (lower is faster)

⏱️ Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage Time (s) % Description
Model Download 0.000 0.0 Fetching diarization models
Model Compile 5.032 7.7 CoreML compilation
Audio Load 0.106 0.2 Loading audio file
Segmentation 14.588 22.2 Detecting speech regions
Embedding 46.001 70.0 Extracting speaker voices
Clustering 0.028 0.0 Grouping same speakers
Total 65.756 100 Full pipeline

📊 Speaker Diarization Research Comparison

Comparing against state-of-the-art diarization methods

Method DER Year Notes
FluidAudio 18.7% 2025 On-device CoreML
Powerset BCE 18.5% 2023 Research baseline
EEND 25.3% 2019 End-to-end neural
x-vector clustering 28.7% 2018 Traditional approach

🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.4s meeting audio • 60.6s diarization time • Test runtime: 1m 31s • 07/27/2025, 12:48 PM EST

@github-actions
Copy link
Copy Markdown

VAD Benchmark Results

Performance Comparison

Metric FluidAudio VAD Industry Standard Status
Accuracy 98.0% 85-90%
Precision 96.2% 85-95%
Recall 100.0% 80-90%
F1-Score 98.0% 85.9% (Sohn's VAD)
Processing Time 424.0s (100 files) ~1ms per 30ms chunk

Industry Leaders:

  • Silero VAD: ~90-95% F1 (DNN-based, 1.8MB model)
  • WebRTC VAD: ~75-80% F1 (GMM-based, fast but lower accuracy)
  • Sohn's VAD: 77.5% F1 (traditional approach)
  • Modern DNNs: 85-97% F1 (varies by SNR conditions)
📊 Detailed Research Comparisons
Paper Dataset F1-Score Method
Silero VAD (2021) TEDx 88.1% LSTM-based lightweight model
WebRTC VAD MUSAN 64.4% GMM-based (traditional)
pyannote.audio (2020) AMI 85.9% SincTDNN architecture
MarbleNet (2020) AVA-Speech 87.8% 1D time-channel separable CNN
FluidAudio VAD MUSAN-mini 98.0% CoreML-optimized Silero

Note: Direct comparisons should consider dataset differences. MUSAN contains challenging noise conditions.

@github-actions
Copy link
Copy Markdown

ASR Benchmark Results

Dataset WER Avg WER Med RTFx Status
test-clean 4.44% 0.00% 1.54x
test-other 8.26% 2.78% 1.47x

500 files per dataset • Test runtime: 39m36s • 07/27/2025, 01:26 PM EST

RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

Testing methodology follows HuggingFace Open ASR Leaderboard

Alex-Wengg pushed a commit that referenced this pull request Jan 1, 2026
SGD2718 pushed a commit that referenced this pull request Jan 4, 2026
Alex-Wengg pushed a commit that referenced this pull request Jan 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant