docs: add existing benchmark datasets to model conversion guide#392
docs: add existing benchmark datasets to model conversion guide#392Alex-Wengg merged 4 commits intomainfrom
Conversation
Follow-up to #391. Documents all 11 existing benchmark datasets (LibriSpeech, FLEURS, AISHELL-1, Buckeye, AMI-SDM, etc.) with their domains, sizes, formats, and download locations so contributors know what's already available before creating new datasets.
|
Claude finished @Alex-Wengg's task —— View job PR Review Complete ✅I've completed the review of PR #392 and provided feedback using the GitHub review system. Todo List:
SummaryThis PR adds a valuable benchmark datasets table to the model conversion guide. The documentation is well-structured and mostly accurate. I identified a few minor inconsistencies:
Overall: Strong documentation improvement that will help contributors use existing datasets effectively. The minor issues don't block the PR but could be addressed for completeness. |
There was a problem hiding this comment.
Code Review Summary
This PR adds a comprehensive table of existing benchmark datasets to the model conversion guide, which is a valuable addition. The documentation is generally well-structured and the information appears accurate based on verification against the codebase.
Key Findings
Strengths:
- ✅ Comprehensive coverage of all 11 existing datasets
- ✅ Accurate dataset sizes and locations verified against codebase
- ✅ Clear guidance on using existing datasets vs creating new ones
- ✅ Good table formatting with domain, size, format, and download info
- ✅ Helpful note about auto-download functionality
- ✅ Updated checklist reference to point to the new table
Minor Issues Identified:
-
FLEURS dataset inconsistency: The codebase uses both
FluidInference/fleursandFluidInference/fleurs-fullrepositories in different benchmark commands. The table showsFluidInference/fleurs, but some commands reference the-fullvariant. -
Text normalization dataset location: The table mentions "Checked in at
text-processing-rs/tests/data/" but this appears to be a relative path that may not exist in the current repository structure. -
Language count precision: The table shows "24+ langs" for FLEURS, but the code shows exactly 30 languages are supported. Consider being more precise.
Overall Assessment
This is a solid documentation improvement that will help contributors understand available benchmark datasets. The minor inconsistencies don't impact the core value of the documentation but could be clarified for completeness.
Recommendation: Approve with optional minor clarifications.
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 6m43s • 03/17/2026, 09:03 PM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
PocketTTS Smoke Test ✅
Runtime: 0m31s Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality may differ from Apple Silicon. |
Qwen3-ASR int8 Smoke Test ✅
Runtime: 3m56s Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx. |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 4m 18s • 2026-03-17T23:42:49.444Z |
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 39.6s diarization time • Test runtime: 5m 26s • 03/17/2026, 08:02 PM EST |
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 235.0s processing • Test runtime: 4m 5s • 03/17/2026, 08:14 PM EST |
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 0m13s • 03/17/2026, 09:25 PM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
Summary
Documentation/ModelConversion.mdContext
PR #391 added the model conversion guide but didn't mention the existing benchmark datasets. Contributors and coding agents should use the existing datasets for benchmarking rather than creating new ones unless a domain gap exists.
Test plan