feat: add Qwen3-ASR-0.6B CoreML speech recognition#281
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 301.3s processing • Test runtime: 6m 51s • 02/11/2026, 04:41 PM EST |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 1m 46s • 2026-02-11T21:27:50.015Z |
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 81.6s diarization time • Test runtime: 3m 49s • 02/11/2026, 04:30 PM EST |
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 7m50s • 02/11/2026, 04:30 PM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 1m26s • 02/11/2026, 04:23 PM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
ab5e438 to
d0d92c4
Compare
78dbb82 to
8890841
Compare
int8 quantization does not improve performance for Qwen3-ASR on Apple Silicon. Testing showed int8 was slower (1.4x RTFx) than f32 (2.8x RTFx) due to runtime dequantization overhead across 28 decoder layers that run once per token during autoregressive generation.
Qwen3AsrConfig: - Convert struct to enum with static properties - Add Language enum with all 30 supported languages - Add asrTextTokenId constant Qwen3AsrManager: - Convert class to actor for thread safety - Add typed language support with Qwen3AsrConfig.Language - Cache WhisperMelSpectrogram instance - Replace print() with logger.debug() - Remove dead repetition penalty code Qwen3AsrModels: - Fix computeUnits parameter (was ignored) - Add vocab.json to modelsExist check - Use native Float16 for embedding weights - Add validation against Qwen3AsrConfig Qwen3KVCache: - Delete file (dead code, manager uses MLModel.makeState()) Qwen3RoPE: - Remove unused Accelerate import - Add Sendable conformance - Fix MemoryLayout.size to .stride WhisperMelSpectrogram: - Fix hot path allocation (reuse imagSq buffer) - Reference Qwen3AsrConfig for sampleRate/nMels - Vectorize post-processing with vDSP/vForce - Fix NFKD vs NFKC (use decomposedStringWithCompatibilityMapping) Qwen3AsrBenchmark: - Use typed Qwen3AsrConfig.Language enum - Fix dataset label bug for AISHELL - Add medianCER to JSON output - Extract Qwen3BenchmarkSummary to avoid duplication - Rename LibriSpeechFile to BenchmarkAudioFile - Reuse AudioConverter instance Qwen3TranscribeCommand: - Use typed Qwen3AsrConfig.Language with validation - Complete language list in usage (all 30) - Compute duration from samples (remove AVFoundation) TextNormalizer: - Remove unused RegexBuilder import - Add missing "six": "6" in numberWords - Fix NSRange unicode bug (use utf16.count) - Fix category checking with proper enum cases - Make regex patterns static (compile once) - Convert struct to enum WERCalculator: - Add Korean Hangul ranges to containsCJK - Extract tokenizePair helper - Add editDistanceChars for Character arrays
The key "six": "6" was present in both English and French sections, causing a Swift runtime crash on dictionary initialization.
- Qwen3AsrModels: conform to Sendable, use @preconcurrency import CoreML - Qwen3AsrManager: use @preconcurrency import CoreML - Add beta warnings to Qwen3AsrManager and Qwen3AsrModels - docs: add supported languages section to Qwen3-ASR.md
Remove beta warnings from Swift code (Qwen3AsrManager, Qwen3AsrModels) and add beta notice to Qwen3-ASR.md instead.
Document differences from original PyTorch implementation that may affect accuracy (fixed windows, greedy-only decoding, no streaming, etc.)
Implements sliding window streaming approach: - Accumulates audio chunks and re-transcribes periodically - Configurable chunk size (default 2s), min audio, max audio - Returns partial results as audio accumulates - No state persistence needed (works with current CoreML model) Note: True stateful streaming not possible due to CoreML MLState being opaque/non-serializable. This approach re-transcribes from start each update, acceptable for <30s audio.
AISHELL-2 requires application with institutional affiliation, AISHELL-1 is openly available under Apache 2.0.
Downloads from FluidInference/fleurs on HuggingFace automatically when data is not present. Supports European languages available in the dataset.
- Add 1-second pause every 25 files to allow CoreML MLState IOSurface memory to be reclaimed, preventing crash at ~200 file limit - Update FLEURS download to use FluidInference/fleurs-full which now has all 30 Qwen3-supported languages (13 Asian + 17 European) - Update help text to reflect all languages auto-download
Already in .gitignore - libraries shouldn't track lock files.
7e58536 to
c951331
Compare
|
Thank you very much for adding this great model!
What is this referring to? |
|
@reneleonhardt its beta for our coreml conversion not the original model |
Encoder-decoder ASR pipeline using Qwen3-ASR-0.6B converted to CoreML.
Performance
Supported Languages
30 languages with automatic detection: Chinese, English, Cantonese, Japanese, Korean, Vietnamese, Thai, Indonesian, Malay, Hindi, Arabic, Turkish, Russian, German, French, Spanish, Portuguese, Italian, Dutch, Polish, Swedish, Danish, Finnish, Czech, Filipino, Persian, Greek, Hungarian, Macedonian, Romanian.
Components
qwen3-benchmarkandqwen3-transcribecommandsModels
CoreML Model: FluidInference/qwen3-asr-0.6b-coreml
Only f32 variant recommended (int8 is slower due to autoregressive decoding overhead).
Swift 6 Compatibility
@preconcurrency import CoreMLfor actor isolationSendableconformance for cross-isolation boundary support🤖 Generated with Claude Code