Skip to content

feat: add Qwen3-ASR-0.6B CoreML speech recognition#281

Merged
Alex-Wengg merged 41 commits intomainfrom
qwen3-asr
Feb 12, 2026
Merged

feat: add Qwen3-ASR-0.6B CoreML speech recognition#281
Alex-Wengg merged 41 commits intomainfrom
qwen3-asr

Conversation

@Alex-Wengg
Copy link
Copy Markdown
Member

@Alex-Wengg Alex-Wengg commented Feb 2, 2026

Beta: Qwen3-ASR is experimental and under active development.

Encoder-decoder ASR pipeline using Qwen3-ASR-0.6B converted to CoreML.

Performance

Dataset WER CER RTFx
LibriSpeech test-clean (2620 files) 4.4% - 3.8x
AISHELL-1 Chinese (7176 files) 10.3% 6.6% 3.8x

Supported Languages

30 languages with automatic detection: Chinese, English, Cantonese, Japanese, Korean, Vietnamese, Thai, Indonesian, Malay, Hindi, Arabic, Turkish, Russian, German, French, Spanish, Portuguese, Italian, Dutch, Polish, Swedish, Danish, Finnish, Czech, Filipino, Persian, Greek, Hungarian, Macedonian, Romanian.

Components

  • Qwen3AsrManager: Autoregressive decoder with batched prefill
  • WhisperMelSpectrogram: Whisper-compatible mel spectrogram (pure Swift/vDSP)
  • Qwen3RoPE: Multi-resolution rotary position embeddings (M-RoPE)
  • Qwen3AsrModels: Model loading with auto-download from HuggingFace
  • CLI: qwen3-benchmark and qwen3-transcribe commands

Models

CoreML Model: FluidInference/qwen3-asr-0.6b-coreml

Only f32 variant recommended (int8 is slower due to autoregressive decoding overhead).

Swift 6 Compatibility

  • @preconcurrency import CoreML for actor isolation
  • Sendable conformance for cross-isolation boundary support

🤖 Generated with Claude Code

@claude

This comment was marked as outdated.

claude[bot]

This comment was marked as outdated.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 2, 2026

VAD Benchmark Results

Performance Comparison

Dataset Accuracy Precision Recall F1-Score RTFx Files
MUSAN 92.0% 86.2% 100.0% 92.6% 655.2x faster 50
VOiCES 92.0% 86.2% 100.0% 92.6% 676.1x faster 50

Dataset Details

  • MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
  • VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 2, 2026

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric Value Target Status Description
DER 14.5% <20% Diarization Error Rate (lower is better)
RTFx 3.96x >1.0x Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage Time (s) % Description
Model Download 11.601 4.4 Fetching diarization models
Model Compile 4.972 1.9 CoreML compilation
Audio Load 0.085 0.0 Loading audio file
Segmentation 36.850 13.9 VAD + speech detection
Embedding 261.385 98.6 Speaker embedding extraction
Clustering (VBx) 3.091 1.2 Hungarian algorithm + VBx clustering
Total 265.012 100 Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method DER Mode Description
FluidAudio (Offline) 14.5% VBx Batch On-device CoreML with optimal clustering
FluidAudio (Streaming) 17.7% Chunk-based First-occurrence speaker mapping
Research baseline 18-30% Various Standard dataset performance

Pipeline Details:

  • Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
  • Segmentation: VAD-based voice activity detection
  • Embeddings: WeSpeaker-compatible speaker embeddings
  • Clustering: PowerSet with VBx refinement
  • Accuracy: Higher than streaming due to optimal post-hoc mapping

🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 301.3s processing • Test runtime: 6m 51s • 02/11/2026, 04:41 PM EST

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 2, 2026

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric Value Target Status
DER 33.4% <35%
Miss Rate 24.4% - -
False Alarm 0.1% - -
Speaker Error 8.9% - -
RTFx 19.1x >1.0x
Speakers 4/4 - -

Sortformer High-Latency • ES2004a • Runtime: 1m 46s • 2026-02-11T21:27:50.015Z

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 2, 2026

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric Value Target Status Description
DER 15.1% <30% Diarization Error Rate (lower is better)
JER 24.9% <25% Jaccard Error Rate
RTFx 12.85x >1.0x Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage Time (s) % Description
Model Download 10.765 13.2 Fetching diarization models
Model Compile 4.613 5.6 CoreML compilation
Audio Load 0.096 0.1 Loading audio file
Segmentation 24.488 30.0 Detecting speech regions
Embedding 40.813 50.0 Extracting speaker voices
Clustering 16.325 20.0 Grouping same speakers
Total 81.691 100 Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method DER Notes
FluidAudio 15.1% On-device CoreML
Research baseline 18-30% Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

  • M2 MacBook Air (2022): Runs at 150 RTFx real-time
  • Performance scales with Apple Neural Engine capabilities

🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 81.6s diarization time • Test runtime: 3m 49s • 02/11/2026, 04:30 PM EST

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 2, 2026

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset WER Avg WER Med RTFx Status
test-clean 0.57% 0.00% 5.27x
test-other 1.80% 0.00% 3.14x

Parakeet v2 (English-optimized)

Dataset WER Avg WER Med RTFx Status
test-clean 0.80% 0.00% 4.97x
test-other 1.22% 0.00% 3.02x

Streaming (v3)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.59x Streaming real-time factor
Avg Chunk Time 1.533s Average time to process each chunk
Max Chunk Time 2.261s Maximum chunk processing time
First Token 2.001s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming (v2)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.56x Streaming real-time factor
Avg Chunk Time 1.615s Average time to process each chunk
Max Chunk Time 2.175s Maximum chunk processing time
First Token 1.716s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming

25 files per dataset • Test runtime: 7m50s • 02/11/2026, 04:30 PM EST

RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

Testing methodology follows HuggingFace Open ASR Leaderboard

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 2, 2026

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric Value Description
WER (Avg) 7.03% Average Word Error Rate
WER (Med) 4.17% Median Word Error Rate
RTFx 7.55x Real-time factor (higher = faster)
Total Audio 470.6s Total audio duration processed
Total Time 68.1s Total processing time

Streaming Metrics

Metric Value Description
Avg Chunk Time 0.068s Average chunk processing time
Max Chunk Time 0.136s Maximum chunk processing time
EOU Detections 0 Total End-of-Utterance detections

Test runtime: 1m26s • 02/11/2026, 04:23 PM EST

RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O

@kyr0

This comment was marked as resolved.

@davit-b

This comment was marked as resolved.

@davit-b

This comment was marked as resolved.

@Alex-Wengg Alex-Wengg force-pushed the qwen3-asr branch 5 times, most recently from ab5e438 to d0d92c4 Compare February 3, 2026 14:07
@Alex-Wengg Alex-Wengg marked this pull request as ready for review February 8, 2026 05:29
@Alex-Wengg Alex-Wengg force-pushed the qwen3-asr branch 11 times, most recently from 78dbb82 to 8890841 Compare February 8, 2026 18:35
int8 quantization does not improve performance for Qwen3-ASR on Apple
Silicon. Testing showed int8 was slower (1.4x RTFx) than f32 (2.8x RTFx)
due to runtime dequantization overhead across 28 decoder layers that run
once per token during autoregressive generation.
Qwen3AsrConfig:
- Convert struct to enum with static properties
- Add Language enum with all 30 supported languages
- Add asrTextTokenId constant

Qwen3AsrManager:
- Convert class to actor for thread safety
- Add typed language support with Qwen3AsrConfig.Language
- Cache WhisperMelSpectrogram instance
- Replace print() with logger.debug()
- Remove dead repetition penalty code

Qwen3AsrModels:
- Fix computeUnits parameter (was ignored)
- Add vocab.json to modelsExist check
- Use native Float16 for embedding weights
- Add validation against Qwen3AsrConfig

Qwen3KVCache:
- Delete file (dead code, manager uses MLModel.makeState())

Qwen3RoPE:
- Remove unused Accelerate import
- Add Sendable conformance
- Fix MemoryLayout.size to .stride

WhisperMelSpectrogram:
- Fix hot path allocation (reuse imagSq buffer)
- Reference Qwen3AsrConfig for sampleRate/nMels
- Vectorize post-processing with vDSP/vForce
- Fix NFKD vs NFKC (use decomposedStringWithCompatibilityMapping)

Qwen3AsrBenchmark:
- Use typed Qwen3AsrConfig.Language enum
- Fix dataset label bug for AISHELL
- Add medianCER to JSON output
- Extract Qwen3BenchmarkSummary to avoid duplication
- Rename LibriSpeechFile to BenchmarkAudioFile
- Reuse AudioConverter instance

Qwen3TranscribeCommand:
- Use typed Qwen3AsrConfig.Language with validation
- Complete language list in usage (all 30)
- Compute duration from samples (remove AVFoundation)

TextNormalizer:
- Remove unused RegexBuilder import
- Add missing "six": "6" in numberWords
- Fix NSRange unicode bug (use utf16.count)
- Fix category checking with proper enum cases
- Make regex patterns static (compile once)
- Convert struct to enum

WERCalculator:
- Add Korean Hangul ranges to containsCJK
- Extract tokenizePair helper
- Add editDistanceChars for Character arrays
The key "six": "6" was present in both English and French sections,
causing a Swift runtime crash on dictionary initialization.
- Qwen3AsrModels: conform to Sendable, use @preconcurrency import CoreML
- Qwen3AsrManager: use @preconcurrency import CoreML
- Add beta warnings to Qwen3AsrManager and Qwen3AsrModels
- docs: add supported languages section to Qwen3-ASR.md
Remove beta warnings from Swift code (Qwen3AsrManager, Qwen3AsrModels)
and add beta notice to Qwen3-ASR.md instead.
Document differences from original PyTorch implementation
that may affect accuracy (fixed windows, greedy-only decoding,
no streaming, etc.)
Implements sliding window streaming approach:
- Accumulates audio chunks and re-transcribes periodically
- Configurable chunk size (default 2s), min audio, max audio
- Returns partial results as audio accumulates
- No state persistence needed (works with current CoreML model)

Note: True stateful streaming not possible due to CoreML MLState
being opaque/non-serializable. This approach re-transcribes from
start each update, acceptable for <30s audio.
AISHELL-2 requires application with institutional affiliation,
AISHELL-1 is openly available under Apache 2.0.
Downloads from FluidInference/fleurs on HuggingFace automatically
when data is not present. Supports European languages available
in the dataset.
- Add 1-second pause every 25 files to allow CoreML MLState IOSurface
  memory to be reclaimed, preventing crash at ~200 file limit
- Update FLEURS download to use FluidInference/fleurs-full which now
  has all 30 Qwen3-supported languages (13 Asian + 17 European)
- Update help text to reflect all languages auto-download
Already in .gitignore - libraries shouldn't track lock files.
@Alex-Wengg Alex-Wengg merged commit 772feab into main Feb 12, 2026
10 checks passed
@Alex-Wengg Alex-Wengg deleted the qwen3-asr branch February 12, 2026 00:16
@reneleonhardt
Copy link
Copy Markdown

Thank you very much for adding this great model!

Beta: Qwen3-ASR is experimental and under active development.

What is this referring to?
I couldn't find "beta" or "experimental" in the model card, GitHub, the blog entry or the paper.
https://github.com/QwenLM/Qwen3-ASR
https://huggingface.co/Qwen/Qwen3-ASR-0.6B
https://qwen.ai/blog?id=qwen3asr
https://arxiv.org/abs/2601.21337

@Alex-Wengg
Copy link
Copy Markdown
Member Author

@reneleonhardt its beta for our coreml conversion not the original model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants