feat: add Qwen3-ASR-0.6B CoreML speech recognition by Alex-Wengg · Pull Request #281 · FluidInference/FluidAudio

Alex-Wengg · 2026-02-02T08:06:31Z

Beta: Qwen3-ASR is experimental and under active development.

Encoder-decoder ASR pipeline using Qwen3-ASR-0.6B converted to CoreML.

Performance

Dataset	WER	CER	RTFx
LibriSpeech test-clean (2620 files)	4.4%	-	3.8x
AISHELL-1 Chinese (7176 files)	10.3%	6.6%	3.8x

Supported Languages

30 languages with automatic detection: Chinese, English, Cantonese, Japanese, Korean, Vietnamese, Thai, Indonesian, Malay, Hindi, Arabic, Turkish, Russian, German, French, Spanish, Portuguese, Italian, Dutch, Polish, Swedish, Danish, Finnish, Czech, Filipino, Persian, Greek, Hungarian, Macedonian, Romanian.

Components

Qwen3AsrManager: Autoregressive decoder with batched prefill
WhisperMelSpectrogram: Whisper-compatible mel spectrogram (pure Swift/vDSP)
Qwen3RoPE: Multi-resolution rotary position embeddings (M-RoPE)
Qwen3AsrModels: Model loading with auto-download from HuggingFace
CLI: qwen3-benchmark and qwen3-transcribe commands

Models

CoreML Model: FluidInference/qwen3-asr-0.6b-coreml

Only f32 variant recommended (int8 is slower due to autoregressive decoding overhead).

Swift 6 Compatibility

@preconcurrency import CoreML for actor isolation
Sendable conformance for cross-isolation boundary support

🤖 Generated with Claude Code

github-actions · 2026-02-02T08:14:35Z

VAD Benchmark Results

Performance Comparison

Dataset	Accuracy	Precision	Recall	F1-Score	RTFx	Files
MUSAN	92.0%	86.2%	100.0%	92.6%	655.2x faster	50
VOiCES	92.0%	86.2%	100.0%	92.6%	676.1x faster	50

Dataset Details

MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

github-actions · 2026-02-02T08:19:57Z

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric	Value	Target	Status	Description
DER	14.5%	<20%	✅	Diarization Error Rate (lower is better)
RTFx	3.96x	>1.0x	✅	Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage	Time (s)	%	Description
Model Download	11.601	4.4	Fetching diarization models
Model Compile	4.972	1.9	CoreML compilation
Audio Load	0.085	0.0	Loading audio file
Segmentation	36.850	13.9	VAD + speech detection
Embedding	261.385	98.6	Speaker embedding extraction
Clustering (VBx)	3.091	1.2	Hungarian algorithm + VBx clustering
Total	265.012	100	Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method	DER	Mode	Description
FluidAudio (Offline)	14.5%	VBx Batch	On-device CoreML with optimal clustering
FluidAudio (Streaming)	17.7%	Chunk-based	First-occurrence speaker mapping
Research baseline	18-30%	Various	Standard dataset performance

Pipeline Details:

Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
Segmentation: VAD-based voice activity detection
Embeddings: WeSpeaker-compatible speaker embeddings
Clustering: PowerSet with VBx refinement
Accuracy: Higher than streaming due to optimal post-hoc mapping

_{🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 301.3s processing • Test runtime: 6m 51s • 02/11/2026, 04:41 PM EST}

github-actions · 2026-02-02T08:25:00Z

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric	Value	Target	Status
DER	33.4%	<35%	✅
Miss Rate	24.4%	-	-
False Alarm	0.1%	-	-
Speaker Error	8.9%	-	-
RTFx	19.1x	>1.0x	✅
Speakers	4/4	-	-

_{Sortformer High-Latency • ES2004a • Runtime: 1m 46s • 2026-02-11T21:27:50.015Z}

github-actions · 2026-02-02T08:25:09Z

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric	Value	Target	Status	Description
DER	15.1%	<30%	✅	Diarization Error Rate (lower is better)
JER	24.9%	<25%	✅	Jaccard Error Rate
RTFx	12.85x	>1.0x	✅	Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage	Time (s)	%	Description
Model Download	10.765	13.2	Fetching diarization models
Model Compile	4.613	5.6	CoreML compilation
Audio Load	0.096	0.1	Loading audio file
Segmentation	24.488	30.0	Detecting speech regions
Embedding	40.813	50.0	Extracting speaker voices
Clustering	16.325	20.0	Grouping same speakers
Total	81.691	100	Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method	DER	Notes
FluidAudio	15.1%	On-device CoreML
Research baseline	18-30%	Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

M2 MacBook Air (2022): Runs at 150 RTFx real-time
Performance scales with Apple Neural Engine capabilities

_{🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 81.6s diarization time • Test runtime: 3m 49s • 02/11/2026, 04:30 PM EST}

github-actions · 2026-02-02T08:27:12Z

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.57%	0.00%	5.27x	✅
test-other	1.80%	0.00%	3.14x	✅

Parakeet v2 (English-optimized)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.80%	0.00%	4.97x	✅
test-other	1.22%	0.00%	3.02x	✅

Streaming (v3)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.59x	Streaming real-time factor
Avg Chunk Time	1.533s	Average time to process each chunk
Max Chunk Time	2.261s	Maximum chunk processing time
First Token	2.001s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

Streaming (v2)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.56x	Streaming real-time factor
Avg Chunk Time	1.615s	Average time to process each chunk
Max Chunk Time	2.175s	Maximum chunk processing time
First Token	1.716s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

_{Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming}

_{25 files per dataset • Test runtime: 7m50s • 02/11/2026, 04:30 PM EST}

_{RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)}

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

_{Testing methodology follows HuggingFace Open ASR Leaderboard}

github-actions · 2026-02-02T08:35:01Z

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric	Value	Description
WER (Avg)	7.03%	Average Word Error Rate
WER (Med)	4.17%	Median Word Error Rate
RTFx	7.55x	Real-time factor (higher = faster)
Total Audio	470.6s	Total audio duration processed
Total Time	68.1s	Total processing time

Streaming Metrics

Metric	Value	Description
Avg Chunk Time	0.068s	Average chunk processing time
Max Chunk Time	0.136s	Maximum chunk processing time
EOU Detections	0	Total End-of-Utterance detections

_{Test runtime: 1m26s • 02/11/2026, 04:23 PM EST}

_{RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O}

…arison

int8 quantization does not improve performance for Qwen3-ASR on Apple Silicon. Testing showed int8 was slower (1.4x RTFx) than f32 (2.8x RTFx) due to runtime dequantization overhead across 28 decoder layers that run once per token during autoregressive generation.

Qwen3AsrConfig: - Convert struct to enum with static properties - Add Language enum with all 30 supported languages - Add asrTextTokenId constant Qwen3AsrManager: - Convert class to actor for thread safety - Add typed language support with Qwen3AsrConfig.Language - Cache WhisperMelSpectrogram instance - Replace print() with logger.debug() - Remove dead repetition penalty code Qwen3AsrModels: - Fix computeUnits parameter (was ignored) - Add vocab.json to modelsExist check - Use native Float16 for embedding weights - Add validation against Qwen3AsrConfig Qwen3KVCache: - Delete file (dead code, manager uses MLModel.makeState()) Qwen3RoPE: - Remove unused Accelerate import - Add Sendable conformance - Fix MemoryLayout.size to .stride WhisperMelSpectrogram: - Fix hot path allocation (reuse imagSq buffer) - Reference Qwen3AsrConfig for sampleRate/nMels - Vectorize post-processing with vDSP/vForce - Fix NFKD vs NFKC (use decomposedStringWithCompatibilityMapping) Qwen3AsrBenchmark: - Use typed Qwen3AsrConfig.Language enum - Fix dataset label bug for AISHELL - Add medianCER to JSON output - Extract Qwen3BenchmarkSummary to avoid duplication - Rename LibriSpeechFile to BenchmarkAudioFile - Reuse AudioConverter instance Qwen3TranscribeCommand: - Use typed Qwen3AsrConfig.Language with validation - Complete language list in usage (all 30) - Compute duration from samples (remove AVFoundation) TextNormalizer: - Remove unused RegexBuilder import - Add missing "six": "6" in numberWords - Fix NSRange unicode bug (use utf16.count) - Fix category checking with proper enum cases - Make regex patterns static (compile once) - Convert struct to enum WERCalculator: - Add Korean Hangul ranges to containsCJK - Extract tokenizePair helper - Add editDistanceChars for Character arrays

The key "six": "6" was present in both English and French sections, causing a Swift runtime crash on dictionary initialization.

- Qwen3AsrModels: conform to Sendable, use @preconcurrency import CoreML - Qwen3AsrManager: use @preconcurrency import CoreML - Add beta warnings to Qwen3AsrManager and Qwen3AsrModels - docs: add supported languages section to Qwen3-ASR.md

Remove beta warnings from Swift code (Qwen3AsrManager, Qwen3AsrModels) and add beta notice to Qwen3-ASR.md instead.

Document differences from original PyTorch implementation that may affect accuracy (fixed windows, greedy-only decoding, no streaming, etc.)

Implements sliding window streaming approach: - Accumulates audio chunks and re-transcribes periodically - Configurable chunk size (default 2s), min audio, max audio - Returns partial results as audio accumulates - No state persistence needed (works with current CoreML model) Note: True stateful streaming not possible due to CoreML MLState being opaque/non-serializable. This approach re-transcribes from start each update, acceptable for <30s audio.

AISHELL-2 requires application with institutional affiliation, AISHELL-1 is openly available under Apache 2.0.

Downloads from FluidInference/fleurs on HuggingFace automatically when data is not present. Supports European languages available in the dataset.

- Add 1-second pause every 25 files to allow CoreML MLState IOSurface memory to be reclaimed, preventing crash at ~200 file limit - Update FLEURS download to use FluidInference/fleurs-full which now has all 30 Qwen3-supported languages (13 Asian + 17 European) - Update help text to reflect all languages auto-download

Already in .gitignore - libraries shouldn't track lock files.

reneleonhardt · 2026-02-13T14:49:21Z

Thank you very much for adding this great model!

Beta: Qwen3-ASR is experimental and under active development.

What is this referring to?
I couldn't find "beta" or "experimental" in the model card, GitHub, the blog entry or the paper.
https://github.com/QwenLM/Qwen3-ASR
https://huggingface.co/Qwen/Qwen3-ASR-0.6B
https://qwen.ai/blog?id=qwen3asr
https://arxiv.org/abs/2601.21337

Alex-Wengg · 2026-02-13T16:08:18Z

@reneleonhardt its beta for our coreml conversion not the original model

This comment was marked as outdated.

Sign in to view

Alex-Wengg mentioned this pull request Feb 2, 2026

Model Support Requests #49

Open

This comment was marked as resolved.

Sign in to view

Alex-Wengg force-pushed the qwen3-asr branch from 942fa9a to 093bc90 Compare February 2, 2026 23:25

This comment was marked as resolved.

Sign in to view

Alex-Wengg force-pushed the qwen3-asr branch 5 times, most recently from ab5e438 to d0d92c4 Compare February 3, 2026 14:07

Alex-Wengg marked this pull request as ready for review February 8, 2026 05:29

Alex-Wengg force-pushed the qwen3-asr branch 11 times, most recently from 78dbb82 to 8890841 Compare February 8, 2026 18:35

Alex-Wengg added 20 commits February 8, 2026 20:03

docs: remove redundant Qwen3-ASR description

d542cb1

feat: default Qwen3-ASR to .all compute units, remove unverified comp…

f8ca2cc

…arison

docs: add Qwen3-ASR documentation

6220fcc

docs: remove duplicate benchmarks, reference Benchmarks.md

0b90583

docs: remove limitations section

0adb510

docs: explain why int8 is not recommended for Qwen3-ASR

c9e37a9

fix: remove duplicate 'six' key in TextNormalizer numberWords dictionary

a065d7a

The key "six": "6" was present in both English and French sections, causing a Swift runtime crash on dictionary initialization.

docs: move beta notice to documentation only

8ff6684

Remove beta warnings from Swift code (Qwen3AsrManager, Qwen3AsrModels) and add beta notice to Qwen3-ASR.md instead.

docs: simplify beta notice

3cf0a7c

revert: remove WERCalculator changes

793fb6b

revert: remove TextNormalizer changes

b825c54

docs: add CoreML limitations section to Qwen3-ASR

47e0691

Document differences from original PyTorch implementation that may affect accuracy (fixed windows, greedy-only decoding, no streaming, etc.)

docs: remove speculative CER comparison

9d7015e

docs: add note explaining AISHELL-1 vs AISHELL-2 access

dd7aead

AISHELL-2 requires application with institutional affiliation, AISHELL-1 is openly available under Apache 2.0.

feat: add auto-download for FLEURS benchmark in Qwen3

12e8bb9

Downloads from FluidInference/fleurs on HuggingFace automatically when data is not present. Supports European languages available in the dataset.

fix: resolve vDSP pointer issues for CI build

20e59ef

Alex-Wengg force-pushed the qwen3-asr branch from 796887e to 20e59ef Compare February 9, 2026 01:05

Alex-Wengg added 4 commits February 8, 2026 20:41

docs: add FLEURS benchmark results for Qwen3-ASR (30 languages)

7460b43

docs: add median CER/WER to FLEURS benchmark results

630714c

chore: remove Package.resolved from tracking

c951331

Already in .gitignore - libraries shouldn't track lock files.

Alex-Wengg force-pushed the qwen3-asr branch from 7e58536 to c951331 Compare February 11, 2026 21:16

Alex-Wengg merged commit 772feab into main Feb 12, 2026
10 checks passed

Alex-Wengg deleted the qwen3-asr branch February 12, 2026 00:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Qwen3-ASR-0.6B CoreML speech recognition#281

feat: add Qwen3-ASR-0.6B CoreML speech recognition#281
Alex-Wengg merged 41 commits intomainfrom
qwen3-asr

Alex-Wengg commented Feb 2, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

github-actions bot commented Feb 2, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 2, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 2, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 2, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 2, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 2, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

reneleonhardt commented Feb 13, 2026

Uh oh!

Alex-Wengg commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Alex-Wengg commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance

Supported Languages

Components

Models

Swift 6 Compatibility

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

github-actions bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

VAD Benchmark Results

Performance Comparison

Dataset Details

Uh oh!

github-actions bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Offline VBx Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

github-actions bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Uh oh!

github-actions bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Diarization Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

github-actions bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ASR Benchmark Results ✅

Parakeet v3 (multilingual)

Parakeet v2 (English-optimized)

Streaming (v3)

Streaming (v2)

Expected RTFx Performance on Physical M1 Hardware:

Uh oh!

github-actions bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Parakeet EOU Benchmark Results ✅

Performance Metrics

Streaming Metrics

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

Uh oh!

reneleonhardt commented Feb 13, 2026

Uh oh!

Alex-Wengg commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Alex-Wengg commented Feb 2, 2026 •

edited

Loading

github-actions bot commented Feb 2, 2026 •

edited

Loading

github-actions bot commented Feb 2, 2026 •

edited

Loading

github-actions bot commented Feb 2, 2026 •

edited

Loading

github-actions bot commented Feb 2, 2026 •

edited

Loading

github-actions bot commented Feb 2, 2026 •

edited

Loading

github-actions bot commented Feb 2, 2026 •

edited

Loading