Feat: Parakeet EOU streaming ASR with 160ms/320ms chunk support #216

Alex-Wengg · 2025-12-15T21:36:29Z

Add Parakeet EOU 120M streaming ASR with End-of-Utterance detection
Support 160ms and 320ms chunk sizes with automatic HuggingFace model downloads
benchmarks.md
Add GitHub Actions CI benchmark workflow for Parakeet EOU

Changes

StreamingEouAsrManager - streaming pipeline with configurable chunk sizes
NeMoMelSpectrogram - native Swift mel spectrogram with vDSP vectorization
RnntDecoder - RNN-T greedy decoder with EOU detection
Configurable EOU debounce (default 1280ms)

github-actions · 2025-12-15T21:39:18Z

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric	Value	Target	Status	Description
DER	14.5%	<20%	✅	Diarization Error Rate (lower is better)
RTFx	3.27x	>1.0x	✅	Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage	Time (s)	%	Description
Model Download	15.781	4.9	Fetching diarization models
Model Compile	6.763	2.1	CoreML compilation
Audio Load	0.099	0.0	Loading audio file
Segmentation	34.756	10.8	VAD + speech detection
Embedding	317.042	98.9	Speaker embedding extraction
Clustering (VBx)	3.066	1.0	Hungarian algorithm + VBx clustering
Total	320.694	100	Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method	DER	Mode	Description
FluidAudio (Offline)	14.5%	VBx Batch	On-device CoreML with optimal clustering
FluidAudio (Streaming)	17.7%	Chunk-based	First-occurrence speaker mapping
Research baseline	18-30%	Various	Standard dataset performance

Pipeline Details:

Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
Segmentation: VAD-based voice activity detection
Embeddings: WeSpeaker-compatible speaker embeddings
Clustering: PowerSet with VBx refinement
Accuracy: Higher than streaming due to optimal post-hoc mapping

_{🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 354.9s processing • Test runtime: 5m 54s • 12/17/2025, 05:16 PM EST}

github-actions · 2025-12-15T21:40:46Z

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric	Value	Target	Status	Description
DER	15.1%	<30%	✅	Diarization Error Rate (lower is better)
JER	24.9%	<25%	✅	Jaccard Error Rate
RTFx	22.22x	>1.0x	✅	Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage	Time (s)	%	Description
Model Download	7.582	16.1	Fetching diarization models
Model Compile	3.249	6.9	CoreML compilation
Audio Load	0.103	0.2	Loading audio file
Segmentation	14.157	30.0	Detecting speech regions
Embedding	23.595	50.0	Extracting speaker voices
Clustering	9.438	20.0	Grouping same speakers
Total	47.217	100	Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method	DER	Notes
FluidAudio	15.1%	On-device CoreML
Research baseline	18-30%	Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

M2 MacBook Air (2022): Runs at 150 RTFx real-time
Performance scales with Apple Neural Engine capabilities

_{🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 47.2s diarization time • Test runtime: 1m 20s • 12/17/2025, 05:11 PM EST}

github-actions · 2025-12-15T21:41:15Z

VAD Benchmark Results

Performance Comparison

Dataset	Accuracy	Precision	Recall	F1-Score	RTFx	Files
MUSAN	92.0%	86.2%	100.0%	92.6%	486.3x faster	50
VOiCES	92.0%	86.2%	100.0%	92.6%	386.0x faster	50

Dataset Details

MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

- Add StreamingEouAsrManager for real-time streaming ASR - Add RnntDecoder for RNN-T decoding - Add NeMoMelSpectrogram for audio preprocessing - Add Tokenizer for sentencepiece tokenization - Add StreamingEncoderState for encoder cache management - Update HuggingFaceDownloader to support 160ms and 320ms models - Add ParakeetEouCommand CLI for benchmarking - Add TextNormalizerOfficial for proper WER calculation - Support --chunk-ms flag for 160ms/320ms chunk sizes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add --eou-debounce parameter to control minimum silence duration before End-of-Utterance triggers. Default is 1280ms. This allows users to reduce false EOU triggers during brief pauses in natural speech while keeping fast 160ms/320ms chunk sizes for low-latency transcription. - Add eouDebounceMs parameter to StreamingEouAsrManager - Implement debounce logic: count consecutive EOU predictions - Add --eou-debounce CLI flag with 1280ms default - Reset debounce timer when speech tokens are produced 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

@unknown

MLMultiArrayDataType.int8 is not available on older macOS versions, causing CI build failures. The @unknown default case handles any unknown data types gracefully. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

…nchmark - Remove redundant HuggingFaceDownloader.swift, use DownloadUtils instead - Add parakeetEou160/320 to Repo enum with subPath support - Add ModelNames.ParakeetEOU with required model names - Update DownloadUtils.downloadRepo to handle repo subdirectories - Add GitHub Actions workflow for 320ms Parakeet EOU benchmark (100 files) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

github-actions · 2025-12-15T22:58:54Z

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.57%	0.00%	3.28x	✅
test-other	1.80%	0.00%	2.49x	✅

Parakeet v2 (English-optimized)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.40%	0.00%	3.40x	✅
test-other	1.56%	0.00%	1.68x	✅

Streaming (v3)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.39x	Streaming real-time factor
Avg Chunk Time	2.209s	Average time to process each chunk
Max Chunk Time	2.764s	Maximum chunk processing time
First Token	2.679s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

Streaming (v2)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.21x	Streaming real-time factor
Avg Chunk Time	4.142s	Average time to process each chunk
Max Chunk Time	6.384s	Maximum chunk processing time
First Token	3.610s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

_{Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming}

_{25 files per dataset • Test runtime: 9m3s • 12/17/2025, 05:12 PM EST}

_{RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)}

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

_{Testing methodology follows HuggingFace Open ASR Leaderboard}

github-actions · 2025-12-15T23:03:02Z

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric	Value	Description
WER (Avg)	6.91%	Average Word Error Rate
WER (Med)	4.00%	Median Word Error Rate
RTFx	4.87x	Real-time factor (higher = faster)
Total Audio	470.6s	Total audio duration processed
Total Time	112.5s	Total processing time

Streaming Metrics

Metric	Value	Description
Avg Chunk Time	0.113s	Average chunk processing time
Max Chunk Time	0.225s	Maximum chunk processing time
EOU Detections	0	Total End-of-Utterance detections

_{Test runtime: 2m4s • 12/17/2025, 05:05 PM EST}

_{RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O}

Alex-Wengg · 2025-12-15T23:07:37Z

@claude

- Update BenchmarkJSONOutput to use nested summary structure - Add BenchmarkSummary struct with proper field names (averageWER, medianWER, medianRTFx) - Add StreamingMetrics struct for chunk processing times - Calculate and output median WER and RTFx - Remove dead code StreamingEncoderState.swift - Apply swift-format to TextNormalizer files 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

- Pre-allocate reusable buffers to avoid allocations in hot path - Vectorize preemphasis filter using vDSP_vsma - Use memcpy for bulk padding copy - Vectorize windowing with vDSP_vmul - Use vDSP_mmul for mel filterbank matrix-vector multiply - Vectorize power spectrum with vDSP_vsq and vDSP_vadd - Flatten mel filterbank for efficient vDSP operations RTFx improved from ~12x to ~19x on M2 hardware. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

The GitHub Actions workflow passes --output parakeet_eou_results.json but the command was ignoring this flag and always writing to benchmark_results.json. This caused the workflow to show N/A for all metrics since the expected JSON file was never created. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Alex-Wengg · 2025-12-16T03:57:04Z

@claude

claude · 2025-12-16T03:57:17Z

Claude finished @Alex-Wengg's task —— View job

Current PR Status Analysis

Review current PR status and previous critical issues
Check if compilation errors have been resolved
Verify force unwrapping violations are addressed
Examine recent changes since last review
Provide updated recommendations

🎯 Current Status: Mixed Progress

✅ Resolved Issues

Compilation Error Fixed - TextNormalizerOfficial.swift:368 - englishSpellingNormalizer reference has been commented out
Print Statements Cleaned - HuggingFaceDownloader print statements have been converted to proper logging
CI Stability - Multiple benchmarks are passing (Swift Format, Parakeet EOU, VAD, TTS, Podspec)

🚨 Critical Issues Still Present

These must be fixed before merge:

Force Unwrapping Violations (violates CLAUDE.md guidelines):
- StreamingEouAsrManager.swift:248 - RnntDecoder(decoderModel: self.decoder!, jointModel: self.joint!)
- RnntDecoder.swift:35,37 - try! MLMultiArray(...) - 2 instances
- RnntDecoder.swift:92,109,123,124 - Multiple ! unwraps in model output processing
- Tokenizer.swift:9 - as! [String: String] force cast
Unsafe Memory Operations:
- NeMoMelSpectrogram.swift:161,173,210-212,229-231,256 - Multiple baseAddress! force unwraps

📋 Required Actions Before Merge

Priority 1 - Critical Safety Issues:

// Fix StreamingEouAsrManager.swift:248
guard let decoder = self.decoder, let joint = self.joint else {
    throw StreamingError.modelsNotLoaded
}
self.rnntDecoder = RnntDecoder(decoderModel: decoder, jointModel: joint)

// Fix RnntDecoder.swift:35-37
guard let hState = try? MLMultiArray(shape: [NSNumber(value: layers), NSNumber(value: 1), NSNumber(value: hiddenSize)], dataType: .float32),
      let cState = try? MLMultiArray(shape: [NSNumber(value: layers), NSNumber(value: 1), NSNumber(value: hiddenSize)], dataType: .float32) else {
    throw DecoderError.stateInitializationFailed
}

Priority 2 - Model Output Safety:
Replace force unwraps in RnntDecoder.swift:92,109,123,124 with proper error handling using guard let statements.

🔍 Performance & Architecture Assessment

Positive Aspects:

✅ Comprehensive benchmarking infrastructure working well
✅ NeMoMelSpectrogram shows good performance optimization with vDSP
✅ Streaming architecture is well-structured with proper EOU detection
✅ Swift formatting compliance maintained

Architecture Notes:

Large file complexity (StreamingEouAsrManager.swift - 487 lines) could benefit from decomposition
Memory management in streaming contexts needs careful review due to MLMultiArray allocations

🚀 CI Status

Current runs show promising results:

✅ Swift Format: Passing
✅ Parakeet EOU Benchmark: Passing
✅ VAD Benchmark: Passing
🟡 Build and Test: In Progress
🟡 ASR Benchmark: In Progress
🟡 Diarizer Performance: In Progress

🎯 Recommendation

Do not merge until force unwrapping violations are resolved. These violate the project's critical development rules in CLAUDE.md and could cause runtime crashes. The fixes are straightforward but essential for production safety.

Once safety issues are addressed, this will be a valuable addition providing production-ready streaming ASR with EOU detection.

BrandonWeng · 2025-12-16T04:37:42Z

Documentation/Benchmarks.md

+| Chunk Size | WER (Avg) | RTFx | Total Time |
+|------------|-----------|------|------------|
+| 320ms      | 4.92%     | 12.48x | 1558s (26m) |
+| 160ms      | 8.29%     | 4.78x  | 4070s (68m) |


how does this compare with the huggingface nemo model?

the way they benchmarked it like " 160ms streaming setting" i am not sure if it was really 160 ms audio chunkings or if it was referring to the 160 ms eou detection, it didn't really specify.

https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Sources/FluidAudioCLI/Utils/TextNormalizerOfficial.swift

Sources/FluidAudioCLI/Commands/ASR/AsrBenchmarkTypes.swift

Sources/FluidAudioCLI/Commands/ASR/AsrBenchmark.swift

BrandonWeng

mostly jsut nits - otherwise good to merge once you address them. great job!

also update the readme? add this and replace the streaming doscs that we have irhgt now ?

- Remove TextNormalizerOfficial, use TextNormalizer everywhere - Rename pureCoreML to useStreamingEou for clarity - Remove duplicate comment in AsrBenchmark - Add StreamingEouAsrManager API documentation - Add streaming CLI docs to ASR/GettingStarted.md - Add parakeet-eou command to CLI README - Compact CLI documentation examples 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add Parakeet EOU 120M streaming ASR with End-of-Utterance detection - Support 160ms and 320ms chunk sizes with automatic HuggingFace model downloads - benchmarks.md - Add GitHub Actions CI benchmark workflow for Parakeet EOU Changes - StreamingEouAsrManager - streaming pipeline with configurable chunk sizes - NeMoMelSpectrogram - native Swift mel spectrogram with vDSP vectorization - RnntDecoder - RNN-T greedy decoder with EOU detection - Configurable EOU debounce (default 1280ms) ---------

Remove tokenizer.model and preprocessorFile from required models list. These files don't exist in the HuggingFace repo and aren't used: - preprocessor: Native Swift NeMoMelSpectrogram is used instead - tokenizer.model: vocab.json is used for the Tokenizer class Fixes streaming model download failure introduced in PR FluidInference#216.

- Add Parakeet EOU 120M streaming ASR with End-of-Utterance detection - Support 160ms and 320ms chunk sizes with automatic HuggingFace model downloads - benchmarks.md - Add GitHub Actions CI benchmark workflow for Parakeet EOU Changes - StreamingEouAsrManager - streaming pipeline with configurable chunk sizes - NeMoMelSpectrogram - native Swift mel spectrogram with vDSP vectorization - RnntDecoder - RNN-T greedy decoder with EOU detection - Configurable EOU debounce (default 1280ms) ---------

This comment was marked as outdated.

Sign in to view

Alex-Wengg force-pushed the feat/parakeet-eou-integration-2 branch from 16fa53f to 3c5a75a Compare December 15, 2025 21:54

Alex-Wengg and others added 2 commits December 15, 2025 17:03

Alex-Wengg force-pushed the feat/parakeet-eou-integration-2 branch from 3c5a75a to 4acfea2 Compare December 15, 2025 22:29

Alex-Wengg and others added 2 commits December 15, 2025 17:33

This comment was marked as resolved.

Sign in to view

Alex-Wengg and others added 3 commits December 15, 2025 20:47

add benchmarks.md info

f14d218

Alex-Wengg changed the title ~~Feat/parakeet eou integration 2~~ Feat: Parakeet EOU streaming ASR with 160ms/320ms chunk support Dec 16, 2025

BrandonWeng reviewed Dec 16, 2025

View reviewed changes

Alex-Wengg marked this pull request as ready for review December 16, 2025 05:34

chore: update parakeet-eou model URL to FluidInference org

a5ac061

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Alex-Wengg requested a review from BrandonWeng December 17, 2025 17:37

BrandonWeng reviewed Dec 17, 2025

View reviewed changes

Sources/FluidAudioCLI/Utils/TextNormalizerOfficial.swift Outdated Show resolved Hide resolved

BrandonWeng reviewed Dec 17, 2025

View reviewed changes

Sources/FluidAudioCLI/Commands/ASR/AsrBenchmarkTypes.swift Outdated Show resolved Hide resolved

BrandonWeng reviewed Dec 17, 2025

View reviewed changes

Sources/FluidAudioCLI/Commands/ASR/AsrBenchmark.swift Outdated Show resolved Hide resolved

BrandonWeng approved these changes Dec 17, 2025

View reviewed changes

Alex-Wengg merged commit 892da4f into main Dec 17, 2025
10 checks passed

Alex-Wengg deleted the feat/parakeet-eou-integration-2 branch December 17, 2025 22:18

keithah mentioned this pull request Dec 24, 2025

Phase 1: FluidAudio Speech-to-Text and Speaker Diarization keithah/overhear#23

Open

Feat: Parakeet EOU streaming ASR with 160ms/320ms chunk support #216

Feat: Parakeet EOU streaming ASR with 160ms/320ms chunk support #216

Uh oh!

Conversation

Alex-Wengg commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

Uh oh!

github-actions bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Offline VBx Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

github-actions bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Diarization Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

github-actions bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

VAD Benchmark Results

Performance Comparison

Dataset Details

Uh oh!

github-actions bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ASR Benchmark Results ✅

Parakeet v3 (multilingual)

Parakeet v2 (English-optimized)

Streaming (v3)

Streaming (v2)

Expected RTFx Performance on Physical M1 Hardware:

Uh oh!

github-actions bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Parakeet EOU Benchmark Results ✅

Performance Metrics

Streaming Metrics

Uh oh!

Alex-Wengg commented Dec 15, 2025

Uh oh!

This comment was marked as resolved.

Alex-Wengg commented Dec 16, 2025

Uh oh!

claude bot commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current PR Status Analysis

🎯 Current Status: Mixed Progress

✅ Resolved Issues

🚨 Critical Issues Still Present

📋 Required Actions Before Merge

🔍 Performance & Architecture Assessment

🚀 CI Status

🎯 Recommendation

Uh oh!

BrandonWeng Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Alex-Wengg Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BrandonWeng left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Alex-Wengg commented Dec 15, 2025 •

edited

Loading

github-actions bot commented Dec 15, 2025 •

edited

Loading

github-actions bot commented Dec 15, 2025 •

edited

Loading

github-actions bot commented Dec 15, 2025 •

edited

Loading

github-actions bot commented Dec 15, 2025 •

edited

Loading

github-actions bot commented Dec 15, 2025 •

edited

Loading

claude bot commented Dec 16, 2025 •

edited

Loading

BrandonWeng left a comment •

edited

Loading