-
Notifications
You must be signed in to change notification settings - Fork 167
Feat: Parakeet EOU streaming ASR with 160ms/320ms chunk support #216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 354.9s processing • Test runtime: 5m 54s • 12/17/2025, 05:16 PM EST |
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 47.2s diarization time • Test runtime: 1m 20s • 12/17/2025, 05:11 PM EST |
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
16fa53f to
3c5a75a
Compare
- Add StreamingEouAsrManager for real-time streaming ASR - Add RnntDecoder for RNN-T decoding - Add NeMoMelSpectrogram for audio preprocessing - Add Tokenizer for sentencepiece tokenization - Add StreamingEncoderState for encoder cache management - Update HuggingFaceDownloader to support 160ms and 320ms models - Add ParakeetEouCommand CLI for benchmarking - Add TextNormalizerOfficial for proper WER calculation - Support --chunk-ms flag for 160ms/320ms chunk sizes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Add --eou-debounce parameter to control minimum silence duration before End-of-Utterance triggers. Default is 1280ms. This allows users to reduce false EOU triggers during brief pauses in natural speech while keeping fast 160ms/320ms chunk sizes for low-latency transcription. - Add eouDebounceMs parameter to StreamingEouAsrManager - Implement debounce logic: count consecutive EOU predictions - Add --eou-debounce CLI flag with 1280ms default - Reset debounce timer when speech tokens are produced 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
3c5a75a to
4acfea2
Compare
MLMultiArrayDataType.int8 is not available on older macOS versions, causing CI build failures. The @unknown default case handles any unknown data types gracefully. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…nchmark - Remove redundant HuggingFaceDownloader.swift, use DownloadUtils instead - Add parakeetEou160/320 to Repo enum with subPath support - Add ModelNames.ParakeetEOU with required model names - Update DownloadUtils.downloadRepo to handle repo subdirectories - Add GitHub Actions workflow for 320ms Parakeet EOU benchmark (100 files) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 9m3s • 12/17/2025, 05:12 PM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 2m4s • 12/17/2025, 05:05 PM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
This comment was marked as resolved.
This comment was marked as resolved.
- Update BenchmarkJSONOutput to use nested summary structure - Add BenchmarkSummary struct with proper field names (averageWER, medianWER, medianRTFx) - Add StreamingMetrics struct for chunk processing times - Calculate and output median WER and RTFx - Remove dead code StreamingEncoderState.swift - Apply swift-format to TextNormalizer files 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Pre-allocate reusable buffers to avoid allocations in hot path - Vectorize preemphasis filter using vDSP_vsma - Use memcpy for bulk padding copy - Vectorize windowing with vDSP_vmul - Use vDSP_mmul for mel filterbank matrix-vector multiply - Vectorize power spectrum with vDSP_vsq and vDSP_vadd - Flatten mel filterbank for efficient vDSP operations RTFx improved from ~12x to ~19x on M2 hardware. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
The GitHub Actions workflow passes --output parakeet_eou_results.json but the command was ignoring this flag and always writing to benchmark_results.json. This caused the workflow to show N/A for all metrics since the expected JSON file was never created. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
Claude finished @Alex-Wengg's task —— View job Current PR Status Analysis
🎯 Current Status: Mixed Progress✅ Resolved Issues
🚨 Critical Issues Still PresentThese must be fixed before merge:
📋 Required Actions Before MergePriority 1 - Critical Safety Issues: // Fix StreamingEouAsrManager.swift:248
guard let decoder = self.decoder, let joint = self.joint else {
throw StreamingError.modelsNotLoaded
}
self.rnntDecoder = RnntDecoder(decoderModel: decoder, jointModel: joint)
// Fix RnntDecoder.swift:35-37
guard let hState = try? MLMultiArray(shape: [NSNumber(value: layers), NSNumber(value: 1), NSNumber(value: hiddenSize)], dataType: .float32),
let cState = try? MLMultiArray(shape: [NSNumber(value: layers), NSNumber(value: 1), NSNumber(value: hiddenSize)], dataType: .float32) else {
throw DecoderError.stateInitializationFailed
}Priority 2 - Model Output Safety: 🔍 Performance & Architecture AssessmentPositive Aspects:
Architecture Notes:
🚀 CI StatusCurrent runs show promising results:
🎯 RecommendationDo not merge until force unwrapping violations are resolved. These violate the project's critical development rules in CLAUDE.md and could cause runtime crashes. The fixes are straightforward but essential for production safety. Once safety issues are addressed, this will be a valuable addition providing production-ready streaming ASR with EOU detection. |
| | Chunk Size | WER (Avg) | RTFx | Total Time | | ||
| |------------|-----------|------|------------| | ||
| | 320ms | 4.92% | 12.48x | 1558s (26m) | | ||
| | 160ms | 8.29% | 4.78x | 4070s (68m) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this compare with the huggingface nemo model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly jsut nits - otherwise good to merge once you address them. great job!
also update the readme? add this and replace the streaming doscs that we have irhgt now ?
- Remove TextNormalizerOfficial, use TextNormalizer everywhere - Rename pureCoreML to useStreamingEou for clarity - Remove duplicate comment in AsrBenchmark - Add StreamingEouAsrManager API documentation - Add streaming CLI docs to ASR/GettingStarted.md - Add parakeet-eou command to CLI README - Compact CLI documentation examples 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add Parakeet EOU 120M streaming ASR with End-of-Utterance detection - Support 160ms and 320ms chunk sizes with automatic HuggingFace model downloads - benchmarks.md - Add GitHub Actions CI benchmark workflow for Parakeet EOU Changes - StreamingEouAsrManager - streaming pipeline with configurable chunk sizes - NeMoMelSpectrogram - native Swift mel spectrogram with vDSP vectorization - RnntDecoder - RNN-T greedy decoder with EOU detection - Configurable EOU debounce (default 1280ms) ---------
Remove tokenizer.model and preprocessorFile from required models list. These files don't exist in the HuggingFace repo and aren't used: - preprocessor: Native Swift NeMoMelSpectrogram is used instead - tokenizer.model: vocab.json is used for the Tokenizer class Fixes streaming model download failure introduced in PR FluidInference#216.
- Add Parakeet EOU 120M streaming ASR with End-of-Utterance detection - Support 160ms and 320ms chunk sizes with automatic HuggingFace model downloads - benchmarks.md - Add GitHub Actions CI benchmark workflow for Parakeet EOU Changes - StreamingEouAsrManager - streaming pipeline with configurable chunk sizes - NeMoMelSpectrogram - native Swift mel spectrogram with vDSP vectorization - RnntDecoder - RNN-T greedy decoder with EOU detection - Configurable EOU debounce (default 1280ms) ---------
- Add Parakeet EOU 120M streaming ASR with End-of-Utterance detection - Support 160ms and 320ms chunk sizes with automatic HuggingFace model downloads - benchmarks.md - Add GitHub Actions CI benchmark workflow for Parakeet EOU Changes - StreamingEouAsrManager - streaming pipeline with configurable chunk sizes - NeMoMelSpectrogram - native Swift mel spectrogram with vDSP vectorization - RnntDecoder - RNN-T greedy decoder with EOU detection - Configurable EOU debounce (default 1280ms) ---------
- Add Parakeet EOU 120M streaming ASR with End-of-Utterance detection - Support 160ms and 320ms chunk sizes with automatic HuggingFace model downloads - benchmarks.md - Add GitHub Actions CI benchmark workflow for Parakeet EOU Changes - StreamingEouAsrManager - streaming pipeline with configurable chunk sizes - NeMoMelSpectrogram - native Swift mel spectrogram with vDSP vectorization - RnntDecoder - RNN-T greedy decoder with EOU detection - Configurable EOU debounce (default 1280ms) ---------
- Add Parakeet EOU 120M streaming ASR with End-of-Utterance detection - Support 160ms and 320ms chunk sizes with automatic HuggingFace model downloads - benchmarks.md - Add GitHub Actions CI benchmark workflow for Parakeet EOU Changes - StreamingEouAsrManager - streaming pipeline with configurable chunk sizes - NeMoMelSpectrogram - native Swift mel spectrogram with vDSP vectorization - RnntDecoder - RNN-T greedy decoder with EOU detection - Configurable EOU debounce (default 1280ms) ---------
- Add Parakeet EOU 120M streaming ASR with End-of-Utterance detection - Support 160ms and 320ms chunk sizes with automatic HuggingFace model downloads - benchmarks.md - Add GitHub Actions CI benchmark workflow for Parakeet EOU Changes - StreamingEouAsrManager - streaming pipeline with configurable chunk sizes - NeMoMelSpectrogram - native Swift mel spectrogram with vDSP vectorization - RnntDecoder - RNN-T greedy decoder with EOU detection - Configurable EOU debounce (default 1280ms) ---------
- Add Parakeet EOU 120M streaming ASR with End-of-Utterance detection - Support 160ms and 320ms chunk sizes with automatic HuggingFace model downloads - benchmarks.md - Add GitHub Actions CI benchmark workflow for Parakeet EOU Changes - StreamingEouAsrManager - streaming pipeline with configurable chunk sizes - NeMoMelSpectrogram - native Swift mel spectrogram with vDSP vectorization - RnntDecoder - RNN-T greedy decoder with EOU detection - Configurable EOU debounce (default 1280ms) ---------
- Add Parakeet EOU 120M streaming ASR with End-of-Utterance detection - Support 160ms and 320ms chunk sizes with automatic HuggingFace model downloads - benchmarks.md - Add GitHub Actions CI benchmark workflow for Parakeet EOU Changes - StreamingEouAsrManager - streaming pipeline with configurable chunk sizes - NeMoMelSpectrogram - native Swift mel spectrogram with vDSP vectorization - RnntDecoder - RNN-T greedy decoder with EOU detection - Configurable EOU debounce (default 1280ms) ---------
- Add Parakeet EOU 120M streaming ASR with End-of-Utterance detection - Support 160ms and 320ms chunk sizes with automatic HuggingFace model downloads - benchmarks.md - Add GitHub Actions CI benchmark workflow for Parakeet EOU Changes - StreamingEouAsrManager - streaming pipeline with configurable chunk sizes - NeMoMelSpectrogram - native Swift mel spectrogram with vDSP vectorization - RnntDecoder - RNN-T greedy decoder with EOU detection - Configurable EOU debounce (default 1280ms) ---------

Changes