Skip to content

Conversation

@Alex-Wengg
Copy link
Contributor

@Alex-Wengg Alex-Wengg commented Dec 15, 2025

  • Add Parakeet EOU 120M streaming ASR with End-of-Utterance detection
  • Support 160ms and 320ms chunk sizes with automatic HuggingFace model downloads
  • benchmarks.md
  • Add GitHub Actions CI benchmark workflow for Parakeet EOU

Changes

  • StreamingEouAsrManager - streaming pipeline with configurable chunk sizes
  • NeMoMelSpectrogram - native Swift mel spectrogram with vDSP vectorization
  • RnntDecoder - RNN-T greedy decoder with EOU detection
  • Configurable EOU debounce (default 1280ms)

@claude

This comment was marked as outdated.

claude[bot]

This comment was marked as outdated.

@github-actions
Copy link

github-actions bot commented Dec 15, 2025

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric Value Target Status Description
DER 14.5% <20% Diarization Error Rate (lower is better)
RTFx 3.27x >1.0x Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage Time (s) % Description
Model Download 15.781 4.9 Fetching diarization models
Model Compile 6.763 2.1 CoreML compilation
Audio Load 0.099 0.0 Loading audio file
Segmentation 34.756 10.8 VAD + speech detection
Embedding 317.042 98.9 Speaker embedding extraction
Clustering (VBx) 3.066 1.0 Hungarian algorithm + VBx clustering
Total 320.694 100 Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method DER Mode Description
FluidAudio (Offline) 14.5% VBx Batch On-device CoreML with optimal clustering
FluidAudio (Streaming) 17.7% Chunk-based First-occurrence speaker mapping
Research baseline 18-30% Various Standard dataset performance

Pipeline Details:

  • Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
  • Segmentation: VAD-based voice activity detection
  • Embeddings: WeSpeaker-compatible speaker embeddings
  • Clustering: PowerSet with VBx refinement
  • Accuracy: Higher than streaming due to optimal post-hoc mapping

🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 354.9s processing • Test runtime: 5m 54s • 12/17/2025, 05:16 PM EST

@github-actions
Copy link

github-actions bot commented Dec 15, 2025

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric Value Target Status Description
DER 15.1% <30% Diarization Error Rate (lower is better)
JER 24.9% <25% Jaccard Error Rate
RTFx 22.22x >1.0x Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage Time (s) % Description
Model Download 7.582 16.1 Fetching diarization models
Model Compile 3.249 6.9 CoreML compilation
Audio Load 0.103 0.2 Loading audio file
Segmentation 14.157 30.0 Detecting speech regions
Embedding 23.595 50.0 Extracting speaker voices
Clustering 9.438 20.0 Grouping same speakers
Total 47.217 100 Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method DER Notes
FluidAudio 15.1% On-device CoreML
Research baseline 18-30% Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

  • M2 MacBook Air (2022): Runs at 150 RTFx real-time
  • Performance scales with Apple Neural Engine capabilities

🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 47.2s diarization time • Test runtime: 1m 20s • 12/17/2025, 05:11 PM EST

@github-actions
Copy link

github-actions bot commented Dec 15, 2025

VAD Benchmark Results

Performance Comparison

Dataset Accuracy Precision Recall F1-Score RTFx Files
MUSAN 92.0% 86.2% 100.0% 92.6% 486.3x faster 50
VOiCES 92.0% 86.2% 100.0% 92.6% 386.0x faster 50

Dataset Details

  • MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
  • VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

@Alex-Wengg Alex-Wengg force-pushed the feat/parakeet-eou-integration-2 branch from 16fa53f to 3c5a75a Compare December 15, 2025 21:54
Alex-Wengg and others added 2 commits December 15, 2025 17:03
- Add StreamingEouAsrManager for real-time streaming ASR
- Add RnntDecoder for RNN-T decoding
- Add NeMoMelSpectrogram for audio preprocessing
- Add Tokenizer for sentencepiece tokenization
- Add StreamingEncoderState for encoder cache management
- Update HuggingFaceDownloader to support 160ms and 320ms models
- Add ParakeetEouCommand CLI for benchmarking
- Add TextNormalizerOfficial for proper WER calculation
- Support --chunk-ms flag for 160ms/320ms chunk sizes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add --eou-debounce parameter to control minimum silence duration before
End-of-Utterance triggers. Default is 1280ms. This allows users to
reduce false EOU triggers during brief pauses in natural speech while
keeping fast 160ms/320ms chunk sizes for low-latency transcription.

- Add eouDebounceMs parameter to StreamingEouAsrManager
- Implement debounce logic: count consecutive EOU predictions
- Add --eou-debounce CLI flag with 1280ms default
- Reset debounce timer when speech tokens are produced

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@Alex-Wengg Alex-Wengg force-pushed the feat/parakeet-eou-integration-2 branch from 3c5a75a to 4acfea2 Compare December 15, 2025 22:29
Alex-Wengg and others added 2 commits December 15, 2025 17:33
MLMultiArrayDataType.int8 is not available on older macOS versions,
causing CI build failures. The @unknown default case handles any
unknown data types gracefully.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…nchmark

- Remove redundant HuggingFaceDownloader.swift, use DownloadUtils instead
- Add parakeetEou160/320 to Repo enum with subPath support
- Add ModelNames.ParakeetEOU with required model names
- Update DownloadUtils.downloadRepo to handle repo subdirectories
- Add GitHub Actions workflow for 320ms Parakeet EOU benchmark (100 files)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@github-actions
Copy link

github-actions bot commented Dec 15, 2025

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset WER Avg WER Med RTFx Status
test-clean 0.57% 0.00% 3.28x
test-other 1.80% 0.00% 2.49x

Parakeet v2 (English-optimized)

Dataset WER Avg WER Med RTFx Status
test-clean 0.40% 0.00% 3.40x
test-other 1.56% 0.00% 1.68x

Streaming (v3)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.39x Streaming real-time factor
Avg Chunk Time 2.209s Average time to process each chunk
Max Chunk Time 2.764s Maximum chunk processing time
First Token 2.679s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming (v2)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.21x Streaming real-time factor
Avg Chunk Time 4.142s Average time to process each chunk
Max Chunk Time 6.384s Maximum chunk processing time
First Token 3.610s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming

25 files per dataset • Test runtime: 9m3s • 12/17/2025, 05:12 PM EST

RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

Testing methodology follows HuggingFace Open ASR Leaderboard

@github-actions
Copy link

github-actions bot commented Dec 15, 2025

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric Value Description
WER (Avg) 6.91% Average Word Error Rate
WER (Med) 4.00% Median Word Error Rate
RTFx 4.87x Real-time factor (higher = faster)
Total Audio 470.6s Total audio duration processed
Total Time 112.5s Total processing time

Streaming Metrics

Metric Value Description
Avg Chunk Time 0.113s Average chunk processing time
Max Chunk Time 0.225s Maximum chunk processing time
EOU Detections 0 Total End-of-Utterance detections

Test runtime: 2m4s • 12/17/2025, 05:05 PM EST

RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O

@Alex-Wengg
Copy link
Contributor Author

@claude

@claude

This comment was marked as resolved.

Alex-Wengg and others added 3 commits December 15, 2025 20:47
- Update BenchmarkJSONOutput to use nested summary structure
- Add BenchmarkSummary struct with proper field names (averageWER, medianWER, medianRTFx)
- Add StreamingMetrics struct for chunk processing times
- Calculate and output median WER and RTFx
- Remove dead code StreamingEncoderState.swift
- Apply swift-format to TextNormalizer files

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Pre-allocate reusable buffers to avoid allocations in hot path
- Vectorize preemphasis filter using vDSP_vsma
- Use memcpy for bulk padding copy
- Vectorize windowing with vDSP_vmul
- Use vDSP_mmul for mel filterbank matrix-vector multiply
- Vectorize power spectrum with vDSP_vsq and vDSP_vadd
- Flatten mel filterbank for efficient vDSP operations

RTFx improved from ~12x to ~19x on M2 hardware.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@Alex-Wengg Alex-Wengg changed the title Feat/parakeet eou integration 2 Feat: Parakeet EOU streaming ASR with 160ms/320ms chunk support Dec 16, 2025
The GitHub Actions workflow passes --output parakeet_eou_results.json but
the command was ignoring this flag and always writing to benchmark_results.json.
This caused the workflow to show N/A for all metrics since the expected JSON
file was never created.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@Alex-Wengg
Copy link
Contributor Author

@claude

@claude
Copy link
Contributor

claude bot commented Dec 16, 2025

Claude finished @Alex-Wengg's task —— View job


Current PR Status Analysis

  • Review current PR status and previous critical issues
  • Check if compilation errors have been resolved
  • Verify force unwrapping violations are addressed
  • Examine recent changes since last review
  • Provide updated recommendations

🎯 Current Status: Mixed Progress

Resolved Issues

  1. Compilation Error Fixed - TextNormalizerOfficial.swift:368 - englishSpellingNormalizer reference has been commented out
  2. Print Statements Cleaned - HuggingFaceDownloader print statements have been converted to proper logging
  3. CI Stability - Multiple benchmarks are passing (Swift Format, Parakeet EOU, VAD, TTS, Podspec)

🚨 Critical Issues Still Present

These must be fixed before merge:

  1. Force Unwrapping Violations (violates CLAUDE.md guidelines):

    • StreamingEouAsrManager.swift:248 - RnntDecoder(decoderModel: self.decoder!, jointModel: self.joint!)
    • RnntDecoder.swift:35,37 - try! MLMultiArray(...) - 2 instances
    • RnntDecoder.swift:92,109,123,124 - Multiple ! unwraps in model output processing
    • Tokenizer.swift:9 - as! [String: String] force cast
  2. Unsafe Memory Operations:

    • NeMoMelSpectrogram.swift:161,173,210-212,229-231,256 - Multiple baseAddress! force unwraps

📋 Required Actions Before Merge

Priority 1 - Critical Safety Issues:

// Fix StreamingEouAsrManager.swift:248
guard let decoder = self.decoder, let joint = self.joint else {
    throw StreamingError.modelsNotLoaded
}
self.rnntDecoder = RnntDecoder(decoderModel: decoder, jointModel: joint)

// Fix RnntDecoder.swift:35-37
guard let hState = try? MLMultiArray(shape: [NSNumber(value: layers), NSNumber(value: 1), NSNumber(value: hiddenSize)], dataType: .float32),
      let cState = try? MLMultiArray(shape: [NSNumber(value: layers), NSNumber(value: 1), NSNumber(value: hiddenSize)], dataType: .float32) else {
    throw DecoderError.stateInitializationFailed
}

Priority 2 - Model Output Safety:
Replace force unwraps in RnntDecoder.swift:92,109,123,124 with proper error handling using guard let statements.

🔍 Performance & Architecture Assessment

Positive Aspects:

  • ✅ Comprehensive benchmarking infrastructure working well
  • ✅ NeMoMelSpectrogram shows good performance optimization with vDSP
  • ✅ Streaming architecture is well-structured with proper EOU detection
  • ✅ Swift formatting compliance maintained

Architecture Notes:

  • Large file complexity (StreamingEouAsrManager.swift - 487 lines) could benefit from decomposition
  • Memory management in streaming contexts needs careful review due to MLMultiArray allocations

🚀 CI Status

Current runs show promising results:

  • ✅ Swift Format: Passing
  • ✅ Parakeet EOU Benchmark: Passing
  • ✅ VAD Benchmark: Passing
  • 🟡 Build and Test: In Progress
  • 🟡 ASR Benchmark: In Progress
  • 🟡 Diarizer Performance: In Progress

🎯 Recommendation

Do not merge until force unwrapping violations are resolved. These violate the project's critical development rules in CLAUDE.md and could cause runtime crashes. The fixes are straightforward but essential for production safety.

Once safety issues are addressed, this will be a valuable addition providing production-ready streaming ASR with EOU detection.


Comment on lines +235 to +238
| Chunk Size | WER (Avg) | RTFx | Total Time |
|------------|-----------|------|------------|
| 320ms | 4.92% | 12.48x | 1558s (26m) |
| 160ms | 8.29% | 4.78x | 4070s (68m) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this compare with the huggingface nemo model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way they benchmarked it like " 160ms streaming setting" i am not sure if it was really 160 ms audio chunkings or if it was referring to the 160 ms eou detection, it didn't really specify.

CleanShot 2025-12-15 at 11  43 26@2x

https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1

@Alex-Wengg Alex-Wengg marked this pull request as ready for review December 16, 2025 05:34
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link
Member

@BrandonWeng BrandonWeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly jsut nits - otherwise good to merge once you address them. great job!

also update the readme? add this and replace the streaming doscs that we have irhgt now ?

- Remove TextNormalizerOfficial, use TextNormalizer everywhere
- Rename pureCoreML to useStreamingEou for clarity
- Remove duplicate comment in AsrBenchmark
- Add StreamingEouAsrManager API documentation
- Add streaming CLI docs to ASR/GettingStarted.md
- Add parakeet-eou command to CLI README
- Compact CLI documentation examples

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@Alex-Wengg Alex-Wengg merged commit 892da4f into main Dec 17, 2025
10 checks passed
@Alex-Wengg Alex-Wengg deleted the feat/parakeet-eou-integration-2 branch December 17, 2025 22:18
Alex-Wengg added a commit that referenced this pull request Dec 18, 2025
- Add Parakeet EOU 120M streaming ASR with End-of-Utterance detection
- Support 160ms and 320ms chunk sizes with automatic HuggingFace model
downloads
- benchmarks.md
- Add GitHub Actions CI benchmark workflow for Parakeet EOU

Changes
- StreamingEouAsrManager - streaming pipeline with configurable chunk
sizes
- NeMoMelSpectrogram - native Swift mel spectrogram with vDSP
vectorization
- RnntDecoder - RNN-T greedy decoder with EOU detection
- Configurable EOU debounce (default 1280ms)

---------
aryasaatvik added a commit to AryaLabsHQ/FluidAudio that referenced this pull request Dec 18, 2025
Remove tokenizer.model and preprocessorFile from required models list.
These files don't exist in the HuggingFace repo and aren't used:
- preprocessor: Native Swift NeMoMelSpectrogram is used instead
- tokenizer.model: vocab.json is used for the Tokenizer class

Fixes streaming model download failure introduced in PR FluidInference#216.
Alex-Wengg added a commit that referenced this pull request Jan 1, 2026
- Add Parakeet EOU 120M streaming ASR with End-of-Utterance detection
- Support 160ms and 320ms chunk sizes with automatic HuggingFace model
downloads
- benchmarks.md
- Add GitHub Actions CI benchmark workflow for Parakeet EOU

Changes
- StreamingEouAsrManager - streaming pipeline with configurable chunk
sizes
- NeMoMelSpectrogram - native Swift mel spectrogram with vDSP
vectorization
- RnntDecoder - RNN-T greedy decoder with EOU detection
- Configurable EOU debounce (default 1280ms)

---------
Alex-Wengg added a commit that referenced this pull request Jan 1, 2026
- Add Parakeet EOU 120M streaming ASR with End-of-Utterance detection
- Support 160ms and 320ms chunk sizes with automatic HuggingFace model
downloads
- benchmarks.md 
- Add GitHub Actions CI benchmark workflow for Parakeet EOU



Changes
- StreamingEouAsrManager - streaming pipeline with configurable chunk
sizes
- NeMoMelSpectrogram - native Swift mel spectrogram with vDSP
vectorization
- RnntDecoder - RNN-T greedy decoder with EOU detection
- Configurable EOU debounce (default 1280ms)

---------
Alex-Wengg added a commit that referenced this pull request Jan 1, 2026
- Add Parakeet EOU 120M streaming ASR with End-of-Utterance detection
- Support 160ms and 320ms chunk sizes with automatic HuggingFace model
downloads
- benchmarks.md
- Add GitHub Actions CI benchmark workflow for Parakeet EOU

Changes
- StreamingEouAsrManager - streaming pipeline with configurable chunk
sizes
- NeMoMelSpectrogram - native Swift mel spectrogram with vDSP
vectorization
- RnntDecoder - RNN-T greedy decoder with EOU detection
- Configurable EOU debounce (default 1280ms)

---------
Alex-Wengg added a commit that referenced this pull request Jan 1, 2026
- Add Parakeet EOU 120M streaming ASR with End-of-Utterance detection
- Support 160ms and 320ms chunk sizes with automatic HuggingFace model
downloads
- benchmarks.md
- Add GitHub Actions CI benchmark workflow for Parakeet EOU

Changes
- StreamingEouAsrManager - streaming pipeline with configurable chunk
sizes
- NeMoMelSpectrogram - native Swift mel spectrogram with vDSP
vectorization
- RnntDecoder - RNN-T greedy decoder with EOU detection
- Configurable EOU debounce (default 1280ms)

---------
Alex-Wengg added a commit that referenced this pull request Jan 1, 2026
- Add Parakeet EOU 120M streaming ASR with End-of-Utterance detection
- Support 160ms and 320ms chunk sizes with automatic HuggingFace model
downloads
- benchmarks.md
- Add GitHub Actions CI benchmark workflow for Parakeet EOU

Changes
- StreamingEouAsrManager - streaming pipeline with configurable chunk
sizes
- NeMoMelSpectrogram - native Swift mel spectrogram with vDSP
vectorization
- RnntDecoder - RNN-T greedy decoder with EOU detection
- Configurable EOU debounce (default 1280ms)

---------
Alex-Wengg added a commit that referenced this pull request Jan 1, 2026
- Add Parakeet EOU 120M streaming ASR with End-of-Utterance detection
- Support 160ms and 320ms chunk sizes with automatic HuggingFace model
downloads
- benchmarks.md
- Add GitHub Actions CI benchmark workflow for Parakeet EOU

Changes
- StreamingEouAsrManager - streaming pipeline with configurable chunk
sizes
- NeMoMelSpectrogram - native Swift mel spectrogram with vDSP
vectorization
- RnntDecoder - RNN-T greedy decoder with EOU detection
- Configurable EOU debounce (default 1280ms)

---------
SGD2718 pushed a commit that referenced this pull request Jan 4, 2026
- Add Parakeet EOU 120M streaming ASR with End-of-Utterance detection
- Support 160ms and 320ms chunk sizes with automatic HuggingFace model
downloads
- benchmarks.md 
- Add GitHub Actions CI benchmark workflow for Parakeet EOU



Changes
- StreamingEouAsrManager - streaming pipeline with configurable chunk
sizes
- NeMoMelSpectrogram - native Swift mel spectrogram with vDSP
vectorization
- RnntDecoder - RNN-T greedy decoder with EOU detection
- Configurable EOU debounce (default 1280ms)

---------
Alex-Wengg added a commit that referenced this pull request Jan 5, 2026
- Add Parakeet EOU 120M streaming ASR with End-of-Utterance detection
- Support 160ms and 320ms chunk sizes with automatic HuggingFace model
downloads
- benchmarks.md 
- Add GitHub Actions CI benchmark workflow for Parakeet EOU



Changes
- StreamingEouAsrManager - streaming pipeline with configurable chunk
sizes
- NeMoMelSpectrogram - native Swift mel spectrogram with vDSP
vectorization
- RnntDecoder - RNN-T greedy decoder with EOU detection
- Configurable EOU debounce (default 1280ms)

---------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants