Fix EOU frame count calculation for center-padded mel spectrograms by Alex-Wengg · Pull Request #444 · FluidInference/FluidAudio

Alex-Wengg · 2026-03-27T12:15:59Z

Summary

Fixes #441 - StreamingEouAsrManager with 320ms chunks was producing incorrect frame counts, causing shape mismatches.

Updated AudioMelSpectrogram.computeFlat() to use correct frame count formula
Updated AudioMelSpectrogram.computeFlatTransposed() with .center padding mode
Changed from numFrames = audioCount / hopLength to numFrames = 1 + (paddedCount - winLength) / hopLength
This accounts for nFFT/2 center padding applied before STFT processing, matching NeMo's computation

Root Cause

The original formula didn't account for the center padding (nFFT/2 on each side) that's applied to audio before windowing. This caused the frame count to be off by 1, producing 63 frames instead of 64 for 630ms audio chunks.

Test Results

Frame Count Validation Tests

Added EouChunkSizeFrameCountTests - all passing:

✅ 160ms: 17 frames (was 16)
✅ 320ms: 64 frames (was 63) ← Issue Creating StreamingEouAsrManager with 320ms is broken #441 error case
✅ 1280ms: 129 frames (was 128)
✅ Tested with 10 different audio lengths per chunk size

Integration Tests (10 files per chunk size)

30 transcriptions total - 100% success rate:

Chunk Size	Files	Success	Avg WER	Overall WER
160ms	10/10	100%	8.40%	9.64%
320ms	10/10	100%	4.92%	5.72%
1280ms	10/10	100%	7.19%	7.83%

✅ No shape mismatch errors detected across all 30 transcriptions

The 320ms chunk size (the problematic one from issue #441) now works perfectly and actually achieves the lowest WER!

Test Plan

All AudioMelSpectrogramTests pass
Added EouChunkSizeFrameCountTests - all passing
Integration test: 10 files × 3 chunk sizes = 30 successful transcriptions
WER calculation confirms transcription quality maintained (5-10% WER)
Verified no shape mismatch errors

All tests pass successfully.

🤖 Generated with Claude Code

Fixes #441 The AudioMelSpectrogram.computeFlat() methods were using an incorrect formula for calculating frame count that didn't account for the nFFT/2 center padding applied before STFT processing. Old formula: numFrames = audioCount / hopLength New formula: numFrames = 1 + (paddedCount - winLength) / hopLength This caused shape mismatches when using StreamingEouAsrManager with 320ms chunks, producing 63 frames instead of the expected 64 frames. The fix applies to both computeFlat() and computeFlatTransposed() methods in their .center padding mode, matching NeMo's computation.

claude · 2026-03-27T12:16:19Z

Claude finished @Alex-Wengg's task —— View job

PR Review: Fix EOU frame count calculation for center-padded mel spectrograms

I've completed the code review and submitted detailed feedback through GitHub's review system.

Todo List:

Create pending PR review
Get PR diff to understand the changes
Read the modified file for context
Analyze code changes and add inline comments
Submit comprehensive review

Review Summary

This PR correctly fixes the frame count calculation for center-padded mel spectrograms. The key findings:

✅ Mathematical correctness: The new formula 1 + (paddedCount - winLength) / hopLength properly accounts for nFFT/2 center padding, fixing the off-by-one error

✅ Consistency: Both computeFlat() and computeFlatTransposed() now use the same logic

✅ Alignment with NeMo: The implementation now matches NeMo's mel spectrogram computation

The changes are minimal, focused, and address the root cause of the shape mismatch issue in StreamingEouAsrManager. I've provided detailed technical analysis and minor suggestions for improvement in the GitHub review.

claude

Code Review Summary

This PR correctly fixes the frame count calculation for center-padded mel spectrograms. The changes are mathematically sound and address the root cause described in issue #441.

Key Changes Analysis

Formula Fix (Lines 193-197 & 339-343):

✅ Correct: Changed from audioCount / hopLength to 1 + (paddedCount - winLength) / hopLength
✅ Reasoning: The new formula properly accounts for nFFT/2 center padding applied before STFT processing
✅ Consistency: Both computeFlat() and computeFlatTransposed() now use the same logic

Code Organization (Lines 195-196):

✅ Improvement: Moved padLength and paddedCount calculations earlier in computeFlat() for better readability
✅ Consistency: Variables are now defined before use, making the logic flow clearer

Technical Correctness

The mathematical derivation is sound:

With center padding, audio is padded by nFFT/2 on each side
Total padded length = audioCount + 2 * (nFFT/2)
Number of frames = 1 + (paddedLength - windowLength) / hopLength
This matches NeMo's mel spectrogram computation as stated in the comments

Test Coverage

The PR description indicates comprehensive testing:

All chunk sizes (160ms, 320ms, 630ms, 1280ms) now produce correct frame counts
The off-by-one error (63→64 frames for 630ms) has been resolved

Potential Concerns

Nit: The comment on line 193-194 could be more concise:

// Frame count for center-padded audio: 1 + (paddedCount - winLength) / hopLength

Minor: Consider adding a unit test specifically for the frame count formula to prevent future regressions.

Verdict

This is a solid fix that addresses the core issue without introducing side effects. The changes are minimal, focused, and mathematically correct. The PR resolves the shape mismatch issue in StreamingEouAsrManager and aligns the implementation with NeMo's approach.

Recommendation: Approve once manual testing with StreamingEouAsrManager is completed.

github-actions · 2026-03-27T12:24:15Z

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric	Value	Description
WER (Avg)	7.03%	Average Word Error Rate
WER (Med)	4.17%	Median Word Error Rate
RTFx	7.26x	Real-time factor (higher = faster)
Total Audio	470.6s	Total audio duration processed
Total Time	67.2s	Total processing time

Streaming Metrics

Metric	Value	Description
Avg Chunk Time	0.067s	Average chunk processing time
Max Chunk Time	0.134s	Maximum chunk processing time
EOU Detections	0	Total End-of-Utterance detections

_{Test runtime: 1m28s • 03/27/2026, 06:24 PM EST}

_{RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O}

github-actions · 2026-03-27T12:24:49Z

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric	Value	Target	Status	Description
DER	15.1%	<30%	✅	Diarization Error Rate (lower is better)
JER	24.9%	<25%	✅	Jaccard Error Rate
RTFx	30.18x	>1.0x	✅	Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage	Time (s)	%	Description
Model Download	6.603	19.0	Fetching diarization models
Model Compile	2.830	8.1	CoreML compilation
Audio Load	0.058	0.2	Loading audio file
Segmentation	10.429	30.0	Detecting speech regions
Embedding	17.382	50.0	Extracting speaker voices
Clustering	6.953	20.0	Grouping same speakers
Total	34.772	100	Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method	DER	Notes
FluidAudio	15.1%	On-device CoreML
Research baseline	18-30%	Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

M2 MacBook Air (2022): Runs at 150 RTFx real-time
Performance scales with Apple Neural Engine capabilities

_{🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 34.8s diarization time • Test runtime: 2m 4s • 03/27/2026, 06:31 PM EST}

github-actions · 2026-03-27T12:27:24Z

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric	Value	Target	Status
DER	33.4%	<35%	✅
Miss Rate	24.4%	-	-
False Alarm	0.2%	-	-
Speaker Error	8.8%	-	-
RTFx	13.9x	>1.0x	✅
Speakers	4/4	-	-

_{Sortformer High-Latency • ES2004a • Runtime: 2m 37s • 2026-03-27T22:34:43.604Z}

Tests verify that AudioMelSpectrogram produces the correct number of mel frames for each EOU chunk size: - 160ms: 17 frames - 320ms: 64 frames (was 63 before fix) - 1280ms: 129 frames All tests pass with 10 different audio lengths per chunk size.

github-actions · 2026-03-27T12:29:14Z

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric	Value	Target	Status	Description
DER	14.5%	<20%	✅	Diarization Error Rate (lower is better)
RTFx	5.23x	>1.0x	✅	Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage	Time (s)	%	Description
Model Download	13.361	6.7	Fetching diarization models
Model Compile	5.726	2.9	CoreML compilation
Audio Load	0.076	0.0	Loading audio file
Segmentation	21.280	10.6	VAD + speech detection
Embedding	199.864	99.5	Speaker embedding extraction
Clustering (VBx)	0.780	0.4	Hungarian algorithm + VBx clustering
Total	200.792	100	Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method	DER	Mode	Description
FluidAudio (Offline)	14.5%	VBx Batch	On-device CoreML with optimal clustering
FluidAudio (Streaming)	17.7%	Chunk-based	First-occurrence speaker mapping
Research baseline	18-30%	Various	Standard dataset performance

Pipeline Details:

Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
Segmentation: VAD-based voice activity detection
Embeddings: WeSpeaker-compatible speaker embeddings
Clustering: PowerSet with VBx refinement
Accuracy: Higher than streaming due to optimal post-hoc mapping

_{🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 221.9s processing • Test runtime: 3m 59s • 03/27/2026, 06:37 PM EST}

github-actions · 2026-03-27T12:32:28Z

PocketTTS Smoke Test ✅

Check	Result
Build	✅
Model download	✅
Model load	✅
Synthesis pipeline	✅
Output WAV	✅ (176.3 KB)

_{Runtime: 0m40s}

_{Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality may differ from Apple Silicon.}

github-actions · 2026-03-27T12:36:56Z

Qwen3-ASR int8 Smoke Test ✅

Check	Result
Build	✅
Model download	✅
Model load	✅
Transcription pipeline	✅
Decoder size	571 MB (vs 1.1 GB f32)

_{Runtime: 3m3s}

_{Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx.}

github-actions · 2026-03-27T12:41:13Z

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.57%	0.00%	4.28x	✅
test-other	1.35%	0.00%	2.94x	✅

Parakeet v2 (English-optimized)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.80%	0.00%	4.74x	✅
test-other	1.00%	0.00%	3.23x	✅

Streaming (v3)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.54x	Streaming real-time factor
Avg Chunk Time	1.621s	Average time to process each chunk
Max Chunk Time	1.977s	Maximum chunk processing time
First Token	2.035s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

Streaming (v2)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.57x	Streaming real-time factor
Avg Chunk Time	1.589s	Average time to process each chunk
Max Chunk Time	1.973s	Maximum chunk processing time
First Token	1.601s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

_{Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming}

_{25 files per dataset • Test runtime: 6m43s • 03/27/2026, 06:30 PM EST}

_{RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)}

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

_{Testing methodology follows HuggingFace Open ASR Leaderboard}

Minor formatting changes: - Fix line breaks in XCTAssert calls - Reorder imports - Add trailing commas

github-actions · 2026-03-27T12:49:14Z

VAD Benchmark Results

Performance Comparison

Dataset	Accuracy	Precision	Recall	F1-Score	RTFx	Files
MUSAN	92.0%	86.2%	100.0%	92.6%	794.2x faster	50
VOiCES	92.0%	86.2%	100.0%	92.6%	802.5x faster	50

Dataset Details

MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

Swift 6.1.3 (CI) and Swift 6.3 (local) have different import ordering behavior for @preconcurrency. Revert to avoid CI failures.

… mel The new frame count formula `1 + (paddedCount - winLength) / hopLength` accounts for center padding, but SortformerDiarizerPipeline was still using the old formula `melLength * stride` to calculate samples consumed. This caused samplesConsumed to exceed audioBuffer.count, triggering the else branch that resets lastAudioSample and breaks preemphasis continuity. Changes: - Reverse the frame count formula to calculate actual samples consumed: samplesConsumed = (melLength - 1) * hopLength + winLength - nFFT - Update producedMelFramesAvailable() to use center-padded formula - Make AudioMelSpectrogram.nFFT public so callers can access it Fixes Devin Review issue in PR #444. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

devin-ai-integration

Devin Review found 1 new potential issue.

View 10 additional findings in Devin Review.

devin-ai-integration · 2026-03-27T13:29:17Z

🔴 Stale samplesNeeded formula causes finalization to silently drop diarization frames

The PR updated producedMelFramesAvailable() at SortformerDiarizerPipeline.swift:887-889 to use the new center-padded frame count formula (1 + (paddedCount - melWindow) / melStride), but the samplesNeeded guard in preprocessAudioToFeatureTargetLocked at SortformerDiarizerPipeline.swift:761-768 still uses the old formula (framesNeeded * config.melStride or (framesNeeded - 1) * config.melStride + config.melWindow). During finalization, exactFinalizationPaddingSamples (SortformerDiarizerPipeline.swift:807) consults the updated producedMelFramesAvailable() and may determine no padding is needed (returns 0). However, when the finalization loop then calls preprocessAudioToFeatureTargetLocked (SortformerDiarizerPipeline.swift:528), the stale samplesNeeded requires more audio than is available, so it returns without computing features. This causes makeStreamingChunkLocked() to return nil, the loop breaks with a warning, and the tail diarization frames are lost.

Concrete example showing the mismatch window

With defaults (melStride=160, melWindow=400, nFFT=512) and featureBuffer not empty, needing 10 additional frames:

producedMelFramesAvailable() says 10 frames need (10-1)*160 + 400 - 512 = 1388 samples (new formula)

samplesNeeded requires 10 * 160 = 1600 samples (old formula)

If audioBuffer.count is between 1388 and 1599 (a 212-sample / 13ms window), finalization adds no padding but also cannot compute features.

Was this helpful? React with 👍 or 👎 to provide feedback.

Revert samplesConsumed calculation to hop-aligned consumption to maintain proper streaming continuity. Issue: The previous fix computed minimum samples needed for frame production, but this created overlapping frames in streaming mode. With center padding, the padding affects frame *production* (how computeFlatTransposed generates frames), but samples should still be consumed in hop-aligned fashion for streaming continuity. Fix: Use `melLength * hopLength` for samples consumed, regardless of center padding. This maintains proper frame boundaries across streaming chunks and prevents feature buffer corruption. Addresses: #444 (review) Tests: SortformerStreamingIntegrationTests pass (2/2)

… mel The new frame count formula `1 + (paddedCount - winLength) / hopLength` accounts for center padding, but SortformerDiarizerPipeline was still using the old formula `melLength * stride` to calculate samples consumed. This caused samplesConsumed to exceed audioBuffer.count, triggering the else branch that resets lastAudioSample and breaks preemphasis continuity. Changes: - Reverse the frame count formula to calculate actual samples consumed: samplesConsumed = (melLength - 1) * hopLength + winLength - nFFT - Update producedMelFramesAvailable() to use center-padded formula - Make AudioMelSpectrogram.nFFT public so callers can access it Fixes Devin Review issue in PR #444.

Benchmarks showed: - 80ms: 63.31% WER (unusable - gibberish transcripts) - 160ms: Model files incomplete (missing preprocessor.mlmodelc) - 560ms: 0.59% WER, RTFx 1.8x ✅ - 1120ms: 0.59% WER, RTFx 1.9x ✅ Only keep the two working chunk sizes with excellent accuracy. Changes: - NemotronChunkSize enum: removed ms80 and ms160 cases - ModelNames.Repo: removed nemotronStreaming80 and nemotronStreaming160 - CLI commands: updated help text and validation - Tests: updated to test only supported chunk sizes

…urements Ran full LibriSpeech test-clean benchmarks (2,620 files) on M2 hardware to verify and update documented performance metrics. Changes: - 320ms: 4.88% WER (was 4.87%), 19.25x RTFx (was 12.48x), 16.9m (was 26m) - 160ms: 8.23% WER (was 8.29%), 5.78x RTFx (was 4.78x), 56.4m (was 68m) - Added Median WER column: 0.00% (320ms), 5.26% (160ms) WER values confirmed accurate (within 0.06%). Improved RTFx and faster times reflect actual M2 performance vs originally documented values. Also fix import order in SortformerStreamingIntegrationTests.swift for consistency.

Revert samplesConsumed calculation to hop-aligned consumption to maintain proper streaming continuity. Issue: The previous fix computed minimum samples needed for frame production, but this created overlapping frames in streaming mode. With center padding, the padding affects frame *production* (how computeFlatTransposed generates frames), but samples should still be consumed in hop-aligned fashion for streaming continuity. Fix: Use `melLength * hopLength` for samples consumed, regardless of center padding. This maintains proper frame boundaries across streaming chunks and prevents feature buffer corruption. Addresses: #444 (review) Tests: SortformerStreamingIntegrationTests pass (2/2)

Save benchmark results to /tmp/nemotron_{chunk_size}ms_benchmark.json for easy retrieval. This ensures results are captured even when running in RELEASE mode, which logs to the unified logging system instead of terminal output. Output includes: - chunkSize, filesProcessed, totalWords, totalErrors - wer, audioDuration, processingTime, rtfx

Alex-Wengg · 2026-03-27T19:11:33Z

We independently traced this same bug from the Zoryn (GlibScribe) side — the 160ms EOU streaming path was hitting the shape mismatch on every audio chunk, causing a full fallback to Apple Speech for L1 streaming.

Root cause confirmed: PR #376 (LS-EEND) accidentally replaced computeFlat()'s correct center-padded formula (1 + (paddedCount - winLength) / hopLength → 17 frames) with the Sortformer-specific formula from computeFlatTransposed() (audioCount / hopLength → 16 frames) during the move to Shared/.

We've cherry-picked 05a59f3 onto our fork (VillaAI/audio-engine branch fix/eou-mel-frame-count) and are testing it in Zoryn now. Would be great to get this merged to main so downstream consumers pick it up.

this will be merged today.

Throw error if totalWords=0, totalAudioDuration=0, or totalProcessingTime=0 to catch benchmark failures early instead of reporting misleading 0% WER / 0x RTFx.

Use inverted center-padded formula: (melLength - 1) * hopLength + winLength - nFFT This ensures samplesConsumed ≤ audioBuffer.count, preventing preemphasis state from being reset on every chunk. Previous formula (melLength * hopLength) always exceeded audioBuffer.count by ~272 samples, causing lastAudioSample=0 reset and corrupting preemphasis continuity.

…patibility

Swift 6.1 and 6.3 have different formatting rules for import statements. Update CI to use 6.3 to prevent format conflicts.

Swift 6.3 is not yet available on GitHub Actions runners. Keep Swift 6.1 formatter with blank line between imports.

Swift 6.1 (CI) and 6.3 (local) have different rules for import statement spacing. Disable OrderedImports to prevent formatter conflicts.

If totalProcessingTime is 0, rtf would be 0 and rtfx = 1.0 / rtf would be +Infinity. Add guard to match AsrBenchmark.swift validation.

devin-ai-integration

Devin Review found 1 new potential issue.

View 18 additional findings in Devin Review.

devin-ai-integration · 2026-03-27T21:12:16Z

🟡 requiredBufferedSamples in exactFinalizationPaddingSamples uses old formula, over-estimating padding

The requiredBufferedSamples calculation in exactFinalizationPaddingSamples (lines 853-878) was not updated to match the new center-padded frame count formula. For the empty buffer case it computes (frames-1)*melStride + melWindow and for the non-empty case max(melWindow, frames*melStride), but the actual minimum samples needed with center padding is (frames-1)*melStride + melWindow - nFFT. This over-estimates by up to nFFT (512 samples), causing more zero-padding than necessary during finalization, which produces extra silent frames at the end of diarization output.

(Refers to lines 855-877)

Was this helpful? React with 👍 or 👎 to provide feedback.

- Remove unused audioDuration calculation in single-file transcription - Remove unused avgRtf calculation in benchmark

The bug was that modelDir used chunkSize.modelSubdirectory (e.g., "320ms") but downloadRepo appends repo.folderName (e.g., "parakeet-realtime-eou-120m-coreml/320ms"), causing downloads to go to a different path than where the code looks for models. Fixed by using repo.folderName for modelDir instead of chunkSize.modelSubdirectory, so both download and load paths now match correctly. Download: modelsRoot/parakeet-realtime-eou-120m-coreml/320ms/ Load: modelsRoot/parakeet-realtime-eou-120m-coreml/320ms/

Updated AGENTS.md and CLAUDE.md to explain that OrderedImports rule is disabled due to Swift 6.1 (GitHub Actions CI) vs 6.3 (local) formatter incompatibility. Swift 6.3 is unavailable in GitHub Actions runners, causing the formatters to cycle between adding/removing blank lines between imports. Added inline comment in SortformerStreamingIntegrationTests.swift explaining the non-alphabetical import order.

HuggingFace sometimes returns 200 OK status with HTML error pages during rate limiting or timeouts, bypassing the 429/503 status code checks. Added validateJSONResponse() to detect HTML responses and throw a descriptive error instead of silently failing. Changed silent 'return' to throw invalidResponse error when JSON parsing fails. This fixes PocketTTS/Kokoro download failures where the code would: 1. Receive HTML error page with 200 OK status 2. Fail to parse as JSON 3. Silently return (no files listed) 4. Create empty model directory 5. Fail later when trying to load non-existent models

claude bot reviewed Mar 27, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

Apply swift-format to test files

1c775a6

Minor formatting changes: - Fix line breaks in XCTAssert calls - Reorder imports - Add trailing commas

Alex-Wengg added 2 commits March 27, 2026 08:59

Retrigger CI checks

27e29eb

Revert Sortformer test file to main branch formatting

ef610a9

Swift 6.1.3 (CI) and Swift 6.3 (local) have different import ordering behavior for @preconcurrency. Revert to avoid CI failures.

devin-ai-integration bot reviewed Mar 27, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

Alex-Wengg added 4 commits March 27, 2026 14:01

Alex-Wengg force-pushed the fix-eou-320ms-frame-count branch from f79c2dd to 82c2c0f Compare March 27, 2026 18:02

This comment was marked as resolved.

Sign in to view

Alex-Wengg added 3 commits March 27, 2026 14:43

Retrigger CI

4ea22b8

Retrigger CI after HuggingFace network issues

fb1216e

Alex-Wengg added 9 commits March 27, 2026 15:17

Fix swift-format: add blank line before @preconcurrency import

d46fb36

Add Nemotron streaming ASR benchmarks to documentation

b6a1857

Fix swift-format: use single blank line before @preconcurrency import

6c231f0

Add validation for zero WER/RTFx in benchmarks

8131997

Throw error if totalWords=0, totalAudioDuration=0, or totalProcessingTime=0 to catch benchmark failures early instead of reporting misleading 0% WER / 0x RTFx.

Fix swift-format: remove blank line between imports for Swift 6.1 com…

dd2b37f

…patibility

Add blank line between imports for Swift 6.1 formatter compatibility

e6a8fe6

Update CI formatter to Swift 6.3 to match local environment

e30bf86

Swift 6.1 and 6.3 have different formatting rules for import statements. Update CI to use 6.3 to prevent format conflicts.

Revert to Swift 6.1 for CI - 6.3 not available yet

3b72071

Swift 6.3 is not yet available on GitHub Actions runners. Keep Swift 6.1 formatter with blank line between imports.

This comment was marked as resolved.

Sign in to view

Alex-Wengg added 7 commits March 27, 2026 16:29

Add blank line between imports - CI Swift 6.1 formatter expects it

194c950

Remove blank line between imports - match OrderedImports rule

124795e

Add blank line between import statements (CI formatter requirement)

90ef490

Retrigger CI - swift-format check

f444dcf

Disable OrderedImports rule - Swift 6.1/6.3 formatter incompatibility

a06f0cc

Swift 6.1 (CI) and 6.3 (local) have different rules for import statement spacing. Disable OrderedImports to prevent formatter conflicts.

Add guard for totalProcessingTime > 0 to prevent division by zero

2be58f4

If totalProcessingTime is 0, rtf would be 0 and rtfx = 1.0 / rtf would be +Infinity. Add guard to match AsrBenchmark.swift validation.

Retrigger PocketTTS smoke test after HuggingFace download failure

1375a06

devin-ai-integration bot reviewed Mar 27, 2026

View reviewed changes

Remove unused variables in ParakeetEouCommand

5b3aedd

- Remove unused audioDuration calculation in single-file transcription - Remove unused avgRtf calculation in benchmark

This comment was marked as resolved.

Sign in to view

Alex-Wengg added 4 commits March 27, 2026 17:44

Fix swift-format: shorten HTML error message to fit 120 char limit

8f900c8

Alex-Wengg merged commit 06fc2ab into main Mar 27, 2026
12 checks passed

Alex-Wengg deleted the fix-eou-320ms-frame-count branch March 27, 2026 22:41

Conversation

Alex-Wengg commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Test Results

Frame Count Validation Tests

Integration Tests (10 files per chunk size)

Test Plan

Uh oh!

claude bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Fix EOU frame count calculation for center-padded mel spectrograms

Todo List:

Review Summary

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Code Review Summary

Key Changes Analysis

Technical Correctness

Test Coverage

Potential Concerns

Verdict

Uh oh!

github-actions bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Parakeet EOU Benchmark Results ✅

Performance Metrics

Streaming Metrics

Uh oh!

github-actions bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Diarization Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

This comment was marked as resolved.

Uh oh!

github-actions bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Uh oh!

github-actions bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Offline VBx Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

github-actions bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PocketTTS Smoke Test ✅

Uh oh!

github-actions bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Qwen3-ASR int8 Smoke Test ✅

Uh oh!

github-actions bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ASR Benchmark Results ✅

Parakeet v3 (multilingual)

Parakeet v2 (English-optimized)

Streaming (v3)

Streaming (v2)

Expected RTFx Performance on Physical M1 Hardware:

Uh oh!

github-actions bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

VAD Benchmark Results

Performance Comparison

Dataset Details

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Alex-Wengg commented Mar 27, 2026 •

edited

Loading

claude bot commented Mar 27, 2026 •

edited

Loading

github-actions bot commented Mar 27, 2026 •

edited

Loading

github-actions bot commented Mar 27, 2026 •

edited

Loading

github-actions bot commented Mar 27, 2026 •

edited

Loading

github-actions bot commented Mar 27, 2026 •

edited

Loading

github-actions bot commented Mar 27, 2026 •

edited

Loading

github-actions bot commented Mar 27, 2026 •

edited

Loading

github-actions bot commented Mar 27, 2026 •

edited

Loading

github-actions bot commented Mar 27, 2026 •

edited

Loading

devin-ai-integration bot Mar 27, 2026 •

edited

Loading