Fix EOU frame count calculation for center-padded mel spectrograms#444
Fix EOU frame count calculation for center-padded mel spectrograms#444Alex-Wengg merged 33 commits intomainfrom
Conversation
Fixes #441 The AudioMelSpectrogram.computeFlat() methods were using an incorrect formula for calculating frame count that didn't account for the nFFT/2 center padding applied before STFT processing. Old formula: numFrames = audioCount / hopLength New formula: numFrames = 1 + (paddedCount - winLength) / hopLength This caused shape mismatches when using StreamingEouAsrManager with 320ms chunks, producing 63 frames instead of the expected 64 frames. The fix applies to both computeFlat() and computeFlatTransposed() methods in their .center padding mode, matching NeMo's computation.
|
Claude finished @Alex-Wengg's task —— View job PR Review: Fix EOU frame count calculation for center-padded mel spectrogramsI've completed the code review and submitted detailed feedback through GitHub's review system. Todo List:
Review SummaryThis PR correctly fixes the frame count calculation for center-padded mel spectrograms. The key findings: ✅ Mathematical correctness: The new formula ✅ Consistency: Both ✅ Alignment with NeMo: The implementation now matches NeMo's mel spectrogram computation The changes are minimal, focused, and address the root cause of the shape mismatch issue in StreamingEouAsrManager. I've provided detailed technical analysis and minor suggestions for improvement in the GitHub review. |
There was a problem hiding this comment.
Code Review Summary
This PR correctly fixes the frame count calculation for center-padded mel spectrograms. The changes are mathematically sound and address the root cause described in issue #441.
Key Changes Analysis
Formula Fix (Lines 193-197 & 339-343):
- ✅ Correct: Changed from
audioCount / hopLengthto1 + (paddedCount - winLength) / hopLength - ✅ Reasoning: The new formula properly accounts for nFFT/2 center padding applied before STFT processing
- ✅ Consistency: Both
computeFlat()andcomputeFlatTransposed()now use the same logic
Code Organization (Lines 195-196):
- ✅ Improvement: Moved
padLengthandpaddedCountcalculations earlier incomputeFlat()for better readability - ✅ Consistency: Variables are now defined before use, making the logic flow clearer
Technical Correctness
The mathematical derivation is sound:
- With center padding, audio is padded by nFFT/2 on each side
- Total padded length =
audioCount + 2 * (nFFT/2) - Number of frames =
1 + (paddedLength - windowLength) / hopLength - This matches NeMo's mel spectrogram computation as stated in the comments
Test Coverage
The PR description indicates comprehensive testing:
- All chunk sizes (160ms, 320ms, 630ms, 1280ms) now produce correct frame counts
- The off-by-one error (63→64 frames for 630ms) has been resolved
Potential Concerns
Nit: The comment on line 193-194 could be more concise:
// Frame count for center-padded audio: 1 + (paddedCount - winLength) / hopLengthMinor: Consider adding a unit test specifically for the frame count formula to prevent future regressions.
Verdict
This is a solid fix that addresses the core issue without introducing side effects. The changes are minimal, focused, and mathematically correct. The PR resolves the shape mismatch issue in StreamingEouAsrManager and aligns the implementation with NeMo's approach.
Recommendation: Approve once manual testing with StreamingEouAsrManager is completed.
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 1m28s • 03/27/2026, 06:24 PM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 34.8s diarization time • Test runtime: 2m 4s • 03/27/2026, 06:31 PM EST |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 2m 37s • 2026-03-27T22:34:43.604Z |
Tests verify that AudioMelSpectrogram produces the correct number of mel frames for each EOU chunk size: - 160ms: 17 frames - 320ms: 64 frames (was 63 before fix) - 1280ms: 129 frames All tests pass with 10 different audio lengths per chunk size.
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 221.9s processing • Test runtime: 3m 59s • 03/27/2026, 06:37 PM EST |
PocketTTS Smoke Test ✅
Runtime: 0m40s Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality may differ from Apple Silicon. |
Qwen3-ASR int8 Smoke Test ✅
Runtime: 3m3s Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx. |
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 6m43s • 03/27/2026, 06:30 PM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
Minor formatting changes: - Fix line breaks in XCTAssert calls - Reorder imports - Add trailing commas
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
Swift 6.1.3 (CI) and Swift 6.3 (local) have different import ordering behavior for @preconcurrency. Revert to avoid CI failures.
… mel The new frame count formula `1 + (paddedCount - winLength) / hopLength` accounts for center padding, but SortformerDiarizerPipeline was still using the old formula `melLength * stride` to calculate samples consumed. This caused samplesConsumed to exceed audioBuffer.count, triggering the else branch that resets lastAudioSample and breaks preemphasis continuity. Changes: - Reverse the frame count formula to calculate actual samples consumed: samplesConsumed = (melLength - 1) * hopLength + winLength - nFFT - Update producedMelFramesAvailable() to use center-padded formula - Make AudioMelSpectrogram.nFFT public so callers can access it Fixes Devin Review issue in PR #444. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
🔴 Stale samplesNeeded formula causes finalization to silently drop diarization frames
The PR updated producedMelFramesAvailable() at SortformerDiarizerPipeline.swift:887-889 to use the new center-padded frame count formula (1 + (paddedCount - melWindow) / melStride), but the samplesNeeded guard in preprocessAudioToFeatureTargetLocked at SortformerDiarizerPipeline.swift:761-768 still uses the old formula (framesNeeded * config.melStride or (framesNeeded - 1) * config.melStride + config.melWindow). During finalization, exactFinalizationPaddingSamples (SortformerDiarizerPipeline.swift:807) consults the updated producedMelFramesAvailable() and may determine no padding is needed (returns 0). However, when the finalization loop then calls preprocessAudioToFeatureTargetLocked (SortformerDiarizerPipeline.swift:528), the stale samplesNeeded requires more audio than is available, so it returns without computing features. This causes makeStreamingChunkLocked() to return nil, the loop breaks with a warning, and the tail diarization frames are lost.
Concrete example showing the mismatch window
With defaults (melStride=160, melWindow=400, nFFT=512) and featureBuffer not empty, needing 10 additional frames:
producedMelFramesAvailable()says 10 frames need(10-1)*160 + 400 - 512 = 1388samples (new formula)samplesNeededrequires10 * 160 = 1600samples (old formula)- If
audioBuffer.countis between 1388 and 1599 (a 212-sample / 13ms window), finalization adds no padding but also cannot compute features.
Was this helpful? React with 👍 or 👎 to provide feedback.
Revert samplesConsumed calculation to hop-aligned consumption to maintain proper streaming continuity. Issue: The previous fix computed minimum samples needed for frame production, but this created overlapping frames in streaming mode. With center padding, the padding affects frame *production* (how computeFlatTransposed generates frames), but samples should still be consumed in hop-aligned fashion for streaming continuity. Fix: Use `melLength * hopLength` for samples consumed, regardless of center padding. This maintains proper frame boundaries across streaming chunks and prevents feature buffer corruption. Addresses: #444 (review) Tests: SortformerStreamingIntegrationTests pass (2/2)
… mel The new frame count formula `1 + (paddedCount - winLength) / hopLength` accounts for center padding, but SortformerDiarizerPipeline was still using the old formula `melLength * stride` to calculate samples consumed. This caused samplesConsumed to exceed audioBuffer.count, triggering the else branch that resets lastAudioSample and breaks preemphasis continuity. Changes: - Reverse the frame count formula to calculate actual samples consumed: samplesConsumed = (melLength - 1) * hopLength + winLength - nFFT - Update producedMelFramesAvailable() to use center-padded formula - Make AudioMelSpectrogram.nFFT public so callers can access it Fixes Devin Review issue in PR #444.
Benchmarks showed: - 80ms: 63.31% WER (unusable - gibberish transcripts) - 160ms: Model files incomplete (missing preprocessor.mlmodelc) - 560ms: 0.59% WER, RTFx 1.8x ✅ - 1120ms: 0.59% WER, RTFx 1.9x ✅ Only keep the two working chunk sizes with excellent accuracy. Changes: - NemotronChunkSize enum: removed ms80 and ms160 cases - ModelNames.Repo: removed nemotronStreaming80 and nemotronStreaming160 - CLI commands: updated help text and validation - Tests: updated to test only supported chunk sizes
…urements Ran full LibriSpeech test-clean benchmarks (2,620 files) on M2 hardware to verify and update documented performance metrics. Changes: - 320ms: 4.88% WER (was 4.87%), 19.25x RTFx (was 12.48x), 16.9m (was 26m) - 160ms: 8.23% WER (was 8.29%), 5.78x RTFx (was 4.78x), 56.4m (was 68m) - Added Median WER column: 0.00% (320ms), 5.26% (160ms) WER values confirmed accurate (within 0.06%). Improved RTFx and faster times reflect actual M2 performance vs originally documented values. Also fix import order in SortformerStreamingIntegrationTests.swift for consistency.
Revert samplesConsumed calculation to hop-aligned consumption to maintain proper streaming continuity. Issue: The previous fix computed minimum samples needed for frame production, but this created overlapping frames in streaming mode. With center padding, the padding affects frame *production* (how computeFlatTransposed generates frames), but samples should still be consumed in hop-aligned fashion for streaming continuity. Fix: Use `melLength * hopLength` for samples consumed, regardless of center padding. This maintains proper frame boundaries across streaming chunks and prevents feature buffer corruption. Addresses: #444 (review) Tests: SortformerStreamingIntegrationTests pass (2/2)
f79c2dd to
82c2c0f
Compare
Save benchmark results to /tmp/nemotron_{chunk_size}ms_benchmark.json for easy retrieval.
This ensures results are captured even when running in RELEASE mode, which logs to the unified logging system instead of terminal output.
Output includes:
- chunkSize, filesProcessed, totalWords, totalErrors
- wer, audioDuration, processingTime, rtfx
this will be merged today. |
Throw error if totalWords=0, totalAudioDuration=0, or totalProcessingTime=0 to catch benchmark failures early instead of reporting misleading 0% WER / 0x RTFx.
Use inverted center-padded formula: (melLength - 1) * hopLength + winLength - nFFT This ensures samplesConsumed ≤ audioBuffer.count, preventing preemphasis state from being reset on every chunk. Previous formula (melLength * hopLength) always exceeded audioBuffer.count by ~272 samples, causing lastAudioSample=0 reset and corrupting preemphasis continuity.
Swift 6.1 and 6.3 have different formatting rules for import statements. Update CI to use 6.3 to prevent format conflicts.
Swift 6.3 is not yet available on GitHub Actions runners. Keep Swift 6.1 formatter with blank line between imports.
Swift 6.1 (CI) and 6.3 (local) have different rules for import statement spacing. Disable OrderedImports to prevent formatter conflicts.
If totalProcessingTime is 0, rtf would be 0 and rtfx = 1.0 / rtf would be +Infinity. Add guard to match AsrBenchmark.swift validation.
There was a problem hiding this comment.
🟡 requiredBufferedSamples in exactFinalizationPaddingSamples uses old formula, over-estimating padding
The requiredBufferedSamples calculation in exactFinalizationPaddingSamples (lines 853-878) was not updated to match the new center-padded frame count formula. For the empty buffer case it computes (frames-1)*melStride + melWindow and for the non-empty case max(melWindow, frames*melStride), but the actual minimum samples needed with center padding is (frames-1)*melStride + melWindow - nFFT. This over-estimates by up to nFFT (512 samples), causing more zero-padding than necessary during finalization, which produces extra silent frames at the end of diarization output.
(Refers to lines 855-877)
Was this helpful? React with 👍 or 👎 to provide feedback.
- Remove unused audioDuration calculation in single-file transcription - Remove unused avgRtf calculation in benchmark
The bug was that modelDir used chunkSize.modelSubdirectory (e.g., "320ms") but downloadRepo appends repo.folderName (e.g., "parakeet-realtime-eou-120m-coreml/320ms"), causing downloads to go to a different path than where the code looks for models. Fixed by using repo.folderName for modelDir instead of chunkSize.modelSubdirectory, so both download and load paths now match correctly. Download: modelsRoot/parakeet-realtime-eou-120m-coreml/320ms/ Load: modelsRoot/parakeet-realtime-eou-120m-coreml/320ms/
Updated AGENTS.md and CLAUDE.md to explain that OrderedImports rule is disabled due to Swift 6.1 (GitHub Actions CI) vs 6.3 (local) formatter incompatibility. Swift 6.3 is unavailable in GitHub Actions runners, causing the formatters to cycle between adding/removing blank lines between imports. Added inline comment in SortformerStreamingIntegrationTests.swift explaining the non-alphabetical import order.
HuggingFace sometimes returns 200 OK status with HTML error pages during rate limiting or timeouts, bypassing the 429/503 status code checks. Added validateJSONResponse() to detect HTML responses and throw a descriptive error instead of silently failing. Changed silent 'return' to throw invalidResponse error when JSON parsing fails. This fixes PocketTTS/Kokoro download failures where the code would: 1. Receive HTML error page with 200 OK status 2. Fail to parse as JSON 3. Silently return (no files listed) 4. Create empty model directory 5. Fail later when trying to load non-existent models
Summary
Fixes #441 - StreamingEouAsrManager with 320ms chunks was producing incorrect frame counts, causing shape mismatches.
AudioMelSpectrogram.computeFlat()to use correct frame count formulaAudioMelSpectrogram.computeFlatTransposed()with.centerpadding modenumFrames = audioCount / hopLengthtonumFrames = 1 + (paddedCount - winLength) / hopLengthRoot Cause
The original formula didn't account for the center padding (nFFT/2 on each side) that's applied to audio before windowing. This caused the frame count to be off by 1, producing 63 frames instead of 64 for 630ms audio chunks.
Test Results
Frame Count Validation Tests
Added
EouChunkSizeFrameCountTests- all passing:Integration Tests (10 files per chunk size)
30 transcriptions total - 100% success rate:
✅ No shape mismatch errors detected across all 30 transcriptions
The 320ms chunk size (the problematic one from issue #441) now works perfectly and actually achieves the lowest WER!
Test Plan
AudioMelSpectrogramTestspassEouChunkSizeFrameCountTests- all passingAll tests pass successfully.
🤖 Generated with Claude Code