Skip to content

Fix EOU frame count calculation for center-padded mel spectrograms#444

Merged
Alex-Wengg merged 33 commits intomainfrom
fix-eou-320ms-frame-count
Mar 27, 2026
Merged

Fix EOU frame count calculation for center-padded mel spectrograms#444
Alex-Wengg merged 33 commits intomainfrom
fix-eou-320ms-frame-count

Conversation

@Alex-Wengg
Copy link
Copy Markdown
Member

@Alex-Wengg Alex-Wengg commented Mar 27, 2026

Summary

Fixes #441 - StreamingEouAsrManager with 320ms chunks was producing incorrect frame counts, causing shape mismatches.

  • Updated AudioMelSpectrogram.computeFlat() to use correct frame count formula
  • Updated AudioMelSpectrogram.computeFlatTransposed() with .center padding mode
  • Changed from numFrames = audioCount / hopLength to numFrames = 1 + (paddedCount - winLength) / hopLength
  • This accounts for nFFT/2 center padding applied before STFT processing, matching NeMo's computation

Root Cause

The original formula didn't account for the center padding (nFFT/2 on each side) that's applied to audio before windowing. This caused the frame count to be off by 1, producing 63 frames instead of 64 for 630ms audio chunks.

Test Results

Frame Count Validation Tests

Added EouChunkSizeFrameCountTests - all passing:

Integration Tests (10 files per chunk size)

30 transcriptions total - 100% success rate:

Chunk Size Files Success Avg WER Overall WER
160ms 10/10 100% 8.40% 9.64%
320ms 10/10 100% 4.92% 5.72%
1280ms 10/10 100% 7.19% 7.83%

✅ No shape mismatch errors detected across all 30 transcriptions

The 320ms chunk size (the problematic one from issue #441) now works perfectly and actually achieves the lowest WER!

Test Plan

  • All AudioMelSpectrogramTests pass
  • Added EouChunkSizeFrameCountTests - all passing
  • Integration test: 10 files × 3 chunk sizes = 30 successful transcriptions
  • WER calculation confirms transcription quality maintained (5-10% WER)
  • Verified no shape mismatch errors

All tests pass successfully.

🤖 Generated with Claude Code

Fixes #441

The AudioMelSpectrogram.computeFlat() methods were using an incorrect
formula for calculating frame count that didn't account for the nFFT/2
center padding applied before STFT processing.

Old formula: numFrames = audioCount / hopLength
New formula: numFrames = 1 + (paddedCount - winLength) / hopLength

This caused shape mismatches when using StreamingEouAsrManager with
320ms chunks, producing 63 frames instead of the expected 64 frames.

The fix applies to both computeFlat() and computeFlatTransposed()
methods in their .center padding mode, matching NeMo's computation.
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 27, 2026

Claude finished @Alex-Wengg's task —— View job


PR Review: Fix EOU frame count calculation for center-padded mel spectrograms

I've completed the code review and submitted detailed feedback through GitHub's review system.

Todo List:

  • Create pending PR review
  • Get PR diff to understand the changes
  • Read the modified file for context
  • Analyze code changes and add inline comments
  • Submit comprehensive review

Review Summary

This PR correctly fixes the frame count calculation for center-padded mel spectrograms. The key findings:

Mathematical correctness: The new formula 1 + (paddedCount - winLength) / hopLength properly accounts for nFFT/2 center padding, fixing the off-by-one error

Consistency: Both computeFlat() and computeFlatTransposed() now use the same logic

Alignment with NeMo: The implementation now matches NeMo's mel spectrogram computation

The changes are minimal, focused, and address the root cause of the shape mismatch issue in StreamingEouAsrManager. I've provided detailed technical analysis and minor suggestions for improvement in the GitHub review.


Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

This PR correctly fixes the frame count calculation for center-padded mel spectrograms. The changes are mathematically sound and address the root cause described in issue #441.

Key Changes Analysis

Formula Fix (Lines 193-197 & 339-343):

  • Correct: Changed from audioCount / hopLength to 1 + (paddedCount - winLength) / hopLength
  • Reasoning: The new formula properly accounts for nFFT/2 center padding applied before STFT processing
  • Consistency: Both computeFlat() and computeFlatTransposed() now use the same logic

Code Organization (Lines 195-196):

  • Improvement: Moved padLength and paddedCount calculations earlier in computeFlat() for better readability
  • Consistency: Variables are now defined before use, making the logic flow clearer

Technical Correctness

The mathematical derivation is sound:

  • With center padding, audio is padded by nFFT/2 on each side
  • Total padded length = audioCount + 2 * (nFFT/2)
  • Number of frames = 1 + (paddedLength - windowLength) / hopLength
  • This matches NeMo's mel spectrogram computation as stated in the comments

Test Coverage

The PR description indicates comprehensive testing:

  • All chunk sizes (160ms, 320ms, 630ms, 1280ms) now produce correct frame counts
  • The off-by-one error (63→64 frames for 630ms) has been resolved

Potential Concerns

Nit: The comment on line 193-194 could be more concise:

// Frame count for center-padded audio: 1 + (paddedCount - winLength) / hopLength

Minor: Consider adding a unit test specifically for the frame count formula to prevent future regressions.

Verdict

This is a solid fix that addresses the core issue without introducing side effects. The changes are minimal, focused, and mathematically correct. The PR resolves the shape mismatch issue in StreamingEouAsrManager and aligns the implementation with NeMo's approach.

Recommendation: Approve once manual testing with StreamingEouAsrManager is completed.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 27, 2026

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric Value Description
WER (Avg) 7.03% Average Word Error Rate
WER (Med) 4.17% Median Word Error Rate
RTFx 7.26x Real-time factor (higher = faster)
Total Audio 470.6s Total audio duration processed
Total Time 67.2s Total processing time

Streaming Metrics

Metric Value Description
Avg Chunk Time 0.067s Average chunk processing time
Max Chunk Time 0.134s Maximum chunk processing time
EOU Detections 0 Total End-of-Utterance detections

Test runtime: 1m28s • 03/27/2026, 06:24 PM EST

RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 27, 2026

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric Value Target Status Description
DER 15.1% <30% Diarization Error Rate (lower is better)
JER 24.9% <25% Jaccard Error Rate
RTFx 30.18x >1.0x Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage Time (s) % Description
Model Download 6.603 19.0 Fetching diarization models
Model Compile 2.830 8.1 CoreML compilation
Audio Load 0.058 0.2 Loading audio file
Segmentation 10.429 30.0 Detecting speech regions
Embedding 17.382 50.0 Extracting speaker voices
Clustering 6.953 20.0 Grouping same speakers
Total 34.772 100 Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method DER Notes
FluidAudio 15.1% On-device CoreML
Research baseline 18-30% Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

  • M2 MacBook Air (2022): Runs at 150 RTFx real-time
  • Performance scales with Apple Neural Engine capabilities

🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 34.8s diarization time • Test runtime: 2m 4s • 03/27/2026, 06:31 PM EST

devin-ai-integration[bot]

This comment was marked as resolved.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 27, 2026

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric Value Target Status
DER 33.4% <35%
Miss Rate 24.4% - -
False Alarm 0.2% - -
Speaker Error 8.8% - -
RTFx 13.9x >1.0x
Speakers 4/4 - -

Sortformer High-Latency • ES2004a • Runtime: 2m 37s • 2026-03-27T22:34:43.604Z

Tests verify that AudioMelSpectrogram produces the correct number of
mel frames for each EOU chunk size:
- 160ms: 17 frames
- 320ms: 64 frames (was 63 before fix)
- 1280ms: 129 frames

All tests pass with 10 different audio lengths per chunk size.
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 27, 2026

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric Value Target Status Description
DER 14.5% <20% Diarization Error Rate (lower is better)
RTFx 5.23x >1.0x Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage Time (s) % Description
Model Download 13.361 6.7 Fetching diarization models
Model Compile 5.726 2.9 CoreML compilation
Audio Load 0.076 0.0 Loading audio file
Segmentation 21.280 10.6 VAD + speech detection
Embedding 199.864 99.5 Speaker embedding extraction
Clustering (VBx) 0.780 0.4 Hungarian algorithm + VBx clustering
Total 200.792 100 Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method DER Mode Description
FluidAudio (Offline) 14.5% VBx Batch On-device CoreML with optimal clustering
FluidAudio (Streaming) 17.7% Chunk-based First-occurrence speaker mapping
Research baseline 18-30% Various Standard dataset performance

Pipeline Details:

  • Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
  • Segmentation: VAD-based voice activity detection
  • Embeddings: WeSpeaker-compatible speaker embeddings
  • Clustering: PowerSet with VBx refinement
  • Accuracy: Higher than streaming due to optimal post-hoc mapping

🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 221.9s processing • Test runtime: 3m 59s • 03/27/2026, 06:37 PM EST

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 27, 2026

PocketTTS Smoke Test ✅

Check Result
Build
Model download
Model load
Synthesis pipeline
Output WAV ✅ (176.3 KB)

Runtime: 0m40s

Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality may differ from Apple Silicon.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 27, 2026

Qwen3-ASR int8 Smoke Test ✅

Check Result
Build
Model download
Model load
Transcription pipeline
Decoder size 571 MB (vs 1.1 GB f32)

Runtime: 3m3s

Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 27, 2026

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset WER Avg WER Med RTFx Status
test-clean 0.57% 0.00% 4.28x
test-other 1.35% 0.00% 2.94x

Parakeet v2 (English-optimized)

Dataset WER Avg WER Med RTFx Status
test-clean 0.80% 0.00% 4.74x
test-other 1.00% 0.00% 3.23x

Streaming (v3)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.54x Streaming real-time factor
Avg Chunk Time 1.621s Average time to process each chunk
Max Chunk Time 1.977s Maximum chunk processing time
First Token 2.035s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming (v2)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.57x Streaming real-time factor
Avg Chunk Time 1.589s Average time to process each chunk
Max Chunk Time 1.973s Maximum chunk processing time
First Token 1.601s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming

25 files per dataset • Test runtime: 6m43s • 03/27/2026, 06:30 PM EST

RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

Testing methodology follows HuggingFace Open ASR Leaderboard

Minor formatting changes:
- Fix line breaks in XCTAssert calls
- Reorder imports
- Add trailing commas
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 27, 2026

VAD Benchmark Results

Performance Comparison

Dataset Accuracy Precision Recall F1-Score RTFx Files
MUSAN 92.0% 86.2% 100.0% 92.6% 794.2x faster 50
VOiCES 92.0% 86.2% 100.0% 92.6% 802.5x faster 50

Dataset Details

  • MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
  • VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

Swift 6.1.3 (CI) and Swift 6.3 (local) have different import ordering
behavior for @preconcurrency. Revert to avoid CI failures.
Alex-Wengg added a commit that referenced this pull request Mar 27, 2026
… mel

The new frame count formula `1 + (paddedCount - winLength) / hopLength`
accounts for center padding, but SortformerDiarizerPipeline was still using
the old formula `melLength * stride` to calculate samples consumed.

This caused samplesConsumed to exceed audioBuffer.count, triggering the
else branch that resets lastAudioSample and breaks preemphasis continuity.

Changes:
- Reverse the frame count formula to calculate actual samples consumed:
  samplesConsumed = (melLength - 1) * hopLength + winLength - nFFT
- Update producedMelFramesAvailable() to use center-padded formula
- Make AudioMelSpectrogram.nFFT public so callers can access it

Fixes Devin Review issue in PR #444.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 10 additional findings in Devin Review.

Open in Devin Review

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Stale samplesNeeded formula causes finalization to silently drop diarization frames

The PR updated producedMelFramesAvailable() at SortformerDiarizerPipeline.swift:887-889 to use the new center-padded frame count formula (1 + (paddedCount - melWindow) / melStride), but the samplesNeeded guard in preprocessAudioToFeatureTargetLocked at SortformerDiarizerPipeline.swift:761-768 still uses the old formula (framesNeeded * config.melStride or (framesNeeded - 1) * config.melStride + config.melWindow). During finalization, exactFinalizationPaddingSamples (SortformerDiarizerPipeline.swift:807) consults the updated producedMelFramesAvailable() and may determine no padding is needed (returns 0). However, when the finalization loop then calls preprocessAudioToFeatureTargetLocked (SortformerDiarizerPipeline.swift:528), the stale samplesNeeded requires more audio than is available, so it returns without computing features. This causes makeStreamingChunkLocked() to return nil, the loop breaks with a warning, and the tail diarization frames are lost.

Concrete example showing the mismatch window

With defaults (melStride=160, melWindow=400, nFFT=512) and featureBuffer not empty, needing 10 additional frames:

  • producedMelFramesAvailable() says 10 frames need (10-1)*160 + 400 - 512 = 1388 samples (new formula)
  • samplesNeeded requires 10 * 160 = 1600 samples (old formula)
  • If audioBuffer.count is between 1388 and 1599 (a 212-sample / 13ms window), finalization adds no padding but also cannot compute features.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration[bot]

This comment was marked as resolved.

Alex-Wengg added a commit that referenced this pull request Mar 27, 2026
Revert samplesConsumed calculation to hop-aligned consumption to maintain proper streaming continuity.

Issue: The previous fix computed minimum samples needed for frame production, but this created overlapping frames in streaming mode. With center padding, the padding affects frame *production* (how computeFlatTransposed generates frames), but samples should still be consumed in hop-aligned fashion for streaming continuity.

Fix: Use `melLength * hopLength` for samples consumed, regardless of center padding. This maintains proper frame boundaries across streaming chunks and prevents feature buffer corruption.

Addresses: #444 (review)
Tests: SortformerStreamingIntegrationTests pass (2/2)
… mel

The new frame count formula `1 + (paddedCount - winLength) / hopLength`
accounts for center padding, but SortformerDiarizerPipeline was still using
the old formula `melLength * stride` to calculate samples consumed.

This caused samplesConsumed to exceed audioBuffer.count, triggering the
else branch that resets lastAudioSample and breaks preemphasis continuity.

Changes:
- Reverse the frame count formula to calculate actual samples consumed:
  samplesConsumed = (melLength - 1) * hopLength + winLength - nFFT
- Update producedMelFramesAvailable() to use center-padded formula
- Make AudioMelSpectrogram.nFFT public so callers can access it

Fixes Devin Review issue in PR #444.
Benchmarks showed:
- 80ms: 63.31% WER (unusable - gibberish transcripts)
- 160ms: Model files incomplete (missing preprocessor.mlmodelc)
- 560ms: 0.59% WER, RTFx 1.8x ✅
- 1120ms: 0.59% WER, RTFx 1.9x ✅

Only keep the two working chunk sizes with excellent accuracy.

Changes:
- NemotronChunkSize enum: removed ms80 and ms160 cases
- ModelNames.Repo: removed nemotronStreaming80 and nemotronStreaming160
- CLI commands: updated help text and validation
- Tests: updated to test only supported chunk sizes
…urements

Ran full LibriSpeech test-clean benchmarks (2,620 files) on M2 hardware to verify and update documented performance metrics.

Changes:
- 320ms: 4.88% WER (was 4.87%), 19.25x RTFx (was 12.48x), 16.9m (was 26m)
- 160ms: 8.23% WER (was 8.29%), 5.78x RTFx (was 4.78x), 56.4m (was 68m)
- Added Median WER column: 0.00% (320ms), 5.26% (160ms)

WER values confirmed accurate (within 0.06%). Improved RTFx and faster times reflect actual M2 performance vs originally documented values.

Also fix import order in SortformerStreamingIntegrationTests.swift for consistency.
Revert samplesConsumed calculation to hop-aligned consumption to maintain proper streaming continuity.

Issue: The previous fix computed minimum samples needed for frame production, but this created overlapping frames in streaming mode. With center padding, the padding affects frame *production* (how computeFlatTransposed generates frames), but samples should still be consumed in hop-aligned fashion for streaming continuity.

Fix: Use `melLength * hopLength` for samples consumed, regardless of center padding. This maintains proper frame boundaries across streaming chunks and prevents feature buffer corruption.

Addresses: #444 (review)
Tests: SortformerStreamingIntegrationTests pass (2/2)
@Alex-Wengg Alex-Wengg force-pushed the fix-eou-320ms-frame-count branch from f79c2dd to 82c2c0f Compare March 27, 2026 18:02
devin-ai-integration[bot]

This comment was marked as resolved.

Save benchmark results to /tmp/nemotron_{chunk_size}ms_benchmark.json for easy retrieval.

This ensures results are captured even when running in RELEASE mode, which logs to the unified logging system instead of terminal output.

Output includes:
- chunkSize, filesProcessed, totalWords, totalErrors
- wer, audioDuration, processingTime, rtfx
@Alex-Wengg
Copy link
Copy Markdown
Member Author

We independently traced this same bug from the Zoryn (GlibScribe) side — the 160ms EOU streaming path was hitting the shape mismatch on every audio chunk, causing a full fallback to Apple Speech for L1 streaming.

Root cause confirmed: PR #376 (LS-EEND) accidentally replaced computeFlat()'s correct center-padded formula (1 + (paddedCount - winLength) / hopLength → 17 frames) with the Sortformer-specific formula from computeFlatTransposed() (audioCount / hopLength → 16 frames) during the move to Shared/.

We've cherry-picked 05a59f3 onto our fork (VillaAI/audio-engine branch fix/eou-mel-frame-count) and are testing it in Zoryn now. Would be great to get this merged to main so downstream consumers pick it up.

this will be merged today.

Throw error if totalWords=0, totalAudioDuration=0, or totalProcessingTime=0
to catch benchmark failures early instead of reporting misleading 0% WER / 0x RTFx.
Use inverted center-padded formula: (melLength - 1) * hopLength + winLength - nFFT
This ensures samplesConsumed ≤ audioBuffer.count, preventing preemphasis state
from being reset on every chunk.

Previous formula (melLength * hopLength) always exceeded audioBuffer.count by ~272
samples, causing lastAudioSample=0 reset and corrupting preemphasis continuity.
Swift 6.1 and 6.3 have different formatting rules for import statements.
Update CI to use 6.3 to prevent format conflicts.
Swift 6.3 is not yet available on GitHub Actions runners.
Keep Swift 6.1 formatter with blank line between imports.
devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 18 additional findings in Devin Review.

Open in Devin Review

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 requiredBufferedSamples in exactFinalizationPaddingSamples uses old formula, over-estimating padding

The requiredBufferedSamples calculation in exactFinalizationPaddingSamples (lines 853-878) was not updated to match the new center-padded frame count formula. For the empty buffer case it computes (frames-1)*melStride + melWindow and for the non-empty case max(melWindow, frames*melStride), but the actual minimum samples needed with center padding is (frames-1)*melStride + melWindow - nFFT. This over-estimates by up to nFFT (512 samples), causing more zero-padding than necessary during finalization, which produces extra silent frames at the end of diarization output.

(Refers to lines 855-877)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

- Remove unused audioDuration calculation in single-file transcription
- Remove unused avgRtf calculation in benchmark
devin-ai-integration[bot]

This comment was marked as resolved.

The bug was that modelDir used chunkSize.modelSubdirectory (e.g., "320ms")
but downloadRepo appends repo.folderName (e.g., "parakeet-realtime-eou-120m-coreml/320ms"),
causing downloads to go to a different path than where the code looks for models.

Fixed by using repo.folderName for modelDir instead of chunkSize.modelSubdirectory,
so both download and load paths now match correctly.

Download: modelsRoot/parakeet-realtime-eou-120m-coreml/320ms/
Load:     modelsRoot/parakeet-realtime-eou-120m-coreml/320ms/
Updated AGENTS.md and CLAUDE.md to explain that OrderedImports rule is disabled
due to Swift 6.1 (GitHub Actions CI) vs 6.3 (local) formatter incompatibility.
Swift 6.3 is unavailable in GitHub Actions runners, causing the formatters to
cycle between adding/removing blank lines between imports.

Added inline comment in SortformerStreamingIntegrationTests.swift explaining
the non-alphabetical import order.
HuggingFace sometimes returns 200 OK status with HTML error pages during
rate limiting or timeouts, bypassing the 429/503 status code checks.

Added validateJSONResponse() to detect HTML responses and throw a descriptive
error instead of silently failing. Changed silent 'return' to throw
invalidResponse error when JSON parsing fails.

This fixes PocketTTS/Kokoro download failures where the code would:
1. Receive HTML error page with 200 OK status
2. Fail to parse as JSON
3. Silently return (no files listed)
4. Create empty model directory
5. Fail later when trying to load non-existent models
@Alex-Wengg Alex-Wengg merged commit 06fc2ab into main Mar 27, 2026
12 checks passed
@Alex-Wengg Alex-Wengg deleted the fix-eou-320ms-frame-count branch March 27, 2026 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Creating StreamingEouAsrManager with 320ms is broken

1 participant