Fix parakeet-ctc-ja download error: Prevent AsrModels from loading CTC-only models#516
Conversation
…ames AsrModels was incorrectly accepting .ctcJa and .ctcZhCn model versions, which use different decoder file names than TDT models: - TDT models use Decoder.mlmodelc - CTC Japanese models use CtcDecoder.mlmodelc - CTC Chinese models use Decoder.mlmodelc (different structure) This caused download to succeed but loading to fail with: "Model file not found: Decoder.mlmodelc" Solution: - Added validation in AsrModels.load() and download() to reject CTC-only models with clear error messages - Error messages direct users to the correct manager classes: CtcJaManager and CtcZhCnManager - Added tests to verify the validation works correctly Fixes #514
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 1m4s • 04/11/2026, 10:59 PM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 288.3s processing • Test runtime: 4m 48s • 04/11/2026, 11:02 PM EST |
Kokoro TTS Smoke Test ✅
Runtime: 0m38s Note: Kokoro TTS uses CoreML flow matching + Vocos vocoder. CI VM lacks physical ANE — performance may differ from Apple Silicon. |
PocketTTS Smoke Test ✅
Runtime: 0m44s Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon. |
Qwen3-ASR int8 Smoke Test ✅
Performance Metrics
Runtime: 4m4s Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx. |
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 52.4s diarization time • Test runtime: 2m 57s • 04/11/2026, 11:06 PM EST |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 2m 56s • 2026-04-12T03:09:35.212Z |
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 6m23s • 04/11/2026, 11:13 PM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
|
Thanks @Josscii for catching this! You're right that there's an inconsistency: The Issue:
The Problem:
But when
Question: This inconsistency could cause incorrect model downloads if someone tries to use |
Additional Fix: Corrected AsrModelVersion.tdtJa Repo MappingThanks @Josscii for catching the repo inconsistency! I've added an additional commit to fix it. The Issue
Evidence They're Separate Repos
Changes in Second Commit
All 33 tests now pass. ✅ |
Update: Investigating Repo Structure@Josscii raised a good point about the repo mapping. I've discovered:
So the current code that has However, I need to verify if the repo enum values match the actual HuggingFace repo names. |
✅ Issue Resolved - Correct Fix in #519After investigation with @Josscii, I found the real repo structure: HuggingFace Repositories
The Correct FixesThis PR (#516): ✅ Prevents PR #519: ✅ Fixes PR #518: ❌ Was incorrect (tried to change Summary
|
✅ Final Resolution - All Issues FixedAfter investigation with @Josscii and verifying the actual HuggingFace repository structure, here's the complete fix: The Truth
Pull Requests
SummaryIssue #514 is now completely resolved with better naming that reflects the actual repository contents. |
…520) ## Problem The enum name `Repo.parakeetCtcJa` is misleading because it implies the repository only contains CTC models, but it actually contains **both CTC and TDT models**. ## Verified Repository Contents **`FluidInference/parakeet-ctc-0.6b-ja-coreml`** contains: - ✅ CTC models: `CtcDecoder.mlmodelc` - ✅ TDT v2 models: `Decoderv2.mlmodelc` + `Jointerv2.mlmodelc` - Shared: `Preprocessor.mlmodelc`, `Encoder.mlmodelc`, `vocab.json` ## Solution Renamed `Repo.parakeetCtcJa` → `Repo.parakeetJa` to accurately reflect that it's the Japanese models repository containing both decoder variants. ## Changes - **ModelNames.swift**: Renamed enum case from `.parakeetCtcJa` to `.parakeetJa` - **AsrModels.swift**: Updated `.ctcJa` and `.tdtJa` to use `.parakeetJa` - **CtcJaModels.swift**: Updated repository reference - **TdtJaModels.swift**: Updated repository reference and added comment ## Testing - ✅ Build succeeds - ✅ Both CTC and TDT Japanese managers now use the correct repository name ## Related - Follow-up to #516 and #519 - Addresses naming clarity issue raised by @Josscii <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/520" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end -->
Problem
Issue #514 reported that downloading
parakeet-ctc-jamodels would succeed, but then fail during loading with:Root Cause
AsrModels(designed for TDT models) was incorrectly accepting.ctcJaand.ctcZhCnmodel versions, which use different decoder file names:Decoder.mlmodelcCtcDecoder.mlmodelcDecoder.mlmodelc(but with different structure)When users tried to load
.ctcJamodels viaAsrModels:CtcDecoder.mlmodelc)Decoder.mlmodelc)Solution
Added validation in
AsrModels.load()andAsrModels.download()to reject CTC-only model versions with clear error messages that direct users to the correct manager classes:.ctcJa→ UseCtcJaManager.ctcZhCn→ UseCtcZhCnManagerChanges
Modified Files
Sources/FluidAudio/ASR/Parakeet/SlidingWindow/TDT/AsrModels.swiftload()methoddownload()methodAsrModelsErrorwith guidance to correct managerTests/FluidAudioTests/ASR/Parakeet/SlidingWindow/TDT/AsrModelsTests.swift.ctcJaand.ctcZhCnare properly rejectedTesting
All 32 tests in
AsrModelsTestspass, including the new validation tests:testCtcJaModelRejectsAsrModelsLoad()testCtcJaModelRejectsAsrModelsDownload()testCtcZhCnModelRejectsAsrModelsLoad()testCtcZhCnModelRejectsAsrModelsDownload()testCtcOnlyModelsAreMarkedCorrectly()Example Error Message
Before (confusing):
After (clear guidance):
Closes #514