Skip to content

Fix parakeet-ctc-ja download error: Prevent AsrModels from loading CTC-only models#516

Merged
Alex-Wengg merged 1 commit intomainfrom
fix/issue-514-ctc-ja-model-loading
Apr 12, 2026
Merged

Fix parakeet-ctc-ja download error: Prevent AsrModels from loading CTC-only models#516
Alex-Wengg merged 1 commit intomainfrom
fix/issue-514-ctc-ja-model-loading

Conversation

@Alex-Wengg
Copy link
Copy Markdown
Member

@Alex-Wengg Alex-Wengg commented Apr 12, 2026

Problem

Issue #514 reported that downloading parakeet-ctc-ja models would succeed, but then fail during loading with:

[WARN] First load failed: Model file not found: Decoder.mlmodelc

Root Cause

AsrModels (designed for TDT models) was incorrectly accepting .ctcJa and .ctcZhCn model versions, which use different decoder file names:

  • TDT models use Decoder.mlmodelc
  • Japanese CTC models use CtcDecoder.mlmodelc
  • Chinese CTC models use Decoder.mlmodelc (but with different structure)

When users tried to load .ctcJa models via AsrModels:

  1. Download succeeded (correct files downloaded: CtcDecoder.mlmodelc)
  2. Loading failed (looking for wrong file: Decoder.mlmodelc)

Solution

Added validation in AsrModels.load() and AsrModels.download() to reject CTC-only model versions with clear error messages that direct users to the correct manager classes:

  • For .ctcJa → Use CtcJaManager
  • For .ctcZhCn → Use CtcZhCnManager

Changes

Modified Files

  • Sources/FluidAudio/ASR/Parakeet/SlidingWindow/TDT/AsrModels.swift

    • Added validation at the start of load() method
    • Added validation at the start of download() method
    • Throws descriptive AsrModelsError with guidance to correct manager
  • Tests/FluidAudioTests/ASR/Parakeet/SlidingWindow/TDT/AsrModelsTests.swift

    • Added 5 new tests for CTC-only model validation
    • Tests verify both .ctcJa and .ctcZhCn are properly rejected
    • Tests verify error messages contain correct manager class names

Testing

All 32 tests in AsrModelsTests pass, including the new validation tests:

  • testCtcJaModelRejectsAsrModelsLoad()
  • testCtcJaModelRejectsAsrModelsDownload()
  • testCtcZhCnModelRejectsAsrModelsLoad()
  • testCtcZhCnModelRejectsAsrModelsDownload()
  • testCtcOnlyModelsAreMarkedCorrectly()

Example Error Message

Before (confusing):

Model file not found: Decoder.mlmodelc

After (clear guidance):

CTC-only model .ctcJa must be loaded via CtcJaManager, not AsrModels

Closes #514


Open with Devin

…ames

AsrModels was incorrectly accepting .ctcJa and .ctcZhCn model versions,
which use different decoder file names than TDT models:
- TDT models use Decoder.mlmodelc
- CTC Japanese models use CtcDecoder.mlmodelc
- CTC Chinese models use Decoder.mlmodelc (different structure)

This caused download to succeed but loading to fail with:
"Model file not found: Decoder.mlmodelc"

Solution:
- Added validation in AsrModels.load() and download() to reject
  CTC-only models with clear error messages
- Error messages direct users to the correct manager classes:
  CtcJaManager and CtcZhCnManager
- Added tests to verify the validation works correctly

Fixes #514
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

@github-actions
Copy link
Copy Markdown

VAD Benchmark Results

Performance Comparison

Dataset Accuracy Precision Recall F1-Score RTFx Files
MUSAN 92.0% 86.2% 100.0% 92.6% 775.6x faster 50
VOiCES 92.0% 86.2% 100.0% 92.6% 793.0x faster 50

Dataset Details

  • MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
  • VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

@github-actions
Copy link
Copy Markdown

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric Value Description
WER (Avg) 7.03% Average Word Error Rate
WER (Med) 4.17% Median Word Error Rate
RTFx 11.43x Real-time factor (higher = faster)
Total Audio 470.6s Total audio duration processed
Total Time 41.9s Total processing time

Streaming Metrics

Metric Value Description
Avg Chunk Time 0.042s Average chunk processing time
Max Chunk Time 0.084s Maximum chunk processing time
EOU Detections 0 Total End-of-Utterance detections

Test runtime: 1m4s • 04/11/2026, 10:59 PM EST

RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O

@Alex-Wengg Alex-Wengg merged commit c4aaa5d into main Apr 12, 2026
12 checks passed
@Alex-Wengg Alex-Wengg deleted the fix/issue-514-ctc-ja-model-loading branch April 12, 2026 03:01
@github-actions
Copy link
Copy Markdown

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric Value Target Status Description
DER 14.5% <20% Diarization Error Rate (lower is better)
RTFx 4.06x >1.0x Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage Time (s) % Description
Model Download 13.474 5.2 Fetching diarization models
Model Compile 5.774 2.2 CoreML compilation
Audio Load 0.059 0.0 Loading audio file
Segmentation 30.356 11.8 VAD + speech detection
Embedding 257.087 99.6 Speaker embedding extraction
Clustering (VBx) 0.876 0.3 Hungarian algorithm + VBx clustering
Total 258.162 100 Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method DER Mode Description
FluidAudio (Offline) 14.5% VBx Batch On-device CoreML with optimal clustering
FluidAudio (Streaming) 17.7% Chunk-based First-occurrence speaker mapping
Research baseline 18-30% Various Standard dataset performance

Pipeline Details:

  • Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
  • Segmentation: VAD-based voice activity detection
  • Embeddings: WeSpeaker-compatible speaker embeddings
  • Clustering: PowerSet with VBx refinement
  • Accuracy: Higher than streaming due to optimal post-hoc mapping

🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 288.3s processing • Test runtime: 4m 48s • 04/11/2026, 11:02 PM EST

@github-actions
Copy link
Copy Markdown

Kokoro TTS Smoke Test ✅

Check Result
Build
Model download
Model load
Synthesis pipeline
Output WAV ✅ (634.8 KB)

Runtime: 0m38s

Note: Kokoro TTS uses CoreML flow matching + Vocos vocoder. CI VM lacks physical ANE — performance may differ from Apple Silicon.

@github-actions
Copy link
Copy Markdown

PocketTTS Smoke Test ✅

Check Result
Build
Model download
Model load
Synthesis pipeline
Output WAV ✅ (202.5 KB)

Runtime: 0m44s

Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon.

@github-actions
Copy link
Copy Markdown

Qwen3-ASR int8 Smoke Test ✅

Check Result
Build
Model download
Model load
Transcription pipeline
Decoder size 571 MB (vs 1.1 GB f32)

Performance Metrics

Metric CI Value Expected on Apple Silicon
Median RTFx 0.06x ~2.5x
Overall RTFx 0.06x ~2.5x

Runtime: 4m4s

Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx.

@github-actions
Copy link
Copy Markdown

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric Value Target Status Description
DER 15.1% <30% Diarization Error Rate (lower is better)
JER 24.9% <25% Jaccard Error Rate
RTFx 20.00x >1.0x Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage Time (s) % Description
Model Download 10.088 19.2 Fetching diarization models
Model Compile 4.323 8.2 CoreML compilation
Audio Load 0.074 0.1 Loading audio file
Segmentation 15.729 30.0 Detecting speech regions
Embedding 26.214 50.0 Extracting speaker voices
Clustering 10.486 20.0 Grouping same speakers
Total 52.468 100 Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method DER Notes
FluidAudio 15.1% On-device CoreML
Research baseline 18-30% Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

  • M2 MacBook Air (2022): Runs at 150 RTFx real-time
  • Performance scales with Apple Neural Engine capabilities

🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 52.4s diarization time • Test runtime: 2m 57s • 04/11/2026, 11:06 PM EST

@github-actions
Copy link
Copy Markdown

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric Value Target Status
DER 33.4% <35%
Miss Rate 24.4% - -
False Alarm 0.2% - -
Speaker Error 8.8% - -
RTFx 11.3x >1.0x
Speakers 4/4 - -

Sortformer High-Latency • ES2004a • Runtime: 2m 56s • 2026-04-12T03:09:35.212Z

@github-actions
Copy link
Copy Markdown

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset WER Avg WER Med RTFx Status
test-clean 0.57% 0.00% 5.19x
test-other 1.19% 0.00% 3.35x

Parakeet v2 (English-optimized)

Dataset WER Avg WER Med RTFx Status
test-clean 0.80% 0.00% 5.24x
test-other 1.62% 0.00% 3.32x

Streaming (v3)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.63x Streaming real-time factor
Avg Chunk Time 1.481s Average time to process each chunk
Max Chunk Time 1.562s Maximum chunk processing time
First Token 1.746s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming (v2)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.61x Streaming real-time factor
Avg Chunk Time 1.482s Average time to process each chunk
Max Chunk Time 1.616s Maximum chunk processing time
First Token 1.461s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming

25 files per dataset • Test runtime: 6m23s • 04/11/2026, 11:13 PM EST

RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

Testing methodology follows HuggingFace Open ASR Leaderboard

@Alex-Wengg
Copy link
Copy Markdown
Member Author

Thanks @Josscii for catching this! You're right that there's an inconsistency:

The Issue:

  • AsrModelVersion.tdtJa uses Repo.parakeetCtcJa (comment says "TDT v2 models uploaded to CTC repo")
  • But TdtJaModels (the proper manager) uses Repo.parakeetTdtJa
  • And ModelNames.getRequiredModelNames() has separate cases for both repos with different model files

The Problem:
When repo = .parakeetCtcJa, it returns ModelNames.CTCJa.requiredModels which includes:

  • CtcDecoder.mlmodelc

But when repo = .parakeetTdtJa, it returns ModelNames.TDTJa.requiredModels which includes:

  • Decoderv2.mlmodelc
  • Jointerv2.mlmodelc

Question:
Are the TDT Japanese models actually in the CTC repo (FluidInference/parakeet-ctc-0.6b-ja-coreml), or should AsrModelVersion.tdtJa be using Repo.parakeetTdtJa (FluidInference/parakeet-tdt-0.6b-ja-coreml) instead?

This inconsistency could cause incorrect model downloads if someone tries to use AsrModels with .tdtJa version (though my PR now blocks that).

@Alex-Wengg
Copy link
Copy Markdown
Member Author

Additional Fix: Corrected AsrModelVersion.tdtJa Repo Mapping

Thanks @Josscii for catching the repo inconsistency! I've added an additional commit to fix it.

The Issue

AsrModelVersion.tdtJa was incorrectly mapped to Repo.parakeetCtcJa instead of Repo.parakeetTdtJa.

Evidence They're Separate Repos

  1. GitHub Actions workflow caches both repos separately (lines 42-43 in .github/workflows/japanese-asr-benchmark.yml)
  2. TdtJaModels class uses Repo.parakeetTdtJa
  3. ModelNames.getRequiredModelNames() returns different model files for each:
    • .parakeetCtcJaModelNames.CTCJa.requiredModels (includes CtcDecoder.mlmodelc)
    • .parakeetTdtJaModelNames.TDTJa.requiredModels (includes Decoderv2.mlmodelc and Jointerv2.mlmodelc)

Changes in Second Commit

  • Fixed .tdtJa to return .parakeetTdtJa instead of .parakeetCtcJa
  • Added testJapaneseModelRepoMapping() test to verify the repos are correctly separated

All 33 tests now pass. ✅

@Alex-Wengg
Copy link
Copy Markdown
Member Author

Update: Investigating Repo Structure

@Josscii raised a good point about the repo mapping. I've discovered:

  1. parakeet-0.6b-ja-coreml - CTC-only (Preprocessor, Encoder, CtcDecoder)
  2. parakeet-ctc-0.6b-ja-coreml - Contains BOTH CTC + TDT v2 models (CtcDecoder + Decoderv2 + Jointerv2)
  3. parakeet-tdt-0.6b-ja-coreml - Does NOT exist (returns 404)

So the current code that has AsrModelVersion.tdtJa pointing to .parakeetCtcJa appears to be correct, since that's where the TDT v2 models are stored.

However, I need to verify if the repo enum values match the actual HuggingFace repo names.

@Alex-Wengg
Copy link
Copy Markdown
Member Author

✅ Issue Resolved - Correct Fix in #519

After investigation with @Josscii, I found the real repo structure:

HuggingFace Repositories

  1. parakeet-ctc-0.6b-ja-coreml ✅ EXISTS - Contains BOTH:

    • CTC models: CtcDecoder.mlmodelc
    • TDT v2 models: Decoderv2.mlmodelc + Jointerv2.mlmodelc
  2. parakeet-tdt-0.6b-ja-coreml ❌ DOESN'T EXIST (404)

The Correct Fixes

This PR (#516): ✅ Prevents AsrModels from loading CTC-only models - CORRECT

PR #519: ✅ Fixes TdtJaModels to use Repo.parakeetCtcJa instead of non-existent Repo.parakeetTdtJa - NEW FIX

PR #518: ❌ Was incorrect (tried to change AsrModels.tdtJa which was already correct) - CLOSED

Summary

@Alex-Wengg
Copy link
Copy Markdown
Member Author

✅ Final Resolution - All Issues Fixed

After investigation with @Josscii and verifying the actual HuggingFace repository structure, here's the complete fix:

The Truth

FluidInference/parakeet-ctc-0.6b-ja-coreml contains BOTH:

  • CTC models: CtcDecoder.mlmodelc
  • TDT v2 models: Decoderv2.mlmodelc + Jointerv2.mlmodelc

Pull Requests

  1. Fix parakeet-ctc-ja download error: Prevent AsrModels from loading CTC-only models #516 (This PR) ✅ MERGED

    • Prevents AsrModels from loading CTC-only models
  2. Refactor: Rename Repo.parakeetCtcJa to Repo.parakeetJa for accuracy #520 🔄 OPEN - Recommended

  3. Fix AsrModelVersion.tdtJa repo mapping to use separate TDT repo #518 ❌ Closed (incorrect)

  4. Fix TdtJaModels to use correct HuggingFace repo (parakeet-ctc-0.6b-ja-coreml) #519 ❌ Closed (superseded by Refactor: Rename Repo.parakeetCtcJa to Repo.parakeetJa for accuracy #520)

Summary

Issue #514 is now completely resolved with better naming that reflects the actual repository contents.

Alex-Wengg added a commit that referenced this pull request Apr 12, 2026
…520)

## Problem

The enum name `Repo.parakeetCtcJa` is misleading because it implies the
repository only contains CTC models, but it actually contains **both CTC
and TDT models**.

## Verified Repository Contents

**`FluidInference/parakeet-ctc-0.6b-ja-coreml`** contains:
- ✅ CTC models: `CtcDecoder.mlmodelc`
- ✅ TDT v2 models: `Decoderv2.mlmodelc` + `Jointerv2.mlmodelc`
- Shared: `Preprocessor.mlmodelc`, `Encoder.mlmodelc`, `vocab.json`

## Solution

Renamed `Repo.parakeetCtcJa` → `Repo.parakeetJa` to accurately reflect
that it's the Japanese models repository containing both decoder
variants.

## Changes

- **ModelNames.swift**: Renamed enum case from `.parakeetCtcJa` to
`.parakeetJa`
- **AsrModels.swift**: Updated `.ctcJa` and `.tdtJa` to use
`.parakeetJa`
- **CtcJaModels.swift**: Updated repository reference
- **TdtJaModels.swift**: Updated repository reference and added comment

## Testing

- ✅ Build succeeds
- ✅ Both CTC and TDT Japanese managers now use the correct repository
name

## Related

- Follow-up to #516 and #519
- Addresses naming clarity issue raised by @Josscii
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/fluidinference/fluidaudio/pull/520"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

parakeet-ctc-ja download error

1 participant