Skip to content

Add standalone CTC head for custom vocabulary (#435)#450

Merged
Alex-Wengg merged 12 commits intomainfrom
ctc-head-export
Mar 28, 2026
Merged

Add standalone CTC head for custom vocabulary (#435)#450
Alex-Wengg merged 12 commits intomainfrom
ctc-head-export

Conversation

@Alex-Wengg
Copy link
Copy Markdown
Member

@Alex-Wengg Alex-Wengg commented Mar 28, 2026

Summary

  • Export the CTC decoder head (512→1025 linear projection) as a standalone 1MB CoreML model, replacing the need for the full 97.5MB CTC encoder for custom vocabulary keyword spotting
  • Load optional CtcHead.mlmodelc from model directory and run it on existing TDT encoder output
  • Add spotKeywordsFromLogProbs() and applyLogSoftmax() APIs for pre-computed CTC log-probabilities

Benchmark (772 earnings call files)

Approach Model Size Dict Recall RTFx
Separate CTC encoder 97.5 MB 99.4% 25.98x
Standalone CTC head 1 MB 99.4% 70.29x

Test plan

  • swift build -c release passes
  • 10-file quick test: Dict Recall 100%, RTFx 67.36x
  • Full 772-file benchmark: Dict Recall 99.4%, RTFx 70.29x
  • Conversion script: mobius PR #36
  • HF model upload: CtcHead.mlmodelc to parakeet-tdt-ctc-110m repo

Open with Devin

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 28, 2026

Claude finished @Alex-Wengg's task —— View job


Code Review Complete ✅

Successfully reviewed PR #450: Add standalone CTC head for custom vocabulary

Todo List:

  • Create pending review
  • Get PR diff information
  • Review AsrManager.swift changes
  • Review AsrModels.swift changes
  • Review AsrTranscription.swift changes
  • Review CtcKeywordSpotter.swift changes
  • Review ModelNames.swift changes
  • Review CtcEarningsBenchmark.swift changes
  • Review documentation updates
  • Submit final review

Review Summary

This PR successfully implements a standalone CTC head for custom vocabulary processing, achieving a significant performance improvement (86% speedup) while maintaining accuracy. The architectural changes are well-designed and follow the project's patterns.

Key Strengths:

Performance Impact: RTFx improved from 25.98x to 48.35x (86% speedup) with identical Dict Recall (99.4%)
Memory Efficiency: Reduces model size from 97.5MB to 1MB for custom vocabulary use cases
Thread Safety: Proper actor-based concurrency without @unchecked Sendable
Backward Compatibility: Graceful fallback to separate CTC encoder when unified model unavailable
Clean Architecture: Well-separated concerns with clear API boundaries

Issues Found:

🔶 Performance Concerns (Minor)

  • getCachedCtcLogProbs() performs expensive array copying (O(frames × vocab_size))
  • Multiple array allocations in tight loops during conversion

🔶 Error Handling (Minor)

  • Silent failures in shape validation without logging
  • Missing bounds checking for MLMultiArray access

🔶 API Design (Nit)

  • Method name getCachedCtcLogProbs() suggests log-probabilities but returns raw logits
  • No zero-copy accessor for performance-critical scenarios

🔶 Documentation (Nit)

  • Missing documentation for cache lifecycle and memory implications
  • Unclear when cached values are cleared

Specific Findings:

AsrManager.swift: Well-integrated caching mechanism with proper actor isolation. Consider adding zero-copy accessor for performance.

AsrTranscription.swift: Excellent dual-path implementation (unified vs separate CTC). The fallback logic is robust and well-logged.

CtcKeywordSpotter.swift: New spotKeywordsFromLogProbs() and applyLogSoftmax() APIs are well-designed. Good separation of concerns.

AsrModels.swift: Clean optional CTC head loading with proper error handling and logging.

CtcEarningsBenchmark.swift: Comprehensive benchmark updates with proper path detection for unified vs separate models.

Recommendations:

  1. Add zero-copy accessor for high-performance scenarios
  2. Improve error logging for debugging
  3. Consider cache invalidation strategy for memory management
  4. Add documentation for cache behavior

Overall Assessment:

APPROVE - This is a well-executed optimization that delivers significant performance gains without compromising functionality. The issues identified are minor and don't block the PR. The architectural approach is sound and follows project conventions.

Review submitted successfully! 🎯

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

This PR implements a significant optimization for custom vocabulary processing by introducing a standalone 1MB CTC head that replaces the need for a separate 97.5MB CTC encoder. The implementation is well-architected and delivers impressive performance gains.

Performance Impact ⚡

  • 86% speedup: RTFx improved from 25.98x to 48.35x
  • Memory reduction: ~96MB saved (97.5MB → 1MB for custom vocabulary)
  • Accuracy preserved: Dict Recall remains at 99.4%

Architecture Quality ✅

  • Proper actor-based concurrency without @unchecked Sendable
  • Clean separation between unified and legacy dual-model approaches
  • Graceful fallback when standalone CTC head is unavailable
  • Well-designed public APIs for external benchmarking access

Minor Issues Identified

  1. Performance: Array copying in getCachedCtcLogProbs() could be optimized with zero-copy accessor
  2. Error Handling: Some silent failures could benefit from better logging
  3. Documentation: Cache lifecycle and memory behavior could be clearer

Test Coverage

The benchmark results demonstrate thorough validation across 772 earnings call files with comprehensive metrics tracking. The test plan shows good coverage of the new functionality.

This is a solid optimization that significantly improves the custom vocabulary workflow while maintaining code quality and backwards compatibility. The minor issues identified are non-blocking and can be addressed in future iterations.

Recommendation: APPROVE 🎯

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

VAD Benchmark Results

Performance Comparison

Dataset Accuracy Precision Recall F1-Score RTFx Files
MUSAN 92.0% 86.2% 100.0% 92.6% 701.2x faster 50
VOiCES 92.0% 86.2% 100.0% 92.6% 736.8x faster 50

Dataset Details

  • MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
  • VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric Value Description
WER (Avg) 7.03% Average Word Error Rate
WER (Med) 4.17% Median Word Error Rate
RTFx 12.32x Real-time factor (higher = faster)
Total Audio 470.6s Total audio duration processed
Total Time 38.6s Total processing time

Streaming Metrics

Metric Value Description
Avg Chunk Time 0.039s Average chunk processing time
Max Chunk Time 0.077s Maximum chunk processing time
EOU Detections 0 Total End-of-Utterance detections

Test runtime: 0m44s • 03/28/2026, 04:08 PM EST

RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

PocketTTS Smoke Test ✅

Check Result
Build
Model download
Model load
Synthesis pipeline
Output WAV ✅ (165.0 KB)

Runtime: 0m23s

Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality may differ from Apple Silicon.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

Qwen3-ASR int8 Smoke Test ✅

Check Result
Build
Model download
Model load
Transcription pipeline
Decoder size 571 MB (vs 1.1 GB f32)

Runtime: 3m57s

Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx.

devin-ai-integration[bot]

This comment was marked as resolved.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric Value Target Status Description
DER 15.1% <30% Diarization Error Rate (lower is better)
JER 24.9% <25% Jaccard Error Rate
RTFx 24.18x >1.0x Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage Time (s) % Description
Model Download 8.646 19.9 Fetching diarization models
Model Compile 3.705 8.5 CoreML compilation
Audio Load 0.053 0.1 Loading audio file
Segmentation 13.018 30.0 Detecting speech regions
Embedding 21.696 50.0 Extracting speaker voices
Clustering 8.679 20.0 Grouping same speakers
Total 43.405 100 Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method DER Notes
FluidAudio 15.1% On-device CoreML
Research baseline 18-30% Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

  • M2 MacBook Air (2022): Runs at 150 RTFx real-time
  • Performance scales with Apple Neural Engine capabilities

🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 43.4s diarization time • Test runtime: 2m 34s • 03/28/2026, 04:43 PM EST

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric Value Target Status Description
DER 14.5% <20% Diarization Error Rate (lower is better)
RTFx 5.28x >1.0x Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage Time (s) % Description
Model Download 12.228 6.2 Fetching diarization models
Model Compile 5.241 2.6 CoreML compilation
Audio Load 0.053 0.0 Loading audio file
Segmentation 21.722 10.9 VAD + speech detection
Embedding 197.881 99.5 Speaker embedding extraction
Clustering (VBx) 0.724 0.4 Hungarian algorithm + VBx clustering
Total 198.787 100 Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method DER Mode Description
FluidAudio (Offline) 14.5% VBx Batch On-device CoreML with optimal clustering
FluidAudio (Streaming) 17.7% Chunk-based First-occurrence speaker mapping
Research baseline 18-30% Various Standard dataset performance

Pipeline Details:

  • Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
  • Segmentation: VAD-based voice activity detection
  • Embeddings: WeSpeaker-compatible speaker embeddings
  • Clustering: PowerSet with VBx refinement
  • Accuracy: Higher than streaming due to optimal post-hoc mapping

🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 220.3s processing • Test runtime: 3m 45s • 03/28/2026, 04:28 PM EST

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset WER Avg WER Med RTFx Status
test-clean 0.57% 0.00% 5.87x
test-other 1.96% 0.00% 3.86x

Parakeet v2 (English-optimized)

Dataset WER Avg WER Med RTFx Status
test-clean 0.80% 0.00% 5.89x
test-other 1.00% 0.00% 3.91x

Streaming (v3)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.69x Streaming real-time factor
Avg Chunk Time 1.312s Average time to process each chunk
Max Chunk Time 1.363s Maximum chunk processing time
First Token 1.571s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming (v2)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.68x Streaming real-time factor
Avg Chunk Time 1.318s Average time to process each chunk
Max Chunk Time 1.400s Maximum chunk processing time
First Token 1.329s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming

25 files per dataset • Test runtime: 5m28s • 03/28/2026, 04:28 PM EST

RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

Testing methodology follows HuggingFace Open ASR Leaderboard

Export the CTC decoder head (512→1025 linear projection) as a separate
1MB CoreML model instead of requiring the full 97.5MB CTC encoder. The
CtcHead model runs on the existing TDT encoder output, achieving 99.4%
Dict Recall at 70.29x RTFx on the earnings benchmark (772 files).

- Load optional CtcHead.mlmodelc from model directory in AsrModels
- Run CTC head on raw encoder output in AsrTranscription
- Add spotKeywordsFromLogProbs() for DP on pre-computed log-probs
- Add applyLogSoftmax() for raw logits→log-probs conversion
- Expose cached CTC logits via AsrManager for VocabularyRescorer
- Update CtcEarningsBenchmark to use standalone CTC head path
Instead of only loading CtcHead.mlmodelc if manually placed in the model
directory, download it on demand from FluidInference/parakeet-ctc-110m-coreml
via DownloadUtils.loadModels when the tdtCtc110m model version is used.
Try loading CtcHead.mlmodelc from the local TDT model directory first
(v1), then fall back to auto-downloading from the parakeet-ctc-110m HF
repo (v2). Mark CTC head loading as beta in log messages.
- Update CustomVocabulary.md with dual architecture diagrams (standalone
  CTC head vs separate CTC encoder) and approach comparison table
- Add CTC head section to TDT-CTC-110M.md covering architecture, loading
  paths, performance, conversion, and beta status
- Update benchmarks100.md with standalone CTC head results (70.29x RTFx,
  1MB model, 99.4% Dict Recall)
- Skip CTC head caching for multi-chunk audio (>15s) to prevent stale
  logits from last chunk being used for full-audio rescoring
- Clear cachedCtcLogits in resetState() and cleanup() to prevent leak
- Rename getCachedCtcLogProbs() to getCachedCtcRawLogits() to accurately
  reflect that values are raw logits, not log-probabilities
- Remove duplicate CTC inference in benchmark by reusing pre-computed
  logProbs via spotKeywordsFromLogProbs() for both paths
devin-ai-integration[bot]

This comment was marked as resolved.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 28, 2026

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric Value Target Status
DER 33.4% <35%
Miss Rate 24.4% - -
False Alarm 0.2% - -
Speaker Error 8.8% - -
RTFx 12.9x >1.0x
Speakers 4/4 - -

Sortformer High-Latency • ES2004a • Runtime: 2m 12s • 2026-03-28T20:10:04.043Z

The CTC head guard requires isLastChunk to be true, but the single-chunk
path in transcribeWithState did not pass it, causing the CTC head to
never execute for single-chunk audio (the primary use case).
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 new potential issues.

View 14 additional findings in Devin Review.

Open in Devin Review

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Streaming chunk path incorrectly caches CTC logits from partial audio as if single-chunk

transcribeStreamingChunk() calls executeMLInferenceWithTimings without passing globalFrameOffset, so it defaults to 0 (AsrTranscription.swift:280-287). When isLastChunk: true, the caching condition isLastChunk && globalFrameOffset == 0 at AsrTranscription.swift:157,166 is satisfied, causing the CTC head to run and cache logits from ONLY the last streaming chunk. The public APIs hasCachedCtcLogits and getCachedCtcRawLogits() then return this partial-chunk data as if it were valid full-audio logits. An external caller who streams multiple chunks and then checks the cache would get incorrect data.

(Refers to lines 280-287)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +56 to +84
// Cached CTC logits from fused Preprocessor (unified custom vocabulary)
internal var cachedCtcLogits: MLMultiArray?
internal var cachedCtcFrameDuration: Double?

/// Whether the Preprocessor outputs CTC logits (unified custom vocabulary model).
public var hasCachedCtcLogits: Bool { cachedCtcLogits != nil }

/// Get cached CTC raw logits as [[Float]] for external use (e.g. benchmarks).
/// These are raw logits — callers must apply `CtcKeywordSpotter.applyLogSoftmax()`
/// to convert to log-probabilities before use in keyword detection.
/// Returns nil if the CTC head model is not available or audio was multi-chunk.
public func getCachedCtcRawLogits() -> (rawLogits: [[Float]], frameDuration: Double)? {
guard let logits = cachedCtcLogits, let duration = cachedCtcFrameDuration else { return nil }
let shape = logits.shape
guard shape.count == 3 else { return nil }
let numFrames = shape[1].intValue
let vocabSize = shape[2].intValue
var result: [[Float]] = []
result.reserveCapacity(numFrames)
for t in 0..<numFrames {
var frame: [Float] = []
frame.reserveCapacity(vocabSize)
for v in 0..<vocabSize {
frame.append(logits[[0, t, v] as [NSNumber]].floatValue)
}
result.append(frame)
}
return (rawLogits: result, frameDuration: duration)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 No unit tests added for new CTC head functionality (AGENTS.md violation)

AGENTS.md states: "Add unit tests when writing new code." This PR adds significant new functionality — CTC head model loading (AsrModels.swift:219-252), CTC logit caching (AsrManager.swift:57-84), applyLogSoftmax() static method (CtcKeywordSpotter.swift:268-306), spotKeywordsFromLogProbs() (CtcKeywordSpotter.swift:191-254), convertCtcLogitsToArray() (AsrTranscription.swift:654-686), and the cached-logits integration in applyVocabularyRescoring — but no test files were added or modified in the PR.

Prompt for agents
Add unit tests for the new CTC head functionality. At minimum, create tests in Tests/FluidAudioTests/ for:
1. CtcKeywordSpotter.applyLogSoftmax() - verify it produces valid log-probabilities (sum to ~1 after exp), applies temperature scaling correctly, and applies blank bias to the correct index.
2. CtcKeywordSpotter.spotKeywordsFromLogProbs() - verify it produces the same detections as spotKeywordsWithLogProbs when given the same logProbs.
3. AsrManager cached CTC logit lifecycle - verify cachedCtcLogits is nil after resetState(), nil after cleanup(), and that getCachedCtcRawLogits() returns nil when no CTC head is loaded.
4. convertCtcLogitsToArray() - verify correct conversion from MLMultiArray shape [1, T, V] to [[Float]] with proper log-softmax application.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 15 additional findings in Devin Review.

Open in Devin Review

Comment on lines +219 to +252
if version == .tdtCtc110m {
// v1: Check local TDT model directory first
let repoDir = repoPath(from: directory, version: version)
let ctcHeadPath = repoDir.appendingPathComponent(Names.ctcHeadFile)
if FileManager.default.fileExists(atPath: ctcHeadPath.path) {
let ctcConfig = MLModelConfiguration()
ctcConfig.computeUnits = config.computeUnits
ctcHeadModel = try? MLModel(contentsOf: ctcHeadPath, configuration: ctcConfig)
if ctcHeadModel != nil {
logger.info("[Beta] Loaded CTC head model from local directory")
} else {
logger.warning("CTC head model found but failed to load: \(ctcHeadPath.path)")
}
}

// v2: Fall back to downloading from parakeet-ctc-110m HF repo
if ctcHeadModel == nil {
do {
let ctcModels = try await DownloadUtils.loadModels(
.parakeetCtc110m,
modelNames: [Names.ctcHeadFile],
directory: parentDirectory,
computeUnits: config.computeUnits,
progressHandler: progressHandler
)
ctcHeadModel = ctcModels[Names.ctcHeadFile]
if ctcHeadModel != nil {
logger.info("[Beta] Loaded CTC head model from HF repo")
}
} catch {
logger.warning("CTC head model not available: \(error.localizedDescription)")
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Nested if statements in CTC head loading code violate AGENTS.md control flow rule

AGENTS.md states: "Nested if statements should be absolutely avoided. Use guard statements and inverted conditions to exit early." The new CTC head loading block has 3 levels of nesting: if version == .tdtCtc110mif FileManager.default.fileExistsif ctcHeadModel != nil. This could be restructured by extracting a helper method or using guard-based early exits.

Prompt for agents
In Sources/FluidAudio/ASR/Parakeet/AsrModels.swift lines 219-252, extract the CTC head loading logic into a separate private static method like `loadCtcHead(from directory: URL, parentDirectory: URL, config: MLModelConfiguration, progressHandler: ...)` that uses guard statements and early returns instead of nested ifs. The outer call site would become: `let ctcHeadModel = version == .tdtCtc110m ? try? await loadCtcHead(...) : nil`. Inside the helper, use guard for the file existence check and return early on failure, avoiding the 3-level nesting.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@Alex-Wengg Alex-Wengg merged commit 9516d95 into main Mar 28, 2026
16 checks passed
@Alex-Wengg Alex-Wengg deleted the ctc-head-export branch March 28, 2026 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant