Skip to content

feat: integrate Qwen3-ForcedAligner-0.6B for per-word timestamp alignment#315

Closed
Alex-Wengg wants to merge 13 commits intomainfrom
feat/forced-aligner
Closed

feat: integrate Qwen3-ForcedAligner-0.6B for per-word timestamp alignment#315
Alex-Wengg wants to merge 13 commits intomainfrom
feat/forced-aligner

Conversation

@Alex-Wengg
Copy link
Copy Markdown
Member

@Alex-Wengg Alex-Wengg commented Feb 15, 2026

Summary

  • Integrates the Qwen3-ForcedAligner-0.6B CoreML int8 models from alexwengg/Qwen3-ForcedAligner-0.6B-Coreml into FluidAudio
  • 3-model CoreML pipeline (audio encoder + embedding + decoder with LM head) producing per-word timestamps via a single non-autoregressive prefill pass
  • Adds ForcedAlignerManager public actor API and fluidaudio align CLI command

Architecture

Component File Description
Config ForcedAlignerConfig.swift Model constants, special token IDs, dimensions
Types ForcedAlignerTypes.swift WordAlignment, ForcedAlignmentResult, error types
Models ForcedAlignerModels.swift Download from HF + CoreML model loading
Mel ForcedAlignerMelSpectrogram.swift Slaney-scale mel spectrogram with STFT center padding
MRoPE ForcedAlignerMRoPE.swift Interleaved multi-dimensional rotary position embeddings
Tokenizer ForcedAlignerTokenizer.swift BPE tokenizer (vocab.json + merges.txt from HF)
Inference ForcedAlignerInference.swift Full 12-step alignment pipeline
Manager ForcedAlignerManager.swift Public actor API
CLI AlignCommand.swift fluidaudio align audio.wav --text "transcript"

Key Design Decisions

  • Slaney mel scale (not HTK) — matches HuggingFace WhisperFeatureExtractor with mel_scale="slaney" + norm="slaney"
  • STFT center padding — reflect-pads audio by nFFT/2 on each side, matching torch.stft(center=True)
  • Stride-aware MLMultiArray parsing — CoreML outputs may have non-contiguous memory layout with padding between rows
  • LIS-based timestamp correction — fixes non-monotonic timestamps using Longest Increasing Subsequence

Test Results

Tested with 4 TTS-generated wav files of varying lengths:

Audio Words Timestamps
1.39s "hello world how are you today" 6 0 - 1280ms ✓
2.36s "the quick brown fox jumps over the lazy dog" 9 0 - 2320ms ✓
2.93s "artificial intelligence is transforming the world rapidly" 7 0 - 2880ms ✓
4.77s "I have a dream that one day this nation will rise up..." 21 0 - 4720ms ✓

Test plan

  • Build succeeds (swift build)
  • swift-format lint passes
  • Test with short audio (1.39s, 6 words)
  • Test with medium audio (2.36s, 9 words)
  • Test with longer audio (4.77s, 21 words)
  • Timestamps are monotonically increasing
  • Timestamps fall within audio duration
  • Test with real speech recordings (not TTS)

🤖 Generated with Claude Code


Open with Devin

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Feb 15, 2026

Claude finished @Alex-Wengg's task —— View job


PR Review: Forced Aligner Integration ✅

Comprehensive review of this forced aligner integration PR completed.

Todo List:

  • Create pending review
  • Get PR diff for line numbers
  • Review architecture and design decisions
  • Review individual files for code quality
  • Check adherence to project guidelines
  • Add inline comments for specific issues
  • Submit comprehensive review

Review Summary

This is a well-architected and comprehensive implementation of forced alignment using Qwen3-ForcedAligner-0.6B. The code demonstrates excellent understanding of the model pipeline and follows most project conventions well. However, there are several critical issues that should be addressed before merging.

Critical Issues (Blockers) 🚨

1. Actor Isolation Violation

  • File: ForcedAlignerManager.swift:72
  • Issue: align() method is not async but performs heavy computation
  • Impact: Could block the actor and cause UI freezing
  • Fix: Make the method async or move heavy computation off the actor's context

2. Thread Safety Concerns

  • File: ForcedAlignerMelSpectrogram.swift:14
  • Issue: Class marked as "not thread-safe" with mutable buffers used in actor context
  • Impact: Potential race conditions and memory corruption
  • Fix: Make buffers thread-local or implement proper synchronization

3. Memory Management

  • File: ForcedAlignerMelSpectrogram.swift and ForcedAlignerInference.swift
  • Issue: Large buffer allocations without explicit cleanup
  • Impact: Memory pressure on iOS devices
  • Fix: Implement proper memory management and consider using autoreleasepool

Major Issues (Should Fix) ⚠️

4. Error Handling in CLI

  • File: AlignCommand.swift:74
  • Issue: try await manager.align() should be async but the method isn't
  • Fix: Make manager.align() async and update call site

5. Tokenizer Input Validation

  • File: ForcedAlignerTokenizer.swift:89
  • Issue: No validation for text length or character set compatibility
  • Impact: Could cause runtime failures with certain inputs
  • Fix: Add input validation and sanitization

6. Model Loading Pattern

  • File: ForcedAlignerModels.swift:30
  • Issue: Uses .all as default compute units, may not be optimal for all devices
  • Fix: Consider .cpuAndGPU as default for better device compatibility

Minor Issues/Nits (Nice to have) 💡

7. Documentation Enhancement

  • Several methods could benefit from more detailed parameter documentation
  • Consider adding usage examples in key public APIs

8. Performance Optimizations

  • Consider caching mel filterbank calculations
  • MLMultiArray reuse could reduce allocations

9. Code Style

  • Overall excellent adherence to project conventions
  • Proper use of OSLog and error handling patterns
  • Good separation of concerns

Positive Highlights ✨

  1. Excellent Architecture - Clean separation between config, types, models, inference, and manager
  2. Comprehensive Error Handling - Well-defined error types with meaningful messages
  3. Performance Considerations - Good use of Accelerate framework for SIMD operations
  4. Documentation Quality - Extensive comments explaining the pipeline steps
  5. Testing Evidence - PR description shows thorough testing with various audio lengths
  6. Integration Quality - Proper CLI integration following existing patterns

Recommendations

  1. Priority 1: Fix the actor isolation and thread safety issues (Migrate speaker diarization to OSS SDK #1, introduce CoreML diarizer model on the SDK #2, Metal optimizations #3)
  2. Priority 2: Address error handling inconsistencies (Fix DER calculation and add diarization proper AMI benchmarking #4, Add Benchmark speed test report #5)
  3. Priority 3: Consider performance optimizations for production use
  4. Testing: Add unit tests for the tokenizer and mel spectrogram components
  5. Documentation: Add API documentation examples for public interfaces

Conclusion

This is a high-quality implementation that demonstrates deep understanding of the forced alignment pipeline. The modular architecture and comprehensive error handling are exemplary. However, the critical thread safety and actor isolation issues must be resolved before merging to prevent runtime issues in production.

Recommendation: Request changes to address critical issues, then approve. The implementation quality is excellent once these issues are resolved.


Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comprehensive review of the Qwen3-ForcedAligner integration. Overall well-architected implementation with some critical issues to address.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 15, 2026

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric Value Target Status Description
DER NaN% <20% ⚠️ Diarization Error Rate (lower is better)
RTFx NaNx >1.0x ⚠️ Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage Time (s) % Description
Model Download NaN NaN Fetching diarization models
Model Compile NaN NaN CoreML compilation
Audio Load NaN NaN Loading audio file
Segmentation NaN NaN VAD + speech detection
Embedding NaN NaN Speaker embedding extraction
Clustering (VBx) NaN NaN Hungarian algorithm + VBx clustering
Total NaN 100 Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method DER Mode Description
FluidAudio (Offline) NaN% VBx Batch On-device CoreML with optimal clustering
FluidAudio (Streaming) 17.7% Chunk-based First-occurrence speaker mapping
Research baseline 18-30% Various Standard dataset performance

Pipeline Details:

  • Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
  • Segmentation: VAD-based voice activity detection
  • Embeddings: WeSpeaker-compatible speaker embeddings
  • Clustering: PowerSet with VBx refinement
  • Accuracy: Higher than streaming due to optimal post-hoc mapping

🎯 Offline VBx Test • AMI Corpus ES2004a • NaNs meeting audio • NaNs processing • Test runtime: N/A • 03/07/2026, 10:38 AM EST

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 15, 2026

PocketTTS Smoke Test ✅

Check Result
Build
Model download
Model load
Synthesis pipeline
Output WAV ✅ (213.8 KB)

Runtime: 0m26s

Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality may differ from Apple Silicon.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 15, 2026

VAD Benchmark Results

Performance Comparison

Dataset Accuracy Precision Recall F1-Score RTFx Files
MUSAN 92.0% 86.2% 100.0% 92.6% 567.7x faster 50
VOiCES 92.0% 86.2% 100.0% 92.6% 738.0x faster 50

Dataset Details

  • MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
  • VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 15, 2026

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric Value Description
WER (Avg) 7.03% Average Word Error Rate
WER (Med) 4.17% Median Word Error Rate
RTFx 10.85x Real-time factor (higher = faster)
Total Audio 470.6s Total audio duration processed
Total Time 45.3s Total processing time

Streaming Metrics

Metric Value Description
Avg Chunk Time 0.045s Average chunk processing time
Max Chunk Time 0.091s Maximum chunk processing time
EOU Detections 0 Total End-of-Utterance detections

Test runtime: 0m57s • 03/07/2026, 10:30 AM EST

RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 15, 2026

Qwen3-ASR int8 Smoke Test ✅

Check Result
Build
Model download
Model load
Transcription pipeline
Decoder size 571 MB (vs 1.1 GB f32)

Runtime: 4m31s

Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 15, 2026

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric Value Target Status
DER 33.4% <35%
Miss Rate 24.4% - -
False Alarm 0.1% - -
Speaker Error 8.9% - -
RTFx 19.3x >1.0x
Speakers 4/4 - -

Sortformer High-Latency • ES2004a • Runtime: 2m 48s • 2026-03-07T15:31:26.724Z

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 15, 2026

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset WER Avg WER Med RTFx Status
test-clean 0.57% 0.00% 4.39x
test-other 1.59% 0.00% 2.59x

Parakeet v2 (English-optimized)

Dataset WER Avg WER Med RTFx Status
test-clean 0.80% 0.00% 3.92x
test-other 1.00% 0.00% 2.65x

Streaming (v3)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.48x Streaming real-time factor
Avg Chunk Time 1.934s Average time to process each chunk
Max Chunk Time 2.534s Maximum chunk processing time
First Token 2.283s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming (v2)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.45x Streaming real-time factor
Avg Chunk Time 1.934s Average time to process each chunk
Max Chunk Time 3.035s Maximum chunk processing time
First Token 1.944s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming

25 files per dataset • Test runtime: 8m22s • 03/07/2026, 10:40 AM EST

RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

Testing methodology follows HuggingFace Open ASR Leaderboard

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 15, 2026

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric Value Target Status Description
DER 15.1% <30% Diarization Error Rate (lower is better)
JER 24.9% <25% Jaccard Error Rate
RTFx 17.72x >1.0x Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage Time (s) % Description
Model Download 9.066 15.3 Fetching diarization models
Model Compile 3.885 6.6 CoreML compilation
Audio Load 0.053 0.1 Loading audio file
Segmentation 17.752 30.0 Detecting speech regions
Embedding 29.587 50.0 Extracting speaker voices
Clustering 11.835 20.0 Grouping same speakers
Total 59.208 100 Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method DER Notes
FluidAudio 15.1% On-device CoreML
Research baseline 18-30% Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

  • M2 MacBook Air (2022): Runs at 150 RTFx real-time
  • Performance scales with Apple Neural Engine capabilities

🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 59.2s diarization time • Test runtime: 2m 59s • 03/07/2026, 10:38 AM EST

…ment

Add forced alignment pipeline using a 3-model CoreML architecture
(audio encoder + embedding + decoder with LM head) in a single
non-autoregressive prefill pass. Produces per-word timestamps by
aligning audio against a known transcript.

New files:
- ForcedAlignerConfig: model constants and special token IDs
- ForcedAlignerTypes: WordAlignment, ForcedAlignmentResult, error types
- ForcedAlignerModels: CoreML model download and loading
- ForcedAlignerMelSpectrogram: Slaney-scale mel with center padding
- ForcedAlignerMRoPE: interleaved multi-dimensional rotary embeddings
- ForcedAlignerTokenizer: BPE tokenizer with vocab.json/merges.txt
- ForcedAlignerInference: full 12-step alignment pipeline
- ForcedAlignerManager: public actor API
- AlignCommand: CLI command (`fluidaudio align`)
Document 5 bugs encountered during FluidAudio integration:
MLMultiArray stride issues, encoder 3D shape, Slaney vs HTK mel,
STFT center padding, and MRoPE position clamping.
Remove misplaced mobius doc, add ForcedAligner.md covering
architecture, pipeline steps, public API, CLI usage, and
file reference.
Add Documentation/ForcedAligner.md covering architecture, API usage,
CLI, test results, and limitations. Marked as beta.
The CoreML ANE runtime (e5rt) leaks IOSurface buffers on repeated
individual predictions for models with conv ops. After ~500 calls
the process crashes with allocation failures.

Fix: Use MLModel.predictions(fromBatch:) for the audio encoder
instead of per-chunk predict() calls. This reduces encoder calls
from ~N*23 (chunks per segment) to N (one batch per segment),
avoiding the leak accumulation entirely.

Also adds CoreMLPredictionWrapper (ObjC @autoreleasepool) for
embedding and decoder predictions, and changes default compute
units from .all to .cpuAndGPU.

Verified: 500 samples on Buckeye corpus, 9.4x RTFx, no crashes,
no model reload needed.
Adds `fluidaudio align-benchmark` for evaluating ForcedAligner against
the Buckeye Corpus (human-annotated word-level timestamps). Reports
AAS, tolerance percentiles, and RTFx.

Also adds `fluidaudio download --dataset buckeye` to fetch the
segmented Buckeye dataset from HuggingFace (alexwengg/buckeye).
Add rule to use Buckeye Corpus (not LibriSpeech) for forced alignment
evaluation since it has human-annotated word-level timestamps.
Documents all approaches tested (ObjC autorelease, batch API,
model surgery, baked compute units) with benchmark results.
Native batch API was the winner at 9.4x RTFx with no crashes.
Adds run_pytorch_benchmark.py for running PyTorch Qwen3-ForcedAligner
against the Buckeye Corpus. Auto-downloads dataset from HuggingFace
(alexwengg/buckeye) if not present locally.
@Alex-Wengg Alex-Wengg force-pushed the feat/forced-aligner branch from 44c1e2e to c0d4a56 Compare March 4, 2026 20:06
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 3 potential issues.

View 8 additional findings in Devin Review.

Open in Devin Review

}
[results addObject:result];
}
i = batchEnd - 1; // -1 because the for loop will increment
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 NSUInteger underflow causes infinite loop in batchPredictWithModel when drainInterval is 0

In CoreMLPredictionWrapper.m:42, when drainInterval is 0, batchEnd = MIN(i + 0, inputs.count) equals i (which is 0 on the first iteration). The inner loop for j = i; j < batchEnd doesn't execute. Then i = batchEnd - 1 computes 0 - 1 on an NSUInteger (unsigned), which wraps to NSUIntegerMax. The subsequent i++ wraps it back to 0, creating an infinite loop. This method is currently dead code (only predictWithModel is called), but it's a public API that could be invoked by future callers.

Prompt for agents
In Sources/CoreMLPredictionWrapper/CoreMLPredictionWrapper.m, add a guard at the top of batchPredictWithModel to handle drainInterval == 0. Either return nil with an error, or default drainInterval to 1. The underflow occurs at line 42 where `i = batchEnd - 1` computes 0 - 1 on NSUInteger. A simple fix is to add at line 27 (after the results allocation): `if (drainInterval == 0) drainInterval = 1;`
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +23 to +27
public actor ForcedAlignerManager {
private var models: ForcedAlignerModels?
private var tokenizer: ForcedAlignerTokenizer?

public init() {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 No unit tests added for the entire ForcedAligner module

AGENTS.md and CLAUDE.md both mandate: "Add unit tests when writing new code." This PR adds ~1400 lines of new library code across 8 files (ForcedAlignerConfig, ForcedAlignerTypes, ForcedAlignerModels, ForcedAlignerMelSpectrogram, ForcedAlignerMRoPE, ForcedAlignerTokenizer, ForcedAlignerInference, ForcedAlignerManager) with zero test coverage. Key components like BPE tokenization (encode()), mel spectrogram computation, LIS-based timestamp fixing (fixTimestamps()), and MRoPE computation are all pure functions that are straightforward to unit test without models.

Prompt for agents
Add a new test file Tests/FluidAudioTests/ForcedAlignerTests.swift with unit tests for the ForcedAligner module. At minimum, test: (1) ForcedAlignerTokenizer.encode() with known words and expected BPE token IDs, (2) ForcedAlignerTokenizer.tokenize() verifying correct timestamp token placement and word count, (3) ForcedAlignerMelSpectrogram.reflectPad() correctness, (4) ForcedAlignerMelSpectrogram.compute() with a simple sine wave input verifying output dimensions, (5) ForcedAlignerMRoPE.compute() verifying output dimensions and that padded positions repeat the last valid position, (6) ForcedAlignerInference.fixTimestamps() with monotonic input (no-op), decreasing input, and mixed input.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

/// to tokenize input text for the forced alignment pipeline.
///
/// The forced aligner formats input as:
/// `<|audio_start|><|audio_pad|><|audio_end|><timestamp><timestamp>word1<timestamp><timestamp>word2<timestamp><timestamp>`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Tokenizer struct docstring shows leading timestamps but code produces none, risking incorrect future changes

The class-level docstring on ForcedAlignerTokenizer at line 13 states the format is <|audio_start|><|audio_pad|><|audio_end|><timestamp><timestamp>word1<timestamp><timestamp>word2<timestamp><timestamp> — showing leading <timestamp><timestamp> before the first word. However, the actual tokenize() implementation at lines 110-115 explicitly does NOT add leading timestamps (if i > 0), and the inline comment at line 109 correctly says (NO leading timestamps before the first word). The struct docstring contradicts the code and its own inline comment, which could mislead future developers into "fixing" the code to match the docstring and breaking alignment.

Suggested change
/// `<|audio_start|><|audio_pad|><|audio_end|><timestamp><timestamp>word1<timestamp><timestamp>word2<timestamp><timestamp>`
/// `<|audio_start|><|audio_pad|>...<|audio_end|>word1<timestamp><timestamp>word2<timestamp><timestamp>...wordN<timestamp><timestamp>`
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

…ce leak

The audio encoder's Conv2d layers route to ANE, which leaks IOSurface
buffers on repeated individual prediction() calls. Switch to the native
batch API (predictions(fromBatch:)) to manage IOSurface lifecycle in a
single allocation/release cycle, matching the approach in ForcedAligner.

Also removes the sleep(1) workaround from the benchmark and fixes the
pyproject.toml to use the PyPI qwen-asr package.
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 11 additional findings in Devin Review.

Open in Devin Review

import Foundation
import OSLog

private let logger = Logger(subsystem: "FluidAudio", category: "ForcedAlignerManager")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 ForcedAligner uses Logger instead of AppLogger, violating logging convention

CLAUDE.md specifies: "Logging: Use AppLogger(category:) from Shared/AppLogger.swift — not print() in production code." All four new ForcedAligner source files in Sources/FluidAudio/ use Logger(subsystem: "FluidAudio", category: ...) from OSLog directly instead of the project's AppLogger(category:) wrapper. This means ForcedAligner log messages won't be mirrored to console in DEBUG builds and won't use the project's standard subsystem "com.fluidinference". Affected files: ForcedAlignerInference.swift:7, ForcedAlignerModels.swift:5, ForcedAlignerTokenizer.swift:4, ForcedAlignerManager.swift:5.

Prompt for agents
Replace all `Logger(subsystem: "FluidAudio", category: ...)` declarations in the four ForcedAligner files with `AppLogger(category: ...)` to match the project convention. Files to change:
1. Sources/FluidAudio/ASR/ForcedAligner/ForcedAlignerManager.swift line 5: change to `private let logger = AppLogger(category: "ForcedAlignerManager")`
2. Sources/FluidAudio/ASR/ForcedAligner/ForcedAlignerInference.swift line 7: change to `private let logger = AppLogger(category: "ForcedAlignerInference")`
3. Sources/FluidAudio/ASR/ForcedAligner/ForcedAlignerModels.swift line 5: change to `private let logger = AppLogger(category: "ForcedAlignerModels")`
4. Sources/FluidAudio/ASR/ForcedAligner/ForcedAlignerTokenizer.swift line 4: change to `private let logger = AppLogger(category: "ForcedAlignerTokenizer")`
Also remove the `import OSLog` from each file and add `import Foundation` if not already present (AppLogger is in the same module so no extra import needed).
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

… safety net

Switch both Qwen3 ASR and ForcedAligner audio encoders from individual
prediction() calls to the native batch API (predictions(fromBatch:)),
reducing IOSurface leak rate from ANE Conv2d layers.

Add periodic model reload every 300 samples in align-benchmark to
reclaim leaked ANE IOSurface buffers on long runs (1000+ files).
Verified: 1000/1000 Buckeye segments complete without crash.

Remove the sleep(1) workaround from Qwen3 ASR benchmark.
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 new potential issues.

View 13 additional findings in Devin Review.

Open in Devin Review

Comment on lines -553 to -558
// Give system time to reclaim CoreML MLState IOSurface resources every 25 files.
// Without this pause, IOSurface limit (~200) is exhausted causing crashes.
if (index + 1) % 25 == 0 {
logger.info("Memory cleanup pause...")
try? await Task.sleep(for: .seconds(1))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Removal of IOSurface cleanup pause may cause crashes in long ASR benchmarks

The per-25-files sleep was removed from Qwen3AsrBenchmark.runBenchmarkLoop (lines 553-558 on LEFT). The original comment explicitly warned: "Without this pause, IOSurface limit (~200) is exhausted causing crashes." The batch API was added for the audio encoder, but the autoregressive decoder still uses per-prediction calls with MLState. The original sleep helped the system reclaim MLState IOSurface resources between files. Since the batch API only batches encoder predictions and doesn't affect decoder resource lifecycle, removing this pause could re-introduce IOSurface exhaustion crashes on long benchmark runs (100+ files).

Prompt for agents
In Sources/FluidAudioCLI/Commands/ASR/Qwen3AsrBenchmark.swift, the IOSurface cleanup pause (sleep every 25 files) was removed from the runBenchmarkLoop function around the lines after the inner file processing loop (after the catch block). The audio encoder was batched, but the decoder still uses per-prediction MLState calls which also leak IOSurface buffers. Either re-add the periodic sleep/cleanup (e.g. every 25-50 files) or add a periodic model reload (similar to ForcedAligner's approach of reloading every 300 samples in AlignBenchmark.swift:357-364) to reclaim leaked decoder MLState IOSurface resources during long benchmark runs.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +81 to +96
let targetDir = directory ?? defaultCacheDirectory()
let modelsRoot = modelsRootDirectory()

if !force && modelsExist(at: targetDir) {
logger.info("ForcedAligner models already present at: \(targetDir.path)")
return targetDir
}

if force {
try? FileManager.default.removeItem(at: targetDir)
}

logger.info("Downloading ForcedAligner int8 models from HuggingFace...")
try await DownloadUtils.downloadRepo(.forcedAlignerInt8, to: modelsRoot)
logger.info("Successfully downloaded ForcedAligner models")
return targetDir
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 ForcedAlignerModels.download(to:) ignores custom directory, always downloads to default path

When download(to:) is called with a non-nil directory argument, the method sets targetDir = directory (the custom path) but always downloads to modelsRoot = modelsRootDirectory() (the default Application Support path). After downloading, it returns targetDir — which still has no model files. This causes downloadAndLoad(to: customDir) (ForcedAlignerModels.swift:67-73) to download models to the default cache, then attempt to load(from: customDir) which will throw modelNotFound because the models were never placed there.

Prompt for agents
In Sources/FluidAudio/ASR/ForcedAligner/ForcedAlignerModels.swift, the download(to:force:) method on lines 77-97 needs to be fixed so that when a custom `directory` is provided, the models are actually downloaded to that directory (or its parent, depending on how downloadRepo works with folderName). Currently line 82 sets `modelsRoot = modelsRootDirectory()` unconditionally. When `directory` is non-nil, the download target should be derived from the custom directory. One fix: when `directory` is non-nil, compute `modelsRoot` as `directory.deletingLastPathComponent()` (assuming directory already includes the folderName suffix), or download directly to `directory`. Verify the behavior of `DownloadUtils.downloadRepo` — it appends `repo.folderName` to the `to:` parameter, so `modelsRoot` should be the parent of where models should land.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Resolve conflicts:
- Package.swift: keep CoreMLPredictionWrapper, drop removed ESpeakNG
  and FluidAudioEspeak targets (deleted on main)
- Package.resolved: take main's originHash
@Alex-Wengg Alex-Wengg closed this Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant