feat: integrate Qwen3-ForcedAligner-0.6B for per-word timestamp alignment by Alex-Wengg · Pull Request #315 · FluidInference/FluidAudio

Alex-Wengg · 2026-02-15T11:45:17Z

Summary

Integrates the Qwen3-ForcedAligner-0.6B CoreML int8 models from alexwengg/Qwen3-ForcedAligner-0.6B-Coreml into FluidAudio
3-model CoreML pipeline (audio encoder + embedding + decoder with LM head) producing per-word timestamps via a single non-autoregressive prefill pass
Adds ForcedAlignerManager public actor API and fluidaudio align CLI command

Architecture

Component	File	Description
Config	`ForcedAlignerConfig.swift`	Model constants, special token IDs, dimensions
Types	`ForcedAlignerTypes.swift`	`WordAlignment`, `ForcedAlignmentResult`, error types
Models	`ForcedAlignerModels.swift`	Download from HF + CoreML model loading
Mel	`ForcedAlignerMelSpectrogram.swift`	Slaney-scale mel spectrogram with STFT center padding
MRoPE	`ForcedAlignerMRoPE.swift`	Interleaved multi-dimensional rotary position embeddings
Tokenizer	`ForcedAlignerTokenizer.swift`	BPE tokenizer (vocab.json + merges.txt from HF)
Inference	`ForcedAlignerInference.swift`	Full 12-step alignment pipeline
Manager	`ForcedAlignerManager.swift`	Public actor API
CLI	`AlignCommand.swift`	`fluidaudio align audio.wav --text "transcript"`

Key Design Decisions

Slaney mel scale (not HTK) — matches HuggingFace WhisperFeatureExtractor with mel_scale="slaney" + norm="slaney"
STFT center padding — reflect-pads audio by nFFT/2 on each side, matching torch.stft(center=True)
Stride-aware MLMultiArray parsing — CoreML outputs may have non-contiguous memory layout with padding between rows
LIS-based timestamp correction — fixes non-monotonic timestamps using Longest Increasing Subsequence

Test Results

Tested with 4 TTS-generated wav files of varying lengths:

Audio	Words	Timestamps
1.39s "hello world how are you today"	6	0 - 1280ms ✓
2.36s "the quick brown fox jumps over the lazy dog"	9	0 - 2320ms ✓
2.93s "artificial intelligence is transforming the world rapidly"	7	0 - 2880ms ✓
4.77s "I have a dream that one day this nation will rise up..."	21	0 - 4720ms ✓

Test plan

Build succeeds (swift build)
swift-format lint passes
Test with short audio (1.39s, 6 words)
Test with medium audio (2.36s, 9 words)
Test with longer audio (4.77s, 21 words)
Timestamps are monotonically increasing
Timestamps fall within audio duration
Test with real speech recordings (not TTS)

🤖 Generated with Claude Code

claude · 2026-02-15T11:45:35Z

Claude finished @Alex-Wengg's task —— View job

PR Review: Forced Aligner Integration ✅

Comprehensive review of this forced aligner integration PR completed.

Todo List:

Create pending review
Get PR diff for line numbers
Review architecture and design decisions
Review individual files for code quality
Check adherence to project guidelines
Add inline comments for specific issues
Submit comprehensive review

Review Summary

This is a well-architected and comprehensive implementation of forced alignment using Qwen3-ForcedAligner-0.6B. The code demonstrates excellent understanding of the model pipeline and follows most project conventions well. However, there are several critical issues that should be addressed before merging.

Critical Issues (Blockers) 🚨

1. Actor Isolation Violation

File: ForcedAlignerManager.swift:72
Issue: align() method is not async but performs heavy computation
Impact: Could block the actor and cause UI freezing
Fix: Make the method async or move heavy computation off the actor's context

2. Thread Safety Concerns

File: ForcedAlignerMelSpectrogram.swift:14
Issue: Class marked as "not thread-safe" with mutable buffers used in actor context
Impact: Potential race conditions and memory corruption
Fix: Make buffers thread-local or implement proper synchronization

3. Memory Management

File: ForcedAlignerMelSpectrogram.swift and ForcedAlignerInference.swift
Issue: Large buffer allocations without explicit cleanup
Impact: Memory pressure on iOS devices
Fix: Implement proper memory management and consider using autoreleasepool

Major Issues (Should Fix) ⚠️

4. Error Handling in CLI

File: AlignCommand.swift:74
Issue: try await manager.align() should be async but the method isn't
Fix: Make manager.align() async and update call site

5. Tokenizer Input Validation

File: ForcedAlignerTokenizer.swift:89
Issue: No validation for text length or character set compatibility
Impact: Could cause runtime failures with certain inputs
Fix: Add input validation and sanitization

6. Model Loading Pattern

File: ForcedAlignerModels.swift:30
Issue: Uses .all as default compute units, may not be optimal for all devices
Fix: Consider .cpuAndGPU as default for better device compatibility

Minor Issues/Nits (Nice to have) 💡

7. Documentation Enhancement

Several methods could benefit from more detailed parameter documentation
Consider adding usage examples in key public APIs

8. Performance Optimizations

Consider caching mel filterbank calculations
MLMultiArray reuse could reduce allocations

9. Code Style

Overall excellent adherence to project conventions
Proper use of OSLog and error handling patterns
Good separation of concerns

Positive Highlights ✨

Excellent Architecture - Clean separation between config, types, models, inference, and manager
Comprehensive Error Handling - Well-defined error types with meaningful messages
Performance Considerations - Good use of Accelerate framework for SIMD operations
Documentation Quality - Extensive comments explaining the pipeline steps
Testing Evidence - PR description shows thorough testing with various audio lengths
Integration Quality - Proper CLI integration following existing patterns

Recommendations

Priority 1: Fix the actor isolation and thread safety issues (Migrate speaker diarization to OSS SDK #1, introduce CoreML diarizer model on the SDK #2, Metal optimizations #3)
Priority 2: Address error handling inconsistencies (Fix DER calculation and add diarization proper AMI benchmarking #4, Add Benchmark speed test report #5)
Priority 3: Consider performance optimizations for production use
Testing: Add unit tests for the tokenizer and mel spectrogram components
Documentation: Add API documentation examples for public interfaces

Conclusion

This is a high-quality implementation that demonstrates deep understanding of the forced alignment pipeline. The modular architecture and comprehensive error handling are exemplary. However, the critical thread safety and actor isolation issues must be resolved before merging to prevent runtime issues in production.

Recommendation: Request changes to address critical issues, then approve. The implementation quality is excellent once these issues are resolved.

claude

Comprehensive review of the Qwen3-ForcedAligner integration. Overall well-architected implementation with some critical issues to address.

github-actions · 2026-02-15T11:48:36Z

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric	Value	Target	Status	Description
DER	NaN%	<20%	⚠️	Diarization Error Rate (lower is better)
RTFx	NaNx	>1.0x	⚠️	Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage	Time (s)	%	Description
Model Download	NaN	NaN	Fetching diarization models
Model Compile	NaN	NaN	CoreML compilation
Audio Load	NaN	NaN	Loading audio file
Segmentation	NaN	NaN	VAD + speech detection
Embedding	NaN	NaN	Speaker embedding extraction
Clustering (VBx)	NaN	NaN	Hungarian algorithm + VBx clustering
Total	NaN	100	Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method	DER	Mode	Description
FluidAudio (Offline)	NaN%	VBx Batch	On-device CoreML with optimal clustering
FluidAudio (Streaming)	17.7%	Chunk-based	First-occurrence speaker mapping
Research baseline	18-30%	Various	Standard dataset performance

Pipeline Details:

Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
Segmentation: VAD-based voice activity detection
Embeddings: WeSpeaker-compatible speaker embeddings
Clustering: PowerSet with VBx refinement
Accuracy: Higher than streaming due to optimal post-hoc mapping

_{🎯 Offline VBx Test • AMI Corpus ES2004a • NaNs meeting audio • NaNs processing • Test runtime: N/A • 03/07/2026, 10:38 AM EST}

github-actions · 2026-02-15T11:50:19Z

PocketTTS Smoke Test ✅

Check	Result
Build	✅
Model download	✅
Model load	✅
Synthesis pipeline	✅
Output WAV	✅ (213.8 KB)

_{Runtime: 0m26s}

_{Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality may differ from Apple Silicon.}

github-actions · 2026-02-15T11:50:34Z

VAD Benchmark Results

Performance Comparison

Dataset	Accuracy	Precision	Recall	F1-Score	RTFx	Files
MUSAN	92.0%	86.2%	100.0%	92.6%	567.7x faster	50
VOiCES	92.0%	86.2%	100.0%	92.6%	738.0x faster	50

Dataset Details

MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

github-actions · 2026-02-15T11:50:57Z

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric	Value	Description
WER (Avg)	7.03%	Average Word Error Rate
WER (Med)	4.17%	Median Word Error Rate
RTFx	10.85x	Real-time factor (higher = faster)
Total Audio	470.6s	Total audio duration processed
Total Time	45.3s	Total processing time

Streaming Metrics

Metric	Value	Description
Avg Chunk Time	0.045s	Average chunk processing time
Max Chunk Time	0.091s	Maximum chunk processing time
EOU Detections	0	Total End-of-Utterance detections

_{Test runtime: 0m57s • 03/07/2026, 10:30 AM EST}

_{RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O}

github-actions · 2026-02-15T11:53:47Z

Qwen3-ASR int8 Smoke Test ✅

Check	Result
Build	✅
Model download	✅
Model load	✅
Transcription pipeline	✅
Decoder size	571 MB (vs 1.1 GB f32)

_{Runtime: 4m31s}

_{Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx.}

github-actions · 2026-02-15T11:54:56Z

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric	Value	Target	Status
DER	33.4%	<35%	✅
Miss Rate	24.4%	-	-
False Alarm	0.1%	-	-
Speaker Error	8.9%	-	-
RTFx	19.3x	>1.0x	✅
Speakers	4/4	-	-

_{Sortformer High-Latency • ES2004a • Runtime: 2m 48s • 2026-03-07T15:31:26.724Z}

github-actions · 2026-02-15T11:59:55Z

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.57%	0.00%	4.39x	✅
test-other	1.59%	0.00%	2.59x	✅

Parakeet v2 (English-optimized)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.80%	0.00%	3.92x	✅
test-other	1.00%	0.00%	2.65x	✅

Streaming (v3)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.48x	Streaming real-time factor
Avg Chunk Time	1.934s	Average time to process each chunk
Max Chunk Time	2.534s	Maximum chunk processing time
First Token	2.283s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

Streaming (v2)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.45x	Streaming real-time factor
Avg Chunk Time	1.934s	Average time to process each chunk
Max Chunk Time	3.035s	Maximum chunk processing time
First Token	1.944s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

_{Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming}

_{25 files per dataset • Test runtime: 8m22s • 03/07/2026, 10:40 AM EST}

_{RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)}

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

_{Testing methodology follows HuggingFace Open ASR Leaderboard}

github-actions · 2026-02-15T12:07:58Z

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric	Value	Target	Status	Description
DER	15.1%	<30%	✅	Diarization Error Rate (lower is better)
JER	24.9%	<25%	✅	Jaccard Error Rate
RTFx	17.72x	>1.0x	✅	Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage	Time (s)	%	Description
Model Download	9.066	15.3	Fetching diarization models
Model Compile	3.885	6.6	CoreML compilation
Audio Load	0.053	0.1	Loading audio file
Segmentation	17.752	30.0	Detecting speech regions
Embedding	29.587	50.0	Extracting speaker voices
Clustering	11.835	20.0	Grouping same speakers
Total	59.208	100	Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method	DER	Notes
FluidAudio	15.1%	On-device CoreML
Research baseline	18-30%	Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

M2 MacBook Air (2022): Runs at 150 RTFx real-time
Performance scales with Apple Neural Engine capabilities

_{🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 59.2s diarization time • Test runtime: 2m 59s • 03/07/2026, 10:38 AM EST}

…ment Add forced alignment pipeline using a 3-model CoreML architecture (audio encoder + embedding + decoder with LM head) in a single non-autoregressive prefill pass. Produces per-word timestamps by aligning audio against a known transcript. New files: - ForcedAlignerConfig: model constants and special token IDs - ForcedAlignerTypes: WordAlignment, ForcedAlignmentResult, error types - ForcedAlignerModels: CoreML model download and loading - ForcedAlignerMelSpectrogram: Slaney-scale mel with center padding - ForcedAlignerMRoPE: interleaved multi-dimensional rotary embeddings - ForcedAlignerTokenizer: BPE tokenizer with vocab.json/merges.txt - ForcedAlignerInference: full 12-step alignment pipeline - ForcedAlignerManager: public actor API - AlignCommand: CLI command (`fluidaudio align`)

Document 5 bugs encountered during FluidAudio integration: MLMultiArray stride issues, encoder 3D shape, Slaney vs HTK mel, STFT center padding, and MRoPE position clamping.

Remove misplaced mobius doc, add ForcedAligner.md covering architecture, pipeline steps, public API, CLI usage, and file reference.

Add Documentation/ForcedAligner.md covering architecture, API usage, CLI, test results, and limitations. Marked as beta.

@autoreleasepool

The CoreML ANE runtime (e5rt) leaks IOSurface buffers on repeated individual predictions for models with conv ops. After ~500 calls the process crashes with allocation failures. Fix: Use MLModel.predictions(fromBatch:) for the audio encoder instead of per-chunk predict() calls. This reduces encoder calls from ~N*23 (chunks per segment) to N (one batch per segment), avoiding the leak accumulation entirely. Also adds CoreMLPredictionWrapper (ObjC @autoreleasepool) for embedding and decoder predictions, and changes default compute units from .all to .cpuAndGPU. Verified: 500 samples on Buckeye corpus, 9.4x RTFx, no crashes, no model reload needed.

Adds `fluidaudio align-benchmark` for evaluating ForcedAligner against the Buckeye Corpus (human-annotated word-level timestamps). Reports AAS, tolerance percentiles, and RTFx. Also adds `fluidaudio download --dataset buckeye` to fetch the segmented Buckeye dataset from HuggingFace (alexwengg/buckeye).

Add rule to use Buckeye Corpus (not LibriSpeech) for forced alignment evaluation since it has human-annotated word-level timestamps.

Documents all approaches tested (ObjC autorelease, batch API, model surgery, baked compute units) with benchmark results. Native batch API was the winner at 9.4x RTFx with no crashes.

Adds run_pytorch_benchmark.py for running PyTorch Qwen3-ForcedAligner against the Buckeye Corpus. Auto-downloads dataset from HuggingFace (alexwengg/buckeye) if not present locally.

devin-ai-integration

Devin Review found 3 potential issues.

View 8 additional findings in Devin Review.

devin-ai-integration · 2026-03-04T20:14:13Z

Sources/CoreMLPredictionWrapper/CoreMLPredictionWrapper.m

+                }
+                [results addObject:result];
+            }
+            i = batchEnd - 1;  // -1 because the for loop will increment


🟡 NSUInteger underflow causes infinite loop in batchPredictWithModel when drainInterval is 0

In CoreMLPredictionWrapper.m:42, when drainInterval is 0, batchEnd = MIN(i + 0, inputs.count) equals i (which is 0 on the first iteration). The inner loop for j = i; j < batchEnd doesn't execute. Then i = batchEnd - 1 computes 0 - 1 on an NSUInteger (unsigned), which wraps to NSUIntegerMax. The subsequent i++ wraps it back to 0, creating an infinite loop. This method is currently dead code (only predictWithModel is called), but it's a public API that could be invoked by future callers.

Prompt for agents

In Sources/CoreMLPredictionWrapper/CoreMLPredictionWrapper.m, add a guard at the top of batchPredictWithModel to handle drainInterval == 0. Either return nil with an error, or default drainInterval to 1. The underflow occurs at line 42 where `i = batchEnd - 1` computes 0 - 1 on NSUInteger. A simple fix is to add at line 27 (after the results allocation): `if (drainInterval == 0) drainInterval = 1;`

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-04T20:14:15Z

Sources/FluidAudio/ASR/ForcedAligner/ForcedAlignerManager.swift

+public actor ForcedAlignerManager {
+    private var models: ForcedAlignerModels?
+    private var tokenizer: ForcedAlignerTokenizer?
+
+    public init() {}


🔴 No unit tests added for the entire ForcedAligner module

AGENTS.md and CLAUDE.md both mandate: "Add unit tests when writing new code." This PR adds ~1400 lines of new library code across 8 files (ForcedAlignerConfig, ForcedAlignerTypes, ForcedAlignerModels, ForcedAlignerMelSpectrogram, ForcedAlignerMRoPE, ForcedAlignerTokenizer, ForcedAlignerInference, ForcedAlignerManager) with zero test coverage. Key components like BPE tokenization (encode()), mel spectrogram computation, LIS-based timestamp fixing (fixTimestamps()), and MRoPE computation are all pure functions that are straightforward to unit test without models.

Prompt for agents

Add a new test file Tests/FluidAudioTests/ForcedAlignerTests.swift with unit tests for the ForcedAligner module. At minimum, test: (1) ForcedAlignerTokenizer.encode() with known words and expected BPE token IDs, (2) ForcedAlignerTokenizer.tokenize() verifying correct timestamp token placement and word count, (3) ForcedAlignerMelSpectrogram.reflectPad() correctness, (4) ForcedAlignerMelSpectrogram.compute() with a simple sine wave input verifying output dimensions, (5) ForcedAlignerMRoPE.compute() verifying output dimensions and that padded positions repeat the last valid position, (6) ForcedAlignerInference.fixTimestamps() with monotonic input (no-op), decreasing input, and mixed input.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-04T20:14:16Z

Sources/FluidAudio/ASR/ForcedAligner/ForcedAlignerTokenizer.swift

+/// to tokenize input text for the forced alignment pipeline.
+///
+/// The forced aligner formats input as:
+/// `<|audio_start|><|audio_pad|><|audio_end|><timestamp><timestamp>word1<timestamp><timestamp>word2<timestamp><timestamp>`


🟡 Tokenizer struct docstring shows leading timestamps but code produces none, risking incorrect future changes

The class-level docstring on ForcedAlignerTokenizer at line 13 states the format is <|audio_start|><|audio_pad|><|audio_end|><timestamp><timestamp>word1<timestamp><timestamp>word2<timestamp><timestamp> — showing leading <timestamp><timestamp> before the first word. However, the actual tokenize() implementation at lines 110-115 explicitly does NOT add leading timestamps (if i > 0), and the inline comment at line 109 correctly says (NO leading timestamps before the first word). The struct docstring contradicts the code and its own inline comment, which could mislead future developers into "fixing" the code to match the docstring and breaking alignment.

Suggested change

/// `<|audio_start|><|audio_pad|><|audio_end|><timestamp><timestamp>word1<timestamp><timestamp>word2<timestamp><timestamp>`

/// `<|audio_start|><|audio_pad|>...<|audio_end|>word1<timestamp><timestamp>word2<timestamp><timestamp>...wordN<timestamp><timestamp>`

Was this helpful? React with 👍 or 👎 to provide feedback.

…ce leak The audio encoder's Conv2d layers route to ANE, which leaks IOSurface buffers on repeated individual prediction() calls. Switch to the native batch API (predictions(fromBatch:)) to manage IOSurface lifecycle in a single allocation/release cycle, matching the approach in ForcedAligner. Also removes the sleep(1) workaround from the benchmark and fixes the pyproject.toml to use the PyPI qwen-asr package.

devin-ai-integration

Devin Review found 2 new potential issues.

View 11 additional findings in Devin Review.

devin-ai-integration · 2026-03-04T21:00:24Z

Sources/FluidAudio/ASR/ForcedAligner/ForcedAlignerManager.swift

+import Foundation
+import OSLog
+
+private let logger = Logger(subsystem: "FluidAudio", category: "ForcedAlignerManager")


🟡 ForcedAligner uses Logger instead of AppLogger, violating logging convention

CLAUDE.md specifies: "Logging: Use AppLogger(category:) from Shared/AppLogger.swift — not print() in production code." All four new ForcedAligner source files in Sources/FluidAudio/ use Logger(subsystem: "FluidAudio", category: ...) from OSLog directly instead of the project's AppLogger(category:) wrapper. This means ForcedAligner log messages won't be mirrored to console in DEBUG builds and won't use the project's standard subsystem "com.fluidinference". Affected files: ForcedAlignerInference.swift:7, ForcedAlignerModels.swift:5, ForcedAlignerTokenizer.swift:4, ForcedAlignerManager.swift:5.

Prompt for agents

Replace all `Logger(subsystem: "FluidAudio", category: ...)` declarations in the four ForcedAligner files with `AppLogger(category: ...)` to match the project convention. Files to change: 1. Sources/FluidAudio/ASR/ForcedAligner/ForcedAlignerManager.swift line 5: change to `private let logger = AppLogger(category: "ForcedAlignerManager")` 2. Sources/FluidAudio/ASR/ForcedAligner/ForcedAlignerInference.swift line 7: change to `private let logger = AppLogger(category: "ForcedAlignerInference")` 3. Sources/FluidAudio/ASR/ForcedAligner/ForcedAlignerModels.swift line 5: change to `private let logger = AppLogger(category: "ForcedAlignerModels")` 4. Sources/FluidAudio/ASR/ForcedAligner/ForcedAlignerTokenizer.swift line 4: change to `private let logger = AppLogger(category: "ForcedAlignerTokenizer")` Also remove the `import OSLog` from each file and add `import Foundation` if not already present (AppLogger is in the same module so no extra import needed).

Was this helpful? React with 👍 or 👎 to provide feedback.

… safety net Switch both Qwen3 ASR and ForcedAligner audio encoders from individual prediction() calls to the native batch API (predictions(fromBatch:)), reducing IOSurface leak rate from ANE Conv2d layers. Add periodic model reload every 300 samples in align-benchmark to reclaim leaked ANE IOSurface buffers on long runs (1000+ files). Verified: 1000/1000 Buckeye segments complete without crash. Remove the sleep(1) workaround from Qwen3 ASR benchmark.

devin-ai-integration

Devin Review found 2 new potential issues.

View 13 additional findings in Devin Review.

devin-ai-integration · 2026-03-04T22:07:19Z

Sources/FluidAudioCLI/Commands/ASR/Qwen3AsrBenchmark.swift

-            // Give system time to reclaim CoreML MLState IOSurface resources every 25 files.
-            // Without this pause, IOSurface limit (~200) is exhausted causing crashes.
-            if (index + 1) % 25 == 0 {
-                logger.info("Memory cleanup pause...")
-                try? await Task.sleep(for: .seconds(1))
-            }


🔴 Removal of IOSurface cleanup pause may cause crashes in long ASR benchmarks

The per-25-files sleep was removed from Qwen3AsrBenchmark.runBenchmarkLoop (lines 553-558 on LEFT). The original comment explicitly warned: "Without this pause, IOSurface limit (~200) is exhausted causing crashes." The batch API was added for the audio encoder, but the autoregressive decoder still uses per-prediction calls with MLState. The original sleep helped the system reclaim MLState IOSurface resources between files. Since the batch API only batches encoder predictions and doesn't affect decoder resource lifecycle, removing this pause could re-introduce IOSurface exhaustion crashes on long benchmark runs (100+ files).

Prompt for agents

In Sources/FluidAudioCLI/Commands/ASR/Qwen3AsrBenchmark.swift, the IOSurface cleanup pause (sleep every 25 files) was removed from the runBenchmarkLoop function around the lines after the inner file processing loop (after the catch block). The audio encoder was batched, but the decoder still uses per-prediction MLState calls which also leak IOSurface buffers. Either re-add the periodic sleep/cleanup (e.g. every 25-50 files) or add a periodic model reload (similar to ForcedAligner's approach of reloading every 300 samples in AlignBenchmark.swift:357-364) to reclaim leaked decoder MLState IOSurface resources during long benchmark runs.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-03-04T22:07:20Z

Sources/FluidAudio/ASR/ForcedAligner/ForcedAlignerModels.swift

+        let targetDir = directory ?? defaultCacheDirectory()
+        let modelsRoot = modelsRootDirectory()
+
+        if !force && modelsExist(at: targetDir) {
+            logger.info("ForcedAligner models already present at: \(targetDir.path)")
+            return targetDir
+        }
+
+        if force {
+            try? FileManager.default.removeItem(at: targetDir)
+        }
+
+        logger.info("Downloading ForcedAligner int8 models from HuggingFace...")
+        try await DownloadUtils.downloadRepo(.forcedAlignerInt8, to: modelsRoot)
+        logger.info("Successfully downloaded ForcedAligner models")
+        return targetDir


🔴 ForcedAlignerModels.download(to:) ignores custom directory, always downloads to default path

When download(to:) is called with a non-nil directory argument, the method sets targetDir = directory (the custom path) but always downloads to modelsRoot = modelsRootDirectory() (the default Application Support path). After downloading, it returns targetDir — which still has no model files. This causes downloadAndLoad(to: customDir) (ForcedAlignerModels.swift:67-73) to download models to the default cache, then attempt to load(from: customDir) which will throw modelNotFound because the models were never placed there.

Prompt for agents

In Sources/FluidAudio/ASR/ForcedAligner/ForcedAlignerModels.swift, the download(to:force:) method on lines 77-97 needs to be fixed so that when a custom `directory` is provided, the models are actually downloaded to that directory (or its parent, depending on how downloadRepo works with folderName). Currently line 82 sets `modelsRoot = modelsRootDirectory()` unconditionally. When `directory` is non-nil, the download target should be derived from the custom directory. One fix: when `directory` is non-nil, compute `modelsRoot` as `directory.deletingLastPathComponent()` (assuming directory already includes the folderName suffix), or download directly to `directory`. Verify the behavior of `DownloadUtils.downloadRepo` — it appends `repo.folderName` to the `to:` parameter, so `modelsRoot` should be the parent of where models should land.

Was this helpful? React with 👍 or 👎 to provide feedback.

Resolve conflicts: - Package.swift: keep CoreMLPredictionWrapper, drop removed ESpeakNG and FluidAudioEspeak targets (deleted on main) - Package.resolved: take main's originHash

claude bot reviewed Feb 15, 2026

View reviewed changes

Alex-Wengg force-pushed the feat/forced-aligner branch from 721f7c3 to edbf3ce Compare February 15, 2026 11:48

Alex-Wengg mentioned this pull request Feb 15, 2026

Model Support Requests #49

Open

Alex-Wengg added 10 commits March 4, 2026 15:03

docs: add Swift/CoreML integration bugs for ForcedAligner

c923d7b

Document 5 bugs encountered during FluidAudio integration: MLMultiArray stride issues, encoder 3D shape, Slaney vs HTK mel, STFT center padding, and MRoPE position clamping.

docs: add ForcedAligner architecture and usage documentation

bee861b

Remove misplaced mobius doc, add ForcedAligner.md covering architecture, pipeline steps, public API, CLI usage, and file reference.

docs: add ForcedAligner documentation (beta)

7a623ed

Add Documentation/ForcedAligner.md covering architecture, API usage, CLI, test results, and limitations. Marked as beta.

docs: add Buckeye benchmarking guidelines to CLAUDE.md

434dd51

Add rule to use Buckeye Corpus (not LibriSpeech) for forced alignment evaluation since it has human-annotated word-level timestamps.

docs: add E5/ANE IOSurface memory leak investigation

7a2dae0

Documents all approaches tested (ObjC autorelease, batch API, model surgery, baked compute units) with benchmark results. Native batch API was the winner at 9.4x RTFx with no crashes.

chore: add Package.resolved

03eac6e

feat: add PyTorch Buckeye benchmark script and dataset prep

c0d4a56

Adds run_pytorch_benchmark.py for running PyTorch Qwen3-ForcedAligner against the Buckeye Corpus. Auto-downloads dataset from HuggingFace (alexwengg/buckeye) if not present locally.

Alex-Wengg force-pushed the feat/forced-aligner branch from 44c1e2e to c0d4a56 Compare March 4, 2026 20:06

devin-ai-integration bot reviewed Mar 4, 2026

View reviewed changes

Merge branch 'main' into feat/forced-aligner

0221533

Resolve conflicts: - Package.swift: keep CoreMLPredictionWrapper, drop removed ESpeakNG and FluidAudioEspeak targets (deleted on main) - Package.resolved: take main's originHash

Alex-Wengg closed this Mar 16, 2026

	/// `<\|audio_start\|><\|audio_pad\|><\|audio_end\|><timestamp><timestamp>word1<timestamp><timestamp>word2<timestamp><timestamp>`
	/// `<\|audio_start\|><\|audio_pad\|>...<\|audio_end\|>word1<timestamp><timestamp>word2<timestamp><timestamp>...wordN<timestamp><timestamp>`

Conversation

Alex-Wengg commented Feb 15, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Key Design Decisions

Test Results

Test plan

Uh oh!

claude bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Forced Aligner Integration ✅

Todo List:

Review Summary

Critical Issues (Blockers) 🚨

1. Actor Isolation Violation

2. Thread Safety Concerns

3. Memory Management

Major Issues (Should Fix) ⚠️

4. Error Handling in CLI

5. Tokenizer Input Validation

6. Model Loading Pattern

Minor Issues/Nits (Nice to have) 💡

7. Documentation Enhancement

8. Performance Optimizations

9. Code Style

Positive Highlights ✨

Recommendations

Conclusion

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Offline VBx Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

github-actions bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PocketTTS Smoke Test ✅

Uh oh!

github-actions bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

VAD Benchmark Results

Performance Comparison

Dataset Details

Uh oh!

github-actions bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Parakeet EOU Benchmark Results ✅

Performance Metrics

Streaming Metrics

Uh oh!

github-actions bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Qwen3-ASR int8 Smoke Test ✅

Uh oh!

github-actions bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Uh oh!

github-actions bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ASR Benchmark Results ✅

Parakeet v3 (multilingual)

Parakeet v2 (English-optimized)

Streaming (v3)

Streaming (v2)

Expected RTFx Performance on Physical M1 Hardware:

Uh oh!

github-actions bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Diarization Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Alex-Wengg commented Feb 15, 2026 •

edited by devin-ai-integration bot

Loading

claude bot commented Feb 15, 2026 •

edited

Loading

github-actions bot commented Feb 15, 2026 •

edited

Loading

github-actions bot commented Feb 15, 2026 •

edited

Loading

github-actions bot commented Feb 15, 2026 •

edited

Loading

github-actions bot commented Feb 15, 2026 •

edited

Loading

github-actions bot commented Feb 15, 2026 •

edited

Loading

github-actions bot commented Feb 15, 2026 •

edited

Loading

github-actions bot commented Feb 15, 2026 •

edited

Loading

github-actions bot commented Feb 15, 2026 •

edited

Loading