feat: integrate Qwen3-ForcedAligner-0.6B for per-word timestamp alignment#315
feat: integrate Qwen3-ForcedAligner-0.6B for per-word timestamp alignment#315Alex-Wengg wants to merge 13 commits intomainfrom
Conversation
|
Claude finished @Alex-Wengg's task —— View job PR Review: Forced Aligner Integration ✅Comprehensive review of this forced aligner integration PR completed. Todo List:
Review SummaryThis is a well-architected and comprehensive implementation of forced alignment using Qwen3-ForcedAligner-0.6B. The code demonstrates excellent understanding of the model pipeline and follows most project conventions well. However, there are several critical issues that should be addressed before merging. Critical Issues (Blockers) 🚨1. Actor Isolation Violation
2. Thread Safety Concerns
3. Memory Management
Major Issues (Should Fix)
|
721f7c3 to
edbf3ce
Compare
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • NaNs meeting audio • NaNs processing • Test runtime: N/A • 03/07/2026, 10:38 AM EST |
PocketTTS Smoke Test ✅
Runtime: 0m26s Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality may differ from Apple Silicon. |
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 0m57s • 03/07/2026, 10:30 AM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
Qwen3-ASR int8 Smoke Test ✅
Runtime: 4m31s Note: CI VM lacks physical GPU — CoreML MLState (macOS 15) KV cache produces degraded results on virtualized runners. On Apple Silicon: ~1.3% WER / 2.5x RTFx. |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 2m 48s • 2026-03-07T15:31:26.724Z |
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 8m22s • 03/07/2026, 10:40 AM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 59.2s diarization time • Test runtime: 2m 59s • 03/07/2026, 10:38 AM EST |
…ment Add forced alignment pipeline using a 3-model CoreML architecture (audio encoder + embedding + decoder with LM head) in a single non-autoregressive prefill pass. Produces per-word timestamps by aligning audio against a known transcript. New files: - ForcedAlignerConfig: model constants and special token IDs - ForcedAlignerTypes: WordAlignment, ForcedAlignmentResult, error types - ForcedAlignerModels: CoreML model download and loading - ForcedAlignerMelSpectrogram: Slaney-scale mel with center padding - ForcedAlignerMRoPE: interleaved multi-dimensional rotary embeddings - ForcedAlignerTokenizer: BPE tokenizer with vocab.json/merges.txt - ForcedAlignerInference: full 12-step alignment pipeline - ForcedAlignerManager: public actor API - AlignCommand: CLI command (`fluidaudio align`)
Document 5 bugs encountered during FluidAudio integration: MLMultiArray stride issues, encoder 3D shape, Slaney vs HTK mel, STFT center padding, and MRoPE position clamping.
Remove misplaced mobius doc, add ForcedAligner.md covering architecture, pipeline steps, public API, CLI usage, and file reference.
Add Documentation/ForcedAligner.md covering architecture, API usage, CLI, test results, and limitations. Marked as beta.
The CoreML ANE runtime (e5rt) leaks IOSurface buffers on repeated individual predictions for models with conv ops. After ~500 calls the process crashes with allocation failures. Fix: Use MLModel.predictions(fromBatch:) for the audio encoder instead of per-chunk predict() calls. This reduces encoder calls from ~N*23 (chunks per segment) to N (one batch per segment), avoiding the leak accumulation entirely. Also adds CoreMLPredictionWrapper (ObjC @autoreleasepool) for embedding and decoder predictions, and changes default compute units from .all to .cpuAndGPU. Verified: 500 samples on Buckeye corpus, 9.4x RTFx, no crashes, no model reload needed.
Adds `fluidaudio align-benchmark` for evaluating ForcedAligner against the Buckeye Corpus (human-annotated word-level timestamps). Reports AAS, tolerance percentiles, and RTFx. Also adds `fluidaudio download --dataset buckeye` to fetch the segmented Buckeye dataset from HuggingFace (alexwengg/buckeye).
Add rule to use Buckeye Corpus (not LibriSpeech) for forced alignment evaluation since it has human-annotated word-level timestamps.
Documents all approaches tested (ObjC autorelease, batch API, model surgery, baked compute units) with benchmark results. Native batch API was the winner at 9.4x RTFx with no crashes.
Adds run_pytorch_benchmark.py for running PyTorch Qwen3-ForcedAligner against the Buckeye Corpus. Auto-downloads dataset from HuggingFace (alexwengg/buckeye) if not present locally.
44c1e2e to
c0d4a56
Compare
| } | ||
| [results addObject:result]; | ||
| } | ||
| i = batchEnd - 1; // -1 because the for loop will increment |
There was a problem hiding this comment.
🟡 NSUInteger underflow causes infinite loop in batchPredictWithModel when drainInterval is 0
In CoreMLPredictionWrapper.m:42, when drainInterval is 0, batchEnd = MIN(i + 0, inputs.count) equals i (which is 0 on the first iteration). The inner loop for j = i; j < batchEnd doesn't execute. Then i = batchEnd - 1 computes 0 - 1 on an NSUInteger (unsigned), which wraps to NSUIntegerMax. The subsequent i++ wraps it back to 0, creating an infinite loop. This method is currently dead code (only predictWithModel is called), but it's a public API that could be invoked by future callers.
Prompt for agents
In Sources/CoreMLPredictionWrapper/CoreMLPredictionWrapper.m, add a guard at the top of batchPredictWithModel to handle drainInterval == 0. Either return nil with an error, or default drainInterval to 1. The underflow occurs at line 42 where `i = batchEnd - 1` computes 0 - 1 on NSUInteger. A simple fix is to add at line 27 (after the results allocation): `if (drainInterval == 0) drainInterval = 1;`
Was this helpful? React with 👍 or 👎 to provide feedback.
| public actor ForcedAlignerManager { | ||
| private var models: ForcedAlignerModels? | ||
| private var tokenizer: ForcedAlignerTokenizer? | ||
|
|
||
| public init() {} |
There was a problem hiding this comment.
🔴 No unit tests added for the entire ForcedAligner module
AGENTS.md and CLAUDE.md both mandate: "Add unit tests when writing new code." This PR adds ~1400 lines of new library code across 8 files (ForcedAlignerConfig, ForcedAlignerTypes, ForcedAlignerModels, ForcedAlignerMelSpectrogram, ForcedAlignerMRoPE, ForcedAlignerTokenizer, ForcedAlignerInference, ForcedAlignerManager) with zero test coverage. Key components like BPE tokenization (encode()), mel spectrogram computation, LIS-based timestamp fixing (fixTimestamps()), and MRoPE computation are all pure functions that are straightforward to unit test without models.
Prompt for agents
Add a new test file Tests/FluidAudioTests/ForcedAlignerTests.swift with unit tests for the ForcedAligner module. At minimum, test: (1) ForcedAlignerTokenizer.encode() with known words and expected BPE token IDs, (2) ForcedAlignerTokenizer.tokenize() verifying correct timestamp token placement and word count, (3) ForcedAlignerMelSpectrogram.reflectPad() correctness, (4) ForcedAlignerMelSpectrogram.compute() with a simple sine wave input verifying output dimensions, (5) ForcedAlignerMRoPE.compute() verifying output dimensions and that padded positions repeat the last valid position, (6) ForcedAlignerInference.fixTimestamps() with monotonic input (no-op), decreasing input, and mixed input.
Was this helpful? React with 👍 or 👎 to provide feedback.
| /// to tokenize input text for the forced alignment pipeline. | ||
| /// | ||
| /// The forced aligner formats input as: | ||
| /// `<|audio_start|><|audio_pad|><|audio_end|><timestamp><timestamp>word1<timestamp><timestamp>word2<timestamp><timestamp>` |
There was a problem hiding this comment.
🟡 Tokenizer struct docstring shows leading timestamps but code produces none, risking incorrect future changes
The class-level docstring on ForcedAlignerTokenizer at line 13 states the format is <|audio_start|><|audio_pad|><|audio_end|><timestamp><timestamp>word1<timestamp><timestamp>word2<timestamp><timestamp> — showing leading <timestamp><timestamp> before the first word. However, the actual tokenize() implementation at lines 110-115 explicitly does NOT add leading timestamps (if i > 0), and the inline comment at line 109 correctly says (NO leading timestamps before the first word). The struct docstring contradicts the code and its own inline comment, which could mislead future developers into "fixing" the code to match the docstring and breaking alignment.
| /// `<|audio_start|><|audio_pad|><|audio_end|><timestamp><timestamp>word1<timestamp><timestamp>word2<timestamp><timestamp>` | |
| /// `<|audio_start|><|audio_pad|>...<|audio_end|>word1<timestamp><timestamp>word2<timestamp><timestamp>...wordN<timestamp><timestamp>` |
Was this helpful? React with 👍 or 👎 to provide feedback.
…ce leak The audio encoder's Conv2d layers route to ANE, which leaks IOSurface buffers on repeated individual prediction() calls. Switch to the native batch API (predictions(fromBatch:)) to manage IOSurface lifecycle in a single allocation/release cycle, matching the approach in ForcedAligner. Also removes the sleep(1) workaround from the benchmark and fixes the pyproject.toml to use the PyPI qwen-asr package.
| import Foundation | ||
| import OSLog | ||
|
|
||
| private let logger = Logger(subsystem: "FluidAudio", category: "ForcedAlignerManager") |
There was a problem hiding this comment.
🟡 ForcedAligner uses Logger instead of AppLogger, violating logging convention
CLAUDE.md specifies: "Logging: Use AppLogger(category:) from Shared/AppLogger.swift — not print() in production code." All four new ForcedAligner source files in Sources/FluidAudio/ use Logger(subsystem: "FluidAudio", category: ...) from OSLog directly instead of the project's AppLogger(category:) wrapper. This means ForcedAligner log messages won't be mirrored to console in DEBUG builds and won't use the project's standard subsystem "com.fluidinference". Affected files: ForcedAlignerInference.swift:7, ForcedAlignerModels.swift:5, ForcedAlignerTokenizer.swift:4, ForcedAlignerManager.swift:5.
Prompt for agents
Replace all `Logger(subsystem: "FluidAudio", category: ...)` declarations in the four ForcedAligner files with `AppLogger(category: ...)` to match the project convention. Files to change:
1. Sources/FluidAudio/ASR/ForcedAligner/ForcedAlignerManager.swift line 5: change to `private let logger = AppLogger(category: "ForcedAlignerManager")`
2. Sources/FluidAudio/ASR/ForcedAligner/ForcedAlignerInference.swift line 7: change to `private let logger = AppLogger(category: "ForcedAlignerInference")`
3. Sources/FluidAudio/ASR/ForcedAligner/ForcedAlignerModels.swift line 5: change to `private let logger = AppLogger(category: "ForcedAlignerModels")`
4. Sources/FluidAudio/ASR/ForcedAligner/ForcedAlignerTokenizer.swift line 4: change to `private let logger = AppLogger(category: "ForcedAlignerTokenizer")`
Also remove the `import OSLog` from each file and add `import Foundation` if not already present (AppLogger is in the same module so no extra import needed).
Was this helpful? React with 👍 or 👎 to provide feedback.
… safety net Switch both Qwen3 ASR and ForcedAligner audio encoders from individual prediction() calls to the native batch API (predictions(fromBatch:)), reducing IOSurface leak rate from ANE Conv2d layers. Add periodic model reload every 300 samples in align-benchmark to reclaim leaked ANE IOSurface buffers on long runs (1000+ files). Verified: 1000/1000 Buckeye segments complete without crash. Remove the sleep(1) workaround from Qwen3 ASR benchmark.
| // Give system time to reclaim CoreML MLState IOSurface resources every 25 files. | ||
| // Without this pause, IOSurface limit (~200) is exhausted causing crashes. | ||
| if (index + 1) % 25 == 0 { | ||
| logger.info("Memory cleanup pause...") | ||
| try? await Task.sleep(for: .seconds(1)) | ||
| } |
There was a problem hiding this comment.
🔴 Removal of IOSurface cleanup pause may cause crashes in long ASR benchmarks
The per-25-files sleep was removed from Qwen3AsrBenchmark.runBenchmarkLoop (lines 553-558 on LEFT). The original comment explicitly warned: "Without this pause, IOSurface limit (~200) is exhausted causing crashes." The batch API was added for the audio encoder, but the autoregressive decoder still uses per-prediction calls with MLState. The original sleep helped the system reclaim MLState IOSurface resources between files. Since the batch API only batches encoder predictions and doesn't affect decoder resource lifecycle, removing this pause could re-introduce IOSurface exhaustion crashes on long benchmark runs (100+ files).
Prompt for agents
In Sources/FluidAudioCLI/Commands/ASR/Qwen3AsrBenchmark.swift, the IOSurface cleanup pause (sleep every 25 files) was removed from the runBenchmarkLoop function around the lines after the inner file processing loop (after the catch block). The audio encoder was batched, but the decoder still uses per-prediction MLState calls which also leak IOSurface buffers. Either re-add the periodic sleep/cleanup (e.g. every 25-50 files) or add a periodic model reload (similar to ForcedAligner's approach of reloading every 300 samples in AlignBenchmark.swift:357-364) to reclaim leaked decoder MLState IOSurface resources during long benchmark runs.
Was this helpful? React with 👍 or 👎 to provide feedback.
| let targetDir = directory ?? defaultCacheDirectory() | ||
| let modelsRoot = modelsRootDirectory() | ||
|
|
||
| if !force && modelsExist(at: targetDir) { | ||
| logger.info("ForcedAligner models already present at: \(targetDir.path)") | ||
| return targetDir | ||
| } | ||
|
|
||
| if force { | ||
| try? FileManager.default.removeItem(at: targetDir) | ||
| } | ||
|
|
||
| logger.info("Downloading ForcedAligner int8 models from HuggingFace...") | ||
| try await DownloadUtils.downloadRepo(.forcedAlignerInt8, to: modelsRoot) | ||
| logger.info("Successfully downloaded ForcedAligner models") | ||
| return targetDir |
There was a problem hiding this comment.
🔴 ForcedAlignerModels.download(to:) ignores custom directory, always downloads to default path
When download(to:) is called with a non-nil directory argument, the method sets targetDir = directory (the custom path) but always downloads to modelsRoot = modelsRootDirectory() (the default Application Support path). After downloading, it returns targetDir — which still has no model files. This causes downloadAndLoad(to: customDir) (ForcedAlignerModels.swift:67-73) to download models to the default cache, then attempt to load(from: customDir) which will throw modelNotFound because the models were never placed there.
Prompt for agents
In Sources/FluidAudio/ASR/ForcedAligner/ForcedAlignerModels.swift, the download(to:force:) method on lines 77-97 needs to be fixed so that when a custom `directory` is provided, the models are actually downloaded to that directory (or its parent, depending on how downloadRepo works with folderName). Currently line 82 sets `modelsRoot = modelsRootDirectory()` unconditionally. When `directory` is non-nil, the download target should be derived from the custom directory. One fix: when `directory` is non-nil, compute `modelsRoot` as `directory.deletingLastPathComponent()` (assuming directory already includes the folderName suffix), or download directly to `directory`. Verify the behavior of `DownloadUtils.downloadRepo` — it appends `repo.folderName` to the `to:` parameter, so `modelsRoot` should be the parent of where models should land.
Was this helpful? React with 👍 or 👎 to provide feedback.
Resolve conflicts: - Package.swift: keep CoreMLPredictionWrapper, drop removed ESpeakNG and FluidAudioEspeak targets (deleted on main) - Package.resolved: take main's originHash
Summary
alexwengg/Qwen3-ForcedAligner-0.6B-Coremlinto FluidAudioForcedAlignerManagerpublic actor API andfluidaudio alignCLI commandArchitecture
ForcedAlignerConfig.swiftForcedAlignerTypes.swiftWordAlignment,ForcedAlignmentResult, error typesForcedAlignerModels.swiftForcedAlignerMelSpectrogram.swiftForcedAlignerMRoPE.swiftForcedAlignerTokenizer.swiftForcedAlignerInference.swiftForcedAlignerManager.swiftAlignCommand.swiftfluidaudio align audio.wav --text "transcript"Key Design Decisions
WhisperFeatureExtractorwithmel_scale="slaney"+norm="slaney"torch.stft(center=True)Test Results
Tested with 4 TTS-generated wav files of varying lengths:
Test plan
swift build)🤖 Generated with Claude Code