fix(asr): avoid v3 multilingual seam drift by vdt4534 · Pull Request #604 · FluidInference/FluidAudio

vdt4534 · 2026-05-12T06:00:14Z

Summary

This is a stacked draft PR on top of #596 for issue #594. It keeps the opt-in multilingualChunkContinuity API, but changes the v3 implementation so the path fixes the original dropped-word repro without introducing English-token drift on additional French fixtures.

The change is intentionally v3-only. It does not use language hints, deterministic vocabulary filtering, or French/English-specific token rules.

What changed

Route the multilingualChunkContinuity special path only for parakeet-tdt-0.6b-v3; other model versions keep the default parallel path and log a warning if the flag is set.
Stop persisting TdtDecoderState across v3 batch chunks. Each chunk gets a fresh decoder state.
Prepend a short real-audio prefix to non-first chunks: 7 encoder frames, about 560 ms.
Decode through that prefix as warmup, but suppress prefix-region emitted tokens from the returned token window.
Add conservative silence-aligned chunk starts based only on audio energy near the nominal boundary. If no true near-silence exists nearby, the regular frame-aligned boundary is retained.

Why

PR #596 fixes the original notes_1408_clean.wav drop, but in my test matrix its persisted decoder-state variant regressed two French fixtures by injecting English BPE at or near seams:

wwii_belgique_fr.wav: In Belgique, le 8 mai has been jour férié
user2_2026-05-12.wav: blouses médicales blanches and portant des masques

The working hypothesis is that carrying predictor state across chunks pairs chunk N decoder state with chunk N+1 encoder frames that were re-encoded under different positional/convolutional context. On multilingual v3, that out-of-distribution joint-network pair can collapse toward the model's English prior.

This variant avoids that persisted predictor/encoder mismatch while still giving each chunk enough real audio warmup to recover boundary words.

Validation

Fresh local checks:

swift build -c release --product fluidaudiocli
swift test --filter ChunkProcessorTests -> 44 tests, 0 failures
Five-fixture matrix with --multilingual-chunks -> clean transcripts on all five fixtures tested

Fixture	PR #596 flag ON behavior before this patch	This patch flag ON
`notes_1408_clean.wav`	clean	clean
`wwii_belgique_fr.wav`	English drift: `In Belgique ... has been`	clean
`user_2026-05-12.wav`	clean	clean
`user2_2026-05-12.wav`	English drift: `and portant des masques`	clean
`climate_2026_fr_voice_memo.wav`	current flag-OFF and `bb96003` both drift at least once	clean

The fifth fixture is a new 110.976s French Voice Memo converted to mono 16 kHz Int16 WAV. I can attach it to issue #594 as an additional regression fixture if useful.

Timing notes

This path is slower than default parallel chunking because it remains sequential.

Measured wall-clock with the release CLI on my machine:

Fixture	PR branch flag OFF avg	This patch flag ON avg
`user2_2026-05-12.wav`	0.513s	0.943s
`climate_2026_fr_voice_memo.wav`	0.513s	1.047s

Compared to the old bb96003 pin on the same longer fixtures, this patch was about 1.4x wall-clock.

…ion termination (FluidInference#594) Batch transcription drifted French to English at every ~15s chunk boundary on parakeet-tdt-0.6b-v3-coreml. Streaming on the same audio was clean. Root cause is three interacting issues: 1. ChunkProcessor created a fresh TdtDecoderState per chunk and SOS-primed with the blank token. For non-first chunks this starts the LSTM mid-utterance, biased toward TDT v3's English prior. 2. Non-first chunks received only ~80ms of mel-context prefix (from FluidInference#264), while streaming uses ~2s of actual leading audio. FastConformer's depthwise convs produce language-biased logits with too little audio history, even when the decoder state is correct. 3. When the decoder emits a sentence-final token mid-chunk, the LSTM enters a state where the joint predicts BLANK for the remaining frames, silently dropping audio. Masked by the per-chunk SOS reset; surfaces once state is persisted. Fix: - ChunkProcessor.process: serialize chunk processing, persist TdtDecoderState across chunks (matches SlidingWindowAsrManager). - ChunkProcessor: extend non-first chunk audio prefix from 80ms mel-context to 2.0s of actual audio. Decoder skips prefix encoder frames via contextFrameAdjustment; timestamps remain anchored on global frames. - TdtDecoderV3: after a sentence-final token, if the decoder emits a long blank-only streak with audio remaining, clear predictorOutput to re-engage emission while preserving LSTM state. Verified on reporter's notes_1408_clean.wav: drift gone with --language fr. English LibriSpeech test-clean smoke (N=5): WER unchanged vs main. Streaming path unchanged. Preserves FluidInference#264's chunk-boundary token-loss fix. Closes FluidInference#594

…h ASR French batch transcription with parakeet-tdt-0.6b-v3-coreml was drifting to English at every ~15s chunk boundary because per-chunk SOS priming re-applied the model's English-biased prior, and the encoder lacked enough left context to escape it. This commit wraps the chunk-boundary continuity behavior behind an opt-in `multilingualChunkContinuity` boolean on ASRConfig (default `false`). The default behavior is unchanged from main (parallel chunks, fresh SOS per chunk, English WER 2.64% on LibriSpeech test-clean). When opted in, the chunk path uses: - Serialized chunk processing with persisted TdtDecoderState across chunks (LSTM hidden/cell + lastToken + predictorOutput). - 2.0s real-audio prefix prepended to chunks N>=1 as streaming-style warmup; decoder consumes prefix frames normally. - Reserved chunk size (~13s actual audio) so prefix fits within the encoder's 240,000-sample window. The in-decoder punctuation guard (recovery from blank-emission spirals after sentence-final tokens) added in 2206dd1 is enabled in both modes; bisection during PR investigation confirmed it is benign on the English baseline. CLI: pass --multilingual-chunks to fluidaudiocli transcribe and asr-benchmark to enable the new path. Validation (LibriSpeech test-clean, 100 files; reporter French audio): - Flag OFF (default, matches origin/main): English WER 2.640% - Flag ON: English WER 5.264% - Flag OFF (reporter French): drifts ("rest of the key what is that") -- confirms gating restores original behavior. - Flag ON (reporter French): correct, "reste de l'équipe" present. Known limitation (flag ON): a few LibriSpeech files (e.g. 1089-134691-0009, 1188-133604-0011) still drop continuation content past a sentence-final chunk seam due to encoder prosodic bleed >75 frames. Tracked as followup; the chosen variant is the empirically best non-regressing combination across 8 decoder-side hypotheses tested.

…path Phase 2 of FluidInference#594. The unconditional contextFrameAdjustment from fix B desyncs encoder/predictor at chunk boundaries, costing ~0.8pp WER on LibriSpeech test-clean. Replace with the softer streaming-warmup variant B': prepend the 2.0s real-audio prefix for encoder left context but let the decoder emit naturally through the prefix region, relying on the LCS/midpoint merger to discard prefix tokens during chunk merge. Combined with seam-clear LSTM reset, LCS punctuation-blind matcher, and effective-left-end in mergeByMidpoint, brings flag-ON path from 5.26% to 4.19% English WER. Flag-OFF preserved at 2.6403%. French fix preserved.

…Offset In processWithMultilingualContinuity, the audio buffer for chunks N>=1 starts at `contextStart = chunkStart - contextSamples` (2.0s of real audio prepended for encoder left context), but transcribeChunk was being given `chunkStart` as its origin. Inside transcribeChunk, `globalFrameOffset = chunkStart / samplesPerEncoderFrame` then placed every token in chunks N>=1 at +25 frames (+2.0s) past its true position in the original audio timeline. Consequence: chunk N's prefix tokens (covering audio that actually overlaps chunk N-1's tail) landed in timestamp space _after_ chunk N-1's end, beyond the merger's 1.0s halfOverlapWindow tolerance. LCS and contiguous matchers could not anchor across the boundary, so every seam fell through to mergeByMidpoint, which duplicated ~2s of content at every chunk join. Pass `contextStart` instead. Prefix tokens now overlap chunk N-1 correctly, LCS matches anchor properly, and the merger can dedupe as designed. LibriSpeech test-clean (100 files, flag ON): 4.19% → 2.90% WER. Flag-OFF unchanged at 2.6403%. French fix preserved. Credit: Devin AI review on PR FluidInference#596.

Three followups from the bot review of PR FluidInference#596: 1. Update stale WER doc comment in AsrTypes.swift The `multilingualChunkContinuity` doc said "~4.40% English WER", which referred to an intermediate variant before the Devin globalFrameOffset fix landed. The validated landed number is 2.90% (vs 2.64% with flag off, a +0.26pp cost). 2. Warn when `multilingualChunkContinuity=true` on a non-v3 model The flag's sequential serialization + 2.0s audio prefix is designed to mitigate parakeet-tdt-0.6b-v3 English-prior drift. On v2 / tdtCtc110m / ctcZhCn / tdtJa it still produces correct output, but only adds latency with no benefit, so log a warning once when the path is entered with a non-v3 model. 3. Unit tests for the new code path (CLAUDE.md policy: "Add unit tests when writing new code") - ASRConfigTests: multilingualChunkContinuity defaults to false, preserves explicit true/false, and doesn't disturb other fields. - ChunkProcessorTests (via #if DEBUG accessors): * audio prefix is exactly 32000 samples (2.0s @ 16kHz), encoder- frame-aligned (multiple of 1280). * multilingual chunk size + prefix ≤ maxModelSamples (240000), and chunk size is frame-aligned. * multilingual chunk size is strictly smaller than default chunk size (it has to give up content to make prefix room). * chunkSamples(multilingualContinuity:) dispatches correctly to either the default or multilingual sizing path.

Devin PR-596 review (2nd pass) flagged two issues: 1. (real bug) `processWithMultilingualContinuity` persists `timeJump` across chunks. `TdtDecoderV3.decodeWithTimings` writes `decoderState.timeJump = currentTimeIndices - effectiveSequenceLength` for non-last chunks. On the next chunk, `TdtFrameNavigation.calculateInitialTimeIndices` either skips into the 2.0s prefix region (prevTimeJump > 0) or, when prevTimeJump == 0, returns the special-case `standardOverlapFrames` (25 = exactly the prefix length) which skips the prefix entirely. Either case breaks the merger's ability to anchor tokens across the chunk seam. Fix: explicitly `decoderState.timeJump = nil` after each non-last chunk so the next chunk's decoder starts at frame 0 of the buffer (which already begins with the 2.0s prefix). The punctuation-seam `reset()` already nils timeJump as a side effect; this clear handles every other boundary the same way. 2. (doc accuracy) Both ChunkProcessor.processWithParallelChunks and ASRConfig.multilingualChunkContinuity claimed the default path was bit-for-bit identical to pre-FluidInference#596 main. The shared `mergeChunks` / `mergeByMidpoint` punctuation-aware LCS matcher and trailing-punctuation midpoint adjustment apply to both paths. Empirically WER-neutral on LibriSpeech test-clean (validated 2.64%), but the merger algorithm is no longer literally identical. Doc updated to reflect this. Tests: TdtDecoderStateTests gains `testTimeJumpNilingForMultilingualContinuityPath` documenting that `timeJump` is independently nilable without disturbing LSTM hidden/cell, lastToken, or predictorOutput. `testDecoderStateReset` also now exercises the timeJump field in the populate/reset cycle.

Reverts the code change from 0f9085f. The doc accuracy update (shared mergeChunks/mergeByMidpoint logic applies to the default path too) is retained. Devin PR-596 review (2nd pass) flagged that `processWithMultilingualContinuity` persists `decoderState.timeJump` across chunks, causing `TdtFrameNavigation.calculateInitialTimeIndices` to skip into (or, when the persisted value is 0, fully past via `standardOverlapFrames=25`) the 2.0s prefix region. The static analysis is correct: per the documented intent of that function the behavior is unintended for this path's fresh-buffer-with-prefix layout. Empirically, however, the prefix-skip is what produces the established flag-ON English LibriSpeech `test-clean` baseline of 2.90% WER on this PR. Clearing `decoderState.timeJump = nil` between chunks (the "fix") regresses to 4.43% WER (validated 100-file run) because the decoder re-emits tokens through the prefix region and the merger duplicates tokens it cannot reliably de-anchor. Flag-OFF LibriSpeech stays at 2.6403% in both variants. Keeping the persisted `timeJump` matches the baseline measurements the PR was reviewed against. The shared-merger doc updates in `processWithParallelChunks` and `ASRConfig.multilingualChunkContinuity` remain valid and are not reverted. A short comment in `processWithMultilingualContinuity` now explains the empirical trade-off so future readers know not to "fix" it again. Validation: - swift test --filter TdtDecoderStateTests: 11/11 pass - asr-benchmark --multilingual-chunks: 2.8986% (target ≤2.90%) - asr-benchmark (flag off): 2.6403% (unchanged baseline)

Alex-Wengg · 2026-05-12T13:27:47Z

i actually overhauled this change . #596

Alex-Wengg and others added 8 commits May 11, 2026 11:16

fix(asr): avoid v3 multilingual seam drift

0af1b54

vdt4534 mentioned this pull request May 12, 2026

bug: French transcription drifts to English at chunk boundary in AsrManager.transcribe (batch) — regression introduced by #264 #594

Open

Alex-Wengg force-pushed the fix/asr-594-french-chunk-boundary branch from f7f58f5 to bfa14a1 Compare May 12, 2026 12:55

Alex-Wengg mentioned this pull request May 12, 2026

fix(asr): add melChunkContext opt-out flag for Issue #594 #596

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(asr): avoid v3 multilingual seam drift#604

fix(asr): avoid v3 multilingual seam drift#604
vdt4534 wants to merge 8 commits into
FluidInference:fix/asr-594-french-chunk-boundaryfrom
vdt4534:codex/asr-594-v3-seam-warmup

vdt4534 commented May 12, 2026

Uh oh!

Alex-Wengg commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vdt4534 commented May 12, 2026

Summary

What changed

Why

Validation

Timing notes

Uh oh!

Alex-Wengg commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants