fix(asr): avoid v3 multilingual seam drift#604
Draft
vdt4534 wants to merge 8 commits into
Draft
Conversation
…ion termination (FluidInference#594) Batch transcription drifted French to English at every ~15s chunk boundary on parakeet-tdt-0.6b-v3-coreml. Streaming on the same audio was clean. Root cause is three interacting issues: 1. ChunkProcessor created a fresh TdtDecoderState per chunk and SOS-primed with the blank token. For non-first chunks this starts the LSTM mid-utterance, biased toward TDT v3's English prior. 2. Non-first chunks received only ~80ms of mel-context prefix (from FluidInference#264), while streaming uses ~2s of actual leading audio. FastConformer's depthwise convs produce language-biased logits with too little audio history, even when the decoder state is correct. 3. When the decoder emits a sentence-final token mid-chunk, the LSTM enters a state where the joint predicts BLANK for the remaining frames, silently dropping audio. Masked by the per-chunk SOS reset; surfaces once state is persisted. Fix: - ChunkProcessor.process: serialize chunk processing, persist TdtDecoderState across chunks (matches SlidingWindowAsrManager). - ChunkProcessor: extend non-first chunk audio prefix from 80ms mel-context to 2.0s of actual audio. Decoder skips prefix encoder frames via contextFrameAdjustment; timestamps remain anchored on global frames. - TdtDecoderV3: after a sentence-final token, if the decoder emits a long blank-only streak with audio remaining, clear predictorOutput to re-engage emission while preserving LSTM state. Verified on reporter's notes_1408_clean.wav: drift gone with --language fr. English LibriSpeech test-clean smoke (N=5): WER unchanged vs main. Streaming path unchanged. Preserves FluidInference#264's chunk-boundary token-loss fix. Closes FluidInference#594
…h ASR French batch transcription with parakeet-tdt-0.6b-v3-coreml was drifting to English at every ~15s chunk boundary because per-chunk SOS priming re-applied the model's English-biased prior, and the encoder lacked enough left context to escape it. This commit wraps the chunk-boundary continuity behavior behind an opt-in `multilingualChunkContinuity` boolean on ASRConfig (default `false`). The default behavior is unchanged from main (parallel chunks, fresh SOS per chunk, English WER 2.64% on LibriSpeech test-clean). When opted in, the chunk path uses: - Serialized chunk processing with persisted TdtDecoderState across chunks (LSTM hidden/cell + lastToken + predictorOutput). - 2.0s real-audio prefix prepended to chunks N>=1 as streaming-style warmup; decoder consumes prefix frames normally. - Reserved chunk size (~13s actual audio) so prefix fits within the encoder's 240,000-sample window. The in-decoder punctuation guard (recovery from blank-emission spirals after sentence-final tokens) added in 2206dd1 is enabled in both modes; bisection during PR investigation confirmed it is benign on the English baseline. CLI: pass --multilingual-chunks to fluidaudiocli transcribe and asr-benchmark to enable the new path. Validation (LibriSpeech test-clean, 100 files; reporter French audio): - Flag OFF (default, matches origin/main): English WER 2.640% - Flag ON: English WER 5.264% - Flag OFF (reporter French): drifts ("rest of the key what is that") -- confirms gating restores original behavior. - Flag ON (reporter French): correct, "reste de l'équipe" present. Known limitation (flag ON): a few LibriSpeech files (e.g. 1089-134691-0009, 1188-133604-0011) still drop continuation content past a sentence-final chunk seam due to encoder prosodic bleed >75 frames. Tracked as followup; the chosen variant is the empirically best non-regressing combination across 8 decoder-side hypotheses tested.
…path Phase 2 of FluidInference#594. The unconditional contextFrameAdjustment from fix B desyncs encoder/predictor at chunk boundaries, costing ~0.8pp WER on LibriSpeech test-clean. Replace with the softer streaming-warmup variant B': prepend the 2.0s real-audio prefix for encoder left context but let the decoder emit naturally through the prefix region, relying on the LCS/midpoint merger to discard prefix tokens during chunk merge. Combined with seam-clear LSTM reset, LCS punctuation-blind matcher, and effective-left-end in mergeByMidpoint, brings flag-ON path from 5.26% to 4.19% English WER. Flag-OFF preserved at 2.6403%. French fix preserved.
…Offset In processWithMultilingualContinuity, the audio buffer for chunks N>=1 starts at `contextStart = chunkStart - contextSamples` (2.0s of real audio prepended for encoder left context), but transcribeChunk was being given `chunkStart` as its origin. Inside transcribeChunk, `globalFrameOffset = chunkStart / samplesPerEncoderFrame` then placed every token in chunks N>=1 at +25 frames (+2.0s) past its true position in the original audio timeline. Consequence: chunk N's prefix tokens (covering audio that actually overlaps chunk N-1's tail) landed in timestamp space _after_ chunk N-1's end, beyond the merger's 1.0s halfOverlapWindow tolerance. LCS and contiguous matchers could not anchor across the boundary, so every seam fell through to mergeByMidpoint, which duplicated ~2s of content at every chunk join. Pass `contextStart` instead. Prefix tokens now overlap chunk N-1 correctly, LCS matches anchor properly, and the merger can dedupe as designed. LibriSpeech test-clean (100 files, flag ON): 4.19% → 2.90% WER. Flag-OFF unchanged at 2.6403%. French fix preserved. Credit: Devin AI review on PR FluidInference#596.
Three followups from the bot review of PR FluidInference#596: 1. Update stale WER doc comment in AsrTypes.swift The `multilingualChunkContinuity` doc said "~4.40% English WER", which referred to an intermediate variant before the Devin globalFrameOffset fix landed. The validated landed number is 2.90% (vs 2.64% with flag off, a +0.26pp cost). 2. Warn when `multilingualChunkContinuity=true` on a non-v3 model The flag's sequential serialization + 2.0s audio prefix is designed to mitigate parakeet-tdt-0.6b-v3 English-prior drift. On v2 / tdtCtc110m / ctcZhCn / tdtJa it still produces correct output, but only adds latency with no benefit, so log a warning once when the path is entered with a non-v3 model. 3. Unit tests for the new code path (CLAUDE.md policy: "Add unit tests when writing new code") - ASRConfigTests: multilingualChunkContinuity defaults to false, preserves explicit true/false, and doesn't disturb other fields. - ChunkProcessorTests (via #if DEBUG accessors): * audio prefix is exactly 32000 samples (2.0s @ 16kHz), encoder- frame-aligned (multiple of 1280). * multilingual chunk size + prefix ≤ maxModelSamples (240000), and chunk size is frame-aligned. * multilingual chunk size is strictly smaller than default chunk size (it has to give up content to make prefix room). * chunkSamples(multilingualContinuity:) dispatches correctly to either the default or multilingual sizing path.
Devin PR-596 review (2nd pass) flagged two issues: 1. (real bug) `processWithMultilingualContinuity` persists `timeJump` across chunks. `TdtDecoderV3.decodeWithTimings` writes `decoderState.timeJump = currentTimeIndices - effectiveSequenceLength` for non-last chunks. On the next chunk, `TdtFrameNavigation.calculateInitialTimeIndices` either skips into the 2.0s prefix region (prevTimeJump > 0) or, when prevTimeJump == 0, returns the special-case `standardOverlapFrames` (25 = exactly the prefix length) which skips the prefix entirely. Either case breaks the merger's ability to anchor tokens across the chunk seam. Fix: explicitly `decoderState.timeJump = nil` after each non-last chunk so the next chunk's decoder starts at frame 0 of the buffer (which already begins with the 2.0s prefix). The punctuation-seam `reset()` already nils timeJump as a side effect; this clear handles every other boundary the same way. 2. (doc accuracy) Both ChunkProcessor.processWithParallelChunks and ASRConfig.multilingualChunkContinuity claimed the default path was bit-for-bit identical to pre-FluidInference#596 main. The shared `mergeChunks` / `mergeByMidpoint` punctuation-aware LCS matcher and trailing-punctuation midpoint adjustment apply to both paths. Empirically WER-neutral on LibriSpeech test-clean (validated 2.64%), but the merger algorithm is no longer literally identical. Doc updated to reflect this. Tests: TdtDecoderStateTests gains `testTimeJumpNilingForMultilingualContinuityPath` documenting that `timeJump` is independently nilable without disturbing LSTM hidden/cell, lastToken, or predictorOutput. `testDecoderStateReset` also now exercises the timeJump field in the populate/reset cycle.
Reverts the code change from 0f9085f. The doc accuracy update (shared mergeChunks/mergeByMidpoint logic applies to the default path too) is retained. Devin PR-596 review (2nd pass) flagged that `processWithMultilingualContinuity` persists `decoderState.timeJump` across chunks, causing `TdtFrameNavigation.calculateInitialTimeIndices` to skip into (or, when the persisted value is 0, fully past via `standardOverlapFrames=25`) the 2.0s prefix region. The static analysis is correct: per the documented intent of that function the behavior is unintended for this path's fresh-buffer-with-prefix layout. Empirically, however, the prefix-skip is what produces the established flag-ON English LibriSpeech `test-clean` baseline of 2.90% WER on this PR. Clearing `decoderState.timeJump = nil` between chunks (the "fix") regresses to 4.43% WER (validated 100-file run) because the decoder re-emits tokens through the prefix region and the merger duplicates tokens it cannot reliably de-anchor. Flag-OFF LibriSpeech stays at 2.6403% in both variants. Keeping the persisted `timeJump` matches the baseline measurements the PR was reviewed against. The shared-merger doc updates in `processWithParallelChunks` and `ASRConfig.multilingualChunkContinuity` remain valid and are not reverted. A short comment in `processWithMultilingualContinuity` now explains the empirical trade-off so future readers know not to "fix" it again. Validation: - swift test --filter TdtDecoderStateTests: 11/11 pass - asr-benchmark --multilingual-chunks: 2.8986% (target ≤2.90%) - asr-benchmark (flag off): 2.6403% (unchanged baseline)
f7f58f5 to
bfa14a1
Compare
Member
|
i actually overhauled this change . #596 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This is a stacked draft PR on top of #596 for issue #594. It keeps the opt-in
multilingualChunkContinuityAPI, but changes the v3 implementation so the path fixes the original dropped-word repro without introducing English-token drift on additional French fixtures.The change is intentionally v3-only. It does not use language hints, deterministic vocabulary filtering, or French/English-specific token rules.
What changed
multilingualChunkContinuityspecial path only forparakeet-tdt-0.6b-v3; other model versions keep the default parallel path and log a warning if the flag is set.TdtDecoderStateacross v3 batch chunks. Each chunk gets a fresh decoder state.Why
PR #596 fixes the original
notes_1408_clean.wavdrop, but in my test matrix its persisted decoder-state variant regressed two French fixtures by injecting English BPE at or near seams:wwii_belgique_fr.wav:In Belgique, le 8 mai has been jour fériéuser2_2026-05-12.wav:blouses médicales blanches and portant des masquesThe working hypothesis is that carrying predictor state across chunks pairs chunk N decoder state with chunk N+1 encoder frames that were re-encoded under different positional/convolutional context. On multilingual v3, that out-of-distribution joint-network pair can collapse toward the model's English prior.
This variant avoids that persisted predictor/encoder mismatch while still giving each chunk enough real audio warmup to recover boundary words.
Validation
Fresh local checks:
swift build -c release --product fluidaudiocliswift test --filter ChunkProcessorTests-> 44 tests, 0 failures--multilingual-chunks-> clean transcripts on all five fixtures testednotes_1408_clean.wavwwii_belgique_fr.wavIn Belgique ... has beenuser_2026-05-12.wavuser2_2026-05-12.wavand portant des masquesclimate_2026_fr_voice_memo.wavThe fifth fixture is a new 110.976s French Voice Memo converted to mono 16 kHz Int16 WAV. I can attach it to issue #594 as an additional regression fixture if useful.
Timing notes
This path is slower than default parallel chunking because it remains sequential.
Measured wall-clock with the release CLI on my machine:
user2_2026-05-12.wavclimate_2026_fr_voice_memo.wavCompared to the old
bb96003pin on the same longer fixtures, this patch was about 1.4x wall-clock.