Skip to content

fix(asr): avoid v3 multilingual seam drift#604

Draft
vdt4534 wants to merge 8 commits into
FluidInference:fix/asr-594-french-chunk-boundaryfrom
vdt4534:codex/asr-594-v3-seam-warmup
Draft

fix(asr): avoid v3 multilingual seam drift#604
vdt4534 wants to merge 8 commits into
FluidInference:fix/asr-594-french-chunk-boundaryfrom
vdt4534:codex/asr-594-v3-seam-warmup

Conversation

@vdt4534
Copy link
Copy Markdown

@vdt4534 vdt4534 commented May 12, 2026

Summary

This is a stacked draft PR on top of #596 for issue #594. It keeps the opt-in multilingualChunkContinuity API, but changes the v3 implementation so the path fixes the original dropped-word repro without introducing English-token drift on additional French fixtures.

The change is intentionally v3-only. It does not use language hints, deterministic vocabulary filtering, or French/English-specific token rules.

What changed

  • Route the multilingualChunkContinuity special path only for parakeet-tdt-0.6b-v3; other model versions keep the default parallel path and log a warning if the flag is set.
  • Stop persisting TdtDecoderState across v3 batch chunks. Each chunk gets a fresh decoder state.
  • Prepend a short real-audio prefix to non-first chunks: 7 encoder frames, about 560 ms.
  • Decode through that prefix as warmup, but suppress prefix-region emitted tokens from the returned token window.
  • Add conservative silence-aligned chunk starts based only on audio energy near the nominal boundary. If no true near-silence exists nearby, the regular frame-aligned boundary is retained.

Why

PR #596 fixes the original notes_1408_clean.wav drop, but in my test matrix its persisted decoder-state variant regressed two French fixtures by injecting English BPE at or near seams:

  • wwii_belgique_fr.wav: In Belgique, le 8 mai has been jour férié
  • user2_2026-05-12.wav: blouses médicales blanches and portant des masques

The working hypothesis is that carrying predictor state across chunks pairs chunk N decoder state with chunk N+1 encoder frames that were re-encoded under different positional/convolutional context. On multilingual v3, that out-of-distribution joint-network pair can collapse toward the model's English prior.

This variant avoids that persisted predictor/encoder mismatch while still giving each chunk enough real audio warmup to recover boundary words.

Validation

Fresh local checks:

  • swift build -c release --product fluidaudiocli
  • swift test --filter ChunkProcessorTests -> 44 tests, 0 failures
  • Five-fixture matrix with --multilingual-chunks -> clean transcripts on all five fixtures tested
Fixture PR #596 flag ON behavior before this patch This patch flag ON
notes_1408_clean.wav clean clean
wwii_belgique_fr.wav English drift: In Belgique ... has been clean
user_2026-05-12.wav clean clean
user2_2026-05-12.wav English drift: and portant des masques clean
climate_2026_fr_voice_memo.wav current flag-OFF and bb96003 both drift at least once clean

The fifth fixture is a new 110.976s French Voice Memo converted to mono 16 kHz Int16 WAV. I can attach it to issue #594 as an additional regression fixture if useful.

Timing notes

This path is slower than default parallel chunking because it remains sequential.

Measured wall-clock with the release CLI on my machine:

Fixture PR branch flag OFF avg This patch flag ON avg
user2_2026-05-12.wav 0.513s 0.943s
climate_2026_fr_voice_memo.wav 0.513s 1.047s

Compared to the old bb96003 pin on the same longer fixtures, this patch was about 1.4x wall-clock.

Alex-Wengg and others added 8 commits May 11, 2026 11:16
…ion termination (FluidInference#594)

Batch transcription drifted French to English at every ~15s chunk boundary
on parakeet-tdt-0.6b-v3-coreml. Streaming on the same audio was clean.
Root cause is three interacting issues:

1. ChunkProcessor created a fresh TdtDecoderState per chunk and SOS-primed
   with the blank token. For non-first chunks this starts the LSTM
   mid-utterance, biased toward TDT v3's English prior.

2. Non-first chunks received only ~80ms of mel-context prefix (from FluidInference#264),
   while streaming uses ~2s of actual leading audio. FastConformer's
   depthwise convs produce language-biased logits with too little audio
   history, even when the decoder state is correct.

3. When the decoder emits a sentence-final token mid-chunk, the LSTM
   enters a state where the joint predicts BLANK for the remaining frames,
   silently dropping audio. Masked by the per-chunk SOS reset; surfaces
   once state is persisted.

Fix:
- ChunkProcessor.process: serialize chunk processing, persist
  TdtDecoderState across chunks (matches SlidingWindowAsrManager).
- ChunkProcessor: extend non-first chunk audio prefix from 80ms mel-context
  to 2.0s of actual audio. Decoder skips prefix encoder frames via
  contextFrameAdjustment; timestamps remain anchored on global frames.
- TdtDecoderV3: after a sentence-final token, if the decoder emits a long
  blank-only streak with audio remaining, clear predictorOutput to
  re-engage emission while preserving LSTM state.

Verified on reporter's notes_1408_clean.wav: drift gone with --language fr.
English LibriSpeech test-clean smoke (N=5): WER unchanged vs main.
Streaming path unchanged.

Preserves FluidInference#264's chunk-boundary token-loss fix.

Closes FluidInference#594
…h ASR

French batch transcription with parakeet-tdt-0.6b-v3-coreml was drifting
to English at every ~15s chunk boundary because per-chunk SOS priming
re-applied the model's English-biased prior, and the encoder lacked
enough left context to escape it.

This commit wraps the chunk-boundary continuity behavior behind an opt-in
`multilingualChunkContinuity` boolean on ASRConfig (default `false`). The
default behavior is unchanged from main (parallel chunks, fresh SOS per
chunk, English WER 2.64% on LibriSpeech test-clean). When opted in, the
chunk path uses:

- Serialized chunk processing with persisted TdtDecoderState across
  chunks (LSTM hidden/cell + lastToken + predictorOutput).
- 2.0s real-audio prefix prepended to chunks N>=1 as streaming-style
  warmup; decoder consumes prefix frames normally.
- Reserved chunk size (~13s actual audio) so prefix fits within the
  encoder's 240,000-sample window.

The in-decoder punctuation guard (recovery from blank-emission spirals
after sentence-final tokens) added in 2206dd1 is enabled in both
modes; bisection during PR investigation confirmed it is benign on the
English baseline.

CLI: pass --multilingual-chunks to fluidaudiocli transcribe and
asr-benchmark to enable the new path.

Validation (LibriSpeech test-clean, 100 files; reporter French audio):
- Flag OFF (default, matches origin/main): English WER 2.640%
- Flag ON: English WER 5.264%
- Flag OFF (reporter French): drifts ("rest of the key what is that")
  -- confirms gating restores original behavior.
- Flag ON (reporter French): correct, "reste de l'équipe" present.

Known limitation (flag ON): a few LibriSpeech files (e.g. 1089-134691-0009,
1188-133604-0011) still drop continuation content past a sentence-final
chunk seam due to encoder prosodic bleed >75 frames. Tracked as
followup; the chosen variant is the empirically best non-regressing
combination across 8 decoder-side hypotheses tested.
…path

Phase 2 of FluidInference#594. The unconditional contextFrameAdjustment from fix B
desyncs encoder/predictor at chunk boundaries, costing ~0.8pp WER on
LibriSpeech test-clean. Replace with the softer streaming-warmup
variant B': prepend the 2.0s real-audio prefix for encoder left context
but let the decoder emit naturally through the prefix region, relying
on the LCS/midpoint merger to discard prefix tokens during chunk merge.

Combined with seam-clear LSTM reset, LCS punctuation-blind matcher, and
effective-left-end in mergeByMidpoint, brings flag-ON path from 5.26%
to 4.19% English WER. Flag-OFF preserved at 2.6403%. French fix
preserved.
…Offset

In processWithMultilingualContinuity, the audio buffer for chunks N>=1
starts at `contextStart = chunkStart - contextSamples` (2.0s of real
audio prepended for encoder left context), but transcribeChunk was
being given `chunkStart` as its origin. Inside transcribeChunk,
`globalFrameOffset = chunkStart / samplesPerEncoderFrame` then placed
every token in chunks N>=1 at +25 frames (+2.0s) past its true
position in the original audio timeline.

Consequence: chunk N's prefix tokens (covering audio that actually
overlaps chunk N-1's tail) landed in timestamp space _after_ chunk
N-1's end, beyond the merger's 1.0s halfOverlapWindow tolerance. LCS
and contiguous matchers could not anchor across the boundary, so every
seam fell through to mergeByMidpoint, which duplicated ~2s of content
at every chunk join.

Pass `contextStart` instead. Prefix tokens now overlap chunk N-1
correctly, LCS matches anchor properly, and the merger can dedupe as
designed. LibriSpeech test-clean (100 files, flag ON): 4.19% → 2.90%
WER. Flag-OFF unchanged at 2.6403%. French fix preserved.

Credit: Devin AI review on PR FluidInference#596.
Three followups from the bot review of PR FluidInference#596:

1. Update stale WER doc comment in AsrTypes.swift
   The `multilingualChunkContinuity` doc said "~4.40% English WER", which
   referred to an intermediate variant before the Devin globalFrameOffset
   fix landed. The validated landed number is 2.90% (vs 2.64% with flag
   off, a +0.26pp cost).

2. Warn when `multilingualChunkContinuity=true` on a non-v3 model
   The flag's sequential serialization + 2.0s audio prefix is designed
   to mitigate parakeet-tdt-0.6b-v3 English-prior drift. On v2 /
   tdtCtc110m / ctcZhCn / tdtJa it still produces correct output, but
   only adds latency with no benefit, so log a warning once when the
   path is entered with a non-v3 model.

3. Unit tests for the new code path (CLAUDE.md policy: "Add unit tests
   when writing new code")
   - ASRConfigTests: multilingualChunkContinuity defaults to false,
     preserves explicit true/false, and doesn't disturb other fields.
   - ChunkProcessorTests (via #if DEBUG accessors):
     * audio prefix is exactly 32000 samples (2.0s @ 16kHz), encoder-
       frame-aligned (multiple of 1280).
     * multilingual chunk size + prefix ≤ maxModelSamples (240000),
       and chunk size is frame-aligned.
     * multilingual chunk size is strictly smaller than default chunk
       size (it has to give up content to make prefix room).
     * chunkSamples(multilingualContinuity:) dispatches correctly to
       either the default or multilingual sizing path.
Devin PR-596 review (2nd pass) flagged two issues:

1. (real bug) `processWithMultilingualContinuity` persists `timeJump`
   across chunks. `TdtDecoderV3.decodeWithTimings` writes
   `decoderState.timeJump = currentTimeIndices - effectiveSequenceLength`
   for non-last chunks. On the next chunk,
   `TdtFrameNavigation.calculateInitialTimeIndices` either skips into
   the 2.0s prefix region (prevTimeJump > 0) or, when prevTimeJump == 0,
   returns the special-case `standardOverlapFrames` (25 = exactly the
   prefix length) which skips the prefix entirely. Either case breaks
   the merger's ability to anchor tokens across the chunk seam.

   Fix: explicitly `decoderState.timeJump = nil` after each non-last
   chunk so the next chunk's decoder starts at frame 0 of the buffer
   (which already begins with the 2.0s prefix). The punctuation-seam
   `reset()` already nils timeJump as a side effect; this clear handles
   every other boundary the same way.

2. (doc accuracy) Both ChunkProcessor.processWithParallelChunks and
   ASRConfig.multilingualChunkContinuity claimed the default path was
   bit-for-bit identical to pre-FluidInference#596 main. The shared
   `mergeChunks` / `mergeByMidpoint` punctuation-aware LCS matcher and
   trailing-punctuation midpoint adjustment apply to both paths.
   Empirically WER-neutral on LibriSpeech test-clean (validated 2.64%),
   but the merger algorithm is no longer literally identical. Doc
   updated to reflect this.

Tests: TdtDecoderStateTests gains `testTimeJumpNilingForMultilingualContinuityPath`
documenting that `timeJump` is independently nilable without disturbing
LSTM hidden/cell, lastToken, or predictorOutput. `testDecoderStateReset`
also now exercises the timeJump field in the populate/reset cycle.
Reverts the code change from 0f9085f. The doc accuracy update (shared
mergeChunks/mergeByMidpoint logic applies to the default path too) is
retained.

Devin PR-596 review (2nd pass) flagged that
`processWithMultilingualContinuity` persists `decoderState.timeJump`
across chunks, causing `TdtFrameNavigation.calculateInitialTimeIndices`
to skip into (or, when the persisted value is 0, fully past via
`standardOverlapFrames=25`) the 2.0s prefix region. The static
analysis is correct: per the documented intent of that function the
behavior is unintended for this path's fresh-buffer-with-prefix layout.

Empirically, however, the prefix-skip is what produces the established
flag-ON English LibriSpeech `test-clean` baseline of 2.90% WER on this
PR. Clearing `decoderState.timeJump = nil` between chunks (the "fix")
regresses to 4.43% WER (validated 100-file run) because the decoder
re-emits tokens through the prefix region and the merger duplicates
tokens it cannot reliably de-anchor. Flag-OFF LibriSpeech stays at
2.6403% in both variants.

Keeping the persisted `timeJump` matches the baseline measurements the
PR was reviewed against. The shared-merger doc updates in
`processWithParallelChunks` and `ASRConfig.multilingualChunkContinuity`
remain valid and are not reverted. A short comment in
`processWithMultilingualContinuity` now explains the empirical
trade-off so future readers know not to "fix" it again.

Validation:
- swift test --filter TdtDecoderStateTests: 11/11 pass
- asr-benchmark --multilingual-chunks: 2.8986% (target ≤2.90%)
- asr-benchmark (flag off): 2.6403% (unchanged baseline)
@Alex-Wengg
Copy link
Copy Markdown
Member

i actually overhauled this change . #596

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants