You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Spun out of #78 after the M3c-style loudness fix in PR #82 was empirically falsified as an ASR fix. This issue tracks the actual root cause of the Whisper WER drop reported in PR #77 / #79 / #78.
Background
#78 hypothesized that the post-2026-04-16 ~6 dB peak drop on Tier A clips drove Whisper from WER ~0.04–0.08 down to 0.18–0.28. The PR that closed it (#82) added a target-peak loudness contract and metadata trail, but a controlled lever probe done before merging falsified the loudness hypothesis cleanly.
Lever-probe data (2026-05-05, M4 Max, openai/whisper-large-v3, greedy)
Same source audio (post-fix re-render of sp_neu_a_0001_00), single global gain to a grid of peak / RMS targets:
variant
peak (dBFS)
rms (dBFS)
WER
length-ratio
04-15 reference (untouched)
−1.00
−20.37
0.079
1.005
post-fix as is
−2.00
−23.06
0.286
0.762
scaled to peak −1 dBFS
−1.00
−22.06
0.286
0.762
scaled to peak −3 dBFS
−3.00
−24.06
0.286
0.762
scaled to peak −6 dBFS
−6.00
−27.06
0.286
0.762
scaled to RMS −20.4 dBFS (= 04-15)
+0.69 (clipped)
−20.37
0.286
0.762
scaled to RMS −16 dBFS
+5.06 (clipped)
−16.00
0.286
0.762
scaled to RMS −14 dBFS
+7.06 (clipped)
−14.00
0.180
0.862
Seven of eight rows produce byte-identical Whisper hypotheses. Whisper's log-mel feature extractor internally normalizes — peak/RMS in the spec range is invisible to it. The eighth row only differs because it's clipped past ±1.0, which is degradation, not improvement.
So the Whisper drop is content-driven, not level-driven. #78 saw two regressions that landed in the same window of PRs (loudness + Whisper WER) and treated their correlation as causation. They have a common cause (the M15 / #70 / #71 prosody work), not a causal relationship.
What changed in the suspect window (2026-04-15 → 2026-05-05)
Sorted by likelihood-of-impact-on-Whisper:
fix(tts): insert inter-word <break> tags to prevent Hebrew word merging #70 — inter-word <break> tags to prevent Hebrew word merging. Adds explicit pauses between words. Whisper interprets sustained pauses as utterance boundaries; if breaks land at every word, Whisper may treat each chunk as an isolated short utterance, lose context, and over-trigger silence-token emission. This would also explain the 0.762 length-ratio (Whisper dropping ~24% of words it reads as silence).
Tier A / Tier B/C consolidation through the shared peak_normalize_to_target helper.
None of this touches the prosody / SSML / mixer paths. The two changes are orthogonal; this issue is the actual ASR fix and should be treated as a separate workstream.
Reproduction
.venv/bin/python -c "from pathlib import Pathimport soundfile as sf, numpy as np, torchfrom jiwer import werfrom transformers import pipelineimport syssys.path.insert(0, 'scripts')from m17_phase_a_validation import normalize_for_werasr = pipeline('automatic-speech-recognition', model='openai/whisper-large-v3', device=torch.device('mps' if torch.backends.mps.is_available() else 'cpu'), torch_dtype=torch.float32, chunk_length_s=30)for label, path in [ ('04-15 ref', 'data/m2a_wettest/agg_m_30-45_001/sp_neu_a_0001_00.wav'), ('current ', 'data/m17_loudness_repro/agg_m_30-45_001/sp_neu_a_0001_00.wav'),]: wav, sr = sf.read(path, dtype='float32') if wav.ndim > 1: wav = wav.mean(axis=1) txt = Path(path).with_suffix('.txt').read_text(encoding='utf-8') ref = '\n'.join(l for l in txt.splitlines() if l and not l.startswith('[')).strip() out = asr({'raw': wav.copy(), 'sampling_rate': sr}, generate_kwargs={'language': 'he', 'task': 'transcribe', 'num_beams': 1, 'do_sample': False}) w = wer(normalize_for_wer(ref), normalize_for_wer(out['text'])) print(f'{label}: WER={w:.3f}')"
Spun out of #78 after the M3c-style loudness fix in PR #82 was empirically falsified as an ASR fix. This issue tracks the actual root cause of the Whisper WER drop reported in PR #77 / #79 / #78.
Background
#78 hypothesized that the post-2026-04-16 ~6 dB peak drop on Tier A clips drove Whisper from WER ~0.04–0.08 down to 0.18–0.28. The PR that closed it (#82) added a target-peak loudness contract and metadata trail, but a controlled lever probe done before merging falsified the loudness hypothesis cleanly.
Lever-probe data (2026-05-05, M4 Max, openai/whisper-large-v3, greedy)
Same source audio (post-fix re-render of
sp_neu_a_0001_00), single global gain to a grid of peak / RMS targets:Seven of eight rows produce byte-identical Whisper hypotheses. Whisper's log-mel feature extractor internally normalizes — peak/RMS in the spec range is invisible to it. The eighth row only differs because it's clipped past ±1.0, which is degradation, not improvement.
So the Whisper drop is content-driven, not level-driven. #78 saw two regressions that landed in the same window of PRs (loudness + Whisper WER) and treated their correlation as causation. They have a common cause (the M15 / #70 / #71 prosody work), not a causal relationship.
What changed in the suspect window (2026-04-15 → 2026-05-05)
Sorted by likelihood-of-impact-on-Whisper:
<break>tags to prevent Hebrew word merging. Adds explicit pauses between words. Whisper interprets sustained pauses as utterance boundaries; if breaks land at every word, Whisper may treat each chunk as an isolated short utterance, lose context, and over-trigger silence-token emission. This would also explain the 0.762 length-ratio (Whisper dropping ~24% of words it reads as silence).[1,1,1,2,1]), included for completeness.Suggested investigation path
mainHEAD — runopenai/whisper-large-v3on a freshly renderedsp_neu_a_0001, confirm WER ≈ 0.28.Why this is independent of #82's fix
PR #82 ships:
PreprocessingConfig.loudness_target_peak_dbfsfield inGenerationMetadataso future loudness drift is diagnosable from clip metadata alone — the structural fix that prevents another silent bug(tts): rendered clip loudness regressed by ~6 dB between 2026-04-15 and 2026-05-05 #78-style regression.peak_normalize_to_targethelper.None of this touches the prosody / SSML / mixer paths. The two changes are orthogonal; this issue is the actual ASR fix and should be treated as a separate workstream.
Reproduction
References
/tmp/m17_lever_probe.pyduring the fix(preprocessing): #78 define loudness contract + metadata trail (does NOT recover Whisper — see #83) #82 review (not committed; see PR fix(preprocessing): #78 define loudness contract + metadata trail (does NOT recover Whisper — see #83) #82 thread).