Skip to content

investigate(tts): Whisper WER regression on Tier A clips is NOT loudness-driven — bisect M15/#70/#71 prosody changes #83

@shaypal5

Description

@shaypal5

Spun out of #78 after the M3c-style loudness fix in PR #82 was empirically falsified as an ASR fix. This issue tracks the actual root cause of the Whisper WER drop reported in PR #77 / #79 / #78.

Background

#78 hypothesized that the post-2026-04-16 ~6 dB peak drop on Tier A clips drove Whisper from WER ~0.04–0.08 down to 0.18–0.28. The PR that closed it (#82) added a target-peak loudness contract and metadata trail, but a controlled lever probe done before merging falsified the loudness hypothesis cleanly.

Lever-probe data (2026-05-05, M4 Max, openai/whisper-large-v3, greedy)

Same source audio (post-fix re-render of sp_neu_a_0001_00), single global gain to a grid of peak / RMS targets:

variant peak (dBFS) rms (dBFS) WER length-ratio
04-15 reference (untouched) −1.00 −20.37 0.079 1.005
post-fix as is −2.00 −23.06 0.286 0.762
scaled to peak −1 dBFS −1.00 −22.06 0.286 0.762
scaled to peak −3 dBFS −3.00 −24.06 0.286 0.762
scaled to peak −6 dBFS −6.00 −27.06 0.286 0.762
scaled to RMS −20.4 dBFS (= 04-15) +0.69 (clipped) −20.37 0.286 0.762
scaled to RMS −16 dBFS +5.06 (clipped) −16.00 0.286 0.762
scaled to RMS −14 dBFS +7.06 (clipped) −14.00 0.180 0.862

Seven of eight rows produce byte-identical Whisper hypotheses. Whisper's log-mel feature extractor internally normalizes — peak/RMS in the spec range is invisible to it. The eighth row only differs because it's clipped past ±1.0, which is degradation, not improvement.

So the Whisper drop is content-driven, not level-driven. #78 saw two regressions that landed in the same window of PRs (loudness + Whisper WER) and treated their correlation as causation. They have a common cause (the M15 / #70 / #71 prosody work), not a causal relationship.

What changed in the suspect window (2026-04-15 → 2026-05-05)

Sorted by likelihood-of-impact-on-Whisper:

  1. fix(tts): insert inter-word <break> tags to prevent Hebrew word merging #70 — inter-word <break> tags to prevent Hebrew word merging. Adds explicit pauses between words. Whisper interprets sustained pauses as utterance boundaries; if breaks land at every word, Whisper may treat each chunk as an isolated short utterance, lose context, and over-trigger silence-token emission. This would also explain the 0.762 length-ratio (Whisper dropping ~24% of words it reads as silence).
  2. feat(m15): SSML prosody tuning with research-validated Hebrew parameters #51 (M15) — SSML prosody tuning. Changes rate / pitch / volume per intensity. Slower speech → longer audio → more chances for Whisper's silence detector to mis-segment. Same scene config now yields 63% longer audio (investigate(tts): same scene config now yields 63% longer audio (121s → 198s) — confirm new duration regime is intentional #81), almost certainly from M15 + fix(tts): insert inter-word <break> tags to prevent Hebrew word merging #70 combined.
  3. fix(tts): Azure SSML parsing error on adjacent break elements (#67) #71 — Azure SSML hardening. Less likely; this was a defensive fix to merge adjacent break elements. But could interact with fix(tts): insert inter-word <break> tags to prevent Hebrew word merging #70's break placement.
  4. fix(config): halve pitch escalation at I4–I5 to eliminate helium effect #68 — halve pitch escalation at I4–I5. Changes high-intensity F0. Unlikely to affect Whisper at I1 (the test scene is intensity arc [1,1,1,2,1]), included for completeness.

Suggested investigation path

  1. Reproduce at the current main HEAD — run openai/whisper-large-v3 on a freshly rendered sp_neu_a_0001, confirm WER ≈ 0.28.
  2. Bisect by reverting one PR at a time — render the same scene with each suspect PR's code reverted (cherry-pick the revert onto a temp branch). Order: fix(tts): insert inter-word <break> tags to prevent Hebrew word merging #70 first (highest prior probability), then feat(m15): SSML prosody tuning with research-validated Hebrew parameters #51, then fix(tts): Azure SSML parsing error on adjacent break elements (#67) #71, then fix(config): halve pitch escalation at I4–I5 to eliminate helium effect #68.
  3. Identify the dominant contributor — the PR whose revert drops WER below 0.10 (the 04-15 baseline range).
  4. Decide remediation — depends on which PR it is. If fix(tts): insert inter-word <break> tags to prevent Hebrew word merging #70: tune break tag frequency or insertion rules so Whisper's silence-detector boundary heuristic is not tripped. If feat(m15): SSML prosody tuning with research-validated Hebrew parameters #51: rate multipliers may have gone too aggressive. If fix(tts): Azure SSML parsing error on adjacent break elements (#67) #71: revisit the SSML normalization rules.

Why this is independent of #82's fix

PR #82 ships:

  • An explicit loudness contract (target peak + safety ceiling) in PreprocessingConfig.
  • A loudness_target_peak_dbfs field in GenerationMetadata so future loudness drift is diagnosable from clip metadata alone — the structural fix that prevents another silent bug(tts): rendered clip loudness regressed by ~6 dB between 2026-04-15 and 2026-05-05 #78-style regression.
  • Tier A / Tier B/C consolidation through the shared peak_normalize_to_target helper.

None of this touches the prosody / SSML / mixer paths. The two changes are orthogonal; this issue is the actual ASR fix and should be treated as a separate workstream.

Reproduction

.venv/bin/python -c "
from pathlib import Path
import soundfile as sf, numpy as np, torch
from jiwer import wer
from transformers import pipeline
import sys
sys.path.insert(0, 'scripts')
from m17_phase_a_validation import normalize_for_wer

asr = pipeline('automatic-speech-recognition', model='openai/whisper-large-v3',
               device=torch.device('mps' if torch.backends.mps.is_available() else 'cpu'),
               torch_dtype=torch.float32, chunk_length_s=30)

for label, path in [
    ('04-15 ref', 'data/m2a_wettest/agg_m_30-45_001/sp_neu_a_0001_00.wav'),
    ('current  ', 'data/m17_loudness_repro/agg_m_30-45_001/sp_neu_a_0001_00.wav'),
]:
    wav, sr = sf.read(path, dtype='float32')
    if wav.ndim > 1: wav = wav.mean(axis=1)
    txt = Path(path).with_suffix('.txt').read_text(encoding='utf-8')
    ref = '\n'.join(l for l in txt.splitlines() if l and not l.startswith('[')).strip()
    out = asr({'raw': wav.copy(), 'sampling_rate': sr},
              generate_kwargs={'language': 'he', 'task': 'transcribe', 'num_beams': 1, 'do_sample': False})
    w = wer(normalize_for_wer(ref), normalize_for_wer(out['text']))
    print(f'{label}: WER={w:.3f}')
"

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions