investigate(tts): Whisper WER regression on Tier A clips is NOT loudness-driven — bisect M15/#70/#71 prosody changes

Spun out of #78 after the M3c-style loudness fix in PR #82 was empirically falsified as an ASR fix.  This issue tracks the *actual* root cause of the Whisper WER drop reported in PR #77 / #79 / #78.

## Background

#78 hypothesized that the post-2026-04-16 ~6 dB peak drop on Tier A clips drove Whisper from WER ~0.04–0.08 down to 0.18–0.28.  The PR that closed it (#82) added a target-peak loudness contract and metadata trail, but a controlled lever probe done before merging falsified the loudness hypothesis cleanly.

## Lever-probe data (2026-05-05, M4 Max, openai/whisper-large-v3, greedy)

Same source audio (post-fix re-render of `sp_neu_a_0001_00`), single global gain to a grid of peak / RMS targets:

| variant | peak (dBFS) | rms (dBFS) | WER | length-ratio |
|---|---:|---:|---:|---:|
| 04-15 reference (untouched) | −1.00 | −20.37 | **0.079** | 1.005 |
| post-fix as is | −2.00 | −23.06 | 0.286 | 0.762 |
| scaled to peak −1 dBFS | −1.00 | −22.06 | 0.286 | 0.762 |
| scaled to peak −3 dBFS | −3.00 | −24.06 | 0.286 | 0.762 |
| scaled to peak −6 dBFS | −6.00 | −27.06 | 0.286 | 0.762 |
| scaled to RMS −20.4 dBFS (= 04-15) | +0.69 (clipped) | −20.37 | 0.286 | 0.762 |
| scaled to RMS −16 dBFS | +5.06 (clipped) | −16.00 | 0.286 | 0.762 |
| scaled to RMS −14 dBFS | +7.06 (clipped) | −14.00 | 0.180 | 0.862 |

**Seven of eight rows produce byte-identical Whisper hypotheses.**  Whisper's log-mel feature extractor internally normalizes — peak/RMS in the spec range is invisible to it.  The eighth row only differs because it's clipped past ±1.0, which is degradation, not improvement.

So the Whisper drop is content-driven, not level-driven.  #78 saw two regressions that landed in the same window of PRs (loudness + Whisper WER) and treated their correlation as causation.  They have a common cause (the M15 / #70 / #71 prosody work), not a causal relationship.

## What changed in the suspect window (2026-04-15 → 2026-05-05)

Sorted by likelihood-of-impact-on-Whisper:

1. **#70 — inter-word `<break>` tags to prevent Hebrew word merging.**  Adds explicit pauses between words.  Whisper interprets sustained pauses as utterance boundaries; if breaks land at every word, Whisper may treat each chunk as an isolated short utterance, lose context, and over-trigger silence-token emission.  This would also explain the 0.762 length-ratio (Whisper dropping ~24% of words it reads as silence).
2. **#51 (M15) — SSML prosody tuning.**  Changes rate / pitch / volume per intensity.  Slower speech → longer audio → more chances for Whisper's silence detector to mis-segment.  Same scene config now yields 63% longer audio (#81), almost certainly from M15 + #70 combined.
3. **#71 — Azure SSML hardening.**  Less likely; this was a defensive fix to merge adjacent break elements.  But could interact with #70's break placement.
4. **#68 — halve pitch escalation at I4–I5.**  Changes high-intensity F0.  Unlikely to affect Whisper at I1 (the test scene is intensity arc `[1,1,1,2,1]`), included for completeness.

## Suggested investigation path

1. **Reproduce** at the current `main` HEAD — run `openai/whisper-large-v3` on a freshly rendered `sp_neu_a_0001`, confirm WER ≈ 0.28.
2. **Bisect by reverting one PR at a time** — render the same scene with each suspect PR's code reverted (cherry-pick the revert onto a temp branch).  Order: #70 first (highest prior probability), then #51, then #71, then #68.
3. **Identify the dominant contributor** — the PR whose revert drops WER below 0.10 (the 04-15 baseline range).
4. **Decide remediation** — depends on which PR it is.  If #70: tune break tag *frequency* or insertion rules so Whisper's silence-detector boundary heuristic is not tripped.  If #51: rate multipliers may have gone too aggressive.  If #71: revisit the SSML normalization rules.

## Why this is independent of #82's fix

PR #82 ships:
- An explicit loudness contract (target peak + safety ceiling) in `PreprocessingConfig`.
- A `loudness_target_peak_dbfs` field in `GenerationMetadata` so future loudness drift is diagnosable from clip metadata alone — the structural fix that prevents another silent #78-style regression.
- Tier A / Tier B/C consolidation through the shared `peak_normalize_to_target` helper.

None of this touches the prosody / SSML / mixer paths.  The two changes are orthogonal; this issue is the actual ASR fix and should be treated as a separate workstream.

## Reproduction

```bash
.venv/bin/python -c "
from pathlib import Path
import soundfile as sf, numpy as np, torch
from jiwer import wer
from transformers import pipeline
import sys
sys.path.insert(0, 'scripts')
from m17_phase_a_validation import normalize_for_wer

asr = pipeline('automatic-speech-recognition', model='openai/whisper-large-v3',
               device=torch.device('mps' if torch.backends.mps.is_available() else 'cpu'),
               torch_dtype=torch.float32, chunk_length_s=30)

for label, path in [
    ('04-15 ref', 'data/m2a_wettest/agg_m_30-45_001/sp_neu_a_0001_00.wav'),
    ('current  ', 'data/m17_loudness_repro/agg_m_30-45_001/sp_neu_a_0001_00.wav'),
]:
    wav, sr = sf.read(path, dtype='float32')
    if wav.ndim > 1: wav = wav.mean(axis=1)
    txt = Path(path).with_suffix('.txt').read_text(encoding='utf-8')
    ref = '\n'.join(l for l in txt.splitlines() if l and not l.startswith('[')).strip()
    out = asr({'raw': wav.copy(), 'sampling_rate': sr},
              generate_kwargs={'language': 'he', 'task': 'transcribe', 'num_beams': 1, 'do_sample': False})
    w = wer(normalize_for_wer(ref), normalize_for_wer(out['text']))
    print(f'{label}: WER={w:.3f}')
"
```

## References

- #78 — original mis-attribution; closed by PR #82 with the loudness-contract fix.
- PR #82 — loudness contract + metadata trail.
- #77, #79 — original M17 Phase A spike that surfaced the WER numbers.
- #81 — same scene config now 63% longer (correlated symptom of the same prosody changes).
- Reproduction script saved at `/tmp/m17_lever_probe.py` during the #82 review (not committed; see PR #82 thread).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigate(tts): Whisper WER regression on Tier A clips is NOT loudness-driven — bisect M15/#70/#71 prosody changes #83

Background

Lever-probe data (2026-05-05, M4 Max, openai/whisper-large-v3, greedy)

What changed in the suspect window (2026-04-15 → 2026-05-05)

Suggested investigation path

Why this is independent of #82's fix

Reproduction

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

variant	peak (dBFS)	rms (dBFS)	WER	length-ratio
04-15 reference (untouched)	−1.00	−20.37	0.079	1.005
post-fix as is	−2.00	−23.06	0.286	0.762
scaled to peak −1 dBFS	−1.00	−22.06	0.286	0.762
scaled to peak −3 dBFS	−3.00	−24.06	0.286	0.762
scaled to peak −6 dBFS	−6.00	−27.06	0.286	0.762
scaled to RMS −20.4 dBFS (= 04-15)	+0.69 (clipped)	−20.37	0.286	0.762
scaled to RMS −16 dBFS	+5.06 (clipped)	−16.00	0.286	0.762
scaled to RMS −14 dBFS	+7.06 (clipped)	−14.00	0.180	0.862

investigate(tts): Whisper WER regression on Tier A clips is NOT loudness-driven — bisect M15/#70/#71 prosody changes #83

Description

Background

Lever-probe data (2026-05-05, M4 Max, openai/whisper-large-v3, greedy)

What changed in the suspect window (2026-04-15 → 2026-05-05)

Suggested investigation path

Why this is independent of #82's fix

Reproduction

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions