Skip to content

fix(tts): #83 partial — revert inter-word breaks (low-intensity scenes only)#86

Merged
shaypal5 merged 3 commits into
mainfrom
investigate/83-whisper-wer-bisect
May 5, 2026
Merged

fix(tts): #83 partial — revert inter-word breaks (low-intensity scenes only)#86
shaypal5 merged 3 commits into
mainfrom
investigate/83-whisper-wer-bisect

Conversation

@shaypal5
Copy link
Copy Markdown
Member

@shaypal5 shaypal5 commented May 5, 2026

Problem

Whisper-large-v3 WER on Tier A clips dropped from ~0.04–0.08 (pre-2026-04-16) to ~0.286 on current main (#83). The same scene config also produces 63% longer audio (#81 — 121s → 198s on sp_neu_a_0001).

PR #82 falsified loudness as the cause via a controlled lever probe. #83 was spun out as the actual ASR fix.

Bisect — partial result

Same scene, same random_seed: 1401, same Whisper-large-v3 invocation across all rows. Two scenes (N=2): one low-intensity (intensity arc almost all I1), one high-intensity (I1→I5 arc).

scene variant duration WER length-ratio
sp_neu_a_0001 (low-intensity, arc [1,1,1,2,1]) 04-15 reference 121.0 s 0.079 1.000
sp_neu_a_0001 current main HEAD 197.8 s 0.286 0.753
sp_neu_a_0001 this PR (revert #70 + #71) 122.9 s 0.048 0.995
sp_it_a_0001 (high-intensity, arc [1,2,3,4,5,4,3]) 04-15 reference 155.9 s 0.056 1.009
sp_it_a_0001 this PR (revert #70 + #71) 146.6 s 0.322 0.709

What this PR fixes

What this PR does NOT fix

  • sp_it_a_0001 (high-intensity): residual regression remains. WER 0.322 vs 04-15 reference's 0.056 — same Whisper-silence-detector fingerprint (length-ratio 0.709, ~28% of words missing). The Hebrew text is byte-identical between the 04-15 reference and the revert branch (verified via diff), so the residual gap is purely audio rendering.

There is therefore a second TTS-side change in the post-2026-04-15 window that fires at I3+. PR #86 does not identify or fix it.

Why #70 was the dominant low-intensity cause

<break time="50ms"/> between every word pair does two things at every word boundary:

  1. Adds a small silence. Across ~100–200 word boundaries per clip, the cumulative silence is large, and Whisper's silence-detection heuristic treats the audio as fragmented short utterances → drops context → over-emits silence tokens.
  2. Resets the prosody pipeline at every word. Words can no longer flow into each other prosodically — cadence becomes uniform and unnaturally rhythmic.

Native-speaker A/B confirms the perceptual signature: "very over-pausy", "very robotic content tone", "weird, rhythmic, equal spacing between words" — all consistent with mechanical 50 ms inter-word silences regardless of natural prosody.

#71 is reverted alongside #70 because most of #71 only existed to defuse an SSML parse error (#67) that #70 itself introduced. However, #71 also bundled three independent hardenings unrelated to #70 — those have been re-introduced individually in this PR's second commit (see "Restored hardenings" below).

Bisect scope and limitations (be honest)

Restored hardenings (from #71, unrelated to #70)

The initial revert in this PR was a wholesale git revert of #70 + #71. PR #71 actually bundled three independent hardenings — only one (the adjacent-break merge) was caused by #70. The second commit on this branch restores the other two and narrows the third:

hardening independent of #70? status in this PR
_sanitize_text (strip XML 1.0 control chars from LLM output) Yes — independent bug class restored + regression test
Azure prosody-range clamping (±50% pitch, -50%..+200% rate) + warning logs Yes — independent bug class (e.g. speaker_BYS_F_6-10_001.yaml has pitch_delta_st=+9+54% unclamped) restored + regression tests
Adjacent-break merge in _apply_phrase_prosody No, but a residual case (phrase break_after → next phrase break_before) survives the #70 revert restored, narrowed to the surviving case + regression test

These were independently flagged by Copilot's review on this PR (3 review threads, all resolved).

Trade-off — #62 is reopened

#70 was solving #62 (Hebrew word merging — "most generated speech unintelligible to a native Hebrew speaker" per 2026-05-03 listening test). Reverting #70 brings the original word-merging problem back. This is a deliberate trade-off:

  • The Whisper WER regression is a 6× degradation that blocks all ASR-based eval.
  • Hebrew word-merging intelligibility is the listener's call and needs A/B comparison against alternative mitigations.

#62 is reopened with the following alternative-mitigation list. Any future attempt should be validated with both a native-speaker A/B listening test (the original failure mode is invisible to ASR metrics) and a Whisper WER regression check on sp_neu_a_0001 (the prior failure mode is invisible to a listener-only test).

  1. Add nikud (vowel diacritics) to high-error words via the Stage 1b normalization lexicon. Targets the actual cause (missing vowel disambiguation) rather than treating every word boundary as ambiguous.
  2. Explicit <phoneme alphabet="ipa"> tags for known-problematic words, scoped to the specific failures rather than applied globally.
  3. Test Google Chirp 3 HD as primary for he-IL — Azure he-IL appears to be the engine that needs the most help; Chirp may handle Hebrew word boundaries differently.
  4. If breaks are revisited — try shorter durations (10–20 ms) AND only at boundaries where merging is empirically observed (not every word). Note that the prosody-reset side-effect of <break> may persist even at low durations, so a listening test is required regardless of duration.

Changes per file

file change reason
synthbanshee/tts/ssml_builder.py reverted _inject_word_breaks and per-word break logic; restored _sanitize_text, prosody clamping (with warning logs), and a narrowed adjacent-break merge partial revert — keep #70's removal, keep #71's unrelated hardenings
tests/unit/test_tts.py dropped tests for the removed per-word-break behaviour; added TestSSMLInvariants class with five new tests pinning the restored hardenings + the #83 regression rule test_no_per_word_breaks_in_default_ssml is the regression test that would catch any future re-introduction of #70's approach
tests/unit/test_phrase_prosody.py restored pre-#70 assertion shapes text-injection no longer creates <break> children in the no-phrases case

No prosody / SSML defaults changed. Loudness contract from #82 remains in effect.

Test plan

  • pytest tests/unit -q1662 passed locally
  • ruff check synthbanshee/tts/ tests/unit/ — clean
  • mypy synthbanshee/tts/ssml_builder.py — clean
  • CI: all 6 checks green on the first pushed commit (d07e144); restoration commit (df65198) re-running CI now
  • Reproduce WER gap on the existing pre-rendered pair (no render cost) — confirmed 0.079 vs 0.286
  • Render sp_neu_a_0001 on this branch (low-intensity, 12 turns, cache-miss against Azure) — duration 122.9 s, WER 0.048
  • Render sp_it_a_0001 on this branch (high-intensity, 17 turns, cache-miss) — duration 146.6 s, WER 0.322 (residual cause confirmed; tracked in investigate(tts): #83 residual — Whisper WER regression on high-intensity (I3+) Tier A clips #87)
  • All three Copilot review threads resolved by the restoration commit

References

Copilot AI review requested due to automatic review settings May 5, 2026 20:41
@shaypal5 shaypal5 added this to the M17 milestone May 5, 2026
@shaypal5 shaypal5 added bugfix comp: tts TTS rendering, SSML, Azure/Google providers labels May 5, 2026
@github-actions

This comment has been minimized.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reverts the SSML “inter-word <break> injection” behavior (and its test suite) to restore pre-#70/#71 TTS output characteristics, addressing the reported Whisper WER and scene-duration regressions in the Hebrew TTS pipeline.

Changes:

  • Revert SSMLBuilder back to direct text injection (removing word-boundary <break> insertion and related SSML normalization logic).
  • Remove unit tests that asserted word-break injection and adjacent-break merging behavior.
  • Update phrase-prosody unit tests to match the restored “plain text” behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
synthbanshee/tts/ssml_builder.py Reverts SSML generation logic to remove inter-word break injection and associated handling.
tests/unit/test_tts.py Drops tests that validated inter-word breaks / adjacent-break merging / sanitization behavior.
tests/unit/test_phrase_prosody.py Updates assertions to match non-break-injected phrase prosody output.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread synthbanshee/tts/ssml_builder.py
Comment thread synthbanshee/tts/ssml_builder.py
Comment thread synthbanshee/tts/ssml_builder.py
…regression test

The initial revert in this branch (revert PR #70 + PR #71 wholesale) was
too aggressive: PR #71 bundled three independent hardenings, only one of
which was caused by PR #70.  This commit restores the two hardenings that
have nothing to do with inter-word break injection, and narrows the third
to the residual case that survives #70's revert.  Also adds the #83
regression test that pins the bisect finding.

Restored from #71 (and explicitly verified to NOT re-introduce #83):

- `_sanitize_text` + `_XML_INVALID_CHARS_RE` regex.  Defends against an
  LLM emitting XML 1.0 control characters that otherwise make the SSML
  unparseable by Azure.  Independent bug class from per-word breaks.
- Azure-range prosody clamping in `_semitones_to_percent`,
  `_rate_to_string`, `_volume_to_string`, plus warning logs on clamp
  activation.  `speaker_BYS_F_6-10_001.yaml` ships `pitch_delta_st=+9`
  → +54% unclamped, which Azure rejects.  Independent bug class.
- Adjacent `<break>` merging in `_apply_phrase_prosody`, narrowed to the
  phrase-after / phrase-before case (the only adjacent-break source that
  survives #70's revert).  The original #71 logic also had a word-break
  branch that is no longer reachable.

Added:

- `test_no_per_word_breaks_in_default_ssml` regression test pinned to
  #83.  The default multi-word SSML must not contain `<break>` tags;
  per-word break injection (PR #70) tripped Whisper's silence-detection
  heuristic and produced the WER regression.  Any future Hebrew word-
  merge mitigation (#62) must not re-introduce per-word breaks.
- `test_text_with_invalid_xml_chars_sanitized`,
  `test_prosody_pitch_clamped_to_azure_range`,
  `test_prosody_rate_clamped_to_azure_range`,
  `test_prosody_volume_clamped_to_azure_range`,
  `test_adjacent_phrase_breaks_are_merged` — pin the restored hardenings.

All three of these were independently flagged by Copilot's review on this
PR (resolves three Copilot review threads).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@shaypal5 shaypal5 changed the title fix(tts): #83 revert inter-word break tags — root cause of Whisper WER & duration regressions fix(tts): #83 partial — revert inter-word breaks (low-intensity scenes only) May 5, 2026
@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

pr-agent-context report:

This is a refreshed snapshot of the current PR state.

This run includes a patch coverage gap on PR #86 in repository https://github.com/DataHackIL/SynthBanshee

Address the patch coverage gaps below, then push all of these changes in a single commit.

# Patch coverage

Patch test coverage is 53.33%; please raise it to 100%. These are the uncovered code lines:
- synthbanshee/tts/ssml_builder.py: 115, 118, 119, 120, 121, 137, 141

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: schedule
Workflow run: 25403543471 attempt 1
Comment timestamp: 2026-05-05T21:33:48.144278+00:00
PR head commit: df651981211de9d2d8db65caf992a31dac8f58a4

@shaypal5 shaypal5 merged commit a910f2e into main May 5, 2026
6 checks passed
@shaypal5 shaypal5 deleted the investigate/83-whisper-wer-bisect branch May 5, 2026 21:35
shaypal5 added a commit that referenced this pull request May 6, 2026
…oor + helium range (#90)

The bisect on PR #86 showed the residual sp_it_a_0001 WER regression
(0.322 vs 04-15's 0.056) is caused by M7 SpeakerState drift compounding
with #51's M15 style_map values, producing effective pitch +14 % to +17 %
and rate 1.27-1.33x at high-intensity turns.  That range simultaneously
sounds cartoonish to listeners (May-3 listening test "helium / oompa-
loompa") and trips Whisper-large-v3's silence-detection heuristic — the
classic length-ratio collapse to ~0.7 that hid the bug for weeks.

This PR ships a partial fix: a runtime effective-prosody cap that
addresses the canonical Whisper-backdoor fingerprint and the helium-
range pitch concern, plus the two detection layers Shay asked for to
catch this class of regression in the future.  It does NOT fully
restore high-intensity WER to the pre-#51 baseline — see #89 for the
follow-up workstream.

## Tier-3 Whisper validation (`sp_it_a_0001`)

| variant | dur | WER | length_ratio | hyp / ref |
|---|---:|---:|---:|---:|
| 04-15 reference | 155.9 s | 0.056 | 1.009 | 236 / 234 |
| post-#86 main (no cap) | 146.6 s | 0.322 | 0.709 | 166 / 234 |
| this PR (cap active) | 149.1 s | **0.129** | **0.906** | 212 / 234 |

  - Length-ratio recovers above the qa-report --asr 0.85 threshold.
  - WER reduced 2.5x (0.322 -> 0.129) but still above the 04-15 baseline
    of 0.056.  Failure mode shifts from silence-detector trip
    (~30 % of words missing) to substitution noise — distinct mechanism
    requiring a paired listening test to fix without breaking M15
    naturalness calibration.  Tracked in #89 with insights and four
    proposed approaches.

## The fix — effective-prosody runtime cap

`synthbanshee/tts/renderer._apply_effective_prosody_cap` clamps post-
state, post-randomization prosody before SSML emission:

  - pitch in [-3.0, +2.0] st  (~ +/- 12 % Azure)
  - rate  in [0.85, 1.20]
  - volume left to the existing +/-50 % Azure clamp (Whisper internally
    normalizes loudness, per #82's lever probe — not a Whisper-trip
    dimension).

Caps are anchored to the pre-#51 effective envelope, which produced the
04-15 reference clips with WER 0.04-0.08.  Tighter caps would diverge
further from M15 listening-test calibration; looser caps would re-trip
Whisper.  Each cap activation logs a warning and is recorded per turn.

## Detection layer 1 — static prosody-cap activations in metadata

  - `DialogueTurn.effective_prosody_caps` carries per-turn cap events.
  - `cli.py` rolls them up into
    `ClipMetadata.generation_metadata.effective_prosody_caps`
    (new `EffectiveProsodyCapEvent` model in labels/schema.py).
  - `qa-report` surfaces a new "Effective-Prosody Cap Activations (#87)"
    table per clip — runs on every batch, no Azure / Whisper required.
    Tier-3 render of sp_it_a_0001 recorded 14 cap activations across
    7 high-intensity turns; metadata example in PR description.

## Detection layer 2 — `qa-report --asr` Whisper backdoor check

New `synthbanshee/package/asr_sanity.py` provides a lazy-loaded
`WhisperRunner` and `compute_asr_metrics`.  `qa-report --asr` runs
Whisper-large-v3 on every clip in a directory, flags clips whose
length-ratio falls below `--asr-min-length-ratio` (default 0.85 — the
#87 fingerprint sat at ~0.71).  Heavy dependencies isolated in the new
`eval-asr` optional extra so normal generation/QA stays light.

Per the policy decision documented in CLAUDE.md ("ASR sanity check
policy"), Tier-3 ASR sanity is local-only (not in CI) for now — see
GH issue #88 for the deferred CI re-evaluation triggers.

## Tests

  - tests/unit/test_effective_prosody_cap.py: 11 tests covering the
    helper unit, render_utterance integration, and render_scene event
    propagation to DialogueTurn.
  - tests/unit/test_qa.py::TestProsodyCapRollup: 3 tests verifying
    cap-event aggregation in qa-report.
  - tests/unit/test_asr_sanity.py: 11 tests covering normalize_for_wer,
    AsrMetrics threshold semantics, and bracket-line stripping in the
    reference parser.  Heavy Whisper inference is exercised by the
    Tier-3 local run, not these tests.
  - 1687 unit tests pass (1662 baseline + 25 new); ruff + mypy clean.

## Docs

  - CLAUDE.md: new "ASR sanity check policy" section + "What NOT to do"
    bullets pinning the cap thresholds and the Tier-3 local-only policy.
  - pyproject.toml: new `eval-asr` optional extra.

Reduces #87 (does not fully close — see #89 for the residual WER work).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
shaypal5 added a commit that referenced this pull request May 7, 2026
…ft (#96)

* docs: sync .agent-plan.md to current state (M16 done, M17 in flight)

The state tracker was last updated 2026-04-22 and claimed the active task
was M8b. Since then M8b, M9a, M9b, M10a, M10b, M11, M13, M14, M15, M16
have all merged, and M17 (automated evaluation) has had its design (#73),
Phase A spike (#77, #79), and a wave of bug-fix PRs (#82, #85, #86, #90)
land — anyone reading this file got an 8-milestone-stale picture.

Rewrote:
- Current system state — list every merged V3 milestone with one-line
  summaries; promoted the loudness contract (#78) and effective-prosody
  cap (#87) to the architectural-invariants list since both are
  load-bearing for any agent editing TTS or preprocessing code.
- Active / next task — replaced "M8b" with the M17 ASR regression thread,
  noted PR #90 as the partial fix and #91 (rate-floor lift R) as the
  queued next experiment, including the WER ≤ 0.10 + listening-test pass
  criterion.
- Open threads table — new section listing the threads agents most often
  need to know about (M12 gate, M17 full automation, #62 word merging,
  #72 SSML parse, #88 CI ASR deferral).
- Context pointers — added splendor-brief as the orientation entrypoint
  and the .venv-vs-~/.local PATH trap.
- CI / Workflow notes — added the Tier-3 ASR local-only policy summary
  so PRs touching audio-rendering files don't merge without it.

This file is a quick-orientation summary — added a header line marking
the design-doc tracker as authoritative when details disagree, so the
next drift gets caught in a tracker diff rather than a stale summary.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs: delete .agent-plan.md (Option A; supersedes prior commit)

Self-review of commit 4361497 (the .agent-plan.md rewrite) caught that
the file fundamentally fails the duplication test: milestone status
duplicates docs/audio_generation_v3_design.md → Implementation Tracker;
open-thread state duplicates GitHub Issues + splendor brief; pointers
duplicate AGENTS.md; and the only non-duplicate section
(architectural invariants) is already covered in AGENTS.md (FLOAT
subtype line 72, MixedScene shift line 37, validate_audio peak line 85,
full #78 loudness contract line 35, full #87 effective-prosody cap
line 73).

The rewrite also reproduced the "if details disagree, the tracker wins"
disclaimer antipattern — the same one PR #94's review removed from
docs/spec.md §3.1 just one PR ago. A docs file whose explicit charter
is "I am allowed to be wrong relative to the canonical source" is
structured drift bait. The original .agent-plan.md got 15 days stale;
"rewrite more carefully" was the wrong response.

Changes:
- Delete .agent-plan.md (75-line summary that duplicated load-bearing
  state held authoritatively elsewhere).
- Update .claude/skills/open-feature-pr.md step 2 to drop the
  ".agent-plan.md" fallback for milestone-ID inference; the branch
  name + parent issue's milestone field already cover it, and pointing
  at the design-doc tracker as authoritative is more durable than
  pointing at a manually-maintained summary.
- Splendor maintenance: source forget src-9d9759e5ad... --apply.
  Removes the orphan source manifest, wiki summary page, and wiki
  index entry. Residual cross-references in 5 planning tasks and 5
  wiki pages remain — splendor surfaces them but doesn't auto-clean;
  they'll regenerate on next ingest of those sources.

Nothing added to AGENTS.md: the cross-cutting rules from .agent-plan.md
are already there. The remaining "invariants" (5-tuple mixer API
post-M8a, audible_* timeline use, MixMode no-audio-deps, _peak_limit vs
_normalize_peak naming) are mixer-internal details that belong in
module docstrings, not global agent rules — and AGENTS.md has its own
M8a drift bug (line 71 still says "4-tuple") that's better fixed in a
dedicated PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(agents): fix M3a/M8a/#74 drift in mixer Segment API description

AGENTS.md "TTS" section claimed `SceneMixer.mix_sequential` took a
4-tuple `(wav_bytes, pause_s, speaker_id, rms_target_dbfs)`. The actual
current API is the `Segment` dataclass in `synthbanshee/tts/mixer.py`
with six named fields (`wav_bytes`, `amount_s`, `speaker_id`,
`rms_target_dbfs`, `mix_mode`, `intensity`). The doc has been wrong
since at least M8a (added `mix_mode`); #74's Lombard tilt then added
`intensity`. Per the dataclass docstring, the named-fields move from
positional tuple was deliberate so call sites and reviewers can't
transpose args silently.

Surfaced during PR #96 review (delete-.agent-plan.md) — the original
.agent-plan.md and AGENTS.md disagreed about the segment API; turns out
they were both stale, and the design-doc tracker (#74 row line 219)
shows the dataclass move that neither AGENTS.md nor .agent-plan.md
caught.

Splendor re-ingest of AGENTS.md (and the opportunistic re-ingest of
docs/spec.md that ingest --changed picked up) is intentionally NOT in
this commit — it's 16 state/wiki/planning files of churn, scoped for a
dedicated splendor maintenance follow-up rather than mixed into a doc
fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs(agents): cross-reference Lombard tilt as #65 (issue), not #74 (PR)

Copilot's review of commit b20eaac caught a cross-reference convention
break: the codebase consistently uses #65 (the issue) when referencing
the Lombard high-shelf — see synthbanshee/tts/mixer.py lines 6, 80, 84,
88, 108, 183, 322, and docs/audio_generation_v3_design.md §4.2c
(line 215). PR #74 is the implementation that closed the issue, but
the cross-link convention is to point at the issue.

One-character fix: #74#65 in the M3a TTS bullet.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix comp: tts TTS rendering, SSML, Azure/Google providers

Projects

None yet

2 participants