fix(tts): #83 partial — revert inter-word breaks (low-intensity scenes only)#86
Merged
Merged
Conversation
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
This PR reverts the SSML “inter-word <break> injection” behavior (and its test suite) to restore pre-#70/#71 TTS output characteristics, addressing the reported Whisper WER and scene-duration regressions in the Hebrew TTS pipeline.
Changes:
- Revert SSMLBuilder back to direct text injection (removing word-boundary
<break>insertion and related SSML normalization logic). - Remove unit tests that asserted word-break injection and adjacent-break merging behavior.
- Update phrase-prosody unit tests to match the restored “plain text” behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
synthbanshee/tts/ssml_builder.py |
Reverts SSML generation logic to remove inter-word break injection and associated handling. |
tests/unit/test_tts.py |
Drops tests that validated inter-word breaks / adjacent-break merging / sanitization behavior. |
tests/unit/test_phrase_prosody.py |
Updates assertions to match non-break-injected phrase prosody output. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…regression test The initial revert in this branch (revert PR #70 + PR #71 wholesale) was too aggressive: PR #71 bundled three independent hardenings, only one of which was caused by PR #70. This commit restores the two hardenings that have nothing to do with inter-word break injection, and narrows the third to the residual case that survives #70's revert. Also adds the #83 regression test that pins the bisect finding. Restored from #71 (and explicitly verified to NOT re-introduce #83): - `_sanitize_text` + `_XML_INVALID_CHARS_RE` regex. Defends against an LLM emitting XML 1.0 control characters that otherwise make the SSML unparseable by Azure. Independent bug class from per-word breaks. - Azure-range prosody clamping in `_semitones_to_percent`, `_rate_to_string`, `_volume_to_string`, plus warning logs on clamp activation. `speaker_BYS_F_6-10_001.yaml` ships `pitch_delta_st=+9` → +54% unclamped, which Azure rejects. Independent bug class. - Adjacent `<break>` merging in `_apply_phrase_prosody`, narrowed to the phrase-after / phrase-before case (the only adjacent-break source that survives #70's revert). The original #71 logic also had a word-break branch that is no longer reachable. Added: - `test_no_per_word_breaks_in_default_ssml` regression test pinned to #83. The default multi-word SSML must not contain `<break>` tags; per-word break injection (PR #70) tripped Whisper's silence-detection heuristic and produced the WER regression. Any future Hebrew word- merge mitigation (#62) must not re-introduce per-word breaks. - `test_text_with_invalid_xml_chars_sanitized`, `test_prosody_pitch_clamped_to_azure_range`, `test_prosody_rate_clamped_to_azure_range`, `test_prosody_volume_clamped_to_azure_range`, `test_adjacent_phrase_breaks_are_merged` — pin the restored hardenings. All three of these were independently flagged by Copilot's review on this PR (resolves three Copilot review threads). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
|
pr-agent-context report: This is a refreshed snapshot of the current PR state.
This run includes a patch coverage gap on PR #86 in repository https://github.com/DataHackIL/SynthBanshee
Address the patch coverage gaps below, then push all of these changes in a single commit.
# Patch coverage
Patch test coverage is 53.33%; please raise it to 100%. These are the uncovered code lines:
- synthbanshee/tts/ssml_builder.py: 115, 118, 119, 120, 121, 137, 141Run metadata: |
shaypal5
added a commit
that referenced
this pull request
May 6, 2026
…oor + helium range (#90) The bisect on PR #86 showed the residual sp_it_a_0001 WER regression (0.322 vs 04-15's 0.056) is caused by M7 SpeakerState drift compounding with #51's M15 style_map values, producing effective pitch +14 % to +17 % and rate 1.27-1.33x at high-intensity turns. That range simultaneously sounds cartoonish to listeners (May-3 listening test "helium / oompa- loompa") and trips Whisper-large-v3's silence-detection heuristic — the classic length-ratio collapse to ~0.7 that hid the bug for weeks. This PR ships a partial fix: a runtime effective-prosody cap that addresses the canonical Whisper-backdoor fingerprint and the helium- range pitch concern, plus the two detection layers Shay asked for to catch this class of regression in the future. It does NOT fully restore high-intensity WER to the pre-#51 baseline — see #89 for the follow-up workstream. ## Tier-3 Whisper validation (`sp_it_a_0001`) | variant | dur | WER | length_ratio | hyp / ref | |---|---:|---:|---:|---:| | 04-15 reference | 155.9 s | 0.056 | 1.009 | 236 / 234 | | post-#86 main (no cap) | 146.6 s | 0.322 | 0.709 | 166 / 234 | | this PR (cap active) | 149.1 s | **0.129** | **0.906** | 212 / 234 | - Length-ratio recovers above the qa-report --asr 0.85 threshold. - WER reduced 2.5x (0.322 -> 0.129) but still above the 04-15 baseline of 0.056. Failure mode shifts from silence-detector trip (~30 % of words missing) to substitution noise — distinct mechanism requiring a paired listening test to fix without breaking M15 naturalness calibration. Tracked in #89 with insights and four proposed approaches. ## The fix — effective-prosody runtime cap `synthbanshee/tts/renderer._apply_effective_prosody_cap` clamps post- state, post-randomization prosody before SSML emission: - pitch in [-3.0, +2.0] st (~ +/- 12 % Azure) - rate in [0.85, 1.20] - volume left to the existing +/-50 % Azure clamp (Whisper internally normalizes loudness, per #82's lever probe — not a Whisper-trip dimension). Caps are anchored to the pre-#51 effective envelope, which produced the 04-15 reference clips with WER 0.04-0.08. Tighter caps would diverge further from M15 listening-test calibration; looser caps would re-trip Whisper. Each cap activation logs a warning and is recorded per turn. ## Detection layer 1 — static prosody-cap activations in metadata - `DialogueTurn.effective_prosody_caps` carries per-turn cap events. - `cli.py` rolls them up into `ClipMetadata.generation_metadata.effective_prosody_caps` (new `EffectiveProsodyCapEvent` model in labels/schema.py). - `qa-report` surfaces a new "Effective-Prosody Cap Activations (#87)" table per clip — runs on every batch, no Azure / Whisper required. Tier-3 render of sp_it_a_0001 recorded 14 cap activations across 7 high-intensity turns; metadata example in PR description. ## Detection layer 2 — `qa-report --asr` Whisper backdoor check New `synthbanshee/package/asr_sanity.py` provides a lazy-loaded `WhisperRunner` and `compute_asr_metrics`. `qa-report --asr` runs Whisper-large-v3 on every clip in a directory, flags clips whose length-ratio falls below `--asr-min-length-ratio` (default 0.85 — the #87 fingerprint sat at ~0.71). Heavy dependencies isolated in the new `eval-asr` optional extra so normal generation/QA stays light. Per the policy decision documented in CLAUDE.md ("ASR sanity check policy"), Tier-3 ASR sanity is local-only (not in CI) for now — see GH issue #88 for the deferred CI re-evaluation triggers. ## Tests - tests/unit/test_effective_prosody_cap.py: 11 tests covering the helper unit, render_utterance integration, and render_scene event propagation to DialogueTurn. - tests/unit/test_qa.py::TestProsodyCapRollup: 3 tests verifying cap-event aggregation in qa-report. - tests/unit/test_asr_sanity.py: 11 tests covering normalize_for_wer, AsrMetrics threshold semantics, and bracket-line stripping in the reference parser. Heavy Whisper inference is exercised by the Tier-3 local run, not these tests. - 1687 unit tests pass (1662 baseline + 25 new); ruff + mypy clean. ## Docs - CLAUDE.md: new "ASR sanity check policy" section + "What NOT to do" bullets pinning the cap thresholds and the Tier-3 local-only policy. - pyproject.toml: new `eval-asr` optional extra. Reduces #87 (does not fully close — see #89 for the residual WER work). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 6, 2026
shaypal5
added a commit
that referenced
this pull request
May 7, 2026
…ft (#96) * docs: sync .agent-plan.md to current state (M16 done, M17 in flight) The state tracker was last updated 2026-04-22 and claimed the active task was M8b. Since then M8b, M9a, M9b, M10a, M10b, M11, M13, M14, M15, M16 have all merged, and M17 (automated evaluation) has had its design (#73), Phase A spike (#77, #79), and a wave of bug-fix PRs (#82, #85, #86, #90) land — anyone reading this file got an 8-milestone-stale picture. Rewrote: - Current system state — list every merged V3 milestone with one-line summaries; promoted the loudness contract (#78) and effective-prosody cap (#87) to the architectural-invariants list since both are load-bearing for any agent editing TTS or preprocessing code. - Active / next task — replaced "M8b" with the M17 ASR regression thread, noted PR #90 as the partial fix and #91 (rate-floor lift R) as the queued next experiment, including the WER ≤ 0.10 + listening-test pass criterion. - Open threads table — new section listing the threads agents most often need to know about (M12 gate, M17 full automation, #62 word merging, #72 SSML parse, #88 CI ASR deferral). - Context pointers — added splendor-brief as the orientation entrypoint and the .venv-vs-~/.local PATH trap. - CI / Workflow notes — added the Tier-3 ASR local-only policy summary so PRs touching audio-rendering files don't merge without it. This file is a quick-orientation summary — added a header line marking the design-doc tracker as authoritative when details disagree, so the next drift gets caught in a tracker diff rather than a stale summary. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs: delete .agent-plan.md (Option A; supersedes prior commit) Self-review of commit 4361497 (the .agent-plan.md rewrite) caught that the file fundamentally fails the duplication test: milestone status duplicates docs/audio_generation_v3_design.md → Implementation Tracker; open-thread state duplicates GitHub Issues + splendor brief; pointers duplicate AGENTS.md; and the only non-duplicate section (architectural invariants) is already covered in AGENTS.md (FLOAT subtype line 72, MixedScene shift line 37, validate_audio peak line 85, full #78 loudness contract line 35, full #87 effective-prosody cap line 73). The rewrite also reproduced the "if details disagree, the tracker wins" disclaimer antipattern — the same one PR #94's review removed from docs/spec.md §3.1 just one PR ago. A docs file whose explicit charter is "I am allowed to be wrong relative to the canonical source" is structured drift bait. The original .agent-plan.md got 15 days stale; "rewrite more carefully" was the wrong response. Changes: - Delete .agent-plan.md (75-line summary that duplicated load-bearing state held authoritatively elsewhere). - Update .claude/skills/open-feature-pr.md step 2 to drop the ".agent-plan.md" fallback for milestone-ID inference; the branch name + parent issue's milestone field already cover it, and pointing at the design-doc tracker as authoritative is more durable than pointing at a manually-maintained summary. - Splendor maintenance: source forget src-9d9759e5ad... --apply. Removes the orphan source manifest, wiki summary page, and wiki index entry. Residual cross-references in 5 planning tasks and 5 wiki pages remain — splendor surfaces them but doesn't auto-clean; they'll regenerate on next ingest of those sources. Nothing added to AGENTS.md: the cross-cutting rules from .agent-plan.md are already there. The remaining "invariants" (5-tuple mixer API post-M8a, audible_* timeline use, MixMode no-audio-deps, _peak_limit vs _normalize_peak naming) are mixer-internal details that belong in module docstrings, not global agent rules — and AGENTS.md has its own M8a drift bug (line 71 still says "4-tuple") that's better fixed in a dedicated PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(agents): fix M3a/M8a/#74 drift in mixer Segment API description AGENTS.md "TTS" section claimed `SceneMixer.mix_sequential` took a 4-tuple `(wav_bytes, pause_s, speaker_id, rms_target_dbfs)`. The actual current API is the `Segment` dataclass in `synthbanshee/tts/mixer.py` with six named fields (`wav_bytes`, `amount_s`, `speaker_id`, `rms_target_dbfs`, `mix_mode`, `intensity`). The doc has been wrong since at least M8a (added `mix_mode`); #74's Lombard tilt then added `intensity`. Per the dataclass docstring, the named-fields move from positional tuple was deliberate so call sites and reviewers can't transpose args silently. Surfaced during PR #96 review (delete-.agent-plan.md) — the original .agent-plan.md and AGENTS.md disagreed about the segment API; turns out they were both stale, and the design-doc tracker (#74 row line 219) shows the dataclass move that neither AGENTS.md nor .agent-plan.md caught. Splendor re-ingest of AGENTS.md (and the opportunistic re-ingest of docs/spec.md that ingest --changed picked up) is intentionally NOT in this commit — it's 16 state/wiki/planning files of churn, scoped for a dedicated splendor maintenance follow-up rather than mixed into a doc fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(agents): cross-reference Lombard tilt as #65 (issue), not #74 (PR) Copilot's review of commit b20eaac caught a cross-reference convention break: the codebase consistently uses #65 (the issue) when referencing the Lombard high-shelf — see synthbanshee/tts/mixer.py lines 6, 80, 84, 88, 108, 183, 322, and docs/audio_generation_v3_design.md §4.2c (line 215). PR #74 is the implementation that closed the issue, but the cross-link convention is to point at the issue. One-character fix: #74 → #65 in the M3a TTS bullet. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Whisper-large-v3 WER on Tier A clips dropped from ~0.04–0.08 (pre-2026-04-16) to ~0.286 on current
main(#83). The same scene config also produces 63% longer audio (#81 — 121s → 198s onsp_neu_a_0001).PR #82 falsified loudness as the cause via a controlled lever probe. #83 was spun out as the actual ASR fix.
Bisect — partial result
Same scene, same
random_seed: 1401, same Whisper-large-v3 invocation across all rows. Two scenes (N=2): one low-intensity (intensity arc almost all I1), one high-intensity (I1→I5 arc).sp_neu_a_0001(low-intensity, arc[1,1,1,2,1])sp_neu_a_0001mainHEADsp_neu_a_0001sp_it_a_0001(high-intensity, arc[1,2,3,4,5,4,3])sp_it_a_0001What this PR fixes
sp_neu_a_0001(low-intensity): fully resolved. WER 0.286 → 0.048 (better than the 04-15 reference's 0.079). Length-ratio returns to ~1.0.sp_neu_a_0001: fully resolved. 197.8s → 122.9s ≈ 04-15 reference's 121.0s.What this PR does NOT fix
sp_it_a_0001(high-intensity): residual regression remains. WER 0.322 vs 04-15 reference's 0.056 — same Whisper-silence-detector fingerprint (length-ratio 0.709, ~28% of words missing). The Hebrew text is byte-identical between the 04-15 reference and the revert branch (verified viadiff), so the residual gap is purely audio rendering.There is therefore a second TTS-side change in the post-2026-04-15 window that fires at I3+. PR #86 does not identify or fix it.
Why #70 was the dominant low-intensity cause
<break time="50ms"/>between every word pair does two things at every word boundary:Native-speaker A/B confirms the perceptual signature: "very over-pausy", "very robotic content tone", "weird, rhythmic, equal spacing between words" — all consistent with mechanical 50 ms inter-word silences regardless of natural prosody.
#71 is reverted alongside #70 because most of #71 only existed to defuse an SSML parse error (#67) that #70 itself introduced. However, #71 also bundled three independent hardenings unrelated to #70 — those have been re-introduced individually in this PR's second commit (see "Restored hardenings" below).
Bisect scope and limitations (be honest)
sp_neu_a_0001, so feat(m15): SSML prosody tuning with research-validated Hebrew parameters #51 and fix(config): halve pitch escalation at I4–I5 to eliminate helium effect #68 were not additionally reverted. Their residual contribution to WER on low-intensity scenes is not bounded by this work; the arithmetic argument that feat(m15): SSML prosody tuning with research-validated Hebrew parameters #51's I1 rate slowdown (0.95–0.97×) cannot account for the regressions is plausible but not empirically verified.Restored hardenings (from #71, unrelated to #70)
The initial revert in this PR was a wholesale
git revertof #70 + #71. PR #71 actually bundled three independent hardenings — only one (the adjacent-break merge) was caused by #70. The second commit on this branch restores the other two and narrows the third:_sanitize_text(strip XML 1.0 control chars from LLM output)±50%pitch,-50%..+200%rate) + warning logsspeaker_BYS_F_6-10_001.yamlhaspitch_delta_st=+9→+54%unclamped)_apply_phrase_prosodyThese were independently flagged by Copilot's review on this PR (3 review threads, all resolved).
Trade-off — #62 is reopened
#70 was solving #62 (Hebrew word merging — "most generated speech unintelligible to a native Hebrew speaker" per 2026-05-03 listening test). Reverting #70 brings the original word-merging problem back. This is a deliberate trade-off:
#62 is reopened with the following alternative-mitigation list. Any future attempt should be validated with both a native-speaker A/B listening test (the original failure mode is invisible to ASR metrics) and a Whisper WER regression check on
sp_neu_a_0001(the prior failure mode is invisible to a listener-only test).<phoneme alphabet="ipa">tags for known-problematic words, scoped to the specific failures rather than applied globally.<break>may persist even at low durations, so a listening test is required regardless of duration.Changes per file
synthbanshee/tts/ssml_builder.py_inject_word_breaksand per-word break logic; restored_sanitize_text, prosody clamping (with warning logs), and a narrowed adjacent-break mergetests/unit/test_tts.pyTestSSMLInvariantsclass with five new tests pinning the restored hardenings + the #83 regression ruletest_no_per_word_breaks_in_default_ssmlis the regression test that would catch any future re-introduction of #70's approachtests/unit/test_phrase_prosody.py<break>children in the no-phrases caseNo prosody / SSML defaults changed. Loudness contract from #82 remains in effect.
Test plan
pytest tests/unit -q— 1662 passed locallyruff check synthbanshee/tts/ tests/unit/— cleanmypy synthbanshee/tts/ssml_builder.py— cleand07e144); restoration commit (df65198) re-running CI nowsp_neu_a_0001on this branch (low-intensity, 12 turns, cache-miss against Azure) — duration 122.9 s, WER 0.048sp_it_a_0001on this branch (high-intensity, 17 turns, cache-miss) — duration 146.6 s, WER 0.322 (residual cause confirmed; tracked in investigate(tts): #83 residual — Whisper WER regression on high-intensity (I3+) Tier A clips #87)References
sp_neuduration regression (197.8s → 122.9s) is fully resolved by this PR.