Skip to content

fix(tts): Azure SSML parsing error on adjacent break elements (#67)#71

Merged
shaypal5 merged 2 commits into
mainfrom
fix/ssml-parse-errors-67
May 3, 2026
Merged

fix(tts): Azure SSML parsing error on adjacent break elements (#67)#71
shaypal5 merged 2 commits into
mainfrom
fix/ssml-parse-errors-67

Conversation

@shaypal5
Copy link
Copy Markdown
Member

@shaypal5 shaypal5 commented May 3, 2026

Summary

  • Root cause: PR fix(tts): insert inter-word <break> tags to prevent Hebrew word merging #70's inter-word <break time="50ms"/> elements interact with phrase prosody break_before_ms (from "menace"/"slow"/"break_before" hints) to create adjacent <break> elements. Azure's SSML parser rejects this pattern with error 0x80045003.
  • Fix: Merge adjacent <break> elements instead of emitting consecutive breaks. Word-boundary breaks (50ms) are replaced by the phrase break; semantic breaks (e.g. a preceding break_after) are summed to preserve both pause intents.
  • Scope: Only affects high-intensity turns (I3–I5) in violent scene templates that use hints with break_before_ms > 0.

Changes

File Change
synthbanshee/tts/ssml_builder.py Merge adjacent breaks in _apply_phrase_prosody (word-boundary → replace, semantic → sum); clamp prosody values to Azure ranges with logging.warning; add _sanitize_text for XML-invalid chars as defense-in-depth
tests/unit/test_tts.py 9 regression tests: adjacent-break merging, duration semantics (replace vs sum), prosody clamping, text sanitization, full Hebrew I5 integration with niqqud and realistic AGG speaker params

Design decisions

  • Replace vs sum: A word-boundary break (50ms) is purely structural — the phrase break subsumes its intent. But a semantic break (e.g. 250ms break_after) represents intentional dramatic pause that should be preserved alongside the new phrase's break_before.
  • Clamping warns, doesn't silently swallow: If prosody values hit Azure's ceiling, that's a speaker-config or state-drift bug upstream. The warning makes it visible in logs.
  • _sanitize_text stays in the builder: Ideally invalid chars should be rejected at the LLM parsing boundary, but defense-in-depth here ensures the SSML layer never produces unparseable output regardless of upstream bugs.

Test plan

  • pytest tests/ — all 1688 tests pass (9 new)
  • ruff check — clean
  • mypy — clean
  • Manual: run Azure TTS on previously failing scenes (sp_it_a_0002, sp_sv_a_0001, sp_sv_a_0002, sp_neg_a_0002) to confirm they render successfully

Closes #67

🤖 Generated with Claude Code

…arse error (#67)

Azure's SSML parser rejects adjacent <break> elements (error 0x80045003).
PR #70's inter-word breaks interact with phrase prosody break_before/break_after
to create adjacent breaks in high-intensity turns with "menace"/"slow" hints.

Fix: merge adjacent breaks (use max duration) instead of emitting consecutive
break elements. Also add prosody value clamping to Azure's documented ranges
and text sanitization for XML 1.0 invalid characters.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@shaypal5 shaypal5 added type: fix Bug fix comp: tts TTS rendering, SSML, Azure/Google providers labels May 3, 2026
Copilot AI review requested due to automatic review settings May 3, 2026 22:59
@github-actions

This comment has been minimized.

…w test

Address self-review feedback:
- Merge logic now distinguishes word-boundary breaks (replace) from semantic
  breaks like break_after (sum durations) to preserve dramatic pause intent
- Prosody clamping now logs a warning instead of silently swallowing — makes
  upstream speaker config bugs visible
- Remove redundant ET.fromstring validation (ET.tostring can't produce invalid
  XML from a valid tree; _sanitize_text handles the only failure path)
- Add Hebrew regression test with actual AGG I5 parameters and niqqud text
- Add test for break_after → break_before summing on consecutive phrases

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 3, 2026

pr-agent-context report:

This run includes a patch coverage gap on PR #71 in repository https://github.com/DataHackIL/SynthBanshee

Address the patch coverage gaps below, then push all of these changes in a single commit.

# Patch coverage

Patch test coverage is 97.06%; please raise it to 100%. These are the uncovered code lines:
- synthbanshee/tts/ssml_builder.py: 178

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25293417025 attempt 1
Comment timestamp: 2026-05-03T23:06:49.561142+00:00
PR head commit: db82021071d3a0abba3da4add5496235ceb3ae85

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@shaypal5 shaypal5 merged commit 0bdb217 into main May 3, 2026
6 checks passed
@shaypal5 shaypal5 deleted the fix/ssml-parse-errors-67 branch May 3, 2026 23:21
shaypal5 added a commit that referenced this pull request May 5, 2026
sp_it_a_0003.yaml (intimate_terror_financial_control then jealousy_surveillance
templates, seed 1103) consistently failed with Azure TTS SSML parse error
0x80045003 ("TurnStarted; Received audio size: 0 bytes"). Both templates
exhibited the same failure under different slots, suggesting the cause is
upstream (in the SSML the script generator emits for IT-typology content
in this seed range), not template-specific.

Replaced with sp_neg_a_0003 (negative_argument_deescalation template, seed
1303), which renders cleanly. Final 10-clip spike set: 2 IT / 3 NEG / 3 NEU
/ 2 SV. The IT failure is itself a finding for the report — it suggests
PR #67 / #71's SSML escaping work isn't fully covering the LLM output
distribution and warrants a follow-up.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
shaypal5 added a commit that referenced this pull request May 5, 2026
…sults

Final report captures:

ASR — PASS strong on all three runs (10 clips, paired comparison):
  - openai/whisper-large-v3 (HF MPS): median WER 0.064
  - ivrit-ai/whisper-large-v3 (HF MPS): median WER 0.036  ← preferred
  - ivrit-ai/whisper-large-v3-ct2 int8 (faster-whisper CPU): median WER 0.044
  Δ int8 vs MPS = +0.008, well within ε=0.02 → CI plan validated.
  Bonus: int8 CT2 dodges the only hallucination loop (sp_it_a_0002_00) seen
  on HF MPS; opens the door to using CT2 in dev/QA, not just CI.

Hallucination detectors empirically validated on the 10 clips:
  trigram_repeat_ratio > 1.5 catches the outlier alone (ratio 2.00 vs 1.00
  for all other 9 clips). Replaces the wrong-by-one-thousandth length-ratio
  threshold from PR #77 review.

UTMOS — FAIL but with much richer evidence:
  - With paired same-5 comparison, all five degradations (4 SNR + lowpass)
    push UTMOS below clean. Direction is correct.
  - But max separation 0.207, well below 0.5 gate.
  - White-noise severity is INVERTED (more noise → higher UTMOS).
  - New clips score ~0.9 MOS points higher than originals — UTMOS is sensitive
    to TTS pipeline drift, dominating any within-clip degradation signal.
  Still NO-GO; recommend turn-segmented Option A re-spike before any E2 work.

Carry-over finding: Azure TTS SSML rejected sp_it_a_0003 with two different
intimate_terror templates. PR #67/#71 SSML escaping work doesn't cover this
case; tracking as a separate follow-up. Replaced with sp_neg_a_0003 in the
spike set.

Reproducibility settled: greedy decoding on MPS is byte-stable run-to-run
given the same normalization. Outlier WER 0.144 → 0.127 from PR #77 to this
report is fully explained by the new RTL-mark strip in normalize_for_wer.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
shaypal5 added a commit that referenced this pull request May 5, 2026
…regression test

The initial revert in this branch (revert PR #70 + PR #71 wholesale) was
too aggressive: PR #71 bundled three independent hardenings, only one of
which was caused by PR #70.  This commit restores the two hardenings that
have nothing to do with inter-word break injection, and narrows the third
to the residual case that survives #70's revert.  Also adds the #83
regression test that pins the bisect finding.

Restored from #71 (and explicitly verified to NOT re-introduce #83):

- `_sanitize_text` + `_XML_INVALID_CHARS_RE` regex.  Defends against an
  LLM emitting XML 1.0 control characters that otherwise make the SSML
  unparseable by Azure.  Independent bug class from per-word breaks.
- Azure-range prosody clamping in `_semitones_to_percent`,
  `_rate_to_string`, `_volume_to_string`, plus warning logs on clamp
  activation.  `speaker_BYS_F_6-10_001.yaml` ships `pitch_delta_st=+9`
  → +54% unclamped, which Azure rejects.  Independent bug class.
- Adjacent `<break>` merging in `_apply_phrase_prosody`, narrowed to the
  phrase-after / phrase-before case (the only adjacent-break source that
  survives #70's revert).  The original #71 logic also had a word-break
  branch that is no longer reachable.

Added:

- `test_no_per_word_breaks_in_default_ssml` regression test pinned to
  #83.  The default multi-word SSML must not contain `<break>` tags;
  per-word break injection (PR #70) tripped Whisper's silence-detection
  heuristic and produced the WER regression.  Any future Hebrew word-
  merge mitigation (#62) must not re-introduce per-word breaks.
- `test_text_with_invalid_xml_chars_sanitized`,
  `test_prosody_pitch_clamped_to_azure_range`,
  `test_prosody_rate_clamped_to_azure_range`,
  `test_prosody_volume_clamped_to_azure_range`,
  `test_adjacent_phrase_breaks_are_merged` — pin the restored hardenings.

All three of these were independently flagged by Copilot's review on this
PR (resolves three Copilot review threads).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
shaypal5 added a commit that referenced this pull request May 5, 2026
…s only) (#86)

* Revert "fix(tts): Azure SSML parsing error on adjacent break elements (#67) (#71)"

This reverts commit 0bdb217.

* Revert "fix(tts): insert inter-word <break> tags to prevent Hebrew word merging (#70)"

This reverts commit d0c273b.

* restore(tts): reinstate hardenings from #71 unrelated to #70 + add #83 regression test

The initial revert in this branch (revert PR #70 + PR #71 wholesale) was
too aggressive: PR #71 bundled three independent hardenings, only one of
which was caused by PR #70.  This commit restores the two hardenings that
have nothing to do with inter-word break injection, and narrows the third
to the residual case that survives #70's revert.  Also adds the #83
regression test that pins the bisect finding.

Restored from #71 (and explicitly verified to NOT re-introduce #83):

- `_sanitize_text` + `_XML_INVALID_CHARS_RE` regex.  Defends against an
  LLM emitting XML 1.0 control characters that otherwise make the SSML
  unparseable by Azure.  Independent bug class from per-word breaks.
- Azure-range prosody clamping in `_semitones_to_percent`,
  `_rate_to_string`, `_volume_to_string`, plus warning logs on clamp
  activation.  `speaker_BYS_F_6-10_001.yaml` ships `pitch_delta_st=+9`
  → +54% unclamped, which Azure rejects.  Independent bug class.
- Adjacent `<break>` merging in `_apply_phrase_prosody`, narrowed to the
  phrase-after / phrase-before case (the only adjacent-break source that
  survives #70's revert).  The original #71 logic also had a word-break
  branch that is no longer reachable.

Added:

- `test_no_per_word_breaks_in_default_ssml` regression test pinned to
  #83.  The default multi-word SSML must not contain `<break>` tags;
  per-word break injection (PR #70) tripped Whisper's silence-detection
  heuristic and produced the WER regression.  Any future Hebrew word-
  merge mitigation (#62) must not re-introduce per-word breaks.
- `test_text_with_invalid_xml_chars_sanitized`,
  `test_prosody_pitch_clamped_to_azure_range`,
  `test_prosody_rate_clamped_to_azure_range`,
  `test_prosody_volume_clamped_to_azure_range`,
  `test_adjacent_phrase_breaks_are_merged` — pin the restored hardenings.

All three of these were independently flagged by Copilot's review on this
PR (resolves three Copilot review threads).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp: tts TTS rendering, SSML, Azure/Google providers type: fix Bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(tts): Azure SSML parsing error on 4 of 8 She-Proves scenes

2 participants