Skip to content

feat(m15): SSML prosody tuning with research-validated Hebrew parameters#51

Merged
shaypal5 merged 4 commits into
mainfrom
feat/m15-prosody-tuning
May 1, 2026
Merged

feat(m15): SSML prosody tuning with research-validated Hebrew parameters#51
shaypal5 merged 4 commits into
mainfrom
feat/m15-prosody-tuning

Conversation

@shaypal5
Copy link
Copy Markdown
Member

@shaypal5 shaypal5 commented May 1, 2026

Summary

  • Update speaker YAML style_maps to research-consensus prosody values (rate, pitch, volume per intensity level) derived from Amir et al. (2003), T-RES Hebrew emotional prosody, and Gelfer (2005) F0 ranges
  • Add 2.0 semitone F0 drift bound to SpeakerState with f0_drift_exceeded property for cross-clip monitoring
  • Implement turn-level quality gates (synthbanshee/tts/quality_gates.py) — sustained-vowel detection (>2.8s reject), F0 guardrails (male [80,180] Hz, female [150,290] Hz), click detection (DC-offset jumps)
  • Wire quality gates into TTSRenderer.render_scene() with verbose logging on failure

Changes by file

File Change
configs/speakers/*.yaml (6 files) Updated style_map rate/pitch/volume to research consensus; replaced angry/sad styles with General
configs/examples/*.yaml (6 files) Same updates for example speaker configs
synthbanshee/tts/quality_gates.py New — three quality gate implementations + composite runner
synthbanshee/tts/speaker_state.py Added MAX_F0_DRIFT_ST constant and f0_drift_exceeded property
synthbanshee/tts/renderer.py Wire quality gates post-render; warn on F0 drift exceeded
tests/unit/test_quality_gates.py New — 19 tests for all quality gates
tests/unit/test_speaker_state.py 7 new tests for F0 drift bound
tests/unit/test_config.py Updated assertions to match new General style

Test plan

  • pytest tests/unit/ — 1322 passed
  • ruff check — all checks passed
  • mypy synthbanshee/tts/quality_gates.py synthbanshee/tts/renderer.py synthbanshee/tts/speaker_state.py — success, no issues
  • Pre-commit hooks pass (ruff, ruff-format, mypy, yaml check)

🤖 Generated with Claude Code

Tunes TTS prosody to research-consensus values from three independent reports
(Amir et al., T-RES, Gelfer 2005) and adds turn-level quality gates to reject
unrealistic renders before mixing.

Changes:
- Update all speaker YAML style_maps (rate, pitch, volume) to match the
  consensus table in wiki/topics/research-synthesis.md (lines 93-99)
- Replace 'angry'/'sad' express-as styles with 'General' (M14 confirmed
  express-as is not supported for he-IL voices)
- Add MAX_F0_DRIFT_ST (2.0 st) bound and f0_drift_exceeded property to
  SpeakerState for cross-clip drift monitoring
- New synthbanshee/tts/quality_gates.py module with three gates:
  - Sustained-vowel detection (>2.8 s reject)
  - F0 guardrails (male [80,180] Hz, female [150,290] Hz)
  - Click detection (DC-offset jumps)
- Wire quality gates into TTSRenderer.render_scene() with verbose logging
- Add comprehensive unit tests (19 new tests in test_quality_gates.py,
  7 new tests in test_speaker_state.py)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 1, 2026 18:58
@shaypal5 shaypal5 added this to the M15 milestone May 1, 2026
@shaypal5 shaypal5 added the enhancement New feature or request label May 1, 2026
@github-actions

This comment has been minimized.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tunes Hebrew SSML prosody controls (rate/pitch/volume) in speaker configs and introduces post-render turn-level audio validation (quality gates) plus a bounded cross-turn F0 drift monitor to catch unrealistic renders early in the TTS pipeline.

Changes:

  • Updated multiple speaker + example YAML style_map entries to research-consensus prosody parameters and standardized styles to "General".
  • Added MAX_F0_DRIFT_ST / f0_drift_exceeded to SpeakerState and integrated drift warnings into TTSRenderer.render_scene().
  • Added synthbanshee/tts/quality_gates.py (sustained vowel, F0 guardrails, click detection) and unit tests.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
synthbanshee/tts/quality_gates.py New module implementing turn-level audio validation gates and a composite runner.
synthbanshee/tts/renderer.py Runs quality gates after each turn render and logs gate failures; warns on accumulated F0 drift.
synthbanshee/tts/speaker_state.py Adds a 2.0 semitone drift bound constant and an f0_drift_exceeded property.
tests/unit/test_quality_gates.py New unit test coverage for all quality gates and the composite runner.
tests/unit/test_speaker_state.py Adds unit tests covering the drift bound constant and property behavior.
tests/unit/test_config.py Updates assertions to reflect "General" style usage at intensity 5/3.
configs/speakers/speaker_VIC_F_25-40_004.yaml Updates prosody parameters per intensity level; standardizes style to "General".
configs/speakers/speaker_SW_F_30-45_003.yaml Updates prosody parameters per intensity level; standardizes style to "General".
configs/speakers/speaker_SW_F_30-45_002.yaml Updates prosody parameters per intensity level; standardizes style to "General".
configs/speakers/speaker_BEN_M_40-55_005.yaml Updates prosody parameters per intensity level; standardizes style to "General".
configs/speakers/speaker_BEN_M_40-55_004.yaml Updates prosody parameters per intensity level; standardizes style to "General".
configs/speakers/speaker_AGG_M_30-45_003.yaml Updates prosody parameters per intensity level; standardizes style to "General".
configs/examples/speaker_VIC_F_25-40_003.yaml Mirrors speaker prosody updates in example config.
configs/examples/speaker_VIC_F_25-40_002.yaml Mirrors speaker prosody updates in example config; updates narrative comments accordingly.
configs/examples/speaker_SW_F_30-45_001.yaml Mirrors speaker prosody updates in example config.
configs/examples/speaker_BEN_M_40-55_003.yaml Mirrors speaker prosody updates in example config.
configs/examples/speaker_AGG_M_30-45_002.yaml Mirrors speaker prosody updates in example config.
configs/examples/speaker_AGG_M_30-45_001.yaml Mirrors speaker prosody updates in example config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread synthbanshee/tts/quality_gates.py
Comment thread synthbanshee/tts/quality_gates.py
Comment thread tests/unit/test_speaker_state.py Outdated
Comment thread synthbanshee/tts/renderer.py Outdated
Comment thread synthbanshee/tts/quality_gates.py
Comment thread synthbanshee/tts/quality_gates.py Outdated
…rsity

Self-review fixes:

1. Quality gates now retry on failure (up to quality_gate_retries=2 re-renders
   with different random seeds) before accepting a failed turn. Failures are
   persisted in DialogueTurn.quality_gate_failures for downstream observability.

2. Click detection raised threshold from 0.05 to 0.15 (avoids false positives
   on plosive transients /p/,/t/,/k/) and added isolated-spike criterion:
   only count a diff event as a click if surrounding ±3 samples are below
   threshold — distinguishes single-sample DC jumps from multi-sample bursts.

3. F0 drift warning now prints the actual numeric bound (±2.0 st) instead of
   the class name.

4. Added quality_gate_failures field to DialogueTurn so gate results are
   persisted in output metadata.

5. Added quality_gates=True and quality_gate_retries=2 params to render_scene()
   so callers can disable gates for fast batch runs.

6. Restored inter-speaker prosody variation: each speaker instance now samples
   a different point within the research consensus ranges, preserving perceptual
   diversity while staying within validated bounds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

- Fix sustained-vowel duration calculation to account for frame overlap:
  duration = frame_len/sr + (N-1)*hop/sr (was N*hop/sr, underestimating)
- Rename test_agg_sustained_i5_may_exceed → test_agg_sustained_i5_stays_within_bound
  to clarify that the drift target is never exceeded (exponential convergence)

Other Copilot comments (click detection, reject behavior, F0 drift warning)
were already addressed in the previous commit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 1, 2026 19:13
@github-actions

This comment has been minimized.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread synthbanshee/tts/quality_gates.py Outdated
Comment thread synthbanshee/tts/renderer.py
Comment thread synthbanshee/tts/quality_gates.py
- Fix _wav_bytes_to_samples docstring to not claim PCM16-only (accepts any
  WAV subtype readable by soundfile)
- Log actual retries_attempted count instead of max retries configured

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

pr-agent-context report:

This run includes patch coverage gaps on PR #51 in repository https://github.com/DataHackIL/SynthBanshee

Address the patch coverage gaps below, then push all of these changes in a single commit.

# Patch coverage

Patch test coverage is 94.78%; please raise it to 100%. These are the uncovered code lines:
- synthbanshee/tts/quality_gates.py: 88, 166, 183, 187, 254, 293
- synthbanshee/tts/renderer.py: 349

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: commit pushed
Workflow run: 25229232317 attempt 1
Comment timestamp: 2026-05-01T19:23:09.827900+00:00
PR head commit: dd6041ba1a6d5cb5e4842d5522fefe2baefb2579

@shaypal5 shaypal5 merged commit 7d30492 into main May 1, 2026
6 checks passed
@shaypal5 shaypal5 deleted the feat/m15-prosody-tuning branch May 1, 2026 20:15
shaypal5 added a commit that referenced this pull request May 1, 2026
- Mark M11, M13, M15 as Done in V3 implementation tracker (PRs #49#51)
- Update V3.1 recommended-order note: only M16 and M12 remain
- Fix 4 wiki pages: review_state human-authored → human-reviewed,
  remove extra created/updated fields not in splendor schema

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
shaypal5 added a commit that referenced this pull request May 1, 2026
* docs: update tracker (M11/M13/M15 done) + fix wiki frontmatter

- Mark M11, M13, M15 as Done in V3 implementation tracker (PRs #49#51)
- Update V3.1 recommended-order note: only M16 and M12 remain
- Fix 4 wiki pages: review_state human-authored → human-reviewed,
  remove extra created/updated fields not in splendor schema

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: fix GenerationMetadata type — dataclass → Pydantic BaseModel

The implementation uses a Pydantic BaseModel, not a dataclass.
Update both mentions in the V3 design doc to match the code.

Addresses COPILOT-1 on PR #53.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
shaypal5 added a commit that referenced this pull request May 6, 2026
…oor + helium range (#90)

The bisect on PR #86 showed the residual sp_it_a_0001 WER regression
(0.322 vs 04-15's 0.056) is caused by M7 SpeakerState drift compounding
with #51's M15 style_map values, producing effective pitch +14 % to +17 %
and rate 1.27-1.33x at high-intensity turns.  That range simultaneously
sounds cartoonish to listeners (May-3 listening test "helium / oompa-
loompa") and trips Whisper-large-v3's silence-detection heuristic — the
classic length-ratio collapse to ~0.7 that hid the bug for weeks.

This PR ships a partial fix: a runtime effective-prosody cap that
addresses the canonical Whisper-backdoor fingerprint and the helium-
range pitch concern, plus the two detection layers Shay asked for to
catch this class of regression in the future.  It does NOT fully
restore high-intensity WER to the pre-#51 baseline — see #89 for the
follow-up workstream.

## Tier-3 Whisper validation (`sp_it_a_0001`)

| variant | dur | WER | length_ratio | hyp / ref |
|---|---:|---:|---:|---:|
| 04-15 reference | 155.9 s | 0.056 | 1.009 | 236 / 234 |
| post-#86 main (no cap) | 146.6 s | 0.322 | 0.709 | 166 / 234 |
| this PR (cap active) | 149.1 s | **0.129** | **0.906** | 212 / 234 |

  - Length-ratio recovers above the qa-report --asr 0.85 threshold.
  - WER reduced 2.5x (0.322 -> 0.129) but still above the 04-15 baseline
    of 0.056.  Failure mode shifts from silence-detector trip
    (~30 % of words missing) to substitution noise — distinct mechanism
    requiring a paired listening test to fix without breaking M15
    naturalness calibration.  Tracked in #89 with insights and four
    proposed approaches.

## The fix — effective-prosody runtime cap

`synthbanshee/tts/renderer._apply_effective_prosody_cap` clamps post-
state, post-randomization prosody before SSML emission:

  - pitch in [-3.0, +2.0] st  (~ +/- 12 % Azure)
  - rate  in [0.85, 1.20]
  - volume left to the existing +/-50 % Azure clamp (Whisper internally
    normalizes loudness, per #82's lever probe — not a Whisper-trip
    dimension).

Caps are anchored to the pre-#51 effective envelope, which produced the
04-15 reference clips with WER 0.04-0.08.  Tighter caps would diverge
further from M15 listening-test calibration; looser caps would re-trip
Whisper.  Each cap activation logs a warning and is recorded per turn.

## Detection layer 1 — static prosody-cap activations in metadata

  - `DialogueTurn.effective_prosody_caps` carries per-turn cap events.
  - `cli.py` rolls them up into
    `ClipMetadata.generation_metadata.effective_prosody_caps`
    (new `EffectiveProsodyCapEvent` model in labels/schema.py).
  - `qa-report` surfaces a new "Effective-Prosody Cap Activations (#87)"
    table per clip — runs on every batch, no Azure / Whisper required.
    Tier-3 render of sp_it_a_0001 recorded 14 cap activations across
    7 high-intensity turns; metadata example in PR description.

## Detection layer 2 — `qa-report --asr` Whisper backdoor check

New `synthbanshee/package/asr_sanity.py` provides a lazy-loaded
`WhisperRunner` and `compute_asr_metrics`.  `qa-report --asr` runs
Whisper-large-v3 on every clip in a directory, flags clips whose
length-ratio falls below `--asr-min-length-ratio` (default 0.85 — the
#87 fingerprint sat at ~0.71).  Heavy dependencies isolated in the new
`eval-asr` optional extra so normal generation/QA stays light.

Per the policy decision documented in CLAUDE.md ("ASR sanity check
policy"), Tier-3 ASR sanity is local-only (not in CI) for now — see
GH issue #88 for the deferred CI re-evaluation triggers.

## Tests

  - tests/unit/test_effective_prosody_cap.py: 11 tests covering the
    helper unit, render_utterance integration, and render_scene event
    propagation to DialogueTurn.
  - tests/unit/test_qa.py::TestProsodyCapRollup: 3 tests verifying
    cap-event aggregation in qa-report.
  - tests/unit/test_asr_sanity.py: 11 tests covering normalize_for_wer,
    AsrMetrics threshold semantics, and bracket-line stripping in the
    reference parser.  Heavy Whisper inference is exercised by the
    Tier-3 local run, not these tests.
  - 1687 unit tests pass (1662 baseline + 25 new); ruff + mypy clean.

## Docs

  - CLAUDE.md: new "ASR sanity check policy" section + "What NOT to do"
    bullets pinning the cap thresholds and the Tier-3 local-only policy.
  - pyproject.toml: new `eval-asr` optional extra.

Reduces #87 (does not fully close — see #89 for the residual WER work).

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants