feat(m15): SSML prosody tuning with research-validated Hebrew parameters#51
Merged
Conversation
Tunes TTS prosody to research-consensus values from three independent reports (Amir et al., T-RES, Gelfer 2005) and adds turn-level quality gates to reject unrealistic renders before mixing. Changes: - Update all speaker YAML style_maps (rate, pitch, volume) to match the consensus table in wiki/topics/research-synthesis.md (lines 93-99) - Replace 'angry'/'sad' express-as styles with 'General' (M14 confirmed express-as is not supported for he-IL voices) - Add MAX_F0_DRIFT_ST (2.0 st) bound and f0_drift_exceeded property to SpeakerState for cross-clip drift monitoring - New synthbanshee/tts/quality_gates.py module with three gates: - Sustained-vowel detection (>2.8 s reject) - F0 guardrails (male [80,180] Hz, female [150,290] Hz) - Click detection (DC-offset jumps) - Wire quality gates into TTSRenderer.render_scene() with verbose logging - Add comprehensive unit tests (19 new tests in test_quality_gates.py, 7 new tests in test_speaker_state.py) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
This PR tunes Hebrew SSML prosody controls (rate/pitch/volume) in speaker configs and introduces post-render turn-level audio validation (quality gates) plus a bounded cross-turn F0 drift monitor to catch unrealistic renders early in the TTS pipeline.
Changes:
- Updated multiple speaker + example YAML
style_mapentries to research-consensus prosody parameters and standardized styles to"General". - Added
MAX_F0_DRIFT_ST/f0_drift_exceededtoSpeakerStateand integrated drift warnings intoTTSRenderer.render_scene(). - Added
synthbanshee/tts/quality_gates.py(sustained vowel, F0 guardrails, click detection) and unit tests.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
synthbanshee/tts/quality_gates.py |
New module implementing turn-level audio validation gates and a composite runner. |
synthbanshee/tts/renderer.py |
Runs quality gates after each turn render and logs gate failures; warns on accumulated F0 drift. |
synthbanshee/tts/speaker_state.py |
Adds a 2.0 semitone drift bound constant and an f0_drift_exceeded property. |
tests/unit/test_quality_gates.py |
New unit test coverage for all quality gates and the composite runner. |
tests/unit/test_speaker_state.py |
Adds unit tests covering the drift bound constant and property behavior. |
tests/unit/test_config.py |
Updates assertions to reflect "General" style usage at intensity 5/3. |
configs/speakers/speaker_VIC_F_25-40_004.yaml |
Updates prosody parameters per intensity level; standardizes style to "General". |
configs/speakers/speaker_SW_F_30-45_003.yaml |
Updates prosody parameters per intensity level; standardizes style to "General". |
configs/speakers/speaker_SW_F_30-45_002.yaml |
Updates prosody parameters per intensity level; standardizes style to "General". |
configs/speakers/speaker_BEN_M_40-55_005.yaml |
Updates prosody parameters per intensity level; standardizes style to "General". |
configs/speakers/speaker_BEN_M_40-55_004.yaml |
Updates prosody parameters per intensity level; standardizes style to "General". |
configs/speakers/speaker_AGG_M_30-45_003.yaml |
Updates prosody parameters per intensity level; standardizes style to "General". |
configs/examples/speaker_VIC_F_25-40_003.yaml |
Mirrors speaker prosody updates in example config. |
configs/examples/speaker_VIC_F_25-40_002.yaml |
Mirrors speaker prosody updates in example config; updates narrative comments accordingly. |
configs/examples/speaker_SW_F_30-45_001.yaml |
Mirrors speaker prosody updates in example config. |
configs/examples/speaker_BEN_M_40-55_003.yaml |
Mirrors speaker prosody updates in example config. |
configs/examples/speaker_AGG_M_30-45_002.yaml |
Mirrors speaker prosody updates in example config. |
configs/examples/speaker_AGG_M_30-45_001.yaml |
Mirrors speaker prosody updates in example config. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…rsity Self-review fixes: 1. Quality gates now retry on failure (up to quality_gate_retries=2 re-renders with different random seeds) before accepting a failed turn. Failures are persisted in DialogueTurn.quality_gate_failures for downstream observability. 2. Click detection raised threshold from 0.05 to 0.15 (avoids false positives on plosive transients /p/,/t/,/k/) and added isolated-spike criterion: only count a diff event as a click if surrounding ±3 samples are below threshold — distinguishes single-sample DC jumps from multi-sample bursts. 3. F0 drift warning now prints the actual numeric bound (±2.0 st) instead of the class name. 4. Added quality_gate_failures field to DialogueTurn so gate results are persisted in output metadata. 5. Added quality_gates=True and quality_gate_retries=2 params to render_scene() so callers can disable gates for fast batch runs. 6. Restored inter-speaker prosody variation: each speaker instance now samples a different point within the research consensus ranges, preserving perceptual diversity while staying within validated bounds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
- Fix sustained-vowel duration calculation to account for frame overlap: duration = frame_len/sr + (N-1)*hop/sr (was N*hop/sr, underestimating) - Rename test_agg_sustained_i5_may_exceed → test_agg_sustained_i5_stays_within_bound to clarify that the drift target is never exceeded (exponential convergence) Other Copilot comments (click detection, reject behavior, F0 drift warning) were already addressed in the previous commit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 19 out of 19 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Fix _wav_bytes_to_samples docstring to not claim PCM16-only (accepts any WAV subtype readable by soundfile) - Log actual retries_attempted count instead of max retries configured Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
pr-agent-context report: This run includes patch coverage gaps on PR #51 in repository https://github.com/DataHackIL/SynthBanshee
Address the patch coverage gaps below, then push all of these changes in a single commit.
# Patch coverage
Patch test coverage is 94.78%; please raise it to 100%. These are the uncovered code lines:
- synthbanshee/tts/quality_gates.py: 88, 166, 183, 187, 254, 293
- synthbanshee/tts/renderer.py: 349Run metadata: |
shaypal5
added a commit
that referenced
this pull request
May 1, 2026
- Mark M11, M13, M15 as Done in V3 implementation tracker (PRs #49–#51) - Update V3.1 recommended-order note: only M16 and M12 remain - Fix 4 wiki pages: review_state human-authored → human-reviewed, remove extra created/updated fields not in splendor schema Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
shaypal5
added a commit
that referenced
this pull request
May 1, 2026
* docs: update tracker (M11/M13/M15 done) + fix wiki frontmatter - Mark M11, M13, M15 as Done in V3 implementation tracker (PRs #49–#51) - Update V3.1 recommended-order note: only M16 and M12 remain - Fix 4 wiki pages: review_state human-authored → human-reviewed, remove extra created/updated fields not in splendor schema Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: fix GenerationMetadata type — dataclass → Pydantic BaseModel The implementation uses a Pydantic BaseModel, not a dataclass. Update both mentions in the V3 design doc to match the code. Addresses COPILOT-1 on PR #53. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
This was referenced May 5, 2026
Open
shaypal5
added a commit
that referenced
this pull request
May 6, 2026
…oor + helium range (#90) The bisect on PR #86 showed the residual sp_it_a_0001 WER regression (0.322 vs 04-15's 0.056) is caused by M7 SpeakerState drift compounding with #51's M15 style_map values, producing effective pitch +14 % to +17 % and rate 1.27-1.33x at high-intensity turns. That range simultaneously sounds cartoonish to listeners (May-3 listening test "helium / oompa- loompa") and trips Whisper-large-v3's silence-detection heuristic — the classic length-ratio collapse to ~0.7 that hid the bug for weeks. This PR ships a partial fix: a runtime effective-prosody cap that addresses the canonical Whisper-backdoor fingerprint and the helium- range pitch concern, plus the two detection layers Shay asked for to catch this class of regression in the future. It does NOT fully restore high-intensity WER to the pre-#51 baseline — see #89 for the follow-up workstream. ## Tier-3 Whisper validation (`sp_it_a_0001`) | variant | dur | WER | length_ratio | hyp / ref | |---|---:|---:|---:|---:| | 04-15 reference | 155.9 s | 0.056 | 1.009 | 236 / 234 | | post-#86 main (no cap) | 146.6 s | 0.322 | 0.709 | 166 / 234 | | this PR (cap active) | 149.1 s | **0.129** | **0.906** | 212 / 234 | - Length-ratio recovers above the qa-report --asr 0.85 threshold. - WER reduced 2.5x (0.322 -> 0.129) but still above the 04-15 baseline of 0.056. Failure mode shifts from silence-detector trip (~30 % of words missing) to substitution noise — distinct mechanism requiring a paired listening test to fix without breaking M15 naturalness calibration. Tracked in #89 with insights and four proposed approaches. ## The fix — effective-prosody runtime cap `synthbanshee/tts/renderer._apply_effective_prosody_cap` clamps post- state, post-randomization prosody before SSML emission: - pitch in [-3.0, +2.0] st (~ +/- 12 % Azure) - rate in [0.85, 1.20] - volume left to the existing +/-50 % Azure clamp (Whisper internally normalizes loudness, per #82's lever probe — not a Whisper-trip dimension). Caps are anchored to the pre-#51 effective envelope, which produced the 04-15 reference clips with WER 0.04-0.08. Tighter caps would diverge further from M15 listening-test calibration; looser caps would re-trip Whisper. Each cap activation logs a warning and is recorded per turn. ## Detection layer 1 — static prosody-cap activations in metadata - `DialogueTurn.effective_prosody_caps` carries per-turn cap events. - `cli.py` rolls them up into `ClipMetadata.generation_metadata.effective_prosody_caps` (new `EffectiveProsodyCapEvent` model in labels/schema.py). - `qa-report` surfaces a new "Effective-Prosody Cap Activations (#87)" table per clip — runs on every batch, no Azure / Whisper required. Tier-3 render of sp_it_a_0001 recorded 14 cap activations across 7 high-intensity turns; metadata example in PR description. ## Detection layer 2 — `qa-report --asr` Whisper backdoor check New `synthbanshee/package/asr_sanity.py` provides a lazy-loaded `WhisperRunner` and `compute_asr_metrics`. `qa-report --asr` runs Whisper-large-v3 on every clip in a directory, flags clips whose length-ratio falls below `--asr-min-length-ratio` (default 0.85 — the #87 fingerprint sat at ~0.71). Heavy dependencies isolated in the new `eval-asr` optional extra so normal generation/QA stays light. Per the policy decision documented in CLAUDE.md ("ASR sanity check policy"), Tier-3 ASR sanity is local-only (not in CI) for now — see GH issue #88 for the deferred CI re-evaluation triggers. ## Tests - tests/unit/test_effective_prosody_cap.py: 11 tests covering the helper unit, render_utterance integration, and render_scene event propagation to DialogueTurn. - tests/unit/test_qa.py::TestProsodyCapRollup: 3 tests verifying cap-event aggregation in qa-report. - tests/unit/test_asr_sanity.py: 11 tests covering normalize_for_wer, AsrMetrics threshold semantics, and bracket-line stripping in the reference parser. Heavy Whisper inference is exercised by the Tier-3 local run, not these tests. - 1687 unit tests pass (1662 baseline + 25 new); ruff + mypy clean. ## Docs - CLAUDE.md: new "ASR sanity check policy" section + "What NOT to do" bullets pinning the cap thresholds and the Tier-3 local-only policy. - pyproject.toml: new `eval-asr` optional extra. Reduces #87 (does not fully close — see #89 for the residual WER work). Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SpeakerStatewithf0_drift_exceededproperty for cross-clip monitoringsynthbanshee/tts/quality_gates.py) — sustained-vowel detection (>2.8s reject), F0 guardrails (male [80,180] Hz, female [150,290] Hz), click detection (DC-offset jumps)TTSRenderer.render_scene()with verbose logging on failureChanges by file
configs/speakers/*.yaml(6 files)angry/sadstyles withGeneralconfigs/examples/*.yaml(6 files)synthbanshee/tts/quality_gates.pysynthbanshee/tts/speaker_state.pyMAX_F0_DRIFT_STconstant andf0_drift_exceededpropertysynthbanshee/tts/renderer.pytests/unit/test_quality_gates.pytests/unit/test_speaker_state.pytests/unit/test_config.pyGeneralstyleTest plan
pytest tests/unit/— 1322 passedruff check— all checks passedmypy synthbanshee/tts/quality_gates.py synthbanshee/tts/renderer.py synthbanshee/tts/speaker_state.py— success, no issues🤖 Generated with Claude Code