Skip to content

fix(tts): #91 — rate-floor lift R experiment (sp_it_a_0001)#95

Merged
shaypal5 merged 1 commit into
mainfrom
fix/m17-rate-floor-lift
May 7, 2026
Merged

fix(tts): #91 — rate-floor lift R experiment (sp_it_a_0001)#95
shaypal5 merged 1 commit into
mainfrom
fix/m17-rate-floor-lift

Conversation

@shaypal5
Copy link
Copy Markdown
Member

@shaypal5 shaypal5 commented May 7, 2026

Hypothesis (#91 R)

PR #90's symmetric pitch cap recovered sp_it_a_0001 WER from 0.322 → 0.129, but left a residual gap to the 04-15 baseline (0.056). #89 falsified pitch as the residual driver. R is the next single-knob lever: lift _EFFECTIVE_RATE_MIN from 0.850.95, on the hypothesis that VIC's I4/I5 slowdown — currently flooring at 0.85 — is what still trips Whisper's silence-detection heuristic.

The risk this PR explicitly traded against: VIC slowdown at high intensity is a deliberate distress cue. Flooring rate at 0.95 may flatten it. The native-speaker listening test was the merge gate — Tier-3 ASR alone is not sufficient.

What this PR ships

A single-constant change plus an anchor test:

file change
synthbanshee/tts/renderer.py _EFFECTIVE_RATE_MIN = 0.850.95. Comment updated to anchor the new value to #91's R rationale. Pitch cap (_EFFECTIVE_PITCH_*) and rate ceiling (_EFFECTIVE_RATE_MAX) untouched.
tests/unit/test_effective_prosody_cap.py Add test_rate_floor_anchored_to_0_95_for_issue_91 — a literal-value anchor so an inadvertent revert is caught loudly. Existing rate-floor assertions reference the imported _EFFECTIVE_RATE_MIN symbol and auto-track the new value.

Diff: 2 files, +16 / -1.

Tier-3 ASR sanity (local)

sp_it_a_0001 re-rendered at seed 1101 into data/m17_pr89_R_rate_floor/ (5 VIC turns re-rendered against Azure due to new floor; AGG and pitch SSML unchanged → cache hits). Whisper-large-v3 + jiwer over four variants:

variant dur WER length_ratio hyp / ref
04-15 baseline (target) 155.9s 0.056 1.013 236 / 233
PR #90 reference (cap on) 149.1s 0.129 0.910 212 / 233
B falsification (per-role pitch cap) 149.1s 0.129 0.910 212 / 233
R (rate floor 0.95) — this PR 143.8s 0.052 1.009 235 / 233

WER pass criterion (≤ 0.10) cleared with margin; R reaches WER 0.052, equivalent to the 04-15 baseline. Length-ratio fully recovered to 1.009 (vs PR #90's 0.910).

Cap-clamp breakdown (this render)

19 cap activations across the scene. Rate-floor clamps are the new behaviour from this PR:

turn role I dim pre post
04 VIC 2 rate floor 0.912 0.950
06 VIC 2 rate floor 0.901 0.950
08 VIC 3 rate floor 0.878 0.950
10 VIC 3 rate floor 0.846 0.950
12 VIC 3 rate floor 0.819 0.950
14 VIC 4 rate floor 0.833 0.950
14 VIC 4 pitch ceil +2.29 +2.00
16 VIC 3 rate floor 0.819 0.950

Plus the unchanged PR #90 ceiling clamps on AGG turns (rate 1.20, pitch +2.00) — 11 events on AGG at I3-I5.

Listening test — completed (2026-05-07)

Native-speaker A/B by @shaypal5 between PR #90 reference and R candidate, focused on VIC at I3–I4.

Verdict: parity, not flattening. Both renders sound the same. In both, VIC at I3–I5 does not sound distressed — only the +2.0 st pitch ceiling is audible as an intensity cue, while the rate movement (whether floored at 0.85 or 0.95) is below perceptual threshold.

Verbatim: "both sound the same, which in both cases doesn't sound very distressed, just a robot whose pitch is a bit higher."

What that means for R

R does not regress vs the May-3-calibrated PR #90 cap — naturalness parity holds. The "VIC sounds distressed" bar wasn't being met by main either; that's a deeper TTS issue (rate + pitch knobs are not the right levers for distress in Azure he-IL voices) which is now tracked separately as #97.

Reframed merge rationale: R is a strict WER win at zero perceptual cost. Decoupling the WER fix (this PR) from the distress investigation (#97) is the right call — the two need different levers and shouldn't be gated on each other.

What R does NOT fix

Test plan

References

🤖 Generated with Claude Code

Single-knob change: lift `_EFFECTIVE_RATE_MIN` from 0.85 to 0.95 to test
whether VIC's I4/I5 slowdown is what drives the residual Whisper WER gap
on `sp_it_a_0001` after PR #90.

PR #90's symmetric pitch cap closed most of the #87 WER gap (0.322 →
0.129).  PR #89 (B) falsified pitch as the residual driver.  R is the
next single-knob lever: VIC at high intensity currently floors at rate
0.85 (style_map I5 0.90 × baseline 1.0 × drift toward 0.87 → clamps to
0.85).  Hypothesis: that floor still trips Whisper's silence-detection
heuristic.  Lifting the floor risks flattening VIC's deliberate distress
cue — the paired native-speaker listening test is the merge gate.

Tier-3 ASR sanity (local) on `sp_it_a_0001` (seed 1101, single render):

| variant                              |    dur |   WER | len_r |
|--------------------------------------|-------:|------:|------:|
| 04-15 baseline (target)              | 155.9s | 0.056 | 1.013 |
| PR #90 reference (cap on)            | 149.1s | 0.129 | 0.910 |
| B falsification (per-role pitch cap) | 149.1s | 0.129 | 0.910 |
| R (rate floor 0.95) — this PR        | 143.8s | 0.052 | 1.009 |

WER pass criterion (≤ 0.10) cleared with margin; R matches the 04-15
baseline.  Listening test pending — VIC at I4/I5 must still sound
distressed for the cap relaxation to be acceptable per CLAUDE.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@shaypal5 shaypal5 added this to the M17 milestone May 7, 2026
@shaypal5 shaypal5 added bugfix comp: tts TTS rendering, SSML, Azure/Google providers labels May 7, 2026
@github-actions

This comment has been minimized.

Copilot AI review requested due to automatic review settings May 7, 2026 06:11
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the TTS “effective prosody” runtime cap to support issue #91’s R experiment, raising the effective rate floor to reduce Whisper WER regressions observed on high-intensity turns (notably sp_it_a_0001) while keeping the pitch caps and rate ceiling unchanged.

Changes:

  • Lift _EFFECTIVE_RATE_MIN in synthbanshee/tts/renderer.py from 0.85 to 0.95, with updated in-code rationale tying the change to #91.
  • Add a unit-test “anchor” assertion to loudly catch accidental reverts of the new floor value.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
synthbanshee/tts/renderer.py Raises the effective rate floor constant and documents the #91 rationale/risks inline.
tests/unit/test_effective_prosody_cap.py Adds a literal-value anchor test ensuring the new floor stays at 0.95.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

pr-agent-context report:

This is a refreshed snapshot of the current PR state.

This run includes a patch coverage gap on PR #95 in repository https://github.com/DataHackIL/SynthBanshee

Address the patch coverage gaps below, then push all of these changes in a single commit.

# Patch coverage

Patch test coverage is 50%; please raise it to 100%. These are the uncovered code lines:
- synthbanshee/tts/renderer.py: 70

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: schedule
Workflow run: 25480717421 attempt 1
Comment timestamp: 2026-05-07T06:52:07.628445+00:00
PR head commit: b398b4ff68a7ac08d8912aea2c83c7619fc11d65

@shaypal5 shaypal5 marked this pull request as ready for review May 7, 2026 20:46
@shaypal5 shaypal5 merged commit f3b86c4 into main May 7, 2026
10 checks passed
@shaypal5 shaypal5 deleted the fix/m17-rate-floor-lift branch May 7, 2026 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix comp: tts TTS rendering, SSML, Azure/Google providers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(tts): #87 follow-up — test rate-floor lift to address residual sp_it WER gap (R)

2 participants