TTS distress cue absent at I3–I5: rate + pitch are not sufficient signal

## Symptom

Native-speaker listening test on `sp_it_a_0001` (2026-05-07, A/B between PR #90 reference and PR #95 R candidate) found that **VIC at I3–I5 does not sound distressed in either render**. The only audible intensity cue is the **+2.0 st pitch ceiling**; rate movement (whether floored at 0.85 or 0.95, ~12 % difference) was below perceptual threshold.

Verdict, verbatim: *"both sound the same, which in both cases doesn't sound very distressed, just a robot whose pitch is a bit higher."*

## Why this matters

The cap layer's job (#87) is to bound prosody for both Whisper compatibility and naturalness. PR #95 confirmed Whisper compatibility holds — but exposed that **the prosody knobs the cap operates on (rate, pitch, volume) do not carry the distress signal at all** in Azure he-IL voices (`AvriNeural`, `HilaNeural`).

This means:
- Tightening or loosening rate/pitch caps will not improve perceived distress.
- The constrained-grid experiment **D** floated in #91 is now moot — it would optimize WER under prosody dimensions we know are perceptually flat for distress.
- Downstream M17 eval (E3 emotion, E5 LLM judge) on Tier A clips will likely under-perform on intensity discrimination until this is addressed.

## Candidate levers (not committed; investigation order TBD)

Roughly ordered by expected impact and cost:

1. **`<mstts:express-as style="...">` per-turn at high intensity** — Azure he-IL supports a subset of styles. `whispering`, `fearful`, `shouting`, `terrified`, `sad` are the candidates per Azure docs. We currently use `General` for everything. Single-knob test: pick one VIC turn at I4, swap style, A/B vs current. Free (no infra change).
2. **Disfluency density at high intensity** — gasps, breaks, false starts, restarts. Already partly modelled in `DisfluencyProfile` but VIC profile may be too sparse at I4–I5. Cheap to dial up.
3. **Breathiness post-process** — separate from M12 breathiness work that already failed gate (May-3 memo); a different synthesis path. Probably needs a research spike before committing.
4. **Per-phrase prosody jitter** — instead of one rate/pitch per turn, vary across phrases within a turn. We have `PhraseProsody` plumbing already.
5. **Switch backend on VIC turns to Google Cloud TTS Chirp 3 HD he-IL** — the secondary backend per CLAUDE.md. Different voice model entirely; may carry distress better. Cost: per-turn dispatch, cache invalidation.
6. **Train a custom Azure neural voice** — heaviest, slowest, real money. Last resort.

## Relationship to existing tickets

- **#92 (naturalness backlog)** — lists individual issues (pronunciation, gender forms, breathiness gate fail). This is broader: it's "the entire distress cue is absent at I3–I5," not a single artifact. Worth keeping separate so the investigation has its own home, but cross-link.
- **May-3 listening test memo** — already flagged "M12 breathiness FAILED gate; systemic Hebrew TTS pronunciation, prosody, gender form issues." This issue is the *prosody* clause of that bullet, made concrete by PR #95's listening test.
- **PR #95 (R rate-floor lift)** — closes the WER gap on `sp_it_a_0001`; surfaces this issue as the residual perceptual problem. Merging PR #95 does not block this work.

## Pass criterion (when this is eventually opened)

A native-speaker listening test on `sp_it_a_0001` (or a fresh seed in the same intensity arc) where VIC at I4–I5 sounds **noticeably more distressed than at I1–I2** to the same listener. Whisper WER must remain ≤ 0.10 — i.e. whatever lever we pick must not reintroduce the silence-detector trip.

## Out of scope here

- AGG distress / aggression. Listening test focused on VIC; AGG perception was not flagged as broken in this verdict.
- Tier B / Tier C distress. Untested; may or may not generalize from Tier A findings.
- Loudness contract (#78) — already separately validated as Whisper-neutral.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TTS distress cue absent at I3–I5: rate + pitch are not sufficient signal #97

Symptom

Why this matters

Candidate levers (not committed; investigation order TBD)

Relationship to existing tickets

Pass criterion (when this is eventually opened)

Out of scope here

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TTS distress cue absent at I3–I5: rate + pitch are not sufficient signal #97

Description

Symptom

Why this matters

Candidate levers (not committed; investigation order TBD)

Relationship to existing tickets

Pass criterion (when this is eventually opened)

Out of scope here

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions