Skip to content

TTS distress cue absent at I3–I5: rate + pitch are not sufficient signal #97

@shaypal5

Description

@shaypal5

Symptom

Native-speaker listening test on sp_it_a_0001 (2026-05-07, A/B between PR #90 reference and PR #95 R candidate) found that VIC at I3–I5 does not sound distressed in either render. The only audible intensity cue is the +2.0 st pitch ceiling; rate movement (whether floored at 0.85 or 0.95, ~12 % difference) was below perceptual threshold.

Verdict, verbatim: "both sound the same, which in both cases doesn't sound very distressed, just a robot whose pitch is a bit higher."

Why this matters

The cap layer's job (#87) is to bound prosody for both Whisper compatibility and naturalness. PR #95 confirmed Whisper compatibility holds — but exposed that the prosody knobs the cap operates on (rate, pitch, volume) do not carry the distress signal at all in Azure he-IL voices (AvriNeural, HilaNeural).

This means:

  • Tightening or loosening rate/pitch caps will not improve perceived distress.
  • The constrained-grid experiment D floated in fix(tts): #87 follow-up — test rate-floor lift to address residual sp_it WER gap (R) #91 is now moot — it would optimize WER under prosody dimensions we know are perceptually flat for distress.
  • Downstream M17 eval (E3 emotion, E5 LLM judge) on Tier A clips will likely under-perform on intensity discrimination until this is addressed.

Candidate levers (not committed; investigation order TBD)

Roughly ordered by expected impact and cost:

  1. <mstts:express-as style="..."> per-turn at high intensity — Azure he-IL supports a subset of styles. whispering, fearful, shouting, terrified, sad are the candidates per Azure docs. We currently use General for everything. Single-knob test: pick one VIC turn at I4, swap style, A/B vs current. Free (no infra change).
  2. Disfluency density at high intensity — gasps, breaks, false starts, restarts. Already partly modelled in DisfluencyProfile but VIC profile may be too sparse at I4–I5. Cheap to dial up.
  3. Breathiness post-process — separate from M12 breathiness work that already failed gate (May-3 memo); a different synthesis path. Probably needs a research spike before committing.
  4. Per-phrase prosody jitter — instead of one rate/pitch per turn, vary across phrases within a turn. We have PhraseProsody plumbing already.
  5. Switch backend on VIC turns to Google Cloud TTS Chirp 3 HD he-IL — the secondary backend per CLAUDE.md. Different voice model entirely; may carry distress better. Cost: per-turn dispatch, cache invalidation.
  6. Train a custom Azure neural voice — heaviest, slowest, real money. Last resort.

Relationship to existing tickets

Pass criterion (when this is eventually opened)

A native-speaker listening test on sp_it_a_0001 (or a fresh seed in the same intensity arc) where VIC at I4–I5 sounds noticeably more distressed than at I1–I2 to the same listener. Whisper WER must remain ≤ 0.10 — i.e. whatever lever we pick must not reintroduce the silence-detector trip.

Out of scope here

Metadata

Metadata

Assignees

No one assigned

    Labels

    comp: ttsTTS rendering, SSML, Azure/Google providersenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions