You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Native-speaker listening test on sp_it_a_0001 (2026-05-07, A/B between PR #90 reference and PR #95 R candidate) found that VIC at I3–I5 does not sound distressed in either render. The only audible intensity cue is the +2.0 st pitch ceiling; rate movement (whether floored at 0.85 or 0.95, ~12 % difference) was below perceptual threshold.
Verdict, verbatim: "both sound the same, which in both cases doesn't sound very distressed, just a robot whose pitch is a bit higher."
Why this matters
The cap layer's job (#87) is to bound prosody for both Whisper compatibility and naturalness. PR #95 confirmed Whisper compatibility holds — but exposed that the prosody knobs the cap operates on (rate, pitch, volume) do not carry the distress signal at all in Azure he-IL voices (AvriNeural, HilaNeural).
This means:
Tightening or loosening rate/pitch caps will not improve perceived distress.
Downstream M17 eval (E3 emotion, E5 LLM judge) on Tier A clips will likely under-perform on intensity discrimination until this is addressed.
Candidate levers (not committed; investigation order TBD)
Roughly ordered by expected impact and cost:
<mstts:express-as style="..."> per-turn at high intensity — Azure he-IL supports a subset of styles. whispering, fearful, shouting, terrified, sad are the candidates per Azure docs. We currently use General for everything. Single-knob test: pick one VIC turn at I4, swap style, A/B vs current. Free (no infra change).
Disfluency density at high intensity — gasps, breaks, false starts, restarts. Already partly modelled in DisfluencyProfile but VIC profile may be too sparse at I4–I5. Cheap to dial up.
Breathiness post-process — separate from M12 breathiness work that already failed gate (May-3 memo); a different synthesis path. Probably needs a research spike before committing.
Per-phrase prosody jitter — instead of one rate/pitch per turn, vary across phrases within a turn. We have PhraseProsody plumbing already.
Switch backend on VIC turns to Google Cloud TTS Chirp 3 HD he-IL — the secondary backend per CLAUDE.md. Different voice model entirely; may carry distress better. Cost: per-turn dispatch, cache invalidation.
Train a custom Azure neural voice — heaviest, slowest, real money. Last resort.
Relationship to existing tickets
tts: aggregate Hebrew TTS naturalness backlog from 2026-05-06 listening test #92 (naturalness backlog) — lists individual issues (pronunciation, gender forms, breathiness gate fail). This is broader: it's "the entire distress cue is absent at I3–I5," not a single artifact. Worth keeping separate so the investigation has its own home, but cross-link.
May-3 listening test memo — already flagged "M12 breathiness FAILED gate; systemic Hebrew TTS pronunciation, prosody, gender form issues." This issue is the prosody clause of that bullet, made concrete by PR fix(tts): #91 — rate-floor lift R experiment (sp_it_a_0001) #95's listening test.
A native-speaker listening test on sp_it_a_0001 (or a fresh seed in the same intensity arc) where VIC at I4–I5 sounds noticeably more distressed than at I1–I2 to the same listener. Whisper WER must remain ≤ 0.10 — i.e. whatever lever we pick must not reintroduce the silence-detector trip.
Out of scope here
AGG distress / aggression. Listening test focused on VIC; AGG perception was not flagged as broken in this verdict.
Tier B / Tier C distress. Untested; may or may not generalize from Tier A findings.
Symptom
Native-speaker listening test on
sp_it_a_0001(2026-05-07, A/B between PR #90 reference and PR #95 R candidate) found that VIC at I3–I5 does not sound distressed in either render. The only audible intensity cue is the +2.0 st pitch ceiling; rate movement (whether floored at 0.85 or 0.95, ~12 % difference) was below perceptual threshold.Verdict, verbatim: "both sound the same, which in both cases doesn't sound very distressed, just a robot whose pitch is a bit higher."
Why this matters
The cap layer's job (#87) is to bound prosody for both Whisper compatibility and naturalness. PR #95 confirmed Whisper compatibility holds — but exposed that the prosody knobs the cap operates on (rate, pitch, volume) do not carry the distress signal at all in Azure he-IL voices (
AvriNeural,HilaNeural).This means:
Candidate levers (not committed; investigation order TBD)
Roughly ordered by expected impact and cost:
<mstts:express-as style="...">per-turn at high intensity — Azure he-IL supports a subset of styles.whispering,fearful,shouting,terrified,sadare the candidates per Azure docs. We currently useGeneralfor everything. Single-knob test: pick one VIC turn at I4, swap style, A/B vs current. Free (no infra change).DisfluencyProfilebut VIC profile may be too sparse at I4–I5. Cheap to dial up.PhraseProsodyplumbing already.Relationship to existing tickets
sp_it_a_0001; surfaces this issue as the residual perceptual problem. Merging PR fix(tts): #91 — rate-floor lift R experiment (sp_it_a_0001) #95 does not block this work.Pass criterion (when this is eventually opened)
A native-speaker listening test on
sp_it_a_0001(or a fresh seed in the same intensity arc) where VIC at I4–I5 sounds noticeably more distressed than at I1–I2 to the same listener. Whisper WER must remain ≤ 0.10 — i.e. whatever lever we pick must not reintroduce the silence-detector trip.Out of scope here