Skip to content

investigate(tts): same scene config now yields 63% longer audio (121s → 198s) — confirm new duration regime is intentional #81

@shaypal5

Description

@shaypal5

Discovered while re-rendering sp_neu_a_0001 from its scene config on current main for #78.

Evidence

Same scene config (configs/scenes/she_proves/sp_neu_a_0001.yaml, 12 turns, intensity arc [1,1,1,2,1], AGG_M_30-45_001 + VIC_F_25-40_002), rendered before and after the prosody changes that landed in late April / early May:

Date Code path Duration Vs scene target_duration_minutes: 3.0 (180 s) Vs She-Proves window (3–6 min)
2026-04-15 pre-M15 / pre-#70 / pre-#68 121.0 s 33% under target below lower bound
2026-05-05 current main 197.8 s 10% over target within range

The new duration is actually closer to the spec target than the old one. But the +63% delta is large enough that downstream code that estimated wall-cost or filename-budget on Tier A clips against the old duration regime is now off.

Likely cause (not yet confirmed)

Cumulative effect of recent prosody work, in rough order of suspected magnitude:

A controlled bisect on the same scene config (cached LLM script + cached SSML where possible, only the rendering path changing) will identify the dominant contributor cheaply.

Why this is not a #78 dependency

#78 is purely about loudness (peak / RMS). Duration is independent. But both regressions came in the same window of PRs and surface together when re-rendering Tier A clips, so it's worth tracking the duration delta separately so the loudness fix doesn't accidentally take ownership of "why are clips longer now."

Decision needed

Two questions:

  1. Is the new duration regime intentional? Per CLAUDE.md, She-Proves clips should be 3–6 min. 198 s = 3.3 min satisfies the lower bound; 121 s did not. If the team agrees the new duration is correct, this issue closes as "investigate, document, no code change."
  2. Does anything downstream budget on duration? E.g. label-generator phase boundaries, augmentation event placement, M17 evaluation runtime estimates. If yes, those budgets need to be re-derived against current TTS output.

Reproduction

.venv/bin/synthbanshee generate -c configs/scenes/she_proves/sp_neu_a_0001.yaml \
    -o /tmp/duration_repro -p she_proves
.venv/bin/python -c \"import soundfile as sf; print(sf.info('/tmp/duration_repro/agg_m_30-45_001/sp_neu_a_0001_00.wav'))\"

Old reference at data/m2a_wettest/agg_m_30-45_001/sp_neu_a_0001_00.wav is 121.0 s.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions