Skip to content

feat: Phase 1 milestones 1.2 + 1.3 — multi-speaker TTS & LLM script generator#5

Merged
shaypal5 merged 2 commits into
mainfrom
feat/phase-1-tts-script
Apr 7, 2026
Merged

feat: Phase 1 milestones 1.2 + 1.3 — multi-speaker TTS & LLM script generator#5
shaypal5 merged 2 commits into
mainfrom
feat/phase-1-tts-script

Conversation

@shaypal5
Copy link
Copy Markdown
Member

@shaypal5 shaypal5 commented Apr 7, 2026

Summary

  • synthbanshee/script/types.pyDialogueTurn and MixedScene dataclasses shared by the script and TTS modules
  • synthbanshee/script/generator.pyScriptGenerator: renders a Jinja2 prompt template, calls an LLM (Anthropic or OpenAI), parses the JSON dialogue response, validates Hebrew constraints, and caches results to disk (SHA-256 keyed). Also provides inject_disfluency() (Hebrew filled pauses: אממ / אה / אנ) and validate_script().
  • synthbanshee/script/templates/base_scene.j2 for both projects + intimate_terror_coercive_control.j2 specialisation for she_proves IT scenes
  • synthbanshee/tts/mixer.pySceneMixer.mix_sequential(): decodes WAV bytes, resamples to 16 kHz, downmixes to mono, concatenates with silence gaps, returns MixedScene with per-turn onset/offset timestamps for label generation
  • synthbanshee/tts/renderer.pyTTSRenderer.render_scene(): wires list[DialogueTurn] → per-turn render_utterance()SceneMixerMixedScene
  • configs/scenes/test_scene_001.yaml — fix script_template path (avdp/synthbanshee/)

Test plan

  • 9 TestSceneMixer unit tests — empty segments, pause offsets, resample, stereo downmix, float32 dtype
  • 21 TestScriptGenerator + TestInjectDisfluency + TestValidateScript unit tests — cache hit/miss, LLM mock, markdown fence stripping, Hebrew validation, disfluency injection
  • 12 TestRenderScene integration tests — full render_scene() pipeline with mocked Azure TTS
  • All 127 existing tests still pass

🤖 Generated with Claude Code

… generator

- synthbanshee/script/types.py: DialogueTurn and MixedScene dataclasses
- synthbanshee/script/generator.py: ScriptGenerator (Anthropic/OpenAI, Jinja2
  prompt, SHA-256 disk cache, JSON response parsing, Hebrew script validation),
  inject_disfluency (Hebrew filled pauses), validate_script
- synthbanshee/script/templates/: base_scene.j2 for she_proves and elephant;
  intimate_terror_coercive_control.j2 for IT scenes
- synthbanshee/tts/mixer.py: SceneMixer.mix_sequential — decodes WAV bytes,
  resamples to 16 kHz, downmixes to mono, concatenates with silence gaps,
  returns MixedScene with per-turn onset/offset times
- synthbanshee/tts/renderer.py: TTSRenderer.render_scene — renders a
  list[DialogueTurn] → MixedScene via per-turn render_utterance + SceneMixer
- configs/scenes/test_scene_001.yaml: fix script_template path (avdp → synthbanshee)
- 42 new tests (9 mixer unit, 21 script-generator unit, 12 multi-speaker integration)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@shaypal5 shaypal5 self-assigned this Apr 7, 2026
@shaypal5 shaypal5 requested a review from Copilot April 7, 2026 12:32
@shaypal5 shaypal5 added the enhancement New feature or request label Apr 7, 2026
@github-actions

This comment has been minimized.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Phase 1 building blocks for generating multi-speaker Hebrew dialogue scripts via LLMs and rendering them into a single mixed TTS waveform with per-turn timing metadata.

Changes:

  • Introduces ScriptGenerator with Jinja2 prompt rendering, LLM provider support (Anthropic/OpenAI), JSON parsing, Hebrew/script validation, disfluency injection, and on-disk caching.
  • Adds multi-speaker TTS pipeline: TTSRenderer.render_scene() → per-turn render → SceneMixer.mix_sequential()MixedScene (timestamps + speaker_ids).
  • Adds unit/integration tests plus new script prompt templates and fixes a scene config template path.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/unit/test_script_generator.py Unit tests for ScriptGenerator cache/parse/validation and disfluency injection.
tests/unit/test_mixer.py Unit tests for sequential mixing, resampling, downmixing, and timing metadata.
tests/integration/test_multi_speaker.py Integration coverage for render_scene() with mocked Azure TTS output.
synthbanshee/tts/renderer.py Adds render_scene() orchestration for multi-turn rendering + mixing.
synthbanshee/tts/mixer.py New SceneMixer that decodes WAV bytes, resamples/downmixes, concatenates, and returns MixedScene.
synthbanshee/tts/init.py Re-exports SceneMixer from the TTS package.
synthbanshee/script/types.py Adds shared DialogueTurn and MixedScene dataclasses for script/TTS boundary.
synthbanshee/script/templates/she_proves/intimate_terror_coercive_control.j2 New specialized prompt template for IT coercive-control scenes.
synthbanshee/script/templates/she_proves/base_scene.j2 New base prompt template for she_proves scenes.
synthbanshee/script/templates/elephant/base_scene.j2 New base prompt template for elephant scenes.
synthbanshee/script/generator.py New LLM script generator + cache + validation utilities.
synthbanshee/script/init.py Exposes script generator/types from the script package.
configs/scenes/test_scene_001.yaml Fixes script_template path to the new template location.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/integration/test_multi_speaker.py
Comment thread synthbanshee/script/templates/she_proves/intimate_terror_coercive_control.j2 Outdated
Comment thread synthbanshee/script/generator.py
Comment thread synthbanshee/tts/mixer.py Outdated
@github-actions

This comment has been minimized.

COPILOT-2: remove bogus {% extends %} from intimate_terror_coercive_control.j2;
  template is now standalone (Jinja2 extends without blocks silently drops all
  child content, making the domain-specific guidance invisible)

COPILOT-3: validate_script() now checks pause_before_s is finite and in [0.0, 1.5];
  negative or huge values would crash SceneMixer with negative array dimensions

COPILOT-4: fix mixer.py module docstring — segments are triples
  (wav_bytes, pause_before_s, speaker_id), not pairs

Coverage: add TestLLMDispatch covering _call_anthropic, _call_openai, _call_llm
  routing via sys.modules injection (lines 232–256, now at 100%)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@shaypal5 shaypal5 merged commit 421b884 into main Apr 7, 2026
6 checks passed
@shaypal5 shaypal5 deleted the feat/phase-1-tts-script branch April 7, 2026 15:46
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 7, 2026

pr-agent-context report:

🚨 `pr-agent-context` failed while preparing PR context.

PR: #5
Error: CalledProcessError: Command '['git', '-C', '/home/runner/work/SynthBanshee/SynthBanshee/caller-repo', 'diff', '--unified=0', 'cf71bdfb309e17b1710bb6ce20c342c8a647f0c0...d81888b239e70ba07091891bd842119d70de3d85']' returned non-zero exit status 128.
Run: https://github.com/DataHackIL/SynthBanshee/actions/runs/24090573071

The workflow continued gracefully so this failure does not block CI.
Check the job logs for the full traceback.

Run metadata:

Tool ref: v4
Tool version: 4.0.14
Trigger: status updated
Workflow run: 24090573071 attempt 1
Comment timestamp: 2026-04-07T15:47:13.968572+00:00
PR head commit: d81888b239e70ba07091891bd842119d70de3d85

shaypal5 added a commit that referenced this pull request May 5, 2026
…metadata trail (lever probe killed Whisper hypothesis)

Self-review of the prior commit found the Whisper-fix framing empirically
false.  Reframed the PR around the actual deliverables (loudness contract
clarity + metadata trail) and addressed all blocking review points.

Empirical finding (lever probe, 2026-05-05, openai/whisper-large-v3):
across 8 single-global-gain variants of the same source audio at peaks
from −6 to +7 dBFS and RMS from −27 to −14 dBFS, Whisper produced
byte-identical hypotheses in 7/8 cases (WER 0.286, length-ratio 0.762).
Loudness in the spec range is invisible to Whisper's log-mel extractor.
The actual ASR regression is content-driven (likely M15 / #70 prosody
changes) and is now tracked separately in #83.

Review fixes applied:
- #4 Metadata trail (the structural fix that prevents #78 recurring):
  GenerationMetadata gains `loudness_target_peak_dbfs: float | None` and
  bumps `normalization_strategy` to "per_turn_rms_v2_target_peak".
  Without this field, the original regression hid for three weeks
  behind unchanged generator_version.
- #5 Pydantic upper bound −1.0 → −1.5.  0.5 dB headroom over the safety
  limiter ceiling guarantees the two stages cannot collide under
  float-arithmetic noise.
- #6 Step names flattened to literal tokens (peak_normalize, peak_limit,
  silence_pad — no embedded numeric parameters).  Configured value moves
  to PreprocessingResult.target_peak_dbfs as structured data so QA
  tooling reads policy via field access, not regex on a step string.
- #7 Renamed misleading `test_loud_clip_clamped_below_ceiling` to
  `test_loud_input_lands_at_target_not_at_ceiling` — the limiter is now
  a guaranteed no-op in normal flow; the target step is what does the
  work.
- #8 Replaced 440 Hz sine fixture with bandpass-filtered Gaussian noise
  targeting 18 dB crest, mirroring real Hebrew TTS post-M3a behaviour.
  Sine's 3 dB crest let regressions slip past peak-anchored asserts.
- #9 Tightened tolerances 0.5 → 0.1 dB (peak) and 0.5 → 0.2 dB (RMS
  contrast).  Original #78 deviation was 4.6 dB so 0.5 dB only caught
  it by luck; 0.1 dB catches subtle regressions.
- #10 Tier B/C cli.py path coverage: two new tests in
  test_tier_b_pipeline.py asserting (a) post-augment peak lands at the
  configured target and (b) GenerationMetadata carries the new
  normalization_strategy + loudness_target_peak_dbfs fields.
- #11 Dropped fictional "M3c" milestone tag everywhere.  Tracker row in
  audio_generation_v3_design.md is now indexed by #78.  peak_normalize_to_target
  docstring expanded to mention dual use (Stage 3a + Stage 3b post-augment).
- #3 Tier B/C −1.0 → −2.0 dBFS behaviour change called out explicitly in
  PR body, AGENTS.md, and the §4.7 spec change.  This is incidentally
  better for Tier B/C (gives 1 dB inter-sample-peak headroom for room IR
  / noise mixing), but it IS a change.

Verification:
- pytest 1720 passed / 1 skipped (added 4 tests, removed 1 obsolete)
- ruff: clean
- mypy: 36 unique errors, identical to main (zero new)
- Live re-render of sp_neu_a_0001 produces peak = -2.000 dBFS exactly,
  with metadata fields populated as expected.

Follow-up issue filed: #83 — investigate the actual Whisper WER cause
(prosody/duration, not loudness).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
shaypal5 added a commit that referenced this pull request May 5, 2026
…es NOT recover Whisper — see #83) (#82)

* fix(preprocessing): #78 restore absolute clip loudness via single-gain target peak (M3c)

M3a per-turn RMS targeting + M3b limiter-only normalization left M3a-shaped
Tier A clips peaking ~6 dB below the −1 dBFS ceiling.  Whisper under-
transcribed (WER 0.04 → 0.28, length-ratio 0.76) and UTMOS read the
loosely-limited clips ~0.9 MOS higher, dominating M17 evaluation.

Fix is a *single global gain* post-mix loudness step — preserves per-turn
RMS ratios exactly (M3a's whole point), unlike the legacy per-segment peak
normalizer M3b removed.  Configurable target with a sane default.

Changes:
- synthbanshee/augment/preprocessing.py: new peak_normalize_to_target()
  helper; preprocess() runs it before the existing −1 dBFS safety limiter
- synthbanshee/config/preprocessing_config.py: target_peak_dbfs field,
  default −2.0 dBFS, range [−12.0, −1.0]
- synthbanshee/cli.py: Stage 3b post-augment normalize uses the same
  helper + same config so Tier A and Tier B/C exit at the same peak,
  fixing a pre-existing asymmetry that hid the regression on Tier A
- tests/unit/test_preprocessing.py: invert the M3b-era "quiet clip not
  scaled up" assertion, add direct unit tests for the helper, validate
  config range constraints
- tests/integration/test_loudness_regression.py: end-to-end guard
  asserting peak lands in [−3, −1] dBFS and per-turn RMS contrast
  survives a single global gain
- AGENTS.md, docs/spec.md §3 + §3.1, docs/audio_generation_v3_design.md
  §4.7 + tracker M3c row: spec text in sync with the two-stage policy

Verification (sp_neu_a_0001 same-config A/B):
  2026-04-15 reference          peak −1.00 dBFS  rms −20.37 dBFS
  2026-05-05 before fix         peak −5.62 dBFS  rms −26.68 dBFS
  2026-05-05 after fix          peak −2.00 dBFS  rms −23.06 dBFS

Closes #78

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(preprocessing): #78 review pass — reframe to loudness-contract + metadata trail (lever probe killed Whisper hypothesis)

Self-review of the prior commit found the Whisper-fix framing empirically
false.  Reframed the PR around the actual deliverables (loudness contract
clarity + metadata trail) and addressed all blocking review points.

Empirical finding (lever probe, 2026-05-05, openai/whisper-large-v3):
across 8 single-global-gain variants of the same source audio at peaks
from −6 to +7 dBFS and RMS from −27 to −14 dBFS, Whisper produced
byte-identical hypotheses in 7/8 cases (WER 0.286, length-ratio 0.762).
Loudness in the spec range is invisible to Whisper's log-mel extractor.
The actual ASR regression is content-driven (likely M15 / #70 prosody
changes) and is now tracked separately in #83.

Review fixes applied:
- #4 Metadata trail (the structural fix that prevents #78 recurring):
  GenerationMetadata gains `loudness_target_peak_dbfs: float | None` and
  bumps `normalization_strategy` to "per_turn_rms_v2_target_peak".
  Without this field, the original regression hid for three weeks
  behind unchanged generator_version.
- #5 Pydantic upper bound −1.0 → −1.5.  0.5 dB headroom over the safety
  limiter ceiling guarantees the two stages cannot collide under
  float-arithmetic noise.
- #6 Step names flattened to literal tokens (peak_normalize, peak_limit,
  silence_pad — no embedded numeric parameters).  Configured value moves
  to PreprocessingResult.target_peak_dbfs as structured data so QA
  tooling reads policy via field access, not regex on a step string.
- #7 Renamed misleading `test_loud_clip_clamped_below_ceiling` to
  `test_loud_input_lands_at_target_not_at_ceiling` — the limiter is now
  a guaranteed no-op in normal flow; the target step is what does the
  work.
- #8 Replaced 440 Hz sine fixture with bandpass-filtered Gaussian noise
  targeting 18 dB crest, mirroring real Hebrew TTS post-M3a behaviour.
  Sine's 3 dB crest let regressions slip past peak-anchored asserts.
- #9 Tightened tolerances 0.5 → 0.1 dB (peak) and 0.5 → 0.2 dB (RMS
  contrast).  Original #78 deviation was 4.6 dB so 0.5 dB only caught
  it by luck; 0.1 dB catches subtle regressions.
- #10 Tier B/C cli.py path coverage: two new tests in
  test_tier_b_pipeline.py asserting (a) post-augment peak lands at the
  configured target and (b) GenerationMetadata carries the new
  normalization_strategy + loudness_target_peak_dbfs fields.
- #11 Dropped fictional "M3c" milestone tag everywhere.  Tracker row in
  audio_generation_v3_design.md is now indexed by #78.  peak_normalize_to_target
  docstring expanded to mention dual use (Stage 3a + Stage 3b post-augment).
- #3 Tier B/C −1.0 → −2.0 dBFS behaviour change called out explicitly in
  PR body, AGENTS.md, and the §4.7 spec change.  This is incidentally
  better for Tier B/C (gives 1 dB inter-sample-peak headroom for room IR
  / noise mixing), but it IS a change.

Verification:
- pytest 1720 passed / 1 skipped (added 4 tests, removed 1 obsolete)
- ruff: clean
- mypy: 36 unique errors, identical to main (zero new)
- Live re-render of sp_neu_a_0001 produces peak = -2.000 dBFS exactly,
  with metadata fields populated as expected.

Follow-up issue filed: #83 — investigate the actual Whisper WER cause
(prosody/duration, not loudness).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(preprocessing): #78 mirror "5a then 5b" policy on Tier B/C path (PR #82 COPILOT-1)

cli.py Stage 3b previously ran only the target-peak step (5a), not the
safety limiter (5b), giving the documented two-stage policy a paper-vs-
reality gap.  The limiter is provably a no-op given Pydantic bounds
(target ≤ −1.5, ceiling = −1.0, PCM_16 quantisation cannot bridge the
0.5 dB margin), but applying it uniformly closes the asymmetry between
Tier A (preprocess()) and Tier B/C (cli.py) and prevents a future
spec/code drift from silently violating §4.7.

PR #82 review thread COPILOT-2 (docs/spec.md §3.1 steps 1-4 stale since
M14) is correct but pre-existing and unrelated to #78 — filed as #84
and resolved out-of-scope here.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants