feat: Phase 1 milestones 1.2 + 1.3 — multi-speaker TTS & LLM script generator by shaypal5 · Pull Request #5 · DataHackIL/SynthBanshee

shaypal5 · 2026-04-07T12:32:01Z

Summary

synthbanshee/script/types.py — DialogueTurn and MixedScene dataclasses shared by the script and TTS modules
synthbanshee/script/generator.py — ScriptGenerator: renders a Jinja2 prompt template, calls an LLM (Anthropic or OpenAI), parses the JSON dialogue response, validates Hebrew constraints, and caches results to disk (SHA-256 keyed). Also provides inject_disfluency() (Hebrew filled pauses: אממ / אה / אנ) and validate_script().
synthbanshee/script/templates/ — base_scene.j2 for both projects + intimate_terror_coercive_control.j2 specialisation for she_proves IT scenes
synthbanshee/tts/mixer.py — SceneMixer.mix_sequential(): decodes WAV bytes, resamples to 16 kHz, downmixes to mono, concatenates with silence gaps, returns MixedScene with per-turn onset/offset timestamps for label generation
synthbanshee/tts/renderer.py — TTSRenderer.render_scene(): wires list[DialogueTurn] → per-turn render_utterance() → SceneMixer → MixedScene
configs/scenes/test_scene_001.yaml — fix script_template path (avdp/ → synthbanshee/)

Test plan

9 TestSceneMixer unit tests — empty segments, pause offsets, resample, stereo downmix, float32 dtype
21 TestScriptGenerator + TestInjectDisfluency + TestValidateScript unit tests — cache hit/miss, LLM mock, markdown fence stripping, Hebrew validation, disfluency injection
12 TestRenderScene integration tests — full render_scene() pipeline with mocked Azure TTS
All 127 existing tests still pass

🤖 Generated with Claude Code

… generator - synthbanshee/script/types.py: DialogueTurn and MixedScene dataclasses - synthbanshee/script/generator.py: ScriptGenerator (Anthropic/OpenAI, Jinja2 prompt, SHA-256 disk cache, JSON response parsing, Hebrew script validation), inject_disfluency (Hebrew filled pauses), validate_script - synthbanshee/script/templates/: base_scene.j2 for she_proves and elephant; intimate_terror_coercive_control.j2 for IT scenes - synthbanshee/tts/mixer.py: SceneMixer.mix_sequential — decodes WAV bytes, resamples to 16 kHz, downmixes to mono, concatenates with silence gaps, returns MixedScene with per-turn onset/offset times - synthbanshee/tts/renderer.py: TTSRenderer.render_scene — renders a list[DialogueTurn] → MixedScene via per-turn render_utterance + SceneMixer - configs/scenes/test_scene_001.yaml: fix script_template path (avdp → synthbanshee) - 42 new tests (9 mixer unit, 21 script-generator unit, 12 multi-speaker integration) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Adds Phase 1 building blocks for generating multi-speaker Hebrew dialogue scripts via LLMs and rendering them into a single mixed TTS waveform with per-turn timing metadata.

Changes:

Introduces ScriptGenerator with Jinja2 prompt rendering, LLM provider support (Anthropic/OpenAI), JSON parsing, Hebrew/script validation, disfluency injection, and on-disk caching.
Adds multi-speaker TTS pipeline: TTSRenderer.render_scene() → per-turn render → SceneMixer.mix_sequential() → MixedScene (timestamps + speaker_ids).
Adds unit/integration tests plus new script prompt templates and fixes a scene config template path.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/unit/test_script_generator.py	Unit tests for ScriptGenerator cache/parse/validation and disfluency injection.
tests/unit/test_mixer.py	Unit tests for sequential mixing, resampling, downmixing, and timing metadata.
tests/integration/test_multi_speaker.py	Integration coverage for `render_scene()` with mocked Azure TTS output.
synthbanshee/tts/renderer.py	Adds `render_scene()` orchestration for multi-turn rendering + mixing.
synthbanshee/tts/mixer.py	New `SceneMixer` that decodes WAV bytes, resamples/downmixes, concatenates, and returns `MixedScene`.
synthbanshee/tts/init.py	Re-exports `SceneMixer` from the TTS package.
synthbanshee/script/types.py	Adds shared `DialogueTurn` and `MixedScene` dataclasses for script/TTS boundary.
synthbanshee/script/templates/she_proves/intimate_terror_coercive_control.j2	New specialized prompt template for IT coercive-control scenes.
synthbanshee/script/templates/she_proves/base_scene.j2	New base prompt template for she_proves scenes.
synthbanshee/script/templates/elephant/base_scene.j2	New base prompt template for elephant scenes.
synthbanshee/script/generator.py	New LLM script generator + cache + validation utilities.
synthbanshee/script/init.py	Exposes script generator/types from the script package.
configs/scenes/test_scene_001.yaml	Fixes `script_template` path to the new template location.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

COPILOT-2: remove bogus {% extends %} from intimate_terror_coercive_control.j2; template is now standalone (Jinja2 extends without blocks silently drops all child content, making the domain-specific guidance invisible) COPILOT-3: validate_script() now checks pause_before_s is finite and in [0.0, 1.5]; negative or huge values would crash SceneMixer with negative array dimensions COPILOT-4: fix mixer.py module docstring — segments are triples (wav_bytes, pause_before_s, speaker_id), not pairs Coverage: add TestLLMDispatch covering _call_anthropic, _call_openai, _call_llm routing via sys.modules injection (lines 232–256, now at 100%) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-04-07T15:47:14Z

pr-agent-context report:

🚨 `pr-agent-context` failed while preparing PR context.

PR: #5
Error: CalledProcessError: Command '['git', '-C', '/home/runner/work/SynthBanshee/SynthBanshee/caller-repo', 'diff', '--unified=0', 'cf71bdfb309e17b1710bb6ce20c342c8a647f0c0...d81888b239e70ba07091891bd842119d70de3d85']' returned non-zero exit status 128.
Run: https://github.com/DataHackIL/SynthBanshee/actions/runs/24090573071

The workflow continued gracefully so this failure does not block CI.
Check the job logs for the full traceback.

Run metadata:

Tool ref: v4
Tool version: 4.0.14
Trigger: status updated
Workflow run: 24090573071 attempt 1
Comment timestamp: 2026-04-07T15:47:13.968572+00:00
PR head commit: d81888b239e70ba07091891bd842119d70de3d85

…metadata trail (lever probe killed Whisper hypothesis) Self-review of the prior commit found the Whisper-fix framing empirically false. Reframed the PR around the actual deliverables (loudness contract clarity + metadata trail) and addressed all blocking review points. Empirical finding (lever probe, 2026-05-05, openai/whisper-large-v3): across 8 single-global-gain variants of the same source audio at peaks from −6 to +7 dBFS and RMS from −27 to −14 dBFS, Whisper produced byte-identical hypotheses in 7/8 cases (WER 0.286, length-ratio 0.762). Loudness in the spec range is invisible to Whisper's log-mel extractor. The actual ASR regression is content-driven (likely M15 / #70 prosody changes) and is now tracked separately in #83. Review fixes applied: - #4 Metadata trail (the structural fix that prevents #78 recurring): GenerationMetadata gains `loudness_target_peak_dbfs: float | None` and bumps `normalization_strategy` to "per_turn_rms_v2_target_peak". Without this field, the original regression hid for three weeks behind unchanged generator_version. - #5 Pydantic upper bound −1.0 → −1.5. 0.5 dB headroom over the safety limiter ceiling guarantees the two stages cannot collide under float-arithmetic noise. - #6 Step names flattened to literal tokens (peak_normalize, peak_limit, silence_pad — no embedded numeric parameters). Configured value moves to PreprocessingResult.target_peak_dbfs as structured data so QA tooling reads policy via field access, not regex on a step string. - #7 Renamed misleading `test_loud_clip_clamped_below_ceiling` to `test_loud_input_lands_at_target_not_at_ceiling` — the limiter is now a guaranteed no-op in normal flow; the target step is what does the work. - #8 Replaced 440 Hz sine fixture with bandpass-filtered Gaussian noise targeting 18 dB crest, mirroring real Hebrew TTS post-M3a behaviour. Sine's 3 dB crest let regressions slip past peak-anchored asserts. - #9 Tightened tolerances 0.5 → 0.1 dB (peak) and 0.5 → 0.2 dB (RMS contrast). Original #78 deviation was 4.6 dB so 0.5 dB only caught it by luck; 0.1 dB catches subtle regressions. - #10 Tier B/C cli.py path coverage: two new tests in test_tier_b_pipeline.py asserting (a) post-augment peak lands at the configured target and (b) GenerationMetadata carries the new normalization_strategy + loudness_target_peak_dbfs fields. - #11 Dropped fictional "M3c" milestone tag everywhere. Tracker row in audio_generation_v3_design.md is now indexed by #78. peak_normalize_to_target docstring expanded to mention dual use (Stage 3a + Stage 3b post-augment). - #3 Tier B/C −1.0 → −2.0 dBFS behaviour change called out explicitly in PR body, AGENTS.md, and the §4.7 spec change. This is incidentally better for Tier B/C (gives 1 dB inter-sample-peak headroom for room IR / noise mixing), but it IS a change. Verification: - pytest 1720 passed / 1 skipped (added 4 tests, removed 1 obsolete) - ruff: clean - mypy: 36 unique errors, identical to main (zero new) - Live re-render of sp_neu_a_0001 produces peak = -2.000 dBFS exactly, with metadata fields populated as expected. Follow-up issue filed: #83 — investigate the actual Whisper WER cause (prosody/duration, not loudness). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…es NOT recover Whisper — see #83) (#82) * fix(preprocessing): #78 restore absolute clip loudness via single-gain target peak (M3c) M3a per-turn RMS targeting + M3b limiter-only normalization left M3a-shaped Tier A clips peaking ~6 dB below the −1 dBFS ceiling. Whisper under- transcribed (WER 0.04 → 0.28, length-ratio 0.76) and UTMOS read the loosely-limited clips ~0.9 MOS higher, dominating M17 evaluation. Fix is a *single global gain* post-mix loudness step — preserves per-turn RMS ratios exactly (M3a's whole point), unlike the legacy per-segment peak normalizer M3b removed. Configurable target with a sane default. Changes: - synthbanshee/augment/preprocessing.py: new peak_normalize_to_target() helper; preprocess() runs it before the existing −1 dBFS safety limiter - synthbanshee/config/preprocessing_config.py: target_peak_dbfs field, default −2.0 dBFS, range [−12.0, −1.0] - synthbanshee/cli.py: Stage 3b post-augment normalize uses the same helper + same config so Tier A and Tier B/C exit at the same peak, fixing a pre-existing asymmetry that hid the regression on Tier A - tests/unit/test_preprocessing.py: invert the M3b-era "quiet clip not scaled up" assertion, add direct unit tests for the helper, validate config range constraints - tests/integration/test_loudness_regression.py: end-to-end guard asserting peak lands in [−3, −1] dBFS and per-turn RMS contrast survives a single global gain - AGENTS.md, docs/spec.md §3 + §3.1, docs/audio_generation_v3_design.md §4.7 + tracker M3c row: spec text in sync with the two-stage policy Verification (sp_neu_a_0001 same-config A/B): 2026-04-15 reference peak −1.00 dBFS rms −20.37 dBFS 2026-05-05 before fix peak −5.62 dBFS rms −26.68 dBFS 2026-05-05 after fix peak −2.00 dBFS rms −23.06 dBFS Closes #78 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(preprocessing): #78 review pass — reframe to loudness-contract + metadata trail (lever probe killed Whisper hypothesis) Self-review of the prior commit found the Whisper-fix framing empirically false. Reframed the PR around the actual deliverables (loudness contract clarity + metadata trail) and addressed all blocking review points. Empirical finding (lever probe, 2026-05-05, openai/whisper-large-v3): across 8 single-global-gain variants of the same source audio at peaks from −6 to +7 dBFS and RMS from −27 to −14 dBFS, Whisper produced byte-identical hypotheses in 7/8 cases (WER 0.286, length-ratio 0.762). Loudness in the spec range is invisible to Whisper's log-mel extractor. The actual ASR regression is content-driven (likely M15 / #70 prosody changes) and is now tracked separately in #83. Review fixes applied: - #4 Metadata trail (the structural fix that prevents #78 recurring): GenerationMetadata gains `loudness_target_peak_dbfs: float | None` and bumps `normalization_strategy` to "per_turn_rms_v2_target_peak". Without this field, the original regression hid for three weeks behind unchanged generator_version. - #5 Pydantic upper bound −1.0 → −1.5. 0.5 dB headroom over the safety limiter ceiling guarantees the two stages cannot collide under float-arithmetic noise. - #6 Step names flattened to literal tokens (peak_normalize, peak_limit, silence_pad — no embedded numeric parameters). Configured value moves to PreprocessingResult.target_peak_dbfs as structured data so QA tooling reads policy via field access, not regex on a step string. - #7 Renamed misleading `test_loud_clip_clamped_below_ceiling` to `test_loud_input_lands_at_target_not_at_ceiling` — the limiter is now a guaranteed no-op in normal flow; the target step is what does the work. - #8 Replaced 440 Hz sine fixture with bandpass-filtered Gaussian noise targeting 18 dB crest, mirroring real Hebrew TTS post-M3a behaviour. Sine's 3 dB crest let regressions slip past peak-anchored asserts. - #9 Tightened tolerances 0.5 → 0.1 dB (peak) and 0.5 → 0.2 dB (RMS contrast). Original #78 deviation was 4.6 dB so 0.5 dB only caught it by luck; 0.1 dB catches subtle regressions. - #10 Tier B/C cli.py path coverage: two new tests in test_tier_b_pipeline.py asserting (a) post-augment peak lands at the configured target and (b) GenerationMetadata carries the new normalization_strategy + loudness_target_peak_dbfs fields. - #11 Dropped fictional "M3c" milestone tag everywhere. Tracker row in audio_generation_v3_design.md is now indexed by #78. peak_normalize_to_target docstring expanded to mention dual use (Stage 3a + Stage 3b post-augment). - #3 Tier B/C −1.0 → −2.0 dBFS behaviour change called out explicitly in PR body, AGENTS.md, and the §4.7 spec change. This is incidentally better for Tier B/C (gives 1 dB inter-sample-peak headroom for room IR / noise mixing), but it IS a change. Verification: - pytest 1720 passed / 1 skipped (added 4 tests, removed 1 obsolete) - ruff: clean - mypy: 36 unique errors, identical to main (zero new) - Live re-render of sp_neu_a_0001 produces peak = -2.000 dBFS exactly, with metadata fields populated as expected. Follow-up issue filed: #83 — investigate the actual Whisper WER cause (prosody/duration, not loudness). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(preprocessing): #78 mirror "5a then 5b" policy on Tier B/C path (PR #82 COPILOT-1) cli.py Stage 3b previously ran only the target-peak step (5a), not the safety limiter (5b), giving the documented two-stage policy a paper-vs- reality gap. The limiter is provably a no-op given Pydantic bounds (target ≤ −1.5, ceiling = −1.0, PCM_16 quantisation cannot bridge the 0.5 dB margin), but applying it uniformly closes the asymmetry between Tier A (preprocess()) and Tier B/C (cli.py) and prevents a future spec/code drift from silently violating §4.7. PR #82 review thread COPILOT-2 (docs/spec.md §3.1 steps 1-4 stale since M14) is correct but pre-existing and unrelated to #78 — filed as #84 and resolved out-of-scope here. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

shaypal5 self-assigned this Apr 7, 2026

shaypal5 requested a review from Copilot April 7, 2026 12:32

shaypal5 added the enhancement New feature or request label Apr 7, 2026

Copilot started reviewing on behalf of shaypal5 April 7, 2026 12:33 View session

This comment has been minimized.

Sign in to view

Copilot AI reviewed Apr 7, 2026

View reviewed changes

Comment thread tests/integration/test_multi_speaker.py

Comment thread synthbanshee/script/templates/she_proves/intimate_terror_coercive_control.j2 Outdated

Comment thread synthbanshee/script/generator.py

Comment thread synthbanshee/tts/mixer.py Outdated

This comment has been minimized.

Sign in to view

shaypal5 merged commit 421b884 into main Apr 7, 2026
6 checks passed

shaypal5 deleted the feat/phase-1-tts-script branch April 7, 2026 15:46

shaypal5 mentioned this pull request May 5, 2026

spike(eval): M17 Phase A validation — Whisper + UTMOS on Hebrew clips #77

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Phase 1 milestones 1.2 + 1.3 — multi-speaker TTS & LLM script generator#5

feat: Phase 1 milestones 1.2 + 1.3 — multi-speaker TTS & LLM script generator#5
shaypal5 merged 2 commits into
mainfrom
feat/phase-1-tts-script

shaypal5 commented Apr 7, 2026

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Uh oh!

github-actions Bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shaypal5 commented Apr 7, 2026

Summary

Test plan

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Uh oh!

github-actions Bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants