feat: Phase 1 milestones 1.2 + 1.3 — multi-speaker TTS & LLM script generator#5
Merged
Conversation
… generator - synthbanshee/script/types.py: DialogueTurn and MixedScene dataclasses - synthbanshee/script/generator.py: ScriptGenerator (Anthropic/OpenAI, Jinja2 prompt, SHA-256 disk cache, JSON response parsing, Hebrew script validation), inject_disfluency (Hebrew filled pauses), validate_script - synthbanshee/script/templates/: base_scene.j2 for she_proves and elephant; intimate_terror_coercive_control.j2 for IT scenes - synthbanshee/tts/mixer.py: SceneMixer.mix_sequential — decodes WAV bytes, resamples to 16 kHz, downmixes to mono, concatenates with silence gaps, returns MixedScene with per-turn onset/offset times - synthbanshee/tts/renderer.py: TTSRenderer.render_scene — renders a list[DialogueTurn] → MixedScene via per-turn render_utterance + SceneMixer - configs/scenes/test_scene_001.yaml: fix script_template path (avdp → synthbanshee) - 42 new tests (9 mixer unit, 21 script-generator unit, 12 multi-speaker integration) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
Adds Phase 1 building blocks for generating multi-speaker Hebrew dialogue scripts via LLMs and rendering them into a single mixed TTS waveform with per-turn timing metadata.
Changes:
- Introduces
ScriptGeneratorwith Jinja2 prompt rendering, LLM provider support (Anthropic/OpenAI), JSON parsing, Hebrew/script validation, disfluency injection, and on-disk caching. - Adds multi-speaker TTS pipeline:
TTSRenderer.render_scene()→ per-turn render →SceneMixer.mix_sequential()→MixedScene(timestamps + speaker_ids). - Adds unit/integration tests plus new script prompt templates and fixes a scene config template path.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_script_generator.py | Unit tests for ScriptGenerator cache/parse/validation and disfluency injection. |
| tests/unit/test_mixer.py | Unit tests for sequential mixing, resampling, downmixing, and timing metadata. |
| tests/integration/test_multi_speaker.py | Integration coverage for render_scene() with mocked Azure TTS output. |
| synthbanshee/tts/renderer.py | Adds render_scene() orchestration for multi-turn rendering + mixing. |
| synthbanshee/tts/mixer.py | New SceneMixer that decodes WAV bytes, resamples/downmixes, concatenates, and returns MixedScene. |
| synthbanshee/tts/init.py | Re-exports SceneMixer from the TTS package. |
| synthbanshee/script/types.py | Adds shared DialogueTurn and MixedScene dataclasses for script/TTS boundary. |
| synthbanshee/script/templates/she_proves/intimate_terror_coercive_control.j2 | New specialized prompt template for IT coercive-control scenes. |
| synthbanshee/script/templates/she_proves/base_scene.j2 | New base prompt template for she_proves scenes. |
| synthbanshee/script/templates/elephant/base_scene.j2 | New base prompt template for elephant scenes. |
| synthbanshee/script/generator.py | New LLM script generator + cache + validation utilities. |
| synthbanshee/script/init.py | Exposes script generator/types from the script package. |
| configs/scenes/test_scene_001.yaml | Fixes script_template path to the new template location. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This comment has been minimized.
This comment has been minimized.
COPILOT-2: remove bogus {% extends %} from intimate_terror_coercive_control.j2;
template is now standalone (Jinja2 extends without blocks silently drops all
child content, making the domain-specific guidance invisible)
COPILOT-3: validate_script() now checks pause_before_s is finite and in [0.0, 1.5];
negative or huge values would crash SceneMixer with negative array dimensions
COPILOT-4: fix mixer.py module docstring — segments are triples
(wav_bytes, pause_before_s, speaker_id), not pairs
Coverage: add TestLLMDispatch covering _call_anthropic, _call_openai, _call_llm
routing via sys.modules injection (lines 232–256, now at 100%)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
pr-agent-context report: 🚨 `pr-agent-context` failed while preparing PR context.
PR: #5
Error: CalledProcessError: Command '['git', '-C', '/home/runner/work/SynthBanshee/SynthBanshee/caller-repo', 'diff', '--unified=0', 'cf71bdfb309e17b1710bb6ce20c342c8a647f0c0...d81888b239e70ba07091891bd842119d70de3d85']' returned non-zero exit status 128.
Run: https://github.com/DataHackIL/SynthBanshee/actions/runs/24090573071
The workflow continued gracefully so this failure does not block CI.
Check the job logs for the full traceback.Run metadata: |
9 tasks
shaypal5
added a commit
that referenced
this pull request
May 5, 2026
…metadata trail (lever probe killed Whisper hypothesis) Self-review of the prior commit found the Whisper-fix framing empirically false. Reframed the PR around the actual deliverables (loudness contract clarity + metadata trail) and addressed all blocking review points. Empirical finding (lever probe, 2026-05-05, openai/whisper-large-v3): across 8 single-global-gain variants of the same source audio at peaks from −6 to +7 dBFS and RMS from −27 to −14 dBFS, Whisper produced byte-identical hypotheses in 7/8 cases (WER 0.286, length-ratio 0.762). Loudness in the spec range is invisible to Whisper's log-mel extractor. The actual ASR regression is content-driven (likely M15 / #70 prosody changes) and is now tracked separately in #83. Review fixes applied: - #4 Metadata trail (the structural fix that prevents #78 recurring): GenerationMetadata gains `loudness_target_peak_dbfs: float | None` and bumps `normalization_strategy` to "per_turn_rms_v2_target_peak". Without this field, the original regression hid for three weeks behind unchanged generator_version. - #5 Pydantic upper bound −1.0 → −1.5. 0.5 dB headroom over the safety limiter ceiling guarantees the two stages cannot collide under float-arithmetic noise. - #6 Step names flattened to literal tokens (peak_normalize, peak_limit, silence_pad — no embedded numeric parameters). Configured value moves to PreprocessingResult.target_peak_dbfs as structured data so QA tooling reads policy via field access, not regex on a step string. - #7 Renamed misleading `test_loud_clip_clamped_below_ceiling` to `test_loud_input_lands_at_target_not_at_ceiling` — the limiter is now a guaranteed no-op in normal flow; the target step is what does the work. - #8 Replaced 440 Hz sine fixture with bandpass-filtered Gaussian noise targeting 18 dB crest, mirroring real Hebrew TTS post-M3a behaviour. Sine's 3 dB crest let regressions slip past peak-anchored asserts. - #9 Tightened tolerances 0.5 → 0.1 dB (peak) and 0.5 → 0.2 dB (RMS contrast). Original #78 deviation was 4.6 dB so 0.5 dB only caught it by luck; 0.1 dB catches subtle regressions. - #10 Tier B/C cli.py path coverage: two new tests in test_tier_b_pipeline.py asserting (a) post-augment peak lands at the configured target and (b) GenerationMetadata carries the new normalization_strategy + loudness_target_peak_dbfs fields. - #11 Dropped fictional "M3c" milestone tag everywhere. Tracker row in audio_generation_v3_design.md is now indexed by #78. peak_normalize_to_target docstring expanded to mention dual use (Stage 3a + Stage 3b post-augment). - #3 Tier B/C −1.0 → −2.0 dBFS behaviour change called out explicitly in PR body, AGENTS.md, and the §4.7 spec change. This is incidentally better for Tier B/C (gives 1 dB inter-sample-peak headroom for room IR / noise mixing), but it IS a change. Verification: - pytest 1720 passed / 1 skipped (added 4 tests, removed 1 obsolete) - ruff: clean - mypy: 36 unique errors, identical to main (zero new) - Live re-render of sp_neu_a_0001 produces peak = -2.000 dBFS exactly, with metadata fields populated as expected. Follow-up issue filed: #83 — investigate the actual Whisper WER cause (prosody/duration, not loudness). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
shaypal5
added a commit
that referenced
this pull request
May 5, 2026
…es NOT recover Whisper — see #83) (#82) * fix(preprocessing): #78 restore absolute clip loudness via single-gain target peak (M3c) M3a per-turn RMS targeting + M3b limiter-only normalization left M3a-shaped Tier A clips peaking ~6 dB below the −1 dBFS ceiling. Whisper under- transcribed (WER 0.04 → 0.28, length-ratio 0.76) and UTMOS read the loosely-limited clips ~0.9 MOS higher, dominating M17 evaluation. Fix is a *single global gain* post-mix loudness step — preserves per-turn RMS ratios exactly (M3a's whole point), unlike the legacy per-segment peak normalizer M3b removed. Configurable target with a sane default. Changes: - synthbanshee/augment/preprocessing.py: new peak_normalize_to_target() helper; preprocess() runs it before the existing −1 dBFS safety limiter - synthbanshee/config/preprocessing_config.py: target_peak_dbfs field, default −2.0 dBFS, range [−12.0, −1.0] - synthbanshee/cli.py: Stage 3b post-augment normalize uses the same helper + same config so Tier A and Tier B/C exit at the same peak, fixing a pre-existing asymmetry that hid the regression on Tier A - tests/unit/test_preprocessing.py: invert the M3b-era "quiet clip not scaled up" assertion, add direct unit tests for the helper, validate config range constraints - tests/integration/test_loudness_regression.py: end-to-end guard asserting peak lands in [−3, −1] dBFS and per-turn RMS contrast survives a single global gain - AGENTS.md, docs/spec.md §3 + §3.1, docs/audio_generation_v3_design.md §4.7 + tracker M3c row: spec text in sync with the two-stage policy Verification (sp_neu_a_0001 same-config A/B): 2026-04-15 reference peak −1.00 dBFS rms −20.37 dBFS 2026-05-05 before fix peak −5.62 dBFS rms −26.68 dBFS 2026-05-05 after fix peak −2.00 dBFS rms −23.06 dBFS Closes #78 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(preprocessing): #78 review pass — reframe to loudness-contract + metadata trail (lever probe killed Whisper hypothesis) Self-review of the prior commit found the Whisper-fix framing empirically false. Reframed the PR around the actual deliverables (loudness contract clarity + metadata trail) and addressed all blocking review points. Empirical finding (lever probe, 2026-05-05, openai/whisper-large-v3): across 8 single-global-gain variants of the same source audio at peaks from −6 to +7 dBFS and RMS from −27 to −14 dBFS, Whisper produced byte-identical hypotheses in 7/8 cases (WER 0.286, length-ratio 0.762). Loudness in the spec range is invisible to Whisper's log-mel extractor. The actual ASR regression is content-driven (likely M15 / #70 prosody changes) and is now tracked separately in #83. Review fixes applied: - #4 Metadata trail (the structural fix that prevents #78 recurring): GenerationMetadata gains `loudness_target_peak_dbfs: float | None` and bumps `normalization_strategy` to "per_turn_rms_v2_target_peak". Without this field, the original regression hid for three weeks behind unchanged generator_version. - #5 Pydantic upper bound −1.0 → −1.5. 0.5 dB headroom over the safety limiter ceiling guarantees the two stages cannot collide under float-arithmetic noise. - #6 Step names flattened to literal tokens (peak_normalize, peak_limit, silence_pad — no embedded numeric parameters). Configured value moves to PreprocessingResult.target_peak_dbfs as structured data so QA tooling reads policy via field access, not regex on a step string. - #7 Renamed misleading `test_loud_clip_clamped_below_ceiling` to `test_loud_input_lands_at_target_not_at_ceiling` — the limiter is now a guaranteed no-op in normal flow; the target step is what does the work. - #8 Replaced 440 Hz sine fixture with bandpass-filtered Gaussian noise targeting 18 dB crest, mirroring real Hebrew TTS post-M3a behaviour. Sine's 3 dB crest let regressions slip past peak-anchored asserts. - #9 Tightened tolerances 0.5 → 0.1 dB (peak) and 0.5 → 0.2 dB (RMS contrast). Original #78 deviation was 4.6 dB so 0.5 dB only caught it by luck; 0.1 dB catches subtle regressions. - #10 Tier B/C cli.py path coverage: two new tests in test_tier_b_pipeline.py asserting (a) post-augment peak lands at the configured target and (b) GenerationMetadata carries the new normalization_strategy + loudness_target_peak_dbfs fields. - #11 Dropped fictional "M3c" milestone tag everywhere. Tracker row in audio_generation_v3_design.md is now indexed by #78. peak_normalize_to_target docstring expanded to mention dual use (Stage 3a + Stage 3b post-augment). - #3 Tier B/C −1.0 → −2.0 dBFS behaviour change called out explicitly in PR body, AGENTS.md, and the §4.7 spec change. This is incidentally better for Tier B/C (gives 1 dB inter-sample-peak headroom for room IR / noise mixing), but it IS a change. Verification: - pytest 1720 passed / 1 skipped (added 4 tests, removed 1 obsolete) - ruff: clean - mypy: 36 unique errors, identical to main (zero new) - Live re-render of sp_neu_a_0001 produces peak = -2.000 dBFS exactly, with metadata fields populated as expected. Follow-up issue filed: #83 — investigate the actual Whisper WER cause (prosody/duration, not loudness). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(preprocessing): #78 mirror "5a then 5b" policy on Tier B/C path (PR #82 COPILOT-1) cli.py Stage 3b previously ran only the target-peak step (5a), not the safety limiter (5b), giving the documented two-stage policy a paper-vs- reality gap. The limiter is provably a no-op given Pydantic bounds (target ≤ −1.5, ceiling = −1.0, PCM_16 quantisation cannot bridge the 0.5 dB margin), but applying it uniformly closes the asymmetry between Tier A (preprocess()) and Tier B/C (cli.py) and prevents a future spec/code drift from silently violating §4.7. PR #82 review thread COPILOT-2 (docs/spec.md §3.1 steps 1-4 stale since M14) is correct but pre-existing and unrelated to #78 — filed as #84 and resolved out-of-scope here. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
synthbanshee/script/types.py—DialogueTurnandMixedScenedataclasses shared by the script and TTS modulessynthbanshee/script/generator.py—ScriptGenerator: renders a Jinja2 prompt template, calls an LLM (Anthropic or OpenAI), parses the JSON dialogue response, validates Hebrew constraints, and caches results to disk (SHA-256 keyed). Also providesinject_disfluency()(Hebrew filled pauses: אממ / אה / אנ) andvalidate_script().synthbanshee/script/templates/—base_scene.j2for both projects +intimate_terror_coercive_control.j2specialisation for she_proves IT scenessynthbanshee/tts/mixer.py—SceneMixer.mix_sequential(): decodes WAV bytes, resamples to 16 kHz, downmixes to mono, concatenates with silence gaps, returnsMixedScenewith per-turn onset/offset timestamps for label generationsynthbanshee/tts/renderer.py—TTSRenderer.render_scene(): wireslist[DialogueTurn]→ per-turnrender_utterance()→SceneMixer→MixedSceneconfigs/scenes/test_scene_001.yaml— fixscript_templatepath (avdp/→synthbanshee/)Test plan
TestSceneMixerunit tests — empty segments, pause offsets, resample, stereo downmix, float32 dtypeTestScriptGenerator+TestInjectDisfluency+TestValidateScriptunit tests — cache hit/miss, LLM mock, markdown fence stripping, Hebrew validation, disfluency injectionTestRenderSceneintegration tests — fullrender_scene()pipeline with mocked Azure TTS🤖 Generated with Claude Code