fix(mixer): #66 BARGE_IN audible-overlap crossfade past TTS trailing silence#75
Merged
Conversation
…silence Root cause: Azure/Google TTS pads the end of an utterance with 100–300 ms of near-silence, and the mixer anchored BARGE_IN onset against the file end. A typical 200–500 ms barge-in depth therefore landed mostly inside the silence, so listeners heard the previous speaker stop, then a gap, then the new speaker start — sounding like polite turn-taking instead of an interruption. Compounding the perception: the previous turn was hard-cut at the new turn's onset, leaving zero audible overlap of the two voices. Fix: - Add `_speech_end_sample()` (windowed RMS scan, scipy/numpy only) and anchor BARGE_IN onset against the previous turn's *speech* end, skipping trailing TTS silence padding. - Truncate the previous turn at its speech end (not at the new onset) and apply a fade-out across the overlap region, producing a real audible crossfade between the two voices. - OVERLAP path is unchanged (still anchored against file end). Also: update the docstring to reflect that for an interrupted turn, `audible_ends_s` now sits *after* the new turn's `rendered_onsets_s` by the overlap-region length (the truncation point is the previous speech end, not the new onset). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Pull request overview
Fixes #66 by changing how BARGE_IN mixing is timed and truncated so the “interrupting” speaker audibly overlaps the prior speaker’s speech (not the prior file’s trailing near-silence), and by introducing a crossfade instead of a hard cut at the interruption point.
Changes:
- Add
_speech_end_sample()(windowed RMS tail scan) and use it to anchorBARGE_INonset against the previous turn’s speech end. - Update
BARGE_INbehavior to keep the previous turn through its speech end and apply a linear fade-out across the overlap region (audible crossfade). - Add/adjust unit + integration tests to cover trailing-silence padding scenarios and confirm
OVERLAPremains anchored to file end.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| synthbanshee/tts/mixer.py | Implements speech-end detection and new BARGE_IN onset/truncation + crossfade semantics; updates module docs. |
| tests/unit/test_mixer.py | Adds targeted unit tests for _speech_end_sample() and acceptance/regression tests for BARGE_IN vs OVERLAP. |
| tests/integration/test_multi_speaker.py | Adds trailing-silence support to integration WAV helper to make truncation observable in end-to-end label metrics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… fade redesign, label semantic Addresses self-review of #75: 1. **Signal-relative speech-end threshold.** Replace the hardcoded −40 dBFS absolute threshold with a per-segment relative threshold (40 dB below the segment's own peak window RMS). Auto-scales for whisper, normal, and Lombard-tilted (#65) turns; tested against a Hebrew-style fricative tail (vowel @ 0.4 + fricative @ −25 dB + silence) to confirm we don't cut quiet word-final phonemes. 2. **10 ms windows.** Drop window from 30 ms → 10 ms (matches `_EDGE_FADE_SAMPLES`). Cuts measurement uncertainty in interrupted-turn label offsets from ±30 ms to ≤ 10 ms. 3. **OVERLAP gets the same fix.** OVERLAP had the identical trailing-silence bug — both speakers were meant to be heard, but with TTS padding the new turn could land in silence. Speech-end anchoring now applies to both modes. The "OVERLAP-anchors-on-file-end" regression-guard test (which codified the broken behaviour) is replaced with one that asserts the correct behaviour. 4. **BARGE_IN fade redesign.** Drop the linear span-the-overlap fade-out (which sounded like a console fader pulling down the prior speaker over 500 ms) in favour of a short 60 ms fade at the truncation boundary. The prior speaker plays at full volume through the entire overlap region — matching real interruption physics. 5. **In-place tail mutation.** Replace whole-array `prev_mono.copy()` with in-place mutation of the tail region (the array is already a fresh copy from `_apply_edge_fades`). 6. **`LabelGenerator.truncated` driven by mix mode, not duration heuristic.** Old heuristic ("audible_duration < script_duration") silently went False for tight-TTS clips that lacked trailing silence, even when the script said BARGE_IN. New: `truncated=True` when the next turn's mix mode is `BARGE_IN` (read from `scene.mix_modes`), with the duration heuristic retained as a defensive fallback. Restores the integration test fixture to its original (no-trailing-silence) form — the test was previously masked by adding silence until the assertion passed. 7. **Property-based acceptance test.** Replace the brittle `assert overlap_rms > 0.32` (a magic number tuned to the specific fade shape) with a property: overlap RMS exceeds the louder solo voice's RMS by >= 1.30x. Holds regardless of fade shape, voice frequencies, or window length. 8. **Helper signature simplified.** `_speech_end_sample` drops the unused `sample_rate`, `window_ms`, and `threshold_dbfs` parameters. 9. **Docs.** Update `docs/audio_generation_v3_design.md` §4.6 and the `MixedScene` dataclass docstring with the new audible-end semantic (consecutive turns can overlap for BARGE_IN; consumers must not assume non-overlap). Audited the codebase for `audible_ends_s` consumers — only `LabelGenerator` reads it, and it now handles the new semantic correctly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
|
pr-agent-context report: This is a refreshed snapshot of the current PR state.
This run includes an unresolved review comment and patch coverage gaps on PR #75 in repository https://github.com/DataHackIL/SynthBanshee
For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.
After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, address the patch coverage gaps below, and push all of these changes in a single
commit.
# Copilot Comments
## COPILOT-1
Location: synthbanshee/tts/mixer.py:248
URL: https://github.com/DataHackIL/SynthBanshee/pull/75#discussion_r3180298263
Root author: copilot-pull-request-reviewer
Comment:
Docstring for `_speech_end_sample` says empty input returns `len(mono)`, but implementation returns `0` (and unit tests assert 0). Please update the docstring to match the actual behavior (or change the implementation if the docstring is intended).
# Patch coverage
Patch test coverage is 30.77%; please raise it to 100%. These are the uncovered code lines:
- synthbanshee/labels/generator.py: 174, 175, 176, 179, 180
- synthbanshee/script/types.py: 100, 101
- synthbanshee/tts/mixer.py: 33, 35, 36, 78, 79, 80, 81, 213, 214, 215, 216, 217, 218, 219, 223, 224, 236, 248, 252, 253Run metadata: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #66.
Problem
In a BARGE_IN turn pair, listeners heard the previous speaker stop, then a noticeable pause, then the new speaker start — sounding like polite turn-taking instead of an interruption. Reported in the 2026-05-03 listening test against
sp_it_a_0001_00:Two compounding causes:
Solution
In
SceneMixer.mix_sequential:_speech_end_sample()— windowed RMS scan (10 ms windows; threshold is 40 dB below the segment's own peak window RMS, so it auto-scales for whisper, normal, and Lombard-tilted turns), scipy + numpy only._apply_edge_fades); no whole-array copy in the hot path.In
LabelGenerator.generate_events_from_scene:truncated=Truewhen the next turn'smix_mode == BARGE_IN(read fromscene.mix_modes). This is robust to whether the WAV carried trailing silence — a turn marked BARGE_IN by the script is always interrupted in the user-meaningful sense. The previous heuristic ("audible_duration < script_duration") silently producedtruncated=Falsefor tight-TTS clips even when the script said BARGE_IN. The duration heuristic is retained as a defensive fallback.The previous turn's
audible_ends_sfor an interrupted turn now sits after the next turn'srendered_onsets_sby the overlap-region length. The module docstring,MixedScenedataclass docstring, anddocs/audio_generation_v3_design.md§4.6 all flag the change explicitly: consumers must not assumerendered_offsets_s[i] <= rendered_onsets_s[i+1].Real-audio verification
Re-rendered
configs/scenes/she_proves/sp_it_a_0001.yamltwice against the same script and TTS cache: once with the mixer atmain(BEFORE), once with the fix (AFTER). Same 17-turn script in both renders; the only difference is the mixer logic. Both runs produced 2 BARGE_IN turn pairs (turns 10→11 and 12→13).Measured RMS (in dBFS) in 100 ms windows around each new-turn onset, plus the minimum 30 ms-window RMS in ±50 ms straddling the cut (= "is there silence at the boundary?"):
The
−200 dBFSfloor (= zero samples in float32 → log saturation) at the cut in BEFORE is the bug: a real silent gap straddling the BARGE_IN boundary. AFTER, the boundary region is continuously at speech-level energy — both speakers active across the cut.A/B listening snippets are at
/tmp/sb_66_listen/:barge_10_AB_stitched.wav(BEFORE — 0.5 s gap — AFTER, AGG→VIC at I3)barge_12_AB_stitched.wav(BEFORE — 0.5 s gap — AFTER, AGG→VIC at I4)Audit of
audible_ends_sconsumerssynthbanshee/labels/generator.pytruncatednow driven bymix_modes, not audible-vs-script durationsynthbanshee/script/types.py(dataclass)synthbanshee/tts/mixer.py(producer)No other source-tree consumer reads
audible_ends_s. (grep -rn audible_ends_s synthbanshee/)Changes per file
synthbanshee/tts/mixer.py_speech_end_sample()(signal-relative, 10 ms window); anchor BARGE_IN+OVERLAP at speech end; 60 ms fade at truncation boundary; in-place tail mutation; module docstring updatedsynthbanshee/labels/generator.pytruncateddriven by next-turnmix_mode == BARGE_IN; duration heuristic kept as fallback; docstring updatedsynthbanshee/script/types.pyMixedScenedocstring: flag thatrendered_offsets_s[i]may exceedrendered_onsets_s[i+1]for interrupted turnsdocs/audio_generation_v3_design.mdtruncatedtests/unit/test_mixer.pyTestSpeechEndSample(incl. fricative-tail and whisper tests); property-based acceptance test (overlap RMS > 1.30× louder solo); OVERLAP-fix test; existing barge-in tests updated for new audible-end semantictests/integration/test_multi_speaker.pytruncatedflagTest plan
pytest— full suite: 1707 passed, 2 pre-existing warningsruff check— passes (incl. pre-commitruff format)mypy— clean on every changed filetest_overlap_anchored_on_speech_end_through_trailing_silence— OVERLAP onset lands at speech end, not file endtest_does_not_clip_quiet_fricative_tail— vowel + −25 dB fricative + silence; detector includes the fricativetest_threshold_is_signal_relative— peak-relative threshold correctly detects whisper-level (−30 dBFS) speech endtest_barge_in_previous_turn_label_truncated—truncated=Trueeven with no trailing silence in the source WAVsp_it_a_0001_00— both BARGE_IN moments verified numerically (table above); A/B snippets at/tmp/sb_66_listen/Out of scope
augment/preprocessing.pyis untouched🤖 Generated with Claude Code