fix(mixer): #66 BARGE_IN audible-overlap crossfade past TTS trailing silence by shaypal5 · Pull Request #75 · DataHackIL/SynthBanshee

shaypal5 · 2026-05-04T08:33:25Z

Closes #66.

Verified on real audio. Re-rendered sp_it_a_0001_00 with both the old mixer (against a fresh script + cached TTS turns) and the fix; matched-script comparison at every BARGE_IN onset shows the silent gap is gone. Numbers below.

Problem

In a BARGE_IN turn pair, listeners heard the previous speaker stop, then a noticeable pause, then the new speaker start — sounding like polite turn-taking instead of an interruption. Reported in the 2026-05-03 listening test against sp_it_a_0001_00:

"At ~1:19 the woman audibly stops speaking BEFORE the man starts, with a noticeable time gap. Sounds like polite turn-taking, not an interruption."

Two compounding causes:

Onset anchored at file end, not speech end. Azure / Google TTS pad each utterance with 100–300 ms of trailing near-silence. With BARGE_IN depth in the 200–500 ms range, the overlap landed mostly inside that silence — the man overlapped the woman's silence, not her speech.
Hard cut at the new onset. Even with onset moved earlier, the previous turn was truncated exactly at the new turn's onset, so there was zero audible overlap.

Solution

In SceneMixer.mix_sequential:

Add _speech_end_sample() — windowed RMS scan (10 ms windows; threshold is 40 dB below the segment's own peak window RMS, so it auto-scales for whisper, normal, and Lombard-tilted turns), scipy + numpy only.
Anchor both BARGE_IN and OVERLAP onset against speech end instead of file end. Both modes intend audible co-occurrence; OVERLAP had the identical bug.
For BARGE_IN, the previous turn plays at full volume through the entire overlap region (= up to its speech end), then a short 60 ms fade-out at the truncation boundary. Models a real interruption — the prior speaker keeps talking until overpowered, not a long fade pulled by a console fader.
Tail is mutated in place (the array is already a fresh copy from _apply_edge_fades); no whole-array copy in the hot path.

In LabelGenerator.generate_events_from_scene:

truncated=True when the next turn's mix_mode == BARGE_IN (read from scene.mix_modes). This is robust to whether the WAV carried trailing silence — a turn marked BARGE_IN by the script is always interrupted in the user-meaningful sense. The previous heuristic ("audible_duration < script_duration") silently produced truncated=False for tight-TTS clips even when the script said BARGE_IN. The duration heuristic is retained as a defensive fallback.

The previous turn's audible_ends_s for an interrupted turn now sits after the next turn's rendered_onsets_s by the overlap-region length. The module docstring, MixedScene dataclass docstring, and docs/audio_generation_v3_design.md §4.6 all flag the change explicitly: consumers must not assume rendered_offsets_s[i] <= rendered_onsets_s[i+1].

Real-audio verification

Re-rendered configs/scenes/she_proves/sp_it_a_0001.yaml twice against the same script and TTS cache: once with the mixer at main (BEFORE), once with the fix (AFTER). Same 17-turn script in both renders; the only difference is the mixer logic. Both runs produced 2 BARGE_IN turn pairs (turns 10→11 and 12→13).

Measured RMS (in dBFS) in 100 ms windows around each new-turn onset, plus the minimum 30 ms-window RMS in ±50 ms straddling the cut (= "is there silence at the boundary?"):

BARGE_IN	Render	pre-onset 100 ms	post-onset 100 ms	min 30 ms within ±50 ms of onset
turn 10→11 (AGG→VIC, I3)	BEFORE	−101 dBFS	−97 dBFS	−200 dBFS (literal silence)
turn 10→11 (AGG→VIC, I3)	AFTER	−22 dBFS	−22 dBFS	−22 dBFS (continuous speech)
turn 12→13 (AGG→VIC, I4)	BEFORE	−93 dBFS	−98 dBFS	−200 dBFS (literal silence)
turn 12→13 (AGG→VIC, I4)	AFTER	−17 dBFS	−29 dBFS	−31 dBFS (no silent gap)

The −200 dBFS floor (= zero samples in float32 → log saturation) at the cut in BEFORE is the bug: a real silent gap straddling the BARGE_IN boundary. AFTER, the boundary region is continuously at speech-level energy — both speakers active across the cut.

A/B listening snippets are at /tmp/sb_66_listen/:

barge_10_AB_stitched.wav (BEFORE — 0.5 s gap — AFTER, AGG→VIC at I3)
barge_12_AB_stitched.wav (BEFORE — 0.5 s gap — AFTER, AGG→VIC at I4)

Audit of `audible_ends_s` consumers

Consumer	Status
`synthbanshee/labels/generator.py`	Updated — `truncated` now driven by `mix_modes`, not audible-vs-script duration
`synthbanshee/script/types.py` (dataclass)	Docstring updated to flag the new invariant
`synthbanshee/tts/mixer.py` (producer)	Updated — full design described in module docstring
Tests	Updated; brittle invariant assertions replaced

No other source-tree consumer reads audible_ends_s. (grep -rn audible_ends_s synthbanshee/)

Changes per file

File	Change
`synthbanshee/tts/mixer.py`	`_speech_end_sample()` (signal-relative, 10 ms window); anchor BARGE_IN+OVERLAP at speech end; 60 ms fade at truncation boundary; in-place tail mutation; module docstring updated
`synthbanshee/labels/generator.py`	`truncated` driven by next-turn `mix_mode == BARGE_IN`; duration heuristic kept as fallback; docstring updated
`synthbanshee/script/types.py`	`MixedScene` docstring: flag that `rendered_offsets_s[i]` may exceed `rendered_onsets_s[i+1]` for interrupted turns
`docs/audio_generation_v3_design.md`	§4.6 rewritten — speech-end anchoring, fade design, audible-window overlap, mix-mode-driven `truncated`
`tests/unit/test_mixer.py`	New `TestSpeechEndSample` (incl. fricative-tail and whisper tests); property-based acceptance test (overlap RMS > 1.30× louder solo); OVERLAP-fix test; existing barge-in tests updated for new audible-end semantic
`tests/integration/test_multi_speaker.py`	Restored to original (no-trailing-silence) fixture; assertions still pass via the new mix-mode-driven `truncated` flag

Test plan

Out of scope

TTS / SSML parameter changes (silence padding is upstream; we work around it in the mixer)
Preprocessing changes — augment/preprocessing.py is untouched

🤖 Generated with Claude Code

…silence Root cause: Azure/Google TTS pads the end of an utterance with 100–300 ms of near-silence, and the mixer anchored BARGE_IN onset against the file end. A typical 200–500 ms barge-in depth therefore landed mostly inside the silence, so listeners heard the previous speaker stop, then a gap, then the new speaker start — sounding like polite turn-taking instead of an interruption. Compounding the perception: the previous turn was hard-cut at the new turn's onset, leaving zero audible overlap of the two voices. Fix: - Add `_speech_end_sample()` (windowed RMS scan, scipy/numpy only) and anchor BARGE_IN onset against the previous turn's *speech* end, skipping trailing TTS silence padding. - Truncate the previous turn at its speech end (not at the new onset) and apply a fade-out across the overlap region, producing a real audible crossfade between the two voices. - OVERLAP path is unchanged (still anchored against file end). Also: update the docstring to reflect that for an interrupted turn, `audible_ends_s` now sits *after* the new turn's `rendered_onsets_s` by the overlap-region length (the truncation point is the previous speech end, not the new onset). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

Fixes #66 by changing how BARGE_IN mixing is timed and truncated so the “interrupting” speaker audibly overlaps the prior speaker’s speech (not the prior file’s trailing near-silence), and by introducing a crossfade instead of a hard cut at the interruption point.

Changes:

Add _speech_end_sample() (windowed RMS tail scan) and use it to anchor BARGE_IN onset against the previous turn’s speech end.
Update BARGE_IN behavior to keep the previous turn through its speech end and apply a linear fade-out across the overlap region (audible crossfade).
Add/adjust unit + integration tests to cover trailing-silence padding scenarios and confirm OVERLAP remains anchored to file end.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
synthbanshee/tts/mixer.py	Implements speech-end detection and new `BARGE_IN` onset/truncation + crossfade semantics; updates module docs.
tests/unit/test_mixer.py	Adds targeted unit tests for `_speech_end_sample()` and acceptance/regression tests for `BARGE_IN` vs `OVERLAP`.
tests/integration/test_multi_speaker.py	Adds trailing-silence support to integration WAV helper to make truncation observable in end-to-end label metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

… fade redesign, label semantic Addresses self-review of #75: 1. **Signal-relative speech-end threshold.** Replace the hardcoded −40 dBFS absolute threshold with a per-segment relative threshold (40 dB below the segment's own peak window RMS). Auto-scales for whisper, normal, and Lombard-tilted (#65) turns; tested against a Hebrew-style fricative tail (vowel @ 0.4 + fricative @ −25 dB + silence) to confirm we don't cut quiet word-final phonemes. 2. **10 ms windows.** Drop window from 30 ms → 10 ms (matches `_EDGE_FADE_SAMPLES`). Cuts measurement uncertainty in interrupted-turn label offsets from ±30 ms to ≤ 10 ms. 3. **OVERLAP gets the same fix.** OVERLAP had the identical trailing-silence bug — both speakers were meant to be heard, but with TTS padding the new turn could land in silence. Speech-end anchoring now applies to both modes. The "OVERLAP-anchors-on-file-end" regression-guard test (which codified the broken behaviour) is replaced with one that asserts the correct behaviour. 4. **BARGE_IN fade redesign.** Drop the linear span-the-overlap fade-out (which sounded like a console fader pulling down the prior speaker over 500 ms) in favour of a short 60 ms fade at the truncation boundary. The prior speaker plays at full volume through the entire overlap region — matching real interruption physics. 5. **In-place tail mutation.** Replace whole-array `prev_mono.copy()` with in-place mutation of the tail region (the array is already a fresh copy from `_apply_edge_fades`). 6. **`LabelGenerator.truncated` driven by mix mode, not duration heuristic.** Old heuristic ("audible_duration < script_duration") silently went False for tight-TTS clips that lacked trailing silence, even when the script said BARGE_IN. New: `truncated=True` when the next turn's mix mode is `BARGE_IN` (read from `scene.mix_modes`), with the duration heuristic retained as a defensive fallback. Restores the integration test fixture to its original (no-trailing-silence) form — the test was previously masked by adding silence until the assertion passed. 7. **Property-based acceptance test.** Replace the brittle `assert overlap_rms > 0.32` (a magic number tuned to the specific fade shape) with a property: overlap RMS exceeds the louder solo voice's RMS by >= 1.30x. Holds regardless of fade shape, voice frequencies, or window length. 8. **Helper signature simplified.** `_speech_end_sample` drops the unused `sample_rate`, `window_ms`, and `threshold_dbfs` parameters. 9. **Docs.** Update `docs/audio_generation_v3_design.md` §4.6 and the `MixedScene` dataclass docstring with the new audible-end semantic (consecutive turns can overlap for BARGE_IN; consumers must not assume non-overlap). Audited the codebase for `audible_ends_s` consumers — only `LabelGenerator` reads it, and it now handles the new semantic correctly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-05-04T10:08:39Z

pr-agent-context report:

This is a refreshed snapshot of the current PR state.

This run includes an unresolved review comment and patch coverage gaps on PR #75 in repository https://github.com/DataHackIL/SynthBanshee

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, address the patch coverage gaps below, and push all of these changes in a single
commit.

# Copilot Comments

## COPILOT-1
Location: synthbanshee/tts/mixer.py:248
URL: https://github.com/DataHackIL/SynthBanshee/pull/75#discussion_r3180298263
Root author: copilot-pull-request-reviewer

Comment:
    Docstring for `_speech_end_sample` says empty input returns `len(mono)`, but implementation returns `0` (and unit tests assert 0). Please update the docstring to match the actual behavior (or change the implementation if the docstring is intended).

# Patch coverage

Patch test coverage is 30.77%; please raise it to 100%. These are the uncovered code lines:
- synthbanshee/labels/generator.py: 174, 175, 176, 179, 180
- synthbanshee/script/types.py: 100, 101
- synthbanshee/tts/mixer.py: 33, 35, 36, 78, 79, 80, 81, 213, 214, 215, 216, 217, 218, 219, 223, 224, 236, 248, 252, 253

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: schedule
Workflow run: 25313140597 attempt 1
Comment timestamp: 2026-05-04T10:08:06.976204+00:00
PR head commit: 83790384c85c6d43625343c3d9db1371905ae443

shaypal5 added bugfix type: fix Bug fix comp: mixer SceneMixer, MixMode, gap controller labels May 4, 2026

Copilot AI review requested due to automatic review settings May 4, 2026 08:33

Copilot started reviewing on behalf of shaypal5 May 4, 2026 08:33 View session

This comment has been minimized.

Sign in to view

Copilot AI reviewed May 4, 2026

View reviewed changes

Comment thread synthbanshee/tts/mixer.py

This comment has been minimized.

Sign in to view

shaypal5 merged commit d2c0514 into main May 4, 2026
6 checks passed

shaypal5 deleted the fix/barge-in-gap-66 branch May 4, 2026 12:19

shaypal5 mentioned this pull request May 5, 2026

investigate(tts): #83 residual — Whisper WER regression on high-intensity (I3+) Tier A clips #87

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mixer): #66 BARGE_IN audible-overlap crossfade past TTS trailing silence#75

fix(mixer): #66 BARGE_IN audible-overlap crossfade past TTS trailing silence#75
shaypal5 merged 2 commits into
mainfrom
fix/barge-in-gap-66

shaypal5 commented May 4, 2026 •

edited

Loading

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Uh oh!

Uh oh!

This comment has been minimized.

github-actions Bot commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shaypal5 commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Real-audio verification

Audit of audible_ends_s consumers

Changes per file

Test plan

Out of scope

Uh oh!

This comment has been minimized.

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

This comment has been minimized.

github-actions Bot commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shaypal5 commented May 4, 2026 •

edited

Loading

Audit of `audible_ends_s` consumers