Skip to content

fix(mixer): #66 BARGE_IN audible-overlap crossfade past TTS trailing silence#75

Merged
shaypal5 merged 2 commits into
mainfrom
fix/barge-in-gap-66
May 4, 2026
Merged

fix(mixer): #66 BARGE_IN audible-overlap crossfade past TTS trailing silence#75
shaypal5 merged 2 commits into
mainfrom
fix/barge-in-gap-66

Conversation

@shaypal5
Copy link
Copy Markdown
Member

@shaypal5 shaypal5 commented May 4, 2026

Closes #66.

Verified on real audio. Re-rendered sp_it_a_0001_00 with both the old mixer (against a fresh script + cached TTS turns) and the fix; matched-script comparison at every BARGE_IN onset shows the silent gap is gone. Numbers below.

Problem

In a BARGE_IN turn pair, listeners heard the previous speaker stop, then a noticeable pause, then the new speaker start — sounding like polite turn-taking instead of an interruption. Reported in the 2026-05-03 listening test against sp_it_a_0001_00:

"At ~1:19 the woman audibly stops speaking BEFORE the man starts, with a noticeable time gap. Sounds like polite turn-taking, not an interruption."

Two compounding causes:

  1. Onset anchored at file end, not speech end. Azure / Google TTS pad each utterance with 100–300 ms of trailing near-silence. With BARGE_IN depth in the 200–500 ms range, the overlap landed mostly inside that silence — the man overlapped the woman's silence, not her speech.
  2. Hard cut at the new onset. Even with onset moved earlier, the previous turn was truncated exactly at the new turn's onset, so there was zero audible overlap.

Solution

In SceneMixer.mix_sequential:

  • Add _speech_end_sample() — windowed RMS scan (10 ms windows; threshold is 40 dB below the segment's own peak window RMS, so it auto-scales for whisper, normal, and Lombard-tilted turns), scipy + numpy only.
  • Anchor both BARGE_IN and OVERLAP onset against speech end instead of file end. Both modes intend audible co-occurrence; OVERLAP had the identical bug.
  • For BARGE_IN, the previous turn plays at full volume through the entire overlap region (= up to its speech end), then a short 60 ms fade-out at the truncation boundary. Models a real interruption — the prior speaker keeps talking until overpowered, not a long fade pulled by a console fader.
  • Tail is mutated in place (the array is already a fresh copy from _apply_edge_fades); no whole-array copy in the hot path.

In LabelGenerator.generate_events_from_scene:

  • truncated=True when the next turn's mix_mode == BARGE_IN (read from scene.mix_modes). This is robust to whether the WAV carried trailing silence — a turn marked BARGE_IN by the script is always interrupted in the user-meaningful sense. The previous heuristic ("audible_duration < script_duration") silently produced truncated=False for tight-TTS clips even when the script said BARGE_IN. The duration heuristic is retained as a defensive fallback.

The previous turn's audible_ends_s for an interrupted turn now sits after the next turn's rendered_onsets_s by the overlap-region length. The module docstring, MixedScene dataclass docstring, and docs/audio_generation_v3_design.md §4.6 all flag the change explicitly: consumers must not assume rendered_offsets_s[i] <= rendered_onsets_s[i+1].

Real-audio verification

Re-rendered configs/scenes/she_proves/sp_it_a_0001.yaml twice against the same script and TTS cache: once with the mixer at main (BEFORE), once with the fix (AFTER). Same 17-turn script in both renders; the only difference is the mixer logic. Both runs produced 2 BARGE_IN turn pairs (turns 10→11 and 12→13).

Measured RMS (in dBFS) in 100 ms windows around each new-turn onset, plus the minimum 30 ms-window RMS in ±50 ms straddling the cut (= "is there silence at the boundary?"):

BARGE_IN Render pre-onset 100 ms post-onset 100 ms min 30 ms within ±50 ms of onset
turn 10→11 (AGG→VIC, I3) BEFORE −101 dBFS −97 dBFS −200 dBFS (literal silence)
turn 10→11 (AGG→VIC, I3) AFTER −22 dBFS −22 dBFS −22 dBFS (continuous speech)
turn 12→13 (AGG→VIC, I4) BEFORE −93 dBFS −98 dBFS −200 dBFS (literal silence)
turn 12→13 (AGG→VIC, I4) AFTER −17 dBFS −29 dBFS −31 dBFS (no silent gap)

The −200 dBFS floor (= zero samples in float32 → log saturation) at the cut in BEFORE is the bug: a real silent gap straddling the BARGE_IN boundary. AFTER, the boundary region is continuously at speech-level energy — both speakers active across the cut.

A/B listening snippets are at /tmp/sb_66_listen/:

  • barge_10_AB_stitched.wav (BEFORE — 0.5 s gap — AFTER, AGG→VIC at I3)
  • barge_12_AB_stitched.wav (BEFORE — 0.5 s gap — AFTER, AGG→VIC at I4)

Audit of audible_ends_s consumers

Consumer Status
synthbanshee/labels/generator.py Updated — truncated now driven by mix_modes, not audible-vs-script duration
synthbanshee/script/types.py (dataclass) Docstring updated to flag the new invariant
synthbanshee/tts/mixer.py (producer) Updated — full design described in module docstring
Tests Updated; brittle invariant assertions replaced

No other source-tree consumer reads audible_ends_s. (grep -rn audible_ends_s synthbanshee/)

Changes per file

File Change
synthbanshee/tts/mixer.py _speech_end_sample() (signal-relative, 10 ms window); anchor BARGE_IN+OVERLAP at speech end; 60 ms fade at truncation boundary; in-place tail mutation; module docstring updated
synthbanshee/labels/generator.py truncated driven by next-turn mix_mode == BARGE_IN; duration heuristic kept as fallback; docstring updated
synthbanshee/script/types.py MixedScene docstring: flag that rendered_offsets_s[i] may exceed rendered_onsets_s[i+1] for interrupted turns
docs/audio_generation_v3_design.md §4.6 rewritten — speech-end anchoring, fade design, audible-window overlap, mix-mode-driven truncated
tests/unit/test_mixer.py New TestSpeechEndSample (incl. fricative-tail and whisper tests); property-based acceptance test (overlap RMS > 1.30× louder solo); OVERLAP-fix test; existing barge-in tests updated for new audible-end semantic
tests/integration/test_multi_speaker.py Restored to original (no-trailing-silence) fixture; assertions still pass via the new mix-mode-driven truncated flag

Test plan

  • pytest — full suite: 1707 passed, 2 pre-existing warnings
  • ruff check — passes (incl. pre-commit ruff format)
  • mypy — clean on every changed file
  • Acceptance (BARGE_IN, property-based): overlap RMS > 1.30× louder solo voice — both speakers genuinely co-present after the new onset
  • OVERLAP fix verified: test_overlap_anchored_on_speech_end_through_trailing_silence — OVERLAP onset lands at speech end, not file end
  • Threshold robustness: test_does_not_clip_quiet_fricative_tail — vowel + −25 dB fricative + silence; detector includes the fricative
  • Threshold scaling: test_threshold_is_signal_relative — peak-relative threshold correctly detects whisper-level (−30 dBFS) speech end
  • Label semantic: test_barge_in_previous_turn_label_truncatedtruncated=True even with no trailing silence in the source WAV
  • Existing edge cases preserved: full-depth BARGE_IN, zero-depth, as-first-segment, SEQUENTIAL three-timeline equality
  • Real-audio verification on sp_it_a_0001_00 — both BARGE_IN moments verified numerically (table above); A/B snippets at /tmp/sb_66_listen/

Out of scope

  • TTS / SSML parameter changes (silence padding is upstream; we work around it in the mixer)
  • Preprocessing changes — augment/preprocessing.py is untouched

🤖 Generated with Claude Code

…silence

Root cause: Azure/Google TTS pads the end of an utterance with 100–300 ms of
near-silence, and the mixer anchored BARGE_IN onset against the file end.  A
typical 200–500 ms barge-in depth therefore landed mostly inside the silence,
so listeners heard the previous speaker stop, then a gap, then the new speaker
start — sounding like polite turn-taking instead of an interruption.

Compounding the perception: the previous turn was hard-cut at the new turn's
onset, leaving zero audible overlap of the two voices.

Fix:
- Add `_speech_end_sample()` (windowed RMS scan, scipy/numpy only) and anchor
  BARGE_IN onset against the previous turn's *speech* end, skipping trailing
  TTS silence padding.
- Truncate the previous turn at its speech end (not at the new onset) and
  apply a fade-out across the overlap region, producing a real audible
  crossfade between the two voices.
- OVERLAP path is unchanged (still anchored against file end).

Also: update the docstring to reflect that for an interrupted turn,
`audible_ends_s` now sits *after* the new turn's `rendered_onsets_s` by the
overlap-region length (the truncation point is the previous speech end, not
the new onset).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@shaypal5 shaypal5 added bugfix type: fix Bug fix comp: mixer SceneMixer, MixMode, gap controller labels May 4, 2026
Copilot AI review requested due to automatic review settings May 4, 2026 08:33
@github-actions

This comment has been minimized.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes #66 by changing how BARGE_IN mixing is timed and truncated so the “interrupting” speaker audibly overlaps the prior speaker’s speech (not the prior file’s trailing near-silence), and by introducing a crossfade instead of a hard cut at the interruption point.

Changes:

  • Add _speech_end_sample() (windowed RMS tail scan) and use it to anchor BARGE_IN onset against the previous turn’s speech end.
  • Update BARGE_IN behavior to keep the previous turn through its speech end and apply a linear fade-out across the overlap region (audible crossfade).
  • Add/adjust unit + integration tests to cover trailing-silence padding scenarios and confirm OVERLAP remains anchored to file end.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
synthbanshee/tts/mixer.py Implements speech-end detection and new BARGE_IN onset/truncation + crossfade semantics; updates module docs.
tests/unit/test_mixer.py Adds targeted unit tests for _speech_end_sample() and acceptance/regression tests for BARGE_IN vs OVERLAP.
tests/integration/test_multi_speaker.py Adds trailing-silence support to integration WAV helper to make truncation observable in end-to-end label metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread synthbanshee/tts/mixer.py
… fade redesign, label semantic

Addresses self-review of #75:

1. **Signal-relative speech-end threshold.** Replace the hardcoded −40 dBFS
   absolute threshold with a per-segment relative threshold (40 dB below the
   segment's own peak window RMS).  Auto-scales for whisper, normal, and
   Lombard-tilted (#65) turns; tested against a Hebrew-style fricative tail
   (vowel @ 0.4 + fricative @ −25 dB + silence) to confirm we don't cut quiet
   word-final phonemes.

2. **10 ms windows.** Drop window from 30 ms → 10 ms (matches
   `_EDGE_FADE_SAMPLES`).  Cuts measurement uncertainty in interrupted-turn
   label offsets from ±30 ms to ≤ 10 ms.

3. **OVERLAP gets the same fix.** OVERLAP had the identical trailing-silence
   bug — both speakers were meant to be heard, but with TTS padding the new
   turn could land in silence.  Speech-end anchoring now applies to both
   modes.  The "OVERLAP-anchors-on-file-end" regression-guard test (which
   codified the broken behaviour) is replaced with one that asserts the
   correct behaviour.

4. **BARGE_IN fade redesign.** Drop the linear span-the-overlap fade-out
   (which sounded like a console fader pulling down the prior speaker over
   500 ms) in favour of a short 60 ms fade at the truncation boundary.  The
   prior speaker plays at full volume through the entire overlap region —
   matching real interruption physics.

5. **In-place tail mutation.** Replace whole-array `prev_mono.copy()` with
   in-place mutation of the tail region (the array is already a fresh copy
   from `_apply_edge_fades`).

6. **`LabelGenerator.truncated` driven by mix mode, not duration heuristic.**
   Old heuristic ("audible_duration < script_duration") silently went False
   for tight-TTS clips that lacked trailing silence, even when the script
   said BARGE_IN.  New: `truncated=True` when the next turn's mix mode is
   `BARGE_IN` (read from `scene.mix_modes`), with the duration heuristic
   retained as a defensive fallback.  Restores the integration test fixture
   to its original (no-trailing-silence) form — the test was previously
   masked by adding silence until the assertion passed.

7. **Property-based acceptance test.** Replace the brittle `assert overlap_rms
   > 0.32` (a magic number tuned to the specific fade shape) with a property:
   overlap RMS exceeds the louder solo voice's RMS by >= 1.30x.  Holds
   regardless of fade shape, voice frequencies, or window length.

8. **Helper signature simplified.** `_speech_end_sample` drops the unused
   `sample_rate`, `window_ms`, and `threshold_dbfs` parameters.

9. **Docs.** Update `docs/audio_generation_v3_design.md` §4.6 and the
   `MixedScene` dataclass docstring with the new audible-end semantic
   (consecutive turns can overlap for BARGE_IN; consumers must not assume
   non-overlap).  Audited the codebase for `audible_ends_s` consumers — only
   `LabelGenerator` reads it, and it now handles the new semantic correctly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

pr-agent-context report:

This is a refreshed snapshot of the current PR state.

This run includes an unresolved review comment and patch coverage gaps on PR #75 in repository https://github.com/DataHackIL/SynthBanshee

For each unresolved review comment, recommend one of: resolve as irrelevant, accept and implement
the recommended solution, open a separate issue and resolve as out-of-scope for this PR, accept and
implement a different solution, or resolve as already treated by the code.

After I reply with my decision per item, implement the accepted actions, resolve the corresponding
PR comments, address the patch coverage gaps below, and push all of these changes in a single
commit.

# Copilot Comments

## COPILOT-1
Location: synthbanshee/tts/mixer.py:248
URL: https://github.com/DataHackIL/SynthBanshee/pull/75#discussion_r3180298263
Root author: copilot-pull-request-reviewer

Comment:
    Docstring for `_speech_end_sample` says empty input returns `len(mono)`, but implementation returns `0` (and unit tests assert 0). Please update the docstring to match the actual behavior (or change the implementation if the docstring is intended).

# Patch coverage

Patch test coverage is 30.77%; please raise it to 100%. These are the uncovered code lines:
- synthbanshee/labels/generator.py: 174, 175, 176, 179, 180
- synthbanshee/script/types.py: 100, 101
- synthbanshee/tts/mixer.py: 33, 35, 36, 78, 79, 80, 81, 213, 214, 215, 216, 217, 218, 219, 223, 224, 236, 248, 252, 253

Run metadata:

Tool ref: v4
Tool version: 4.0.21
Trigger: schedule
Workflow run: 25313140597 attempt 1
Comment timestamp: 2026-05-04T10:08:06.976204+00:00
PR head commit: 83790384c85c6d43625343c3d9db1371905ae443

@shaypal5 shaypal5 merged commit d2c0514 into main May 4, 2026
6 checks passed
@shaypal5 shaypal5 deleted the fix/barge-in-gap-66 branch May 4, 2026 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix comp: mixer SceneMixer, MixMode, gap controller type: fix Bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(mixer): barge-in has audible gap — woman stops before man starts

2 participants