TTS Audio Quality Improvements: VAD Silence Compression, WSOLA Time-Stretching & Advanced Settings UI by Ironship · Pull Request #10382 · SubtitleEdit/subtitleedit

Ironship · 2026-03-23T10:10:53Z

Context

This is a continuation of the Edge-TTS engine integration (PR #10378). While that PR added Edge-TTS support with basic prosody controls (rate/pitch/volume), pro audio chain, audio ducking, silence padding, and sample rate conversion, this PR focuses on solving a fundamental problem that affects all TTS engines: what happens when generated audio doesn't fit the subtitle duration.

The Problem

When TTS generates speech for a subtitle segment, the audio often exceeds the available time window. This is especially common when translating between languages with different information density (e.g., English → Spanish or German translations are typically 20-40% longer).

The previous approach was:

Trim silence from start/end of the audio
Speed up the entire audio using FFmpeg's atempo filter

This worked, but had two significant quality issues:

Unnecessary speed-up: The audio contained internal pauses between words/phrases that could be shortened first, reducing or eliminating the need for tempo changes
Chipmunk effect at high factors: While atempo preserves pitch, its phase vocoder algorithm produces artifacts on speech at higher speed factors (1.3x+), making voices sound unnatural

What This PR Adds

1. VAD-Based Internal Silence Compression (New)

The idea: Before touching the tempo of speech, first attack the silence. A Voice Activity Detection approach precisely identifies and shortens gaps between words/phrases. This is optimal because it doesn't distort any phonemes.

Implementation: New CompressInternalSilence() method in FfmpegGenerator.cs using FFmpeg's silenceremove filter:

// silenceremove: stop_periods=-1 processes ALL silence gaps (not just first)
// stop_duration = max allowed silence length; stop_threshold = silence detection level
var filter = $"silenceremove=stop_periods=-1:stop_duration={maxSilence}:stop_threshold=-40dB";

Key parameters:

stop_periods=-1 — process ALL internal silence gaps, not just the first one
stop_duration — configurable maximum silence duration (default: 150ms)
stop_threshold=-40dB — silence detection threshold

Why this filter configuration: The existing TrimSilenceStartAndEnd only removes silence at the beginning and end of audio. By using stop_periods=-1, we target every pause inside the audio. The -40dB threshold is tuned for TTS output which has a clean noise floor. The configurable stop_duration parameter lets users control how much pause to keep — 150ms is a natural inter-word gap that still sounds comfortable.

Example scenario: A 5-second TTS clip needs to fit in 4 seconds. The audio has 1.2 seconds of internal pauses. After VAD compression (capping pauses at 150ms), the audio is 4.1 seconds — now only 1.025x speed-up is needed instead of 1.25x. The difference is barely audible vs. clearly noticeable.

2. High-Quality Time-Stretching via Rubberband/WSOLA (New)

The idea: When speed-up is still needed after silence compression, use a better algorithm than the default atempo. The rubberband library implements WSOLA (Waveform Similarity Overlap-Add), which cuts the audio waveform into overlapping micro-segments and "squeezes" them by removing redundant wave portions while preserving the original pitch.

Implementation: New ChangeSpeedHighQuality() method in FfmpegGenerator.cs:

var filter = $"rubberband=tempo={speedStr}:transients=smooth:engine=faster:window=short";

Parameter choices explained:

transients=smooth — smoother transient handling, better for speech than the default "crisp" mode which is designed for percussive music
engine=faster — uses the faster processing engine (sufficient quality for speech, where the frequency content is simpler than music)
window=short — short analysis window, better for speech which has rapidly changing formants vs. sustained musical notes

Automatic fallback: Not all FFmpeg builds include librubberband. The code handles this gracefully:

// If rubberband output file wasn't created, fall back to atempo
if (doHighQualityStretch && (!File.Exists(outputFileName2) || new FileInfo(outputFileName2).Length == 0))
{
    var fallbackProcess = FfmpegGenerator.ChangeSpeed(currentFile, outputFileName2, (float)factor);
    ...
}

Rubberband detection: A new IsRubberbandAvailable() method runs ffmpeg -filters and checks for rubberband presence. The result is displayed in the Advanced settings UI as "(installed)" or "(not found in FFmpeg)" next to the checkbox — similar to how Whisper models show installation status.

3. New Three-Stage Audio Pipeline

The FixSpeed() method in TextToSpeechViewModel.cs now implements a three-stage pipeline:

Stage 1: Trim silence from start/end (existing behavior, unchanged)
    ↓
Stage 2: VAD silence compression — shorten internal pauses (NEW, optional)
    ↓
  Check: does audio fit now? → YES → done, no speed change needed
    ↓ NO
Stage 3: Time-stretch to fit duration (improved: rubberband option)

This pipeline applies uniformly to all TTS engines (Edge-TTS, Piper, ElevenLabs, Azure, Google, AllTalk, Murf). This is correct because:

No engine does VAD or duration-fitting server-side — all engines return raw audio of whatever duration they naturally generate
No engine does silence compression — internal pauses are part of the TTS output
Engine-specific speed parameters (Edge-TTS rate, ElevenLabs speed) control the base speaking rate at generation time, while our pipeline handles the post-generation timing adjustment — these are complementary, not duplicating

The same pipeline was also applied to TrimAndAdjustSpeed() in ReviewSpeechViewModel.cs, so that regenerated segments during review use the same processing.

4. Advanced TTS Settings Window (UI Refactor)

The problem: The main TTS window was becoming cluttered with too many controls (pro audio, ducking, VAD, time-stretch, silence padding, sample rate, Edge-TTS prosody). This made the UI intimidating for basic use.

Solution: Created a new AdvancedTtsSettings window (following the existing project pattern used by EncodingSettings and VoiceSettings), accessible via an "Advanced..." button.

The main TTS window now shows only:

TTS - Review audio segments (checkbox)
Add audio to video file (checkbox + encoding settings gear)
Advanced... (button)

The Advanced window groups all post-processing options with descriptions explaining what each option does:

Settings persistence: All settings are saved via Se.SaveSettings() when the user clicks OK. Settings persist across sessions — the user configures once and all subsequent TTS generations use the same settings.

Files Changed

Modified Files

File	Changes
`src/UI/Logic/Media/FfmpegGenerator.cs`	Added 3 new methods: `IsRubberbandAvailable()`, `CompressInternalSilence()`, `ChangeSpeedHighQuality()`
`src/UI/Logic/Config/SeVideoTextToSpeech.cs`	Added 3 new settings: `VadSilenceCompressionEnabled`, `VadMaxSilenceSeconds`, `HighQualityTimeStretchEnabled`
`src/UI/Features/Video/TextToSpeech/TextToSpeechViewModel.cs`	Refactored `FixSpeed()` to 3-stage pipeline; moved advanced settings to new window; added `ShowAdvancedSettings` command
`src/UI/Features/Video/TextToSpeech/TextToSpeechWindow.cs`	Simplified UI — removed advanced controls, added "Advanced..." button
`src/UI/Features/Video/TextToSpeech/ReviewSpeech/ReviewSpeechViewModel.cs`	Updated `TrimAndAdjustSpeed()` to use same VAD + rubberband pipeline
`src/UI/DependencyInjectionExtensions.cs`	Registered `AdvancedTtsSettingsViewModel` in DI container

New Files

File	Purpose
`src/UI/Features/Video/TextToSpeech/AdvancedTtsSettings/AdvancedTtsSettingsViewModel.cs`	ViewModel for advanced settings dialog — loads/saves all post-processing settings
`src/UI/Features/Video/TextToSpeech/AdvancedTtsSettings/AdvancedTtsSettingsWindow.cs`	Avalonia Window with described settings sections, engine-specific visibility, rubberband status detection

Compatibility

Cross-platform: All changes are pure C# .NET — no platform-specific code, no new NuGet dependencies
Linux/macOS: silenceremove filter is standard FFmpeg, available everywhere. rubberband filter is optional with automatic fallback to atempo
Backwards compatible: All new settings default to false/0 — existing behavior is unchanged unless the user explicitly enables new features
All TTS engines: VAD compression and time-stretching improvements apply to all 7 supported engines (Edge-TTS, Piper, ElevenLabs, Azure Speech, Google Speech, AllTalk, Murf)

Testing

Tested on:

OS: Windows 11 Home 64-bit (build 10.0.22631)
CPU: AMD Ryzen 7 5800X3D 8-Core Processor
FFmpeg: 8.0-full_build (gyan.dev Windows build — includes librubberband)
Runtime: .NET 10.0.104

Scenarios verified:

VAD silence compression shortens internal pauses without affecting speech
High-quality rubberband time-stretch produces cleaner output than atempo at 1.3x–1.5x speed factors
Automatic fallback to atempo when rubberband is unavailable
Advanced settings window opens, all controls visible, settings persist after restart
Edge-TTS-specific controls (rate/pitch/volume) show only when Edge-TTS engine is selected
All existing TTS engines (Piper, ElevenLabs, Azure, Google, AllTalk, Murf) remain unaffected when new options are disabled

…vanced Settings UI

TTS Audio Quality: VAD silence compression, WSOLA time-stretching, Ad…

eb9f24b

…vanced Settings UI

niksedk merged commit 877837a into SubtitleEdit:main Mar 23, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TTS Audio Quality Improvements: VAD Silence Compression, WSOLA Time-Stretching & Advanced Settings UI#10382

TTS Audio Quality Improvements: VAD Silence Compression, WSOLA Time-Stretching & Advanced Settings UI#10382
niksedk merged 1 commit intoSubtitleEdit:mainfrom
Ironship:feature/tts-vad-wsola-advanced-settings

Ironship commented Mar 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Ironship commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

The Problem

What This PR Adds

1. VAD-Based Internal Silence Compression (New)

2. High-Quality Time-Stretching via Rubberband/WSOLA (New)

3. New Three-Stage Audio Pipeline

4. Advanced TTS Settings Window (UI Refactor)

Files Changed

Modified Files

New Files

Compatibility

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Ironship commented Mar 23, 2026 •

edited

Loading