Skip to content

TTS Audio Quality Improvements: VAD Silence Compression, WSOLA Time-Stretching & Advanced Settings UI#10382

Merged
niksedk merged 1 commit intoSubtitleEdit:mainfrom
Ironship:feature/tts-vad-wsola-advanced-settings
Mar 23, 2026
Merged

TTS Audio Quality Improvements: VAD Silence Compression, WSOLA Time-Stretching & Advanced Settings UI#10382
niksedk merged 1 commit intoSubtitleEdit:mainfrom
Ironship:feature/tts-vad-wsola-advanced-settings

Conversation

@Ironship
Copy link
Copy Markdown
Contributor

@Ironship Ironship commented Mar 23, 2026

Context

This is a continuation of the Edge-TTS engine integration (PR #10378). While that PR added Edge-TTS support with basic prosody controls (rate/pitch/volume), pro audio chain, audio ducking, silence padding, and sample rate conversion, this PR focuses on solving a fundamental problem that affects all TTS engines: what happens when generated audio doesn't fit the subtitle duration.

The Problem

When TTS generates speech for a subtitle segment, the audio often exceeds the available time window. This is especially common when translating between languages with different information density (e.g., English → Spanish or German translations are typically 20-40% longer).

The previous approach was:

  1. Trim silence from start/end of the audio
  2. Speed up the entire audio using FFmpeg's atempo filter

This worked, but had two significant quality issues:

  • Unnecessary speed-up: The audio contained internal pauses between words/phrases that could be shortened first, reducing or eliminating the need for tempo changes
  • Chipmunk effect at high factors: While atempo preserves pitch, its phase vocoder algorithm produces artifacts on speech at higher speed factors (1.3x+), making voices sound unnatural

What This PR Adds

1. VAD-Based Internal Silence Compression (New)

The idea: Before touching the tempo of speech, first attack the silence. A Voice Activity Detection approach precisely identifies and shortens gaps between words/phrases. This is optimal because it doesn't distort any phonemes.

Implementation: New CompressInternalSilence() method in FfmpegGenerator.cs using FFmpeg's silenceremove filter:

// silenceremove: stop_periods=-1 processes ALL silence gaps (not just first)
// stop_duration = max allowed silence length; stop_threshold = silence detection level
var filter = $"silenceremove=stop_periods=-1:stop_duration={maxSilence}:stop_threshold=-40dB";

Key parameters:

  • stop_periods=-1 — process ALL internal silence gaps, not just the first one
  • stop_duration — configurable maximum silence duration (default: 150ms)
  • stop_threshold=-40dB — silence detection threshold

Why this filter configuration: The existing TrimSilenceStartAndEnd only removes silence at the beginning and end of audio. By using stop_periods=-1, we target every pause inside the audio. The -40dB threshold is tuned for TTS output which has a clean noise floor. The configurable stop_duration parameter lets users control how much pause to keep — 150ms is a natural inter-word gap that still sounds comfortable.

Example scenario: A 5-second TTS clip needs to fit in 4 seconds. The audio has 1.2 seconds of internal pauses. After VAD compression (capping pauses at 150ms), the audio is 4.1 seconds — now only 1.025x speed-up is needed instead of 1.25x. The difference is barely audible vs. clearly noticeable.

2. High-Quality Time-Stretching via Rubberband/WSOLA (New)

The idea: When speed-up is still needed after silence compression, use a better algorithm than the default atempo. The rubberband library implements WSOLA (Waveform Similarity Overlap-Add), which cuts the audio waveform into overlapping micro-segments and "squeezes" them by removing redundant wave portions while preserving the original pitch.

Implementation: New ChangeSpeedHighQuality() method in FfmpegGenerator.cs:

var filter = $"rubberband=tempo={speedStr}:transients=smooth:engine=faster:window=short";

Parameter choices explained:

  • transients=smooth — smoother transient handling, better for speech than the default "crisp" mode which is designed for percussive music
  • engine=faster — uses the faster processing engine (sufficient quality for speech, where the frequency content is simpler than music)
  • window=short — short analysis window, better for speech which has rapidly changing formants vs. sustained musical notes

Automatic fallback: Not all FFmpeg builds include librubberband. The code handles this gracefully:

// If rubberband output file wasn't created, fall back to atempo
if (doHighQualityStretch && (!File.Exists(outputFileName2) || new FileInfo(outputFileName2).Length == 0))
{
    var fallbackProcess = FfmpegGenerator.ChangeSpeed(currentFile, outputFileName2, (float)factor);
    ...
}

Rubberband detection: A new IsRubberbandAvailable() method runs ffmpeg -filters and checks for rubberband presence. The result is displayed in the Advanced settings UI as "(installed)" or "(not found in FFmpeg)" next to the checkbox — similar to how Whisper models show installation status.

3. New Three-Stage Audio Pipeline

The FixSpeed() method in TextToSpeechViewModel.cs now implements a three-stage pipeline:

Stage 1: Trim silence from start/end (existing behavior, unchanged)
    ↓
Stage 2: VAD silence compression — shorten internal pauses (NEW, optional)
    ↓
  Check: does audio fit now? → YES → done, no speed change needed
    ↓ NO
Stage 3: Time-stretch to fit duration (improved: rubberband option)

This pipeline applies uniformly to all TTS engines (Edge-TTS, Piper, ElevenLabs, Azure, Google, AllTalk, Murf). This is correct because:

  • No engine does VAD or duration-fitting server-side — all engines return raw audio of whatever duration they naturally generate
  • No engine does silence compression — internal pauses are part of the TTS output
  • Engine-specific speed parameters (Edge-TTS rate, ElevenLabs speed) control the base speaking rate at generation time, while our pipeline handles the post-generation timing adjustment — these are complementary, not duplicating

The same pipeline was also applied to TrimAndAdjustSpeed() in ReviewSpeechViewModel.cs, so that regenerated segments during review use the same processing.

4. Advanced TTS Settings Window (UI Refactor)

The problem: The main TTS window was becoming cluttered with too many controls (pro audio, ducking, VAD, time-stretch, silence padding, sample rate, Edge-TTS prosody). This made the UI intimidating for basic use.

Solution: Created a new AdvancedTtsSettings window (following the existing project pattern used by EncodingSettings and VoiceSettings), accessible via an "Advanced..." button.

The main TTS window now shows only:

  • TTS - Review audio segments (checkbox)
  • Add audio to video file (checkbox + encoding settings gear)
  • Advanced... (button)
image

The Advanced window groups all post-processing options with descriptions explaining what each option does:

image

Settings persistence: All settings are saved via Se.SaveSettings() when the user clicks OK. Settings persist across sessions — the user configures once and all subsequent TTS generations use the same settings.

Files Changed

Modified Files

File Changes
src/UI/Logic/Media/FfmpegGenerator.cs Added 3 new methods: IsRubberbandAvailable(), CompressInternalSilence(), ChangeSpeedHighQuality()
src/UI/Logic/Config/SeVideoTextToSpeech.cs Added 3 new settings: VadSilenceCompressionEnabled, VadMaxSilenceSeconds, HighQualityTimeStretchEnabled
src/UI/Features/Video/TextToSpeech/TextToSpeechViewModel.cs Refactored FixSpeed() to 3-stage pipeline; moved advanced settings to new window; added ShowAdvancedSettings command
src/UI/Features/Video/TextToSpeech/TextToSpeechWindow.cs Simplified UI — removed advanced controls, added "Advanced..." button
src/UI/Features/Video/TextToSpeech/ReviewSpeech/ReviewSpeechViewModel.cs Updated TrimAndAdjustSpeed() to use same VAD + rubberband pipeline
src/UI/DependencyInjectionExtensions.cs Registered AdvancedTtsSettingsViewModel in DI container

New Files

File Purpose
src/UI/Features/Video/TextToSpeech/AdvancedTtsSettings/AdvancedTtsSettingsViewModel.cs ViewModel for advanced settings dialog — loads/saves all post-processing settings
src/UI/Features/Video/TextToSpeech/AdvancedTtsSettings/AdvancedTtsSettingsWindow.cs Avalonia Window with described settings sections, engine-specific visibility, rubberband status detection

Compatibility

  • Cross-platform: All changes are pure C# .NET — no platform-specific code, no new NuGet dependencies
  • Linux/macOS: silenceremove filter is standard FFmpeg, available everywhere. rubberband filter is optional with automatic fallback to atempo
  • Backwards compatible: All new settings default to false/0 — existing behavior is unchanged unless the user explicitly enables new features
  • All TTS engines: VAD compression and time-stretching improvements apply to all 7 supported engines (Edge-TTS, Piper, ElevenLabs, Azure Speech, Google Speech, AllTalk, Murf)

Testing

Tested on:

  • OS: Windows 11 Home 64-bit (build 10.0.22631)
  • CPU: AMD Ryzen 7 5800X3D 8-Core Processor
  • FFmpeg: 8.0-full_build (gyan.dev Windows build — includes librubberband)
  • Runtime: .NET 10.0.104

Scenarios verified:

  • VAD silence compression shortens internal pauses without affecting speech
  • High-quality rubberband time-stretch produces cleaner output than atempo at 1.3x–1.5x speed factors
  • Automatic fallback to atempo when rubberband is unavailable
  • Advanced settings window opens, all controls visible, settings persist after restart
  • Edge-TTS-specific controls (rate/pitch/volume) show only when Edge-TTS engine is selected
  • All existing TTS engines (Piper, ElevenLabs, Azure, Google, AllTalk, Murf) remain unaffected when new options are disabled

@niksedk niksedk merged commit 877837a into SubtitleEdit:main Mar 23, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants