TTS Audio Quality Improvements: VAD Silence Compression, WSOLA Time-Stretching & Advanced Settings UI#10382
Merged
niksedk merged 1 commit intoSubtitleEdit:mainfrom Mar 23, 2026
Conversation
…vanced Settings UI
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
This is a continuation of the Edge-TTS engine integration (PR #10378). While that PR added Edge-TTS support with basic prosody controls (rate/pitch/volume), pro audio chain, audio ducking, silence padding, and sample rate conversion, this PR focuses on solving a fundamental problem that affects all TTS engines: what happens when generated audio doesn't fit the subtitle duration.
The Problem
When TTS generates speech for a subtitle segment, the audio often exceeds the available time window. This is especially common when translating between languages with different information density (e.g., English → Spanish or German translations are typically 20-40% longer).
The previous approach was:
atempofilterThis worked, but had two significant quality issues:
atempopreserves pitch, its phase vocoder algorithm produces artifacts on speech at higher speed factors (1.3x+), making voices sound unnaturalWhat This PR Adds
1. VAD-Based Internal Silence Compression (New)
The idea: Before touching the tempo of speech, first attack the silence. A Voice Activity Detection approach precisely identifies and shortens gaps between words/phrases. This is optimal because it doesn't distort any phonemes.
Implementation: New
CompressInternalSilence()method inFfmpegGenerator.csusing FFmpeg'ssilenceremovefilter:Key parameters:
stop_periods=-1— process ALL internal silence gaps, not just the first onestop_duration— configurable maximum silence duration (default: 150ms)stop_threshold=-40dB— silence detection thresholdWhy this filter configuration: The existing
TrimSilenceStartAndEndonly removes silence at the beginning and end of audio. By usingstop_periods=-1, we target every pause inside the audio. The-40dBthreshold is tuned for TTS output which has a clean noise floor. The configurablestop_durationparameter lets users control how much pause to keep — 150ms is a natural inter-word gap that still sounds comfortable.Example scenario: A 5-second TTS clip needs to fit in 4 seconds. The audio has 1.2 seconds of internal pauses. After VAD compression (capping pauses at 150ms), the audio is 4.1 seconds — now only 1.025x speed-up is needed instead of 1.25x. The difference is barely audible vs. clearly noticeable.
2. High-Quality Time-Stretching via Rubberband/WSOLA (New)
The idea: When speed-up is still needed after silence compression, use a better algorithm than the default
atempo. The rubberband library implements WSOLA (Waveform Similarity Overlap-Add), which cuts the audio waveform into overlapping micro-segments and "squeezes" them by removing redundant wave portions while preserving the original pitch.Implementation: New
ChangeSpeedHighQuality()method inFfmpegGenerator.cs:Parameter choices explained:
transients=smooth— smoother transient handling, better for speech than the default "crisp" mode which is designed for percussive musicengine=faster— uses the faster processing engine (sufficient quality for speech, where the frequency content is simpler than music)window=short— short analysis window, better for speech which has rapidly changing formants vs. sustained musical notesAutomatic fallback: Not all FFmpeg builds include
librubberband. The code handles this gracefully:Rubberband detection: A new
IsRubberbandAvailable()method runsffmpeg -filtersand checks for rubberband presence. The result is displayed in the Advanced settings UI as "(installed)" or "(not found in FFmpeg)" next to the checkbox — similar to how Whisper models show installation status.3. New Three-Stage Audio Pipeline
The
FixSpeed()method inTextToSpeechViewModel.csnow implements a three-stage pipeline:This pipeline applies uniformly to all TTS engines (Edge-TTS, Piper, ElevenLabs, Azure, Google, AllTalk, Murf). This is correct because:
rate, ElevenLabsspeed) control the base speaking rate at generation time, while our pipeline handles the post-generation timing adjustment — these are complementary, not duplicatingThe same pipeline was also applied to
TrimAndAdjustSpeed()inReviewSpeechViewModel.cs, so that regenerated segments during review use the same processing.4. Advanced TTS Settings Window (UI Refactor)
The problem: The main TTS window was becoming cluttered with too many controls (pro audio, ducking, VAD, time-stretch, silence padding, sample rate, Edge-TTS prosody). This made the UI intimidating for basic use.
Solution: Created a new
AdvancedTtsSettingswindow (following the existing project pattern used byEncodingSettingsandVoiceSettings), accessible via an "Advanced..." button.The main TTS window now shows only:
The Advanced window groups all post-processing options with descriptions explaining what each option does:
Settings persistence: All settings are saved via
Se.SaveSettings()when the user clicks OK. Settings persist across sessions — the user configures once and all subsequent TTS generations use the same settings.Files Changed
Modified Files
src/UI/Logic/Media/FfmpegGenerator.csIsRubberbandAvailable(),CompressInternalSilence(),ChangeSpeedHighQuality()src/UI/Logic/Config/SeVideoTextToSpeech.csVadSilenceCompressionEnabled,VadMaxSilenceSeconds,HighQualityTimeStretchEnabledsrc/UI/Features/Video/TextToSpeech/TextToSpeechViewModel.csFixSpeed()to 3-stage pipeline; moved advanced settings to new window; addedShowAdvancedSettingscommandsrc/UI/Features/Video/TextToSpeech/TextToSpeechWindow.cssrc/UI/Features/Video/TextToSpeech/ReviewSpeech/ReviewSpeechViewModel.csTrimAndAdjustSpeed()to use same VAD + rubberband pipelinesrc/UI/DependencyInjectionExtensions.csAdvancedTtsSettingsViewModelin DI containerNew Files
src/UI/Features/Video/TextToSpeech/AdvancedTtsSettings/AdvancedTtsSettingsViewModel.cssrc/UI/Features/Video/TextToSpeech/AdvancedTtsSettings/AdvancedTtsSettingsWindow.csCompatibility
silenceremovefilter is standard FFmpeg, available everywhere.rubberbandfilter is optional with automatic fallback toatempofalse/0— existing behavior is unchanged unless the user explicitly enables new featuresTesting
Tested on:
librubberband)Scenarios verified:
atempoat 1.3x–1.5x speed factorsatempowhen rubberband is unavailable