05/29/2026
- Status: Stable, all features work, still developing
- Major Integrations and Modernifications
Added
Fish Audio S2-Pro Integration
extremeKeyword — Switch TTS engine from Qwen3-TTS to Fish Audio S2-Pro for higher quality voice cloning and 80+ language support- Fish Audio S2-Pro — Dual-autoregressive (4B + 400M) model with RVQ-based codec, voice effects via [tag] syntax
- Train Extreme —
train extreme voice:namesaves as.ttsefile (not.tts) - Extreme in STS — Pre-processes target voice reference through Fish S2 Pro before Seed-VC conversion
- TTM Voice — Generate song via ACE-Step then extract clean vocals via SVS
- TTM Reference Stem Extraction — New
stem/(path)syntax for extracting specific stems (drums, bass, vocals, etc.)
TranslateGemma 12B Integration
- TranslateGemma 12B — Any-to-any translation across 55 languages
translate (source-target)Syntax — e.g.,translate (auto-ar)for any-to-any translation- TTS Dub Sub-Task — Video/audio dubbing with voice cloning, translation, and speed adjustment
- STT Overdose + Translate — Now supported (TranslateGemma decouples from ASR)
STT Subtitle Sub-Task
subtitleKeyword — Burn VibeVoice ASR subtitles onto video in ASS format- Forced Alignment — MMS-FA per-word timestamps for accurate subtitle timing (3-5 word segments)
- Overlap Handling — Dual-line display for overlapping speech
SE Sound Enhancement Modernization
- SE renamed to Sound Enhancement — Expanded scope beyond just speech
- SE Sub-Modes —
se voice,se voice blend,se sr,se sr music,se sr voice, etc. - AudioSR Integration — New super-resolution model (haoheliu/versatile_audio_super_resolution)
(Full changelog available in CHANGELOG.md)