Major Update v5 - Fish S2-Pro, TranslateGemma 12B, AudioSR, TTM Stem Extraction

Latest

Latest

HAKORADev released this 05 Jun 14:55

· 9 commits to main since this release

6e47f27

05/29/2026

Status: Stable, all features work, still developing
Major Integrations and Modernifications

Added

Fish Audio S2-Pro Integration

extreme Keyword — Switch TTS engine from Qwen3-TTS to Fish Audio S2-Pro for higher quality voice cloning and 80+ language support
Fish Audio S2-Pro — Dual-autoregressive (4B + 400M) model with RVQ-based codec, voice effects via [tag] syntax
Train Extreme — train extreme voice:name saves as .ttse file (not .tts)
Extreme in STS — Pre-processes target voice reference through Fish S2 Pro before Seed-VC conversion
TTM Voice — Generate song via ACE-Step then extract clean vocals via SVS
TTM Reference Stem Extraction — New stem/(path) syntax for extracting specific stems (drums, bass, vocals, etc.)

TranslateGemma 12B Integration

TranslateGemma 12B — Any-to-any translation across 55 languages
translate (source-target) Syntax — e.g., translate (auto-ar) for any-to-any translation
TTS Dub Sub-Task — Video/audio dubbing with voice cloning, translation, and speed adjustment
STT Overdose + Translate — Now supported (TranslateGemma decouples from ASR)

STT Subtitle Sub-Task

subtitle Keyword — Burn VibeVoice ASR subtitles onto video in ASS format
Forced Alignment — MMS-FA per-word timestamps for accurate subtitle timing (3-5 word segments)
Overlap Handling — Dual-line display for overlapping speech

SE Sound Enhancement Modernization

SE renamed to Sound Enhancement — Expanded scope beyond just speech
SE Sub-Modes — se voice, se voice blend, se sr, se sr music, se sr voice, etc.
AudioSR Integration — New super-resolution model (haoheliu/versatile_audio_super_resolution)

(Full changelog available in CHANGELOG.md)

Assets 2