Skip to content

Major Update v5 - Fish S2-Pro, TranslateGemma 12B, AudioSR, TTM Stem Extraction

Latest

Choose a tag to compare

@HAKORADev HAKORADev released this 05 Jun 14:55
· 9 commits to main since this release

05/29/2026

  • Status: Stable, all features work, still developing
  • Major Integrations and Modernifications

Added

Fish Audio S2-Pro Integration

  • extreme Keyword — Switch TTS engine from Qwen3-TTS to Fish Audio S2-Pro for higher quality voice cloning and 80+ language support
  • Fish Audio S2-Pro — Dual-autoregressive (4B + 400M) model with RVQ-based codec, voice effects via [tag] syntax
  • Train Extremetrain extreme voice:name saves as .ttse file (not .tts)
  • Extreme in STS — Pre-processes target voice reference through Fish S2 Pro before Seed-VC conversion
  • TTM Voice — Generate song via ACE-Step then extract clean vocals via SVS
  • TTM Reference Stem Extraction — New stem/(path) syntax for extracting specific stems (drums, bass, vocals, etc.)

TranslateGemma 12B Integration

  • TranslateGemma 12B — Any-to-any translation across 55 languages
  • translate (source-target) Syntax — e.g., translate (auto-ar) for any-to-any translation
  • TTS Dub Sub-Task — Video/audio dubbing with voice cloning, translation, and speed adjustment
  • STT Overdose + Translate — Now supported (TranslateGemma decouples from ASR)

STT Subtitle Sub-Task

  • subtitle Keyword — Burn VibeVoice ASR subtitles onto video in ASS format
  • Forced Alignment — MMS-FA per-word timestamps for accurate subtitle timing (3-5 word segments)
  • Overlap Handling — Dual-line display for overlapping speech

SE Sound Enhancement Modernization

  • SE renamed to Sound Enhancement — Expanded scope beyond just speech
  • SE Sub-Modesse voice, se voice blend, se sr, se sr music, se sr voice, etc.
  • AudioSR Integration — New super-resolution model (haoheliu/versatile_audio_super_resolution)

(Full changelog available in CHANGELOG.md)