Skip to content

Ikaruga/TTSdeFou

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TTS de Fou — Qwen3-TTS Rust Client & Voice Clone Studio

Clone any voice. Script any dialogue. Generate audio.

A complete Rust toolkit for Qwen3-TTS — Alibaba's open-source text-to-speech model. We built a client library, a professional GUI, an ADSR envelope tester, and a system audio recorder. 4730 lines of Rust, zero Python on the client side.


What's included

ttsdefou/
├── client/                         # Rust library + GUI
│   ├── src/lib.rs                  # Core library (934 lines)
│   ├── examples/gui.rs             # Full desktop GUI (2911 lines)
│   ├── examples/gui_test_adsr.rs   # ADSR envelope tester (692 lines)
│   ├── examples/simple.rs          # Minimal example (76 lines)
│   └── Cargo.toml
└── recorder/                       # System audio recorder
    ├── src/main.rs                 # Loopback capture (117 lines)
    └── Cargo.toml
Component Lines What it does
lib.rs 934 3 async HTTP clients (VoiceDesign, CustomVoice, VoiceClone), Gradio SSE protocol, audio download, WAV handling
gui.rs 2911 8-mode desktop GUI — TOML dialogue editor, multi-speaker, emotions, preview, save/load
gui_test_adsr.rs 692 Visual ADSR envelope editor for attack/decay/sustain/release testing
simple.rs 76 Generate speech in 10 lines of code
recorder 117 Capture system audio (loopback) to WAV — record any voice from any source
Total 4730

Architecture

graph TB
    subgraph CAPTURE["🎤 Voice Capture"]
        REC["Audio Recorder<br/><i>System loopback (WASAPI)<br/>48kHz, 32-bit float, stereo<br/>Ctrl+C to stop → .wav</i>"]
        SOURCE["Any audio source<br/><i>YouTube, Discord, game,<br/>movie, podcast...</i>"]
    end

    subgraph SERVERS["🐍 Qwen3-TTS Servers (Python — Alibaba)"]
        S1["Port 8001 — CustomVoice<br/><i>9 built-in speakers<br/>Serena, Vivian, Ryan, Aiden,<br/>Eric, Dylan, Ono Anna, Sohee, Uncle Fu<br/>WITH emotion control</i>"]
        S2["Port 8002 — VoiceDesign<br/><i>Describe the voice in text<br/>No .pt file needed</i>"]
        S3["Port 8003 — VoiceClone<br/><i>Your .pt voice clones<br/>Sound like anyone</i>"]
    end

    subgraph RUST["🦀 Rust Client (4730 lines)"]
        LIB["lib.rs — Core Library<br/><i>3 async clients<br/>Gradio SSE protocol<br/>Audio download + WAV save</i>"]
        GUI["gui.rs — Desktop GUI<br/><i>8 modes<br/>TOML dialogue editor<br/>Emotions panel<br/>Live preview</i>"]
        ADSR["gui_test_adsr.rs<br/><i>ADSR envelope editor<br/>Visual waveform</i>"]
    end

    subgraph TOML["📝 TOML Dialogue System"]
        DP["dialogue_perso_lines<br/><i>Cloned voices (.pt)</i>"]
        DC["dialogue_custom_lines<br/><i>Built-in speakers<br/>+ emotions</i>"]
        TIMING["Timing Control<br/><i>+0.5 = pause<br/>0.0 = continuous<br/>-0.8 = overlap/interrupt</i>"]
    end

    subgraph OUTPUT["🔊 Output"]
        WAV["WAV per line<br/>+ merged dialogue"]
    end

    SOURCE --> REC
    REC -->|".wav sample"| S3
    TOML --> GUI
    GUI --> LIB
    LIB -->|"HTTP + SSE"| S1
    LIB -->|"HTTP + SSE"| S2
    LIB -->|"HTTP + SSE"| S3
    S1 --> WAV
    S2 --> WAV
    S3 --> WAV

    style CAPTURE fill:#2a1a10,stroke:#ff8866,color:#ff8866
    style SERVERS fill:#0a2a1a,stroke:#44cc88,color:#44cc88
    style RUST fill:#1a1a2e,stroke:#f0c040,color:#f0c040
    style TOML fill:#1a0a2a,stroke:#cc88cc,color:#cc88cc
    style OUTPUT fill:#0d0d18,stroke:#8888cc,color:#8888cc
Loading

The Killer Feature — Voice Overlap & Timing

Qwen3-TTS generates one line at a time. We added a timing system that controls silence between lines:

graph LR
    subgraph TIMING["Timing System"]
        A["Character A speaks"]
        PAUSE["silence_after = 0.5<br/><i>Normal pause (500ms)</i>"]
        B["Character B speaks"]
        ZERO["silence_after = 0.0<br/><i>No gap — flows directly</i>"]
        C["Character C speaks"]
        OVERLAP["silence_after = -0.8<br/><i>OVERLAP — interrupts!<br/>B starts before A finishes</i>"]
        D["Character D reacts"]
    end

    A --> PAUSE --> B --> ZERO --> C --> OVERLAP --> D

    style PAUSE fill:#0a2a1a,stroke:#44cc88,color:#c8ccd4
    style ZERO fill:#1a1a2e,stroke:#8888cc,color:#c8ccd4
    style OVERLAP fill:#2a0a0a,stroke:#cc4444,color:#c8ccd4
Loading

Negative silence = overlap. This is what makes dialogues feel real. Characters interrupt each other. Conversations flow. -0.8 means character B starts 800ms before character A finishes speaking.

No other TTS client does this.


The 8 GUI Modes

graph TB
    subgraph MODES["8 Modes"]
        M1["VoiceDesign<br/><i>Describe voice style</i>"]
        M2["CustomVoice<br/><i>Pick speaker + emotion</i>"]
        M3["VoiceClone<br/><i>Use .pt clone file</i>"]
        M4["Dialogue<br/><i>Multi-line VoiceDesign</i>"]
        M5["DialogueCustom<br/><i>Multi-line built-in speakers</i>"]
        M6["DialoguePerso<br/><i>Multi-line cloned voices</i>"]
        M7["SavedVoice<br/><i>Reuse saved profile</i>"]
        M8["Emotions List<br/><i>Browse 20+ emotions</i>"]
    end

    style M1 fill:#0a2a1a,stroke:#44cc88,color:#c8ccd4
    style M2 fill:#0a2a1a,stroke:#44cc88,color:#c8ccd4
    style M3 fill:#2a1a10,stroke:#ff8866,color:#c8ccd4
    style M4 fill:#1a1a2e,stroke:#8888cc,color:#c8ccd4
    style M5 fill:#1a1a2e,stroke:#8888cc,color:#c8ccd4
    style M6 fill:#1a0a2a,stroke:#cc88cc,color:#c8ccd4
    style M7 fill:#12121e,stroke:#f0c040,color:#c8ccd4
    style M8 fill:#12121e,stroke:#f0c040,color:#c8ccd4
Loading

Audio Recorder — Capture Any Voice

The recorder/ tool captures system audio via Windows WASAPI loopback. It records whatever your speakers are playing — YouTube, Discord, a movie, a podcast.

// That's it. System loopback capture in Rust.
// WASAPI → 48kHz 32-bit float stereo → WAV file
// Ctrl+C to stop. Timestamped filename.

Use case: You hear a voice you want to clone. Run the recorder, play the audio, stop. You now have a clean WAV sample. Feed it to Qwen3-TTS VoiceClone server → get a .pt file → use it in dialogues forever.

Build & run

cd recorder
cargo build --release
./target/release/audio_recorder
# Recording... Press Ctrl+C to stop
# Output: enregistrement_20260411_1030.wav

TOML Dialogue Format

title = "My Dialogue"
author = "IkarugaRS"
date = "2026-04-11"

# Cloned voices (.pt files — port 8003)
[[dialogue_perso_lines]]
speaker = "Character A"
voice_file = "CharacterA.pt"
text = "Hey, listen to this—"
silence_after = -0.3               # B interrupts A

[[dialogue_perso_lines]]
speaker = "Character B"
voice_file = "CharacterB.pt"
text = "I already know!"
silence_after = 0.5

# Built-in speakers with emotions (port 8001)
[[dialogue_custom_lines]]
speaker = "Narrator"
voice_name = "Vivian"
text = "And so the argument began."
emotion = "amused"
silence_after = 0.0

Rust Code — How it works

lib.rs — 3 async clients, one pattern

// Every client follows the Gradio SSE protocol:
// 1. POST /gradio_api/call/<endpoint>     → get event_id
// 2. GET  /gradio_api/call/<endpoint>/<id> → SSE stream
// 3. Parse SSE for "event: complete"       → extract audio URL
// 4. Download audio                        → AudioResult { data, sample_rate }

pub struct VoiceDesignClient  // Port 8002 — describe voice
pub struct CustomVoiceClient  // Port 8001 — pick speaker + emotion
pub struct VoiceCloneClient   // Port 8003 — use .pt file

gui.rs — Background generation

// GUI never freezes. Generation runs in std::thread::spawn.
// Results come back via mpsc::channel.
// Click a line → hear it instantly (rodio playback).
// Save/load entire dialogues as TOML files.

Dependencies

Crate Purpose
reqwest 0.11 Async HTTP (JSON, multipart)
tokio 1.0 Async runtime
serde + toml TOML dialogue serialization
eframe + egui 0.31 Native desktop GUI
rodio 0.19 Audio playback
cpal 0.15 Audio device access
rfd 0.15 File dialogs
hound 3.5 WAV read/write
wasapi 0.22 Windows audio capture (recorder)
anyhow Error handling with context

What we built vs original Qwen3-TTS

Original (Python/Gradio) Our Rust addition
Web browser interface Native desktop app (2911 lines)
One speaker per call Multi-speaker TOML dialogue system
No timing control Overlap/interruption (negative silence)
No voice file management .pt clone routing across 3 servers
No save/load TOML project files
No preview Line-by-line playback (rodio)
No batch merge Auto-merge all lines → single WAV
No emotion browser 20+ emotions with descriptions
No ADSR control Visual envelope editor
No recording tool System loopback recorder (WASAPI)
Python only 100% Rust client (4730 lines)

Quick Start

# 1. Start Qwen3-TTS servers (Python side)
# Terminal 1: CustomVoice (built-in speakers + emotions)
python run_server.py --mode custom_voice --port 8001
# Terminal 2: VoiceDesign (describe voice style)
python run_server.py --mode voice_design --port 8002
# Terminal 3: VoiceClone (cloned .pt voices)
python run_server.py --mode voice_clone --port 8003

# 2. Build the Rust client
cd client
cargo build --release --example gui

# 3. Launch the GUI
./target/release/examples/gui

# 4. Or use the simple example
cargo run --example simple

Credits

  • Qwen3-TTS / Qwen2.5-Omni — TTS model by Alibaba Qwen team
  • IkarugaRS — Design, testing, TOML format, voice management, dialogue system, timing concept
  • Akari — Rust code (library + GUI + ADSR + recorder)

License

MIT — The Rust code is ours. Qwen3-TTS has its own license (see upstream).

About

Qwen3-TTS Rust client — Voice cloning, multi-speaker dialogues, overlap/interruption timing, TOML scripts, audio recorder. 4730 lines of Rust.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages