Clone any voice. Script any dialogue. Generate audio.
A complete Rust toolkit for Qwen3-TTS — Alibaba's open-source text-to-speech model. We built a client library, a professional GUI, an ADSR envelope tester, and a system audio recorder. 4730 lines of Rust, zero Python on the client side.
ttsdefou/
├── client/ # Rust library + GUI
│ ├── src/lib.rs # Core library (934 lines)
│ ├── examples/gui.rs # Full desktop GUI (2911 lines)
│ ├── examples/gui_test_adsr.rs # ADSR envelope tester (692 lines)
│ ├── examples/simple.rs # Minimal example (76 lines)
│ └── Cargo.toml
└── recorder/ # System audio recorder
├── src/main.rs # Loopback capture (117 lines)
└── Cargo.toml
| Component | Lines | What it does |
|---|---|---|
| lib.rs | 934 | 3 async HTTP clients (VoiceDesign, CustomVoice, VoiceClone), Gradio SSE protocol, audio download, WAV handling |
| gui.rs | 2911 | 8-mode desktop GUI — TOML dialogue editor, multi-speaker, emotions, preview, save/load |
| gui_test_adsr.rs | 692 | Visual ADSR envelope editor for attack/decay/sustain/release testing |
| simple.rs | 76 | Generate speech in 10 lines of code |
| recorder | 117 | Capture system audio (loopback) to WAV — record any voice from any source |
| Total | 4730 |
graph TB
subgraph CAPTURE["🎤 Voice Capture"]
REC["Audio Recorder<br/><i>System loopback (WASAPI)<br/>48kHz, 32-bit float, stereo<br/>Ctrl+C to stop → .wav</i>"]
SOURCE["Any audio source<br/><i>YouTube, Discord, game,<br/>movie, podcast...</i>"]
end
subgraph SERVERS["🐍 Qwen3-TTS Servers (Python — Alibaba)"]
S1["Port 8001 — CustomVoice<br/><i>9 built-in speakers<br/>Serena, Vivian, Ryan, Aiden,<br/>Eric, Dylan, Ono Anna, Sohee, Uncle Fu<br/>WITH emotion control</i>"]
S2["Port 8002 — VoiceDesign<br/><i>Describe the voice in text<br/>No .pt file needed</i>"]
S3["Port 8003 — VoiceClone<br/><i>Your .pt voice clones<br/>Sound like anyone</i>"]
end
subgraph RUST["🦀 Rust Client (4730 lines)"]
LIB["lib.rs — Core Library<br/><i>3 async clients<br/>Gradio SSE protocol<br/>Audio download + WAV save</i>"]
GUI["gui.rs — Desktop GUI<br/><i>8 modes<br/>TOML dialogue editor<br/>Emotions panel<br/>Live preview</i>"]
ADSR["gui_test_adsr.rs<br/><i>ADSR envelope editor<br/>Visual waveform</i>"]
end
subgraph TOML["📝 TOML Dialogue System"]
DP["dialogue_perso_lines<br/><i>Cloned voices (.pt)</i>"]
DC["dialogue_custom_lines<br/><i>Built-in speakers<br/>+ emotions</i>"]
TIMING["Timing Control<br/><i>+0.5 = pause<br/>0.0 = continuous<br/>-0.8 = overlap/interrupt</i>"]
end
subgraph OUTPUT["🔊 Output"]
WAV["WAV per line<br/>+ merged dialogue"]
end
SOURCE --> REC
REC -->|".wav sample"| S3
TOML --> GUI
GUI --> LIB
LIB -->|"HTTP + SSE"| S1
LIB -->|"HTTP + SSE"| S2
LIB -->|"HTTP + SSE"| S3
S1 --> WAV
S2 --> WAV
S3 --> WAV
style CAPTURE fill:#2a1a10,stroke:#ff8866,color:#ff8866
style SERVERS fill:#0a2a1a,stroke:#44cc88,color:#44cc88
style RUST fill:#1a1a2e,stroke:#f0c040,color:#f0c040
style TOML fill:#1a0a2a,stroke:#cc88cc,color:#cc88cc
style OUTPUT fill:#0d0d18,stroke:#8888cc,color:#8888cc
Qwen3-TTS generates one line at a time. We added a timing system that controls silence between lines:
graph LR
subgraph TIMING["Timing System"]
A["Character A speaks"]
PAUSE["silence_after = 0.5<br/><i>Normal pause (500ms)</i>"]
B["Character B speaks"]
ZERO["silence_after = 0.0<br/><i>No gap — flows directly</i>"]
C["Character C speaks"]
OVERLAP["silence_after = -0.8<br/><i>OVERLAP — interrupts!<br/>B starts before A finishes</i>"]
D["Character D reacts"]
end
A --> PAUSE --> B --> ZERO --> C --> OVERLAP --> D
style PAUSE fill:#0a2a1a,stroke:#44cc88,color:#c8ccd4
style ZERO fill:#1a1a2e,stroke:#8888cc,color:#c8ccd4
style OVERLAP fill:#2a0a0a,stroke:#cc4444,color:#c8ccd4
Negative silence = overlap. This is what makes dialogues feel real. Characters interrupt each other. Conversations flow. -0.8 means character B starts 800ms before character A finishes speaking.
No other TTS client does this.
graph TB
subgraph MODES["8 Modes"]
M1["VoiceDesign<br/><i>Describe voice style</i>"]
M2["CustomVoice<br/><i>Pick speaker + emotion</i>"]
M3["VoiceClone<br/><i>Use .pt clone file</i>"]
M4["Dialogue<br/><i>Multi-line VoiceDesign</i>"]
M5["DialogueCustom<br/><i>Multi-line built-in speakers</i>"]
M6["DialoguePerso<br/><i>Multi-line cloned voices</i>"]
M7["SavedVoice<br/><i>Reuse saved profile</i>"]
M8["Emotions List<br/><i>Browse 20+ emotions</i>"]
end
style M1 fill:#0a2a1a,stroke:#44cc88,color:#c8ccd4
style M2 fill:#0a2a1a,stroke:#44cc88,color:#c8ccd4
style M3 fill:#2a1a10,stroke:#ff8866,color:#c8ccd4
style M4 fill:#1a1a2e,stroke:#8888cc,color:#c8ccd4
style M5 fill:#1a1a2e,stroke:#8888cc,color:#c8ccd4
style M6 fill:#1a0a2a,stroke:#cc88cc,color:#c8ccd4
style M7 fill:#12121e,stroke:#f0c040,color:#c8ccd4
style M8 fill:#12121e,stroke:#f0c040,color:#c8ccd4
The recorder/ tool captures system audio via Windows WASAPI loopback. It records whatever your speakers are playing — YouTube, Discord, a movie, a podcast.
// That's it. System loopback capture in Rust.
// WASAPI → 48kHz 32-bit float stereo → WAV file
// Ctrl+C to stop. Timestamped filename.Use case: You hear a voice you want to clone. Run the recorder, play the audio, stop. You now have a clean WAV sample. Feed it to Qwen3-TTS VoiceClone server → get a .pt file → use it in dialogues forever.
cd recorder
cargo build --release
./target/release/audio_recorder
# Recording... Press Ctrl+C to stop
# Output: enregistrement_20260411_1030.wavtitle = "My Dialogue"
author = "IkarugaRS"
date = "2026-04-11"
# Cloned voices (.pt files — port 8003)
[[dialogue_perso_lines]]
speaker = "Character A"
voice_file = "CharacterA.pt"
text = "Hey, listen to this—"
silence_after = -0.3 # B interrupts A
[[dialogue_perso_lines]]
speaker = "Character B"
voice_file = "CharacterB.pt"
text = "I already know!"
silence_after = 0.5
# Built-in speakers with emotions (port 8001)
[[dialogue_custom_lines]]
speaker = "Narrator"
voice_name = "Vivian"
text = "And so the argument began."
emotion = "amused"
silence_after = 0.0// Every client follows the Gradio SSE protocol:
// 1. POST /gradio_api/call/<endpoint> → get event_id
// 2. GET /gradio_api/call/<endpoint>/<id> → SSE stream
// 3. Parse SSE for "event: complete" → extract audio URL
// 4. Download audio → AudioResult { data, sample_rate }
pub struct VoiceDesignClient // Port 8002 — describe voice
pub struct CustomVoiceClient // Port 8001 — pick speaker + emotion
pub struct VoiceCloneClient // Port 8003 — use .pt file// GUI never freezes. Generation runs in std::thread::spawn.
// Results come back via mpsc::channel.
// Click a line → hear it instantly (rodio playback).
// Save/load entire dialogues as TOML files.| Crate | Purpose |
|---|---|
| reqwest 0.11 | Async HTTP (JSON, multipart) |
| tokio 1.0 | Async runtime |
| serde + toml | TOML dialogue serialization |
| eframe + egui 0.31 | Native desktop GUI |
| rodio 0.19 | Audio playback |
| cpal 0.15 | Audio device access |
| rfd 0.15 | File dialogs |
| hound 3.5 | WAV read/write |
| wasapi 0.22 | Windows audio capture (recorder) |
| anyhow | Error handling with context |
| Original (Python/Gradio) | Our Rust addition |
|---|---|
| Web browser interface | Native desktop app (2911 lines) |
| One speaker per call | Multi-speaker TOML dialogue system |
| No timing control | Overlap/interruption (negative silence) |
| No voice file management | .pt clone routing across 3 servers |
| No save/load | TOML project files |
| No preview | Line-by-line playback (rodio) |
| No batch merge | Auto-merge all lines → single WAV |
| No emotion browser | 20+ emotions with descriptions |
| No ADSR control | Visual envelope editor |
| No recording tool | System loopback recorder (WASAPI) |
| Python only | 100% Rust client (4730 lines) |
# 1. Start Qwen3-TTS servers (Python side)
# Terminal 1: CustomVoice (built-in speakers + emotions)
python run_server.py --mode custom_voice --port 8001
# Terminal 2: VoiceDesign (describe voice style)
python run_server.py --mode voice_design --port 8002
# Terminal 3: VoiceClone (cloned .pt voices)
python run_server.py --mode voice_clone --port 8003
# 2. Build the Rust client
cd client
cargo build --release --example gui
# 3. Launch the GUI
./target/release/examples/gui
# 4. Or use the simple example
cargo run --example simple- Qwen3-TTS / Qwen2.5-Omni — TTS model by Alibaba Qwen team
- IkarugaRS — Design, testing, TOML format, voice management, dialogue system, timing concept
- Akari — Rust code (library + GUI + ADSR + recorder)
MIT — The Rust code is ours. Qwen3-TTS has its own license (see upstream).