A Gradio web app for testing open-source text-to-speech models and processing caption-to-speech video pipelines. Compare TTS models side by side, manage reusable voice profiles, and batch-process videos with spoken captions.
Requires Python 3.10–3.12. Several model dependencies (Kokoro, F5-TTS, etc.) do not yet support Python 3.13.
- Playground walkthrough — generating speech with Qwen3-TTS, narrated by the app itself
./run.shThis creates a virtual environment, installs dependencies, and launches the web UI at http://localhost:7860. The server binds to 0.0.0.0 so it's accessible from other machines on the network at http://<your-ip>:7860.
# Launch the web UI
tts-studio serve
# Generate speech from the command line
tts-studio generate "Hello, world." --model kokoro-82m --voice bf_emma -o hello.wav
# Process a captioned video
tts-studio caption video.mp4 --model kokoro-82m --voice bf_emma
# Batch-process a directory of videos
tts-studio batch ./videos/ --profile emmaQuality tiers: S = best-in-class, A = excellent, B = good, C = decent/niche.
| Model | Tier | Params | VRAM | Clone | Voices | Install |
|---|---|---|---|---|---|---|
| Qwen3-TTS-1.7B | S | 1.7B | ~10 GB | No | 9 | pip install -e ".[qwen3tts]" |
| Qwen3-TTS-1.7B Clone | S | 1.7B | ~10 GB | Yes | — | pip install -e ".[qwen3tts]" |
| Chatterbox | A | 350M | ~4 GB | Yes | — | pip install -e ".[chatterbox]" |
| Zonos-v0.1 | A | 1.6B | ~6 GB | Yes (10-30s) | — | pip install -e ".[zonos]" |
| Sesame CSM-1B | A | 1.1B | ~4.5 GB | Yes | — | pip install -e ".[sesame-csm]" |
| TADA-1B | A | 1B | ~5 GB | Yes | — | pip install -e ".[tada]" |
| Kokoro-82M | A | 82M | ~0.5 GB | No | 24 | included in .[all] |
| F5-TTS | A | — | ~2.5 GB | Yes | — | included in .[all] |
| Voxtral-4B | A | 4B | ~16 GB | Yes* | 20 (9 langs) | remote via vLLM |
| Dia-1.6B | B | 1.6B | ~10 GB | No | 2 | included in .[all] |
| Orpheus-3B | B | 3B | ~7 GB | No | 8 | included in .[all] |
| Spark-TTS-0.5B | B | 0.5B | ~2 GB | Yes | — | included in .[all] |
| OuteTTS-0.3-500M | B | 500M | ~2 GB | Yes | — | pip install -e ".[outetts]" |
* Voxtral voice cloning only works with the local variant, not via the remote API.
Qwen3-TTS ships as two variants sharing the same .[qwen3tts] dependency: the CustomVoice variant has 9 preset voices but no cloning, while the Base (Clone) variant supports voice cloning from reference audio but has no presets.
Models in the all group are installed by default with ./run.sh. Others need their extras group installed separately. Models that aren't installed show as "(not installed)" in the UI.
Profiles save a TTS configuration (model, voice, reference audio) under a friendly name for reuse across the UI and CLI.
Create a profiles.json in the project root:
{
"emma": {
"model_id": "kokoro-82m",
"voice": "bf_emma"
},
"cloned-sarah": {
"model_id": "f5-tts",
"reference_audio": "reference_audio/sarah.wav",
"reference_text": "This is Sarah speaking naturally."
}
}Use a profile from the CLI:
tts-studio generate "Good morning." --profile emma -o morning.wavConvert captioned videos into narrated versions. Provide a video file and a matching .srt subtitle file; the pipeline parses the captions, generates speech for each segment, and renders the output.
tts-studio caption video.mp4 --srt video.srt --model kokoro-82m --voice bf_emmaPlace video files alongside matching .srt files (same filename stem) in a directory:
tts-studio batch ./videos/ --profile emma --pattern "*.webm"Create a caption_config.yaml to customise TTS and output settings:
tts:
model_id: kokoro-82m
voice: bf_emma
speed: 1.0
silence_before: 0.3
silence_after: 0.5
output:
format: mkvRequires: ffmpeg installed on the server running TTS Studio.
Any model can be offloaded to a remote server running an OpenAI-compatible TTS API (e.g. vLLM). Create an endpoints.json file in the project root:
{
"voxtral-4b": {
"url": "http://your-server:8000/v1",
"model": "mistralai/Voxtral-4B-TTS-2603"
}
}Models with a configured endpoint show as "(remote)" in the UI. Models without local dependencies or a remote endpoint show as "(not installed)".
vLLM with the vllm-omni extension can serve TTS models with an OpenAI-compatible API. Currently supported TTS models: Voxtral-4B, Qwen3-TTS, and Fish Speech S2 Pro.
# Create a dedicated venv on the server
python3 -m venv ~/vllm-env && source ~/vllm-env/bin/activate
# Install vllm first, then vllm-omni (order matters — vllm-omni registers
# as a plugin and must be installed after vllm)
pip install vllm
pip install git+https://github.com/vllm-project/vllm-omni.gitIf --omni is not recognised after installation, uninstall both and reinstall in order:
pip uninstall vllm vllm-omni -y
pip install vllm
pip install git+https://github.com/vllm-project/vllm-omni.gitThe first run downloads the model weights from HuggingFace automatically.
Voxtral-4B (requires >= 16 GB VRAM):
vllm serve mistralai/Voxtral-4B-TTS-2603 \
--omni \
--trust-remote-code \
--enforce-eagerQwen3-TTS (requires >= 8 GB VRAM):
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-Base --omniThe server listens on port 8000 by default. Add --port 8001 to change it.
If you have multiple GPUs, force a specific one:
CUDA_VISIBLE_DEVICES=0 CUDA_DEVICE_ORDER=PCI_BUS_ID vllm serve ...vLLM runs one model per server process. To serve multiple models, start each on a different port.
On the machine running TTS Studio, create endpoints.json:
{
"voxtral-4b": {
"url": "http://your-server:8000/v1",
"model": "mistralai/Voxtral-4B-TTS-2603"
}
}The model field must match the model name the server was started with.
# Check the server is up
curl http://your-server:8000/v1/models
# Generate a test clip
curl -X POST http://your-server:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Hello, this is a test.", "model": "mistralai/Voxtral-4B-TTS-2603", "voice": "neutral_female", "response_format": "wav"}' \
--output test.wavDia is not supported by vLLM. Use Dia-TTS-Server instead:
git clone https://github.com/devnen/Dia-TTS-Server.git
cd Dia-TTS-Server
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python server.pyDia requires >= 10 GB VRAM (float32 only — produces garbage at half precision). The server listens on port 8003 with an OpenAI-compatible API.
{
"dia-1.6b": {
"url": "http://your-server:8003/v1",
"model": "tts-1"
}
}Orpheus uses vLLM internally for inference and a SNAC decoder for audio. TTS Studio includes a bundled worker (workers/orpheus_worker.py) that can be started from the Models tab, or run manually:
# Install orpheus-speech into the vLLM venv
~/vllm-env/bin/pip install orpheus-speech
# Start the worker
~/vllm-env/bin/python workers/orpheus_worker.py --port 8001 --gpu 0Orpheus requires ~7 GB VRAM. The worker exposes an OpenAI-compatible /v1/audio/speech endpoint.
{
"orpheus-3b": {
"url": "http://your-server:8001/v1",
"model": "canopylabs/orpheus-3b-0.1-ft"
}
}- Voice cloning via reference audio is not yet supported through vLLM's HTTP API for Voxtral. Use the Voxtral worker (start it from the Models tab) for voice cloning support.
- vLLM serves one model per process. For multiple models, run separate processes on different ports.
See docs/remote-models.md for more detailed setup notes.
from tts_tests.api import caption_video, generate_tts
# Generate speech
audio, sr = generate_tts("Hello from Python.", model_id="kokoro-82m", voice="bf_emma")
# Caption a video using a saved profile
result = caption_video("video.mp4", profile="emma")
print(result.output_path)# Generate speech
tts-studio generate "Hello." --profile emma -o hello.wav
# Caption a video (called from another script)
tts-studio caption /path/to/video.mp4 --profile emma --format mkvWhen the web UI is running, the Gradio API is available at http://localhost:7860/api. Use the Gradio Python client or any HTTP client to call endpoints programmatically.
Create a new file in tts_tests/models/ that exports:
MODEL_CLASS— a subclass ofTTSModelimplementinginfo(),load(),unload(),is_loaded(), andgenerate()is_available()— a function returningTrueif the model's dependencies are installed
The model is auto-discovered on startup. See any existing model file for reference.
tts_tests/
app.py # Web UI entry point
cli.py # CLI entry point and argument parsing
api.py # Python API helpers
base.py # TTSModel abstract base class, TTSResult, ModelInfo
config.py # Paths, device detection, endpoint loading
profiles.py # Voice profile management (profiles.json)
registry.py # Model discovery, loading/unloading, remote wrapping
remote.py # RemoteTTSModel HTTP wrapper
audio_utils.py # Audio processing utilities
models/ # One file per TTS model
caption/ # Caption-to-speech pipeline
config.py # Caption/TTS configuration (YAML)
srt_parser.py # SRT subtitle parsing
tts_bridge.py # Bridge between caption segments and TTS registry
timeline.py # Timeline construction from segments
render.py # ffmpeg rendering
pipeline.py # Pipeline orchestration
batch.py # Batch directory processing
ui/ # Gradio interface (single generation, comparison)
endpoints.json # Remote endpoint config (create from endpoints.json.example)
profiles.json # Voice profiles (user-created)
run.sh # Quick-start script
Informal notes from testing these models. Your mileage may vary depending on hardware, reference audio quality, and text content.
- Chatterbox produces voice cloning quality roughly on par with Qwen3-TTS Clone, despite being a much smaller model (350M vs 1.7B). It's noticeably faster too. The trade-off is that Chatterbox has no preset voices — you always need a reference audio clip.
- Qwen3-TTS is the best all-rounder for preset voices. The Eric and Vivian presets are particularly natural. The Clone variant is solid for voice cloning but slower than Chatterbox at inference.
- Kokoro is excellent for its size — 82M parameters with surprisingly good quality and very fast inference. Best choice when VRAM is tight or you need high throughput.
- Voxtral has the widest language coverage (9 languages, 20 voices) but requires a beefy GPU (~16 GB) and runs best via vLLM on a dedicated server.