Rust-native text-to-speech and speech synthesis for modern open-weight models.
any-tts is a Rust TTS library built around Candle with one trait-based API for Kokoro, OmniVoice, Qwen3-TTS, VibeVoice, VibeVoice Realtime, and Voxtral. It is aimed at developers who want local speech synthesis, multilingual TTS, reference-audio prompting, preset-voice workflows, or low-latency voice agents without rewriting their application around each model family.
You can point it at local files, hand it explicit paths from your own cache, feed it named in-memory byte assets from an object store, or let it resolve missing assets from Hugging Face while keeping the synthesis call site unchanged.
For Flow-like specifically: every public backend can now load from relative-path byte assets, so object_store reads can go straight into TtsConfig without writing temp files first.
- One API for Kokoro, OmniVoice, Qwen3-TTS, VibeVoice, VibeVoice Realtime, and Voxtral.
- Native Rust backends across the public model surface.
- Local path loading, in-memory byte bundles, per-file wiring, or Hugging Face fallback.
- GPU first when available through CUDA or Metal, with CPU fallback and optional Accelerate support for Apple CPU builds.
- Request-level control for
language,voice,reference_audio,instruct,max_tokens,temperature, andcfg_scale. - WAV output everywhere, with built-in WAV and MP3 input decoding for cleanup and reference-audio workflows.
| Model | Status in any-tts | Default upstream | Best at | Main tradeoff | Model license |
|---|---|---|---|---|---|
| Kokoro-82M | Public, native, lightweight | hexgrad/Kokoro-82M |
Fast local TTS with small weights | Uses an in-tree pure-Rust phonemizer compatible with Kokoro's current public language set; parity tuning is still ongoing | Apache-2.0 |
| OmniVoice | Public, native | k2-fsa/OmniVoice |
Huge language coverage and instruct-driven voice design | The current Rust backend does not yet expose upstream zero-shot cloning | Apache-2.0 |
| Qwen3-TTS-12Hz-1.7B | Public, native | Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice |
Strong multilingual control, named speakers, and instruct handling | Heavy weights and extra speech-tokenizer assets | Apache-2.0 |
| VibeVoice-1.5B | Public, native | microsoft/VibeVoice-1.5B |
Long-form multi-speaker speech diffusion with native Rust inference | Still early and currently optimized for single-request parity work rather than streaming performance | MIT |
| VibeVoice-Realtime-0.5B | Public, native | microsoft/VibeVoice-Realtime-0.5B |
Low-latency preset-voice TTS with native Rust inference | English-first upstream and depends on cached voices/*.pt presets instead of reference audio |
MIT |
| Voxtral-4B-TTS-2603 | Public, native | mistralai/Voxtral-4B-TTS-2603 |
Production-style voice agents, preset voices, low-latency oriented stack | Largest backend here and not commercially permissive | CC BY-NC 4.0 |
Important: the Rust crate is dual licensed under MIT OR Apache-2.0. The model weights are not. Always check the model-specific license before shipping. Voxtral is the one that changes the deployment story the most because its published checkpoint is CC BY-NC 4.0.
All six models above are exposed through the public ModelType enum and load_model() API.
- Use
Kokorowhen you want the smallest local deployment and simple preset-voice TTS. - Use
OmniVoicewhen language coverage andinstructmatter more than named voices. - Use
Qwen3Ttswhen you want the strongest public request-control surface and named speakers. - Use
VibeVoicewhen you need long-form or reference-audio-conditioned generation. - Use
VibeVoiceRealtimewhen you want cached-prompt preset voices and faster time-to-first-audio. - Use
Voxtralwhen you want preset-voice, voice-agent-style deployment and can accept its model license.
For CPU-only builds, a recent stable Rust toolchain is enough. For GPU builds, compile the feature set that matches your machine. Kokoro no longer requires a system espeak-ng install: the repo now ships an in-tree pure-Rust phonemizer with an espeak-rs-compatible interface for the language set exposed by the current Kokoro backend.
Add the crate from crates.io:
[dependencies]
any-tts = "0.1"Or opt into a smaller feature set:
[dependencies]
any-tts = { version = "0.1", default-features = false, features = ["kokoro", "download", "metal"] }By default the crate enables qwen3-tts, kokoro, omnivoice, vibevoice, voxtral, and download. The vibevoice feature exposes both ModelType::VibeVoice and ModelType::VibeVoiceRealtime.
| Feature | What it does |
|---|---|
kokoro |
Enables the Kokoro backend. |
omnivoice |
Enables the native OmniVoice backend. |
qwen3-tts |
Enables the Qwen3-TTS backend. |
vibevoice |
Enables the native VibeVoice and VibeVoice Realtime backends. |
voxtral |
Enables the native Voxtral backend. |
download |
Allows missing model files to be pulled from Hugging Face Hub through the crate's built-in downloader. |
cuda |
Builds Candle with CUDA support. |
metal |
Builds Candle with Metal support for Apple GPUs. |
accelerate |
Enables Apple Accelerate support for CPU-heavy Apple builds. |
DeviceSelection::Autotries CUDA first, then Metal, then CPU.DeviceSelection::Cpu,DeviceSelection::Cuda(0), andDeviceSelection::Metal(0)let you force the runtime target.preferred_runtime_choice(ModelType::...)returns the fastest safe device and dtype for the current machine.TtsConfig::with_preferred_runtime()applies that runtime choice in one builder call.DTypecan be set toF32,F16, orBF16.- On CPU, models that cannot safely run BF16 fall back to
F32. - The native OmniVoice helper prefers
cuda:0 (bf16), thenmetal:0 (f32), thencpu (f32).
use any_tts::{load_model, ModelType, SynthesisRequest, TtsConfig};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let model = load_model(
TtsConfig::new(ModelType::Qwen3Tts)
.with_model_path("./models/Qwen3-TTS")
)?;
let audio = model.synthesize(
&SynthesisRequest::new("Hello from Rust TTS.")
.with_language("English")
.with_voice("Ryan")
.with_instruct("Calm, clear, slightly upbeat."),
)?;
audio.save_wav("hello.wav")?;
Ok(())
}If your runtime already has model artifacts in memory, use ModelAssetBundle or the with_*_bytes() builders instead of writing them to disk first.
use any_tts::{load_model, ModelAssetBundle, ModelType, SynthesisRequest, TtsConfig};
fn read_object(_key: &str) -> Result<Vec<u8>, Box<dyn std::error::Error>> {
Ok(Vec::new())
}
fn main() -> Result<(), Box<dyn std::error::Error>> {
let assets = ModelAssetBundle::new()
.with_bytes("config.json", read_object("config.json")?)
.with_bytes("tokenizer.json", read_object("tokenizer.json")?)
.with_bytes("model.safetensors", read_object("model.safetensors")?)
.with_bytes(
"speech_tokenizer/model.safetensors",
read_object("speech_tokenizer/model.safetensors")?,
);
let model = load_model(
TtsConfig::new(ModelType::Qwen3Tts)
.with_asset_bundle(assets)
)?;
let audio = model.synthesize(&SynthesisRequest::new("Hello from byte-backed assets."))?;
let wav_bytes = audio.get_wav();
let _ = wav_bytes;
Ok(())
}The relative paths in the asset bundle should match the model layout documented below, for example config.json, audio_tokenizer/model.safetensors, or voice_embedding/Aurora.pt.
Generated audio already comes back as AudioSamples, so output does not need filesystem paths either.
audio.get_wav()returns a complete WAV file asVec<u8>for every backend.audio.save_wav()is a convenience helper on top of the same byte encoder.
use any_tts::{AudioSamples, DenoiseOptions};
use std::io::Cursor;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let input = std::fs::read("speech-with-music.mp3")?;
let cleaned = AudioSamples::denoise_audio_stream(
Cursor::new(input),
DenoiseOptions::default(),
)?;
cleaned.save_wav("speech-cleaned.wav")?;
Ok(())
}The denoiser auto-detects WAV and MP3 input streams and applies a speech-band filter plus a short-time spectral gate. It is useful for attenuating steady background noise and background music, but it is not a full voice-isolation or source-separation model.
any-tts resolves model assets in four tiers, in this order:
- Explicit files you set on
TtsConfigwith methods likewith_config_file()orwith_weight_file(). - Auto-discovery from
with_asset_bundle()orwith_asset_bytes()using model-relative paths. - Auto-discovery from
with_model_path()using the expected filenames for that backend. - Hugging Face fallback through the
downloadfeature.
That means you can mix strategies. A service with its own artifact cache can hand over a few exact files and let the crate discover or download the rest.
You can inspect the documented manifest programmatically through ModelType::asset_requirements(). The expected relative paths are:
| Model | Required asset patterns | Optional asset patterns |
|---|---|---|
| Kokoro | config.json, model.safetensors or *.pth |
voices/*.pt |
| OmniVoice | config.json, tokenizer.json, model.safetensors or model-*-of-*.safetensors, audio_tokenizer/config.json, audio_tokenizer/model.safetensors or audio_tokenizer/model-*-of-*.safetensors |
generation_config.json |
| Qwen3-TTS | config.json, tokenizer.json, model.safetensors or model-*-of-*.safetensors, speech_tokenizer/model.safetensors or speech_tokenizer/model-*-of-*.safetensors |
speech_tokenizer/config.json, generation_config.json |
| VibeVoice | config.json, tokenizer.json, model.safetensors or model-*-of-*.safetensors |
preprocessor_config.json, generation_config.json |
| VibeVoice Realtime | config.json, tokenizer.json, model.safetensors |
preprocessor_config.json, voices/*.pt |
| Voxtral | params.json, tekken.json, consolidated.safetensors, voice_embedding/*.pt |
none |
Using these exact relative paths makes the byte-based API foolproof because the same names are what with_model_path() auto-discovery expects on disk.
SynthesisRequest keeps the per-call control surface stable across models.
| Field | Purpose | Notes |
|---|---|---|
text |
Input text to synthesize | Required for every backend. |
language |
Language tag or model-specific language name | Supports ISO tags in several backends and auto where available. |
voice |
Named speaker or preset voice | Works for Kokoro, Qwen3 CustomVoice, VibeVoice Realtime, and Voxtral. OmniVoice rejects named voices, and full VibeVoice expects reference_audio instead. |
instruct |
Natural-language style control | Most useful on OmniVoice and Qwen3. |
max_tokens |
Upper bound on generated codec/audio tokens | Helpful for latency testing and smoke tests. |
temperature |
Sampling temperature | Supported where the backend uses it. |
cfg_scale |
Classifier-free guidance scale | Used by OmniVoice and other backends that expose CFG-like control. |
reference_audio |
Reference clip for voice cloning or prompt conditioning | Used by backends such as VibeVoice when conditioning from speech; unsupported backends return an explicit error. |
voice_embedding |
Precomputed embedding payload | Currently reusable with backends that accept embeddings directly. |
These are the example entry points that match the current public crate surface:
cargo run --example generate_kokoro --release
cargo run --example generate_qwen3_tts --release
cargo run --example generate_vibevoice --release --no-default-features --features vibevoice,download,metal
cargo run --example generate_vibevoice_realtime --release --no-default-features --features vibevoice,download,metal
cargo run --example generate_voxtral --release
cargo run --example generate_omnivoice --release --no-default-features --features omnivoice,download,metal
cargo run --example generate_comparison_suite --release --features metal -- --runtime all
cargo run --example benchmark_omnivoice --release --no-default-features --features omnivoice,download,metal -- --warmup 1 --iterations 3Outputs are written under output/ by the example binaries.
generate_vibevoice keeps writing the main raw render to the configured
VIBEVOICE_OUTPUT path and also writes *_base.wav,
*_denoised_default.wav, and *_denoised_aggressive.wav under
output/denoise/ by default. You can override that folder with
VIBEVOICE_DENOISE_DIR.
generate_vibevoice_realtime targets ModelType::VibeVoiceRealtime and expects cached prompt presets under models/VibeVoice-Realtime-0.5B/voices/ by default. Use VIBEVOICE_REALTIME_MODEL_PATH, VIBEVOICE_REALTIME_VOICES_DIR, VIBEVOICE_REALTIME_VOICE, VIBEVOICE_REALTIME_DEVICE, and VIBEVOICE_REALTIME_OUTPUT to override the defaults.
generate_comparison_suite writes a shared English and German comparison set under output/model_comparison/cpu/ and output/model_comparison/metal/, plus report.json files with per-model load time, per-sample synthesis time, audio duration, and realtime factor. It loads one model at a time so the full suite can run sequentially on tighter memory budgets.
What it is
Kokoro is the compact option in this repo: an 82M-parameter StyleTTS2 plus ISTFTNet stack with Apache-licensed weights. In practice, it is the backend you reach for when you want a fast local model, simple deployment, and a much smaller download than the larger multilingual checkpoints.
What works in any-tts today
- Native Rust inference.
- Default output at 24 kHz.
- Named preset voices discovered from the
voices/directory. - Language tags exposed by the current backend:
en,ja,zh,ko,fr,de,it,pt,es,hi. - Optional voice-cloning support only when a checkpoint includes style-encoder weights.
Pros
- Small enough to be the practical local-first choice.
- Apache-2.0 model license makes deployment straightforward.
- Good fit for desktop apps, tools, and low-latency local generation.
- Simple model layout relative to the larger codebook-based stacks.
Cons
- Uses a pure-Rust phonemizer for English input, so deployment is simpler than the previous espeak-based setup.
- The common open release is mostly about preset voice packs, not raw zero-shot cloning.
- Less expressive control than the bigger instruct-heavy model families.
License
- Upstream model weights: Apache-2.0.
- Crate code using the model:
MIT OR Apache-2.0.
What it is
OmniVoice is the ambition play in this repo. Upstream, it is a diffusion language model TTS stack aimed at omnilingual zero-shot speech generation with voice design and massive language coverage.
What works in any-tts today
- Native Candle backend.
language,instruct,cfg_scale, andmax_tokensrequest controls.- Automatic runtime preference selection for CPU, CUDA, or Metal.
- Repo-exposed language set:
auto,en,zh,ja,ko,de,fr,es,pt,ru,it.
What does not work yet in the Rust backend
- Named voices.
- Reference-audio voice cloning.
- Reusable voice embeddings.
The code returns explicit errors for those cases instead of silently falling back to Python.
Pros
- Strong upstream story for language coverage.
- Good fit for instruction-driven voice design.
- Benchmark helper in this repo already makes backend comparisons easy.
- Apache-2.0 model license.
Cons
- The current Rust implementation exposes less than the upstream model card promises.
- If your main requirement is zero-shot cloning from reference audio, this backend is not there yet in this crate.
- Heavier than Kokoro and less turnkey than the small local-first path.
License
- Upstream model weights: Apache-2.0.
- Crate code using the model:
MIT OR Apache-2.0.
What it is
Qwen3-TTS is the control-heavy multilingual option. It uses a discrete multi-codebook language model plus a speech-tokenizer decoder and is designed for named speakers, instruction-following, and multiple TTS operating modes.
What works in any-tts today
- Native Rust backend.
- Default path points to
Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice. - Named speaker generation for CustomVoice checkpoints.
- VoiceDesign checkpoints also work when selected through
with_hf_model_id()or local files. - 24 kHz output with the extra speech-tokenizer weights resolved alongside the main model.
- Repo-level language support tracks the current checkpoint config and includes
auto.
Example: switch from CustomVoice to VoiceDesign
use any_tts::{load_model, ModelType, SynthesisRequest, TtsConfig};
let model = load_model(
TtsConfig::new(ModelType::Qwen3Tts)
.with_hf_model_id("Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign")
)?;
let audio = model.synthesize(
&SynthesisRequest::new("This voice should sound sharp, precise, and quietly confident.")
.with_language("English")
.with_instruct("Female presenter voice, low warmth, clear diction, subtle authority."),
)?;Pros
- Best overall control surface in the current public crate API.
- Strong multilingual coverage: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian upstream.
- Named speakers are easy to use from a single request builder.
- VoiceDesign gives you a second mode without changing the main crate API.
Cons
- Large download and memory footprint compared with Kokoro.
- Requires the separate speech-tokenizer decoder assets in addition to the main weights.
- Upstream Base-model voice cloning exists, but reference-audio cloning is not implemented in this crate yet.
License
- Upstream model weights: Apache-2.0.
- Crate code using the model:
MIT OR Apache-2.0.
What it is
VibeVoice-1.5B is the long-form diffusion backend in this crate. It is the VibeVoice option to choose when you want native Rust inference with reference-audio-conditioned prompting instead of named preset voices.
What works in any-tts today
- Native Rust backend.
reference_audio,language,instruct,cfg_scale,max_tokens, andtemperaturerequest controls.- 24 kHz output with automatic runtime selection across CPU, CUDA, or Metal.
- Public
ModelType::VibeVoiceloading through the sharedvibevoicefeature.
What does not work yet in the Rust backend
- Named voice presets.
- Pre-extracted reusable voice embeddings.
- Low-latency streaming generation; the current path is still optimized for correctness and parity work.
Pros
- Best fit in this repo for long-form, reference-audio-conditioned synthesis.
- Native Rust inference for a model family that is often driven from Python examples upstream.
- Shares the same trait-based API as the smaller and faster backends.
Cons
- Heavier and slower than Kokoro or VibeVoice Realtime.
- Still an early backend compared with the simpler local-first paths.
- Not the right choice if your application needs preset voice names or fast startup latency.
License
- Upstream model weights: MIT.
- Crate code using the model:
MIT OR Apache-2.0.
What it is
VibeVoice Realtime is the smaller low-latency VibeVoice variant. In this crate it is exposed as ModelType::VibeVoiceRealtime and centers on cached-prompt voice presets rather than reference-audio cloning.
What works in any-tts today
- Native Rust backend.
- Cached-prompt voice presets discovered from
voices/*.pt. - Public example coverage through
generate_vibevoice_realtime. - 24 kHz output with the same Candle runtime selection flow used by the rest of the crate.
- Voice selection through
SynthesisRequest::with_voice()when matching preset files are present.
What does not work yet in the Rust backend
- Reference-audio input.
- Pre-extracted voice embeddings.
- Arbitrary speaker cloning without upstream preset cache files.
Pros
- Best fit here for low-latency preset-voice generation.
- Smaller checkpoint than the full VibeVoice-1.5B model.
- Good developer path for apps that reuse approved voices and want predictable startup behavior.
Cons
- Depends on
voices/*.ptpreset caches and fails explicitly if they are missing. - The upstream model card is English-first and research-oriented.
- Less flexible than full VibeVoice when you need reference-audio conditioning or multi-speaker behavior.
License
- Upstream model weights: MIT.
- Crate code using the model:
MIT OR Apache-2.0.
What it is
Voxtral is the biggest public backend in the repo and the most obviously voice-agent-oriented. It pairs a language model with acoustic generation and preset voice embeddings, and the published checkpoint is tuned for multilingual, low-latency TTS deployment scenarios.
What works in any-tts today
- Native Rust backend.
- Preset voice selection from the checkpoint's
voice_embedding/assets. - Optional direct
voice_embeddingreuse when you already have a compatible embedding. - Repo-exposed languages:
en,fr,es,de,it,pt,nl,ar,hi. - Default sample rate is resolved from the model config and outputs at 24 kHz with the published checkpoint.
Pros
- Best fit here for production-style voice-agent workloads.
- Upstream model card emphasizes streaming and low time-to-first-audio.
- Comes with preset voices and a clear multilingual story.
Cons
- The open checkpoint does not ship reference-audio encoder weights, so raw voice cloning is unavailable.
- This is the heaviest public backend in the crate.
- The published model license is
CC BY-NC 4.0, so commercial deployment needs extra care or a different model choice.
License
- Upstream model weights: CC BY-NC 4.0.
- Crate code using the model:
MIT OR Apache-2.0.
use any_tts::{load_model, ModelType, TtsConfig};
let model = load_model(
TtsConfig::new(ModelType::Kokoro)
.with_model_path("./models/Kokoro-82M")
)?;use any_tts::{load_model, ModelType, TtsConfig};
let model = load_model(
TtsConfig::new(ModelType::Qwen3Tts)
.with_config_file("/cache/config.json")
.with_tokenizer_file("/cache/tokenizer.json")
.with_weight_file("/cache/model-00001-of-00002.safetensors")
.with_weight_file("/cache/model-00002-of-00002.safetensors")
.with_speech_tokenizer_weight_file("/cache/speech-tokenizer.safetensors")
)?;use any_tts::config::DType;
use any_tts::{load_model, DeviceSelection, ModelType, TtsConfig};
let model = load_model(
TtsConfig::new(ModelType::OmniVoice)
.with_device(DeviceSelection::Metal(0))
.with_dtype(DType::F16)
)?;This repo now includes the standard GitHub community files you would expect for an active project:
- Contributing
- Code of Conduct
- Security Policy
- Support Guide
- Issue templates under
.github/ISSUE_TEMPLATE/ - A pull request template at
.github/PULL_REQUEST_TEMPLATE.md
If you are adding a backend, model variant, or new loading flow, keep the public story honest: unsupported features should fail explicitly, examples should match exported API, and docs should separate experimental repo work from supported top-level surfaces.
The short version is in CONTRIBUTING.md.
Model weights, runtime backends, and artifact loading all change the risk profile of TTS systems. Please read SECURITY.md before disclosing a vulnerability publicly.
The crate metadata declares MIT OR Apache-2.0 for this repository's Rust code. That does not supersede the terms attached to any model weights you download and run through it.
This repo now has a clear public shape: six native backends, trait-based loading, byte-first asset support, and example coverage for local, multilingual, long-form, preset-voice, and realtime TTS workflows. The right way to think about it is not "a single-model wrapper" but "a Rust TTS platform layer that lets one application target multiple open model ecosystems without hiding their differences."