FlashTTS is an open-source, low-latency streaming TTS framework that addresses these limitations. It introduces a lagged multi-track architecture that natively processes streaming text and speech inputs, eliminating sentence-level buffering. Acoustic generation is accelerated by integrating parallel Multi-Token Prediction (MTP) with an X-pred mean flow matching decoder, achieving high-fidelity token-to-mel in exactly two function evaluations (2-NFE). By jointly optimizing input processing and decoding efficiency, FlashTTS offers a practical foundation for real-time speech dialogue systems. Experiments show substantially reduced First-Packet Latency (325ms) compared to robust streaming baselines, while preserving zero-shot voice cloning and multi-lingual intelligibility. Model code and checkpoints are released as open source.
- Native Streaming: Lagged multi-track architecture for streaming text and speech I/O without sentence-level buffering
- MTP + X-pred Mean Flow: Parallel Multi-Token Prediction with X-pred mean flow distillation for fast decoding
- 2-NFE Acoustic Generation: High-fidelity token-to-mel in exactly two function evaluations
- Low First-Packet Latency: 325ms first-packet latency versus robust streaming baselines
- Zero-Shot Voice Cloning: Strong few-shot speaker adaptation from reference audio
- Cross-Lingual Intelligibility: Preserved quality across languages
- Open Source: Model code and checkpoints released for research and deployment
You can listen to audio samples and interactive demos at the anonymous page
(please open in a new tab / window to avoid 404 on the anonymous host):
- FlashTTS Demo (anonymous):
https://anonymous.4open.science/w/flashtts_demo-7E57/
Figure 1. Architecture overview of FlashTTS: stacked inputs track structure (Stage 1) and multi-token prediction training (Stage 2).
Text/Speech Input
β
[Text Tokenizer] + [Speaker Extractor]
β
[LLM Decoder with MTP Acceleration]
β
[Mean Flow X-pred Module] (2-NFE)
β
[HiFi-GAN Vocoder]
β
Audio Output (24kHz)
-
FlashTTS (cosyvoice): Core TTS inference (LLM, tokenizer, frontend)
- Text tokenizer and speaker embedding (CAMPPlus)
- Decoder-only transformer with multi-token prediction
- Streaming-input stacking text and speech tracks
- FlashTTS model: LLM + MeanFlow + vocoder (inference-only)
-
jit_meanflow_xpred: Efficient acoustic generation
- Mean flow matching for 2-NFE mel-spectrogram generation
- X-pred training objective for improved convergence
-
Streaming: First chunk 24 tokens; then 18-token hop with 6-token lookahead per step (24-token context, output 18 tokens per chunk).
-
Vocoder: HiFi-GAN waveform generation, 24kHz.
From the repo root, run the inference example with FlashTTS (text + reference wav) or MeanFlow-only (token file + reference wav):
# FlashTTS: text-to-speech with voice cloning (needs model_dir with LLM + MeanFlow/vocoder)
python examples/inference.py --mode flash_tts --text "Hello, world." --prompt_wav path/to/ref.wav --output_dir out
# MeanFlow-only: token file + reference wav -> wav (no LLM; for testing acoustic model)
python examples/inference.py --mode meanflow_only --token_path path/to/tokens.npy --prompt_wav path/to/ref.wav --output_dir out
# With streaming (MeanFlow 24/18/6 token chunks)
python examples/inference.py --mode meanflow_only --token_path tokens.npy --prompt_wav ref.wav --output_dir out --streamSee examples/inference.py for --model_dir, --pretrained_model_dir, and other options.
# Clone repository
git clone <repo-url>
cd flashtts_opensource
# Install dependencies
pip install -r requirements.txt
FlashTTS uses a config YAML that defines only llm (no flow/hift); the MeanFlow backend is loaded automatically.
from cosyvoice.cli.cosyvoice import FlashTTS
# model_dir: path to your model dir (with llm.pt, cosyvoice2*.yaml, etc.)
# pretrained_model_dir: base dir for configs and MeanFlow/vocoder assets
model = FlashTTS('model_dir', pretrained_model_dir='path/to/pretrained')
# Zero-shot voice cloning
prompt_wav = 'path/to/prompt_audio.wav' # 16kHz reference
text = "Hello, this is a text-to-speech example."
for output in model.inference_zero_shot(text, "", prompt_wav):
audio = output['tts_speech'] # [1, samples]For fastest inference with the X-pred mean flow module (1 step only):
from jit_meanflow_xpred.infer.infer_meanflow_jit_xpred import (
inference_meanflow, initialize_model, initialize_vocoder
)
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Load models
model = initialize_model(config_path, checkpoint_path, device)
vocoder = initialize_vocoder(vocoder_config, vocoder_ckpt, device)
# Run inference with 2 step (MeanFlow = ultra-fast)
wav, inference_time = inference_meanflow(
model=model,
vocoder=vocoder,
token=token_tensor, # [B, seq_len]
spk_emb=speaker_embedding, # [B, 192]
prompt_mel=reference_mel, # [B, T, 80]
steps=2, # MeanFlow: 2 ODE step
cfg_strength=2.0, # Classifier-free guidance strength
device=device
)
print(f"Generated audio shape: {wav.shape}")
print(f"Inference time: {inference_time:.3f}s")
print(f"RTF: {inference_time / (wav.shape[-1] / 24000):.4f}")# Batch inference
python -m jit_meanflow_xpred.infer.infer_meanflow_jit_xpred \
--prompt_wav ./wav_dir \
--token_path ./tokens.jsonl \
--output_dir ./outputs \
--batch \
--steps 1
# Streaming inference (24-token first chunk, 18-token hop, 6-token lookahead)
python -m jit_meanflow_xpred.infer.infer_meanflow_jit_xpred \
--prompt_wav ref.wav \
--token_path tokens.npy \
--output_dir ./outputs \
--stream \
--steps 1- π¨π³ Chinese (Mandarin)
- π¬π§ English
- π«π· French
- π©πͺ German
- π―π΅ Japanese
- π°π· Korean
flashtts_opensource/
βββ cosyvoice/ # FlashTTS core (LLM, frontend, CLI)
β βββ cli/ # Command-line interfaces
β β βββ cosyvoice.py # FlashTTS entry point
β β βββ model.py # FlashTTS model & inference wrapper
β β βββ frontend.py # Audio/text frontend
β βββ llm/ # Language model components
β βββ tokenizer/ # Text/speech tokenization
β βββ transformer/ # Decoder-only Transformer architecture
β βββ utils/ # Utility functions
β βββ hifigan/ # Vocoder components
β βββ vllm/ # vLLM integration (optional)
β
βββ jit_meanflow_xpred/ # Efficient acoustic generation (token2mel + vocoder)
β βββ model/ # CFM, DiT, mel_processing
β βββ infer/
β β βββ infer_meanflow_jit_xpred.py # MeanFlow inference (1-NFE)
β β βββ utils_infer.py
β βββ eval/ # Evaluation scripts
β βββ configs/ # YAML configs
β
βββ third_party/ # Third-party libraries (53 files)
β βββ campplus/ # Speaker embedding extraction
β β βββ tools.py # CAMPPlus inference
β β βββ checkpoints/ # Pre-trained embeddings
β βββ Matcha-TTS/ # Flow matching techniques
β
βββ examples/ # Inference examples
β βββ grpo/
β βββ huawei/
β βββ libritts/
β βββ magicdata-read/
β
βββ asset/ # Project assets
β βββ flashtts_overview.jpg # Architecture diagram
β
βββ README.md # This file
βββ LICENSE # Apache 2.0 License
βββ requirements.txt # Python dependencies
βββ .gitignore # Git ignore rules
Core model configs in jit_meanflow_xpred/configs/:
model:
arch:
num_layers: 16 # Transformer layers
hidden_size: 768 # Hidden dimension
num_heads: 16 # Attention heads
vocab_size: 6563 # Token vocabulary
text_num_embeds: 4096
mel_dim: 80 # Mel-spectrogram dimension
inference:
steps: 2 # MeanFlow: single step ODE
cfg_strength: 2.0 # Classifier-free guidance scale
chunk_size: 0 # 0 = full sequence, >0 = chunk mel framesExtracted using CAMPPlus model:
- Output dimension: 192
- Input requirement: 16kHz, single-channel audio
- Processing: Mel-spectrogram β DTDNN layers β 192-dim embedding
| Metric | Value | Notes |
|---|---|---|
| First-Packet Latency | 325ms | With streaming architecture |
| Mean Flow Steps | 2 | X-pred mean flow: few ODE steps |
| Token-to-Mel Time | ~100ms | On NVIDIA 4090 GPU |
| Mel-to-Wave Time | ~50ms | HiFi-GAN vocoding |
- PyTorch >= 1.13.0
- torchaudio >= 0.13.0
- numpy >= 1.21.0
- scipy >= 1.7.0
- onnxruntime >= 1.12.0
- omegaconf
- hyperpyyaml
See requirements.txt for complete list with specific versions.
This project is licensed under the Apache License 2.0. See LICENSE file for details.
FlashTTS builds upon:
- CosyVoice: TTS backbone and LLM decoder design
- F5-TTS: Flow matching backbone and generation
- HiFi-GAN: High-fidelity neural vocoder architecture
- CAMPPlus: Speaker embedding extraction via contrastive learning