StableTTS

Next-generation TTS model using flow-matching and DiT, inspired by Stable Diffusion 3.

Introduction

As the first open-source TTS model that tried to combine flow-matching and DiT, StableTTS is a fast and lightweight TTS model for chinese and english speech generation. It has only 10M parameters.

✨ Huggingface demo: chinese_version english_version

Pretrained models

We provide pretrained models ready for inference, finetuning and webui. Simply download and place the models in the ./checkpoints directory to get started.

Model Name	Task Details	Dataset	Download Link
StableTTS	text to mel	400h english	🤗
StableTTS	text to mel	100h chinese	🤗
Vocos	mel to wav	2k english + chinese + japanese	🤗

Larger models, better pretrained models and multilingual models will comming soon...

Installation

Set up pytorch: Follow the official PyTorch guide to install pytorch and torchaudio. We recommend using the latest version for optimal performance.
Install Dependencies: Run the following command to install the required Python packages:

pip install -r requirements.txt

Inference

For detailed inference instructions, please refer to inference.ipynb

We also provide a webui based on gradio, please refer to webui.py

Training

Training your models with StableTTS is designed to be straightforward and efficient. Here’s how to get started:

Preparing Your Data

Generate Text and Audio pairs: Generate the text and audio pair filelist as ./filelists/example.txt. Some recipes of open-source datasets could be found in ./recipes.
Run Preprocessing: Adjust the DataConfig in preprocess.py to set your input and output paths, then run the script. This will process the audio and text according to your list, outputting a JSON file with paths to mel features and phonemes. Note: Ensure to change language = 'chinese' in DataConfig for English or Japanese text processing.

Note: Since we use reference encoder to capture speaker identity when training, there is no need for a speaker ID in multispeaker synthesis and training.

Start training

Adjust Training Configuration: In config.py, modify TrainConfig to set your file list path and adjust training parameters as needed.
Start the Training Process: Launch train.py to start training your model.

Note: For finetuning, download the pretrained model and place it in the model_save_path directory specified in TrainConfig. Training script will automatically detect and load the pretrained checkpoint.

Experiment with Configurations

Feel free to explore and modify settings in config.py to modify the hyperparameters!

Model structure

We use the Diffusion Convolution Transformer block from Hierspeech++, which is a combination of original DiT and FFT(Feed forward Transformer from fastspeech) for better prosody.
In flow-matching decoder, we add a FiLM layer before DiT block to condition timestep embedding into model. We also add three ConvNeXt blocks before DiT. We found it helps with model convergence and better sound quality

References

The development of our models heavily relies on insights and code from various projects. We express our heartfelt thanks to the creators of the following:

Direct Inspirations

Matcha TTS: Essential flow-matching code.

Grad TTS: Diffusion model structure.

Stable Diffusion 3: Idea of combining flow-matching and DiT.

Vits: Code style and MAS insights, DistributedBucketSampler.

Additional References:

plowtts-pytorch: codes of MAS in training

Bert-VITS2 : numba version of MAS and modern pytorch codes of Vits

fish-speech: dataclass usage and mel-spectrogram transforms using torchaudio

gpt-sovits: melstyle encoder for voice clone

diffsinger: chinese three section phoneme scheme for chinese g2p

coqui xtts: gradio webui

TODO

Release pretrained models.
Provide detailed finetuning instructions.
Support Japanese language.
User friendly preprocess and inference script.
Enhance documentation and citations.
Add chinese version of readme.
Release multilingual checkpoint.

Disclaimer

Any organization or individual is prohibited from using any technology in this repo to generate or edit someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
checkpoints		checkpoints
datas		datas
figures		figures
filelists		filelists
models		models
monotonic_align		monotonic_align
recipes		recipes
text		text
utils		utils
vocos_pytorch		vocos_pytorch
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
inference.ipynb		inference.ipynb
preprocess.py		preprocess.py
requirements.txt		requirements.txt
train.py		train.py
webui.py		webui.py

License

KdaiP/StableTTS

Folders and files

Latest commit

History

Repository files navigation