Skip to content

KdaiP/StableTTS

Repository files navigation

StableTTS

Next-generation TTS model using flow-matching and DiT, inspired by Stable Diffusion 3.

Introduction

As the first open-source TTS model that tried to combine flow-matching and DiT, StableTTS is a fast and lightweight TTS model for chinese and english speech generation. It has only 10M parameters.

Huggingface demo: chinese_version english_version

Pretrained models

We provide pretrained models ready for inference, finetuning and webui. Simply download and place the models in the ./checkpoints directory to get started.

Model Name Task Details Dataset Download Link
StableTTS text to mel 400h english 🤗
StableTTS text to mel 100h chinese 🤗
Vocos mel to wav 2k english + chinese + japanese 🤗

Larger models, better pretrained models and multilingual models will comming soon...

Installation

  1. Set up pytorch: Follow the official PyTorch guide to install pytorch and torchaudio. We recommend using the latest version for optimal performance.

  2. Install Dependencies: Run the following command to install the required Python packages:

pip install -r requirements.txt

Inference

For detailed inference instructions, please refer to inference.ipynb

We also provide a webui based on gradio, please refer to webui.py

Training

Training your models with StableTTS is designed to be straightforward and efficient. Here’s how to get started:

Preparing Your Data

  1. Generate Text and Audio pairs: Generate the text and audio pair filelist as ./filelists/example.txt. Some recipes of open-source datasets could be found in ./recipes.

  2. Run Preprocessing: Adjust the DataConfig in preprocess.py to set your input and output paths, then run the script. This will process the audio and text according to your list, outputting a JSON file with paths to mel features and phonemes. Note: Ensure to change language = 'chinese' in DataConfig for English or Japanese text processing.

Note: Since we use reference encoder to capture speaker identity when training, there is no need for a speaker ID in multispeaker synthesis and training.

Start training

  1. Adjust Training Configuration: In config.py, modify TrainConfig to set your file list path and adjust training parameters as needed.

  2. Start the Training Process: Launch train.py to start training your model.

Note: For finetuning, download the pretrained model and place it in the model_save_path directory specified in TrainConfig. Training script will automatically detect and load the pretrained checkpoint.

Experiment with Configurations

Feel free to explore and modify settings in config.py to modify the hyperparameters!

Model structure

  • We use the Diffusion Convolution Transformer block from Hierspeech++, which is a combination of original DiT and FFT(Feed forward Transformer from fastspeech) for better prosody.

  • In flow-matching decoder, we add a FiLM layer before DiT block to condition timestep embedding into model. We also add three ConvNeXt blocks before DiT. We found it helps with model convergence and better sound quality

References

The development of our models heavily relies on insights and code from various projects. We express our heartfelt thanks to the creators of the following:

Direct Inspirations

Matcha TTS: Essential flow-matching code.

Grad TTS: Diffusion model structure.

Stable Diffusion 3: Idea of combining flow-matching and DiT.

Vits: Code style and MAS insights, DistributedBucketSampler.

Additional References:

plowtts-pytorch: codes of MAS in training

Bert-VITS2 : numba version of MAS and modern pytorch codes of Vits

fish-speech: dataclass usage and mel-spectrogram transforms using torchaudio

gpt-sovits: melstyle encoder for voice clone

diffsinger: chinese three section phoneme scheme for chinese g2p

coqui xtts: gradio webui

TODO

  • Release pretrained models.
  • Provide detailed finetuning instructions.
  • Support Japanese language.
  • User friendly preprocess and inference script.
  • Enhance documentation and citations.
  • Add chinese version of readme.
  • Release multilingual checkpoint.

Disclaimer

Any organization or individual is prohibited from using any technology in this repo to generate or edit someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

About

Next-generation TTS model using flow-matching and DiT, inspired by Stable Diffusion 3

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published