Light Speed ⚡

Light Speed ⚡ is an open-source text-to-speech model based on VITS, with some modifications:

utilizes phoneme duration's ground truth, obtained from an external forced aligner (such as Montreal Forced Aligner), to upsample phoneme information to frame-level information. The result is a more robust model, with a slight trade-off in speech quality.
employs dilated convolution to expand the Wavenet Flow module's receptive field, enhancing its ability to capture long-range interactions.

Pretrained models and demos

We provide two pretrained models and demos:

Q: How do I create training data?
A: See the ./prepare_ljs_tfdata.ipynb notebook for instructions on preparing the training data.

Q: How can I train the model with 1 GPU?
A: Run: python train.py

Q: How can I train the model with 4 GPUs?
A: Run: torchrun --standalone --nnodes=1 --nproc-per-node=4 train.py

Q: How can I train a model to predict phoneme durations?
A: See the ./train_duration_model.ipynb notebook.

Q: How can I generate speech with a trained model?
A: See the ./inference.ipynb notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
attentions.py		attentions.py
commons.py		commons.py
config.json		config.json
flow.py		flow.py
inference.ipynb		inference.ipynb
losses.py		losses.py
mel_processing.py		mel_processing.py
models.py		models.py
modules.py		modules.py
net.svg		net.svg
prepare_ljs_tfdata.ipynb		prepare_ljs_tfdata.ipynb
prepare_vbx_tfdata.ipynb		prepare_vbx_tfdata.ipynb
tfloader.py		tfloader.py
train.py		train.py
train_duration_model.ipynb		train_duration_model.ipynb