-
Colab demo is available!
-
3/9 Corrected a signficant error!
-
30/8 Start training model of A Certain Magical Index!
-
30/8 Multi-speaker training is available!
Inspired by Rcell, I replaced the word embedding of TextEncoder
in VITS with the output of the ContentEncoder
used in Soft-VC to achieve any-to-one voice conversion with non-parallel data. Of course, any-to-many voice converison is also doable!
If you are interested in the performance of Soft-VC, you may refer to this demo. I've trained a aoustic model for 3 days with about 2000 audio clips.
Audio should be wav
file, with mono channel and a sampling rate of 22050 Hz.
Your dataset should be like:
└───wavs
├───dev
│ ├───LJ001-0001.wav
│ ├───...
│ └───LJ050-0278.wav
└───train
├───LJ002-0332.wav
├───...
└───LJ047-0007.wav
Utilize the content encoder to extract speech units in the audio.
For more information, refer to this repo.
cd hubert
python3 encode.py soft path/to/wavs/directory path/to/soft/directory --extension .wav
Then you need to generate filelists for both your training and validation files. It's recommended that you prepare your filelists beforehand!
Your filelists should look like:
Single speaker:
path/to/wav|path/to/unit
...
Multi-speaker:
path/to/wav|id|path/to/unit
...
Single speaker:
python train.py -c configs/config.json -m model_name
Multi-speaker:
python train_ms.py -c configs/config.json -m model_name
You may also refer to train.ipynb
Please refer to inference.ipynb
QQ: 2235306122
BILIBILI: Francis-Komizu
Special thanks to Rcell for giving me both inspiration and advice!