Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Voice Cloning model for another language #492

Closed
rlutsyshyn opened this issue Aug 13, 2020 · 40 comments
Closed

Training Voice Cloning model for another language #492

rlutsyshyn opened this issue Aug 13, 2020 · 40 comments

Comments

@rlutsyshyn
Copy link

rlutsyshyn commented Aug 13, 2020

Hi! I am already know how to train syntheiser and vocoder, also know how to create relevant dataset. But if I want to train voice cloning model for another language e.g.ukrainian, what else should I do?

@ghost
Copy link

ghost commented Aug 13, 2020

Update synthesizer/utils/symbols.py to contain all valid characters in your text transcripts (the characters you want to train on). This is an example for Swedish: https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/3eb96df1c6b4b3e46c28c6e75e699bffc6dd43be

However, be careful: in order for someone to run the model you've created they will also need to make the same changes to the file. I spent hours learning this the hard way trying to use the model in #257 because the creator was unavailable to help.

@rlutsyshyn
Copy link
Author

Thank you very much! Will try :)

@rlutsyshyn
Copy link
Author

Can you also tell me - can I somehow fine tune pretrained model on some new voice samples without full retraining?

@ghost
Copy link

ghost commented Aug 13, 2020

Yes, you can resume training on a pretrained model using a different dataset. The main use for this is single-speaker finetuning (process and examples in #437) but you could also finetune multi-speaker using the same process.

One more thing to add, the speaker encoder is trained on English and may not work well for other languages. If you have a large number of voice samples in your target language, you may wish to train a new encoder or at least finetune an existing one. (Data preprocessing for encoder is not a smooth process so set your expectations accordingly).

There are some very good speaker encoders shared in #126 but the model size of 768 is too big to be practical for cloning. You can use this process to import the relevant weights from the model and finetune to a more useful dimension: #458 (comment)

@Ananas120
Copy link

Hello, i will also try to train a voice cloning model in another language (in Fr for me) and i have some tricks for you if it can help your :

  • First : use transfert-learning (load the checkpoint with your custom model, normally the last layer doesn’t have same shape but all other yes so load weights for other layers), it will really speed up training
  • Secondly, like @blue-fish says, the encoder is trained in english so don’t know if it is portable for other languages (instead you can use my approach of the siamese network, i trained it on a mixt of En and Fr dataset and have awesome results so it’s compatible with multi-language)
  • Another thing you can do is to test the encoder with your dataset to see if results are good or not (if not, you can retrain-it on your data)
  • Last thing, normally the vocoder is a universal vocoder and so it should be trained to convert mel-spect to wav form (no matter the language) so i think it’s only a bonus to fine tune it on your language but normally it will give good results as it

Good luck for training !

@ghost
Copy link

ghost commented Aug 13, 2020

the encoder is trained in english so don’t know if it is portable for other languages

The English encoder works all right for Swedish. There's info on setting it up and samples in #257 . Since encoder training is very intensive, you should just try it (either jump straight to synth preprocess and training, or do some speaker verification with Ukrainian utterances to see how well it performs).

@rlutsyshyn
Copy link
Author

Thanks guys! Will try :)

@ghost
Copy link

ghost commented Aug 21, 2020

@rlutsyshyn How is progress on your synthesizer model?

@rlutsyshyn
Copy link
Author

@blue-fish Just collect a lot of data :)

@afantasialiberal
Copy link

Hello, i speak spanish, is there a tutorial for train it on my language? sorry i am a very noob with this but very fun project-

@ghost
Copy link

ghost commented Aug 30, 2020

@afantasialiberal Please see #431 (comment) for a general outline of the process. There is no tutorial available at this time.

@rlutsyshyn
Copy link
Author

Hey, I have new issue while tried to run vocoder_preprocess. Preprocessing "starts" but it had 0 iterations (without any error)
I have datasets/SV2TTS/vocoder/mels_gta but it is empty and datasets/SV2TTS/vocoder/synthesized.txt is also empty...
Mb I missed something? I just fine tune pretrained model on my own data (with synthesizer there was no problems)

@ghost
Copy link

ghost commented Sep 10, 2020

@rlutsyshyn Do you still have that issue with vocoder preprocess?

@rlutsyshyn
Copy link
Author

rlutsyshyn commented Sep 10, 2020

@blue-fish have issue with synthesizer now :) I mean, that when I use 48kHz audio and calculate parameters in synthesizer/hparams.py - after fine tuning my voice is like in Alvin and the Chipmunks (very very fast) ... mb you have some advices on this case?
What are the main parameters to configure to have normal voice in the output?

@Ananas120
Copy link

Just to be sure, if you train the synthesizer to create 48khz melspectrogram, you should also train the vocoder to generate 48khz audio (because it’s trained on 16khz audio)
Also you should check if the parameters for the audio player etc are well modified according your 48khz rate

Good luck !

@rlutsyshyn
Copy link
Author

@Ananas120 For synthesizer in hparams.py I can modify win_size, hop... etc, but in vocoder/hparams.py I don't see something like that, so waht sould I modify to fine tune my vocoder for 48kHz data?
Thnx :)

@Ananas120
Copy link

Honnestly, i don’t know, i think blue-fish can help you better for this
If the audio only seems to go to fast but seems good, it can only be a problem with the audio player rate and the no matter the rate of spectrogram for the vocoder (because i don’t know if it changes something for the vocoder if the spectrogram is a 16khz or 48khoz)
So you could search where the toolbox uses something like sounddevice.play (sd.play) or something like that
You could also check when the vocoder generates an audio and play it yourself with 48khz parameter (with IPython.display.Audio for example if you use jupyter notebook)

@ghost
Copy link

ghost commented Sep 11, 2020

@rlutsyshyn You need to train a vocoder from scratch, the good news is that it trains relatively fast and you should only need to do it once. Most people choose sampling rates of 22.05 or 24 kHz for faster inference but that's your call.

In synthesizer hparams, you should modify hop_length to be 0.0125 * sample_rate , and win_length and n_fft to be 4 times that number. The vocoder automatically picks up those hparams from the synthesizer. You'll also need to edit the upsampling factors in this line of code, to match your new hop length. For example, 5*5*8 = 200 (the default hop length for 16 kHz).

voc_upsample_factors = (5, 5, 8) # NB - this needs to correctly factorise hop_length

When preprocessing data, the fmax can be adjusted. You can go as high as 0.5*sample rate (the Nyquist rate). Higher is not necessarily better, because we only have 80 mel channels and each channel needs to represent a wider range of frequencies. If you don't want to experiment, it is safe to leave fmax untouched at 7600 Hz.

@rlutsyshyn
Copy link
Author

Thank for your response @blue-fish , but for training vocoder I need a lot of 48kHz data, what could be a problem. By the way:

  1. If I want to fine tune voice clonner on 22 or 24 kHz I also need to retrain vocoder from scratch?
  2. 5 * 5 * 8 = 200 but how can I split it for 600? 5 * 5 * 24 ? or smthing else?

@ghost
Copy link

ghost commented Sep 14, 2020

Hi @rlutsyshyn, you don't need to use the same datasets for synth and vocoder. You can preprocess a different 48khz dataset (even English) and it should generalize to Ukrainian if it has enough voices (several hundred or more). Use synthesizer_preprocess_audio.py , then copy SV2TTS/synthesizer/mels to SV2TTS/vocoder/mels_gta and SV2TTS/synthesizer/train.txt to SV2TTS/vocoder/synthesized.txt.

The downside to this approach is your trained vocoder will not compensate for any deficiencies of your synthesizer model. It is a missed opportunity to make the final output better.

  1. You can continue to use Corentin's pretrained vocoder if your synthesizer hparams satisfy the following conditions.
    • num_mels = 80
    • (hop_size / sample_rate) = 0.0125
    • win_size = hop_size * 4
    • fmin = 55 and fmax = 7600

For proper vocoder inference, you either need to edit synthesizer/hparams.py or vocoder/hparams.py to set hop_size, win_size, and sample_rate to the old values (200, 800, 16khz). I don't know if it matters but you may also want to set n_fft=800. The toolbox uses the synthesizer's sampling rate, so easier to edit that hparams file (otherwise you need to resample the wav after getting it back from the vocoder).

The reason this works is because the vocoder just sees a 2d array of shape (num_mels, frames) as input. There is no sample rate information contained in the mel spectrogram. You can even go the other direction, and take a synthesizer trained at 16khz and use the mels on a vocoder trained at 24 khz :)

  1. I've personally tried (4, 4, 4, 4) for 256, and (5, 6, 10) for 300 and the results were good. Have not read the WaveRNN paper so I don't know how to select the upsampling factor. Maybe try (4, 5, 5, 6) for 600? An extra element does not add that many trainable parameters, or affect inference speed significantly.

@rlutsyshyn
Copy link
Author

@blue-fish Thanks for your fast response, will try this :)

@rlutsyshyn
Copy link
Author

@blue-fish Hey! Can you give me an advice? When I used data for fine tuning (16kHz english speaker) and fine tune only sysnthesizer after testing I had similar voice but words are like bla bla bla ... bla bla bla Is that problem with synthesizer or I have to train (fine tune) vocoder for that voice?
Thnx

@ghost
Copy link

ghost commented Sep 18, 2020

@rlutsyshyn Are you taking the pretrained synthesizer (English) and finetuning on your Ukrainian data? That's not going to work because the mapping of letters to sounds will not match. You need to start the synthesizer training from scratch when working with a new language.

@Ananas120
Copy link

For the classic Tacotron-2 model, training from En to another language work (in Fr for me) but En and Fr sounds are not as far as that so i suppose mapping slightly differs but not as much
For this model it doesn’t work but i think it’s not the fault of the pretrained weights but of my encoder or my dataset or my preprocessing

@rlutsyshyn
Copy link
Author

@blue-fish @Ananas120 I used english synthesizer and try to fine rune on english data but recorded by my self. I collected 400 samples of utterances and try to fine tune synthesizer on them but had bla bla bla .

@ghost
Copy link

ghost commented Sep 18, 2020

When finetuning, use the same embedding for all of your samples for faster convergence. I take the embedding of the first audio file and use it to overwrite all the others. For inference, make sure you load the same audio file used to generate your embeds for finetuning.

If it still doesn't work, check your preprocessing and also make sure the transcripts in train.txt matches what is spoken in the audio files.

@rlutsyshyn
Copy link
Author

@blue-fish Can you explain this approach with same embedding more accurate, please?

@Ananas120
Copy link

At the moment i use a « speaker-embedding » (the mean of all utterances embeddings), is it more interesting or is it better to user 1 single « real » utterance embedding for all ?

@ghost
Copy link

ghost commented Sep 18, 2020

@rlutsyshyn You have 400 wav files in your training set for finetuning. When you run synthesizer_preprocess_embeds.py it will make embed-file1.npy, ... , embed-file400.npy, in SV2TTS/synthesizer/embeds. Copy the contents of file 1 to files 2-400, so that they are all the same.

@Ananas120 I use the embedding of a real utterance so I can load the audio file in the toolbox to get the desired embedding. The mean or L2-norm is technically better but with a good encoder model it shouldn't make much of a difference.

@rlutsyshyn
Copy link
Author

@blue-fish

For inference, make sure you load the same audio file used to generate your embeds for finetuning.

what did you mean?

@ghost
Copy link

ghost commented Sep 19, 2020

@rlutsyshyn After #492 (comment) , your entire dataset is using the embedding from file 1. The embedding corresponds to a specific audio file, let's call it file1.wav. When you test your new synthesizer in the toolbox (or demo_cli.py), you must remember to load file1.wav to generate the embedding.

@rlutsyshyn
Copy link
Author

@blue-fish I tried to do what you said but I have still same results... bla bla bla. I checked datasets/SV2TTS/synthesizer/train.txt file and all is good there e.g.:
audio-Track 1 - 218.npy|mel-Track 1 - 218.npy|embed-Track 1 - 218.npy|113367|567|Track 1 - 218|You humans who listened to the low notes from the tuba rated it as bittersweet.|You humans who listened to the low notes from the tuba rated it as bittersweet

I used first embedding for fine tuning model, and same embedding for inference in toolbox or demo_cli.py While I fine tuned the model loss was +-0.5 and won't fall more.

@ghost
Copy link

ghost commented Sep 19, 2020

Your train.txt is improperly formatted. Here is an example line:

audio-p240_001.npy|mel-p240_001.npy|embed-p240_001.npy|38921|195|Please call Stella.

@rlutsyshyn
Copy link
Author

@blue-fish thanks, now it works good. But how can I improve the quality of the output ?

@ghost
Copy link

ghost commented Sep 23, 2020

@rlutsyshyn That's something that I continue to work on now. I am experimenting with different synthesizer models and settings, but I still have not surpassed the pretrained models from Corentin.

@rlutsyshyn
Copy link
Author

@blue-fish Can the vocoder fine tuning improve output audio quality?

@ghost
Copy link

ghost commented Sep 23, 2020

@rlutsyshyn Yes, though you'll want to make sure you are satisfied with the synthesizer before moving on to vocoder training.

@rlutsyshyn
Copy link
Author

@blue-fish Yes, I think that I'm satisfied on the synthesizer model quality. But when I try to fine tune vocoder (on 16kHz data) on the output I listen just simple noise...

@ghost
Copy link

ghost commented Oct 12, 2020

Closing this issue due to inactivity. @rlutsyshyn I think you know as much about this repo as I do now. My recommendation is to avoid finetuning the vocoder, since it will not improve the quality that much. If you need a better vocoder train it from scratch.

@Adnan3234
Copy link

how many voice samples of a particular voice are required to train the model ?

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants