Training Voice Cloning model for another language #492

rlutsyshyn · 2020-08-13T13:26:49Z

Hi! I am already know how to train syntheiser and vocoder, also know how to create relevant dataset. But if I want to train voice cloning model for another language e.g.ukrainian, what else should I do?

ghost · 2020-08-13T14:27:40Z

Update synthesizer/utils/symbols.py to contain all valid characters in your text transcripts (the characters you want to train on). This is an example for Swedish: https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/3eb96df1c6b4b3e46c28c6e75e699bffc6dd43be

However, be careful: in order for someone to run the model you've created they will also need to make the same changes to the file. I spent hours learning this the hard way trying to use the model in #257 because the creator was unavailable to help.

rlutsyshyn · 2020-08-13T14:54:26Z

Thank you very much! Will try :)

rlutsyshyn · 2020-08-13T15:18:51Z

Can you also tell me - can I somehow fine tune pretrained model on some new voice samples without full retraining?

ghost · 2020-08-13T15:27:17Z

Yes, you can resume training on a pretrained model using a different dataset. The main use for this is single-speaker finetuning (process and examples in #437) but you could also finetune multi-speaker using the same process.

One more thing to add, the speaker encoder is trained on English and may not work well for other languages. If you have a large number of voice samples in your target language, you may wish to train a new encoder or at least finetune an existing one. (Data preprocessing for encoder is not a smooth process so set your expectations accordingly).

There are some very good speaker encoders shared in #126 but the model size of 768 is too big to be practical for cloning. You can use this process to import the relevant weights from the model and finetune to a more useful dimension: #458 (comment)

Ananas120 · 2020-08-13T19:34:59Z

Hello, i will also try to train a voice cloning model in another language (in Fr for me) and i have some tricks for you if it can help your :

First : use transfert-learning (load the checkpoint with your custom model, normally the last layer doesn’t have same shape but all other yes so load weights for other layers), it will really speed up training
Secondly, like @blue-fish says, the encoder is trained in english so don’t know if it is portable for other languages (instead you can use my approach of the siamese network, i trained it on a mixt of En and Fr dataset and have awesome results so it’s compatible with multi-language)
Another thing you can do is to test the encoder with your dataset to see if results are good or not (if not, you can retrain-it on your data)
Last thing, normally the vocoder is a universal vocoder and so it should be trained to convert mel-spect to wav form (no matter the language) so i think it’s only a bonus to fine tune it on your language but normally it will give good results as it

Good luck for training !

ghost · 2020-08-13T20:20:56Z

the encoder is trained in english so don’t know if it is portable for other languages

The English encoder works all right for Swedish. There's info on setting it up and samples in #257 . Since encoder training is very intensive, you should just try it (either jump straight to synth preprocess and training, or do some speaker verification with Ukrainian utterances to see how well it performs).

rlutsyshyn · 2020-08-14T09:11:13Z

Thanks guys! Will try :)

ghost · 2020-08-21T16:26:31Z

@rlutsyshyn How is progress on your synthesizer model?

rlutsyshyn · 2020-08-24T17:09:43Z

@blue-fish Just collect a lot of data :)

afantasialiberal · 2020-08-30T06:22:15Z

Hello, i speak spanish, is there a tutorial for train it on my language? sorry i am a very noob with this but very fun project-

ghost · 2020-08-30T06:51:12Z

@afantasialiberal Please see #431 (comment) for a general outline of the process. There is no tutorial available at this time.

rlutsyshyn · 2020-09-05T11:16:57Z

Hey, I have new issue while tried to run vocoder_preprocess. Preprocessing "starts" but it had 0 iterations (without any error)
I have datasets/SV2TTS/vocoder/mels_gta but it is empty and datasets/SV2TTS/vocoder/synthesized.txt is also empty...
Mb I missed something? I just fine tune pretrained model on my own data (with synthesizer there was no problems)

ghost · 2020-09-10T19:55:23Z

@rlutsyshyn Do you still have that issue with vocoder preprocess?

rlutsyshyn · 2020-09-10T20:27:01Z

@blue-fish have issue with synthesizer now :) I mean, that when I use 48kHz audio and calculate parameters in synthesizer/hparams.py - after fine tuning my voice is like in Alvin and the Chipmunks (very very fast) ... mb you have some advices on this case?
What are the main parameters to configure to have normal voice in the output?

Ananas120 · 2020-09-11T06:33:59Z

Just to be sure, if you train the synthesizer to create 48khz melspectrogram, you should also train the vocoder to generate 48khz audio (because it’s trained on 16khz audio)
Also you should check if the parameters for the audio player etc are well modified according your 48khz rate

Good luck !

rlutsyshyn · 2020-09-11T09:54:01Z

@Ananas120 For synthesizer in hparams.py I can modify win_size, hop... etc, but in vocoder/hparams.py I don't see something like that, so waht sould I modify to fine tune my vocoder for 48kHz data?
Thnx :)

Ananas120 · 2020-09-11T10:00:49Z

Honnestly, i don’t know, i think blue-fish can help you better for this
If the audio only seems to go to fast but seems good, it can only be a problem with the audio player rate and the no matter the rate of spectrogram for the vocoder (because i don’t know if it changes something for the vocoder if the spectrogram is a 16khz or 48khoz)
So you could search where the toolbox uses something like sounddevice.play (sd.play) or something like that
You could also check when the vocoder generates an audio and play it yourself with 48khz parameter (with IPython.display.Audio for example if you use jupyter notebook)

ghost · 2020-09-11T17:14:52Z

@rlutsyshyn You need to train a vocoder from scratch, the good news is that it trains relatively fast and you should only need to do it once. Most people choose sampling rates of 22.05 or 24 kHz for faster inference but that's your call.

In synthesizer hparams, you should modify hop_length to be 0.0125 * sample_rate , and win_length and n_fft to be 4 times that number. The vocoder automatically picks up those hparams from the synthesizer. You'll also need to edit the upsampling factors in this line of code, to match your new hop length. For example, 5*5*8 = 200 (the default hop length for 16 kHz).

Real-Time-Voice-Cloning/vocoder/hparams.py

Line 26 in 8f71d67

    
           voc_upsample_factors = (5, 5, 8)    # NB - this needs to correctly factorise hop_length

When preprocessing data, the fmax can be adjusted. You can go as high as 0.5*sample rate (the Nyquist rate). Higher is not necessarily better, because we only have 80 mel channels and each channel needs to represent a wider range of frequencies. If you don't want to experiment, it is safe to leave fmax untouched at 7600 Hz.

rlutsyshyn · 2020-09-14T05:37:25Z

Thank for your response @blue-fish , but for training vocoder I need a lot of 48kHz data, what could be a problem. By the way:

If I want to fine tune voice clonner on 22 or 24 kHz I also need to retrain vocoder from scratch?
5 * 5 * 8 = 200 but how can I split it for 600? 5 * 5 * 24 ? or smthing else?

ghost · 2020-09-14T06:15:30Z

Hi @rlutsyshyn, you don't need to use the same datasets for synth and vocoder. You can preprocess a different 48khz dataset (even English) and it should generalize to Ukrainian if it has enough voices (several hundred or more). Use synthesizer_preprocess_audio.py , then copy SV2TTS/synthesizer/mels to SV2TTS/vocoder/mels_gta and SV2TTS/synthesizer/train.txt to SV2TTS/vocoder/synthesized.txt.

The downside to this approach is your trained vocoder will not compensate for any deficiencies of your synthesizer model. It is a missed opportunity to make the final output better.

You can continue to use Corentin's pretrained vocoder if your synthesizer hparams satisfy the following conditions.
- num_mels = 80
- (hop_size / sample_rate) = 0.0125
- win_size = hop_size * 4
- fmin = 55 and fmax = 7600

For proper vocoder inference, you either need to edit synthesizer/hparams.py or vocoder/hparams.py to set hop_size, win_size, and sample_rate to the old values (200, 800, 16khz). I don't know if it matters but you may also want to set n_fft=800. The toolbox uses the synthesizer's sampling rate, so easier to edit that hparams file (otherwise you need to resample the wav after getting it back from the vocoder).

The reason this works is because the vocoder just sees a 2d array of shape (num_mels, frames) as input. There is no sample rate information contained in the mel spectrogram. You can even go the other direction, and take a synthesizer trained at 16khz and use the mels on a vocoder trained at 24 khz :)

I've personally tried (4, 4, 4, 4) for 256, and (5, 6, 10) for 300 and the results were good. Have not read the WaveRNN paper so I don't know how to select the upsampling factor. Maybe try (4, 5, 5, 6) for 600? An extra element does not add that many trainable parameters, or affect inference speed significantly.

rlutsyshyn · 2020-09-14T06:30:44Z

@blue-fish Thanks for your fast response, will try this :)

rlutsyshyn · 2020-09-18T06:52:22Z

@blue-fish Hey! Can you give me an advice? When I used data for fine tuning (16kHz english speaker) and fine tune only sysnthesizer after testing I had similar voice but words are like bla bla bla ... bla bla bla Is that problem with synthesizer or I have to train (fine tune) vocoder for that voice?
Thnx

ghost · 2020-09-18T18:23:50Z

@rlutsyshyn Are you taking the pretrained synthesizer (English) and finetuning on your Ukrainian data? That's not going to work because the mapping of letters to sounds will not match. You need to start the synthesizer training from scratch when working with a new language.

Ananas120 · 2020-09-18T18:28:21Z

For the classic Tacotron-2 model, training from En to another language work (in Fr for me) but En and Fr sounds are not as far as that so i suppose mapping slightly differs but not as much
For this model it doesn’t work but i think it’s not the fault of the pretrained weights but of my encoder or my dataset or my preprocessing

rlutsyshyn · 2020-09-18T19:22:40Z

@blue-fish @Ananas120 I used english synthesizer and try to fine rune on english data but recorded by my self. I collected 400 samples of utterances and try to fine tune synthesizer on them but had bla bla bla .

ghost · 2020-09-18T19:39:48Z

When finetuning, use the same embedding for all of your samples for faster convergence. I take the embedding of the first audio file and use it to overwrite all the others. For inference, make sure you load the same audio file used to generate your embeds for finetuning.

If it still doesn't work, check your preprocessing and also make sure the transcripts in train.txt matches what is spoken in the audio files.

rlutsyshyn · 2020-09-18T19:41:23Z

@blue-fish Can you explain this approach with same embedding more accurate, please?

Ananas120 · 2020-09-18T19:45:54Z

At the moment i use a « speaker-embedding » (the mean of all utterances embeddings), is it more interesting or is it better to user 1 single « real » utterance embedding for all ?

ghost · 2020-09-18T20:31:48Z

@rlutsyshyn You have 400 wav files in your training set for finetuning. When you run synthesizer_preprocess_embeds.py it will make embed-file1.npy, ... , embed-file400.npy, in SV2TTS/synthesizer/embeds. Copy the contents of file 1 to files 2-400, so that they are all the same.

@Ananas120 I use the embedding of a real utterance so I can load the audio file in the toolbox to get the desired embedding. The mean or L2-norm is technically better but with a good encoder model it shouldn't make much of a difference.

rlutsyshyn · 2020-09-19T10:24:01Z

@blue-fish

For inference, make sure you load the same audio file used to generate your embeds for finetuning.

what did you mean?

ghost · 2020-09-19T11:45:08Z

@rlutsyshyn After #492 (comment) , your entire dataset is using the embedding from file 1. The embedding corresponds to a specific audio file, let's call it file1.wav. When you test your new synthesizer in the toolbox (or demo_cli.py), you must remember to load file1.wav to generate the embedding.

rlutsyshyn · 2020-09-19T14:57:34Z

@blue-fish I tried to do what you said but I have still same results... bla bla bla. I checked datasets/SV2TTS/synthesizer/train.txt file and all is good there e.g.:
audio-Track 1 - 218.npy|mel-Track 1 - 218.npy|embed-Track 1 - 218.npy|113367|567|Track 1 - 218|You humans who listened to the low notes from the tuba rated it as bittersweet.|You humans who listened to the low notes from the tuba rated it as bittersweet

I used first embedding for fine tuning model, and same embedding for inference in toolbox or demo_cli.py While I fine tuned the model loss was +-0.5 and won't fall more.

ghost · 2020-09-19T15:41:13Z

Your train.txt is improperly formatted. Here is an example line:

audio-p240_001.npy|mel-p240_001.npy|embed-p240_001.npy|38921|195|Please call Stella.

rlutsyshyn · 2020-09-19T19:32:01Z

@blue-fish thanks, now it works good. But how can I improve the quality of the output ?

ghost · 2020-09-23T06:43:44Z

@rlutsyshyn That's something that I continue to work on now. I am experimenting with different synthesizer models and settings, but I still have not surpassed the pretrained models from Corentin.

rlutsyshyn · 2020-09-23T13:15:44Z

@blue-fish Can the vocoder fine tuning improve output audio quality?

ghost · 2020-09-23T18:21:16Z

@rlutsyshyn Yes, though you'll want to make sure you are satisfied with the synthesizer before moving on to vocoder training.

rlutsyshyn · 2020-09-24T06:49:18Z

@blue-fish Yes, I think that I'm satisfied on the synthesizer model quality. But when I try to fine tune vocoder (on 16kHz data) on the output I listen just simple noise...

ghost · 2020-10-12T08:26:17Z

Closing this issue due to inactivity. @rlutsyshyn I think you know as much about this repo as I do now. My recommendation is to avoid finetuning the vocoder, since it will not improve the quality that much. If you need a better vocoder train it from scratch.

Adnan3234 · 2023-06-28T07:26:50Z

how many voice samples of a particular voice are required to train the model ?

hetpandya mentioned this issue Sep 11, 2020

Fine-tuning for hindi #525

Closed

ghost mentioned this issue Sep 22, 2020

How can I train a model in portuguese lang? #531

Closed

ghost mentioned this issue Sep 30, 2020

Can I recreate my mom voice from WhatsApp Recordings? #537

Closed

ghost closed this as completed Oct 12, 2020

ghost mentioned this issue Oct 8, 2021

Support for other languages #30

Open

Ca-ressemble-a-du-fake mentioned this issue Oct 31, 2021

TTS outputing different words than the ones typed in #883

Closed

This issue was closed.

Training Voice Cloning model for another language #492

Training Voice Cloning model for another language #492

Comments

rlutsyshyn commented Aug 13, 2020 • edited Loading

ghost commented Aug 13, 2020

rlutsyshyn commented Aug 13, 2020

rlutsyshyn commented Aug 13, 2020

ghost commented Aug 13, 2020

Ananas120 commented Aug 13, 2020

ghost commented Aug 13, 2020

rlutsyshyn commented Aug 14, 2020

ghost commented Aug 21, 2020

rlutsyshyn commented Aug 24, 2020

afantasialiberal commented Aug 30, 2020

ghost commented Aug 30, 2020

rlutsyshyn commented Sep 5, 2020

ghost commented Sep 10, 2020

rlutsyshyn commented Sep 10, 2020 • edited Loading

Ananas120 commented Sep 11, 2020

rlutsyshyn commented Sep 11, 2020

Ananas120 commented Sep 11, 2020

ghost commented Sep 11, 2020

rlutsyshyn commented Sep 14, 2020

ghost commented Sep 14, 2020 • edited by ghost Loading

rlutsyshyn commented Sep 14, 2020

rlutsyshyn commented Sep 18, 2020

ghost commented Sep 18, 2020

Ananas120 commented Sep 18, 2020

rlutsyshyn commented Sep 18, 2020

ghost commented Sep 18, 2020

rlutsyshyn commented Sep 18, 2020

Ananas120 commented Sep 18, 2020

ghost commented Sep 18, 2020

rlutsyshyn commented Sep 19, 2020

ghost commented Sep 19, 2020

rlutsyshyn commented Sep 19, 2020

ghost commented Sep 19, 2020

rlutsyshyn commented Sep 19, 2020

ghost commented Sep 23, 2020

rlutsyshyn commented Sep 23, 2020

ghost commented Sep 23, 2020

rlutsyshyn commented Sep 24, 2020

ghost commented Oct 12, 2020

Adnan3234 commented Jun 28, 2023

rlutsyshyn commented Aug 13, 2020 •

edited

Loading

rlutsyshyn commented Sep 10, 2020 •

edited

Loading

ghost commented Sep 14, 2020 •

edited by ghost

Loading