Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How long will it take to get a good result? #132

Closed
NileZhou opened this issue Jul 1, 2019 · 41 comments
Closed

How long will it take to get a good result? #132

NileZhou opened this issue Jul 1, 2019 · 41 comments

Comments

@NileZhou
Copy link

NileZhou commented Jul 1, 2019

the train.py's print is:
46739: -4.754265308
46740: -5.550816059
46741: -4.253830433
46742: -5.338192463
46743: -4.700691700
46744: -5.625311375
46745: -5.753829479
46746: -5.032420158
......

which value when the second number equals, the chenkpoint is useable?

@zhengao1993
Copy link

I don't know your batch size. My suggest is that you should run at least 100 epochs.

@tianrengao
Copy link

Training 100 epochs make your model be able to generate reasonable voice. But for channel size 512 version, you need more than 500 epochs to get high quality voice.

@lingjzhu
Copy link

lingjzhu commented Oct 4, 2019

I fine-tuned the released LJ speech model on a Mandarin dataset. I used a Titan V with fp16 and a batch size of 8. It only took a few hours for the model to produce good quality speech in Mandarin. The loss was around -4.5~-5.2 when it converged.

@rafaelvalle
Copy link
Contributor

@lingjzhu can you share a few samples?

@lingjzhu
Copy link

lingjzhu commented Oct 4, 2019

@rafaelvalle Sure. I do not have access to those audios but I will share them next week.

@lingjzhu
Copy link

lingjzhu commented Oct 7, 2019

@rafaelvalle

Hi. Here are some samples. They sound really good. Thanks for your great work!

Trained for one day.
46k.zip

Trained for three days.
150k.zip

With tacotron2.
tacotron2_and_waveglow.zip

@rafaelvalle
Copy link
Contributor

These sound pretty good! Do you have an Tacotron2 implementation for mandarin that you can share with our community?

@lingjzhu
Copy link

lingjzhu commented Oct 7, 2019

Yes. Actually I made use of your Tacotron2 repo in implementing the Mandarin model (thanks again!) but it is still on experimenting stage. I will share the code and all the pre-trained models in a few months.

@rafaelvalle
Copy link
Contributor

Closing due to inactivity.

@shawnthu
Copy link

shawnthu commented Nov 27, 2019

Yes. Actually I made use of your Tacotron2 repo in implementing the Mandarin model (thanks again!) but it is still on experimenting stage. I will share the code and all the pre-trained models in a few months.

@lingjzhu Hi, I train an single-speaker open-sourced mandarin data-set for speech synthesis from scratch, total duration is about 12 hours. The data config as follow,
data_config:
"segment_length": 16000,
"sampling_rate": 16000,
"filter_length": 743,
"hop_length": 185,
"win_length": 743,
"mel_fmin": 0.0,
"mel_fmax": 8000.0

and the model config as follows,
"n_mel_channels": 80,
"n_flows": 12,
"n_group": 8,
"n_early_every": 4,
"n_early_size": 2,
"WN_config":
"n_layers": 8,
"n_channels": 256,
"kernel_size": 3

the loss curve show pretty good,
image

but when I do inference [call the inference.py], even feed the speech from the train data, the generated audio pretty bad, and sounds like white noise. SIGMA value is set 1 for training, and when inference, I do not change it.

@shawnthu
Copy link

I've solved it ^_^

@cjerry1243
Copy link

@shawnthu how did you solve it?

@ghost
Copy link

ghost commented Feb 13, 2020

@shawnthu how did you solve your issue?

@ricktjwong
Copy link

ricktjwong commented Jun 14, 2020

@rafaelvalle

Hi. Here are some samples. They sound really good. Thanks for your great work!

Trained for one day.
46k.zip

Trained for three days.
150k.zip

With tacotron2.
tacotron2_and_waveglow.zip

@lingjzhu For these examples - did you train your your mandarin model with tacotron2 from scratch (using Nvidia-tacotron2?) Or did you fine-tune it with the pretrained LJSpeech Taco2 model? If it is from scratch what are some of the training specs? (e.g. number of iters you did before it converged).

@lingjzhu
Copy link

lingjzhu commented Jun 15, 2020

@ricktjwong I trained the Mandarin model with the pretrained LJSpeech Taco2 model, which greatly speeds up the convergence. Those samples were produced by a model after a few thousand iterations. The attention plot became diagonal within a thousand iterations.

I tried to train from scratch but the model did not converge well even after 40k iterations.

You can find training details, code, training data, pre-trained models and demos in this repo: https://github.com/lingjzhu/probing-TTS-models

@shawnthu
Copy link

shawnthu commented Jun 15, 2020

In fact, I recently finished two projects about Mandarin TTS:

  1. Single speaker TTS
    Produce quite high-quality speech from Chinese input
    link 1: https://pan.baidu.com/link/zhihu/7YhFzeueh1i3Yx4EVzXzJpp0cvZ2MWSQUsJk==
    link 2: https://pan.baidu.com/link/zhihu/7FhEzTuehEiVcrFXxkVwZrVUYsX15mQQUyB3==
    zhihu: https://www.zhihu.com/people/huang-xu-34
  2. Voice clone
    Only provided no more than 30 seconds audios of target speaker, we can synthesis natural and similar speech from any input Chinese text
    reference: https://pan.baidu.com/link/zhihu/7hhWzWueh6i3YONnBDc3BMVDOhNDN0awZhp1==
    synthesized: https://pan.baidu.com/link/zhihu/7JhnzMuVhmi1aix2t2ThZrVmVlYJx2dQUmJU==

@ricktjwong
Copy link

@lingjzhu Thanks for that appreciate the reply!
@shawnthu Those samples sound good. What's your procedure to create the two voices in the link? Do you have details on the voice cloning too? Thanks!

@AnkurDebnath35
Copy link

I trained waveglow from scratch on a Hindi Dataset for 30K iterations and loss seemed to converge at -5.5~-6.0. But after inferencing, I can listen to the words and everything, but its just not the voice of my speaker. During Inference sigma = 0.6https://drive.google.com/file/d/1DfUyef6XH8HEF-bJPddoLBs2vZnPZ3FA/view?usp=sharing

@AnkurDebnath35
Copy link

Please let me know, if this is premature stopping or sigma value which is causing these outputs

@shawnthu
Copy link

@ricktjwong Sorry for the details cuz it's commercial.

@AnkurDebnath35
Copy link

@ricktjwong Sorry for the details cuz it's commercial.

I am just asking you to listen to my sample and tell me whether I need to train more or change sigma value

@shawnthu
Copy link

@AnkurDebnath35 I'm sorry, I can not open your link cuz blocked.
@lingjzhu In fact, training tacotron from scratch also work, and it seems does not take too much training time for me. Of course, with warm start, It can significantly speed up training.

@AnkurDebnath35
Copy link

Let me know if you can access it now
https://soundcloud.com/ankur-debnath-9482155/md01-002wav_synthesis-wav

@shawnthu
Copy link

shawnthu commented Jun 16, 2020

@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better not change them. Good luck!

@AnkurDebnath35
Copy link

@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!

Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?

@shawnthu
Copy link

@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!

Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?

Would you please show you loss curve here?

@shawnthu
Copy link

@cjerry1243 @deepseek Have u ever tried the default config?

@AnkurDebnath35
Copy link

@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!

Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?

Would you please show you loss curve here?

Sorry, I only have the curve upto 18K iterations only
wglow

And to my surprise, the last sample that I got, I trained the model from scratch and even at 30K iterations, the result was poor. But last night I warm started the model, and just look at the same sample at just 2K iterations :

https://soundcloud.com/ankur-debnath-9482155/md01-002wav-synthesis

@shawnthu
Copy link

@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!

Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?

Would you please show you loss curve here?

Sorry, I only have the curve upto 18K iterations only
wglow

And to my surprise, the last sample that I got, I trained the model from scratch and even at 30K iterations, the result was poor. But last night I warm started the model, and just look at the same sample at just 2K iterations :

https://soundcloud.com/ankur-debnath-9482155/md01-002wav-synthesis

It just proved that the pretrained model is quite helpful. After all, the pretrained model is trained with large data, so it generalize well.

@AnkurDebnath35
Copy link

@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!

Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?

Would you please show you loss curve here?

Sorry, I only have the curve upto 18K iterations only
wglow
And to my surprise, the last sample that I got, I trained the model from scratch and even at 30K iterations, the result was poor. But last night I warm started the model, and just look at the same sample at just 2K iterations :
https://soundcloud.com/ankur-debnath-9482155/md01-002wav-synthesis

It just proved that the pretrained model is quite helpful. After all, the pretrained model is trained with large data, so it generalize well.

In your opinion, should I stop or train further?
It has just trained upto 4K iterations by now, and loss is hovering near -6.7.
The audio @ 2K iterations was close to the ground truth though, but a certain noise is there in the background, thats it.

@shawnthu
Copy link

@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!

Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?

Would you please show you loss curve here?

Sorry, I only have the curve upto 18K iterations only
wglow
And to my surprise, the last sample that I got, I trained the model from scratch and even at 30K iterations, the result was poor. But last night I warm started the model, and just look at the same sample at just 2K iterations :
https://soundcloud.com/ankur-debnath-9482155/md01-002wav-synthesis

It just proved that the pretrained model is quite helpful. After all, the pretrained model is trained with large data, so it generalize well.

In your opinion, should I stop or train further?
It has just trained upto 4K iterations by now, and loss is hovering near -6.7.
The audio @ 2K iterations was close to the ground truth though, but a certain noise is there in the background, thats it.

I think that if fine-tuning the model can reach ur goal, it's no need to train from scratch. So I suggest that you only need fine-tune the model with a small learning rate for a few thousand iterations.

@AnkurDebnath35
Copy link

E

@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!

Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?

Would you please show you loss curve here?

Sorry, I only have the curve upto 18K iterations only
wglow
And to my surprise, the last sample that I got, I trained the model from scratch and even at 30K iterations, the result was poor. But last night I warm started the model, and just look at the same sample at just 2K iterations :
https://soundcloud.com/ankur-debnath-9482155/md01-002wav-synthesis

It just proved that the pretrained model is quite helpful. After all, the pretrained model is trained with large data, so it generalize well.

In your opinion, should I stop or train further?
It has just trained upto 4K iterations by now, and loss is hovering near -6.7.
The audio @ 2K iterations was close to the ground truth though, but a certain noise is there in the background, thats it.

I think that if fine-tuning the model can reach ur goal, it's no need to train from scratch. So I suggest that you only need fine-tune the model with a small learning rate for a few thousand iterations.

Exactly! it is already under training now, I will wait upto 10K iterations may be or stop whenever results seems satisfactory. Thanks a lot @shawnthu

@shawnthu
Copy link

E

@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!

Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?

Would you please show you loss curve here?

Sorry, I only have the curve upto 18K iterations only
wglow
And to my surprise, the last sample that I got, I trained the model from scratch and even at 30K iterations, the result was poor. But last night I warm started the model, and just look at the same sample at just 2K iterations :
https://soundcloud.com/ankur-debnath-9482155/md01-002wav-synthesis

It just proved that the pretrained model is quite helpful. After all, the pretrained model is trained with large data, so it generalize well.

In your opinion, should I stop or train further?
It has just trained upto 4K iterations by now, and loss is hovering near -6.7.
The audio @ 2K iterations was close to the ground truth though, but a certain noise is there in the background, thats it.

I think that if fine-tuning the model can reach ur goal, it's no need to train from scratch. So I suggest that you only need fine-tune the model with a small learning rate for a few thousand iterations.

Exactly! it is already under training now, I will wait upto 10K iterations may be or stop whenever results seems satisfactory. Thanks a lot @shawnthu

My pleasure

@AnkurDebnath35
Copy link

E

@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!

Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?

Would you please show you loss curve here?

Sorry, I only have the curve upto 18K iterations only
wglow
And to my surprise, the last sample that I got, I trained the model from scratch and even at 30K iterations, the result was poor. But last night I warm started the model, and just look at the same sample at just 2K iterations :
https://soundcloud.com/ankur-debnath-9482155/md01-002wav-synthesis

It just proved that the pretrained model is quite helpful. After all, the pretrained model is trained with large data, so it generalize well.

In your opinion, should I stop or train further?
It has just trained upto 4K iterations by now, and loss is hovering near -6.7.
The audio @ 2K iterations was close to the ground truth though, but a certain noise is there in the background, thats it.

I think that if fine-tuning the model can reach ur goal, it's no need to train from scratch. So I suggest that you only need fine-tune the model with a small learning rate for a few thousand iterations.

Exactly! it is already under training now, I will wait upto 10K iterations may be or stop whenever results seems satisfactory. Thanks a lot @shawnthu

Here is the training loss at around 9K iterations, not much improvement though in terms of audio quality from what it was at 2K iterations. Should I stop? Because everyone here seems to train upto 50-60K iterations or even more.
9K

@shawnthu
Copy link

too many iterations may hurt the model, cuz it may overfit. I think you should do evaluation on dev set, rather than only watch the training loss.

@shawnthu
Copy link

@AnkurDebnath35

@AnkurDebnath35
Copy link

too many iterations may hurt the model, cuz it may overfit. I think you should do evaluation on dev set, rather than only watch the training loss.

Yea, but all the samples are unseen examples for the model, so it has not overfitted yet but surely can. Though I have a validation set with myself, I just can't find any code in this repo for validation. Can you point me out?

@shawnthu
Copy link

too many iterations may hurt the model, cuz it may overfit. I think you should do evaluation on dev set, rather than only watch the training loss.

Yea, but all the samples are unseen examples for the model, so it has not overfitted yet but surely can. Though I have a validation set with myself, I just can't find any code in this repo for validation. Can you point me out?

A simple way: split raw dataset into train and dev, then calculate the loss on dev after every epoch.

@AnkurDebnath35
Copy link

too many iterations may hurt the model, cuz it may overfit. I think you should do evaluation on dev set, rather than only watch the training loss.

Yea, but all the samples are unseen examples for the model, so it has not overfitted yet but surely can. Though I have a validation set with myself, I just can't find any code in this repo for validation. Can you point me out?

A simple way: split raw dataset into train and dev, then calculate the loss on dev after every epoch.

That I already have with me, its just that where to modify in train.py, to accomodate val_loss.

@AnkurDebnath35
Copy link

Can someone help me ? , I want to cite this repository in my paper, but I dont know whom to cite it against.

@rafaelvalle
Copy link
Contributor

@AnkurDebnath35 just cite our paper. and please next time do not ask questions that are not related to the thread.

@inproceedings{prenger2019waveglow,
title={Waveglow: A flow-based generative network for speech synthesis},
author={Prenger, Ryan and Valle, Rafael and Catanzaro, Bryan},
booktitle={ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={3617--3621},
year={2019},
organization={IEEE}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants