How long will it take to get a good result? #132

NileZhou · 2019-07-01T03:26:55Z

the train.py's print is:
46739: -4.754265308
46740: -5.550816059
46741: -4.253830433
46742: -5.338192463
46743: -4.700691700
46744: -5.625311375
46745: -5.753829479
46746: -5.032420158
......

which value when the second number equals, the chenkpoint is useable?

zhengao1993 · 2019-07-03T06:04:48Z

I don't know your batch size. My suggest is that you should run at least 100 epochs.

tianrengao · 2019-09-20T19:21:57Z

Training 100 epochs make your model be able to generate reasonable voice. But for channel size 512 version, you need more than 500 epochs to get high quality voice.

lingjzhu · 2019-10-04T17:22:39Z

I fine-tuned the released LJ speech model on a Mandarin dataset. I used a Titan V with fp16 and a batch size of 8. It only took a few hours for the model to produce good quality speech in Mandarin. The loss was around -4.5~-5.2 when it converged.

rafaelvalle · 2019-10-04T19:03:04Z

@lingjzhu can you share a few samples?

lingjzhu · 2019-10-04T19:05:33Z

@rafaelvalle Sure. I do not have access to those audios but I will share them next week.

lingjzhu · 2019-10-07T13:20:40Z

@rafaelvalle

Hi. Here are some samples. They sound really good. Thanks for your great work!

Trained for one day.
46k.zip

Trained for three days.
150k.zip

With tacotron2.
tacotron2_and_waveglow.zip

rafaelvalle · 2019-10-07T15:28:01Z

These sound pretty good! Do you have an Tacotron2 implementation for mandarin that you can share with our community?

lingjzhu · 2019-10-07T17:08:39Z

Yes. Actually I made use of your Tacotron2 repo in implementing the Mandarin model (thanks again!) but it is still on experimenting stage. I will share the code and all the pre-trained models in a few months.

rafaelvalle · 2019-10-26T05:00:55Z

Closing due to inactivity.

shawnthu · 2019-11-27T04:27:05Z

Yes. Actually I made use of your Tacotron2 repo in implementing the Mandarin model (thanks again!) but it is still on experimenting stage. I will share the code and all the pre-trained models in a few months.

@lingjzhu Hi, I train an single-speaker open-sourced mandarin data-set for speech synthesis from scratch, total duration is about 12 hours. The data config as follow,
data_config:
"segment_length": 16000,
"sampling_rate": 16000,
"filter_length": 743,
"hop_length": 185,
"win_length": 743,
"mel_fmin": 0.0,
"mel_fmax": 8000.0

and the model config as follows,
"n_mel_channels": 80,
"n_flows": 12,
"n_group": 8,
"n_early_every": 4,
"n_early_size": 2,
"WN_config":
"n_layers": 8,
"n_channels": 256,
"kernel_size": 3

the loss curve show pretty good,

but when I do inference [call the inference.py], even feed the speech from the train data, the generated audio pretty bad, and sounds like white noise. SIGMA value is set 1 for training, and when inference, I do not change it.

shawnthu · 2019-11-29T07:13:00Z

I've solved it ^_^

cjerry1243 · 2020-02-13T03:34:56Z

@shawnthu how did you solve it?

ghost · 2020-02-13T05:33:55Z

@shawnthu how did you solve your issue?

ricktjwong · 2020-06-14T07:19:33Z

@rafaelvalle

Hi. Here are some samples. They sound really good. Thanks for your great work!

Trained for one day.
46k.zip

Trained for three days.
150k.zip

With tacotron2.
tacotron2_and_waveglow.zip

@lingjzhu For these examples - did you train your your mandarin model with tacotron2 from scratch (using Nvidia-tacotron2?) Or did you fine-tune it with the pretrained LJSpeech Taco2 model? If it is from scratch what are some of the training specs? (e.g. number of iters you did before it converged).

lingjzhu · 2020-06-15T01:28:43Z

@ricktjwong I trained the Mandarin model with the pretrained LJSpeech Taco2 model, which greatly speeds up the convergence. Those samples were produced by a model after a few thousand iterations. The attention plot became diagonal within a thousand iterations.

I tried to train from scratch but the model did not converge well even after 40k iterations.

You can find training details, code, training data, pre-trained models and demos in this repo: https://github.com/lingjzhu/probing-TTS-models

shawnthu · 2020-06-15T08:05:18Z

In fact, I recently finished two projects about Mandarin TTS:

Single speaker TTS
Produce quite high-quality speech from Chinese input
link 1: https://pan.baidu.com/link/zhihu/7YhFzeueh1i3Yx4EVzXzJpp0cvZ2MWSQUsJk==
link 2: https://pan.baidu.com/link/zhihu/7FhEzTuehEiVcrFXxkVwZrVUYsX15mQQUyB3==
zhihu: https://www.zhihu.com/people/huang-xu-34
Voice clone
Only provided no more than 30 seconds audios of target speaker, we can synthesis natural and similar speech from any input Chinese text
reference: https://pan.baidu.com/link/zhihu/7hhWzWueh6i3YONnBDc3BMVDOhNDN0awZhp1==
synthesized: https://pan.baidu.com/link/zhihu/7JhnzMuVhmi1aix2t2ThZrVmVlYJx2dQUmJU==

ricktjwong · 2020-06-15T09:09:58Z

@lingjzhu Thanks for that appreciate the reply!
@shawnthu Those samples sound good. What's your procedure to create the two voices in the link? Do you have details on the voice cloning too? Thanks!

AnkurDebnath35 · 2020-06-15T14:49:26Z

I trained waveglow from scratch on a Hindi Dataset for 30K iterations and loss seemed to converge at -5.5~-6.0. But after inferencing, I can listen to the words and everything, but its just not the voice of my speaker. During Inference sigma = 0.6https://drive.google.com/file/d/1DfUyef6XH8HEF-bJPddoLBs2vZnPZ3FA/view?usp=sharing

AnkurDebnath35 · 2020-06-15T14:50:27Z

Please let me know, if this is premature stopping or sigma value which is causing these outputs

shawnthu · 2020-06-16T02:56:10Z

@ricktjwong Sorry for the details cuz it's commercial.

AnkurDebnath35 · 2020-06-16T08:08:32Z

@ricktjwong Sorry for the details cuz it's commercial.

I am just asking you to listen to my sample and tell me whether I need to train more or change sigma value

shawnthu · 2020-06-16T08:36:21Z

@AnkurDebnath35 I'm sorry, I can not open your link cuz blocked.
@lingjzhu In fact, training tacotron from scratch also work, and it seems does not take too much training time for me. Of course, with warm start, It can significantly speed up training.

AnkurDebnath35 · 2020-06-16T08:46:13Z

Let me know if you can access it now
https://soundcloud.com/ankur-debnath-9482155/md01-002wav_synthesis-wav

shawnthu · 2020-06-16T08:57:21Z

@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better not change them. Good luck!

AnkurDebnath35 · 2020-06-16T09:00:08Z

@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!

Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?

shawnthu · 2020-06-16T09:04:15Z

@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!

Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?

Would you please show you loss curve here?

shawnthu · 2020-06-16T09:09:44Z

@cjerry1243 @deepseek Have u ever tried the default config?

AnkurDebnath35 · 2020-06-16T09:17:05Z

@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!

Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?

Would you please show you loss curve here?

Sorry, I only have the curve upto 18K iterations only

And to my surprise, the last sample that I got, I trained the model from scratch and even at 30K iterations, the result was poor. But last night I warm started the model, and just look at the same sample at just 2K iterations :

https://soundcloud.com/ankur-debnath-9482155/md01-002wav-synthesis

shawnthu · 2020-06-16T09:30:48Z

@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!

Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?

Would you please show you loss curve here?

Sorry, I only have the curve upto 18K iterations only

And to my surprise, the last sample that I got, I trained the model from scratch and even at 30K iterations, the result was poor. But last night I warm started the model, and just look at the same sample at just 2K iterations :

https://soundcloud.com/ankur-debnath-9482155/md01-002wav-synthesis

It just proved that the pretrained model is quite helpful. After all, the pretrained model is trained with large data, so it generalize well.

AnkurDebnath35 · 2020-06-16T09:33:36Z

@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!

Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?

Would you please show you loss curve here?

Sorry, I only have the curve upto 18K iterations only

And to my surprise, the last sample that I got, I trained the model from scratch and even at 30K iterations, the result was poor. But last night I warm started the model, and just look at the same sample at just 2K iterations :
https://soundcloud.com/ankur-debnath-9482155/md01-002wav-synthesis

It just proved that the pretrained model is quite helpful. After all, the pretrained model is trained with large data, so it generalize well.

In your opinion, should I stop or train further?
It has just trained upto 4K iterations by now, and loss is hovering near -6.7.
The audio @ 2K iterations was close to the ground truth though, but a certain noise is there in the background, thats it.

shawnthu · 2020-06-16T09:42:21Z

@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!

Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?

Would you please show you loss curve here?

Sorry, I only have the curve upto 18K iterations only

And to my surprise, the last sample that I got, I trained the model from scratch and even at 30K iterations, the result was poor. But last night I warm started the model, and just look at the same sample at just 2K iterations :
https://soundcloud.com/ankur-debnath-9482155/md01-002wav-synthesis

It just proved that the pretrained model is quite helpful. After all, the pretrained model is trained with large data, so it generalize well.

In your opinion, should I stop or train further?
It has just trained upto 4K iterations by now, and loss is hovering near -6.7.
The audio @ 2K iterations was close to the ground truth though, but a certain noise is there in the background, thats it.

I think that if fine-tuning the model can reach ur goal, it's no need to train from scratch. So I suggest that you only need fine-tune the model with a small learning rate for a few thousand iterations.

AnkurDebnath35 · 2020-06-16T09:45:02Z

E

@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!

Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?

Would you please show you loss curve here?

Sorry, I only have the curve upto 18K iterations only

And to my surprise, the last sample that I got, I trained the model from scratch and even at 30K iterations, the result was poor. But last night I warm started the model, and just look at the same sample at just 2K iterations :
https://soundcloud.com/ankur-debnath-9482155/md01-002wav-synthesis

It just proved that the pretrained model is quite helpful. After all, the pretrained model is trained with large data, so it generalize well.

In your opinion, should I stop or train further?
It has just trained upto 4K iterations by now, and loss is hovering near -6.7.
The audio @ 2K iterations was close to the ground truth though, but a certain noise is there in the background, thats it.

I think that if fine-tuning the model can reach ur goal, it's no need to train from scratch. So I suggest that you only need fine-tune the model with a small learning rate for a few thousand iterations.

Exactly! it is already under training now, I will wait upto 10K iterations may be or stop whenever results seems satisfactory. Thanks a lot @shawnthu

shawnthu · 2020-06-16T10:22:10Z

E

@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!

Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?

Would you please show you loss curve here?

Sorry, I only have the curve upto 18K iterations only

And to my surprise, the last sample that I got, I trained the model from scratch and even at 30K iterations, the result was poor. But last night I warm started the model, and just look at the same sample at just 2K iterations :
https://soundcloud.com/ankur-debnath-9482155/md01-002wav-synthesis

It just proved that the pretrained model is quite helpful. After all, the pretrained model is trained with large data, so it generalize well.

In your opinion, should I stop or train further?
It has just trained upto 4K iterations by now, and loss is hovering near -6.7.
The audio @ 2K iterations was close to the ground truth though, but a certain noise is there in the background, thats it.

I think that if fine-tuning the model can reach ur goal, it's no need to train from scratch. So I suggest that you only need fine-tune the model with a small learning rate for a few thousand iterations.

Exactly! it is already under training now, I will wait upto 10K iterations may be or stop whenever results seems satisfactory. Thanks a lot @shawnthu

My pleasure

AnkurDebnath35 · 2020-06-16T23:54:49Z

E

@AnkurDebnath35 Yes, I heard it. It's quite bad. Is the input mel spectrogram? Or from tactron output? I think u should not change the default parameters, and try train again. For example, the default sampling rate, hop length, window length are important. You'd better do not change them. Good luck!

Everything is at default and in compliance with my audio, and yes the input is mel spectrograms of my test set using mel2samp.py , and not from tacotron output, I am training the waveglow first. Do you think it is under-trained or could be a normalization issue?

Would you please show you loss curve here?

Sorry, I only have the curve upto 18K iterations only

And to my surprise, the last sample that I got, I trained the model from scratch and even at 30K iterations, the result was poor. But last night I warm started the model, and just look at the same sample at just 2K iterations :
https://soundcloud.com/ankur-debnath-9482155/md01-002wav-synthesis

It just proved that the pretrained model is quite helpful. After all, the pretrained model is trained with large data, so it generalize well.

In your opinion, should I stop or train further?
It has just trained upto 4K iterations by now, and loss is hovering near -6.7.
The audio @ 2K iterations was close to the ground truth though, but a certain noise is there in the background, thats it.

I think that if fine-tuning the model can reach ur goal, it's no need to train from scratch. So I suggest that you only need fine-tune the model with a small learning rate for a few thousand iterations.

Exactly! it is already under training now, I will wait upto 10K iterations may be or stop whenever results seems satisfactory. Thanks a lot @shawnthu

Here is the training loss at around 9K iterations, not much improvement though in terms of audio quality from what it was at 2K iterations. Should I stop? Because everyone here seems to train upto 50-60K iterations or even more.

shawnthu · 2020-06-17T03:03:33Z

too many iterations may hurt the model, cuz it may overfit. I think you should do evaluation on dev set, rather than only watch the training loss.

shawnthu · 2020-06-17T03:03:52Z

@AnkurDebnath35

AnkurDebnath35 · 2020-06-17T09:04:43Z

too many iterations may hurt the model, cuz it may overfit. I think you should do evaluation on dev set, rather than only watch the training loss.

Yea, but all the samples are unseen examples for the model, so it has not overfitted yet but surely can. Though I have a validation set with myself, I just can't find any code in this repo for validation. Can you point me out?

shawnthu · 2020-06-17T09:51:46Z

too many iterations may hurt the model, cuz it may overfit. I think you should do evaluation on dev set, rather than only watch the training loss.

Yea, but all the samples are unseen examples for the model, so it has not overfitted yet but surely can. Though I have a validation set with myself, I just can't find any code in this repo for validation. Can you point me out?

A simple way: split raw dataset into train and dev, then calculate the loss on dev after every epoch.

AnkurDebnath35 · 2020-06-17T10:02:12Z

too many iterations may hurt the model, cuz it may overfit. I think you should do evaluation on dev set, rather than only watch the training loss.

Yea, but all the samples are unseen examples for the model, so it has not overfitted yet but surely can. Though I have a validation set with myself, I just can't find any code in this repo for validation. Can you point me out?

A simple way: split raw dataset into train and dev, then calculate the loss on dev after every epoch.

That I already have with me, its just that where to modify in train.py, to accomodate val_loss.

AnkurDebnath35 · 2020-06-29T00:08:55Z

Can someone help me ? , I want to cite this repository in my paper, but I dont know whom to cite it against.

rafaelvalle · 2020-06-29T18:44:45Z

@AnkurDebnath35 just cite our paper. and please next time do not ask questions that are not related to the thread.

@inproceedings{prenger2019waveglow,
title={Waveglow: A flow-based generative network for speech synthesis},
author={Prenger, Ryan and Valle, Rafael and Catanzaro, Bryan},
booktitle={ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={3617--3621},
year={2019},
organization={IEEE}
}

lingjzhu mentioned this issue Oct 16, 2019

Development plan of TTS recipes for v.0.5.0 espnet/espnet#977

Closed

10 tasks

rafaelvalle closed this as completed Oct 26, 2019

How long will it take to get a good result? #132

How long will it take to get a good result? #132

Comments

NileZhou commented Jul 1, 2019

zhengao1993 commented Jul 3, 2019

tianrengao commented Sep 20, 2019

lingjzhu commented Oct 4, 2019 • edited

rafaelvalle commented Oct 4, 2019

lingjzhu commented Oct 4, 2019

lingjzhu commented Oct 7, 2019

rafaelvalle commented Oct 7, 2019

lingjzhu commented Oct 7, 2019

rafaelvalle commented Oct 26, 2019

shawnthu commented Nov 27, 2019 • edited

shawnthu commented Nov 29, 2019

cjerry1243 commented Feb 13, 2020

ghost commented Feb 13, 2020

ricktjwong commented Jun 14, 2020 • edited

lingjzhu commented Jun 15, 2020 • edited

shawnthu commented Jun 15, 2020 • edited

ricktjwong commented Jun 15, 2020

AnkurDebnath35 commented Jun 15, 2020

AnkurDebnath35 commented Jun 15, 2020

shawnthu commented Jun 16, 2020

AnkurDebnath35 commented Jun 16, 2020

shawnthu commented Jun 16, 2020

AnkurDebnath35 commented Jun 16, 2020

shawnthu commented Jun 16, 2020 • edited

AnkurDebnath35 commented Jun 16, 2020

shawnthu commented Jun 16, 2020

shawnthu commented Jun 16, 2020

AnkurDebnath35 commented Jun 16, 2020

shawnthu commented Jun 16, 2020

AnkurDebnath35 commented Jun 16, 2020

shawnthu commented Jun 16, 2020

AnkurDebnath35 commented Jun 16, 2020

shawnthu commented Jun 16, 2020

AnkurDebnath35 commented Jun 16, 2020

shawnthu commented Jun 17, 2020

shawnthu commented Jun 17, 2020

AnkurDebnath35 commented Jun 17, 2020

shawnthu commented Jun 17, 2020

AnkurDebnath35 commented Jun 17, 2020

AnkurDebnath35 commented Jun 29, 2020

rafaelvalle commented Jun 29, 2020

lingjzhu commented Oct 4, 2019 •

edited

shawnthu commented Nov 27, 2019 •

edited

ricktjwong commented Jun 14, 2020 •

edited

lingjzhu commented Jun 15, 2020 •

edited

shawnthu commented Jun 15, 2020 •

edited

shawnthu commented Jun 16, 2020 •

edited