Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is the 'training' parameter of dropout in Prenet set to True? #247

Closed
jjl1994 opened this issue Aug 1, 2019 · 18 comments
Closed

Why is the 'training' parameter of dropout in Prenet set to True? #247

jjl1994 opened this issue Aug 1, 2019 · 18 comments

Comments

@jjl1994
Copy link

jjl1994 commented Aug 1, 2019

In the code of Prenet

def forward(self, x):
    for linear in self.layers:
        x = F.dropout(F.relu(linear(x)), p=0.5, training=True)
    return x

Why is 'training=True'? Shouldn't it be 'training=self.training'? Does that mean we apply dropout when inference? I changed this to 'training=self.training' and the pre-trained model is unable to generate correct audio.

@jjl1994 jjl1994 changed the title Why is the dropout parameter 'training' in Prenet set to True? Why is the 'training' parameter of dropout in Prenet set to True? Aug 1, 2019
@yuxinyuan
Copy link

This is mentioned in the original Tacotron2 paper.

In order to introduce output variation at inference time, dropout with probability 0.5 is applied only to layers in the pre-net of the autoregressive decoder.

The code just follows the specification.

@jjl1994
Copy link
Author

jjl1994 commented Aug 2, 2019

@yuxinyuan Hi, I noticed that this is mentioned in Tacotron2 paper: 'to introduce output variation at inference time'. But why the model output noise after I set this to False when inference? I don't understand why the model can only work with dropout set to True.

@jjl1994
Copy link
Author

jjl1994 commented Aug 2, 2019

@yuxinyuan When I'm testing with the pre-trained model, even if I set training=True and change the drop rate to some other value, for example, 0.3. The model also generates noise. That's weird. The pre-trained model only works with training=True and droprate=0.5.

@Yeongtae
Copy link

Yeongtae commented Aug 2, 2019

@jjl1994
Model have been learned from half information(dropout(0.5)) of previous mel.

Full information(dropout(0.0)) of previous mel make the decoder hard to correct prediction. Since Full information is too much for prenet that consist of (fc, dropout(0.5))×2.

@jjl1994
Copy link
Author

jjl1994 commented Aug 2, 2019

@Yeongtae I think the framework should have already automatically ×2 if the dropout is set to true because if the framework don't (fc, dropout(0.5))×2, it will not get the correct loss and update the network. This mechanism has been mentioned in the course from Fei-Fei Li. Also If model only learned from half information. We should always use dropout when inference. But actually we don't do that. We only use drop during training(maybe in 99% of the case).

@Yeongtae
Copy link

Yeongtae commented Aug 2, 2019

If u set droprate to 0.15 from 0.5 and training=self.training, u can solve this problem.

Text to mel model can always make some audio given a same utterance.
But it makes model hard to converge and unstable than using droprate=0.5.

@jjl1994
Copy link
Author

jjl1994 commented Aug 2, 2019

@Yeongtae Hi, maybe (0.15 and disable dropout) when reference will let the prenet give simillar results as (0.5 and enable dropout). It seems the autoregressive decoder is extremely sensitive to its input. A dropout before the layer will not affect the output of some conventional layers(such as dense layer), but do will affect the output of the autoregressive decoder.

@rafaelvalle
Copy link
Contributor

@jjl1994 you can run an experiment and train the model from scratch without dropout on the prenet

@wizardk
Copy link

wizardk commented Aug 7, 2019

@jjl1994 You are right. Using dropout in inference is not wised. The mozilla version of tacotron2 is working as your wish. Moreover, using small dropout on overmany parameters is not wised too. The original model structure is just a reference, you can optimize it.

@terryyizhong
Copy link

any experiment result of set training=self.training to train the model?

@rafaelvalle
Copy link
Contributor

Closing due to inactivity.

@kevinmtian
Copy link

@terryyizhong I can share valid loss of setting training=self.training on my end; I started training on LJ speech from scratch using identical params provided in master. It looks very strange; I am looking into what could have gone wrong.

Validation loss 200:  8.105130
Validation loss 400: 12.825017
Validation loss 600: 11.753986
Validation loss 800: 14.233746
Validation loss 1000: 14.253099
Validation loss 1200: 18.660198
Validation loss 1400: 17.960465
Validation loss 1600: 19.330160
Validation loss 1800: 22.346097
Validation loss 2000: 23.067725
Validation loss 2200: 25.812730
Validation loss 2400: 26.288597
Validation loss 2600: 29.514675
Validation loss 2800: 27.077643
Validation loss 3000: 27.432822
Validation loss 3200: 29.471922
Validation loss 3400: 30.740887
Validation loss 3600: 30.523686
Validation loss 3800: 31.277980
Validation loss 4000: 31.414633
Validation loss 4200: 31.757557
Validation loss 4400: 30.777057
Validation loss 4600: 32.895072
Validation loss 4800: 33.554407

@terryyizhong
Copy link

@terryyizhong I can share valid loss of setting training=self.training on my end; I started training on LJ speech from scratch using identical params provided in master. It looks very strange; I am looking into what could have gone wrong.

Validation loss 200:  8.105130
Validation loss 400: 12.825017
Validation loss 600: 11.753986
Validation loss 800: 14.233746
Validation loss 1000: 14.253099
Validation loss 1200: 18.660198
Validation loss 1400: 17.960465
Validation loss 1600: 19.330160
Validation loss 1800: 22.346097
Validation loss 2000: 23.067725
Validation loss 2200: 25.812730
Validation loss 2400: 26.288597
Validation loss 2600: 29.514675
Validation loss 2800: 27.077643
Validation loss 3000: 27.432822
Validation loss 3200: 29.471922
Validation loss 3400: 30.740887
Validation loss 3600: 30.523686
Validation loss 3800: 31.277980
Validation loss 4000: 31.414633
Validation loss 4200: 31.757557
Validation loss 4400: 30.777057
Validation loss 4600: 32.895072
Validation loss 4800: 33.554407

thanks for your information. I tried before and counter the same problem. The loss keep going up after several steps.
Looking forward you find the solution.

@zwlanpishu
Copy link

@terryyizhong I can share valid loss of setting training=self.training on my end; I started training on LJ speech from scratch using identical params provided in master. It looks very strange; I am looking into what could have gone wrong.

Validation loss 200:  8.105130
Validation loss 400: 12.825017
Validation loss 600: 11.753986
Validation loss 800: 14.233746
Validation loss 1000: 14.253099
Validation loss 1200: 18.660198
Validation loss 1400: 17.960465
Validation loss 1600: 19.330160
Validation loss 1800: 22.346097
Validation loss 2000: 23.067725
Validation loss 2200: 25.812730
Validation loss 2400: 26.288597
Validation loss 2600: 29.514675
Validation loss 2800: 27.077643
Validation loss 3000: 27.432822
Validation loss 3200: 29.471922
Validation loss 3400: 30.740887
Validation loss 3600: 30.523686
Validation loss 3800: 31.277980
Validation loss 4000: 31.414633
Validation loss 4200: 31.757557
Validation loss 4400: 30.777057
Validation loss 4600: 32.895072
Validation loss 4800: 33.554407

have you solved the problem? I encountered the same problem when set training=self.training, the valid loss keeps going up after some steps, especially when the reductor factor = 1.

@terryyizhong
Copy link

no, I think this issue should keep open gain

@CookiePPP
Copy link

CookiePPP commented May 14, 2020

@terryyizhong
@zwlanpishu
Do either of you have some alignment and predicted spectrogram pictures you can upload?
(Images tab in Tensorboard)

@zwlanpishu
Copy link

@terryyizhong
@zwlanpishu
Do either of you have some alignment and predicted spectrogram pictures you can upload?
(Images tab in Tensorboard)

I am training. But it is really hard to converge with a reduce fator r =1. So, usually how many steps does it take to pick up a alignment for the LJspeech dataset?

@zwlanpishu
Copy link

@terryyizhong
@zwlanpishu
Do either of you have some alignment and predicted spectrogram pictures you can upload?
(Images tab in Tensorboard)

When training with prenet dropout=self.training, the process is same as before, so it also converges and get an alignment. But the valid loss is easy to overfit with dropout disabled. It rises up quickly after several epochs. As a result, the model can not work with the prenet dropout disabled when infering. However, setting the prenet dropout=True solves the problem with a non-overfiting valid loss. As discussed above, maybe (p = 0.15 and disable dropout) when infering will let the prenet give simillar results as (p = 0.5 and enable dropout).

traing loss:
train
validation loss with dropout disabled when infering:
valid

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants