Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

During training, the loss value goes up and down and cannot converge, is that normal? Besides, what should the final loss value looks like? #49

Closed
yoyololicon opened this issue Nov 28, 2018 · 15 comments

Comments

@yoyololicon
Copy link

I implemented a waveglow model in my own project. The codes are almost the same as this repo with some modifications:

  1. Upsample the mel-spectrogram to number of groups, so n_mel_channels in WN reduce to 80.
  2. Change logdet() in invertable1x1 to det().abs().log() as pretrained model which can resume training #35 did because the first few runs I did the loss became nan after thousands of steps.

The n_channels is 256 so the model size is 4~5 times smaller than original. I run model on 2 1080ti using nn.DataParallel with batch size of 8. After about 5k steps the loss is around -6 ~ -7 and I can hear some speech-like sentences from the model outputs. Then the loss value started to go up and down, even over zero, and cannot go any further. I add 24 flows in the model, doesn't help; 32 batch size, the problem still exist. Maybe more steps it will become better, but after 70k steps still cannot see any improvement. Did anyone have similar problems?

I also want to ask about the final loss as reference. In my case, -11 is the smallest value I can get, then the aforementioned problem happened. In #5 @azraelkuan can get a -18 loss value at 56k, is it the normal loss value?

@jiqizaisikao
Copy link

I trained 170 k,but the loss is still around -5

@azraelkuan
Copy link
Contributor

the -18 is not correct, i have fix it, the reason is that i use the wrong audio value.
in my experience, the loss should be -6~-7 at the final.

@yoyololicon
Copy link
Author

yoyololicon commented Nov 28, 2018

@azraelkuan Thank you for clearing things up!
@jiqizaisikao Did you get good result? When the loss value start to change dramatically my model can only produce noise and spikes.

@rafaelvalle
Copy link
Contributor

The fact that you were able to train after taking the absolute of the determinant suggests that your learning rate was too high. Given that we initialize the determinants to be positive, the determinant's crossing between positive and negative determinant suggests that during the optimization process we step over infinite error, at determinant 0, which is bad.

@yoyololicon
Copy link
Author

The fact that you were able to train after taking the absolute of the determinant suggests that your learning rate was too high. Given that we initialize the determinants to be positive, the determinant's crossing between positive and negative determinant suggests that during the optimization process we step over infinite error, at determinant 0, which is bad.

Thanks for the advice! Maybe 1e-4 is too high for smaller model. I'll try decrease it.

Here is my training curve, very unstable, but it might be caused by too large learning rate as well.
screenshot from 2018-11-29 09-21-01

@jiqizaisikao
Copy link

jiqizaisikao commented Nov 29, 2018

not very good,echo result,it seems that the pitch and amplitude is not convergent ,so it sounds like echo voice,but i think more train will be better.
I used learning rate 5e-5

@yoyololicon
Copy link
Author

screenshot from 2018-11-29 15-31-04

I re-trained the model with 5e-5 learning, and after 10k steps the loss start to jump around again.
I feel like I'm wasting my time tuning the parameters...
If anyone know what's happening or how to solve this situation please let me know.

@yoyololicon
Copy link
Author

@rafaelvalle I still get nan loss after change back to logdet(). I check the log_det_W_total in the loss function, and found it approached to some very negative number which means the determinant approached to zero when training. So how did you prevent it to cross zero value? A smaller learning rate doesn't help for me.
2018-11-29 21_44_22-tensorboard

I have tried using 1e-5 as learning rate, but the progress is too slow, after 8k steps the loss is still around -3.5.

@rafaelvalle
Copy link
Contributor

We trained the model for 540k iterations with bath size 24. Rushing and increasing the learning rate is probably not in your interest.
I would suggest making sure your data does not have all silence samples and using a small learning rate.

@triwoods
Copy link

triwoods commented Nov 29, 2018

@rafaelvalle Thanks for the implementation. As you mentioned, on LJSpeech Dataset, with batch size of 24, it needs 540k iterations to reach good performance, I assume you use 10 V100 from the paper, how many days does it need to train 540k iterations? For people that only have single 1080 Ti, which can only fit batch size of 1, does it mean I need roughly 540k * 24 = 12960k iterations (a few months of training time) to make it perform the same?

@yoyololicon
Copy link
Author

@rafaelvalle yeah probably you are right. But seems everyone are fine with 1e-4 learning rate, it's weird that it only happens to me. And in my experience a too large learning rate would not cause these kinds of crazy loss curve. I might have missed something but I have no clue.

Anyway, I uploaded my implementations on github here. If anyone can help me find out the solution I will be very aprreciated.

@yoyololicon
Copy link
Author

Found out this might caused by the implementation of loss function. I was using python sum() to sum log_det_W_list and log_s_list. After cumulatively adding the determinants and log_s in the forward pass, the model suddenly works. The training process is stable now, I should notice this earlier.
2018-11-30 16_58_52-tensorboard

@jcao-ai
Copy link

jcao-ai commented Nov 30, 2018

@yoyololicon Sounds interesting. I have read your PR before. But do you know why it leads to the instability?

@yoyololicon
Copy link
Author

@yoyololicon Sounds interesting. I have read your PR before. But do you know why it leads to the instability?

Actually I don't know. In this link it said using sum() to combine multiple loss should not have a problem.
https://discuss.pytorch.org/t/how-to-combine-multiple-criterions-to-a-loss-function/348

@yxt132
Copy link

yxt132 commented Jan 30, 2019

Found out this might caused by the implementation of loss function. I was using python sum() to sum log_det_W_list and log_s_list. After cumulatively adding the determinants and log_s in the forward pass, the model suddenly works. The training process is stable now, I should notice this earlier.
2018-11-30 16_58_52-tensorboard

what learning rate did you end up using eventually? What's your final loss and number of iterations when you get good results?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants