-
Notifications
You must be signed in to change notification settings - Fork 950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'NaN' when using nesterov momentum and high learning rates #765
Comments
Hi, good work tracking it down to a commit. My first assumption would have been that your learning rate was just too high, but since it behaved differently in the past we should figure out what's going on. Assuming the bug is not obvious (it's not to me), there are some things you can do to help this get fixed quicker:
|
Im working on it now, however, I would like to ask that lasagne group also check on this. It was working fine on the past, for a long time, probably some 6 months before that commit. So wasnt just a random "working thing". |
Done, script added! |
Test Cases :
Doesnt happen when you remove a hidden layer. For example, the model below causes NAN's :
|
I think it might just be because of using a sigmoid activation function with binary crossentropy. If the result is exactly 1 or exactly 0, the error function results in a NaN which ends up screwing the weights and future outputs. Try using clipped results when calculating the error:
If that doesn't work, it might be because the default activation function for dense layers is ReLU IIRC. The weights can get unrealistically high when the error is too big, which results in values too high causing NaN values. To solve that, I would suggest scaling weights or weight updates to remain within a reasonable certain range. |
Thanks for the test script. So it seems to me the issue is that It's still not clear to me why this difference happens - maybe @f0k can weigh in if it's a Theano bug, but I think we don't want to be marking those dims broadcastable anyway. Here's a simplified demonstration:
output:
versus
output:
|
Thank you for bisecting this, Andre, and thank you for the clear demonstration, Eben!
I just commented on the mailing list post: https://groups.google.com/forum/#!topic/lasagne-users/y-yb6dO_Dzg
#715 did that on purpose to obtain a broadcastable tensor from a 1-unit dense layer. Unfortunately, nobody noticed how it affected the log(sigmoid) optimization, and we didn't have any tests for this either -- we're fully relying on Theano there. @andrelopes1705: A quick workaround for now is to have your target variable be a column vector as well: target_var = T.vector()
...
loss = lasagne.objectives.binary_crossentropy(prediction, target_var.dimshuffle(0, 'x')) Or, with even less changes to your code: target_var = T.TensorType(theano.config.floatX, (False, True))('targets') But we need to figure out how to change Lasagne or Theano to make this work seamlessly again, with existing code using a plain @nouiz, any insights from your side? |
Just to see how the graphs differ -- with same broadcast pattern for predictions and targets:
With different broadcast pattern:
|
Someone here will work on that in the next few days. |
I Appreciate the support! |
We merged a fix for that in Theano. Can you try it in your own environment to make sure the fix is complete? |
Yes, of course! Should i try with my own test case or do you want me to test with some script? $Requesting instructions to test. |
Great!
Yes, it works both for Eben's test case and mine. Good job @ReyhaneAskari!
Just run the same code you used for bisecting Lasagne. It should train fine now when you update Theano to the latest version from git, both with old and recent Lasagne versions. |
It's Fixed! |
in my case i changed the learning rate to 0.0012 and it works now |
Introduction
On August 21, I reported a bug on the lasagne group forum about this.
Basically, when you are using nesterov momentum to train conv nets, if your learning rate starts too high, "nans" are generated to the function loss (training and validation loss), accuracy remains normal. In my case, they start happening near epoch 9.
Obs : The bug seems to happen more often when your dataset is unbalanced or has a lot of values of a single label.
This happens at the most new update of theano and lasagne.
Right now, im using a current version of lasagne that is old but doesnt have that issue :
"pip install --upgrade --no-deps git+git://github.com/Lasagne/Lasagne.git@5a009f9"
Jan.schluete asked me to bisect the code with git to find the guilty commit.
I did it twice to confirm and found it.
_4d4e0b0796634c23ad43889685ee4b428fe30f8a is the first bad commit
commit 4d4e0b0
Author: Jan Schlüter <jan.schlueter@ .at>
Date: Thu Jun 30 13:30:29 2016 +0200
:040000 040000 93ce1ab99b4d4bd14d3825a0b143109aa7d234b2 020a514a738f4fd6d29f538fa2d0fdc5af76555b M lasagne_
The model im using and the learning rate im using are below :
Momentum is kept at default.
Same issue!
The problem is in the commit informed above.
Thanks! I really need this bug fixed since my thesis is using lasagne and referencing it a lot!
Please help!
I recreated the bug using the mnist dataset. However, due to the different data im using, i had to transform every y[y>1] = 0 , so it simulates the unbalance of my dataset.
And yes, the bug starts happening either at the initial epochs ( 2 or 3) or before 10 epochs.
It doesnt happen on the version of lasagne i mentioned before : git+git://github.com/Lasagne/Lasagne.git@5a009f9
I tested more than 10 times in the bleeding edge version and more than 10 times on the version above.
The script on the zip and below simulates it.
Main_RebalancedMnist.zip
The text was updated successfully, but these errors were encountered: