-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Softmax+crossentropy can lead to nan in gradient for unbounded activations #3162
Comments
Another Lasagne user hit the same problem, and was able to solve it via the same solution: Lasagne/Lasagne#333 (comment) |
It surprises me that this isn't the case yet. I thought it was. Maybe the optimization exists, but we're doing something different and it isn't being triggered? |
No, @jsusskin's example above just uses vanilla Theano. It's not because of Lasagne. |
I had some discussion with @lamblin about this. Mostly, we cover some case like this, but for more complicated graph that do stabilization and speed optimization. But they don't cover all cases of stability optimization. We think what need to be done is an optimization registered between the stabilize and specialize optimizations, that would convert log(softmax(x)) to logsoftmax(x). This op would compute the stabilized version of it. Could you make such an op? I can take care of the optimization (register it at the good place). Do like this, we will keep our combined stability/speed opt and cover the missing stability case. |
There is an optimization that stabilizes the expression, but I think it gets triggered only if the target y is expressed as a vector of indices, not as a matrix of one-hot vectors. |
You're right. If I change the example to use an integer vector for import theano
import theano.tensor as T
import numpy as np
x=T.matrix('x')
y=T.ivector('y')
# regular softmax and crossentropy
sm = T.nnet.softmax(x)
cm1=T.nnet.categorical_crossentropy(sm,y)
g1 = T.grad(cm1.mean(),x)
# numerically stable log-softmax with crossentropy
xdev = x-x.max(1,keepdims=True)
lsm = xdev - T.log(T.sum(T.exp(xdev),axis=1,keepdims=True))
sm2 = T.exp(lsm) # just used to show equivalence with sm
cm2=-lsm[T.arange(y.shape[0]), y]
g2 = T.grad(cm2.mean(),x)
# create some inputs into a softmax that are large and labels
a=np.exp(10*np.random.rand(5,10).astype(theano.config.floatX))
# create some one-hot coded labels
b=np.random.randint(0,10,5).astype(np.uint8)
# show equivalence of softmax and exponentiated numerically stable log-softmax
f1=theano.function([x],[sm, sm2])
sm1,sm2=f1(a)
print np.allclose(sm1,sm2)
# now show that the two versions result in the same crossentropy cost
# this indicates that the forward function does provide some numerical stability
f2=theano.function([x,y],[cm1,cm2])
c1,c2 = f2(a,b)
print np.allclose(c1,c2)
# now, show that in the standard softmax case the gradients blow up
# while in the log-softmax case they don't
f3=theano.function([x,y],[g1,g2])
g1_,g2_ = f3(a,b)
print g1_
print g2_ produces:
The existing optimization for the 1-hot case probably knows about the
Why does it need a new Op for this? Wouldn't it be enough to directly replace log(softmax) with the stabilized expression? |
cudnn have a logsoftmax. I think it would be faster to use it. Having the Then if cudnn isn't avail, we can do the full graph on the GPU. On Tue, Jul 21, 2015 at 6:00 AM, Jan Schlüter notifications@github.com
|
Good find @lamblin -- as a motivation for why this is important to address, consider the case where the target labels are probabilities and not one-hot-coded labels. In that case you need to represent the targets per observation as a vector summing to 1. |
The problem goes away if you explicitly write out the softmax function instead of using Theano's:
produces:
|
@lamblin |
This issue is closed, so is the corresponding one in Lasagne, so it should be stabilized now. |
In #4451 (comment), my former self indicated that it's resolved. |
Could theano check bounded input for |
We have a special mode to help detect nan: http://deeplearning.net/software/theano/tutorial/nan_tutorial.html#run-in-nanguardmode-debugmode-or-monitormode Read the full page, it contain many good information related to that. |
When using unbounded activation functions (e.g. Relu) the softmax function can saturate. This can lead to nan gradients when paired with categorical crossentropy cost.
If the softmax function is replaced with a numerically stable version of log-softmax and this is used directly in the cost function, then the gradients don't blow up.
It seems that this could be implemented as a pattern to recognize (softmax paired with categorical crossentropy).
Here's a code snippet that illustrates the problem with the regular softmax versus doing the same thing with the numerically stable log-softmax, where the former gives nans in the gradient and the latter does not blow up. It's interesting because the experiment indicates that for the regular softmax case, the crossentropy loss is coming out numerically stable but not the gradient.
OUTPUT:
True
True
[[ nan nan nan nan nan nan nan nan nan nan]
[ nan nan nan nan nan nan nan nan nan nan]
[ nan nan nan nan nan nan nan nan nan nan]
[ nan nan nan nan nan nan nan nan nan nan]
[ nan nan nan nan nan nan nan nan nan nan]]
[[-0.2 0.2 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0. -0.2 0. 0.2 0. 0. 0. 0. 0. 0. ]
[ 0. 0. -0.2 0. 0. 0. 0.2 0. 0. 0. ]
[ 0. 0. 0. -0.2 0. 0. 0.2 0. 0. 0. ]
[ 0. 0. 0. 0. -0.2 0.2 0. 0. 0. 0. ]]
The text was updated successfully, but these errors were encountered: