Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Softmax+crossentropy can lead to nan in gradient for unbounded activations #3162

Closed
jsusskin opened this issue Jul 17, 2015 · 14 comments · Fixed by #3363
Closed

Softmax+crossentropy can lead to nan in gradient for unbounded activations #3162

jsusskin opened this issue Jul 17, 2015 · 14 comments · Fixed by #3363
Labels

Comments

@jsusskin
Copy link

When using unbounded activation functions (e.g. Relu) the softmax function can saturate. This can lead to nan gradients when paired with categorical crossentropy cost.
If the softmax function is replaced with a numerically stable version of log-softmax and this is used directly in the cost function, then the gradients don't blow up.
It seems that this could be implemented as a pattern to recognize (softmax paired with categorical crossentropy).
Here's a code snippet that illustrates the problem with the regular softmax versus doing the same thing with the numerically stable log-softmax, where the former gives nans in the gradient and the latter does not blow up. It's interesting because the experiment indicates that for the regular softmax case, the crossentropy loss is coming out numerically stable but not the gradient.

import theano
import theano.tensor as T
import numpy as np

x,y=T.matrices('xy')

# regular softmax and crossentropy
sm = T.nnet.softmax(x)
cm1=T.nnet.categorical_crossentropy(sm,y)
g1 = T.grad(cm1.mean(),x)

# numerically stable log-softmax with crossentropy
xdev = x-x.max(1,keepdims=True)
lsm = xdev - T.log(T.sum(T.exp(xdev),axis=1,keepdims=True))
sm2 = T.exp(lsm) # just used to show equivalence with sm
cm2=-T.sum(y*lsm,axis=1)
g2 = T.grad(cm2.mean(),x)


# create some inputs into a softmax that are large and labels
a=np.exp(10*np.random.rand(5,10).astype(theano.config.floatX))
# create some one-hot coded labels
b=np.eye(5,10).astype(theano.config.floatX)

# show equivalence of softmax and exponentiated numerically stable log-softmax
f1=theano.function([x],[sm, sm2])
sm1,sm2=f1(a)
print np.allclose(sm1,sm2)

# now show that the two versions result in the same crossentropy cost
# this indicates that the forward function does provide some numerical stability
f2=theano.function([x,y],[cm1,cm2])
c1,c2 = f2(a,b)
print np.allclose(c1,c2)

# now, show that in the standard softmax case the gradients blow up 
# while in the log-softmax case they don't
f3=theano.function([x,y],[g1,g2])
g1_,g2_ = f3(a,b)
print g1_
print g2_

OUTPUT:
True
True
[[ nan nan nan nan nan nan nan nan nan nan]
[ nan nan nan nan nan nan nan nan nan nan]
[ nan nan nan nan nan nan nan nan nan nan]
[ nan nan nan nan nan nan nan nan nan nan]
[ nan nan nan nan nan nan nan nan nan nan]]
[[-0.2 0.2 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0. -0.2 0. 0.2 0. 0. 0. 0. 0. 0. ]
[ 0. 0. -0.2 0. 0. 0. 0.2 0. 0. 0. ]
[ 0. 0. 0. -0.2 0. 0. 0.2 0. 0. 0. ]
[ 0. 0. 0. 0. -0.2 0.2 0. 0. 0. 0. ]]

@f0k
Copy link
Contributor

f0k commented Jul 20, 2015

Another Lasagne user hit the same problem, and was able to solve it via the same solution: Lasagne/Lasagne#333 (comment)
It would be great to solve this via a graph optimizer in Theano. I can try to have a stab at it with some pointers into the right direction, but would be happy about somebody else doing it.

@benanne
Copy link
Contributor

benanne commented Jul 20, 2015

It seems that this could be implemented as a pattern to recognize (softmax paired with categorical crossentropy).

It surprises me that this isn't the case yet. I thought it was. Maybe the optimization exists, but we're doing something different and it isn't being triggered?

@f0k
Copy link
Contributor

f0k commented Jul 20, 2015

Maybe the optimization exists, but we're doing something different and it isn't being triggered?

No, @jsusskin's example above just uses vanilla Theano. It's not because of Lasagne.

@nouiz
Copy link
Member

nouiz commented Jul 20, 2015

I had some discussion with @lamblin about this. Mostly, we cover some case like this, but for more complicated graph that do stabilization and speed optimization. But they don't cover all cases of stability optimization.

We think what need to be done is an optimization registered between the stabilize and specialize optimizations, that would convert log(softmax(x)) to logsoftmax(x). This op would compute the stabilized version of it.

Could you make such an op? I can take care of the optimization (register it at the good place). Do like this, we will keep our combined stability/speed opt and cover the missing stability case.

@lamblin
Copy link
Member

lamblin commented Jul 21, 2015

There is an optimization that stabilizes the expression, but I think it gets triggered only if the target y is expressed as a vector of indices, not as a matrix of one-hot vectors.
Adding new optimizations for that case would make a lot of sense.

@f0k
Copy link
Contributor

f0k commented Jul 21, 2015

There is an optimization that stabilizes the expression, but I think it gets triggered only if the target y is expressed as a vector of indices, not as a matrix of one-hot vectors.

You're right. If I change the example to use an integer vector for y, it is stable:

import theano
import theano.tensor as T
import numpy as np

x=T.matrix('x')
y=T.ivector('y')

# regular softmax and crossentropy
sm = T.nnet.softmax(x)
cm1=T.nnet.categorical_crossentropy(sm,y)
g1 = T.grad(cm1.mean(),x)

# numerically stable log-softmax with crossentropy
xdev = x-x.max(1,keepdims=True)
lsm = xdev - T.log(T.sum(T.exp(xdev),axis=1,keepdims=True))
sm2 = T.exp(lsm) # just used to show equivalence with sm
cm2=-lsm[T.arange(y.shape[0]), y]
g2 = T.grad(cm2.mean(),x)


# create some inputs into a softmax that are large and labels
a=np.exp(10*np.random.rand(5,10).astype(theano.config.floatX))
# create some one-hot coded labels
b=np.random.randint(0,10,5).astype(np.uint8)

# show equivalence of softmax and exponentiated numerically stable log-softmax
f1=theano.function([x],[sm, sm2])
sm1,sm2=f1(a)
print np.allclose(sm1,sm2)

# now show that the two versions result in the same crossentropy cost
# this indicates that the forward function does provide some numerical stability
f2=theano.function([x,y],[cm1,cm2])
c1,c2 = f2(a,b)
print np.allclose(c1,c2)

# now, show that in the standard softmax case the gradients blow up 
# while in the log-softmax case they don't
f3=theano.function([x,y],[g1,g2])
g1_,g2_ = f3(a,b)
print g1_
print g2_

produces:

Using gpu device 0: GeForce GT 640
True
True
[[ 0.   0.   0.   0.   0.2  0.  -0.2  0.   0.   0. ]
 [ 0.   0.2 -0.2  0.   0.   0.   0.   0.   0.   0. ]
 [ 0.   0.   0.   0.   0.  -0.2  0.   0.2  0.   0. ]
 [ 0.   0.2  0.   0.   0.  -0.2  0.   0.   0.   0. ]
 [ 0.  -0.2  0.2  0.   0.   0.   0.   0.   0.   0. ]]
[[ 0.   0.   0.   0.   0.2  0.  -0.2  0.   0.   0. ]
 [ 0.   0.2 -0.2  0.   0.   0.   0.   0.   0.   0. ]
 [ 0.   0.   0.   0.   0.  -0.2  0.   0.2  0.   0. ]
 [ 0.   0.2  0.   0.   0.  -0.2  0.   0.   0.   0. ]
 [ 0.  -0.2  0.2  0.   0.   0.   0.   0.   0.   0. ]]

The existing optimization for the 1-hot case probably knows about the CrossentropyCategorical1Hot Op. categorical_crossentropy with a matrix for the target distribution does not instantiate a specific Op, but directly returns -tensor.sum(true_dist * tensor.log(coding_dist), axis=coding_dist.ndim-1). As @nouiz said, just stabilizing log(softmax) would be the way to go then.

convert log(softmax(x)) to logsoftmax(x). This op would compute the stabilized version of it.

Why does it need a new Op for this? Wouldn't it be enough to directly replace log(softmax) with the stabilized expression?

@nouiz
Copy link
Member

nouiz commented Jul 21, 2015

cudnn have a logsoftmax. I think it would be faster to use it. Having the
full graph would make introducing it harder.

Then if cudnn isn't avail, we can do the full graph on the GPU.

On Tue, Jul 21, 2015 at 6:00 AM, Jan Schlüter notifications@github.com
wrote:

There is an optimization that stabilizes the expression, but I think it
gets triggered only if the target y is expressed as a vector of indices,
not as a matrix of one-hot vectors.

You're right. If I change the example to use an integer vector for y, it
is stable:

import theanoimport theano.tensor as Timport numpy as np

x=T.matrix('x')
y=T.ivector('y')

regular softmax and crossentropy

sm = T.nnet.softmax(x)
cm1=T.nnet.categorical_crossentropy(sm,y)
g1 = T.grad(cm1.mean(),x)

numerically stable log-softmax with crossentropy

xdev = x-x.max(1,keepdims=True)
lsm = xdev - T.log(T.sum(T.exp(xdev),axis=1,keepdims=True))
sm2 = T.exp(lsm) # just used to show equivalence with sm
cm2=-lsm[T.arange(y.shape[0]), y]
g2 = T.grad(cm2.mean(),x)

create some inputs into a softmax that are large and labels

a=np.exp(10*np.random.rand(5,10).astype(theano.config.floatX))# create some one-hot coded labels
b=np.random.randint(0,10,5).astype(np.uint8)

show equivalence of softmax and exponentiated numerically stable log-softmax

f1=theano.function([x],[sm, sm2])
sm1,sm2=f1(a)print np.allclose(sm1,sm2)

now show that the two versions result in the same crossentropy cost# this indicates that the forward function does provide some numerical stability

f2=theano.function([x,y],[cm1,cm2])
c1,c2 = f2(a,b)print np.allclose(c1,c2)

now, show that in the standard softmax case the gradients blow up # while in the log-softmax case they don't

f3=theano.function([x,y],[g1,g2])
g1_,g2_ = f3(a,b)print g1_print g2_

produces:

Using gpu device 0: GeForce GT 640
True
True
[[ 0. 0. 0. 0. 0.2 0. -0.2 0. 0. 0. ]
[ 0. 0.2 -0.2 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. -0.2 0. 0.2 0. 0. ]
[ 0. 0.2 0. 0. 0. -0.2 0. 0. 0. 0. ]
[ 0. -0.2 0.2 0. 0. 0. 0. 0. 0. 0. ]]
[[ 0. 0. 0. 0. 0.2 0. -0.2 0. 0. 0. ]
[ 0. 0.2 -0.2 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. -0.2 0. 0.2 0. 0. ]
[ 0. 0.2 0. 0. 0. -0.2 0. 0. 0. 0. ]
[ 0. -0.2 0.2 0. 0. 0. 0. 0. 0. 0. ]]

The existing optimization for the 1-hot case probably knows about the
CrossentropyCategorical1Hot Op. categorical_crossentropy with a matrix
for the target distribution does not instantiate a specific Op, but
directly returns -tensor.sum(true_dist * tensor.log(coding_dist),
axis=coding_dist.ndim-1). As @nouiz https://github.com/nouiz said, just
stabilizing log(softmax) would be the way to go then.

convert log(softmax(x)) to logsoftmax(x). This op would compute the
stabilized version of it.

Why does it need a new Op for this? Wouldn't it be enough to directly
replace log(softmax) with the stabilized expression?


Reply to this email directly or view it on GitHub
#3162 (comment).

@jsusskin
Copy link
Author

Good find @lamblin -- as a motivation for why this is important to address, consider the case where the target labels are probabilities and not one-hot-coded labels. In that case you need to represent the targets per observation as a vector summing to 1.

@nouiz nouiz added the CCW label Jul 27, 2015
@manger719
Copy link

The problem goes away if you explicitly write out the softmax function instead of using Theano's:

import theano
import theano.tensor as T
import numpy as np

x,y=T.matrices('xy')

# regular softmax and crossentropy
sm = T.exp(x)/(T.exp(x).sum(1,keepdims=True))
cm1=T.nnet.categorical_crossentropy(sm,y)
g1 = T.grad(cm1.mean(),x)

# numerically stable log-softmax with crossentropy
xdev = x-x.max(1,keepdims=True)
lsm = xdev - T.log(T.sum(T.exp(xdev),axis=1,keepdims=True))
sm2 = T.exp(lsm) # just used to show equivalence with sm
cm2=-T.sum(y*lsm,axis=1)
g2 = T.grad(cm2.mean(),x)


# create some inputs into a softmax that are large and labels
a=np.exp(10*np.random.rand(5,10).astype(theano.config.floatX))
# create some one-hot coded labels
b=np.eye(5,10).astype(theano.config.floatX)

# show equivalence of softmax and exponentiated numerically stable log-softmax
f1=theano.function([x],[sm, sm2])
sm1,sm2=f1(a)
print np.allclose(sm1,sm2)

# now show that the two versions result in the same crossentropy cost
# this indicates that the forward function does provide some numerical stability
f2=theano.function([x,y],[cm1,cm2])
c1,c2 = f2(a,b)
print np.allclose(c1,c2)

# now, show that in the standard softmax case the gradients blow up 
# while in the log-softmax case they don't
f3=theano.function([x,y],[g1,g2])
g1_,g2_ = f3(a,b)
print g1_
print g2_

produces:

Using gpu device 0: GeForce GT 750M (CNMeM is disabled)
True
True
[[-0.2  0.   0.   0.   0.2  0.   0.   0.   0.   0. ]
 [ 0.  -0.2  0.   0.   0.   0.   0.   0.   0.   0.2]
 [ 0.   0.  -0.2  0.   0.   0.   0.   0.2  0.   0. ]
 [ 0.   0.   0.  -0.2  0.   0.   0.2  0.   0.   0. ]
 [ 0.   0.   0.   0.  -0.2  0.   0.   0.   0.   0.2]]
[[-0.2  0.   0.   0.   0.2  0.   0.   0.   0.   0. ]
 [ 0.  -0.2  0.   0.   0.   0.   0.   0.   0.   0.2]
 [ 0.   0.  -0.2  0.   0.   0.   0.   0.2  0.   0. ]
 [ 0.   0.   0.  -0.2  0.   0.   0.2  0.   0.   0. ]
 [ 0.   0.   0.   0.  -0.2  0.   0.   0.   0.   0.2]]

@capybaralet
Copy link

@lamblin
Is it still the case that one-hot targets aren't stabilized?

@lamblin
Copy link
Member

lamblin commented Oct 31, 2016

This issue is closed, so is the corresponding one in Lasagne, so it should be stabilized now.
If you have code that suggests it is not, you can open another issue.

@f0k
Copy link
Contributor

f0k commented Oct 31, 2016

Is it still the case that one-hot targets aren't stabilized?

In #4451 (comment), my former self indicated that it's resolved.

@jiangnanhugo
Copy link

Could theano check bounded input for binary_crossentropy for getting NaN after several hours of training.

@nouiz
Copy link
Member

nouiz commented Oct 2, 2017

We have a special mode to help detect nan: http://deeplearning.net/software/theano/tutorial/nan_tutorial.html#run-in-nanguardmode-debugmode-or-monitormode

Read the full page, it contain many good information related to that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants