Softmax+crossentropy can lead to nan in gradient for unbounded activations #3162

jsusskin · 2015-07-17T01:19:32Z

When using unbounded activation functions (e.g. Relu) the softmax function can saturate. This can lead to nan gradients when paired with categorical crossentropy cost.
If the softmax function is replaced with a numerically stable version of log-softmax and this is used directly in the cost function, then the gradients don't blow up.
It seems that this could be implemented as a pattern to recognize (softmax paired with categorical crossentropy).
Here's a code snippet that illustrates the problem with the regular softmax versus doing the same thing with the numerically stable log-softmax, where the former gives nans in the gradient and the latter does not blow up. It's interesting because the experiment indicates that for the regular softmax case, the crossentropy loss is coming out numerically stable but not the gradient.

import theano
import theano.tensor as T
import numpy as np

x,y=T.matrices('xy')

# regular softmax and crossentropy
sm = T.nnet.softmax(x)
cm1=T.nnet.categorical_crossentropy(sm,y)
g1 = T.grad(cm1.mean(),x)

# numerically stable log-softmax with crossentropy
xdev = x-x.max(1,keepdims=True)
lsm = xdev - T.log(T.sum(T.exp(xdev),axis=1,keepdims=True))
sm2 = T.exp(lsm) # just used to show equivalence with sm
cm2=-T.sum(y*lsm,axis=1)
g2 = T.grad(cm2.mean(),x)


# create some inputs into a softmax that are large and labels
a=np.exp(10*np.random.rand(5,10).astype(theano.config.floatX))
# create some one-hot coded labels
b=np.eye(5,10).astype(theano.config.floatX)

# show equivalence of softmax and exponentiated numerically stable log-softmax
f1=theano.function([x],[sm, sm2])
sm1,sm2=f1(a)
print np.allclose(sm1,sm2)

# now show that the two versions result in the same crossentropy cost
# this indicates that the forward function does provide some numerical stability
f2=theano.function([x,y],[cm1,cm2])
c1,c2 = f2(a,b)
print np.allclose(c1,c2)

# now, show that in the standard softmax case the gradients blow up 
# while in the log-softmax case they don't
f3=theano.function([x,y],[g1,g2])
g1_,g2_ = f3(a,b)
print g1_
print g2_

OUTPUT:
True
True
[[ nan nan nan nan nan nan nan nan nan nan]
[ nan nan nan nan nan nan nan nan nan nan]
[ nan nan nan nan nan nan nan nan nan nan]
[ nan nan nan nan nan nan nan nan nan nan]
[ nan nan nan nan nan nan nan nan nan nan]]
[[-0.2 0.2 0. 0. 0. 0. 0. 0. 0. 0. ]
[ 0. -0.2 0. 0.2 0. 0. 0. 0. 0. 0. ]
[ 0. 0. -0.2 0. 0. 0. 0.2 0. 0. 0. ]
[ 0. 0. 0. -0.2 0. 0. 0.2 0. 0. 0. ]
[ 0. 0. 0. 0. -0.2 0.2 0. 0. 0. 0. ]]

The text was updated successfully, but these errors were encountered:

f0k · 2015-07-20T16:04:47Z

Another Lasagne user hit the same problem, and was able to solve it via the same solution: Lasagne/Lasagne#333 (comment)
It would be great to solve this via a graph optimizer in Theano. I can try to have a stab at it with some pointers into the right direction, but would be happy about somebody else doing it.

benanne · 2015-07-20T18:13:42Z

It seems that this could be implemented as a pattern to recognize (softmax paired with categorical crossentropy).

It surprises me that this isn't the case yet. I thought it was. Maybe the optimization exists, but we're doing something different and it isn't being triggered?

f0k · 2015-07-20T18:20:23Z

Maybe the optimization exists, but we're doing something different and it isn't being triggered?

No, @jsusskin's example above just uses vanilla Theano. It's not because of Lasagne.

nouiz · 2015-07-20T18:52:26Z

I had some discussion with @lamblin about this. Mostly, we cover some case like this, but for more complicated graph that do stabilization and speed optimization. But they don't cover all cases of stability optimization.

We think what need to be done is an optimization registered between the stabilize and specialize optimizations, that would convert log(softmax(x)) to logsoftmax(x). This op would compute the stabilized version of it.

Could you make such an op? I can take care of the optimization (register it at the good place). Do like this, we will keep our combined stability/speed opt and cover the missing stability case.

lamblin · 2015-07-21T01:28:13Z

There is an optimization that stabilizes the expression, but I think it gets triggered only if the target y is expressed as a vector of indices, not as a matrix of one-hot vectors.
Adding new optimizations for that case would make a lot of sense.

f0k · 2015-07-21T10:00:38Z

There is an optimization that stabilizes the expression, but I think it gets triggered only if the target y is expressed as a vector of indices, not as a matrix of one-hot vectors.

You're right. If I change the example to use an integer vector for y, it is stable:

import theano
import theano.tensor as T
import numpy as np

x=T.matrix('x')
y=T.ivector('y')

# regular softmax and crossentropy
sm = T.nnet.softmax(x)
cm1=T.nnet.categorical_crossentropy(sm,y)
g1 = T.grad(cm1.mean(),x)

# numerically stable log-softmax with crossentropy
xdev = x-x.max(1,keepdims=True)
lsm = xdev - T.log(T.sum(T.exp(xdev),axis=1,keepdims=True))
sm2 = T.exp(lsm) # just used to show equivalence with sm
cm2=-lsm[T.arange(y.shape[0]), y]
g2 = T.grad(cm2.mean(),x)


# create some inputs into a softmax that are large and labels
a=np.exp(10*np.random.rand(5,10).astype(theano.config.floatX))
# create some one-hot coded labels
b=np.random.randint(0,10,5).astype(np.uint8)

# show equivalence of softmax and exponentiated numerically stable log-softmax
f1=theano.function([x],[sm, sm2])
sm1,sm2=f1(a)
print np.allclose(sm1,sm2)

# now show that the two versions result in the same crossentropy cost
# this indicates that the forward function does provide some numerical stability
f2=theano.function([x,y],[cm1,cm2])
c1,c2 = f2(a,b)
print np.allclose(c1,c2)

# now, show that in the standard softmax case the gradients blow up 
# while in the log-softmax case they don't
f3=theano.function([x,y],[g1,g2])
g1_,g2_ = f3(a,b)
print g1_
print g2_

produces:

Using gpu device 0: GeForce GT 640
True
True
[[ 0.   0.   0.   0.   0.2  0.  -0.2  0.   0.   0. ]
 [ 0.   0.2 -0.2  0.   0.   0.   0.   0.   0.   0. ]
 [ 0.   0.   0.   0.   0.  -0.2  0.   0.2  0.   0. ]
 [ 0.   0.2  0.   0.   0.  -0.2  0.   0.   0.   0. ]
 [ 0.  -0.2  0.2  0.   0.   0.   0.   0.   0.   0. ]]
[[ 0.   0.   0.   0.   0.2  0.  -0.2  0.   0.   0. ]
 [ 0.   0.2 -0.2  0.   0.   0.   0.   0.   0.   0. ]
 [ 0.   0.   0.   0.   0.  -0.2  0.   0.2  0.   0. ]
 [ 0.   0.2  0.   0.   0.  -0.2  0.   0.   0.   0. ]
 [ 0.  -0.2  0.2  0.   0.   0.   0.   0.   0.   0. ]]

The existing optimization for the 1-hot case probably knows about the CrossentropyCategorical1Hot Op. categorical_crossentropy with a matrix for the target distribution does not instantiate a specific Op, but directly returns -tensor.sum(true_dist * tensor.log(coding_dist), axis=coding_dist.ndim-1). As @nouiz said, just stabilizing log(softmax) would be the way to go then.

convert log(softmax(x)) to logsoftmax(x). This op would compute the stabilized version of it.

Why does it need a new Op for this? Wouldn't it be enough to directly replace log(softmax) with the stabilized expression?

nouiz · 2015-07-21T13:51:11Z

cudnn have a logsoftmax. I think it would be faster to use it. Having the
full graph would make introducing it harder.

Then if cudnn isn't avail, we can do the full graph on the GPU.

On Tue, Jul 21, 2015 at 6:00 AM, Jan Schlüter notifications@github.com
wrote:

There is an optimization that stabilizes the expression, but I think it
gets triggered only if the target y is expressed as a vector of indices,
not as a matrix of one-hot vectors.

You're right. If I change the example to use an integer vector for y, it
is stable:

import theanoimport theano.tensor as Timport numpy as np

x=T.matrix('x')
y=T.ivector('y')

regular softmax and crossentropy

sm = T.nnet.softmax(x)
cm1=T.nnet.categorical_crossentropy(sm,y)
g1 = T.grad(cm1.mean(),x)

numerically stable log-softmax with crossentropy

xdev = x-x.max(1,keepdims=True)
lsm = xdev - T.log(T.sum(T.exp(xdev),axis=1,keepdims=True))
sm2 = T.exp(lsm) # just used to show equivalence with sm
cm2=-lsm[T.arange(y.shape[0]), y]
g2 = T.grad(cm2.mean(),x)

create some inputs into a softmax that are large and labels

a=np.exp(10*np.random.rand(5,10).astype(theano.config.floatX))# create some one-hot coded labels
b=np.random.randint(0,10,5).astype(np.uint8)

show equivalence of softmax and exponentiated numerically stable log-softmax

f1=theano.function([x],[sm, sm2])
sm1,sm2=f1(a)print np.allclose(sm1,sm2)

now show that the two versions result in the same crossentropy cost# this indicates that the forward function does provide some numerical stability

f2=theano.function([x,y],[cm1,cm2])
c1,c2 = f2(a,b)print np.allclose(c1,c2)

now, show that in the standard softmax case the gradients blow up # while in the log-softmax case they don't

f3=theano.function([x,y],[g1,g2])
g1_,g2_ = f3(a,b)print g1_print g2_

produces:

Using gpu device 0: GeForce GT 640
True
True
[[ 0. 0. 0. 0. 0.2 0. -0.2 0. 0. 0. ]
[ 0. 0.2 -0.2 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. -0.2 0. 0.2 0. 0. ]
[ 0. 0.2 0. 0. 0. -0.2 0. 0. 0. 0. ]
[ 0. -0.2 0.2 0. 0. 0. 0. 0. 0. 0. ]]
[[ 0. 0. 0. 0. 0.2 0. -0.2 0. 0. 0. ]
[ 0. 0.2 -0.2 0. 0. 0. 0. 0. 0. 0. ]
[ 0. 0. 0. 0. 0. -0.2 0. 0.2 0. 0. ]
[ 0. 0.2 0. 0. 0. -0.2 0. 0. 0. 0. ]
[ 0. -0.2 0.2 0. 0. 0. 0. 0. 0. 0. ]]

The existing optimization for the 1-hot case probably knows about the
CrossentropyCategorical1Hot Op. categorical_crossentropy with a matrix
for the target distribution does not instantiate a specific Op, but
directly returns -tensor.sum(true_dist * tensor.log(coding_dist),
axis=coding_dist.ndim-1). As @nouiz https://github.com/nouiz said, just
stabilizing log(softmax) would be the way to go then.

convert log(softmax(x)) to logsoftmax(x). This op would compute the
stabilized version of it.

Why does it need a new Op for this? Wouldn't it be enough to directly
replace log(softmax) with the stabilized expression?

—
Reply to this email directly or view it on GitHub
#3162 (comment).

jsusskin · 2015-07-21T17:46:07Z

Good find @lamblin -- as a motivation for why this is important to address, consider the case where the target labels are probabilities and not one-hot-coded labels. In that case you need to represent the targets per observation as a vector summing to 1.

manger719 · 2015-08-03T18:31:47Z

The problem goes away if you explicitly write out the softmax function instead of using Theano's:

import theano
import theano.tensor as T
import numpy as np

x,y=T.matrices('xy')

# regular softmax and crossentropy
sm = T.exp(x)/(T.exp(x).sum(1,keepdims=True))
cm1=T.nnet.categorical_crossentropy(sm,y)
g1 = T.grad(cm1.mean(),x)

# numerically stable log-softmax with crossentropy
xdev = x-x.max(1,keepdims=True)
lsm = xdev - T.log(T.sum(T.exp(xdev),axis=1,keepdims=True))
sm2 = T.exp(lsm) # just used to show equivalence with sm
cm2=-T.sum(y*lsm,axis=1)
g2 = T.grad(cm2.mean(),x)


# create some inputs into a softmax that are large and labels
a=np.exp(10*np.random.rand(5,10).astype(theano.config.floatX))
# create some one-hot coded labels
b=np.eye(5,10).astype(theano.config.floatX)

# show equivalence of softmax and exponentiated numerically stable log-softmax
f1=theano.function([x],[sm, sm2])
sm1,sm2=f1(a)
print np.allclose(sm1,sm2)

# now show that the two versions result in the same crossentropy cost
# this indicates that the forward function does provide some numerical stability
f2=theano.function([x,y],[cm1,cm2])
c1,c2 = f2(a,b)
print np.allclose(c1,c2)

# now, show that in the standard softmax case the gradients blow up 
# while in the log-softmax case they don't
f3=theano.function([x,y],[g1,g2])
g1_,g2_ = f3(a,b)
print g1_
print g2_

produces:

Using gpu device 0: GeForce GT 750M (CNMeM is disabled)
True
True
[[-0.2  0.   0.   0.   0.2  0.   0.   0.   0.   0. ]
 [ 0.  -0.2  0.   0.   0.   0.   0.   0.   0.   0.2]
 [ 0.   0.  -0.2  0.   0.   0.   0.   0.2  0.   0. ]
 [ 0.   0.   0.  -0.2  0.   0.   0.2  0.   0.   0. ]
 [ 0.   0.   0.   0.  -0.2  0.   0.   0.   0.   0.2]]
[[-0.2  0.   0.   0.   0.2  0.   0.   0.   0.   0. ]
 [ 0.  -0.2  0.   0.   0.   0.   0.   0.   0.   0.2]
 [ 0.   0.  -0.2  0.   0.   0.   0.   0.2  0.   0. ]
 [ 0.   0.   0.  -0.2  0.   0.   0.2  0.   0.   0. ]
 [ 0.   0.   0.   0.  -0.2  0.   0.   0.   0.   0.2]]

capybaralet · 2016-10-31T07:49:35Z

@lamblin
Is it still the case that one-hot targets aren't stabilized?

lamblin · 2016-10-31T19:24:23Z

This issue is closed, so is the corresponding one in Lasagne, so it should be stabilized now.
If you have code that suggests it is not, you can open another issue.

f0k · 2016-10-31T20:10:46Z

Is it still the case that one-hot targets aren't stabilized?

In #4451 (comment), my former self indicated that it's resolved.

jiangnanhugo · 2017-08-30T02:19:09Z

Could theano check bounded input for binary_crossentropy for getting NaN after several hours of training.

nouiz · 2017-10-02T13:56:03Z

We have a special mode to help detect nan: http://deeplearning.net/software/theano/tutorial/nan_tutorial.html#run-in-nanguardmode-debugmode-or-monitormode

Read the full page, it contain many good information related to that.

jsusskin mentioned this issue Jul 17, 2015

Bugfix to discovering hidden factors demo Lasagne/Lasagne#308

Merged

sidps mentioned this issue Jul 17, 2015

LogSoftmax as a non-linearity Lasagne/Lasagne#332

Closed

nouiz added the CCW label Jul 27, 2015

fvisin mentioned this issue Sep 3, 2015

LogSoftmax #3363

Merged

f0k mentioned this issue Sep 25, 2015

Binary crossentropy gives NANs Lasagne/Lasagne#436

Closed

abergeron closed this as completed in #3363 Jan 18, 2016

f0k mentioned this issue May 18, 2016

Generalize optimization like log(sigmoid) or log(1-sigmoid) to tolerate reshape/dimshuffle #4451

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Softmax+crossentropy can lead to nan in gradient for unbounded activations #3162

Softmax+crossentropy can lead to nan in gradient for unbounded activations #3162

jsusskin commented Jul 17, 2015

f0k commented Jul 20, 2015

benanne commented Jul 20, 2015

f0k commented Jul 20, 2015

nouiz commented Jul 20, 2015

lamblin commented Jul 21, 2015

f0k commented Jul 21, 2015

nouiz commented Jul 21, 2015

regular softmax and crossentropy

numerically stable log-softmax with crossentropy

create some inputs into a softmax that are large and labels

show equivalence of softmax and exponentiated numerically stable log-softmax

now show that the two versions result in the same crossentropy cost# this indicates that the forward function does provide some numerical stability

now, show that in the standard softmax case the gradients blow up # while in the log-softmax case they don't

jsusskin commented Jul 21, 2015

manger719 commented Aug 3, 2015

capybaralet commented Oct 31, 2016

lamblin commented Oct 31, 2016

f0k commented Oct 31, 2016

jiangnanhugo commented Aug 30, 2017

nouiz commented Oct 2, 2017

Softmax+crossentropy can lead to nan in gradient for unbounded activations #3162

Softmax+crossentropy can lead to nan in gradient for unbounded activations #3162

Comments

jsusskin commented Jul 17, 2015

f0k commented Jul 20, 2015

benanne commented Jul 20, 2015

f0k commented Jul 20, 2015

nouiz commented Jul 20, 2015

lamblin commented Jul 21, 2015

f0k commented Jul 21, 2015

nouiz commented Jul 21, 2015

regular softmax and crossentropy

numerically stable log-softmax with crossentropy

create some inputs into a softmax that are large and labels

show equivalence of softmax and exponentiated numerically stable log-softmax

now show that the two versions result in the same crossentropy cost# this indicates that the forward function does provide some numerical stability

now, show that in the standard softmax case the gradients blow up # while in the log-softmax case they don't

jsusskin commented Jul 21, 2015

manger719 commented Aug 3, 2015

capybaralet commented Oct 31, 2016

lamblin commented Oct 31, 2016

f0k commented Oct 31, 2016

jiangnanhugo commented Aug 30, 2017

nouiz commented Oct 2, 2017