Leaky rectifier nonlinearity #163

f0k · 2015-03-11T15:46:20Z

This adds the leaky rectifier nonlinearity with custom leakiness. Example use:

input_var = theano.matrix('x')
layer = InputLayer((10, 20), input_var)
layer = DenseLayer(30, nonlinearity=lasagne.nonlinearities.leaky_rectify(0.05))

It's obviously a callable class, not a function. We can discuss whether to name it leaky_rectify to blend in with the others or LeakyRectify to set it apart. Thinking about it, I'm actually leaning towards the latter.

ebenolson · 2015-03-11T15:51:18Z

would it be better to implement analogously to relu? ie

return (x + (1 - self.leakiness) * abs(x)) / (2 - self.leakiness)

f0k · 2015-03-11T16:21:34Z

would it be better to implement analogously to relu?

Hmm, I'm not sure, but thank you for doing the thinking :) A clear advantage of that formulation is that it's working both for Theano expressions and numpy arrays. The disadvantage is that it's harder to read (but that can be documented) and that it leads to a more complex graph. Performance-wise I don't see any difference:

In [1]: import theano
Using gpu device 0: GeForce GT 640

In [2]: T = theano.tensor

In [3]: x = T.matrix('x')

In [5]: relu1 = T.maximum(0, x)

In [6]: relu2 = (x + abs(x)) / 2.0

In [7]: relu3 = 0.5 * (x + abs(x))  # could be faster than division

In [8]: lrelu1 = T.maximum(0.1*x, x)

In [9]: lrelu2 = (x + (1 - 0.1) * abs(x)) / (2 - 0.1)

In [10]: fn_relu1 = theano.function([x], relu1)

In [11]: fn_relu2 = theano.function([x], relu2)

In [12]: fn_relu3 = theano.function([x], relu3)

In [13]: fn_lrelu1 = theano.function([x], frelu1)

In [14]: fn_lrelu2 = theano.function([x], lrelu2)

In [15]: x = np.random.randn(10000, 1000)

In [16]: %timeit fn_relu1(x)
10 loops, best of 3: 19.9 ms per loop

In [17]: %timeit fn_relu1(x)
10 loops, best of 3: 19.8 ms per loop

In [18]: %timeit fn_relu2(x)
10 loops, best of 3: 20 ms per loop

In [19]: %timeit fn_relu2(x)
10 loops, best of 3: 19.9 ms per loop

In [20]: %timeit fn_relu3(x)
10 loops, best of 3: 20.1 ms per loop

In [21]: %timeit fn_relu3(x)
10 loops, best of 3: 20.2 ms per loop

In [22]: %timeit fn_lrelu1(x)
10 loops, best of 3: 20.2 ms per loop

In [23]: %timeit fn_lrelu1(x)
10 loops, best of 3: 20.2 ms per loop

In [24]: %timeit fn_lrelu2(x)
10 loops, best of 3: 20.2 ms per loop

In [25]: %timeit fn_lrelu2(x)
10 loops, best of 3: 20.3 ms per loop

Looking at @SnippyHolloW's code, the original comparison was between T.switch and the abs trick, not between T.maximum and the abs trick. And his "quick benchmark" with the %timeit just compares the time needed to build the expression, not to execute it (but he makes a remark on T.switch being 2.5 times slower when used to train a network).

So maybe the question should be: Shouldn't we just use T.maximum(0, x) for the regular rectifier instead?

We can discuss whether to name it leaky_rectify to blend in with the others or LeakyRectify to set it apart.

Regarding this, what about naming it LeakyRectify and providing a shortcut leaky_rectify = LeakyRectify(0.01) using the default leakiness from the original paper?

/edit: By the way, Travis failed because MNIST couldn't be downloaded from deeplearning.net, which is down.

benanne · 2015-03-11T17:33:44Z

We should definitely add leaky rectification, but I'm a little concerned about the interface: leaky_rectify is now a callable which returns a callable, which is inconsistent with the other nonlinearities defined in the module. Maybe we should distinguish these two types of functions.

Another approach is to define all nonlinearities as first-order functions, possibly with multiple arguments, and then encourage users to define lambdas (or functions) to set all arguments but the first to the desired values if they don't want to use the default, e.g.:

l1 = DenseLayer(..., nonlinearity=lambda x: lasagne.nonlinearities.leaky_rectify(x, leakiness=0.2))

This is of course more verbose.

Regarding the implementation, I actually did profile T.maximum versus the current implementation for rectify and found the latter to be a bit faster. Not a whole lot, but the improvement was pretty significant for such a small change. You wouldn't expect the rectification implementation to matter a whole lot since it's such a small part of the computation, but the difference was definitely noticeable. I guess it all depends on the network size and other hyperparameters.

f0k · 2015-03-11T18:01:23Z

Maybe we should distinguish these two types of functions.

Yes, that's why I'd change it to CamelCase. Wouldn't that be enough of a distinction?

This is of course more verbose.

And you couldn't pickle it any more.

benanne · 2015-03-11T18:20:03Z

Yeah, CamelCase would work I guess. Good point about the pickling.

Looks like Travis is failing because deeplearning.net is down (can't download MNIST), I will restart it when it's back up.

f0k · 2015-03-11T18:32:26Z

Looks like Travis is failing because deeplearning.net is down (can't download MNIST), I will restart it when it's back up.

Yep, I said so above ;) No need to restart, I'll switch to CamelCase tomorrow and then it has to run again anyway.

Remaining open question from above:

what about naming it LeakyRectify and providing a shortcut leaky_rectify = LeakyRectify(0.01) using the default leakiness from the original paper?

…hortcut

f0k · 2015-03-12T20:14:40Z

Renamed to CamelCase and added a lowercase shortcut for the default leakiness. Travis passes.

benanne · 2015-03-15T19:41:34Z

Looks good, will merge it tomorrow if there are no further comments!

f0k · 2015-03-16T16:09:07Z

return (x + (1 - self.leakiness) * abs(x)) / (2 - self.leakiness)

To be sure I did a test and profiled it (as part of a CNN forward pass):

0.617s       8.40e-04s     C      735        3   GpuElemwise{Composite{maximum((i0 * (i1 + i2)), (i1 + i2))}}[(0, 1)]

compared to:

0.579s       7.88e-04s     C      735        3   GpuElemwise{Composite{(i0 * ((i1 + i2) + (i3 * Abs((i1 + i2)))))}}[(0, 1)]

So @ebenolson's formulation is indeed a tiny little faster for the forward pass, and consistently so.

I did my four variations as in the test above again, this time with Theano's profile mode:

   0.7%    98.3%       0.137s       1.37e-03s     C      100        1   GpuElemwise{Composite{maximum((i0 * i1), i1)}}[(0, 1)]
   0.6%    98.9%       0.131s       1.31e-03s     C      100        1   GpuElemwise{Maximum}[(0, 1)]
   0.5%    99.5%       0.111s       1.11e-03s     C      100        1   GpuElemwise{Composite{(i0 * (i1 + (i2 * Abs(i1))))}}[(0, 1)]
   0.5%   100.0%       0.110s       1.10e-03s     C      100        1   GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))}}[(0, 1)]

So yes, the more complicated formulation is indeed faster. My previous test was flawed because it included the transfer costs.

The backward pass is troubling me a bit, though... I'll report back.

f0k · 2015-03-16T16:58:47Z

Backward pass graphs (before optimization):

======= relu1: =======
Elemwise{mul} [@A] 'grelu1'   
 |Elemwise{EQ} [@B] ''   
 | |Elemwise{maximum} [@C] ''   
 | | |DimShuffle{x,x} [@D] ''   
 | | | |TensorConstant{0} [@E]
 | | |x [@F]
 | |x [@F]
 |out [@G]

======= relu2: =======
Elemwise{add,no_inplace} [@A] 'grelu2'   
 |Elemwise{true_div} [@B] ''   
 | |out [@C]
 | |DimShuffle{x,x} [@D] ''   
 |   |TensorConstant{2.0} [@E]
 |Elemwise{true_div} [@F] ''   
   |Elemwise{mul} [@G] ''   
   | |Elemwise{true_div} [@B] ''   
   | |x [@H]
   |Elemwise{Abs} [@I] ''   
     |x [@H]

======= leaky relu1: =======
Elemwise{add,no_inplace} [@A] 'glrelu1'   
 |Elemwise{mul} [@B] ''   
 | |Elemwise{mul} [@C] ''   
 | | |Elemwise{EQ} [@D] ''   
 | | | |Elemwise{maximum} [@E] ''   
 | | | | |Elemwise{mul,no_inplace} [@F] ''   
 | | | | | |DimShuffle{x,x} [@G] ''   
 | | | | | | |TensorConstant{0.10000000149} [@H]
 | | | | | |x [@I]
 | | | | |x [@I]
 | | | |Elemwise{mul,no_inplace} [@F] ''   
 | | |out [@J]
 | |DimShuffle{x,x} [@G] ''   
 |Elemwise{mul} [@K] ''   
   |Elemwise{EQ} [@L] ''   
   | |Elemwise{maximum} [@M] ''   
   | | |Elemwise{mul,no_inplace} [@F] ''   
   | | |x [@I]
   | |x [@I]
   |out [@J]

======= leaky relu2: =======
Elemwise{add,no_inplace} [@A] 'glrelu2'   
 |Elemwise{true_div} [@B] ''   
 | |out [@C]
 | |DimShuffle{x,x} [@D] ''   
 |   |TensorConstant{1.89999997616} [@E]
 |Elemwise{true_div} [@F] ''   
   |Elemwise{mul} [@G] ''   
   | |Elemwise{mul} [@H] ''   
   | | |Elemwise{true_div} [@B] ''   
   | | |DimShuffle{x,x} [@I] ''   
   | |   |TensorConstant{0.899999976158} [@J]
   | |x [@K]
   |Elemwise{Abs} [@L] ''   
     |x [@K]

The corresponding profiles:

======= relu1: =======
   1.3%    97.7%       0.075s       7.46e-04s     C      100        1   GpuElemwise{maximum,no_inplace}
   1.2%    98.9%       0.066s       6.58e-04s     C      100        1   GpuElemwise{Composite{Cast{float32}(EQ(i0, i1))}}[(0, 0)]
   1.1%   100.0%       0.065s       6.49e-04s     C      100        1   GpuElemwise{Mul}[(0, 0)]

======= relu2: =======
   1.4%   100.0%       0.074s       7.39e-04s     C      100        1   GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * sgn(i2)))}}[(0, 1)]

======= leaky relu1: =======
   1.5%    95.3%       0.085s       8.45e-04s     C      100        1   GpuElemwise{maximum,no_inplace}
   1.4%    96.7%       0.078s       7.76e-04s     C      100        1   GpuElemwise{Composite{((i0 * i1 * i2) + (i3 * i2))}}[(0, 1)]
   1.2%    97.9%       0.068s       6.78e-04s     C      100        1   GpuElemwise{Composite{Cast{float32}(EQ(i0, i1))}}[(0, 0)]
   1.2%    99.0%       0.066s       6.59e-04s     C      100        1   GpuElemwise{Composite{Cast{float32}(EQ(i0, i1))}}[(0, 1)]
   1.0%   100.0%       0.056s       5.56e-04s     C      100        1   GpuElemwise{mul,no_inplace}

======= leaky relu2: =======
   1.4%   100.0%       0.075s       7.49e-04s     C      100        1   GpuElemwise{Composite{((i0 * i1) + (i2 * i1 * sgn(i3)))}}[(0, 1)]

So also the backward passes are faster for the more complicated formulation. I'll change this PR accordingly.

f0k · 2015-03-16T17:45:04Z

@ebenolson: Your formula was wrong. Did you try it? A correct one is:

((1 + self.leakiness) * x + (1 - self.leakiness) * abs(x)) / 2

The timing results still hold; it's a lot faster especially for the backward pass. I've updated the PR (in a slightly more efficient formulation).

ebenolson · 2015-03-16T18:19:56Z

@ebenolson: Your formula was wrong. Did you try it?

nope, that's embarrassing. I've been using it for PReLU and never noticed, I suppose the error just got absorbed into the parameter.

+1 for test driven development.

f0k · 2015-03-17T10:59:48Z

@benanne: Travis passes, ready to merge this time!

Leaky rectifier nonlinearity

benanne · 2015-03-17T17:55:13Z

Done :)

r2007 · 2015-03-24T16:25:18Z

consider the gradient is always happened in NN, I add the gradient computing in the test and a new method after google it.

(Mac BookAir without GPU)

import numpy as np
import theano
import theano.tensor as T

x = T.matrix('x')
d = {
        "T.maximum(0, x)": lambda x: T.maximum(0, x),
        "0.5 * (x + abs(x))": lambda x: 0.5 * (x + abs(x)),
        "x * (x > 0)": lambda x: x * (x > 0),
        "T.switch(x<0, 0, x)": lambda x: T.switch(x<0, 0, x)
      }
z = np.random.randn(10000, 1000)
for name, f in d.iteritems():
    cost = theano.function([x], f(x))
    grad = theano.function([x], theano.grad(f(x).sum(), x))
    cost_grad = theano.function([x], [f(x), theano.grad(f(x).sum(), x)])
    print(name)
    %timeit cost(z)
    %timeit grad(z)
    %timeit cost_grad(z)
    c, g = cost_grad(z)
    assert np.all(c == np.where(z > 0, z, 0))
    assert np.all(g == (z > 0))

output:

T.switch(x<0, 0, x)
10 loops, best of 3: 30.3 ms per loop
10 loops, best of 3: 80.1 ms per loop
10 loops, best of 3: 152 ms per loop
T.maximum(0, x)
10 loops, best of 3: 74.3 ms per loop
10 loops, best of 3: 81.1 ms per loop
10 loops, best of 3: 120 ms per loop
0.5 * (x + abs(x))
10 loops, best of 3: 29.5 ms per loop
10 loops, best of 3: 86.2 ms per loop
10 loops, best of 3: 117 ms per loop
x * (x > 0)
10 loops, best of 3: 28.8 ms per loop
10 loops, best of 3: 29 ms per loop
10 loops, best of 3: 71.5 ms per loop

benanne · 2015-03-24T16:30:49Z

Here's the output of the same code running on a GTX 980:

T.switch(x<0, 0, x)
100 loops, best of 3: 17.8 ms per loop
100 loops, best of 3: 16.5 ms per loop
10 loops, best of 3: 82 ms per loop
T.maximum(0, x)
100 loops, best of 3: 16.5 ms per loop
100 loops, best of 3: 17.5 ms per loop
10 loops, best of 3: 26.1 ms per loop
0.5 * (x + abs(x))
100 loops, best of 3: 16.4 ms per loop
100 loops, best of 3: 16.4 ms per loop
10 loops, best of 3: 25.6 ms per loop
x * (x > 0)
100 loops, best of 3: 17.3 ms per loop
100 loops, best of 3: 16.8 ms per loop
1 loops, best of 3: 40.3 ms per loop

Our existing implementation (0.5 * (x + abs(x))) seems to have the edge here, although the difference with T.maximum(0, x) is almost negligible.

r2007 · 2015-03-24T16:49:42Z

So, GPU first ;)

r2007 · 2015-03-24T17:32:33Z

@benanne could you test the following on GTX 980?
I am interesting the 2 leaky rectifier function performance on GPU.
Thanks.

import numpy as np
import theano
import theano.tensor as T

x = T.matrix('x')
d = {
        "T.maximum(0, x)": lambda x: T.maximum(0, x),
        "0.5 * (x + abs(x))": lambda x: 0.5 * (x + abs(x)),
        "x * (x > 0)": lambda x: x * (x > 0),
        "T.switch(x<0, 0, x)": lambda x: T.switch(x<0, 0, x),
        "f1 * x + f2 * abs(x)": lambda x: 0.5 * 1.1 * x + 0.5 * 0.9 * abs(x),
        "T.maximum(0.1*x, x)": lambda x: T.maximum(0.1*x, x)
      }
z = np.random.randn(10000, 1000)
for name, f in d.iteritems():
    cost = theano.function([x], f(x))
    grad = theano.function([x], theano.grad(f(x).sum(), x))
    cost_grad = theano.function([x], [f(x), theano.grad(f(x).sum(), x)])
    print(name)
    %timeit cost(z)
    %timeit grad(z)
    %timeit cost_grad(z)
    c, g = cost_grad(z)
    #assert np.all(c == np.where(z > 0, z, 0))
    #assert np.all(g == (z > 0))

benanne · 2015-03-24T17:48:59Z

Alright then, just this once ;) but I don't plan to make a habit out of it!

f1 * x + f2 * abs(x)
100 loops, best of 3: 16.4 ms per loop
100 loops, best of 3: 16.4 ms per loop
10 loops, best of 3: 25.7 ms per loop
T.maximum(0, x)
100 loops, best of 3: 16.6 ms per loop
100 loops, best of 3: 17.4 ms per loop
10 loops, best of 3: 25.5 ms per loop
T.maximum(0.1*x, x)
100 loops, best of 3: 16.7 ms per loop
100 loops, best of 3: 18.4 ms per loop
10 loops, best of 3: 27 ms per loop
x * (x > 0)
100 loops, best of 3: 17.3 ms per loop
100 loops, best of 3: 16.5 ms per loop
10 loops, best of 3: 39.4 ms per loop
T.switch(x<0, 0, x)
100 loops, best of 3: 17.6 ms per loop
100 loops, best of 3: 16.5 ms per loop
10 loops, best of 3: 79.5 ms per loop
0.5 * (x + abs(x))
100 loops, best of 3: 16.4 ms per loop
100 loops, best of 3: 16.5 ms per loop
10 loops, best of 3: 25.1 ms per loop

f0k · 2015-03-24T18:34:02Z

consider the gradient is always happened in NN, I add the gradient computing in the test and a new method after google it.

Sorry to interrupt the happy benchmarking party, but as I noted above, this way of testing is flawed for the GPU because it includes the transfer costs. Also your gradient benchmark does more than just computing the gradient.

My setup was like this:

import theano
T = theano.tensor
x = T.matrix('x')
# define forward expression
# (comment in the expression you want to time)
#relu = T.maximum(0, x)
relu = 0.5 * (x + abs(x))
# define backward expression
out = T.matrix('out')
grelu = theano.grad(None, x, known_grads={relu: out})
# compile backward expression
fgrelu = theano.function([x, out], grelu)
# optional: print graph
if False:
    from theano.printing import debugprint
    print "Original graph:"
    debugprint(grelu)
    print "Optimized graph:"
    debugprint(fgrelu)
# run
print "Running benchmark."
import numpy as np
inp = np.random.randn(10000, 1000).astype(np.float32)
outp = np.random.randn(10000, 1000).astype(np.float32)
for _ in range(100):
    fgrelu(inp, outp)

Run this with CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=device=gpu,floatX=float32,profile=1 ./name_of_the_file.py to get a profile, and add up everything but the HostToGpu. Alternatively, read the Theano documentation to learn how to get all the graph and the inputs on the GPU so there are neither transfers nor copy operations in the compiled functions at all.

Our 0.5 * (x + abs(x)) formulation is a lot faster than anything else for the backward pass because it gets compiled into a single Elemwise{Composite{...}} expression and hence a single kernel launch, while all others get compiled into multiple separate kernel launches.
/edit: On CPU, things may well look different, of course, because there's no such thing as a kernel launch overhead. But we're focusing on GPU performance here. On a side note, these kind of tweaks should actually be done by Theano, but that's close to impossible (it's an easy optimization for the forward pass, but would be very complex for the backward pass).

Added leaky rectifier nonlinearity

dd6733b

renamed leaky_rectify class to LeakyRectify and added leaky_rectify s…

bbb1dce

…hortcut

Faster formulation of the leaky rectifier

ab0fef3

f0k force-pushed the leaky-relu branch from 6363c3e to ab0fef3 Compare March 16, 2015 23:00

benanne added a commit that referenced this pull request Mar 17, 2015

Merge pull request #163 from f0k/leaky-relu

93250bb

Leaky rectifier nonlinearity

benanne merged commit 93250bb into Lasagne:master Mar 17, 2015

f0k deleted the leaky-relu branch March 17, 2015 18:30

f0k mentioned this pull request Mar 30, 2015

Relu different speed - optimization not working? Theano/Theano#2698

Open

f0k mentioned this pull request Oct 2, 2015

NaN Values creeping up on a simple CNN #450

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leaky rectifier nonlinearity #163

Leaky rectifier nonlinearity #163

f0k commented Mar 11, 2015

ebenolson commented Mar 11, 2015

f0k commented Mar 11, 2015

benanne commented Mar 11, 2015

f0k commented Mar 11, 2015

benanne commented Mar 11, 2015

f0k commented Mar 11, 2015

f0k commented Mar 12, 2015

benanne commented Mar 15, 2015

f0k commented Mar 16, 2015

f0k commented Mar 16, 2015

f0k commented Mar 16, 2015

ebenolson commented Mar 16, 2015

f0k commented Mar 17, 2015

benanne commented Mar 17, 2015

r2007 commented Mar 24, 2015

benanne commented Mar 24, 2015

r2007 commented Mar 24, 2015

r2007 commented Mar 24, 2015

benanne commented Mar 24, 2015

f0k commented Mar 24, 2015

Leaky rectifier nonlinearity #163

Leaky rectifier nonlinearity #163

Conversation

f0k commented Mar 11, 2015

ebenolson commented Mar 11, 2015

f0k commented Mar 11, 2015

benanne commented Mar 11, 2015

f0k commented Mar 11, 2015

benanne commented Mar 11, 2015

f0k commented Mar 11, 2015

f0k commented Mar 12, 2015

benanne commented Mar 15, 2015

f0k commented Mar 16, 2015

f0k commented Mar 16, 2015

f0k commented Mar 16, 2015

ebenolson commented Mar 16, 2015

f0k commented Mar 17, 2015

benanne commented Mar 17, 2015

r2007 commented Mar 24, 2015

benanne commented Mar 24, 2015

r2007 commented Mar 24, 2015

r2007 commented Mar 24, 2015

benanne commented Mar 24, 2015

f0k commented Mar 24, 2015