Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaky rectifier nonlinearity #163

Merged
merged 3 commits into from
Mar 17, 2015
Merged

Leaky rectifier nonlinearity #163

merged 3 commits into from
Mar 17, 2015

Conversation

f0k
Copy link
Member

@f0k f0k commented Mar 11, 2015

This adds the leaky rectifier nonlinearity with custom leakiness. Example use:

input_var = theano.matrix('x')
layer = InputLayer((10, 20), input_var)
layer = DenseLayer(30, nonlinearity=lasagne.nonlinearities.leaky_rectify(0.05))

It's obviously a callable class, not a function. We can discuss whether to name it leaky_rectify to blend in with the others or LeakyRectify to set it apart. Thinking about it, I'm actually leaning towards the latter.

@ebenolson
Copy link
Member

would it be better to implement analogously to relu? ie

return (x + (1 - self.leakiness) * abs(x)) / (2 - self.leakiness)

@f0k
Copy link
Member Author

f0k commented Mar 11, 2015

would it be better to implement analogously to relu?

Hmm, I'm not sure, but thank you for doing the thinking :) A clear advantage of that formulation is that it's working both for Theano expressions and numpy arrays. The disadvantage is that it's harder to read (but that can be documented) and that it leads to a more complex graph. Performance-wise I don't see any difference:

In [1]: import theano
Using gpu device 0: GeForce GT 640

In [2]: T = theano.tensor

In [3]: x = T.matrix('x')

In [5]: relu1 = T.maximum(0, x)

In [6]: relu2 = (x + abs(x)) / 2.0

In [7]: relu3 = 0.5 * (x + abs(x))  # could be faster than division

In [8]: lrelu1 = T.maximum(0.1*x, x)

In [9]: lrelu2 = (x + (1 - 0.1) * abs(x)) / (2 - 0.1)

In [10]: fn_relu1 = theano.function([x], relu1)

In [11]: fn_relu2 = theano.function([x], relu2)

In [12]: fn_relu3 = theano.function([x], relu3)

In [13]: fn_lrelu1 = theano.function([x], frelu1)

In [14]: fn_lrelu2 = theano.function([x], lrelu2)

In [15]: x = np.random.randn(10000, 1000)

In [16]: %timeit fn_relu1(x)
10 loops, best of 3: 19.9 ms per loop

In [17]: %timeit fn_relu1(x)
10 loops, best of 3: 19.8 ms per loop

In [18]: %timeit fn_relu2(x)
10 loops, best of 3: 20 ms per loop

In [19]: %timeit fn_relu2(x)
10 loops, best of 3: 19.9 ms per loop

In [20]: %timeit fn_relu3(x)
10 loops, best of 3: 20.1 ms per loop

In [21]: %timeit fn_relu3(x)
10 loops, best of 3: 20.2 ms per loop

In [22]: %timeit fn_lrelu1(x)
10 loops, best of 3: 20.2 ms per loop

In [23]: %timeit fn_lrelu1(x)
10 loops, best of 3: 20.2 ms per loop

In [24]: %timeit fn_lrelu2(x)
10 loops, best of 3: 20.2 ms per loop

In [25]: %timeit fn_lrelu2(x)
10 loops, best of 3: 20.3 ms per loop

Looking at @SnippyHolloW's code, the original comparison was between T.switch and the abs trick, not between T.maximum and the abs trick. And his "quick benchmark" with the %timeit just compares the time needed to build the expression, not to execute it (but he makes a remark on T.switch being 2.5 times slower when used to train a network).

So maybe the question should be: Shouldn't we just use T.maximum(0, x) for the regular rectifier instead?

We can discuss whether to name it leaky_rectify to blend in with the others or LeakyRectify to set it apart.

Regarding this, what about naming it LeakyRectify and providing a shortcut leaky_rectify = LeakyRectify(0.01) using the default leakiness from the original paper?

/edit: By the way, Travis failed because MNIST couldn't be downloaded from deeplearning.net, which is down.

@benanne
Copy link
Member

benanne commented Mar 11, 2015

We should definitely add leaky rectification, but I'm a little concerned about the interface: leaky_rectify is now a callable which returns a callable, which is inconsistent with the other nonlinearities defined in the module. Maybe we should distinguish these two types of functions.

Another approach is to define all nonlinearities as first-order functions, possibly with multiple arguments, and then encourage users to define lambdas (or functions) to set all arguments but the first to the desired values if they don't want to use the default, e.g.:

l1 = DenseLayer(..., nonlinearity=lambda x: lasagne.nonlinearities.leaky_rectify(x, leakiness=0.2))

This is of course more verbose.

Regarding the implementation, I actually did profile T.maximum versus the current implementation for rectify and found the latter to be a bit faster. Not a whole lot, but the improvement was pretty significant for such a small change. You wouldn't expect the rectification implementation to matter a whole lot since it's such a small part of the computation, but the difference was definitely noticeable. I guess it all depends on the network size and other hyperparameters.

@f0k
Copy link
Member Author

f0k commented Mar 11, 2015

Maybe we should distinguish these two types of functions.

Yes, that's why I'd change it to CamelCase. Wouldn't that be enough of a distinction?

This is of course more verbose.

And you couldn't pickle it any more.

@benanne
Copy link
Member

benanne commented Mar 11, 2015

Yeah, CamelCase would work I guess. Good point about the pickling.

Looks like Travis is failing because deeplearning.net is down (can't download MNIST), I will restart it when it's back up.

@f0k
Copy link
Member Author

f0k commented Mar 11, 2015

Looks like Travis is failing because deeplearning.net is down (can't download MNIST), I will restart it when it's back up.

Yep, I said so above ;) No need to restart, I'll switch to CamelCase tomorrow and then it has to run again anyway.

Remaining open question from above:

what about naming it LeakyRectify and providing a shortcut leaky_rectify = LeakyRectify(0.01) using the default leakiness from the original paper?

@f0k
Copy link
Member Author

f0k commented Mar 12, 2015

Renamed to CamelCase and added a lowercase shortcut for the default leakiness. Travis passes.

@benanne
Copy link
Member

benanne commented Mar 15, 2015

Looks good, will merge it tomorrow if there are no further comments!

@f0k
Copy link
Member Author

f0k commented Mar 16, 2015

return (x + (1 - self.leakiness) * abs(x)) / (2 - self.leakiness)

To be sure I did a test and profiled it (as part of a CNN forward pass):

0.617s       8.40e-04s     C      735        3   GpuElemwise{Composite{maximum((i0 * (i1 + i2)), (i1 + i2))}}[(0, 1)]

compared to:

0.579s       7.88e-04s     C      735        3   GpuElemwise{Composite{(i0 * ((i1 + i2) + (i3 * Abs((i1 + i2)))))}}[(0, 1)]

So @ebenolson's formulation is indeed a tiny little faster for the forward pass, and consistently so.

I did my four variations as in the test above again, this time with Theano's profile mode:

   0.7%    98.3%       0.137s       1.37e-03s     C      100        1   GpuElemwise{Composite{maximum((i0 * i1), i1)}}[(0, 1)]
   0.6%    98.9%       0.131s       1.31e-03s     C      100        1   GpuElemwise{Maximum}[(0, 1)]
   0.5%    99.5%       0.111s       1.11e-03s     C      100        1   GpuElemwise{Composite{(i0 * (i1 + (i2 * Abs(i1))))}}[(0, 1)]
   0.5%   100.0%       0.110s       1.10e-03s     C      100        1   GpuElemwise{Composite{(i0 * (i1 + Abs(i1)))}}[(0, 1)]

So yes, the more complicated formulation is indeed faster. My previous test was flawed because it included the transfer costs.

The backward pass is troubling me a bit, though... I'll report back.

@f0k
Copy link
Member Author

f0k commented Mar 16, 2015

Backward pass graphs (before optimization):

======= relu1: =======
Elemwise{mul} [@A] 'grelu1'   
 |Elemwise{EQ} [@B] ''   
 | |Elemwise{maximum} [@C] ''   
 | | |DimShuffle{x,x} [@D] ''   
 | | | |TensorConstant{0} [@E]
 | | |x [@F]
 | |x [@F]
 |out [@G]

======= relu2: =======
Elemwise{add,no_inplace} [@A] 'grelu2'   
 |Elemwise{true_div} [@B] ''   
 | |out [@C]
 | |DimShuffle{x,x} [@D] ''   
 |   |TensorConstant{2.0} [@E]
 |Elemwise{true_div} [@F] ''   
   |Elemwise{mul} [@G] ''   
   | |Elemwise{true_div} [@B] ''   
   | |x [@H]
   |Elemwise{Abs} [@I] ''   
     |x [@H]

======= leaky relu1: =======
Elemwise{add,no_inplace} [@A] 'glrelu1'   
 |Elemwise{mul} [@B] ''   
 | |Elemwise{mul} [@C] ''   
 | | |Elemwise{EQ} [@D] ''   
 | | | |Elemwise{maximum} [@E] ''   
 | | | | |Elemwise{mul,no_inplace} [@F] ''   
 | | | | | |DimShuffle{x,x} [@G] ''   
 | | | | | | |TensorConstant{0.10000000149} [@H]
 | | | | | |x [@I]
 | | | | |x [@I]
 | | | |Elemwise{mul,no_inplace} [@F] ''   
 | | |out [@J]
 | |DimShuffle{x,x} [@G] ''   
 |Elemwise{mul} [@K] ''   
   |Elemwise{EQ} [@L] ''   
   | |Elemwise{maximum} [@M] ''   
   | | |Elemwise{mul,no_inplace} [@F] ''   
   | | |x [@I]
   | |x [@I]
   |out [@J]

======= leaky relu2: =======
Elemwise{add,no_inplace} [@A] 'glrelu2'   
 |Elemwise{true_div} [@B] ''   
 | |out [@C]
 | |DimShuffle{x,x} [@D] ''   
 |   |TensorConstant{1.89999997616} [@E]
 |Elemwise{true_div} [@F] ''   
   |Elemwise{mul} [@G] ''   
   | |Elemwise{mul} [@H] ''   
   | | |Elemwise{true_div} [@B] ''   
   | | |DimShuffle{x,x} [@I] ''   
   | |   |TensorConstant{0.899999976158} [@J]
   | |x [@K]
   |Elemwise{Abs} [@L] ''   
     |x [@K]

The corresponding profiles:

======= relu1: =======
   1.3%    97.7%       0.075s       7.46e-04s     C      100        1   GpuElemwise{maximum,no_inplace}
   1.2%    98.9%       0.066s       6.58e-04s     C      100        1   GpuElemwise{Composite{Cast{float32}(EQ(i0, i1))}}[(0, 0)]
   1.1%   100.0%       0.065s       6.49e-04s     C      100        1   GpuElemwise{Mul}[(0, 0)]

======= relu2: =======
   1.4%   100.0%       0.074s       7.39e-04s     C      100        1   GpuElemwise{Composite{((i0 * i1) + (i0 * i1 * sgn(i2)))}}[(0, 1)]

======= leaky relu1: =======
   1.5%    95.3%       0.085s       8.45e-04s     C      100        1   GpuElemwise{maximum,no_inplace}
   1.4%    96.7%       0.078s       7.76e-04s     C      100        1   GpuElemwise{Composite{((i0 * i1 * i2) + (i3 * i2))}}[(0, 1)]
   1.2%    97.9%       0.068s       6.78e-04s     C      100        1   GpuElemwise{Composite{Cast{float32}(EQ(i0, i1))}}[(0, 0)]
   1.2%    99.0%       0.066s       6.59e-04s     C      100        1   GpuElemwise{Composite{Cast{float32}(EQ(i0, i1))}}[(0, 1)]
   1.0%   100.0%       0.056s       5.56e-04s     C      100        1   GpuElemwise{mul,no_inplace}

======= leaky relu2: =======
   1.4%   100.0%       0.075s       7.49e-04s     C      100        1   GpuElemwise{Composite{((i0 * i1) + (i2 * i1 * sgn(i3)))}}[(0, 1)]

So also the backward passes are faster for the more complicated formulation. I'll change this PR accordingly.

@f0k
Copy link
Member Author

f0k commented Mar 16, 2015

@ebenolson: Your formula was wrong. Did you try it? A correct one is:

((1 + self.leakiness) * x + (1 - self.leakiness) * abs(x)) / 2

The timing results still hold; it's a lot faster especially for the backward pass. I've updated the PR (in a slightly more efficient formulation).

@ebenolson
Copy link
Member

@ebenolson: Your formula was wrong. Did you try it?

nope, that's embarrassing. I've been using it for PReLU and never noticed, I suppose the error just got absorbed into the parameter.

+1 for test driven development.

@f0k
Copy link
Member Author

f0k commented Mar 17, 2015

@benanne: Travis passes, ready to merge this time!

benanne added a commit that referenced this pull request Mar 17, 2015
Leaky rectifier nonlinearity
@benanne benanne merged commit 93250bb into Lasagne:master Mar 17, 2015
@benanne
Copy link
Member

benanne commented Mar 17, 2015

Done :)

@f0k f0k deleted the leaky-relu branch March 17, 2015 18:30
@r2007
Copy link

r2007 commented Mar 24, 2015

consider the gradient is always happened in NN, I add the gradient computing in the test and a new method after google it.

(Mac BookAir without GPU)

import numpy as np
import theano
import theano.tensor as T

x = T.matrix('x')
d = {
        "T.maximum(0, x)": lambda x: T.maximum(0, x),
        "0.5 * (x + abs(x))": lambda x: 0.5 * (x + abs(x)),
        "x * (x > 0)": lambda x: x * (x > 0),
        "T.switch(x<0, 0, x)": lambda x: T.switch(x<0, 0, x)
      }
z = np.random.randn(10000, 1000)
for name, f in d.iteritems():
    cost = theano.function([x], f(x))
    grad = theano.function([x], theano.grad(f(x).sum(), x))
    cost_grad = theano.function([x], [f(x), theano.grad(f(x).sum(), x)])
    print(name)
    %timeit cost(z)
    %timeit grad(z)
    %timeit cost_grad(z)
    c, g = cost_grad(z)
    assert np.all(c == np.where(z > 0, z, 0))
    assert np.all(g == (z > 0))

output:

T.switch(x<0, 0, x)
10 loops, best of 3: 30.3 ms per loop
10 loops, best of 3: 80.1 ms per loop
10 loops, best of 3: 152 ms per loop
T.maximum(0, x)
10 loops, best of 3: 74.3 ms per loop
10 loops, best of 3: 81.1 ms per loop
10 loops, best of 3: 120 ms per loop
0.5 * (x + abs(x))
10 loops, best of 3: 29.5 ms per loop
10 loops, best of 3: 86.2 ms per loop
10 loops, best of 3: 117 ms per loop
x * (x > 0)
10 loops, best of 3: 28.8 ms per loop
10 loops, best of 3: 29 ms per loop
10 loops, best of 3: 71.5 ms per loop

@benanne
Copy link
Member

benanne commented Mar 24, 2015

Here's the output of the same code running on a GTX 980:

T.switch(x<0, 0, x)
100 loops, best of 3: 17.8 ms per loop
100 loops, best of 3: 16.5 ms per loop
10 loops, best of 3: 82 ms per loop
T.maximum(0, x)
100 loops, best of 3: 16.5 ms per loop
100 loops, best of 3: 17.5 ms per loop
10 loops, best of 3: 26.1 ms per loop
0.5 * (x + abs(x))
100 loops, best of 3: 16.4 ms per loop
100 loops, best of 3: 16.4 ms per loop
10 loops, best of 3: 25.6 ms per loop
x * (x > 0)
100 loops, best of 3: 17.3 ms per loop
100 loops, best of 3: 16.8 ms per loop
1 loops, best of 3: 40.3 ms per loop

Our existing implementation (0.5 * (x + abs(x))) seems to have the edge here, although the difference with T.maximum(0, x) is almost negligible.

@r2007
Copy link

r2007 commented Mar 24, 2015

So, GPU first ;)

@r2007
Copy link

r2007 commented Mar 24, 2015

@benanne could you test the following on GTX 980?
I am interesting the 2 leaky rectifier function performance on GPU.
Thanks.

import numpy as np
import theano
import theano.tensor as T

x = T.matrix('x')
d = {
        "T.maximum(0, x)": lambda x: T.maximum(0, x),
        "0.5 * (x + abs(x))": lambda x: 0.5 * (x + abs(x)),
        "x * (x > 0)": lambda x: x * (x > 0),
        "T.switch(x<0, 0, x)": lambda x: T.switch(x<0, 0, x),
        "f1 * x + f2 * abs(x)": lambda x: 0.5 * 1.1 * x + 0.5 * 0.9 * abs(x),
        "T.maximum(0.1*x, x)": lambda x: T.maximum(0.1*x, x)
      }
z = np.random.randn(10000, 1000)
for name, f in d.iteritems():
    cost = theano.function([x], f(x))
    grad = theano.function([x], theano.grad(f(x).sum(), x))
    cost_grad = theano.function([x], [f(x), theano.grad(f(x).sum(), x)])
    print(name)
    %timeit cost(z)
    %timeit grad(z)
    %timeit cost_grad(z)
    c, g = cost_grad(z)
    #assert np.all(c == np.where(z > 0, z, 0))
    #assert np.all(g == (z > 0))

@benanne
Copy link
Member

benanne commented Mar 24, 2015

Alright then, just this once ;) but I don't plan to make a habit out of it!

f1 * x + f2 * abs(x)
100 loops, best of 3: 16.4 ms per loop
100 loops, best of 3: 16.4 ms per loop
10 loops, best of 3: 25.7 ms per loop
T.maximum(0, x)
100 loops, best of 3: 16.6 ms per loop
100 loops, best of 3: 17.4 ms per loop
10 loops, best of 3: 25.5 ms per loop
T.maximum(0.1*x, x)
100 loops, best of 3: 16.7 ms per loop
100 loops, best of 3: 18.4 ms per loop
10 loops, best of 3: 27 ms per loop
x * (x > 0)
100 loops, best of 3: 17.3 ms per loop
100 loops, best of 3: 16.5 ms per loop
10 loops, best of 3: 39.4 ms per loop
T.switch(x<0, 0, x)
100 loops, best of 3: 17.6 ms per loop
100 loops, best of 3: 16.5 ms per loop
10 loops, best of 3: 79.5 ms per loop
0.5 * (x + abs(x))
100 loops, best of 3: 16.4 ms per loop
100 loops, best of 3: 16.5 ms per loop
10 loops, best of 3: 25.1 ms per loop

@f0k
Copy link
Member Author

f0k commented Mar 24, 2015

consider the gradient is always happened in NN, I add the gradient computing in the test and a new method after google it.

Sorry to interrupt the happy benchmarking party, but as I noted above, this way of testing is flawed for the GPU because it includes the transfer costs. Also your gradient benchmark does more than just computing the gradient.

My setup was like this:

import theano
T = theano.tensor
x = T.matrix('x')
# define forward expression
# (comment in the expression you want to time)
#relu = T.maximum(0, x)
relu = 0.5 * (x + abs(x))
# define backward expression
out = T.matrix('out')
grelu = theano.grad(None, x, known_grads={relu: out})
# compile backward expression
fgrelu = theano.function([x, out], grelu)
# optional: print graph
if False:
    from theano.printing import debugprint
    print "Original graph:"
    debugprint(grelu)
    print "Optimized graph:"
    debugprint(fgrelu)
# run
print "Running benchmark."
import numpy as np
inp = np.random.randn(10000, 1000).astype(np.float32)
outp = np.random.randn(10000, 1000).astype(np.float32)
for _ in range(100):
    fgrelu(inp, outp)

Run this with CUDA_LAUNCH_BLOCKING=1 THEANO_FLAGS=device=gpu,floatX=float32,profile=1 ./name_of_the_file.py to get a profile, and add up everything but the HostToGpu. Alternatively, read the Theano documentation to learn how to get all the graph and the inputs on the GPU so there are neither transfers nor copy operations in the compiled functions at all.

Our 0.5 * (x + abs(x)) formulation is a lot faster than anything else for the backward pass because it gets compiled into a single Elemwise{Composite{...}} expression and hence a single kernel launch, while all others get compiled into multiple separate kernel launches.
/edit: On CPU, things may well look different, of course, because there's no such thing as a kernel launch overhead. But we're focusing on GPU performance here. On a side note, these kind of tweaks should actually be done by Theano, but that's close to impossible (it's an easy optimization for the forward pass, but would be very complex for the backward pass).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants