-
Notifications
You must be signed in to change notification settings - Fork 946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leaky rectifier nonlinearity #163
Conversation
would it be better to implement analogously to relu? ie
|
Hmm, I'm not sure, but thank you for doing the thinking :) A clear advantage of that formulation is that it's working both for Theano expressions and numpy arrays. The disadvantage is that it's harder to read (but that can be documented) and that it leads to a more complex graph. Performance-wise I don't see any difference:
Looking at @SnippyHolloW's code, the original comparison was between So maybe the question should be: Shouldn't we just use
Regarding this, what about naming it /edit: By the way, Travis failed because MNIST couldn't be downloaded from deeplearning.net, which is down. |
We should definitely add leaky rectification, but I'm a little concerned about the interface: Another approach is to define all nonlinearities as first-order functions, possibly with multiple arguments, and then encourage users to define lambdas (or functions) to set all arguments but the first to the desired values if they don't want to use the default, e.g.:
This is of course more verbose. Regarding the implementation, I actually did profile |
Yes, that's why I'd change it to CamelCase. Wouldn't that be enough of a distinction?
And you couldn't pickle it any more. |
Yeah, CamelCase would work I guess. Good point about the pickling. Looks like Travis is failing because deeplearning.net is down (can't download MNIST), I will restart it when it's back up. |
Yep, I said so above ;) No need to restart, I'll switch to CamelCase tomorrow and then it has to run again anyway. Remaining open question from above:
|
Renamed to CamelCase and added a lowercase shortcut for the default leakiness. Travis passes. |
Looks good, will merge it tomorrow if there are no further comments! |
To be sure I did a test and profiled it (as part of a CNN forward pass):
compared to:
So @ebenolson's formulation is indeed a tiny little faster for the forward pass, and consistently so. I did my four variations as in the test above again, this time with Theano's profile mode:
So yes, the more complicated formulation is indeed faster. My previous test was flawed because it included the transfer costs. The backward pass is troubling me a bit, though... I'll report back. |
Backward pass graphs (before optimization):
The corresponding profiles:
So also the backward passes are faster for the more complicated formulation. I'll change this PR accordingly. |
@ebenolson: Your formula was wrong. Did you try it? A correct one is: ((1 + self.leakiness) * x + (1 - self.leakiness) * abs(x)) / 2 The timing results still hold; it's a lot faster especially for the backward pass. I've updated the PR (in a slightly more efficient formulation). |
nope, that's embarrassing. I've been using it for PReLU and never noticed, I suppose the error just got absorbed into the parameter. +1 for test driven development. |
@benanne: Travis passes, ready to merge this time! |
Done :) |
consider the gradient is always happened in NN, I add the gradient computing in the test and a new method after google it. (Mac BookAir without GPU) import numpy as np
import theano
import theano.tensor as T
x = T.matrix('x')
d = {
"T.maximum(0, x)": lambda x: T.maximum(0, x),
"0.5 * (x + abs(x))": lambda x: 0.5 * (x + abs(x)),
"x * (x > 0)": lambda x: x * (x > 0),
"T.switch(x<0, 0, x)": lambda x: T.switch(x<0, 0, x)
}
z = np.random.randn(10000, 1000)
for name, f in d.iteritems():
cost = theano.function([x], f(x))
grad = theano.function([x], theano.grad(f(x).sum(), x))
cost_grad = theano.function([x], [f(x), theano.grad(f(x).sum(), x)])
print(name)
%timeit cost(z)
%timeit grad(z)
%timeit cost_grad(z)
c, g = cost_grad(z)
assert np.all(c == np.where(z > 0, z, 0))
assert np.all(g == (z > 0)) output:
|
Here's the output of the same code running on a GTX 980:
Our existing implementation ( |
So, GPU first ;) |
@benanne could you test the following on GTX 980?
|
Alright then, just this once ;) but I don't plan to make a habit out of it!
|
Sorry to interrupt the happy benchmarking party, but as I noted above, this way of testing is flawed for the GPU because it includes the transfer costs. Also your gradient benchmark does more than just computing the gradient. My setup was like this: import theano
T = theano.tensor
x = T.matrix('x')
# define forward expression
# (comment in the expression you want to time)
#relu = T.maximum(0, x)
relu = 0.5 * (x + abs(x))
# define backward expression
out = T.matrix('out')
grelu = theano.grad(None, x, known_grads={relu: out})
# compile backward expression
fgrelu = theano.function([x, out], grelu)
# optional: print graph
if False:
from theano.printing import debugprint
print "Original graph:"
debugprint(grelu)
print "Optimized graph:"
debugprint(fgrelu)
# run
print "Running benchmark."
import numpy as np
inp = np.random.randn(10000, 1000).astype(np.float32)
outp = np.random.randn(10000, 1000).astype(np.float32)
for _ in range(100):
fgrelu(inp, outp) Run this with Our |
This adds the leaky rectifier nonlinearity with custom leakiness. Example use:
It's obviously a callable class, not a function. We can discuss whether to name it
leaky_rectify
to blend in with the others orLeakyRectify
to set it apart. Thinking about it, I'm actually leaning towards the latter.