-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PReLU Layer #1940
Add PReLU Layer #1940
Conversation
Thank you for sharing this tnarihi. I've been running it. I've seeing improved performance. |
Good to hear that @jyegerlehner ! |
* @param param provides PReLUParameter prelu_param, | ||
* with PReLULayer options: | ||
* - init_value (\b optional, default 0.25). | ||
* all negative slopes over channels are set to this value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about using a Filler
for this (like InnerProductLayer
and ConvolutionLayer
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. It seems like more flexible. In this case, is filler
good for the name of parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, filler
sounds good to me
Hey Takuya, thanks for creating this PR. Besides the one comment I made above, this looks good to me. At some point when we figure out a better way of handling composition, it would be good to add a (Edit: if you'd prefer, you could separate the appropriate piece of this out into a |
Interesting! I believe adding more primitive layers would be nicer for research perspective, and to reuse them makes source codes more readable. clearer and easier to maintain. I will consider to separate the piece of codes out into a Thanks, Jeff! |
Changed to use |
Hi @tnarihi, tried your PReLU on the cifar10 and mnist example in caffe and it get a worst result. With cifar10 example train_full.sh, the ReLU gets accuracy 0.8181, after i change all ReLU to PReLU, it drops to 0.7495. Any idea why? |
Thanks for reporting, @hycis. I tried only on cifar_quick example. Let me reproduce the results maybe on this weekend. So far, please try different learning rates and initialization. |
@tnarihi, @hycis, the improved performance I alluded to above was from a model that had a couple of things different from the one I was comparing it to; one of those things was the use of PReLU. I haven't done a clean A/B comparison of PReLU vs leaky ReLU. Sorry if I wrongly attributed the improvement to the PReLU. I will try a clean compare of Leaky ReLU vs PReLU on my own model. |
@jyegerlehner Also there is a "Very Leaky ReLU", when negative slope is ~ 0.1-0.3 rather than 0.01. |
Here's a result comparing training of two models that are identical except leaky relu's with negative_slope=0.1 are switched out for prelus with initial_value=0.1 (I was running the patch from before initial_value was changed over to filler). Both models were initialized from the same .caffemodel (using Edit: Updated charts to reflect behavior of latest PR code. This is the kind of observation that led to my "improved behavior" comment above. However, there seems to be a problem. I notice that the initial loss computed by caffe was a bit different for the two cases (25.17 for ReLU, 25.62 for PReLU). It should be identical, since a prelu with initial_value = 0.1 ought to forward propagate identically to leaky relu with negatve_slope = 0.1 before any training has happened. Unless I'm confused about PReLUs. Furthermore, setting the learn rate for the PReLU to zero ought to make it behave identically to a ReLU if its initial_value is set identically to the ReLU's negative_slope. In other words, I have layers in the ReLU net like this: layer {
name: "encode1_relu"
type: "ReLU"
bottom: "encode1_conv"
top: "encode1_conv"
relu_param: {
negative_slope: 0.1
}
} And in the PReLU version, I replace those with: layer {
name: "encode1_prelu"
type: "PReLU"
bottom: "encode1_conv"
top: "encode1_conv"
param {
lr_mult: 0
}
prelu_param: {
filler { value: 0.1 type: "constant" }
}
} So if I do that, and run the same test, I should see a loss vs iterations identical to the relu version. I did this, and it turns out they are different.
Anyone see errors in my reasoning, or have better ideas? Edit: I don't show the loss at iteration =0 just because it changes Y axis scaling enough to obscure the trends later. |
Thanks for pointing that out @ducha-aiki . Yes, somehow I had stumbled across that. My ReLUs are already very leaky, using negative_slope = 0.1. |
@jyegerlehner |
@jyegerlehner Thanks for reporting your experiments. Those are really useful. As you suggested, I've just added a new test case that checks if PReLU produces numbers consistent with Leaky ReLU ( |
Sorry... test was wrong, but now it passed anyway. |
Now I figured out in-place computation in GPU backward was something wrong. I will look into the GPU code.
|
I found the bug. It was due to calling an incorrect API for copying data. Please see the commit message for details. Now it should work correctly. @jyegerlehner |
Sorry, I replied to another person. |
@tnarihi said:
Wow that was quick; I had just finished changing over my net.prototxt to remove in-place computation. Most of what I had to say is stale now in light of your new work. Both the ReLU and PReLU showed improved performance after removing in-place computation. As concerns conclusions above, it doesn't make any difference: PReLU still performs better. And PReLU with learn rate=0 and negative slope = 0.1 still gives slightly different results than ReLU with negative slope = 0.1. It's a small difference, but consistent. Not sure if it's worth it to go through and see what the root cause of that difference is. I didn't find anything wrong with the PReLU tests so I'm inclined to think it's good. Could be a methodological error on my part. Will pull your fix, restore the net.prototxt to in-place computation, and report back the results, which I expect will be identical to the not-in-place computation results. Edit: Updated charts above with the new results. PReLU still better. |
@hycis I reproduced the behavior you reported where MNIST/lenet and Cifar examples both perform worse when ReLUs are switched over to PReLUs. Note that both of those models use in-place computation for the ReLUs. I then pulled @tnarihi's latest change that fixes in-place computation, and repeated the test. I found that MNIST/lenet with PReLUs and ReLUs perform nearly identically to each other. In the case of Cifar, I found that PReLU has superior accuracy at 70K iterations: 0.8216 for PReLU model vs 0.8156 accuracy for ReLU. So I think we can call that one solved. |
It's great that the implementation has been proved to be correct. Will there be any end-to-end example that can get the same result of the paper? Thanks! |
@futurely surely it would be nice, but now even per-reviewed paper does not require to check algorithms on ImageNet (He et.at. report performance on it only), which takes >=2 weeks of GPU work (pretty costly even if you have free GPU and pay only for electricity) |
Agree. After this PR is merged, It would be nice if anyone reproduces the result of the paper and put it into ModelZoo! I would like to keep this PR as is. I came up with another possible fix of this PR. The paper describes they don't use weight decay for PReLU.
Wouldn't it be better to force weight decay for PReLU to be 0? Does anyone have comments or suggestions? |
I think, it is enough to set param{ decay_mult : 0} in example which explains it's usage |
} | ||
} | ||
|
||
TYPED_TEST(NeuronLayerTest, TestPReLUInplace) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
capitalize Place
(TestPReLUInPlace
)
@jeffdonahue Done. |
@tnarihi I just tried decay_multi: 0 on cifar10_full, but still only get 0.75. @jyegerlehner I wonder how you get 0.82? layer { |
@hycis I'm running the experiment again. Will report back when it's done. I didn't change any hyperparameters. Just the out-of-the-box examples/cifar10/train_full.sh, with the ReLUs changed to default PReLUs. I also did not use tnarihi's trick of decay_mult = 0.0. I'm not clear. Is your 0.75 PReLU accuracy after pulling tnarihi's in-place-computation fix? |
Thanks again for the layer and all the fixes @tnarihi. |
@hycis This time the ReLU version produced 0.8172 accuracy at 70K iterations, and PReLU version produced 0.8177 at 70K iterations. I could post the modified shell scripts and prototxt I used if that would help you to reproduce the result. |
@jyegerlehner I am not sure why too. I did a pull and also did as what you mention, but just not getting the result as you. It will be great if you can share your prototxt and shell scripts. Thanks. My email hyciswu@gmail.com |
@hycis Well that's troubling. OK here's what I used: https://gist.github.com/jyegerlehner/b2f073aa8e213f0a9167 Please let us know what you find. I'm worried perhaps I have an error on my end. |
After I pull and rebuild on the latest commit, I was able to improve full cifar10 from 0.7562 to 0.8184 and 0.8193 (decay_mult=0) at 70000 iterations for using prelu. Thanks @jyegerlehner and @tnarihi |
A bit of (mostly useless) Caffe trivia: I just realized that even before this PR we could already implement PReLU (very inefficiently) by composition; at least the
...is equivalent to this sequence of layers:
To be honest though, when I tried both I got slightly different results, so I'm not 100% sure that's right, but I've already spent way more time on this than was warranted, so I won't look into it further... |
@jeffdonahue layer composition is a fine hobby. Thanks for commenting with
|
@tnarihi just wondering is there a way to output the learned coefficients from of the slope of PReLU from the saved caffemodel? |
@hycis net = caffe.Net("<proto_path>", "<caffemodel_path>", caffe.TRAIN)
slopes_blob = net.layers['prelu1'][0] # your prelu layer name
print slopes_blob.data # This is a numpy array of the slopes I don't test this script, but a script like this should work. |
@tnarihi thanks for the quick reply |
sorry it shoud be |
hi @tnarihi |
@hycis Right. |
hi @tnarihi , |
Hi @happynear, I think that's the case only if negative slopes are all positive. If we allow slopes to be negative, we cannot figure out the pre-activation values only from |
@tnarihi |
@happynear Sorry to be late.
One naive implementation of doing this is that every layer copies bottom.diff to EDIT: This still needs additional memory |
Now I think we have two choices:
|
When I use PReLULayer to train the ImageNet model, the loss of the beginning is about 7. But when I fine-tune the " bvlc_reference_caffenet.caffemodel ", the loss of the beginning is about 80! and why? |
Maybe the bigger loss is due to much larger amount of nonzero responses in PReLU, but I am not sure exactly. The reference model is trained for ReLU (not PReLU) which is equivalent to the initial state of PReLU with following setting:
You should start with this setting. Default value of negative slope is 0.25. |
@tnarihi Thank you! I want to train model with prelu. And I use the reference model to make an experiments to fine-tune a prelu model. So I use the default value of negative slope, 0.25. |
A new type of ReLU is designed to solve the overfit problem: http://arxiv.org/abs/1505.00853 . |
@happynear Thank you. I have read this paper. In my experience, the initial negative slope should be adjusted, if you pre-trained model to fine-tune. |
Replacement of #1880 for master branch development.