-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Swish layer #6002
Added Swish layer #6002
Conversation
Is it necessary to modify caffe.proto for adding swish layer? |
@a26908456 Swish does not require any parameters (like ReLU), so there is no need to modify caffe.proto. |
@anmikh - thank you for contributing to caffe! If I understand the cited paper correctly, then "swish" amounts to
The advantage of using existing layers:
If you are troubled by the fact that you need to write longer prototxt
|
There are many other activations that are published recently with all sorts of guarantees and it is very tempting to have a different "layer" for each and every one of them. However, I think a more "long term" approach is to build upon existing layers as building blocks and only implement layers that cannot be constructed using the existing arsenal. It is very difficult to maintain code for so many layer when in fact most of the code basically implements the same functions. |
BTW, I just came across a swish implementation with GPU. See here. |
@shaibagon @a26908456 In version 1 of the original paper https://arxiv.org/abs/1710.05941v1 Swish is just |
@anmikh I see. I suppose if we go into all the trouble of actually implementing |
@shaibagon Of course the Swish can be implemented by the combination of the existing layers. But it's more convenient to use "swish" layer directly at the .prototxt file rather than generating using python code. On the other hand, I agree that it's more difficult to maintain code for new layers, they should be newly tested and so on... What do you mean about my GPU implementation? Do you mean that intermediate storing of sigmoid values shouldn't be used? For sure, I'll add version with |
@anmikh |
|
@anmikh I wish to thank you again for your effort on making this high quality contribution to caffe! THANKS! |
@shaibagon I implemented the 2-nd version of the Swish with |
Thanks for your contribution! I have a problem when I use Swish function,so I really need some help.
I met such error:
How can I deal with it? PLUS:My roommate has helped me out just now,we need to add a sentence in the end of "swish_layer.cpp".like INSTANTIATE_CLASS(SwishLayer);
REGISTER_LAYER_CLASS(Swish); |
@anmikh : Meanwhile a new variant is proposed which outperforms the vanilla Swish. Check this paper : https://arxiv.org/abs/1801.07145v1 |
@Coderx7 This brings again the question: should we have a separate layer for every other activation function out there, even if it is just a simple combination of existing functions? E-Swish is essentially Normally I would argue that Caffe should be a collection of layers with orthogonal functionalities, with the user being responsible for arranging complex functions from them like building blocks. In this case (Swish) I am still not sure if we should include it in the framework... it is more elegant than building it from existing layers (3 layers for an activation function starts to look clumsy, now multiply that by the number of nonlinearities in the net), but if it's faster or more numerically stable (reason why we have SoftmaxWithLoss, for example) - we don't know. Would you mind sharing your general opinion on this, @shelhamer, @naibaf7? |
@Noiredd : When we have a complete implementation that works decently why not use it? Yes I get it implementing it through what we have available is the first thing that comes to mind. But it gets really tedious if you are going to implement everything this way to have it tested. |
Performance-wise it makes sense to have as few kernel calls as possible, especially on OpenCL. I also agree that it becomes tedious (and slow!) to have a combination of layers instead of a fixed function that does it all-in-one. And the PR LGTM. Could be merged. |
@naibaf7, @Coderx7 that sounds good to me. The only thing this PR misses is I also agree that all possible variants should be included in one layer. So we can either merge this as is and optionally add the other ones in another PR, or maybe you want to include @Coderx7's suggestion in this one? |
Of course, i'll add REGISTER_LAYER_CLASS and test all again in 4 days on Monday. I suppose it would be better to finish this pr as is and then, based on the new paper that was mentioned above, improve existing functionality |
@Noiredd I've added and tested the missing thing to this layer. |
Thank you @anmikh, merged! |
* added swish layer (cpu) * swish layer: added tests * swish layer: optimized backpropogation * swish layer: added cuda implementation * swish layer: added beta parameter * swish layer: incorporated sigmoid layer * swish layer: fix comment of last added parameter * swish layer: added REGISTER_LAYER_CLASS
@shaibagon |
@hezw2016 need to check the actual implementation, but I'm afraid even when the exponent is |
* added swish layer (cpu) * swish layer: added tests * swish layer: optimized backpropogation * swish layer: added cuda implementation * swish layer: added beta parameter * swish layer: incorporated sigmoid layer * swish layer: fix comment of last added parameter * swish layer: added REGISTER_LAYER_CLASS
Added a novel activation function, named "Swish" (paper: https://arxiv.org/abs/1710.05941). Swish has the properties of one-sided boundedness at zero, smoothness, and non-monotonicity. Experiments, described in the paper above, show that Swish tends to work better than ReLU on deeper models across a number of challenging datasets.