Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Swish layer #6002

Merged
merged 8 commits into from
Mar 17, 2018
Merged

Added Swish layer #6002

merged 8 commits into from
Mar 17, 2018

Conversation

anmikh
Copy link
Contributor

@anmikh anmikh commented Oct 20, 2017

Added a novel activation function, named "Swish" (paper: https://arxiv.org/abs/1710.05941). Swish has the properties of one-sided boundedness at zero, smoothness, and non-monotonicity. Experiments, described in the paper above, show that Swish tends to work better than ReLU on deeper models across a number of challenging datasets.

@JokaHD
Copy link

JokaHD commented Oct 28, 2017

Is it necessary to modify caffe.proto for adding swish layer?

@Noiredd
Copy link
Member

Noiredd commented Oct 30, 2017

@a26908456 Swish does not require any parameters (like ReLU), so there is no need to modify caffe.proto.

@shaibagon
Copy link
Member

@anmikh - thank you for contributing to caffe!

If I understand the cited paper correctly, then "swish" amounts to y = x*sigmoid(beta*x).
Why don't you use existing caffe layers to implement this?

layer {
  name: "swish/beta_x"
  type: "Scale"
  bottom: "x"
  top: "swish/beta_x"
  param { lr_mult: 1 decay_mult: 0 }
  scale_param { bias_term: false num_axes:0 filler: {type: "constant" value: 1 } }
}
layer {
  name: "swish/sig"
  type: "Sigmoid"
  bottom: "swish/beta_x"
  top: "swish/sig"
}
layer {
  name: "swish"
  type: "Eltwise"
  bottom: "x"
  bottom: "swish/sig"
  eltwise_param { operation:"PROD }
}

The advantage of using existing layers:

  1. You have more flexibility w.r.t scale beta: learn it/omit it make it scalar or per-channel etc.
  2. You have GPU implementation(!) you cannot seriously train a model when your activation is CPU only layer.
  3. Layers are already tested and used and are in frequent use: more eyes to fix bugs and improve performance.

If you are troubled by the fact that you need to write longer prototxt

  1. You can use NetSpec() (python interface) to write the proto for you and have a python function that outputs "swish" activations for you.
  2. You can implement a layer that is composed of existing layers (see e.g. "Scale" layer that has a "Bias" layer inside, or "SoftmaxWithLoss" layer that has "Softmax" inside).

@shaibagon
Copy link
Member

There are many other activations that are published recently with all sorts of guarantees and it is very tempting to have a different "layer" for each and every one of them. However, I think a more "long term" approach is to build upon existing layers as building blocks and only implement layers that cannot be constructed using the existing arsenal.
For instance "SELU" can be implemented as a "Scale" layer with "ELU" layer. and so on.

It is very difficult to maintain code for so many layer when in fact most of the code basically implements the same functions.

@shaibagon
Copy link
Member

BTW, I just came across a swish implementation with GPU. See here.

@anmikh
Copy link
Contributor Author

anmikh commented Oct 30, 2017

@shaibagon @a26908456 In version 1 of the original paper https://arxiv.org/abs/1710.05941v1 Swish is just y=x*sigmoid(x) so you don't need to modify caffe.proto to use it.
But on the 27-th of October version 2 of the paper was released and beta parameter is needed.

@shaibagon
Copy link
Member

@anmikh I see. I suppose if we go into all the trouble of actually implementing "Swish", we would like to have beta as an optional parameter? How about GPU implementation, have you looked at the implementation I linked to?
What is your opinion about realizing "Swish" as a combination of existing layers?

@anmikh
Copy link
Contributor Author

anmikh commented Oct 30, 2017

@shaibagon Of course the Swish can be implemented by the combination of the existing layers. But it's more convenient to use "swish" layer directly at the .prototxt file rather than generating using python code. On the other hand, I agree that it's more difficult to maintain code for new layers, they should be newly tested and so on...

What do you mean about my GPU implementation? Do you mean that intermediate storing of sigmoid values shouldn't be used?

For sure, I'll add version with beta param

@shaibagon
Copy link
Member

@anmikh
(1) please disregard my comment about GPU. I did not see your .cu submission. sorry.
(2) Have you considered using C++ code to create the "Swish" as a combination of existing layers without duplicating code? (like solution 2 I proposed)?
(3) I think unit testing should include test for the forward function and not only for the gradient. This test should exist even if we opt to implement "Swish" as composition of layers.

@shaibagon shaibagon requested review from shaibagon and removed request for shaibagon October 30, 2017 10:38
@anmikh
Copy link
Contributor Author

anmikh commented Oct 30, 2017

@shaibagon

  1. Ok
  2. I thought about the combination of layers when was implementing the swish, but at the first step decided to implement it without combination (I didn't save intermediate sigmoid value at the first step). Considering that I save intermediate sigmoid values, and to avoid duplicating, I agree that combination of existing layers like in SoftmaxWithLoss and other is more correct. I'll implement it with the new version of Swish with beta parameter.
  3. The forward function test was implemented here

@shaibagon
Copy link
Member

@anmikh
Intermediate layers can be "in-place", thus not saving sigmoid values.

I wish to thank you again for your effort on making this high quality contribution to caffe! THANKS!

@shaibagon shaibagon self-requested a review October 30, 2017 11:31
@anmikh
Copy link
Contributor Author

anmikh commented Nov 12, 2017

@shaibagon I implemented the 2-nd version of the Swish with beta parameter (81dfc8f) - without intermediate layers; and in cdca72e - with intermediate sigmoid layer to avoid code duplication.

@ldsyou92
Copy link

ldsyou92 commented Dec 19, 2017

Thanks for your contribution! I have a problem when I use Swish function,so I really need some help.
The problem is : I have successfully complied caffe , but when I use swish in my prototxt file , like

layer {
  name: "swish1"
  type: "Swish"
  bottom: "conv1"
  top: "conv1"
}

I met such error:

I1219 18:28:34.587666  8240 layer_factory.hpp:77] Creating layer swish1
F1219 18:28:34.587718  8240 layer_factory.hpp:81] Check failed: registry.count(type) == 1 (0 vs. 1) Unknown layer type: Swish (known types: AbsVal, Accuracy, ArgMax, BNLL, BatchNorm, BatchReindex, Bias, Concat, ContrastiveLoss, Convolution, Crop, Data, Deconvolution, Dropout, DummyData, ELU, Eltwise, Embed, EuclideanLoss, Exp, Filter, Flatten, HDF5Data, HDF5Output, HingeLoss, Im2col, ImageData, InfogainLoss, InnerProduct, Input, LRN, LSTM, LSTMUnit, Log, MVN, MemoryData, MultinomialLogisticLoss, PReLU, Parameter, Pooling, Power, Python, RNN, ReLU, Reduction, Reshape, SPP, Scale, Sigmoid, SigmoidCrossEntropyLoss, Silence, Slice, Softmax, SoftmaxWithLoss, Split, TanH, Threshold, Tile, WindowData)

How can I deal with it?

PLUS:My roommate has helped me out just now,we need to add a sentence in the end of "swish_layer.cpp".like

INSTANTIATE_CLASS(SwishLayer);
REGISTER_LAYER_CLASS(Swish);

@Noiredd Noiredd removed the focus label Feb 9, 2018
@Coderx7
Copy link
Contributor

Coderx7 commented Mar 6, 2018

@anmikh : Meanwhile a new variant is proposed which outperforms the vanilla Swish. Check this paper : https://arxiv.org/abs/1801.07145v1
Now that you have implemented the Swish, I think it would be a good idea to have this implemented as well

@Noiredd
Copy link
Member

Noiredd commented Mar 7, 2018

@Coderx7 This brings again the question: should we have a separate layer for every other activation function out there, even if it is just a simple combination of existing functions? E-Swish is essentially beta*swish(x), which can be already constructed with a PowerLayer (multiply by beta), SigmoidLayer and an EltwiseLayer (multiply beta*x by sigmoid of x) - or a PowerLayer + Swish if we decide to merge this.

Normally I would argue that Caffe should be a collection of layers with orthogonal functionalities, with the user being responsible for arranging complex functions from them like building blocks. In this case (Swish) I am still not sure if we should include it in the framework... it is more elegant than building it from existing layers (3 layers for an activation function starts to look clumsy, now multiply that by the number of nonlinearities in the net), but if it's faster or more numerically stable (reason why we have SoftmaxWithLoss, for example) - we don't know.

Would you mind sharing your general opinion on this, @shelhamer, @naibaf7?

@Coderx7
Copy link
Contributor

Coderx7 commented Mar 7, 2018

@Noiredd : When we have a complete implementation that works decently why not use it? Yes I get it implementing it through what we have available is the first thing that comes to mind. But it gets really tedious if you are going to implement everything this way to have it tested.

@naibaf7
Copy link
Member

naibaf7 commented Mar 7, 2018

Performance-wise it makes sense to have as few kernel calls as possible, especially on OpenCL.
I think swish and swish beta should be implemented in the same layer though.
Put all available variants of swish as a configuration option in the swish layer, is the best way to go about it, I think.

I also agree that it becomes tedious (and slow!) to have a combination of layers instead of a fixed function that does it all-in-one.

And the PR LGTM. Could be merged.

@Noiredd
Copy link
Member

Noiredd commented Mar 7, 2018

@naibaf7, @Coderx7 that sounds good to me.

The only thing this PR misses is REGISTER_LAYER_CLASS (see @ldsyou92's comment above) - @anmikh please add that and we merge.

I also agree that all possible variants should be included in one layer. So we can either merge this as is and optionally add the other ones in another PR, or maybe you want to include @Coderx7's suggestion in this one?

@anmikh
Copy link
Contributor Author

anmikh commented Mar 8, 2018

Of course, i'll add REGISTER_LAYER_CLASS and test all again in 4 days on Monday.

I suppose it would be better to finish this pr as is and then, based on the new paper that was mentioned above, improve existing functionality

@anmikh
Copy link
Contributor Author

anmikh commented Mar 13, 2018

@Noiredd I've added and tested the missing thing to this layer.

@Noiredd Noiredd merged commit dabbc91 into BVLC:master Mar 17, 2018
@Noiredd
Copy link
Member

Noiredd commented Mar 17, 2018

Thank you @anmikh, merged!

beniz pushed a commit to jolibrain/caffe that referenced this pull request Mar 25, 2018
* added swish layer (cpu)

* swish layer: added tests

* swish layer: optimized backpropogation

* swish layer: added cuda implementation

* swish layer: added beta parameter

* swish layer: incorporated sigmoid layer

* swish layer: fix comment of last added parameter

* swish layer: added REGISTER_LAYER_CLASS
@hezw2016
Copy link

@shaibagon
"
For instance "SELU" can be implemented as a "Scale" layer with "ELU" layer.
"
Given the fact alpha and lamda values are fixed, I think "Power" layer will be more suitable for "Scale".

@shaibagon
Copy link
Member

@hezw2016 need to check the actual implementation, but I'm afraid even when the exponent is 1, "Poser" layer still calls pow() function that is expensive.

XinYao1994 pushed a commit to XinYao1994/caffe that referenced this pull request Aug 29, 2018
* added swish layer (cpu)

* swish layer: added tests

* swish layer: optimized backpropogation

* swish layer: added cuda implementation

* swish layer: added beta parameter

* swish layer: incorporated sigmoid layer

* swish layer: fix comment of last added parameter

* swish layer: added REGISTER_LAYER_CLASS
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants