Added Swish layer #6002

anmikh · 2017-10-20T12:59:08Z

Added a novel activation function, named "Swish" (paper: https://arxiv.org/abs/1710.05941). Swish has the properties of one-sided boundedness at zero, smoothness, and non-monotonicity. Experiments, described in the paper above, show that Swish tends to work better than ReLU on deeper models across a number of challenging datasets.

JokaHD · 2017-10-28T09:27:39Z

Is it necessary to modify caffe.proto for adding swish layer?

Noiredd · 2017-10-30T08:27:28Z

@a26908456 Swish does not require any parameters (like ReLU), so there is no need to modify caffe.proto.

shaibagon · 2017-10-30T08:48:07Z

@anmikh - thank you for contributing to caffe!

If I understand the cited paper correctly, then "swish" amounts to y = x*sigmoid(beta*x).
Why don't you use existing caffe layers to implement this?

layer {
  name: "swish/beta_x"
  type: "Scale"
  bottom: "x"
  top: "swish/beta_x"
  param { lr_mult: 1 decay_mult: 0 }
  scale_param { bias_term: false num_axes:0 filler: {type: "constant" value: 1 } }
}
layer {
  name: "swish/sig"
  type: "Sigmoid"
  bottom: "swish/beta_x"
  top: "swish/sig"
}
layer {
  name: "swish"
  type: "Eltwise"
  bottom: "x"
  bottom: "swish/sig"
  eltwise_param { operation:"PROD }
}

The advantage of using existing layers:

You have more flexibility w.r.t scale beta: learn it/omit it make it scalar or per-channel etc.
You have GPU implementation(!) you cannot seriously train a model when your activation is CPU only layer.
Layers are already tested and used and are in frequent use: more eyes to fix bugs and improve performance.

If you are troubled by the fact that you need to write longer prototxt

You can use NetSpec() (python interface) to write the proto for you and have a python function that outputs "swish" activations for you.
You can implement a layer that is composed of existing layers (see e.g. "Scale" layer that has a "Bias" layer inside, or "SoftmaxWithLoss" layer that has "Softmax" inside).

shaibagon · 2017-10-30T08:55:44Z

There are many other activations that are published recently with all sorts of guarantees and it is very tempting to have a different "layer" for each and every one of them. However, I think a more "long term" approach is to build upon existing layers as building blocks and only implement layers that cannot be constructed using the existing arsenal.
For instance "SELU" can be implemented as a "Scale" layer with "ELU" layer. and so on.

It is very difficult to maintain code for so many layer when in fact most of the code basically implements the same functions.

shaibagon · 2017-10-30T09:37:46Z

BTW, I just came across a swish implementation with GPU. See here.

anmikh · 2017-10-30T10:05:54Z

@shaibagon @a26908456 In version 1 of the original paper https://arxiv.org/abs/1710.05941v1 Swish is just y=x*sigmoid(x) so you don't need to modify caffe.proto to use it.
But on the 27-th of October version 2 of the paper was released and beta parameter is needed.

shaibagon · 2017-10-30T10:13:25Z

@anmikh I see. I suppose if we go into all the trouble of actually implementing "Swish", we would like to have beta as an optional parameter? How about GPU implementation, have you looked at the implementation I linked to?
What is your opinion about realizing "Swish" as a combination of existing layers?

anmikh · 2017-10-30T10:15:19Z

@shaibagon Of course the Swish can be implemented by the combination of the existing layers. But it's more convenient to use "swish" layer directly at the .prototxt file rather than generating using python code. On the other hand, I agree that it's more difficult to maintain code for new layers, they should be newly tested and so on...

What do you mean about my GPU implementation? Do you mean that intermediate storing of sigmoid values shouldn't be used?

For sure, I'll add version with beta param

shaibagon · 2017-10-30T10:24:40Z

@anmikh
(1) please disregard my comment about GPU. I did not see your .cu submission. sorry.
(2) Have you considered using C++ code to create the "Swish" as a combination of existing layers without duplicating code? (like solution 2 I proposed)?
(3) I think unit testing should include test for the forward function and not only for the gradient. This test should exist even if we opt to implement "Swish" as composition of layers.

anmikh · 2017-10-30T11:20:12Z

@shaibagon

Ok
I thought about the combination of layers when was implementing the swish, but at the first step decided to implement it without combination (I didn't save intermediate sigmoid value at the first step). Considering that I save intermediate sigmoid values, and to avoid duplicating, I agree that combination of existing layers like in SoftmaxWithLoss and other is more correct. I'll implement it with the new version of Swish with beta parameter.
The forward function test was implemented here

shaibagon · 2017-10-30T11:30:49Z

@anmikh
Intermediate layers can be "in-place", thus not saving sigmoid values.

I wish to thank you again for your effort on making this high quality contribution to caffe! THANKS!

anmikh · 2017-11-12T08:11:37Z

@shaibagon I implemented the 2-nd version of the Swish with beta parameter (81dfc8f) - without intermediate layers; and in cdca72e - with intermediate sigmoid layer to avoid code duplication.

ldsyou92 · 2017-12-19T02:33:23Z

Thanks for your contribution! I have a problem when I use Swish function,so I really need some help.
The problem is : I have successfully complied caffe , but when I use swish in my prototxt file , like

layer {
  name: "swish1"
  type: "Swish"
  bottom: "conv1"
  top: "conv1"
}

I met such error:

I1219 18:28:34.587666  8240 layer_factory.hpp:77] Creating layer swish1
F1219 18:28:34.587718  8240 layer_factory.hpp:81] Check failed: registry.count(type) == 1 (0 vs. 1) Unknown layer type: Swish (known types: AbsVal, Accuracy, ArgMax, BNLL, BatchNorm, BatchReindex, Bias, Concat, ContrastiveLoss, Convolution, Crop, Data, Deconvolution, Dropout, DummyData, ELU, Eltwise, Embed, EuclideanLoss, Exp, Filter, Flatten, HDF5Data, HDF5Output, HingeLoss, Im2col, ImageData, InfogainLoss, InnerProduct, Input, LRN, LSTM, LSTMUnit, Log, MVN, MemoryData, MultinomialLogisticLoss, PReLU, Parameter, Pooling, Power, Python, RNN, ReLU, Reduction, Reshape, SPP, Scale, Sigmoid, SigmoidCrossEntropyLoss, Silence, Slice, Softmax, SoftmaxWithLoss, Split, TanH, Threshold, Tile, WindowData)

How can I deal with it?

PLUS：My roommate has helped me out just now,we need to add a sentence in the end of "swish_layer.cpp".like

INSTANTIATE_CLASS(SwishLayer);
REGISTER_LAYER_CLASS(Swish);

Coderx7 · 2018-03-06T21:28:49Z

@anmikh : Meanwhile a new variant is proposed which outperforms the vanilla Swish. Check this paper : https://arxiv.org/abs/1801.07145v1
Now that you have implemented the Swish, I think it would be a good idea to have this implemented as well

Noiredd · 2018-03-07T11:37:21Z

@Coderx7 This brings again the question: should we have a separate layer for every other activation function out there, even if it is just a simple combination of existing functions? E-Swish is essentially beta*swish(x), which can be already constructed with a PowerLayer (multiply by beta), SigmoidLayer and an EltwiseLayer (multiply beta*x by sigmoid of x) - or a PowerLayer + Swish if we decide to merge this.

Normally I would argue that Caffe should be a collection of layers with orthogonal functionalities, with the user being responsible for arranging complex functions from them like building blocks. In this case (Swish) I am still not sure if we should include it in the framework... it is more elegant than building it from existing layers (3 layers for an activation function starts to look clumsy, now multiply that by the number of nonlinearities in the net), but if it's faster or more numerically stable (reason why we have SoftmaxWithLoss, for example) - we don't know.

Would you mind sharing your general opinion on this, @shelhamer, @naibaf7?

Coderx7 · 2018-03-07T11:53:15Z

@Noiredd : When we have a complete implementation that works decently why not use it? Yes I get it implementing it through what we have available is the first thing that comes to mind. But it gets really tedious if you are going to implement everything this way to have it tested.

naibaf7 · 2018-03-07T14:48:36Z

Performance-wise it makes sense to have as few kernel calls as possible, especially on OpenCL.
I think swish and swish beta should be implemented in the same layer though.
Put all available variants of swish as a configuration option in the swish layer, is the best way to go about it, I think.

I also agree that it becomes tedious (and slow!) to have a combination of layers instead of a fixed function that does it all-in-one.

And the PR LGTM. Could be merged.

Noiredd · 2018-03-07T22:09:12Z

@naibaf7, @Coderx7 that sounds good to me.

The only thing this PR misses is REGISTER_LAYER_CLASS (see @ldsyou92's comment above) - @anmikh please add that and we merge.

I also agree that all possible variants should be included in one layer. So we can either merge this as is and optionally add the other ones in another PR, or maybe you want to include @Coderx7's suggestion in this one?

anmikh · 2018-03-08T08:13:42Z

Of course, i'll add REGISTER_LAYER_CLASS and test all again in 4 days on Monday.

I suppose it would be better to finish this pr as is and then, based on the new paper that was mentioned above, improve existing functionality

anmikh · 2018-03-13T10:06:53Z

@Noiredd I've added and tested the missing thing to this layer.

Noiredd · 2018-03-17T15:27:41Z

Thank you @anmikh, merged!

* added swish layer (cpu) * swish layer: added tests * swish layer: optimized backpropogation * swish layer: added cuda implementation * swish layer: added beta parameter * swish layer: incorporated sigmoid layer * swish layer: fix comment of last added parameter * swish layer: added REGISTER_LAYER_CLASS

hezw2016 · 2018-05-25T08:58:45Z

@shaibagon
"
For instance "SELU" can be implemented as a "Scale" layer with "ELU" layer.
"
Given the fact alpha and lamda values are fixed, I think "Power" layer will be more suitable for "Scale".

shaibagon · 2018-05-25T09:03:43Z

@hezw2016 need to check the actual implementation, but I'm afraid even when the exponent is 1, "Poser" layer still calls pow() function that is expensive.

* added swish layer (cpu) * swish layer: added tests * swish layer: optimized backpropogation * swish layer: added cuda implementation * swish layer: added beta parameter * swish layer: incorporated sigmoid layer * swish layer: fix comment of last added parameter * swish layer: added REGISTER_LAYER_CLASS

anmikh added 4 commits October 20, 2017 15:40

added swish layer (cpu)

0446d18

swish layer: added tests

23e0f5b

swish layer: optimized backpropogation

907a5a3

swish layer: added cuda implementation

189cbee

anmikh mentioned this pull request Oct 20, 2017

Swish activation function #6003

Closed

Noiredd added enhancement focus labels Oct 20, 2017

shaibagon requested review from shaibagon and removed request for shaibagon October 30, 2017 10:38

shaibagon self-requested a review October 30, 2017 11:31

anmikh added 3 commits November 11, 2017 20:57

swish layer: added beta parameter

81dfc8f

swish layer: incorporated sigmoid layer

cdca72e

swish layer: fix comment of last added parameter

8f7db5f

Noiredd removed the focus label Feb 9, 2018

swish layer: added REGISTER_LAYER_CLASS

ca56553

Noiredd merged commit dabbc91 into BVLC:master Mar 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Swish layer #6002

Added Swish layer #6002

anmikh commented Oct 20, 2017

JokaHD commented Oct 28, 2017

Noiredd commented Oct 30, 2017

shaibagon commented Oct 30, 2017

shaibagon commented Oct 30, 2017

shaibagon commented Oct 30, 2017

anmikh commented Oct 30, 2017

shaibagon commented Oct 30, 2017

anmikh commented Oct 30, 2017 •

edited

Loading

shaibagon commented Oct 30, 2017

anmikh commented Oct 30, 2017 •

edited

Loading

shaibagon commented Oct 30, 2017

anmikh commented Nov 12, 2017

ldsyou92 commented Dec 19, 2017 •

edited by Noiredd

Loading

Coderx7 commented Mar 6, 2018

Noiredd commented Mar 7, 2018

Coderx7 commented Mar 7, 2018

naibaf7 commented Mar 7, 2018

Noiredd commented Mar 7, 2018

anmikh commented Mar 8, 2018

anmikh commented Mar 13, 2018

Noiredd commented Mar 17, 2018

hezw2016 commented May 25, 2018

shaibagon commented May 25, 2018

Added Swish layer #6002

Added Swish layer #6002

Conversation

anmikh commented Oct 20, 2017

JokaHD commented Oct 28, 2017

Noiredd commented Oct 30, 2017

shaibagon commented Oct 30, 2017

shaibagon commented Oct 30, 2017

shaibagon commented Oct 30, 2017

anmikh commented Oct 30, 2017

shaibagon commented Oct 30, 2017

anmikh commented Oct 30, 2017 • edited Loading

shaibagon commented Oct 30, 2017

anmikh commented Oct 30, 2017 • edited Loading

shaibagon commented Oct 30, 2017

anmikh commented Nov 12, 2017

ldsyou92 commented Dec 19, 2017 • edited by Noiredd Loading

Coderx7 commented Mar 6, 2018

Noiredd commented Mar 7, 2018

Coderx7 commented Mar 7, 2018

naibaf7 commented Mar 7, 2018

Noiredd commented Mar 7, 2018

anmikh commented Mar 8, 2018

anmikh commented Mar 13, 2018

Noiredd commented Mar 17, 2018

hezw2016 commented May 25, 2018

shaibagon commented May 25, 2018

anmikh commented Oct 30, 2017 •

edited

Loading

anmikh commented Oct 30, 2017 •

edited

Loading

ldsyou92 commented Dec 19, 2017 •

edited by Noiredd

Loading