Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PReLU Layer #1940

Merged
merged 1 commit into from
Mar 12, 2015
Merged

Add PReLU Layer #1940

merged 1 commit into from
Mar 12, 2015

Conversation

tnarihi
Copy link
Contributor

@tnarihi tnarihi commented Feb 22, 2015

Replacement of #1880 for master branch development.

@jyegerlehner
Copy link
Contributor

Thank you for sharing this tnarihi. I've been running it. I've seeing improved performance.
Edit: Wait, there's more to this. See discussion below.

@tnarihi
Copy link
Contributor Author

tnarihi commented Feb 28, 2015

Good to hear that @jyegerlehner !

* @param param provides PReLUParameter prelu_param,
* with PReLULayer options:
* - init_value (\b optional, default 0.25).
* all negative slopes over channels are set to this value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using a Filler for this (like InnerProductLayer and ConvolutionLayer)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. It seems like more flexible. In this case, is filler good for the name of parameter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, filler sounds good to me

@jeffdonahue
Copy link
Contributor

Hey Takuya, thanks for creating this PR. Besides the one comment I made above, this looks good to me. At some point when we figure out a better way of handling composition, it would be good to add a DiagonalInnerProductLayer (which handles the elementwise multiplication by a parameter -- I can clean up my implementation of this and PR it) and give it the responsibility of handling the parameters. If we had such a layer, we could implement PReLU as the composition EltwiseSum(ReLU(x), DiagonalInnerProduct(ReLU(Power(scale = -1, x)))), but based on the MSR work this seems like a useful enough shorthand for those 5 layers to deserve a name. (I guess with the combined kernel calls it may also be significantly faster on GPU?)

(Edit: if you'd prefer, you could separate the appropriate piece of this out into a DiagInnerProductLayer yourself, but this is useful as is.)

@tnarihi
Copy link
Contributor Author

tnarihi commented Mar 2, 2015

Interesting! I believe adding more primitive layers would be nicer for research perspective, and to reuse them makes source codes more readable. clearer and easier to maintain.

I will consider to separate the piece of codes out into a DiagonalInnerProductLayer when your PR arrives, if I can do this without increasing computational costs.

Thanks, Jeff!

@tnarihi
Copy link
Contributor Author

tnarihi commented Mar 4, 2015

Changed to use FillerParameter to fill initial values of negative slopes

@hycis
Copy link

hycis commented Mar 5, 2015

Hi @tnarihi, tried your PReLU on the cifar10 and mnist example in caffe and it get a worst result. With cifar10 example train_full.sh, the ReLU gets accuracy 0.8181, after i change all ReLU to PReLU, it drops to 0.7495. Any idea why?

@tnarihi
Copy link
Contributor Author

tnarihi commented Mar 5, 2015

Thanks for reporting, @hycis. I tried only on cifar_quick example. Let me reproduce the results maybe on this weekend. So far, please try different learning rates and initialization.
@jyegerlehner If you have any comments, it would be helpful. Thanks!

@jyegerlehner
Copy link
Contributor

@tnarihi, @hycis, the improved performance I alluded to above was from a model that had a couple of things different from the one I was comparing it to; one of those things was the use of PReLU. I haven't done a clean A/B comparison of PReLU vs leaky ReLU. Sorry if I wrongly attributed the improvement to the PReLU. I will try a clean compare of Leaky ReLU vs PReLU on my own model.

@ducha-aiki
Copy link
Contributor

@jyegerlehner Also there is a "Very Leaky ReLU", when negative slope is ~ 0.1-0.3 rather than 0.01.
It has been used in Kaggle-CIFAR10 competition (http://blog.kaggle.com/2015/01/02/cifar-10-competition-winners-interviews-with-dr-ben-graham-phil-culliton-zygmunt-zajac/ ). So the improved performance of the MSRA paper could also from this rather than learnability of the ReLU. However, learning parameter is much better than manual dark magic.

@jyegerlehner
Copy link
Contributor

Here's a result comparing training of two models that are identical except leaky relu's with negative_slope=0.1 are switched out for prelus with initial_value=0.1 (I was running the patch from before initial_value was changed over to filler). Both models were initialized from the same .caffemodel (using ../../../caffe/build/tools/caffe-d train --solver=solver.prototxt --weights=net.caffemodel) with the same solver parameters. So the initial state for the two should be identical.

Edit: Updated charts to reflect behavior of latest PR code.

image

This is the kind of observation that led to my "improved behavior" comment above.

However, there seems to be a problem. I notice that the initial loss computed by caffe was a bit different for the two cases (25.17 for ReLU, 25.62 for PReLU). It should be identical, since a prelu with initial_value = 0.1 ought to forward propagate identically to leaky relu with negatve_slope = 0.1 before any training has happened. Unless I'm confused about PReLUs. Furthermore, setting the learn rate for the PReLU to zero ought to make it behave identically to a ReLU if its initial_value is set identically to the ReLU's negative_slope. In other words, I have layers in the ReLU net like this:

layer {
  name: "encode1_relu"
  type: "ReLU"
  bottom: "encode1_conv"
  top: "encode1_conv"
  relu_param: {
    negative_slope: 0.1
  }
}

And in the PReLU version, I replace those with:

layer {
  name: "encode1_prelu"
  type: "PReLU"
  bottom: "encode1_conv"
  top: "encode1_conv"
  param {
    lr_mult: 0
  }
  prelu_param: {
    filler { value: 0.1 type: "constant" }
  }
}

So if I do that, and run the same test, I should see a loss vs iterations identical to the relu version. I did this, and it turns out they are different.

image

So I think this suggests we need to look at the forward and gradient tests of PReLU. We could write a test that forward propagates through a Leaky ReLU and PReLU and assert they produce the same result (assuming ReLU's negative_slope is set to PReLU's initial_value). Although I imagine the forward must already assert correct behavior. And same for backward propagation, with learning rate of PReLU set to zero). I think I'll go try that.

Anyone see errors in my reasoning, or have better ideas?

Edit: I don't show the loss at iteration =0 just because it changes Y axis scaling enough to obscure the trends later.

@jyegerlehner
Copy link
Contributor

Thanks for pointing that out @ducha-aiki . Yes, somehow I had stumbled across that. My ReLUs are already very leaky, using negative_slope = 0.1.

@ducha-aiki
Copy link
Contributor

@jyegerlehner
Do you use in-place computation? May be differences are caused by some issues with in-place computations in PReLU?

@tnarihi
Copy link
Contributor Author

tnarihi commented Mar 8, 2015

@jyegerlehner Thanks for reporting your experiments. Those are really useful. As you suggested, I've just added a new test case that checks if PReLU produces numbers consistent with Leaky ReLU (negative_slope=0.25) in Forward/Backward. Then, I confirmed the test passed. Please see the final commit I've just made, and check if my code is right.
Now I am suspicious about the behavior of in-place computation as @ducha-aiki mentioned. I will soon add another test to check if it works.
Thanks collaborators!

@tnarihi
Copy link
Contributor Author

tnarihi commented Mar 8, 2015

Sorry... test was wrong, but now it passed anyway.

@tnarihi
Copy link
Contributor Author

tnarihi commented Mar 8, 2015

Now I figured out in-place computation in GPU backward was something wrong. I will look into the GPU code.

[==========] 136 tests from 4 test cases ran. (27581 ms total)
[  PASSED  ] 134 tests.
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] NeuronLayerTest/2.TestPReLUInplace, where TypeParam = caffe::FloatGPU
[  FAILED  ] NeuronLayerTest/3.TestPReLUInplace, where TypeParam = caffe::DoubleGPU

 2 FAILED TESTS

@tnarihi
Copy link
Contributor Author

tnarihi commented Mar 8, 2015

I found the bug. It was due to calling an incorrect API for copying data. Please see the commit message for details. Now it should work correctly. @jyegerlehner @zhongwen @hycis Please try the latest commit version.

@tnarihi
Copy link
Contributor Author

tnarihi commented Mar 8, 2015

Sorry, I replied to another person.

@jyegerlehner
Copy link
Contributor

@tnarihi said:

Now I figured out in-place computation in GPU backward was something wrong.

I found the bug.

Wow that was quick; I had just finished changing over my net.prototxt to remove in-place computation. Most of what I had to say is stale now in light of your new work. Both the ReLU and PReLU showed improved performance after removing in-place computation. As concerns conclusions above, it doesn't make any difference: PReLU still performs better. And PReLU with learn rate=0 and negative slope = 0.1 still gives slightly different results than ReLU with negative slope = 0.1. It's a small difference, but consistent. Not sure if it's worth it to go through and see what the root cause of that difference is. I didn't find anything wrong with the PReLU tests so I'm inclined to think it's good. Could be a methodological error on my part.

Will pull your fix, restore the net.prototxt to in-place computation, and report back the results, which I expect will be identical to the not-in-place computation results.

Edit: Updated charts above with the new results. PReLU still better.

@jyegerlehner
Copy link
Contributor

@hycis I reproduced the behavior you reported where MNIST/lenet and Cifar examples both perform worse when ReLUs are switched over to PReLUs. Note that both of those models use in-place computation for the ReLUs. I then pulled @tnarihi's latest change that fixes in-place computation, and repeated the test. I found that MNIST/lenet with PReLUs and ReLUs perform nearly identically to each other. In the case of Cifar, I found that PReLU has superior accuracy at 70K iterations: 0.8216 for PReLU model vs 0.8156 accuracy for ReLU. So I think we can call that one solved.

@futurely
Copy link

It's great that the implementation has been proved to be correct. Will there be any end-to-end example that can get the same result of the paper? Thanks!

@ducha-aiki
Copy link
Contributor

Will there be any end-to-end example that can get the same result of the paper?

@futurely surely it would be nice, but now even per-reviewed paper does not require to check algorithms on ImageNet (He et.at. report performance on it only), which takes >=2 weeks of GPU work (pretty costly even if you have free GPU and pay only for electricity)

@tnarihi
Copy link
Contributor Author

tnarihi commented Mar 10, 2015

Agree. After this PR is merged, It would be nice if anyone reproduces the result of the paper and put it into ModelZoo! I would like to keep this PR as is.

I came up with another possible fix of this PR. The paper describes they don't use weight decay for PReLU.

It is worth noticing that we do not use weight decay (l2 regular- ization) when updating ai. A weight decay tends to push ai to zero, and thus biases PReLU toward ReLU. Even without regularization, the learned coefficients rarely have a magni- tude larger than 1 in our experiments.

Wouldn't it be better to force weight decay for PReLU to be 0? Does anyone have comments or suggestions?

@ducha-aiki
Copy link
Contributor

I think, it is enough to set param{ decay_mult : 0} in example which explains it's usage

}
}

TYPED_TEST(NeuronLayerTest, TestPReLUInplace) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capitalize Place (TestPReLUInPlace)

@tnarihi
Copy link
Contributor Author

tnarihi commented Mar 12, 2015

@jeffdonahue Done.

@hycis
Copy link

hycis commented Mar 12, 2015

@tnarihi I just tried decay_multi: 0 on cifar10_full, but still only get 0.75. @jyegerlehner I wonder how you get 0.82?

layer {
name: "relu3"
type: "PReLU"
bottom: "conv3"
top: "conv3"
param {
decay_mult: 0
}
}

@jyegerlehner
Copy link
Contributor

. @jyegerlehner I wonder how you get 0.82?

@hycis I'm running the experiment again. Will report back when it's done. I didn't change any hyperparameters. Just the out-of-the-box examples/cifar10/train_full.sh, with the ReLUs changed to default PReLUs. I also did not use tnarihi's trick of decay_mult = 0.0.

I'm not clear. Is your 0.75 PReLU accuracy after pulling tnarihi's in-place-computation fix?

jeffdonahue added a commit that referenced this pull request Mar 12, 2015
@jeffdonahue jeffdonahue merged commit c67a3fa into BVLC:master Mar 12, 2015
@jeffdonahue
Copy link
Contributor

Thanks again for the layer and all the fixes @tnarihi.

@jyegerlehner
Copy link
Contributor

@hycis This time the ReLU version produced 0.8172 accuracy at 70K iterations, and PReLU version produced 0.8177 at 70K iterations.

I could post the modified shell scripts and prototxt I used if that would help you to reproduce the result.

@hycis
Copy link

hycis commented Mar 12, 2015

@jyegerlehner I am not sure why too. I did a pull and also did as what you mention, but just not getting the result as you. It will be great if you can share your prototxt and shell scripts. Thanks. My email hyciswu@gmail.com

@jyegerlehner
Copy link
Contributor

@hycis Well that's troubling. OK here's what I used:

https://gist.github.com/jyegerlehner/b2f073aa8e213f0a9167

Please let us know what you find. I'm worried perhaps I have an error on my end.

@hycis
Copy link

hycis commented Mar 13, 2015

After I pull and rebuild on the latest commit, I was able to improve full cifar10 from 0.7562 to 0.8184 and 0.8193 (decay_mult=0) at 70000 iterations for using prelu. Thanks @jyegerlehner and @tnarihi

@jeffdonahue
Copy link
Contributor

A bit of (mostly useless) Caffe trivia: I just realized that even before this PR we could already implement PReLU (very inefficiently) by composition; at least the !channel_shared version -- the "diagonal" multiplication is equivalent to a 1x1 convolution where num_output and group are both set to the number of input channels. But ConvolutionLayer isn't at all optimized for this case as it loops over groups, so this is a lot faster. In case anyone is curious, I mean that if conv1 has C channels, this PReLU layer...:

layer {
  name: "conv1-prelu"
  type: "PReLU" param { decay_mult: 0 }
  bottom: "conv1"
  top: "conv1-prelu"
}

...is equivalent to this sequence of layers:

layer {
  name: "conv1-prelu1"
  type: "ReLU"
  bottom: "conv1"
  top: "conv1-prelu1"
}
layer {
  name: "conv1-prelu2"
  type: "Power" power_param { scale: -1 }
  bottom: "conv1"
  top: "conv1-prelu2"
}
layer {
  name: "conv1-prelu3"
  type: "ReLU"
  bottom: "conv1-prelu2"
  top: "conv1-prelu3"
}
layer {
  name: "conv1-prelu4"
  type: "Convolution"
  bottom: "conv1-prelu3"
  top: "conv1-prelu4"
  param { decay_mult: 0 }
  convolution_param {
    bias_term: false
    weight_filler { type: "constant" value: 0.25 }
    kernel_size: 1
    group: C
    num_output: C
  }
}
layer {
  name: "conv1-prelu5"
  type: "Eltwise" eltwise_param { operation: SUM }
  bottom: "conv1-prelu1"
  bottom: "conv1-prelu4"
  top: "conv1-prelu"
}

To be honest though, when I tried both I got slightly different results, so I'm not 100% sure that's right, but I've already spent way more time on this than was warranted, so I won't look into it further...

@shelhamer
Copy link
Member

@jeffdonahue layer composition is a fine hobby. Thanks for commenting with
the PReLU thoughts.
On Sat, Mar 14, 2015 at 21:21 Jeff Donahue notifications@github.com wrote:

A bit of (mostly useless) Caffe trivia: I just realized that even before
this PR we could already implement PReLU (very inefficiently) by
composition; at least the !channel_shared version -- the "diagonal"
multiplication is equivalent to a 1x1 convolution where num_output and
group are both set to the number of input channels. But ConvolutionLayer
isn't at all optimized for this case as it loops over groups, so this is
a lot faster. In case anyone is curious, I mean that if conv1 has C
channels, this PReLU layer...:

layer {
name: "conv1-prelu"
type: "PReLU" param { decay_mult: 0 }
bottom: "conv1"
top: "conv1-prelu"
}

...is equivalent to this sequence of layers:

layer {
name: "conv1-prelu1"
type: "ReLU"
bottom: "conv1"
top: "conv1-prelu1"
}
layer {
name: "conv1-prelu2"
type: "Power" power_param { scale: -1 }
bottom: "conv1"
top: "conv1-prelu2"
}
layer {
name: "conv1-prelu3"
type: "ReLU"
bottom: "conv1-prelu2"
top: "conv1-prelu3"
}
layer {
name: "conv1-prelu4"
type: "Convolution"
bottom: "conv1-prelu3"
top: "conv1-prelu4"
param { decay_mult: 0 }
convolution_param {
bias_term: false
weight_filler { type: "constant" value: 0.25 }
kernel_size: 1
group: C
num_output: C
}
}
layer {
name: "conv1-prelu5"
type: "Eltwise" eltwise_param { operation: SUM }
bottom: "conv1-prelu1"
bottom: "conv1-prelu4"
top: "conv1-prelu5"
}

To be honest though, when I tried both I got slightly different results,
so I'm not 100% sure that's right, but I've already spent way more time on
this than was warranted, so I won't look into it further...


Reply to this email directly or view it on GitHub
#1940 (comment).

@hycis
Copy link

hycis commented Mar 24, 2015

@tnarihi just wondering is there a way to output the learned coefficients from of the slope of PReLU from the saved caffemodel?

@tnarihi
Copy link
Contributor Author

tnarihi commented Mar 24, 2015

@hycis
Yes, but It seems like a general Caffe question. There isn't any special difference from other layers such as InnerProduct. If you work on Caffe Python, do like the following:

net = caffe.Net("<proto_path>", "<caffemodel_path>", caffe.TRAIN)
slopes_blob = net.layers['prelu1'][0]  # your prelu layer name
print slopes_blob.data  # This is a numpy array of the slopes

I don't test this script, but a script like this should work.

@hycis
Copy link

hycis commented Mar 24, 2015

@tnarihi thanks for the quick reply
I tried net.layers['prelu1'] but get no invalid index type error, so i tried net.params['prelu1'] which give me some numbers. So I guess net.params corresponds to net.layers?

@tnarihi
Copy link
Contributor Author

tnarihi commented Mar 24, 2015

sorry it shoud be params

@hycis
Copy link

hycis commented Mar 24, 2015

hi @tnarihi
i also observed that the prelu units for the same feature map will have the same slope coefficient? Because i the check the dimension of the net.param['prelu1'][0].shape for prelu layer after a convolution layer, the dimension net.param['prelu1'][0].shape is equal to the number of feature maps i set for the convolution layer.

@tnarihi
Copy link
Contributor Author

tnarihi commented Mar 24, 2015

@hycis Right.

@happynear
Copy link

hi @tnarihi ,
I am wondering why using an additional blob "bottom_memory_". When the layer is in-place computing, the bottom_data can be got by top_data / slope_data. My GPU memory is only 2GB, and I do not want to cost more memory space for an activation layer.

@tnarihi tnarihi deleted the prelu2 branch April 1, 2015 16:27
@tnarihi
Copy link
Contributor Author

tnarihi commented Apr 1, 2015

Hi @happynear,

I think that's the case only if negative slopes are all positive. If we allow slopes to be negative, we cannot figure out the pre-activation values only from top_data and slope_data. Another way to reduce memory consumption is that we keep bottom signs (pos or neg) with 1 byte array (e.g. int8) instead of actual values (Dtype=float 4bytes), then we can reconstruct the pre-activation values using the signs of them, top_data and slope_data. Actually I have one idea to remove temporary memory in mind (one of the authors of the original paper has contacted me and kindly gave me an advice), but it needs that we modify the framework of Caffe Net/Layer. I don't have time to work on it. If you have any other idea, I'd be happy to discuss.

@happynear
Copy link

@tnarihi
Yeah, I haven't considered the negative case. The slopes indeedly came to be negative in some cases I experimented in Matlab. Could you tell me what the idea is?

@tnarihi
Copy link
Contributor Author

tnarihi commented Apr 8, 2015

@happynear Sorry to be late.

  1. We create a global blob for shared buffer shared_buff.
  2. At every PReLU layer during the forward pass, we copy pre-activation values (bottom values) into bottom.diff (bottom[0]-->mutable_diff()) which is not used during the forward pass.
  3. During backward pass, every layer which follows PReLU copies its bottom.diff (PReLU's top.diff = PReLU's bottom.diff) to shared_buff (reshape if necessary) in order to avoid to overwrite the stored PReLU pre-activation.
  4. At every PReLU in backprop, we take pre-activation values from shared_buff and use it for backward diff computation.

One naive implementation of doing this is that every layer copies bottom.diff to shared_buff, but it involves unnecessary computation if the followed layer is not PReLU. Otherwise, we should implement kind of communication interfaces such that layers know what their top/bottom layer is, or introduce switching valuables for layers to know whether it should copy their bottom.diff to shared_buff or not.
Does that make sense?

EDIT: This still needs additional memory shared_buff, but it is much smaller than keeping buff for all PReLUs if we have many PReLUs.

@tnarihi
Copy link
Contributor Author

tnarihi commented Apr 8, 2015

Now I think we have two choices:

  1. Use int8 array (possibly we could use bit array) to store sign of pre-activation, reduces 75% (97% if bit array) memory consumption.
  2. Store pre-activation into bottom.diff (need to modify Caffe itself other than PReLULayer).

@futurely futurely mentioned this pull request Apr 9, 2015
@qingqing01
Copy link

When I use PReLULayer to train the ImageNet model, the loss of the beginning is about 7. But when I fine-tune the " bvlc_reference_caffenet.caffemodel ", the loss of the beginning is about 80! and why?

@tnarihi
Copy link
Contributor Author

tnarihi commented May 5, 2015

Maybe the bigger loss is due to much larger amount of nonzero responses in PReLU, but I am not sure exactly. The reference model is trained for ReLU (not PReLU) which is equivalent to the initial state of PReLU with following setting:

layer {
    type: "PReLU" 
    ....
    prelu_parameter { filler { type: 'constant' value: 0} }
}

You should start with this setting. Default value of negative slope is 0.25.

@qingqing01
Copy link

@tnarihi Thank you! I want to train model with prelu. And I use the reference model to make an experiments to fine-tune a prelu model. So I use the default value of negative slope, 0.25.

@happynear
Copy link

A new type of ReLU is designed to solve the overfit problem: http://arxiv.org/abs/1505.00853 .
Maybe we should open a new issue.

@qingqing01
Copy link

@happynear Thank you. I have read this paper. In my experience, the initial negative slope should be adjusted, if you pre-trained model to fine-tune.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants