Add PReLU Layer #1940

Merged
merged 1 commit into from Mar 12, 2015

Conversation

Projects
None yet
9 participants
Contributor

tnarihi commented Feb 22, 2015

Replacement of #1880 for master branch development.

tnarihi referenced this pull request Feb 22, 2015

Closed

Add PReLULayer #1880

shelhamer added the JD label Feb 22, 2015

Contributor

jyegerlehner commented Feb 27, 2015

Thank you for sharing this tnarihi. I've been running it. I've seeing improved performance.
Edit: Wait, there's more to this. See discussion below.

Contributor

tnarihi commented Feb 28, 2015

Good to hear that @jyegerlehner !

@jeffdonahue jeffdonahue and 1 other commented on an outdated diff Mar 1, 2015

include/caffe/neuron_layers.hpp
@@ -654,6 +654,89 @@ class ThresholdLayer : public NeuronLayer<Dtype> {
Dtype threshold_;
};
+/**
+ * @brief Parameterized Rectified Linear Unit non-linearity @f$
+ * y_i = \max(0, x_i) + a_i \min(0, x_i)
+ * @f$. The differences from ReLULayer are 1) negative slopes are
+ * learnable though backprop and 2) negative slopes can vary across
+ * channels.
+ */
+template <typename Dtype>
+class PReLULayer : public NeuronLayer<Dtype> {
+ public:
+ /**
+ * @param param provides PReLUParameter prelu_param,
+ * with PReLULayer options:
+ * - init_value (\b optional, default 0.25).
+ * all negative slopes over channels are set to this value.
@jeffdonahue

jeffdonahue Mar 1, 2015

Contributor

How about using a Filler for this (like InnerProductLayer and ConvolutionLayer)?

@tnarihi

tnarihi Mar 2, 2015

Contributor

Agree. It seems like more flexible. In this case, is filler good for the name of parameter?

@jeffdonahue

jeffdonahue Mar 3, 2015

Contributor

Yup, filler sounds good to me

Contributor

jeffdonahue commented Mar 1, 2015

Hey Takuya, thanks for creating this PR. Besides the one comment I made above, this looks good to me. At some point when we figure out a better way of handling composition, it would be good to add a DiagonalInnerProductLayer (which handles the elementwise multiplication by a parameter -- I can clean up my implementation of this and PR it) and give it the responsibility of handling the parameters. If we had such a layer, we could implement PReLU as the composition EltwiseSum(ReLU(x), DiagonalInnerProduct(ReLU(Power(scale = -1, x)))), but based on the MSR work this seems like a useful enough shorthand for those 5 layers to deserve a name. (I guess with the combined kernel calls it may also be significantly faster on GPU?)

(Edit: if you'd prefer, you could separate the appropriate piece of this out into a DiagInnerProductLayer yourself, but this is useful as is.)

Contributor

tnarihi commented Mar 2, 2015

Interesting! I believe adding more primitive layers would be nicer for research perspective, and to reuse them makes source codes more readable. clearer and easier to maintain.

I will consider to separate the piece of codes out into a DiagonalInnerProductLayer when your PR arrives, if I can do this without increasing computational costs.

Thanks, Jeff!

Contributor

tnarihi commented Mar 4, 2015

Changed to use FillerParameter to fill initial values of negative slopes

hycis commented Mar 5, 2015

Hi @tnarihi, tried your PReLU on the cifar10 and mnist example in caffe and it get a worst result. With cifar10 example train_full.sh, the ReLU gets accuracy 0.8181, after i change all ReLU to PReLU, it drops to 0.7495. Any idea why?

Contributor

tnarihi commented Mar 5, 2015

Thanks for reporting, @hycis. I tried only on cifar_quick example. Let me reproduce the results maybe on this weekend. So far, please try different learning rates and initialization.
@jyegerlehner If you have any comments, it would be helpful. Thanks!

Contributor

jyegerlehner commented Mar 6, 2015

@tnarihi, @hycis, the improved performance I alluded to above was from a model that had a couple of things different from the one I was comparing it to; one of those things was the use of PReLU. I haven't done a clean A/B comparison of PReLU vs leaky ReLU. Sorry if I wrongly attributed the improvement to the PReLU. I will try a clean compare of Leaky ReLU vs PReLU on my own model.

Contributor

ducha-aiki commented Mar 6, 2015

@jyegerlehner Also there is a "Very Leaky ReLU", when negative slope is ~ 0.1-0.3 rather than 0.01.
It has been used in Kaggle-CIFAR10 competition (http://blog.kaggle.com/2015/01/02/cifar-10-competition-winners-interviews-with-dr-ben-graham-phil-culliton-zygmunt-zajac/ ). So the improved performance of the MSRA paper could also from this rather than learnability of the ReLU. However, learning parameter is much better than manual dark magic.

Contributor

jyegerlehner commented Mar 7, 2015

Here's a result comparing training of two models that are identical except leaky relu's with negative_slope=0.1 are switched out for prelus with initial_value=0.1 (I was running the patch from before initial_value was changed over to filler). Both models were initialized from the same .caffemodel (using ../../../caffe/build/tools/caffe-d train --solver=solver.prototxt --weights=net.caffemodel) with the same solver parameters. So the initial state for the two should be identical.

Edit: Updated charts to reflect behavior of latest PR code.

image

This is the kind of observation that led to my "improved behavior" comment above.

However, there seems to be a problem. I notice that the initial loss computed by caffe was a bit different for the two cases (25.17 for ReLU, 25.62 for PReLU). It should be identical, since a prelu with initial_value = 0.1 ought to forward propagate identically to leaky relu with negatve_slope = 0.1 before any training has happened. Unless I'm confused about PReLUs. Furthermore, setting the learn rate for the PReLU to zero ought to make it behave identically to a ReLU if its initial_value is set identically to the ReLU's negative_slope. In other words, I have layers in the ReLU net like this:

layer {
  name: "encode1_relu"
  type: "ReLU"
  bottom: "encode1_conv"
  top: "encode1_conv"
  relu_param: {
    negative_slope: 0.1
  }
}

And in the PReLU version, I replace those with:

layer {
  name: "encode1_prelu"
  type: "PReLU"
  bottom: "encode1_conv"
  top: "encode1_conv"
  param {
    lr_mult: 0
  }
  prelu_param: {
    filler { value: 0.1 type: "constant" }
  }
}

So if I do that, and run the same test, I should see a loss vs iterations identical to the relu version. I did this, and it turns out they are different.

image

So I think this suggests we need to look at the forward and gradient tests of PReLU. We could write a test that forward propagates through a Leaky ReLU and PReLU and assert they produce the same result (assuming ReLU's negative_slope is set to PReLU's initial_value). Although I imagine the forward must already assert correct behavior. And same for backward propagation, with learning rate of PReLU set to zero). I think I'll go try that.

Anyone see errors in my reasoning, or have better ideas?

Edit: I don't show the loss at iteration =0 just because it changes Y axis scaling enough to obscure the trends later.

Contributor

jyegerlehner commented Mar 7, 2015

Thanks for pointing that out @ducha-aiki . Yes, somehow I had stumbled across that. My ReLUs are already very leaky, using negative_slope = 0.1.

Contributor

ducha-aiki commented Mar 7, 2015

@jyegerlehner
Do you use in-place computation? May be differences are caused by some issues with in-place computations in PReLU?

Contributor

tnarihi commented Mar 8, 2015

@jyegerlehner Thanks for reporting your experiments. Those are really useful. As you suggested, I've just added a new test case that checks if PReLU produces numbers consistent with Leaky ReLU (negative_slope=0.25) in Forward/Backward. Then, I confirmed the test passed. Please see the final commit I've just made, and check if my code is right.
Now I am suspicious about the behavior of in-place computation as @ducha-aiki mentioned. I will soon add another test to check if it works.
Thanks collaborators!

Contributor

tnarihi commented Mar 8, 2015

Sorry... test was wrong, but now it passed anyway.

Contributor

tnarihi commented Mar 8, 2015

Now I figured out in-place computation in GPU backward was something wrong. I will look into the GPU code.

[==========] 136 tests from 4 test cases ran. (27581 ms total)
[  PASSED  ] 134 tests.
[  FAILED  ] 2 tests, listed below:
[  FAILED  ] NeuronLayerTest/2.TestPReLUInplace, where TypeParam = caffe::FloatGPU
[  FAILED  ] NeuronLayerTest/3.TestPReLUInplace, where TypeParam = caffe::DoubleGPU

 2 FAILED TESTS
Contributor

tnarihi commented Mar 8, 2015

I found the bug. It was due to calling an incorrect API for copying data. Please see the commit message for details. Now it should work correctly. @jyegerlehner @zhongwen @hycis Please try the latest commit version.

Contributor

tnarihi commented Mar 8, 2015

Sorry, I replied to another person.

Contributor

jyegerlehner commented Mar 8, 2015

@tnarihi said:

Now I figured out in-place computation in GPU backward was something wrong.

I found the bug.

Wow that was quick; I had just finished changing over my net.prototxt to remove in-place computation. Most of what I had to say is stale now in light of your new work. Both the ReLU and PReLU showed improved performance after removing in-place computation. As concerns conclusions above, it doesn't make any difference: PReLU still performs better. And PReLU with learn rate=0 and negative slope = 0.1 still gives slightly different results than ReLU with negative slope = 0.1. It's a small difference, but consistent. Not sure if it's worth it to go through and see what the root cause of that difference is. I didn't find anything wrong with the PReLU tests so I'm inclined to think it's good. Could be a methodological error on my part.

Will pull your fix, restore the net.prototxt to in-place computation, and report back the results, which I expect will be identical to the not-in-place computation results.

Edit: Updated charts above with the new results. PReLU still better.

Contributor

jyegerlehner commented Mar 8, 2015

@hycis I reproduced the behavior you reported where MNIST/lenet and Cifar examples both perform worse when ReLUs are switched over to PReLUs. Note that both of those models use in-place computation for the ReLUs. I then pulled @tnarihi's latest change that fixes in-place computation, and repeated the test. I found that MNIST/lenet with PReLUs and ReLUs perform nearly identically to each other. In the case of Cifar, I found that PReLU has superior accuracy at 70K iterations: 0.8216 for PReLU model vs 0.8156 accuracy for ReLU. So I think we can call that one solved.

It's great that the implementation has been proved to be correct. Will there be any end-to-end example that can get the same result of the paper? Thanks!

Contributor

ducha-aiki commented Mar 10, 2015

Will there be any end-to-end example that can get the same result of the paper?

@futurely surely it would be nice, but now even per-reviewed paper does not require to check algorithms on ImageNet (He et.at. report performance on it only), which takes >=2 weeks of GPU work (pretty costly even if you have free GPU and pay only for electricity)

Contributor

tnarihi commented Mar 10, 2015

Agree. After this PR is merged, It would be nice if anyone reproduces the result of the paper and put it into ModelZoo! I would like to keep this PR as is.

I came up with another possible fix of this PR. The paper describes they don't use weight decay for PReLU.

It is worth noticing that we do not use weight decay (l2 regular- ization) when updating ai. A weight decay tends to push ai to zero, and thus biases PReLU toward ReLU. Even without regularization, the learned coefficients rarely have a magni- tude larger than 1 in our experiments.

Wouldn't it be better to force weight decay for PReLU to be 0? Does anyone have comments or suggestions?

Contributor

ducha-aiki commented Mar 10, 2015

I think, it is enough to set param{ decay_mult : 0} in example which explains it's usage

@jeffdonahue jeffdonahue commented on an outdated diff Mar 10, 2015

src/caffe/test/test_neuron_layer.cpp
+ GaussianFiller<Dtype> filler(filler_param);
+ filler.Fill(tmp_blob.get());
+ caffe_copy(blob_top_2->count(), tmp_blob->cpu_data(),
+ this->blob_top_->mutable_cpu_diff());
+ caffe_copy(blob_top_2->count(), tmp_blob->cpu_data(),
+ blob_top_2->mutable_cpu_diff());
+ vector<bool> propagate_down;
+ propagate_down.push_back(true);
+ prelu.Backward(this->blob_top_vec_, propagate_down, this->blob_bottom_vec_);
+ relu.Backward(blob_top_vec_2, propagate_down, blob_bottom_vec_2);
+ for (int s = 0; s < blob_bottom_2->count(); ++s) {
+ EXPECT_EQ(this->blob_bottom_->cpu_diff()[s], blob_bottom_2->cpu_diff()[s]);
+ }
+}
+
+TYPED_TEST(NeuronLayerTest, TestPReLUInplace) {
@jeffdonahue

jeffdonahue Mar 10, 2015

Contributor

capitalize Place (TestPReLUInPlace)

@jeffdonahue jeffdonahue commented on an outdated diff Mar 10, 2015

src/caffe/layers/prelu_layer.cpp
+ // keep top_diff unchanged.
+ if (this->param_propagate_down_[0]) {
+ Dtype* slope_diff = this->blobs_[0]->mutable_cpu_diff();
+ caffe_set(this->blobs_[0]->count(), Dtype(0), slope_diff);
+ for (int i = 0; i < count; ++i) {
+ int c = (i / hw) % channels / div_factor;
+ slope_diff[c] += top_diff[i] * bottom_data[i] * (bottom_data[i] <= 0);
+ }
+ }
+ // Propagate to bottom
+ if (propagate_down[0]) {
+ Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();
+ for (int i = 0; i < count; ++i) {
+ int c = (i / hw) % channels / div_factor;
+ bottom_diff[i] = top_diff[i] * ((bottom_data[i] > 0)
+ + slope_data[c] * (bottom_data[i] <= 0));
@jeffdonahue

jeffdonahue Mar 10, 2015

Contributor

use 4 space indent when continuing statement from previous line

@jeffdonahue jeffdonahue commented on an outdated diff Mar 10, 2015

src/caffe/layers/prelu_layer.cpp
+ const int hw = bottom[0]->height() * bottom[0]->width();
+ const int channels = bottom[0]->channels();
+ const Dtype* slope_data = this->blobs_[0]->cpu_data();
+
+ // For in-place computation
+ if (bottom[0] == top[0]) {
+ caffe_copy(count, bottom_data, bottom_memory_.mutable_cpu_data());
+ }
+
+ // if channel_shared, channel index in the following computation becomes
+ // always zero.
+ const int div_factor = channel_shared_ ? channels : 1;
+ for (int i = 0; i < count; ++i) {
+ int c = (i / hw) % channels / div_factor;
+ top_data[i] = std::max(bottom_data[i], Dtype(0))
+ + slope_data[c] * std::min(bottom_data[i], Dtype(0));
@jeffdonahue

jeffdonahue Mar 10, 2015

Contributor

use 4 space indent when continuing statement from previous line

@jeffdonahue jeffdonahue commented on an outdated diff Mar 10, 2015

src/caffe/layers/prelu_layer.cpp
+ if (prelu_param.has_filler()) {
+ filler.reset(GetFiller<Dtype>(prelu_param.filler()));
+ } else {
+ FillerParameter filler_param;
+ filler_param.set_type("constant");
+ filler_param.set_value(0.25);
+ filler.reset(GetFiller<Dtype>(filler_param));
+ }
+ filler->Fill(this->blobs_[0].get());
+ }
+ if (channel_shared_) {
+ CHECK_EQ(this->blobs_[0]->count(), 1)
+ << "Negative slope size is inconsistent with prototxt config";
+ } else {
+ CHECK_EQ(this->blobs_[0]->count(), channels)
+ << "Nagative slope size is inconsistent with prototxt config";
@jeffdonahue

jeffdonahue Mar 10, 2015

Contributor

typo -- Negative

Contributor

jeffdonahue commented Mar 10, 2015

Yeah, I don't think we should default to or require a different decay_mult. I'm ready to merge this once the minor style errors I commented above are fixed -- in general 4 space indent should be used instead of 2 space indent to continue lines, I commented on a couple of them but please try to fix throughout. Also, please squash your history into a single commit when you're ready to merge, then comment again and we can merge this.

Thanks @tnarihi!

Contributor

tnarihi commented Mar 11, 2015

@ducha-aiki @jeffdonahue Thanks for giving your opinions regarding weight decay. Let's keep decay_mult settable. I just thought to force decay to be 0 might be beginner-friendly.

@jeffdonahue Thanks for reviewing! I fixed style errors and rebased and squashed them into a single commit.

Contributor

jeffdonahue commented Mar 11, 2015

Sorry for not noticing the above ND blob-related comments in my first pass -- the weights & internal Blob shapes should be shaped like in other layers. Since this uses the 4D indexing, could you also you add a CHECK_EQ(4, bottom[0]->num_axes()) like the one in ConvolutionLayer to the top of Reshape? This will not work for >4D blobs, which is safe, as an error will be raised on the first call to num()/channels()/..., but it's slightly friendlier to add an explicit check with a helpful message in Reshape, as the failure on num() call will probably look confusing.

Contributor

jeffdonahue commented Mar 11, 2015

Actually, I guess it should be a CHECK_GE(4, bottom[0]->num_axes()) as the old methods will return 1 for <4D blobs, and you probably want to be able to use this layer on 2D IP layer outputs, for example.

@ducha-aiki, why can't we just fine-tune a pre-trained model by replacing ReLU with PReLU to reduce the training time to tens of hours?

hycis commented Mar 11, 2015

@tnarihi thanks for the fixes. @jyegerlehner thanks for reporting your results, you guys are awesome, may I know what hyper parameters that you set for the PReLU for the cifar10. What I did is simply change "ReLU" to "PReLU" and using all the default hyper parameters, but get a worst result of 0.75.
layer {
name: "relu3"
type: "PReLU"
bottom: "conv3"
top: "conv3"
}

Contributor

tnarihi commented Mar 12, 2015

@hycis As I mentioned above, could you try decay_mult: 0?

layer {
name: "relu3"
type: "PReLU" 
bottom: "conv3"
top: "conv3"
param {
decay_mult: 0
}
}
Contributor

tnarihi commented Mar 12, 2015

Thanks for reminding about N-D Blob, Jeff! I made some modifications in my code, and use CHECK_GE(bottom[0]->num_axes(), 2) instead of CHECK_GE(4, bottom[0]->num_axes()). Now we can use any shape of blob that has >=2 axes, and 1st-axis (0-base) is seen as channels. Please review my changes. If it is ready, I will squash them into one again.

Contributor

tnarihi commented Mar 12, 2015

Oops, this should work for arbitrary shape blobs.

EDIT: Theoretically, this should work for arbitrary dimensional blobs, but I think practically to allow only >=2 dims blobs should be good enough. I would like to keep this PR to restrict >=2 blobs.

Contributor

jeffdonahue commented Mar 12, 2015

Thanks for updating this to support ND blobs Takuya! This looks great other than my one minor nitpick --the meaning of the hw variable name may be a little unclear now that it doesn't come from multiplying height and width. Otherwise (once squashed) this looks mergeable to me.

Contributor

tnarihi commented Mar 12, 2015

Thanks for reviewing, Jeff! I will modify as you suggested soon.

@tnarihi tnarihi PReLU Layer and its tests
described in Kaiming He et al, "Delving Deep into Rectifiers: Surpassing
Human-Level Performance on ImageNet Classification", arxiv 2015.

Belows are commit message histories that I had while developing.

PReLULayer takes FillerParameter for init

PReLU testing consistency with ReLU

Fix : PReLU test concistency check

PReLU tests in-place computation, and it failed in GPU

Fix: PReLU in-place backward in GPU

PReLULayer called an incorrect API for copying
data (caffe_gpu_memcpy). First argment of `caffe_gpu_memcpy` should be
size of memory region in byte. I modified to use `caffe_copy` function.

Fix: style errors

Fix: number of axes of input blob must be >= 2

Use 1D blob, zero-D blob.

Rename: hw -> dim
bb5bf43
Contributor

tnarihi commented Mar 12, 2015

@jeffdonahue Done.

hycis commented Mar 12, 2015

@tnarihi I just tried decay_multi: 0 on cifar10_full, but still only get 0.75. @jyegerlehner I wonder how you get 0.82?

layer {
name: "relu3"
type: "PReLU"
bottom: "conv3"
top: "conv3"
param {
decay_mult: 0
}
}

Contributor

jyegerlehner commented Mar 12, 2015

. @jyegerlehner I wonder how you get 0.82?

@hycis I'm running the experiment again. Will report back when it's done. I didn't change any hyperparameters. Just the out-of-the-box examples/cifar10/train_full.sh, with the ReLUs changed to default PReLUs. I also did not use tnarihi's trick of decay_mult = 0.0.

I'm not clear. Is your 0.75 PReLU accuracy after pulling tnarihi's in-place-computation fix?

@jeffdonahue jeffdonahue added a commit that referenced this pull request Mar 12, 2015

@jeffdonahue jeffdonahue Merge pull request #1940 from tnarihi/prelu2
Add PReLU Layer
c67a3fa

@jeffdonahue jeffdonahue merged commit c67a3fa into BVLC:master Mar 12, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Contributor

jeffdonahue commented Mar 12, 2015

Thanks again for the layer and all the fixes @tnarihi.

Contributor

jyegerlehner commented Mar 12, 2015

@hycis This time the ReLU version produced 0.8172 accuracy at 70K iterations, and PReLU version produced 0.8177 at 70K iterations.

I could post the modified shell scripts and prototxt I used if that would help you to reproduce the result.

hycis commented Mar 12, 2015

@jyegerlehner I am not sure why too. I did a pull and also did as what you mention, but just not getting the result as you. It will be great if you can share your prototxt and shell scripts. Thanks. My email hyciswu@gmail.com

Contributor

jyegerlehner commented Mar 12, 2015

@hycis Well that's troubling. OK here's what I used:

https://gist.github.com/jyegerlehner/b2f073aa8e213f0a9167

Please let us know what you find. I'm worried perhaps I have an error on my end.

hycis commented Mar 13, 2015

After I pull and rebuild on the latest commit, I was able to improve full cifar10 from 0.7562 to 0.8184 and 0.8193 (decay_mult=0) at 70000 iterations for using prelu. Thanks @jyegerlehner and @tnarihi

Contributor

jeffdonahue commented Mar 15, 2015

A bit of (mostly useless) Caffe trivia: I just realized that even before this PR we could already implement PReLU (very inefficiently) by composition; at least the !channel_shared version -- the "diagonal" multiplication is equivalent to a 1x1 convolution where num_output and group are both set to the number of input channels. But ConvolutionLayer isn't at all optimized for this case as it loops over groups, so this is a lot faster. In case anyone is curious, I mean that if conv1 has C channels, this PReLU layer...:

layer {
  name: "conv1-prelu"
  type: "PReLU" param { decay_mult: 0 }
  bottom: "conv1"
  top: "conv1-prelu"
}

...is equivalent to this sequence of layers:

layer {
  name: "conv1-prelu1"
  type: "ReLU"
  bottom: "conv1"
  top: "conv1-prelu1"
}
layer {
  name: "conv1-prelu2"
  type: "Power" power_param { scale: -1 }
  bottom: "conv1"
  top: "conv1-prelu2"
}
layer {
  name: "conv1-prelu3"
  type: "ReLU"
  bottom: "conv1-prelu2"
  top: "conv1-prelu3"
}
layer {
  name: "conv1-prelu4"
  type: "Convolution"
  bottom: "conv1-prelu3"
  top: "conv1-prelu4"
  param { decay_mult: 0 }
  convolution_param {
    bias_term: false
    weight_filler { type: "constant" value: 0.25 }
    kernel_size: 1
    group: C
    num_output: C
  }
}
layer {
  name: "conv1-prelu5"
  type: "Eltwise" eltwise_param { operation: SUM }
  bottom: "conv1-prelu1"
  bottom: "conv1-prelu4"
  top: "conv1-prelu"
}

To be honest though, when I tried both I got slightly different results, so I'm not 100% sure that's right, but I've already spent way more time on this than was warranted, so I won't look into it further...

Owner

shelhamer commented Mar 15, 2015

@jeffdonahue layer composition is a fine hobby. Thanks for commenting with
the PReLU thoughts.
On Sat, Mar 14, 2015 at 21:21 Jeff Donahue notifications@github.com wrote:

A bit of (mostly useless) Caffe trivia: I just realized that even before
this PR we could already implement PReLU (very inefficiently) by
composition; at least the !channel_shared version -- the "diagonal"
multiplication is equivalent to a 1x1 convolution where num_output and
group are both set to the number of input channels. But ConvolutionLayer
isn't at all optimized for this case as it loops over groups, so this is
a lot faster. In case anyone is curious, I mean that if conv1 has C
channels, this PReLU layer...:

layer {
name: "conv1-prelu"
type: "PReLU" param { decay_mult: 0 }
bottom: "conv1"
top: "conv1-prelu"
}

...is equivalent to this sequence of layers:

layer {
name: "conv1-prelu1"
type: "ReLU"
bottom: "conv1"
top: "conv1-prelu1"
}
layer {
name: "conv1-prelu2"
type: "Power" power_param { scale: -1 }
bottom: "conv1"
top: "conv1-prelu2"
}
layer {
name: "conv1-prelu3"
type: "ReLU"
bottom: "conv1-prelu2"
top: "conv1-prelu3"
}
layer {
name: "conv1-prelu4"
type: "Convolution"
bottom: "conv1-prelu3"
top: "conv1-prelu4"
param { decay_mult: 0 }
convolution_param {
bias_term: false
weight_filler { type: "constant" value: 0.25 }
kernel_size: 1
group: C
num_output: C
}
}
layer {
name: "conv1-prelu5"
type: "Eltwise" eltwise_param { operation: SUM }
bottom: "conv1-prelu1"
bottom: "conv1-prelu4"
top: "conv1-prelu5"
}

To be honest though, when I tried both I got slightly different results,
so I'm not 100% sure that's right, but I've already spent way more time on
this than was warranted, so I won't look into it further...


Reply to this email directly or view it on GitHub
#1940 (comment).

hycis commented Mar 24, 2015

@tnarihi just wondering is there a way to output the learned coefficients from of the slope of PReLU from the saved caffemodel?

Contributor

tnarihi commented Mar 24, 2015

@hycis
Yes, but It seems like a general Caffe question. There isn't any special difference from other layers such as InnerProduct. If you work on Caffe Python, do like the following:

net = caffe.Net("<proto_path>", "<caffemodel_path>", caffe.TRAIN)
slopes_blob = net.layers['prelu1'][0]  # your prelu layer name
print slopes_blob.data  # This is a numpy array of the slopes

I don't test this script, but a script like this should work.

hycis commented Mar 24, 2015

@tnarihi thanks for the quick reply
I tried net.layers['prelu1'] but get no invalid index type error, so i tried net.params['prelu1'] which give me some numbers. So I guess net.params corresponds to net.layers?

Contributor

tnarihi commented Mar 24, 2015

sorry it shoud be params

hycis commented Mar 24, 2015

hi @tnarihi
i also observed that the prelu units for the same feature map will have the same slope coefficient? Because i the check the dimension of the net.param['prelu1'][0].shape for prelu layer after a convolution layer, the dimension net.param['prelu1'][0].shape is equal to the number of feature maps i set for the convolution layer.

Contributor

tnarihi commented Mar 24, 2015

@hycis Right.

hi @tnarihi ,
I am wondering why using an additional blob "bottom_memory_". When the layer is in-place computing, the bottom_data can be got by top_data / slope_data. My GPU memory is only 2GB, and I do not want to cost more memory space for an activation layer.

tnarihi deleted the tnarihi:prelu2 branch Apr 1, 2015

Contributor

tnarihi commented Apr 1, 2015

Hi @happynear,

I think that's the case only if negative slopes are all positive. If we allow slopes to be negative, we cannot figure out the pre-activation values only from top_data and slope_data. Another way to reduce memory consumption is that we keep bottom signs (pos or neg) with 1 byte array (e.g. int8) instead of actual values (Dtype=float 4bytes), then we can reconstruct the pre-activation values using the signs of them, top_data and slope_data. Actually I have one idea to remove temporary memory in mind (one of the authors of the original paper has contacted me and kindly gave me an advice), but it needs that we modify the framework of Caffe Net/Layer. I don't have time to work on it. If you have any other idea, I'd be happy to discuss.

@tnarihi
Yeah, I haven't considered the negative case. The slopes indeedly came to be negative in some cases I experimented in Matlab. Could you tell me what the idea is?

Contributor

tnarihi commented Apr 8, 2015

@happynear Sorry to be late.

  1. We create a global blob for shared buffer shared_buff.
  2. At every PReLU layer during the forward pass, we copy pre-activation values (bottom values) into bottom.diff (bottom[0]-->mutable_diff()) which is not used during the forward pass.
  3. During backward pass, every layer which follows PReLU copies its bottom.diff (PReLU's top.diff = PReLU's bottom.diff) to shared_buff (reshape if necessary) in order to avoid to overwrite the stored PReLU pre-activation.
  4. At every PReLU in backprop, we take pre-activation values from shared_buff and use it for backward diff computation.

One naive implementation of doing this is that every layer copies bottom.diff to shared_buff, but it involves unnecessary computation if the followed layer is not PReLU. Otherwise, we should implement kind of communication interfaces such that layers know what their top/bottom layer is, or introduce switching valuables for layers to know whether it should copy their bottom.diff to shared_buff or not.
Does that make sense?

EDIT: This still needs additional memory shared_buff, but it is much smaller than keeping buff for all PReLUs if we have many PReLUs.

Contributor

tnarihi commented Apr 8, 2015

Now I think we have two choices:

  1. Use int8 array (possibly we could use bit array) to store sign of pre-activation, reduces 75% (97% if bit array) memory consumption.
  2. Store pre-activation into bottom.diff (need to modify Caffe itself other than PReLULayer).

futurely referenced this pull request Apr 9, 2015

Closed

MSRA weight filler #1946

When I use PReLULayer to train the ImageNet model, the loss of the beginning is about 7. But when I fine-tune the " bvlc_reference_caffenet.caffemodel ", the loss of the beginning is about 80! and why?

Contributor

tnarihi commented May 5, 2015

Maybe the bigger loss is due to much larger amount of nonzero responses in PReLU, but I am not sure exactly. The reference model is trained for ReLU (not PReLU) which is equivalent to the initial state of PReLU with following setting:

layer {
    type: "PReLU" 
    ....
    prelu_parameter { filler { type: 'constant' value: 0} }
}

You should start with this setting. Default value of negative slope is 0.25.

@tnarihi Thank you! I want to train model with prelu. And I use the reference model to make an experiments to fine-tune a prelu model. So I use the default value of negative slope, 0.25.

A new type of ReLU is designed to solve the overfit problem: http://arxiv.org/abs/1505.00853 .
Maybe we should open a new issue.

@happynear Thank you. I have read this paper. In my experience, the initial negative slope should be adjusted, if you pre-trained model to fine-tune.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment