Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Yet another batch normalization PR #3229
Conversation
ducha-aiki
commented on the diff
Oct 21, 2015
| + } | ||
| + } | ||
| +} | ||
| +layer { | ||
| + name: "pool1" | ||
| + type: "Pooling" | ||
| + bottom: "conv1" | ||
| + top: "pool1" | ||
| + pooling_param { | ||
| + pool: MAX | ||
| + kernel_size: 3 | ||
| + stride: 2 | ||
| + } | ||
| +} | ||
| + | ||
| +layer { |
ducha-aiki
Contributor
|
ducha-aiki
commented on the diff
Oct 21, 2015
| @@ -0,0 +1,230 @@ | ||
| +#include <algorithm> | ||
| +#include <vector> | ||
| + | ||
| +#include "caffe/common_layers.hpp" | ||
| +#include "caffe/layer.hpp" | ||
| +#include "caffe/util/math_functions.hpp" | ||
| + | ||
| +namespace caffe { | ||
| + | ||
| +template <typename Dtype> | ||
| +void BatchNormLayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom, | ||
| + const vector<Blob<Dtype>*>& top) { | ||
| + BatchNormParameter param = this->layer_param_.batch_norm_param(); |
ducha-aiki
Contributor
|
|
@cdoersch Looks good for me things I have commented. |
jeffdonahue
commented on an outdated diff
Oct 21, 2015
| +#include "caffe/util/math_functions.hpp" | ||
| + | ||
| +namespace caffe { | ||
| + | ||
| +template <typename Dtype> | ||
| +void BatchNormLayer<Dtype>::LayerSetUp(const vector<Blob<Dtype>*>& bottom, | ||
| + const vector<Blob<Dtype>*>& top) { | ||
| + BatchNormParameter param = this->layer_param_.batch_norm_param(); | ||
| + moving_average_fraction_ = param.moving_average_fraction(); | ||
| + use_global_stats_ = this->phase_ == TEST; | ||
| + if (param.has_use_global_stats()) | ||
| + use_global_stats_ = param.use_global_stats(); | ||
| + if (bottom[0]->num_axes() == 1) | ||
| + channels_ = 1; | ||
| + else | ||
| + channels_ = bottom[0]->channels(); |
jeffdonahue
Contributor
|
jeffdonahue
commented on an outdated diff
Oct 21, 2015
| + if (num_by_chans_.num_axes() == 0 || | ||
| + num_by_chans_.shape(0) != numbychans) { | ||
| + sz[0] = numbychans; | ||
| + num_by_chans_.Reshape(sz); | ||
| + caffe_set(batch_sum_multiplier_.count(), Dtype(1), | ||
| + batch_sum_multiplier_.mutable_cpu_data()); | ||
| + } | ||
| +} | ||
| + | ||
| +template <typename Dtype> | ||
| +void BatchNormLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom, | ||
| + const vector<Blob<Dtype>*>& top) { | ||
| + const Dtype* bottom_data = bottom[0]->cpu_data(); | ||
| + Dtype* top_data = top[0]->mutable_cpu_data(); | ||
| + int num = bottom[0]->shape(0); | ||
| + int spatial_dim = bottom[0]->height() * bottom[0]->width(); |
|
|
jeffdonahue
and 1 other
commented on an outdated diff
Oct 21, 2015
| + // TODO(cdoersch): The caching is only needed because later in-place layers | ||
| + // might clobber the data. Can we skip this if they won't? | ||
| + caffe_copy(x_norm_.count(), top_data, | ||
| + x_norm_.mutable_cpu_data()); | ||
| +} | ||
| + | ||
| +template <typename Dtype> | ||
| +void BatchNormLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top, | ||
| + const vector<bool>& propagate_down, | ||
| + const vector<Blob<Dtype>*>& bottom) { | ||
| + CHECK(!use_global_stats_); | ||
| + const Dtype* top_diff = top[0]->cpu_diff(); | ||
| + const Dtype* top_data = x_norm_.cpu_data(); | ||
| + Dtype* bottom_diff = bottom[0]->mutable_cpu_diff(); | ||
| + int num = bottom[0]->shape()[0]; | ||
| + int spatial_dim = bottom[0]->height() * bottom[0]->width(); |
jeffdonahue
Contributor
|
jeffdonahue
commented on the diff
Oct 21, 2015
| + caffe_cpu_gemv<Dtype>(CblasTrans, num, channels_, 1., | ||
| + num_by_chans_.cpu_data(), batch_sum_multiplier_.cpu_data(), 0., | ||
| + mean_.mutable_cpu_data()); | ||
| + caffe_cpu_gemv<Dtype>(CblasNoTrans, channels_ * num, spatial_dim, | ||
| + 1. / (num * spatial_dim), temp_.cpu_data(), | ||
| + spatial_sum_multiplier_.cpu_data(), 0., | ||
| + num_by_chans_.mutable_cpu_data()); | ||
| + caffe_cpu_gemv<Dtype>(CblasTrans, num, channels_, 1., | ||
| + num_by_chans_.cpu_data(), batch_sum_multiplier_.cpu_data(), 0., | ||
| + variance_.mutable_cpu_data()); | ||
| + this->blobs_[2]->mutable_cpu_data()[0] *= moving_average_fraction_; | ||
| + this->blobs_[2]->mutable_cpu_data()[0] += 1; | ||
| + caffe_cpu_axpby(mean_.count(), Dtype(1), mean_.cpu_data(), | ||
| + moving_average_fraction_, this->blobs_[0]->mutable_cpu_data()); | ||
| + Dtype m = Dtype(bottom[0]->count()/channels_); | ||
| + caffe_cpu_axpby(variance_.count(), m/(m-1), variance_.cpu_data(), |
jeffdonahue
Contributor
|
|
Thanks for putting this PR together @cdoersch and for the early work @ducha-aiki! Besides above comment this looks good to me. I haven't tried the examples, but if they really don't converge I'm not sure yet how we should handle that...they probably shouldn't be merged as is; broken examples aren't great for most users... Perhaps they should temporarily use the 1x1 convolution hack until we have the dedicated layers merged as well? ( |
|
@jeffdonahue they sucessfully converge and actually work (when use fixed example). What is not working - in-place computation, which I propose for now just to flag with |
|
@cdoersch @ducha-aiki my experiments with in-place computation of this layer do not converge either although I have had convergence with this version of batch norm derived from #1965 https://github.com/HyeonwooNoh/caffe/blob/master/src/caffe/layers/bn_layer.cu. It could be worth a glance before merge. |
|
Ah, I see, thanks for the clarification/reiteration @ducha-aiki. I agree with your suggestion in that case -- there should either be such a check (and ideally the then unnecessary variance caching would be removed), or we should try to fix in-place computation. |
jeffdonahue
and 1 other
commented on an outdated diff
Oct 21, 2015
| + backend: LMDB | ||
| + } | ||
| +} | ||
| +layer { | ||
| + name: "conv1" | ||
| + type: "Convolution" | ||
| + bottom: "data" | ||
| + top: "conv1" | ||
| + param { | ||
| + lr_mult: 1 | ||
| + } | ||
| + param { | ||
| + lr_mult: 2 | ||
| + } | ||
| + convolution_param { | ||
| + num_output: 32 |
jeffdonahue
Contributor
|
jeffdonahue
commented on an outdated diff
Oct 21, 2015
| + type: "Pooling" | ||
| + bottom: "conv1" | ||
| + top: "pool1" | ||
| + pooling_param { | ||
| + pool: MAX | ||
| + kernel_size: 3 | ||
| + stride: 2 | ||
| + } | ||
| +} | ||
| + | ||
| +layer { | ||
| + name: "bn1" | ||
| + type: "BatchNorm" | ||
| + bottom: "pool1" | ||
| + top: "bn1" | ||
| + batch_norm_param { |
jeffdonahue
Contributor
|
jeffdonahue
commented on an outdated diff
Oct 21, 2015
| + top: "pool3" | ||
| + pooling_param { | ||
| + pool: AVE | ||
| + kernel_size: 3 | ||
| + stride: 2 | ||
| + } | ||
| +} | ||
| + | ||
| +layer { | ||
| + name: "ip1" | ||
| + type: "InnerProduct" | ||
| + bottom: "pool3" | ||
| + top: "ip1" | ||
| + param { | ||
| + lr_mult: 1 | ||
| + decay_mult: 250 |
jeffdonahue
Contributor
|
|
@cdoersch now it works :) |
|
@jeffdonahue @ducha-aiki I've fixed the lr's and decays. Can you confirm that I've made all the required changes? |
|
@cdoersch LGTM |
|
Thanks again @cdoersch and @ducha-aiki! LGTM as well. |
jeffdonahue
added a commit
that referenced
this pull request
Oct 23, 2015
|
|
jeffdonahue |
39f69fb
|
jeffdonahue
merged commit 39f69fb
into
BVLC:master
Oct 23, 2015
1 check passed
|
Great work |
cysin
commented
Oct 23, 2015
|
More examples and tutorials to use it? |
ducha-aiki
referenced
this pull request
Oct 23, 2015
Closed
Batch normalization layer with test and examples #1965
|
@jeffdonahue #2996 could be used for Scale + bias (in non-channel shared mode). |
beniz
referenced
this pull request
in beniz/deepdetect
Oct 23, 2015
Open
Support for batch normalization #7
beniz
commented
Oct 25, 2015
|
@cdoersch, @ducha-aiki thanks for the work throughout this PR, I can confirm training phase is working great. However, I'm struggling with the inference from a stored model: inference (caffe::TEST) appears to work fine during training, but not from a freshly loaded model. From investigating the code, my understanding is that mean_ and variance_ are not stored as part of the model and thus cannot be loaded up. From the paper, my understanding is that at the moment this may prevent inference from a stored Caffe model trained with BN. Is this correct ? |
|
@beniz During the testing phase (or whenever use_global_stats_ is true), the mean and variance information should be read from the BatchNormLayer's parameter blobs. Whether you're doing this from a freshly loaded model or a model that you've been training shouldn't matter. If the code is really behaving as you say, then there may be a bug. Note that switching to global statistics is not guaranteed to work. If the mean and variance statistics are not stationary (and they generally won't be unless the network has converged), then the estimates may be inaccurate. You should add some debugging statements to make sure the code is indeed executing along the paths you think it is, because I think there's a chance that you're not using global stats for your training-time test phases. |
beniz
commented
Oct 25, 2015
|
All points well understood @cdoersch and thanks for the quick answer. I've re-checked the execution path through global_stats and that mean and variance are in the saved blobs, and I believe there's no bug. So this leaves the stationary requirement (though my nets do converge and test results during training are just fine, most of the time). Since we're at it, would it be possible to get more details on this comment from common_layers.hpp:
Thanks again. |
beniz
commented
Oct 25, 2015
|
I've put the reworked protobuf files for GoogleNet with BN here (modified from https://github.com/lim0606/caffe-googlenet-bn): Training is fine and fast. Note that the type of the test input layer needs to be modified in order to work with straight Caffe. |
|
@beniz these statistics are collected through running averages; updating the associated parameters has nothing to do with optimizing the global objective function. Hence we don't want the solver trying to update these parameters, because it will only mess them up. |
|
@beniz I guess BN wise GoogleNet does not have intermediate supervised layer. Even if it does, I guess loss weights are wrong in your configuration file since sum of them is larger than 1. |
beniz
commented
Oct 26, 2015
My understanding is that it doesn't matter whether they sum to 1 or not. Also, they are the same as in https://github.com/BVLC/caffe/blob/master/models/bvlc_googlenet/train_val.prototxt The original GGNet paper reports that a single intermediate supervised layer is enough, with a 0.6% positive effect in the end. I don't remember what the BN paper says about them. |
|
@beniz in the end, tou are right, no practical effect. |
cysin
commented
Oct 27, 2015
|
@beniz I tried your googlenet_bn but got following errors:
Am I missing something? |
beniz
commented
Oct 27, 2015
|
@cysin weird, it's been running on several datasets including ilsvrc on my side without issue. Though note that the inception_3c is an area where the network differs from the bvlc_googlenet. You can find a picture of the full BN net here: https://raw.githubusercontent.com/lim0606/caffe-googlenet-bn/master/inception_bn.png Are you running this network with a crop size different than 224 maybe ? |
happynear
commented
Oct 28, 2015
|
@cysin
Since all variables are positive integer, this is equivalent to While for pooling, it is somewhat weird
They do not always match with each other. So we must be careful with the |
cysin
commented
Oct 28, 2015
|
@beniz It turned out to be the size issue. I changed the image size to 224x224 and now it seems all fine. |
lukeyeager
referenced
this pull request
in NVIDIA/caffe
Oct 28, 2015
Merged
Cherry-pick batch normalization PR #51
cysin
commented
Oct 30, 2015
|
Can the bn layer be removed at test phase? If not, can this feature be added? |
beniz
commented
Oct 30, 2015
|
@cysin not sure who you are asking, but yes my understanding is that BN layer is required during test phase. From the code, the BN layer switches to global stat whenever the caffe::TEST phase is on, so you should not have to add anything to the .prototxt files regarding the BN layer, if this is what you're asking... |
beniz
commented
Nov 9, 2015
|
This one PR seems to be relevant here: #3299 |
Hrant-Khachatrian
commented on the diff
Nov 19, 2015
| + convolution_param { | ||
| + num_output: 32 | ||
| + pad: 2 | ||
| + kernel_size: 5 | ||
| + stride: 1 | ||
| + weight_filler { | ||
| + type: "gaussian" | ||
| + std: 0.01 | ||
| + } | ||
| + bias_filler { | ||
| + type: "constant" | ||
| + } | ||
| + } | ||
| +} | ||
| + | ||
| + |
Hrant-Khachatrian
|
siddharthm83
commented
Nov 22, 2015
|
Do I need to define a batch norm layer separately for train & test phases? (i.e. like below) (removed params to shroten) or if I define it just once with no phase mentioned, it would automatically switch to global stat during test phase? |
siddharthm83
commented
Nov 27, 2015
beniz
commented
Nov 28, 2015
|
@siddharthm83 yes I believe it switches automatically. As for the model, I guess you are asking for a trained model. While I do have one that I can share, I haven't done so yet because I cannot get it to work in pure prediction mode correctly outside the training/testing phase. Culprit as far as my short investigation did go are the lines https://github.com/BVLC/caffe/blob/master/src/caffe/layers/batch_norm_layer.cpp#L143-L149 Somehow, when predicting over a single image for instance, these two operations turn final softmax into NaN. @cdoersch to be very honest I'm not sure what you mean by 'replicate variance to input size'. |
|
@beniz the input is NxCxHxW, but the estimate of the variance (variance_) is 1xCx1x1. Hence, variance needs to be replicated (i.e. tiled) so that it's the same size as the input. The final line is the one that actually divides the input by the variance (in retrospect, this line should probably have a separate comment). If you have a single input, and you're using fully-connected layers, you're going to be measuring the variance of a single number, which is not defined (I'm pretty sure it will be calculated as 0, which will produce NaNs when you divide). If you're seeing the NaN's when use_global_stats is true, however, then it's probably a different issue. Note that the original version of the code for accumulating global statistics was incorrect; it was fixed recently, but the fix is likely to cause problems with global statistics that were computed using the old code. |
siddharthm83
commented
Nov 28, 2015
|
@beniz , yes, I was referring to the GN-BN model trained on Imagenet-1000. In testing phase, shouldn't the global mean/variance (that is stored during training) be used? Mean/Variance shouldn't be calculated. I can check too on my side (in the middle of training a smaller network; will check when training is finished). |
beniz
commented
Nov 29, 2015
@cdoersch thanks, this was very certainly the culprit. Tried again with intermediary snapshot from a new run and it's working fine now. @siddharthm83 the use_global_stats is set to true in TEST (aka prediction) mode, and thus the mean/variance are reused. |
siddharthm83
commented
Nov 30, 2015
|
Good to hear @beniz. If you do plan to upload it to the model zoo, let me know (saves some time in training one myself :). |
cuihenggang
commented
Dec 4, 2015
|
Does anyone have a train_val file for the Inception-BN network? Thank you so much! Cui |
siddharthm83
commented
Dec 4, 2015
|
@cuihenggang: @beniz has an example here: https://github.com/beniz/deepdetect/tree/master/templates/caffe/googlenet_bn |
cuihenggang
commented
Dec 4, 2015
|
@siddharthm83 Great! Thank you @beniz |
happynear
commented
Dec 4, 2015
|
@beniz , |
beniz
commented
Dec 4, 2015
|
@happynear whereat ? I may have missed something. |
siddharthm83
commented
Dec 4, 2015
|
How can I set a constant learning rate for the batch norm parameters (gamma, beta). (i.e. for eg: i want the learning rate for the conv network/fully connected etc to start from 0.01 and cool down by a factor of 10 every n epochs but I want to keep my batch_norm parameters (gamma,beta as in the paper) constant at say 0.01 throughout learning. If this is possible, some info on how to change the prototxt will be much appreciated :) @cdoersch |
|
@siddharthm83 I feel like I answered this somewhere before, but I can't find it now... The |
siddharthm83
commented
Dec 7, 2015
|
Thanks @cdoersch . I read the code as well and I understand that the batch norm layer gives only (x-mean)/(sqrt(var)). |
siddharthm83
commented
Dec 7, 2015
|
@cdoersch , if I understand correct, it is a bit tedious to implement scale/shift since one would need to manually set the size of |
happynear
commented
Dec 7, 2015
|
@siddharthm83 , |
|
@siddharthm83 @happynear I totally agree, the situation right now is pretty awkward. My impression, though, is that there's some high-level philosophical issues here: we don't want the layer catalogue to grow too large, and we don't want to overfit caffe to the way we do neural networks right now. I know that for a long time, the caffe developers considered treating parameters like we treat data: they're just another input to the network, just one that gets updated by SGD rather than being updated through disk reads. If you had such a parameter layer, plus an Eltwise layer that supports expansion of singleton dimensions, then the EltwiseAffine layer would be totally redundant. We probably wouldn't want to support it going forward, and caffe hates deprecating things. Of course, if the caffe devs have no plans to move forward on that front, then I'm all for merging the EltwiseAffine layer. It's definitely better than the situation we have now. It's unlikely that the EltwiseAffine layer would be included by default with BatchNorm layers, just for backwards compatibility reasons. |
|
@cdoersch a |
beniz
commented
Dec 7, 2015
|
@siddharthm83 @happynear thanks for catching this, I've re-read the original paper this week-end as well. The ReLU of GoogleNet may mitigate the effect for now, but I'll update the net once there's agreement on the best way to fix the BN. My favor goes to an integrated scale/shift of course. |
|
@beniz @shelhamer @happynear @cdoersch Let me as author of EltwiseAffine tell something against it in special BatchNorm case :)
ChannelScalarLayer and BiasLayer makes sense for me :) |
siddharthm83
commented
Dec 7, 2015
|
@cdoersch @shelhamer , thanks for the response. A |
|
@ducha-aiki that's a very interesting result! As a side note, my ICCV paper (http://arxiv.org/abs/1505.05192) doesn't actually use the scale/shift layer, because my goal was to prevent the network activations from collapsing to zero. I haven't tried the network with the scale/shift layer. In my opinion, this is one of the main advantages of batch normalization: it forces the network to try to learn something even when the problem is extremely hard, rather than giving up and ignoring the input. A few others at CMU working on unsupervised learning have been using it this way as well. Do the scale/shift layers break this property? Not clear, though it seems like they might. If anyone wants to play around it, the source code for that paper is up now. |
siddharthm83
commented
Dec 7, 2015
|
@ducha-aiki , do you have an example prototxt? When I use the |
|
@siddharthm83 slope_filler{ Because you don`t want to decrease activations 100x times after batch normalization, which your code do. |
siddharthm83
commented
Dec 9, 2015
|
@ducha-aiki , That makes sense. I will send some sample snippets and prototxt file in your PR page. |
siddharthm83
referenced
this pull request
Dec 9, 2015
Closed
Added layer for learnable eltwise y=kx+b #2996
happynear
commented
Dec 9, 2015
|
Adding BN without scale/shift before ReLU really hurts the complexity of the model, since the output of BN is expected to have zero mean, making the bias term in convolution layer meaningless. However, when the BN layer is added after ReLU, there is no need to append a scale/shift layer after it because scale/shift are also linear transformations. In the Batch Normalization paper, they did not do any experiments to analyse where to put the BN layer is the best. They claimed that BN + ReLU can produce a stable distribution. I can't understand this. I am looking forward for your results. |
|
@happynear @siddharthm83 @cdoersch @beniz @shelhamer |
cuihenggang
commented
Jan 7, 2016
|
Hi, I assume the BatchNormalization layer is pretty much done (is it?). I'm wonder has anyone tried training the Ilsvrc12 task using the Inception-BN network? What validation accuracies have we got here? |
|
Ok, I chatted with some other BVLC folks and it sounds like we're going to go ahead and merge some kind of per-channel scale/shift layers. What PRs do we currently have that do this? I'm currently aware of #2996, as well as #2079/#3021. I also vaguely remember someone referencing a PR with separate scaling and shifting layers, but I can't find it now, so maybe I'm imagining things. Let's first try to come to some consensus about which one should be merged, and then I'll review it. Right now I think #2996 looks pretty straightforward. |
|
@cdoersch I am for separating #2996 into bias and scale, but also OK with it as now (but rebased to current). So should I start doing this? |
|
I'm not sure I see the point in separating it into two layers. I think it would be better if there's just options in the prototxt to turn off the bias or scale and save the computation if desired. I think the general use-case for this will involve both of them, and combining them should save memory. I might pick a different name though--EltwiseAffine makes it sound like something different is happening for each element, when it's really a per-channel operation. Maybe PerChannelAffine? |
|
@cdoersch agree, flags for turn off/on are better, than separation. |
|
@ducha-aiki I guess to follow the 'Eltwise' pattern, it should be ChanwiseAffine. Probably ChannelwiseAffine would more clear though. |
|
@cdoersch OK, then I will clean it up, add flags and rebase. |
|
@cdoersch ScalarLayer (#3021) and BiasLayer (in the newly-created #3550) are what I've been using to learn the BN params. I'd be interested to know how the performance and memory use compares with the combined approach in #2996 from @ducha-aiki. |
cdoersch
referenced
this pull request
Jan 13, 2016
Closed
Add BiasLayer to add two Blobs with broadcasting #3550
|
@jeffdonahue I can test both variants. P.S. caffenet128 training with BN-EA layer is almost come to finish and it looks like EA helps at least with BN before non-linearity setup. Will see if it helps for BN-after-ReLU, which performs much better. |
|
Thanks @ducha-aiki, that would be great. For reference, this Python function (using NetSpec) should do in-place batch norm with ScalarLayer + BiasLayer:
|
|
@jeffdonahue Speed: S+B So my implementation is faster a bit. |
nian-liu
commented
Jan 22, 2016
|
Hi everyone, I am not clear what the param "moving_average_fraction" means and how to determine its value, could anyone give me any hint? |
classner
commented
Feb 3, 2016
|
@nian-liu It is the weight with which the'old' moving average parts are down-weighed every iteration, i.e., (1-moving_average_fraction) gives a measure for the speed of decay of the mean (the higher, the higher the decay). I just observed that this layer is not using the running mean and variance with exponential decay during the training. This means, that it becomes 'risky to use' (to say the least) with very small batch sizes, in the most extreme case with batch size one (this has the nice advantage for semantic segmentation that the network input can be dynamically resized). It can especially lead to large discrepancies between training and testing performance, when the per batch statistics do not approximate the global statistics well. What was the reasoning behind the decision to do this? It requires the assumption that the mean and variance over a batch are a reasonable estimate for the mean and variance of the entire dataset to hold. This may or may not be the case, and increasingly not ;) for small batch sizes. |
cuihenggang
commented
Feb 24, 2016
|
So do we have a new train_val protobuf for ImageNet that includes scale/shift operations after batch normalization? |
cuihenggang
commented
Feb 24, 2016
|
Actually while looking at the BatchNormLayer implementation, I find some issues. I find there is a default option of "use_global_stats_" that let the BatchNormLayer store and use the moving average of the mean and variance values. I think in the original paper, they normalize only inside the mini-batch, without considering the previous samples, which makes their normalized mini-batch white (mean=0 and variance=1). I think with the use of moving average of mean and variance, the output of our BatchNormLayer won't be white, because they are not the real mean and variance of this mini-batch. Will this cause any problems? |
classner
commented
Feb 24, 2016
|
Actually, by default during training it does not use the moving average, during testing it does. That may lead to exactly the problem I described above... |
ducha-aiki
referenced
this pull request
in smichalowski/google_inception_v3_for_caffe
Mar 16, 2016
Closed
Where is ScaleBias layers? #2
cuihenggang
commented
Apr 3, 2016
|
Actually I have a question regarding to the BatchNormLayer design. Are there any specific reasons why we choose to implement scale and bias in a separate ScaleLayer, rather than implementing it inside the BatchNormLayer? Aren't we consuming more memory from adding an extra ScaleLayer after each BatchNormLayer? |
This was referenced Apr 8, 2016
d4nst
commented
Jun 15, 2016
|
I am currently testing some nets with and without batch normalization and I see that the memory consumption for nets with batch normalization is twice as much. For example, using this resnet I can train with a batch size of 24 images in my GPU. However, if I remove all the BatchNorm layers I can use up to 60 images or so. The problem seems to be in the BatchNorm backward pass. What is the reason for this? It seems like a very high memory compsumption. |

cdoersch commentedOct 21, 2015
This PR squashes together #1965 and #3161 to make sure that proper credit is given. The final functionality is much more like #3161: we ultimately decided that the scale/shift could be implemented as a separate layer (and should hence get its own PR) and the data shuffling, if it gets merged, should also be done as a separate PR (I have not reviewed that code closely enough to say whether it is mergeable). This version includes the global stats computations, and fixes the issue where #3161 was using the biased variance estimate (took a little while to convince myself that this is indeed the correct estimator to use).
It would be great if @ducha-aiki and @jeffdonahue could take a look at this.