Batch Norm: Further Documentation and Simplified Definition #4704

Merged
merged 4 commits into from Sep 16, 2016

Conversation

Projects
None yet
7 participants
Owner

shelhamer commented Sep 10, 2016 edited

The current state of batch norm is somewhat unclear and requires a tedious specification of learning rates for correctness. This PR documents the batch norm layer in more detail, to clarify its blobs and how to handle the bias and shift, and ensures that the batch norm statistics are not mistakenly mangled by the solver. As the mean, variance, and bias correction are not learnable parameters to be optimized, they should be not be updated by the solver, and this PR enforces this exclusion.

A further PR could optionally include the scale and shift in the batch norm layer (which for now are handled by a separate layer), and this would align with the cuDNN interface, but this PR is helpful in itself to avoid accidents.

@shelhamer shelhamer [docs] clarify handling of bias and scaling by BiasLayer, ScaleLayer
A bias/scaling can be applied wherever desired by defining the
respective layers, and `ScaleLayer` can handle both as a memory
optimization.
04f9a77
Contributor

bwilbertz commented Sep 11, 2016

@shelhamer Thanks a lot for these clarifications. It took me quite some time, when I started using Caffe's BatchNorm a few weeks ago, to get all these things clear by myself.

Since you are already looking into the BatchNorm implementation, please, please also revert the commit 0ad1d8a, since it makes it impossible to build up an accurate global estimator for the variance.

This commit was switching the calculation of Var(X) from EX^2 - (EX)^2 to E(X-EX)^2. Unfortunately, in each forward pass, the estimator m_b for EX is only based on the values of the current mini-batch (a rather bad estimator compared to the moving mean in blob[0]), which then adds the empirical mean of the non-linear transformation (X - m_b)^2 to the global stats.

In contrast, @cdoersch 's original implementation was storing EX^2 in blob[1], so that the final estimator for Var(X) was computed as EX^2 - m^2, where m is the global estimator for EX, which was computed using all batches (modulo the moving average factor).

This becomes especially an issue, if you want to compute some high accuracy estimators of mean and variance for further finetuning or inference and set moving_average_fraction = 1 (similar to what Kaiming He is doing for his ResNets (c.f. 5. in Disclaimer and known issues).

@bwilbertz bwilbertz and 1 other commented on an outdated diff Sep 11, 2016

include/caffe/layers/batch_norm_layer.hpp
*
- * By default, during training time, the network is computing global mean/
- * variance statistics via a running average, which is then used at test
- * time to allow deterministic outputs for each input. You can manually
- * toggle whether the network is accumulating or using the statistics via the
- * use_global_stats option. IMPORTANT: for this feature to work, you MUST
- * set the learning rate to zero for all three parameter blobs, i.e.,
- * param {lr_mult: 0} three times in the layer definition.
+ * By default, during training time, the network is computing global
+ * mean/variance statistics via a running average, which is then used at test
+ * time to allow deterministic outputs for each input. You can manually toggle
+ * whether the network is accumulating or using the statistics via the
+ * use_global_stats option. IMPORTANT: for this feature to work, you MUST set
+ * the learning rate to zero for all three blobs, i.e., param {lr_mult: 0} three
+ * times in the layer definition. For reference, these three blobs are (0)
+ * mean, (1) variance and (2) m, the correction for the batch size.
@bwilbertz

bwilbertz Sep 11, 2016

Contributor

Isn't blob[2] rather the cumulative moving average factor than a correction for the batch size?

It is updated as b(n+1) = alpha*b(n) + 1 with b(0) = 0.

@shelhamer

shelhamer Sep 13, 2016

Owner

Right, it is. Comment updated accordingly.

shelhamer added some commits Aug 28, 2016

@shelhamer shelhamer [docs] identify batch norm layer blobs 3b6fd1d
@shelhamer shelhamer batch norm: hide statistics from solver, simplifying layer definition
batch norm statistics are not learnable parameters subject to solver
updates, so they must be shielded from the solver. `BatchNorm` layer now
masks its statistics for itself by zeroing parameter learning rates
instead of relying on the layer definition.

n.b. declaring `param`s for batch norm layers is no longer allowed.
c8f446f
@shelhamer shelhamer batch norm: auto-upgrade old layer definitions w/ param messages
automatically strip old batch norm layer definitions including `param`
messages. the batch norm layer used to require manually masking its
state from the solver by setting `param { lr_mult: 0 }` messages for
each of its statistics. this is now handled automatically by the layer.
a8ec123
Owner

shelhamer commented Sep 13, 2016

@bwilbertz

Since you are already looking into the BatchNorm implementation, please, please also revert the commit 0ad1d8a, since it makes it impossible to build up an accurate global estimator for the variance.

This commit was switching the calculation of Var(X) from EX^2 - (EX)^2 to E(X-EX)^2. Unfortunately, in each forward pass, the estimator m_b for EX is only based on the values of the current mini-batch (a rather bad estimator compared to the moving mean in blob[0]), which then adds the empirical mean of the non-linear transformation (X - m_b)^2 to the global stats.

Re: #3299 I'll try to review this once I have time for another round of batch norm reform. Note that it was addressing a numerical instability issue, so the solution isn't as simple as just reverting it.

Contributor

bwilbertz commented Sep 13, 2016

@shelhamer This kind of numerical issue only occurs if N is extreme large and in the same time the variance is very small.
But for that case, the parameter epsilon was introduced into the batch norm, so that one can easily avoid any numeric catastrophe by increasing epsilon (in case someone is really piping more or less constant data into the net).

Btw: epsilon correction is missing in MVNLayer (where this kind of numerical problem had its origin, see #3162)

@shelhamer shelhamer merged commit 25422de into BVLC:master Sep 16, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

shelhamer deleted the shelhamer:groom-batch-norm branch Sep 16, 2016

matthieu637 commented on c8f446f Sep 21, 2016 edited

Since this commit, I cannot copy anymore a network.

caffe::Net<double>* old_network; //network to copy that contains a batch norm layer

caffe::NetParameter net_param;
old_network->ToProto(&net_param);
new caffe::Net<double>(net_param);
//  batch_norm_layer.cpp:39] Check failed: this->layer_param_.param_size() == 0 (3 vs. 0) Cannot configure batch normalization statistics as layer parameters.

What is the new way to do it?

Apparently calling clear_param() on LayerParameter is enough.

Owner

shelhamer replied Sep 21, 2016

Note that Caffe has an automatic definition upgrade path that is made use of in the caffe binary and the interfaces like pycaffe. This is the part that handles batch norm https://github.com/BVLC/caffe/blob/master/src/caffe/util/upgrade_proto.cpp#L1003-L1024 so you don't have to do it manually.

Does this mean I don't need to specify

  param {
    lr_mult: 0
    decay_mult: 0
  }

in batchnorm layer when training, since this commit?
Thanks.

Shouldn't decay_mult also need to be set to zero? I'm seeing a very large regularization term when using batch normalization.

Contributor

shaibagon replied Jan 12, 2017

what about

  param { name: "want_to_share_this" }

How can one share BatchNorm params (for e.g., Siamese network, following "Scale" layer?)

related to #5171

Contributor

bwilbertz commented Sep 23, 2016

@shelhamer I think there is still an issue with the batch norm upgrade:

The operation new Net<Dtype>(net_param)->ToProto(...) is not idempotent any more.

That means if we pass an upgraded net_param into the Net constructor, we receive back a net, which now needs batchNormUpgrade again (because the first round LayerSetUp will add 3 params and the second one will refuse to work unless all params are removed).

Unless this is intended behaviour, I would suggest to maybe change those lines into something like:

  for (int i = 0; i < this->blobs_.size(); ++i) {
    if (this->layer_param_.param_size() == i) {
      ParamSpec* fixed_param_spec = this->layer_param_.add_param();
      fixed_param_spec->set_lr_mult(0.f);
    } else {
      CHECK_EQ(this->layer_param_.param(i).lr_mult(), 0.f)
          << "Cannot configure batch normalization statistics as layer parameters.";
    }
  }
Owner

shelhamer commented Sep 24, 2016

@bwilbertz thanks for raising this, I will look into it soon.

Contributor

shaibagon commented on a8ec123 Jan 12, 2017 edited

Clearing param of "BatchNorm" layer (line 1020) prevents sharing these parameters with other layers (like in e.g., Siamese networks). Please see issue #5171.

WenzhMicrosoft replied Jul 27, 2017 edited

When loading from caffemodel, this message is always showed: "Successfully upgraded batch norm layers using deprecated". Maybe we should remove BatchNorm's param before storing them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment