Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Batch Norm: Further Documentation and Simplified Definition #4704
Conversation
|
@shelhamer Thanks a lot for these clarifications. It took me quite some time, when I started using Caffe's BatchNorm a few weeks ago, to get all these things clear by myself. Since you are already looking into the BatchNorm implementation, please, please also revert the commit 0ad1d8a, since it makes it impossible to build up an accurate global estimator for the variance. This commit was switching the calculation of Var(X) from EX^2 - (EX)^2 to E(X-EX)^2. Unfortunately, in each forward pass, the estimator m_b for EX is only based on the values of the current mini-batch (a rather bad estimator compared to the moving mean in In contrast, @cdoersch 's original implementation was storing EX^2 in This becomes especially an issue, if you want to compute some high accuracy estimators of mean and variance for further finetuning or inference and set |
bwilbertz
and 1 other
commented on an outdated diff
Sep 11, 2016
| * | ||
| - * By default, during training time, the network is computing global mean/ | ||
| - * variance statistics via a running average, which is then used at test | ||
| - * time to allow deterministic outputs for each input. You can manually | ||
| - * toggle whether the network is accumulating or using the statistics via the | ||
| - * use_global_stats option. IMPORTANT: for this feature to work, you MUST | ||
| - * set the learning rate to zero for all three parameter blobs, i.e., | ||
| - * param {lr_mult: 0} three times in the layer definition. | ||
| + * By default, during training time, the network is computing global | ||
| + * mean/variance statistics via a running average, which is then used at test | ||
| + * time to allow deterministic outputs for each input. You can manually toggle | ||
| + * whether the network is accumulating or using the statistics via the | ||
| + * use_global_stats option. IMPORTANT: for this feature to work, you MUST set | ||
| + * the learning rate to zero for all three blobs, i.e., param {lr_mult: 0} three | ||
| + * times in the layer definition. For reference, these three blobs are (0) | ||
| + * mean, (1) variance and (2) m, the correction for the batch size. |
bwilbertz
Contributor
|
Re: #3299 I'll try to review this once I have time for another round of batch norm reform. Note that it was addressing a numerical instability issue, so the solution isn't as simple as just reverting it. |
|
@shelhamer This kind of numerical issue only occurs if N is extreme large and in the same time the variance is very small. Btw: epsilon correction is missing in MVNLayer (where this kind of numerical problem had its origin, see #3162) |
shelhamer
merged commit 25422de
into
BVLC:master
Sep 16, 2016
1 check passed
shelhamer
deleted the
shelhamer:groom-batch-norm branch
Sep 16, 2016
shelhamer
added a commit
that referenced
this pull request
Sep 17, 2016
|
|
shelhamer |
6d6b88c
|
matthieu637
commented on c8f446f
Sep 21, 2016
•
|
Since this commit, I cannot copy anymore a network.
What is the new way to do it? |
matthieu637
replied
Sep 21, 2016
|
Apparently calling clear_param() on LayerParameter is enough. |
|
Note that Caffe has an automatic definition upgrade path that is made use of in the |
D-X-Y
replied
Oct 19, 2016
|
Does this mean I don't need to specify
in batchnorm layer when training, since this commit? |
jspark1105
replied
Nov 30, 2016
|
Shouldn't decay_mult also need to be set to zero? I'm seeing a very large regularization term when using batch normalization. |
|
what about
How can one share BatchNorm params (for e.g., Siamese network, following related to #5171 |
|
@shelhamer I think there is still an issue with the batch norm upgrade: The operation That means if we pass an upgraded net_param into the Net constructor, we receive back a net, which now needs batchNormUpgrade again (because the first round LayerSetUp will add 3 params and the second one will refuse to work unless all params are removed). Unless this is intended behaviour, I would suggest to maybe change those lines into something like:
|
|
@bwilbertz thanks for raising this, I will look into it soon. |
|
Clearing |
WenzhMicrosoft
replied
Jul 27, 2017
•
|
When loading from caffemodel, this message is always showed: "Successfully upgraded batch norm layers using deprecated". Maybe we should remove BatchNorm's param before storing them. |
shelhamer commentedSep 10, 2016
•
edited
The current state of batch norm is somewhat unclear and requires a tedious specification of learning rates for correctness. This PR documents the batch norm layer in more detail, to clarify its blobs and how to handle the bias and shift, and ensures that the batch norm statistics are not mistakenly mangled by the solver. As the mean, variance, and bias correction are not learnable parameters to be optimized, they should be not be updated by the solver, and this PR enforces this exclusion.
A further PR could optionally include the scale and shift in the batch norm layer (which for now are handled by a separate layer), and this would align with the cuDNN interface, but this PR is helpful in itself to avoid accidents.