Update BatchNormLayer #3299

kkhoot · 2015-11-07T04:12:15Z

This PR updates BatchNormLayer.

Enhance numerical stability of variance using var(X) = E((X-EX)^2),
- Similar to Improve numerical stability of variance computation in MVNLayer #3162.
Modify exponential moving average in global stats computation.
- Current implementation can be slightly inaccurate since it saves m/(m-1)E(X)^2 and subtracts E(X)^2, not m/(m-1)E(X)^2 from it.
- ~~change the moving average formula to S = alpha * Xt + (1-alpha)*S which does not need to scale.~~

cdoersch · 2015-11-09T20:59:53Z

I think you mean E(X)^2 for parts of what you're saying, but yes, this is a good point. This is incorrect in the current implementation. I agree with merging the first two points in this PR. @ronghanghu, can you mark this as a bugfix as well as an enhancement?

However, I don't agree with merging the PR as-is. The problem with the using S = alpha * Xt + (1-alpha)*S is that it will be biased if the number of iterations is small relative to 1/alpha. It's actually possible that some users will want to set alpha=0 if they want to integrate over an entire dataset. Hence, I would recommend against making this change.

Also, while we're doing this, can we add two more simple changes:

remove the "TODO(cdoersch):allow an option to use an unbiased variance estimate, like the paper does." This comment doesn't really make sense at the moment (it referred to old code). The unbiased estimate is what we should be using always.
Make the backward pass work when use_global_stats_ is active. It's actually a trivial change: basically add a

if(use_global_stats_){
  caffe[_gpu]_div(temp_.count(), top_diff, temp_.g[c]pu_data(), bottom_diff);
  return;
}

instead of the current CHECK. There's really no good reason why this shouldn't be allowed, and arguably people will sometimes want to turn batch normalization off halfway through training.

kkhoot · 2015-11-10T15:19:00Z

@cdoersch I miswrote it. As you said it is E(X)^2. Thanks for clarification.

The problem with the using S = alpha * Xt + (1-alpha)*S is that it will be biased if the number of iterations is small relative to 1/alpha.

It is true that the exponential moving average (EMA) has a problem of setting initial values. Actually the formula (S = alpha * Xt + (1-alpha)*S) is a standard EMA and it can be used with (although not always) better init tricks. The init value problem can be ameliorated by using some number of starting samples to calculate the mean, and use it as an initial seed value. I think the current implementation can have a bias towards recent samples, since with a small alpha few new samples dominate the sum.

It's actually possible that some users will want to set alpha=0 if they want to integrate over an entire dataset.

What you are saying is probably the cumulative moving average (CMA). In commits of my another branch CMA and the new init method have been implemented. I am thinking about bringing these commits here. Can you please check these? c1884c1 18ef30f

Also, while we're doing this, can we add two more simple changes:

I mad the changes with a new commit. Please check this.

cdoersch · 2015-11-10T16:34:44Z

I still disagree with these changes. It's a lot of extra complexity--not to mention breaking backwards compatibility--for essentially no benefit.

You are certainly correct that the current initialization scheme is biased toward more recent samples, but it's exactly the same bias that would be there in the moving average. In general, it's better to use such an exponential weighting of the samples even in the beginning, since the user explicitly told us how much more to trust recent samples than old samples. It's common that you want to do this if you don't expect the statistics to be stationary.

A side note, from reading your other commits--we assign numbers to fields in prototxt's because in binary files, the fields are stored by number, not by name. Hence for backward compatibility you should never change the number associated with a field; new fields should get the next higher number.

Finally, why did you take out the line that copies the top diff into x_norm_.diff()? This is necessary for the correctness of in-place backwards computation, since the computation clobbers bottom.diff() before it's done reading from top.diff(). If this wasn't obvious to you, maybe we should add a comment to explain why that line is there.

kkhoot · 2015-11-10T17:28:02Z

OK. The extra complexity was the reason I did not bring in those changes at first. I agree that there will be little benefit with the two separate moving averages.

we assign numbers to fields in prototxt's because in binary files, the fields are stored by number, not by name

Thanks for this. I did not know about it.

This is necessary for the correctness of in-place backwards computation, since the computation clobbers bottom.diff() before it's done reading from top.diff()

I did not catch that. I will revert the code. Maybe it is better to add an in-place test in the test.

I will update the moving average part in the first commit.

kkhoot · 2015-11-10T18:40:39Z

Update is done.

shelhamer · 2015-11-11T07:10:26Z

@cdoersch could you review the latest update? Thanks.

cdoersch · 2015-11-11T17:16:38Z

I'd say it's almost there. Howeer, @kkhoot, you also removed the m/(m-1) correction, and I don't see any replacement for it in your code. This is needed to get the unbiased estimate of the variance. What happened to it?

kkhoot · 2015-11-12T04:30:37Z

@cdoersch Please see the above lines.

cdoersch · 2015-11-12T15:54:29Z

Ok, I see it now. That's not where I expected it. If you do it there and use_global_stats is false, won't you end up dividing the batch by the unbiased variance estimate?

When you're doing batch normalization at training time, the empirical variance--i.e. \sum_i X_i^2--of the output should be 1; as far as I understand, you're not supposed to do the bias correction until you're using global statistics at test time.

kkhoot · 2015-11-12T16:20:10Z

The above lines are executed only when use_global_stats_ is true.

cdoersch · 2015-11-12T16:34:51Z

Oh I see--thought it was if(!use_global_stats_)

If that's the case, then it's wrong for a different reason. There's no reason that m shouldn't change from training time to testing time; it's common for people to deploy networks with a different batch size than the one they trained with. This correction needs to be done if and only if use_global_stats_ is false, but the corrected value must only be stored in the parameter, and not used for actually performing batch normalization.

I guess this isn't obvious either. You should add a comment.

including minor code cleaning

kkhoot · 2015-11-12T17:30:42Z

OK. Good point. I moved the correction to the !use_global_stats_ part.

cysin · 2015-11-13T02:43:07Z

I used bn layers in my model and the GPU memory consumption increased significantly. So any idea about this?

cdoersch · 2015-11-13T16:43:37Z

@kkhoot @shelhamer LGTM. Evan, please merge this when you get a chance. Thanks @kkhoot for all your hard work!

@cysin we try to keep conversation on PRs relevant to the PR itself, and so I'm not going to answer here unless it's specifically this PR which increases your memory usage (and I don't think it does). You can ask your question on caffe-users or look at the code to see memory requirements.

Update BatchNormLayer: more numerical stability and backward with global stats

shelhamer · 2015-11-14T19:53:41Z

Thanks for the improvements @kkhoot and thanks for the reviews @cdoersch!

kkhoot · 2015-11-16T02:02:42Z

@cdoersch Thanks for the reviews. BatchNormLayer helps to train my model a lot. Thanks again.

beniz mentioned this pull request Nov 9, 2015

Yet another batch normalization PR #3229

Merged

ronghanghu added the enhancement label Nov 9, 2015

ronghanghu self-assigned this Nov 9, 2015

kkhoot force-pushed the fix_bn branch 2 times, most recently from 641e3b4 to b0707be Compare November 10, 2015 14:55

kkhoot force-pushed the fix_bn branch from b0707be to b5fa3c7 Compare November 10, 2015 15:44

kkhoot force-pushed the fix_bn branch 2 times, most recently from 0d50137 to daa8860 Compare November 10, 2015 18:20

shelhamer added ready for review bug labels Nov 11, 2015

kkhoot added 2 commits November 13, 2015 02:10

Update computation of variance and global stats in BatchNormLayer

0ad1d8a

Make backward pass work when global stats is active for BatchNormLayer

f6e582a

including minor code cleaning

kkhoot force-pushed the fix_bn branch from daa8860 to f6e582a Compare November 12, 2015 17:37

shelhamer added in progress and removed ready for review labels Nov 13, 2015

shelhamer added the ready for review label Nov 14, 2015

shelhamer removed the in progress label Nov 14, 2015

shelhamer added a commit that referenced this pull request Nov 14, 2015

Merge pull request #3299 from kkhoot/fix_bn

fb43ea1

Update BatchNormLayer: more numerical stability and backward with global stats

shelhamer merged commit fb43ea1 into BVLC:master Nov 14, 2015

kkhoot deleted the fix_bn branch November 16, 2015 02:02

lukeyeager mentioned this pull request Nov 19, 2015

Cherry-pick batchnorm fixes NVIDIA/caffe#79

Merged

daylen mentioned this pull request Apr 12, 2016

Support batch normalization szagoruyko/loadcaffe#63

Open

shelhamer mentioned this pull request Sep 13, 2016

Batch Norm: Further Documentation and Simplified Definition #4704

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update BatchNormLayer #3299

Update BatchNormLayer #3299

kkhoot commented Nov 7, 2015

cdoersch commented Nov 9, 2015

kkhoot commented Nov 10, 2015

cdoersch commented Nov 10, 2015

kkhoot commented Nov 10, 2015

kkhoot commented Nov 10, 2015

shelhamer commented Nov 11, 2015

cdoersch commented Nov 11, 2015

kkhoot commented Nov 12, 2015

cdoersch commented Nov 12, 2015

kkhoot commented Nov 12, 2015

cdoersch commented Nov 12, 2015

kkhoot commented Nov 12, 2015

cysin commented Nov 13, 2015

cdoersch commented Nov 13, 2015

shelhamer commented Nov 14, 2015

kkhoot commented Nov 16, 2015

Update BatchNormLayer #3299

Update BatchNormLayer #3299

Conversation

kkhoot commented Nov 7, 2015

cdoersch commented Nov 9, 2015

kkhoot commented Nov 10, 2015

cdoersch commented Nov 10, 2015

kkhoot commented Nov 10, 2015

kkhoot commented Nov 10, 2015

shelhamer commented Nov 11, 2015

cdoersch commented Nov 11, 2015

kkhoot commented Nov 12, 2015

cdoersch commented Nov 12, 2015

kkhoot commented Nov 12, 2015

cdoersch commented Nov 12, 2015

kkhoot commented Nov 12, 2015

cysin commented Nov 13, 2015

cdoersch commented Nov 13, 2015

shelhamer commented Nov 14, 2015

kkhoot commented Nov 16, 2015