Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update BatchNormLayer #3299

Merged
merged 2 commits into from Nov 14, 2015
Merged

Update BatchNormLayer #3299

merged 2 commits into from Nov 14, 2015

Conversation

kkhoot
Copy link
Contributor

@kkhoot kkhoot commented Nov 7, 2015

This PR updates BatchNormLayer.

  1. Enhance numerical stability of variance using var(X) = E((X-EX)^2),
  2. Modify exponential moving average in global stats computation.
    • Current implementation can be slightly inaccurate since it saves m/(m-1)E(X)^2 and subtracts E(X)^2, not m/(m-1)E(X)^2 from it.
    • change the moving average formula to S = alpha * Xt + (1-alpha)*S which does not need to scale.

@cdoersch
Copy link
Contributor

cdoersch commented Nov 9, 2015

I think you mean E(X)^2 for parts of what you're saying, but yes, this is a good point. This is incorrect in the current implementation. I agree with merging the first two points in this PR. @ronghanghu, can you mark this as a bugfix as well as an enhancement?

However, I don't agree with merging the PR as-is. The problem with the using S = alpha * Xt + (1-alpha)*S is that it will be biased if the number of iterations is small relative to 1/alpha. It's actually possible that some users will want to set alpha=0 if they want to integrate over an entire dataset. Hence, I would recommend against making this change.

Also, while we're doing this, can we add two more simple changes:

  1. remove the "TODO(cdoersch):allow an option to use an unbiased variance estimate, like the paper does." This comment doesn't really make sense at the moment (it referred to old code). The unbiased estimate is what we should be using always.

  2. Make the backward pass work when use_global_stats_ is active. It's actually a trivial change: basically add a

if(use_global_stats_){
  caffe[_gpu]_div(temp_.count(), top_diff, temp_.g[c]pu_data(), bottom_diff);
  return;
}

instead of the current CHECK. There's really no good reason why this shouldn't be allowed, and arguably people will sometimes want to turn batch normalization off halfway through training.

@kkhoot kkhoot force-pushed the fix_bn branch 2 times, most recently from 641e3b4 to b0707be Compare November 10, 2015 14:55
@kkhoot
Copy link
Contributor Author

kkhoot commented Nov 10, 2015

@cdoersch I miswrote it. As you said it is E(X)^2. Thanks for clarification.

The problem with the using S = alpha * Xt + (1-alpha)*S is that it will be biased if the number of iterations is small relative to 1/alpha.

It is true that the exponential moving average (EMA) has a problem of setting initial values. Actually the formula (S = alpha * Xt + (1-alpha)*S) is a standard EMA and it can be used with (although not always) better init tricks. The init value problem can be ameliorated by using some number of starting samples to calculate the mean, and use it as an initial seed value. I think the current implementation can have a bias towards recent samples, since with a small alpha few new samples dominate the sum.

It's actually possible that some users will want to set alpha=0 if they want to integrate over an entire dataset.

What you are saying is probably the cumulative moving average (CMA). In commits of my another branch CMA and the new init method have been implemented. I am thinking about bringing these commits here. Can you please check these? c1884c1 18ef30f

Also, while we're doing this, can we add two more simple changes:

I mad the changes with a new commit. Please check this.

@cdoersch
Copy link
Contributor

I still disagree with these changes. It's a lot of extra complexity--not to mention breaking backwards compatibility--for essentially no benefit.

You are certainly correct that the current initialization scheme is biased toward more recent samples, but it's exactly the same bias that would be there in the moving average. In general, it's better to use such an exponential weighting of the samples even in the beginning, since the user explicitly told us how much more to trust recent samples than old samples. It's common that you want to do this if you don't expect the statistics to be stationary.

A side note, from reading your other commits--we assign numbers to fields in prototxt's because in binary files, the fields are stored by number, not by name. Hence for backward compatibility you should never change the number associated with a field; new fields should get the next higher number.

Finally, why did you take out the line that copies the top diff into x_norm_.diff()? This is necessary for the correctness of in-place backwards computation, since the computation clobbers bottom.diff() before it's done reading from top.diff(). If this wasn't obvious to you, maybe we should add a comment to explain why that line is there.

@kkhoot
Copy link
Contributor Author

kkhoot commented Nov 10, 2015

OK. The extra complexity was the reason I did not bring in those changes at first. I agree that there will be little benefit with the two separate moving averages.

we assign numbers to fields in prototxt's because in binary files, the fields are stored by number, not by name

Thanks for this. I did not know about it.

This is necessary for the correctness of in-place backwards computation, since the computation clobbers bottom.diff() before it's done reading from top.diff()

I did not catch that. I will revert the code. Maybe it is better to add an in-place test in the test.

I will update the moving average part in the first commit.

@kkhoot kkhoot force-pushed the fix_bn branch 2 times, most recently from 0d50137 to daa8860 Compare November 10, 2015 18:20
@kkhoot
Copy link
Contributor Author

kkhoot commented Nov 10, 2015

Update is done.

@shelhamer
Copy link
Member

@cdoersch could you review the latest update? Thanks.

@cdoersch
Copy link
Contributor

I'd say it's almost there. Howeer, @kkhoot, you also removed the m/(m-1) correction, and I don't see any replacement for it in your code. This is needed to get the unbiased estimate of the variance. What happened to it?

@kkhoot
Copy link
Contributor Author

kkhoot commented Nov 12, 2015

@cdoersch Please see the above lines.

@cdoersch
Copy link
Contributor

Ok, I see it now. That's not where I expected it. If you do it there and use_global_stats is false, won't you end up dividing the batch by the unbiased variance estimate?

When you're doing batch normalization at training time, the empirical variance--i.e. \sum_i X_i^2--of the output should be 1; as far as I understand, you're not supposed to do the bias correction until you're using global statistics at test time.

@kkhoot
Copy link
Contributor Author

kkhoot commented Nov 12, 2015

The above lines are executed only when use_global_stats_ is true.

@cdoersch
Copy link
Contributor

Oh I see--thought it was if(!use_global_stats_)

If that's the case, then it's wrong for a different reason. There's no reason that m shouldn't change from training time to testing time; it's common for people to deploy networks with a different batch size than the one they trained with. This correction needs to be done if and only if use_global_stats_ is false, but the corrected value must only be stored in the parameter, and not used for actually performing batch normalization.

I guess this isn't obvious either. You should add a comment.

@kkhoot
Copy link
Contributor Author

kkhoot commented Nov 12, 2015

OK. Good point. I moved the correction to the !use_global_stats_ part.

@cysin
Copy link

cysin commented Nov 13, 2015

I used bn layers in my model and the GPU memory consumption increased significantly. So any idea about this?

@cdoersch
Copy link
Contributor

@kkhoot @shelhamer LGTM. Evan, please merge this when you get a chance. Thanks @kkhoot for all your hard work!

@cysin we try to keep conversation on PRs relevant to the PR itself, and so I'm not going to answer here unless it's specifically this PR which increases your memory usage (and I don't think it does). You can ask your question on caffe-users or look at the code to see memory requirements.

shelhamer added a commit that referenced this pull request Nov 14, 2015
Update BatchNormLayer: more numerical stability and backward with global stats
@shelhamer shelhamer merged commit fb43ea1 into BVLC:master Nov 14, 2015
@shelhamer
Copy link
Member

Thanks for the improvements @kkhoot and thanks for the reviews @cdoersch!

@kkhoot
Copy link
Contributor Author

kkhoot commented Nov 16, 2015

@cdoersch Thanks for the reviews. BatchNormLayer helps to train my model a lot. Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants