Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BatchNorm does not converge in digits #629

Open
sodeypunk opened this issue Mar 12, 2016 · 32 comments
Open

BatchNorm does not converge in digits #629

sodeypunk opened this issue Mar 12, 2016 · 32 comments
Labels

Comments

@sodeypunk
Copy link

I was having problems with batchnorm layer not converging when using the nvidia version of caffe. This problem did not occur when using the caffe branch.

@gheinrich
Copy link
Contributor

Hello, do you have more details to share about this?

@sodeypunk
Copy link
Author

Hi,

Most of the details was logged here in the caffe google groups forum:

Batchnorm does not converge

The first page was copied below

My model based on 3 convolutions has no problems converging when using dropout and LRN. After i substituted both layers out with a batch norm layer, it no longer converges. I believe im using it correctly but doesn't work no matter how small the learning rate i set is. Can anyone here shed some light on why?

Original Model:

image

layer {
  name: "conv1"
  type: "Convolution"
  bottom: "data"
  top: "conv1"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  convolution_param {
    num_output: 48
    pad: 0
    kernel_size: 5
    stride: 1
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}
layer {
  name: "drop1"
  type: "Dropout"
  bottom: "conv1"
  top: "conv1"
  dropout_param {
    dropout_ratio: 0.5
  }
}
layer {
  name: "relu1"
  type: "ReLU"
  bottom: "conv1"
  top: "conv1"
}
layer {
  name: "norm1"
  type: "LRN"
  bottom: "conv1"
  top: "norm1"
  lrn_param {
    local_size: 3
    alpha: 0.0001
    beta: 0.75
  }
}
layer {
  name: "pool1"
  type: "Pooling"
  bottom: "norm1"
  top: "pool1"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
}

.... Repeat x3

New Model:

image

layer {
  name: "conv1"
  type: "Convolution"
  bottom: "data"
  top: "conv1"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  convolution_param {
    num_output: 48
    pad: 0
    kernel_size: 5
    stride: 1
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}
layer {
  name: "relu1"
  type: "ReLU"
  bottom: "conv1"
  top: "conv1"
}
layer {
  name: "conv1_BN"
  type: "BatchNorm" include { phase: TRAIN}
  bottom: "conv1"
  top: "conv1_BN"
  param {
    lr_mult: 0
    decay_mult: 0
  }
  param {
    lr_mult: 0
    decay_mult: 0
  }
  param {
    lr_mult: 0
    decay_mult: 0
  }
  batch_norm_param {
    use_global_stats: false
    moving_average_fraction: 0.95
  }
}
layer {
  name: "conv1_BN"
  type: "BatchNorm" include { phase: TEST}
  bottom: "conv1"
  top: "conv1_BN"
  param {
    lr_mult: 0
    decay_mult: 0
  }
  param {
    lr_mult: 0
    decay_mult: 0
  }
  param {
    lr_mult: 0
    decay_mult: 0
  }
  batch_norm_param {
    use_global_stats: true
    moving_average_fraction: 0.95
  }
}
layer {
  name: "pool1"
  type: "Pooling"
  bottom: "conv1_BN"
  top: "pool1"
  pooling_param {
    pool: MAX
    kernel_size: 2
    stride: 2
  }
}

.....repeat x3

@slayton58
Copy link

slayton58 commented Mar 14, 2016

Try adding the following to the batch_norm_param fields:

scale_filler {
  type: "constant"
  val: 1
}
bias_filler {
  type: "constant"
  val: 0
}

@gheinrich
Copy link
Contributor

Are you using the latest version of CuDNNv4.0.5 and nv-caffe 0.14? In particular, there were changes in CuDNN release which fixed some issues around batch normalization that were present in the release candidates (v4.0.4 and earlier). From the release notes:

UPDATES SINCE RELEASE CANDIDATE

The API of cudnnBatchNormalizationBackward has been changed to include an 
additional set of scaling parameters (alphaParamsDiff and betaParamsDiff) applied 
to the dBnScaleResult and dBnBiasResult outputs of the function.

The prior restriction of batch size 512 in all Batch Normalization routines has been 
removed.

Numerical stability and performance of cudnnBatchNormalizationBackward in 
some cases has been improved.

@sodeypunk
Copy link
Author

Yes, it was nv-caffe 0.14 that was retrieved around Feb 2016. I only installed the latest caffe after caffe-nv was failing and did not upgrade CuDNN so I am not sure which version of CuDNN I currently have (will report later).

This leads me to believe that the caffe-nv version was the cause. I'll try slayton's advice and will report back.

@gheinrich
Copy link
Contributor

Actually it looks like we're at 4.0.7 now, sorry:

$ dpkg -s libcudnn4 
Package: libcudnn4
...
Version: 4.0.7

@mfernezir
Copy link

@sodeypunk

I've just come across a slightly different batch normalization setup that is working for me in DIGITS. There is a prototxt file posted in the DIGITS user group: Digits3 GoogLeNet Batch Normalization?

After changing some old names ("BN" to "BatchNorm" and "shift_filler" to "bias_filler"), this is how DIGTIS batch normalization layer followed by its activation looks like:

## BatchNorm
layer {
  bottom: "conv1/7x7_s2"
  name: "conv1/7x7_s2/bn"
  top: "conv1/7x7_s2/bn"
  type: "BatchNorm"
  param {
    lr_mult: 1
    decay_mult: 0
  }
  param {
    lr_mult: 1
    decay_mult: 0
  }
  batch_norm_param {
    scale_filler {
      type: "constant"
      value: 1
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}
layer {
  bottom: "conv1/7x7_s2/bn"
  top: "conv1/7x7_s2/bn"
  name: "conv1/relu_7x7"
  type: "ReLU"
}

Now, a few observations (hopefully to get my head around it too). First, there is no mention of use_global_stats and no separate TRAIN and TEST definitions. Both train_val and deploy proto have the same layer definition. If I got it right, this is because Caffe automatically infers the correct state for the use_global_stats variable, as mentioned in BVLC/caffe#3347

This means by default (as the following is set in batch_norm_layer.cpp), you don't have to set use_global_stats at all in the prototxt.
use_global_stats_ = this->phase_ == TEST;

Second, this approach is different. There are only two lr_multi parameters here and they are not all set to zero like suggested elsewhere. I'm guessing that these two correspond to gamma and beta parameters mentioned in the paper (Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift). I'm not sure what exactly do the three lr_multi mean, except that they relate to some global parameters which the solver isn't supposed to update in that approach. I got that from here: is Batch Normalization supported by Caffe?

The parameters are the collected batch norm statistics. The parameter learning rates need to be set to zero or else the solver will think these are learnable parameters that need to have updates applied to them (with momentum, weight decay, and so forth).

Finally, there is a difference in the current BVLC and NVIDIA proto. The BVLC's doesn't have scale_filler and bias_filler in message BatchNormParameter while NVIDIA's does. https://github.com/NVIDIA/caffe/blob/caffe-0.14/src/caffe/proto/caffe.proto#L465-L483
https://github.com/BVLC/caffe/blob/master/src/caffe/proto/caffe.proto#L493-L503

@lukeyeager
Copy link
Member

@mfernezir you may be interested in BVLC/caffe#3919. It seems like you've got a good grasp of the situation, so please comment on that PR if you have any suggestions!

@mfernezir
Copy link

@lukeyeager I've added a comment about batch normalization layer usage. I'm not sure if there are some remaining issues there, hope this helps!

@engharat
Copy link

engharat commented Apr 14, 2016

@mfernezir I'm so confused by a lot of batch norm implementations!
If I use your batch normalization i get bad results on accuracy. Using a standard alexnet train_val with the add of your batchnorm before every convolution layer:

layer
 {
  name: "bn0"
  type: "BatchNorm"
  bottom: "data"
  top: "bn0"
  batch_norm_param {
    scale_filler {
      type: "constant"
      value: 1
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
   param {
    lr_mult: 1
    decay_mult: 0
  }
  param {
    lr_mult: 1
    decay_mult: 0
  }
}

With this setup the train loss goes down while the accuracy keeps at zero. If I use classify one to see the batch norm effect, I get wrong(very high) mean and variance on data and weights..
If I use this batch norm param form:

layer {
  name: "bn0"
  type: "BatchNorm"
  bottom: "data"
  top: "bn0"
  batch_norm_param {
    moving_average_fraction: 0.98
    use_global_stats: false
    scale_filler {
      type: "constant"
      value: 1 
    }
    bias_filler {
      type: "constant"
      value: 0 
    }
  }
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 1
    decay_mult: 1
  }  
  param {
    lr_mult: 0
    decay_mult: 0
  }
  param {
    lr_mult: 0
    decay_mult: 0
  }  
}

I get an accuracy that keeps moving up and down, somewhat converging but not in a uniform way.
If I try to visualize the weights via classify one, I get the mean 0 var 1 effect of batch normalization, but the network seems to always missclassify the sample.
I'm using nvcaffe v0.15: https://github.com/drnikolaev/caffe/tree/caffe-0.15

@dgschwend
Copy link

@engharat I've had some (limited) success using the following notation:

layer {
  name: "bn0"
  type: "BatchNorm"
  bottom: "data"
  top: "bn0"
  batch_norm_param {
    scale_filler {
      type: "constant"
      value: 1
    }
    bias_filler {
      type: "constant"
      value: 0
    }
    moving_average_fraction: 0.95
    engine: CUDNN  
  }
}

The use_global_stats parameter is automatically derived during run-time (see batch_norm_layer.cpp). Using engine: CUDNN ensures that the CUDNN implementation is used, since the normal CUDA implementation and the CUDNN implementation are totally incompatible and need different parameters to work.
For me, adding BN with this setup resulted in slightly faster convergence, but also slightly lower accuracy (it's not an AlexNet, but close):
screen shot 2016-04-14 at 18 41 48
Also, try to start with just one BatchNorm layer, as I had problems when using too many of them, too! And you might need to adjust your learning rate parameter (in my case, lowered by a factor of 4).
Good luck!

@engharat
Copy link

@dgschwend I'll try your implementation in the next minutes! Meanwhile, could you please confirm me that you get mean 0 var 1.0 out of the batchnorm layers when visualizing weights with digits classify one?
I suppose your batch normalization is not working properly. While the slightly faster-slightly worse convergence is just suspicious, the main benefit of this technique is to use much higher learning rates. The 2nd version of batch norm I posted, even with all this weird behaviour, enable me to use 0.08 learning rate instead of the usual 0.01 on alexnet!
And, even if the use_global_stats parameter is automatically derived, If I remove use_global_stats: false I get no mean/var normalization on batchnorm output layer when using classify one weight visualization digits option! I'm starting to think that maybe the weight visualization script is not inferring correctly the use_global_stats.

@dgschwend
Copy link

@engharat
I get

  • mean: -0.0359341 / var: 0.332215
  • mean: 0.127164 / var: 0.502211
  • mean: 0.092401 / var: 0.413767

and

  • mean: 0.175899 / var: 1.06823
  • mean: 0.125708 / var: 0.688624
  • mean: 0.19241 / var: 0.722601

in the two BN layers, using three arbitrary test images. I'd say that's not too far from the ideal mean 0 / var 1. However the need for lowering the learning rate is very suspicious.

@mfernezir
Copy link

@engharat Note that in the setup I found in DIGITS user group BN comes after each convolutional layer and just before activations (ReLU). This may or may not have some importance regarding your issues. I haven't tried using BN directly after Data layers like you are doing here. Also, there might be some differences between your Caffe fork and the current NVIDIA's one. I'm using 0.14.2.

Here are my GoogleNet v3 training and deploy files generated with DIGITS. Some notes: this is modified for a 151 class problem, xavier initialization is changed to msra and most importantly, this is an older format generated with DIGITS 3.2. There is a change in DIGITS 3.3. which requires slightly different syntax to determine test and train layers (I haven't moved to that yet).

This network had severe overfitting issues for the problem at hand, but no convergence issues. In the end I used different setups and smaller images, but also including one net with BN layers with good results (81% accuracy for my problem). However, I can't actually attribute the effect to BN layers since another version without them ended up practically the same (with differences in training procedure and intermediate results).

I didn't specify the engine parameter manually like @dgschwend did, but hopefully Caffe inferred it correctly with engine parameter set to "DEFAULT". I'm using CUDNN4.

v3-inception-151class.zip

@wlike
Copy link

wlike commented Apr 19, 2016

@sodeypunk you can put BatchNorm before ReLU, and have a try

@engharat
Copy link

@mfernezir Thanks for your provided network! It doesn't seems a v3
inception, it looks more the v1 googlenet with added batchnorm!

2016-04-14 23:02 GMT+02:00 mfernezir notifications@github.com:

@engharat https://github.com/engharat Note that in the setup I found in
DIGITS user group BN comes after each convolutional layer and just before
activations (ReLU). This may or may not have some importance regarding your
issues. I haven't tried using BN directly after Data layers like you are
doing here. Also, there might be some differences between your Caffe fork
and the current NVIDIA's one. I'm using 0.14.2.

Here are my GoogleNet v3 training and deploy files generated with DIGITS.
Some notes: this is modified for a 151 class problem, xavier initialization
is changed to msra and most importantly, this is an older format generated
with DIGITS 3.2. There is a change in DIGITS 3.3. which requires slightly
different syntax to determine test and train layers (I haven't moved to
that yet).

This network had severe overfitting issues for the problem at hand, but no
convergence issues. In the end I used different setups and smaller images,
but also including one net with BN layers with good results (81% accuracy
for my problem). However, I can't actually attribute the effect to BN
layers since another version without them ended up practically the same
(with differences in training procedure and intermediate results).

I didn't specify the engine parameter manually like @dgschwend
https://github.com/dgschwend did, but hopefully Caffe inferred it
correctly with engine parameter set to "DEFAULT". I'm using CUDNN4.

v3-inception-151class.zip
https://github.com/NVIDIA/DIGITS/files/219893/v3-inception-151class.zip


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#629 (comment)

@mfernezir
Copy link

@engharat

Ah yes, it's just v1 with LRN repleaced with BN and added BN after convolutional layers. I haven't really used that net besides one quick run. I just wanted to try BN layers and then I used them on another net.

In any case, I hope you've got your network running.

@engharat
Copy link

I have some quite interesting behaviour to show you. I still cannot let batchnorm perform correctly.
Just to remember, I'm training from scratch on a big private and quite easy dataset. easy means that on few epochs any network can reach 95% of accuracy.
Here is an example of the first output(so the first 3-4 inception layers I don't remember) of a GoogleNet V1, when NOT using batch normalization:
image
Everything goes smoothly.
The same network same first output but with added batch normalization:
image
The accuracy is jumping wildly until the learning rate goes very small.
And here some other examples on a slightly modified dataset,still pretty easy; lets remember I should get 70-80% accuracy after 2 epochs. I tried different batchnorm hyperparameters:
this is wihe esponential moving average parameter 0.95:
image
This is with 0.999 (that should be the default value):
image
And finally with the use_global_stats= false
image

After I tried every solution I get into my mind in order to let this accuracy not oscillate ( and not being at 6% after 2 epochs!) I tried an interesting experiment. I have extracted the features from the last layer of my network and with those features I have trained a linear SVM on some other data. This is just to check how the output layer is performing as a feature extractor.
The network without batchnorm reached 73% of average accuracy on the SVM. The network with batchnorm that I let converge to the high ninties gave me 75% of accuracy on the SVM. But the most incredible thing(for me) is that when I train the SVM on the networks that after 2 epochs are at 10% accuracy, I get 70% on SVM!
This means that the accuracy values that I get from digits during training are completely false with the batchnorm. Something is not performing as it should while on validation data, while if I use a network snapshot via python code and I extract the layer values for training an SVM everything is working fine.
For mfernezir: this is what happens if I use your batchnorm parameters:
image
As you can see, after 1 epoch I'm on 0.38 on train loss, that correspond to 85-90% of accuracy from the other tests. But still the accuracy calculated by digits is wrong (24%)
I'm thinking it is something related to the esponential moving average, to the use of accumulated statistics when training/validating and when testing...

@mfernezir
Copy link

I've also been meaning to comment again since I've tried a few more batch normalization networks in the last couple of days. I have also observed similar strange effects: low training loss and small validation accuracy on high learning rates and then sudden validation accuracy jumps on low learning rates.

There is an interesting comment and code change in the BVLC version which should solve problems with validation accuracy calculations:

BVLC/caffe#3919 (comment)

@gheinrich
Copy link
Contributor

You could try using another method to compute accuracy: on DIGITS 3.3, on the model page after training is finished, select the model corresponding to epoch 2. Then in "Test a list of images", upload the val.txt file from your dataset job folder. Click "Classify many". This will show a new page with Top-1 accuracy, a confusion matrix, etc. (provided it doesn't time out as you have quite a few images in the validation set).

@mfernezir
Copy link

The issue is that batch normalization networks have to calculate global mean and variance statistics from batch statistics and then store these values inside the actual model to be used for inference. If these calculations are wrong in some extreme cases like mentioned in the above BVLC comment, then no matter how we try to calculate validation or test accuracy the results would be wrong.

Still, I've just run some inference tests with DIGITS 3.3.0. I have a couple of different nets trained on the same dataset, some having BN layers and some without. The problem is that I've run into another bug. All of my BN networks use scaling parameter in the Data Layer, for example:

layer {
  name: "data"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TEST
  }
  transform_param {
    scale: 0.0078125
    crop_size: 227
  }
  data_param {
    batch_size: 32
  }
}

However, this parameter is not written in the deploy file. This causes much lower accuracy when classify many is used when compared to validation accuracy reported by graphs. This is also the case for another network that doesn't have BN layers but does have scaling. For that network, on random 50 validation images accuracy reported with classify many is zero percent while it should be around 25%. Models without scaling parameters (and batch normalization) have similar classify many and graph validation accuracies, but even this can't be said decisively since my validation set is around 100k images.

Maybe engharat could confirm that graph accuracies and classify many accuracies coincide, but this is likely not relevant for the underlying Caffe issue.

@gheinrich
Copy link
Contributor

Indeed the Data layer isn't propagated to the deploy network (since this layer type is for reading from an LMDB). Can you use a Power layer instead (like there)?.

Thanks a lot for your detailed analysis of the BN issue. cc @drnikolaev - can you suggest anything to make progress on this?

@mfernezir
Copy link

Yep, I could use power layer instead and I'll do that for convenience in future networks.

@s271
Copy link

s271 commented Jul 19, 2016

Seems the problem is cudnnBatchNormalizationForwardTraining produce wrong running mean or variance. I've tried replace cudnnBatchNormalizationForwardInference with standard BN code for blobs 3 and 4, result was exactly the same. Another small problem is that epsilon_ is not initialized, but fixing it is not helping.

@mathmanu
Copy link

mathmanu commented Sep 26, 2016

I was struggling with the convergence issue, but finally the following worked for me. Specifying the engine as CAFFE is important. CUDNN BatchNorm doesn't converge for me.

layer {
  name: "bn2"
  bottom: "conv2"
  top: "conv2"
  type: "BatchNorm"
  param { #scale
    lr_mult: 1
    decay_mult: 1
  }
  param { #shift/bias
   lr_mult: 1
    decay_mult: 1
  } 
  param { #global mean
    lr_mult: 0
    decay_mult: 0
  }
  param { #global var
   lr_mult: 0
    decay_mult: 0
  }     
  batch_norm_param {
    scale_filler {
      type: "constant"
      value: 1
    }
    bias_filler {
      type: "constant"
      value: 0
    }
    engine: CAFFE
  }
}

@achaiah
Copy link

achaiah commented Nov 18, 2016

@mathmanu I just verified that "engine: CAFFE" is what makes it train vs not train.

@lukeyeager - any idea why engine: CUDNN does not converge? Looks like there's almost a 10x slowdown when using engine: CAFFE (but at least it's working)

@mathmanu
Copy link

Top and bottom blobs need to be different for engine:CUDNN BatchNorm. This constraint is not there for engine:CAFFE BatchNorm. This is the reason for non-convergence.

See the following thread for a working prototxt with CUDNN BatchNorm. BVLC/caffe#3919

@borisgin @lukeyeager - Since others are also facing the same issue and wondering about the convergence issue , why not put a check/exit in CUDNN BatchNorm reshape function, if the top and bottom blobs are same - this will save a lot of headache.

@achaiah
Copy link

achaiah commented Nov 21, 2016

Thanks, that cleared things up a lot. Sad to see such divergence between BVLC/Caffe and NVIDIA/Caffe. I can't keep all the differences straight. Hope there will be more similarity in the future so that networks can be ported from one to the other without much effort.

@engharat
Copy link

And the saddest thing is that you cannot test/use any BVLC pretrained batchnorm networks on Nvcaffe, because the two batchnorm formats are not compatible ;if when working on prototxt you can still translate between the two formats, in order to use pretrained inception/residual networks you have to install bvlc caffe.
Anyway, I think they bvlc caffe should converge towards nvcaffe:if you train on those two forks on classic alexnet you will find very little difference, but if you train a heavily batchnormed network you will find that bvlc is 50% slower in training speed and use 30% more GPU memory(slowing down further because forcing you with a smaller batch size) so nvcaffe is winning down easily.

@engharat
Copy link

After reading the posted link by mathmanu, I see that bvlc caffe is converging to the CuDNN batchnorm with all the boon and the boost that it brings - still I don't fully understand if with that CuDNN implementation we can seamlessly switch bvlc caffe and nvcaffe caffemodel networks,or if there still is some different implementation that would rise some mismatch.

@mathmanu
Copy link

Forget about compatibility to BVLC/caffe, there is no compatibility b/w engine:CAFFE and engine:CUDNN BatchNorm in NVIDIA/caffe itself.

I hope someone from NVIDIA will clarify what is the plan forward to fix these inconsistencies.

@mathmanu
Copy link

mathmanu commented Nov 22, 2016

My suggestions to fix these issues are the following:

  1. Rename the NVIDIA/caffe's BatchNorm to BatchNormScale, since it now includes Scaling as well.

  2. Put a check/exit in CUDNN BatchNormScale reshape function, if the top and bottom blobs are same - so that the user will get a warning.

  3. Fix the inconsistency in blob shape between engine:CAFFE and engine:CUDNN

  4. Currenty I have to specify so many parameters in the new BatchNorm layer. Thi is un-necessary.

layer {
  name: "bn_conv1"
  bottom: "conv1"
  top: "conv1"
  type: "BatchNorm"
  param { #scale
    lr_mult: 1
    decay_mult: 1
  }
  param { #shift/bias
    lr_mult: 1
    decay_mult: 1
  } 
  param { #global mean
    lr_mult: 0
    decay_mult: 0
  }
  param { #global var
    lr_mult: 0
    decay_mult: 0
  }

  batch_norm_param {
    scale_filler {
    type: "constant"
    value: 1
    }
    bias_filler {
      type: "constant"
      value: 0
    }
    engine: CUDNN
  }
}

(4a). In BatchNormScale, If you change the oder of the blobs to: gloabl_mean, and global_variance, scale, bias, global_counter, then I don't have to specify 4 param fields for lr_mult and decay_mult - but only 2.

(4b). If the definition of scale and bias fields in BatchNormParameter is changed to:
optional float scale_filler = 5 [default = 1];
optional float bias_filler = 6 [default = 0];
Then I don't have to specify these also in the prototxt.

  1. Keep the original BatchNorm from BVLC/caffe as it is, un-touched - so that compatibility to BVLC/caffe is not affected and old BVLC/caffe models can be used for fine tuning. If possible, provide a CUDNN version of this original BatchNorm without scaling as well, so that it can be accelerated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests