Batch Normalization Survey #3684

JiayiFeng · 2017-08-25T18:54:14Z

This is a survey about main principles and difficulties of batch normalization. Its implementation in Caffe2 and TensorFlow are also included.

related #3658

JiayiFeng · 2017-08-25T18:54:39Z

主要原理

在神经网络的训练过程中，各层的参数在每一轮迭代中都会发生变化，因此除了原始输入数据外，后面各层的输入数据（即上一层的输出）的分布情况都在发生着变化。如果某一层的输出数据分部不理想，很可能会增加后面层的训练难度。为了解决这一问题，可以在神经网络内部的层与层之间对数据再进行一次归一化（0均值，1方差），即

x_std = (x - E[x]) / STD[x]

其中E[x]和STD[x]分别表示当前batch中数据各特征维度上的均值和标准差。

但这种归一化过于简单粗暴，很可能破坏有用的数据分布。因此，batch_norm在进行上述简单归一化之后，又对输出的数据x_std进行了一次缩放变换：

y = scale * x_std + bias

其中scale和bias都是可学习的参数。显然，当scale == 1 / STD[x]且bias == E[x]时，就可以完全恢复数据x的分布情况。bacth_norm通过对scale和bias这两个参数的学习，使得数据在层与层之间各特征的分布都被调整到最佳状态。

需要注意的是，batch_norm的前向计算需要用到E[x]和STD[x]，但这两个值只有在train阶段才有意义，在infer时，根本没有batch，也就没有batch内的均值和方差。因此infer中这两个值需要用其他方法指定，一个常用的做法是使用各个训练batch中这两个值的平均值。

JiayiFeng · 2017-08-25T18:55:13Z

主要困难

相对于一般的OP来说，batch_norm在实现上主要有两个挑战：

在train和infer时行为不同。在train过程中，E[x]和STD[x]通过batch内数据的计算得到，而infer时这两个值一般通过求取所有训练batch上这两个值的平均的方式获得。
batch_norm op本身除了正常的前向计算外，还需要顺便计算E[x]和STD[x]的平均值，并保存这两个值供infer使用。

JiayiFeng · 2017-08-25T18:55:47Z

Caffe2 的实现

在caffe2中，batch_norm在python中被定义为一个layer（batch_normalization.py）。但在这个layer中只有SpatialBNOp这一个op：

net.SpatialBN([input_blob, self.scale,
                       self.bias, self.rm, self.riv],
                      output_blobs,
                      momentum=self.momentum,
                      is_test=is_test,
                      order=self.order)

各个参数意义如下：

[input_blob, self.scale, self.bias, self.rm, self.riv]：op的输入
- input_blob：输入数据
- self.scale：可训练参数 scale
- self.bias：可训练参数 bias
- self.rm 已经训练过的所有batch的E[x]的均值
- self.riv 已经训练过的所有batch的STD[x]的均值的倒数
output_blobs：op的输出
momentum：更新率，用于在训练中计算E[x]和STD[x]的平均值。具体作用将在后面进一步说明
is_test：标记当前是train还是infer
order：数据的存储格式，NCHW或NHWC

其中momentum，is_test和order都作为attributes传给SpatialBNOp。

SpatialBNOp的c++实现代码在spatial_batch_norm_op.h和spatial_batch_norm_op.cc中。

下面主要介绍一下Caffe2是如何克服了上述的两个实现上的困难：

trian和infer时行为不同

Caffe2的SpatialBNOp带有一个名为is_test的attribute，C++代码中通过判断true还是false来执行不同的逻辑。is_test由python端自动指定，一般为false，当执行infer时，python会生成is_test为true的SpatialBNOp。

`E[x]`和`STD[x]`的平均值计算

Caffe2计算的并不是它们的严格平均值，而是通过加权更新的方式计算出的近似均值，以E[x]为例，计算公式为：

running_mean_arr = running_mean_arr * momentum + E[x] * (1.0 - momentum);

在第一个batch中，running_mean_arr = E[x]，所有训练都完成后，running_mean_arr的值就用于填充infer中E[x]的位置。

为了实现这样的计算，SpatialBNOp在c++中定义了如下的输入输出：

.Input(
        0,
        "X",
        "The input 4-dimensional tensor of shape NCHW or NHWC depending "
        "on the order parameter.")
    .Input(
        1,
        "scale",
        "The scale as a 1-dimensional tensor of size C to be applied to the "
        "output.")
    .Input(
        2,
        "bias",
        "The bias as a 1-dimensional tensor of size C to be applied to the "
        "output.")
    .Input(
        3,
        "mean",
        "The running mean (training) or the estimated mean (testing) "
        "as a 1-dimensional tensor of size C.")
    .Input(
        4,
        "var",
        "The running variance (training) or the estimated "
        "variance (testing) as a 1-dimensional tensor of size C.")
    .Output(0, "Y", "The output 4-dimensional tensor of the same shape as X.")
    .Output(
        1,
        "mean",
        "The running mean after the spatial BN operator. Must be in-place "
        "with the input mean. Should not be used for testing.")
    .Output(
        2,
        "var",
        "The running variance after the spatial BN operator. Must be "
        "in-place with the input var. Should not be used for testing.")
    .Output(
        3,
        "saved_mean",
        "Saved mean used during training to speed up gradient "
        "computation. Should not be used for testing.")
    .Output(
        4,
        "saved_var",
        "Saved variance used during training to speed up "
        "gradient computation. Should not be used for testing.");

每轮对batch的训练中，首先计算出batch内部的E[x]，然后从Input(3)中取出前面所有轮的running_mean_arr，通过上面的公式更新running_mean_arr后，输出到Output(1)。同时，E[x]本身也会被输出到Output[3]，方便backward的计算。

STD[x]的计算与更新也与此类似。

另外，Caffe2通过在注册SpatialBNOp时调用EnforceInplace()函数，强制Input(3)和Output(1)、Input(4)和Output(2)共享内存。实现running_mean_arr的持续更新。

JiayiFeng · 2017-08-25T19:03:41Z

TensorFlow的实现

在TensorFlow中，batch_normalization也是一个单独的op，代码实现在FusedBatchNormOp类中。具体的思路和Caffe2几乎完全一样。一个比较明显的区别是，在更新running_mean_arr的时候，TensorFlow没有使用Caffe2那样的共享内存的直接更新方式，而是在Python端完成更新计算。

lcy-seso · 2017-08-26T00:55:14Z

一个问题比较好奇，大家在计算 mean 和 std 时都是每个线程/每个GPU卡独立计算，最后保存主卡的 mean 和 std 的 moving average 对吗？
有一些方法和实现讨论减少merge mean 和 std 计算的开销，但好像这样做的比较少，对吗？

JiayiFeng · 2017-08-26T01:22:39Z

目前还只是看了下单线程的实现，思考和现在框架的兼容情况，还没看多线程的实现，接下来会继续看一下~

lcy-seso · 2017-08-26T01:26:05Z

多线程计算上没啥区别，多线程会分数据，mean 和 std 的计算都是基于每个线程自己分到的数据，这样会出现多个mean 和 std，保存的时候只保存一份。

merge 所有线程的数据计算mean 和 std 比较麻烦，见过很多方法讨论怎样去merge结果，但真正实现进各个框架的好像不多（我没调查过。。）？

lcy-seso · 2017-08-26T01:29:15Z

还有一个问题，暂时估计不重要/不需要考虑？感觉也不难弄。。

Layer normalization 计算上恰就是将 batch normalization 的输入矩阵转置，之后的计算和batch norm完全一致，输出时再转置回去（需要保证layer输出宽度是固定的）。

转置之后可以避免在mini-batch上估计 mean 和 std，于是不再需要区别训练和infer。这个方法是Hinton提的，似乎用的人并不太多（论文引用量好低），但谷歌自己发的论文有在用。。

不知道有没有可能非常方便的复用 batch norm（TF是分开实现的）。

可能对batch norm wrap 一次就可以？这个是临时想到的，可能也不需要考虑。。。

JiayiFeng · 2017-08-26T01:41:33Z

其实我觉得这个问题可以算是op的粒度问题。如果我们只在C++端提供细粒度的op，在python端组成各种layer，那么应该是可以比较容易地实现复用的。

在目前的重构设计中我们遵循了上面说的这种思路，比如C++端只提供了 mul, add和sigmoid这样的操作，然后在python里实现fc。

但是这样是否会有严重的性能问题，感觉也很值得考虑。

lcy-seso · 2017-08-26T02:07:36Z

嗯， batch norm 靠拼出来会有9个还是11个来次的小运算，前向和反向的中间结果会很多，带来额外的计算（可以化简）和内存，不必要的内存消耗会阻止训练深层网络。。。。

qingqing01 · 2017-08-26T02:40:43Z

@Canpio BatchNorm实现时希望增加一个功能：训练时可支持使用global moving mean/var，当前Paddle没有支持该功能。

JiayiFeng · 2017-08-27T00:58:45Z

想了一下，在当前框架下，为了实现batch norm，需要考虑下面几个问题：

如何让BatchNormOp在train和infer的时候有不同行为？
可以在attribute中放一个bool is_train，然后在op内部用if选择。为了实现train和infer交替执行这样的功能，需要允许在运行中更改op的attribute，并将这个更改接口暴露给python。
如何实现running_mean_arr的更新？

输入和输出之间强制共享内存。这种方案需要在当前框架中增加强制共享内存的设置方法，并且，如果被共享的变量还被其他op使用，可能会有读写冲突。
在python端进行输出向输入的拷贝。可能有性能上的问题

batch norm layer是在python中组装实现，还是实现成单独的op

JiayiFeng · 2017-08-28T22:26:54Z

@qingqing01 这应该是可以实现的，感觉问题不是很大

JiayiFeng mentioned this issue Aug 29, 2017

Design doc: Batch Normalization Operator #3748

Merged

JiayiFeng closed this as completed Dec 28, 2017

yzbx mentioned this issue Aug 20, 2018

batch normalization ISCAS007/torchseg#13

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch Normalization Survey #3684

Batch Normalization Survey #3684

JiayiFeng commented Aug 25, 2017

JiayiFeng commented Aug 25, 2017

JiayiFeng commented Aug 25, 2017

JiayiFeng commented Aug 25, 2017 •

edited

Loading

JiayiFeng commented Aug 25, 2017 •

edited

Loading

lcy-seso commented Aug 26, 2017 •

edited

Loading

JiayiFeng commented Aug 26, 2017 •

edited

Loading

lcy-seso commented Aug 26, 2017 •

edited

Loading

lcy-seso commented Aug 26, 2017 •

edited

Loading

JiayiFeng commented Aug 26, 2017 •

edited

Loading

lcy-seso commented Aug 26, 2017 •

edited

Loading

qingqing01 commented Aug 26, 2017

JiayiFeng commented Aug 27, 2017 •

edited

Loading

JiayiFeng commented Aug 28, 2017

Batch Normalization Survey #3684

Batch Normalization Survey #3684

Comments

JiayiFeng commented Aug 25, 2017

JiayiFeng commented Aug 25, 2017

主要原理

JiayiFeng commented Aug 25, 2017

主要困难

JiayiFeng commented Aug 25, 2017 • edited Loading

Caffe2 的实现

trian和infer时行为不同

E[x]和STD[x]的平均值计算

JiayiFeng commented Aug 25, 2017 • edited Loading

TensorFlow的实现

lcy-seso commented Aug 26, 2017 • edited Loading

JiayiFeng commented Aug 26, 2017 • edited Loading

lcy-seso commented Aug 26, 2017 • edited Loading

lcy-seso commented Aug 26, 2017 • edited Loading

JiayiFeng commented Aug 26, 2017 • edited Loading

lcy-seso commented Aug 26, 2017 • edited Loading

qingqing01 commented Aug 26, 2017

JiayiFeng commented Aug 27, 2017 • edited Loading

JiayiFeng commented Aug 28, 2017

JiayiFeng commented Aug 25, 2017 •

edited

Loading

`E[x]`和`STD[x]`的平均值计算

JiayiFeng commented Aug 25, 2017 •

edited

Loading

lcy-seso commented Aug 26, 2017 •

edited

Loading

JiayiFeng commented Aug 26, 2017 •

edited

Loading

lcy-seso commented Aug 26, 2017 •

edited

Loading

lcy-seso commented Aug 26, 2017 •

edited

Loading

JiayiFeng commented Aug 26, 2017 •

edited

Loading

lcy-seso commented Aug 26, 2017 •

edited

Loading

JiayiFeng commented Aug 27, 2017 •

edited

Loading