# 批量归一化
批量归一化（batch normalization）层，它能让较深的神经网络的训练变得更加容易，标准化处理输入数据使各个特征的分布相近：这往往更容易训练出有效的模型。

通常来说，数据标准化预处理对于浅层模型就足够有效了。随着模型训练的进行，当每层中参数更新时，靠近输出层的输出较难出现剧烈变化。但对深层神经网络来说，即使输入数据已做标准化，训练中模型参数的更新依然很容易造成靠近输出层输出的剧烈变化。这种计算数值的不稳定性通常令我们难以训练出有效的深度模型。

批量归一化的提出正是为了应对深度模型训练的挑战。在模型训练时，批量归一化利用小批量上的均值和标准差，不断调整神经网络中间输出，从而使整个神经网络在各层的中间输出的数值更稳定。
## 批量归一化层
对全连接层和卷积层做批量归一化的方法稍有不同。
### 对全连接层做批量归一化
对全连接层做批量归一化。通常，将批量归一化层置于全连接层中的仿射变换和激活函数之间。设全连接层的输入为$\boldsymbol{u}$，权重参数和偏差参数分别为$\boldsymbol{W}$和$\boldsymbol{b}$，激活函数为$\phi$。设批量归一化的运算符为$\text{BN}$。那么，使用批量归一化的全连接层的输出为

$$\phi(\text{BN}(\boldsymbol{x})),$$

其中批量归一化输入$\boldsymbol{x}$由仿射变换

$$\boldsymbol{x} = \boldsymbol{W\boldsymbol{u} + \boldsymbol{b}}$$

得到。考虑一个由$m$个样本组成的小批量，仿射变换的输出为一个新的小批量$\mathcal{B} = {\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(m)} }$。它们正是批量归一化层的输入。对于小批量$\mathcal{B}$中任意样本$\boldsymbol{x}^{(i)} \in \mathbb{R}^d, 1 \leq i \leq m$，批量归一化层的输出同样是$d$维向量

$$\boldsymbol{y}^{(i)} = \text{BN}(\boldsymbol{x}^{(i)}),$$

并由以下几步求得。首先，对小批量$\mathcal{B}$求均值和方差：

$$\boldsymbol{\mu}\mathcal{B} \leftarrow \frac{1}{m}\sum{i = 1}^{m} \boldsymbol{x}^{(i)},$$ $$\boldsymbol{\sigma}\mathcal{B}^2 \leftarrow \frac{1}{m} \sum{i=1}^{m}(\boldsymbol{x}^{(i)} - \boldsymbol{\mu}_\mathcal{B})^2,$$

其中的平方计算是按元素求平方。接下来，使用按元素开方和按元素除法对$\boldsymbol{x}^{(i)}$标准化：

$$\hat{\boldsymbol{x}}^{(i)} \leftarrow \frac{\boldsymbol{x}^{(i)} - \boldsymbol{\mu}\mathcal{B}}{\sqrt{\boldsymbol{\sigma}\mathcal{B}^2 + \epsilon}},$$

这里$\epsilon > 0$是一个很小的常数，保证分母大于0。在上面标准化的基础上，批量归一化层引入了两个可以学习的模型参数，拉伸（scale）参数 $\boldsymbol{\gamma}$ 和偏移（shift）参数 $\boldsymbol{\beta}$。这两个参数和$\boldsymbol{x}^{(i)}$形状相同，皆为$d$维向量。它们与$\hat{\boldsymbol{x}}^{(i)}$分别做按元素乘法（符号$\odot$）和加法计算：

$${\boldsymbol{y}}^{(i)} \leftarrow \boldsymbol{\gamma} \odot \hat{\boldsymbol{x}}^{(i)} + \boldsymbol{\beta}.$$

至此，得到了$\boldsymbol{x}^{(i)}$的批量归一化的输出$\boldsymbol{y}^{(i)}$。 值得注意的是，可学习的拉伸和偏移参数保留了不对$\boldsymbol{x}^{(i)}$做批量归一化的可能：此时只需学出$\boldsymbol{\gamma} = \sqrt{\boldsymbol{\sigma}\mathcal{B}^2 + \epsilon}$和$\boldsymbol{\beta} = \boldsymbol{\mu}\mathcal{B}$。可以对此这样理解：如果批量归一化无益，理论上，学出的模型可以不使用批量归一化。

### 对卷积层做批量归一化
对卷积层来说，批量归一化发生在卷积计算之后、应用激活函数之前。如果卷积计算输出多个通道，我们需要对这些通道的输出分别做批量归一化，且每个通道都拥有独立的拉伸和偏移参数，并均为标量。设小批量中有$m$个样本。在单个通道上，假设卷积计算输出的高和宽分别为$p$和$q$。我们需要对该通道中$m \times p \times q$个元素同时做批量归一化。对这些元素做标准化计算时，我们使用相同的均值和方差，即该通道中$m \times p \times q$个元素的均值和方差。

### 预测时的批量归一化
使用批量归一化训练时，可以将批量大小设得大一点，从而使批量内样本的均值和方差的计算都较为准确。将训练好的模型用于预测时，我们希望模型对于任意输入都有确定的输出。因此，单个样本的输出不应取决于批量归一化所需要的随机小批量中的均值和方差。一种常用的方法是通过移动平均估算整个训练数据集的样本均值和方差，并在预测时使用它们得到确定的输出。可见，和丢弃层一样，批量归一化层在训练模式和预测模式下的计算结果也是不一样的。

## LeNet 使用批量归一化
tf.keras中layers模块定义的BatchNorm类使用起来更加简单。

tf.keras.layers.BatchNormalization(
    axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True,
    beta_initializer='zeros', gamma_initializer='ones',
    moving_mean_initializer='zeros', moving_variance_initializer='ones',
    beta_regularizer=None, gamma_regularizer=None, beta_constraint=None,
    gamma_constraint=None, renorm=False, renorm_clipping=None, renorm_momentum=0.99,
    fused=None, trainable=True, virtual_batch_size=None, adjustment=None, name=None,
    **kwargs
)

Normalize the activations of the previous layer at each batch, i.e. applies a transformation that maintains the mean activation close to 0 and the activation standard deviation close to 1.

Batch normalization differs from other layers in several key aspects:

1) Adding BatchNormalization with training=True to a model causes the result of one example to depend on the contents of all other examples in a minibatch. Be careful when padding batches or masking examples, as these can change the minibatch statistics and affect other examples.

2) Updates to the weights (moving statistics) are based on the forward pass of a model rather than the result of gradient computations.

3) When performing inference using a model containing batch normalization, it is generally (though not always) desirable to use accumulated statistics rather than mini-batch statistics. This is accomplished by passing training=False when calling the model, or using model.predict.

Arguments:
* axis: Integer, the axis that should be normalized (typically the features axis). For instance, after a Conv2D layer with data_format="channels_first", set axis=1 in BatchNormalization.
* momentum: Momentum for the moving average.
* epsilon: Small float added to variance to avoid dividing by zero.
* center: If True, add offset of beta to normalized tensor. If False, beta is ignored.
* scale: If True, multiply by gamma. If False, gamma is not used. When the next layer is linear (also e.g. nn.relu), this can be disabled since the scaling will be done by the next layer.
* beta_initializer: Initializer for the beta weight.
* gamma_initializer: Initializer for the gamma weight.
* moving_mean_initializer: Initializer for the moving mean.
* moving_variance_initializer: Initializer for the moving variance.
* beta_regularizer: Optional regularizer for the beta weight.
* gamma_regularizer: Optional regularizer for the gamma weight.
* beta_constraint: Optional constraint for the beta weight.
* gamma_constraint: Optional constraint for the gamma weight.
* renorm: Whether to use Batch Renormalization (https://arxiv.org/abs/1702.03275). This adds extra variables during training. The inference is the same for either value of this parameter.
renorm_clipping: A dictionary that may map keys 'rmax', 'rmin', 'dmax' to scalar Tensors used to clip the renorm correction. The correction (r, d) is used as corrected_value = normalized_value * r + d, with r clipped to [rmin, rmax], and d to [-dmax, dmax]. Missing rmax, rmin, dmax are set to inf, 0, inf, respectively.
renorm_momentum: Momentum used to update the moving means and standard deviations with renorm. Unlike momentum, this affects training and should be neither too small (which would add noise) nor too large (which would give stale estimates). Note that momentum is still applied to get the means and variances for inference.
* fused: if True, use a faster, fused implementation, or raise a ValueError if the fused implementation cannot be used. If None, use the faster implementation if possible. If False, do not used the fused implementation.
* trainable: Boolean, if True the variables will be marked as trainable.
* virtual_batch_size: An int. By default, virtual_batch_size is None, which means batch normalization is performed across the whole batch. When virtual_batch_size is not None, instead perform "Ghost Batch Normalization", which creates virtual sub-batches which are each normalized separately (with shared gamma, beta, and moving statistics). Must divide the actual batch size during execution.
* adjustment: A function taking the Tensor containing the (dynamic) shape of the input tensor and returning a pair (scale, bias) to apply to the normalized values (before gamma and beta), only during training. For example, if axis==-1, adjustment = lambda shape: ( tf.random.uniform(shape[-1:], 0.93, 1.07), tf.random.uniform(shape[-1:], -0.1, 0.1)) will scale the normalized value by up to 7% up or down, then shift the result by up to 0.1 (with independent scaling and bias for each feature but shared across all examples), and finally apply gamma and/or beta. If None, no adjustment is applied. Cannot be specified if virtual_batch_size is specified.

Call arguments:
* inputs: Input tensor (of any rank).
* training: Python boolean indicating whether the layer should behave in training mode or in inference mode.
* training=True: The layer will normalize its inputs using the mean and variance of the current batch of inputs.
* training=False: The layer will normalize its inputs using the mean and variance of its moving statistics, learned during training.
Input shape:

Arbitrary. Use the keyword argument input_shape (tuple of integers, does not include the samples axis) when using this layer as the first layer in a model.

Output shape:

Same shape as input.



In [3]:
import tensorflow as tf
from tensorflow.keras import layers, models, losses
import numpy as np 
import pandas as pd 
import plotly as py 
import plotly.graph_objects as go 
import datetime

print('Tensorflow version:', tf.__version__)
print('Numpy version:', np.__version__)
print('Pandas version:', pd.__version__)
print('Plotly version:', py.__version__)


Tensorflow version: 2.2.0
Numpy version: 1.18.1
Pandas version: 1.0.1
Plotly version: 4.8.1


In [4]:
def bulid_lenet_batchnorm():
    net = models.Sequential()
    net.add(layers.Conv2D(filters=6,kernel_size=5, input_shape=(28,28,1)))
    net.add(layers.BatchNormalization())
    net.add(layers.Activation('sigmoid'))
    net.add(layers.MaxPool2D(pool_size=2, strides=2))
    net.add(layers.Conv2D(filters=16,kernel_size=5))
    net.add(layers.BatchNormalization())
    net.add(layers.Activation('sigmoid'))
    net.add(layers.MaxPool2D(pool_size=2, strides=2))
    net.add(layers.Flatten())
    net.add(layers.Dense(120))
    net.add(layers.BatchNormalization())
    net.add(layers.Activation('sigmoid'))
    net.add(layers.Dense(84))
    net.add(layers.BatchNormalization())
    net.add(layers.Activation('sigmoid'))
    net.add(layers.Dense(10,activation='sigmoid'))
    
    net.compile(loss='sparse_categorical_crossentropy',
              optimizer=tf.keras.optimizers.RMSprop(),
              metrics=['accuracy'])
    net.summary()
    return net

def train_lenet(net, batch_size, epochs):
    (x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
    x_train = x_train.reshape((60000, 28, 28, 1)).astype('float32') / 255
    x_test = x_test.reshape((10000, 28, 28, 1)).astype('float32') / 255
    %load_ext tensorboard
    log_dir = './log/lenetnorm/' + datetime.datetime.now().strftime('%Y%m%d-%H%M%S')
    tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir = log_dir, histogram_freq=1)
    history = net.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    validation_split=0.2, callbacks=[tensorboard_callback])
    net.save_weights('./ModelTrained/lenetnorm_weights')
    %tensorboard --logdir log/lenetnorm
    test_scores = net.evaluate(x_test, y_test, verbose=2)
    print('Test loss:', test_scores[0])
    print('Test accuracy:', test_scores[1])
    return net



In [5]:
train_lenet(bulid_lenet_batchnorm(),64, 5)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 24, 24, 6)         156       
_________________________________________________________________
batch_normalization (BatchNo (None, 24, 24, 6)         24        
_________________________________________________________________
activation (Activation)      (None, 24, 24, 6)         0         
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 12, 12, 6)         0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 8, 8, 16)          2416      
_________________________________________________________________
batch_normalization_1 (Batch (None, 8, 8, 16)          64        
_________________________________________________________________
activation_1 (Activation)    (None, 8, 8, 16)          0

Reusing TensorBoard on port 6006 (pid 9100), started 3 days, 0:01:07 ago. (Use '!kill 9100' to kill it.)

313/313 - 0s - loss: 0.1423 - accuracy: 0.9538
Test loss: 0.14229196310043335
Test accuracy: 0.9538000226020813


<tensorflow.python.keras.engine.sequential.Sequential at 0x1b1a9e7bd08>

In [None]:
%tensorboard --logdir log/lenetnorm port = 6006

Launching TensorBoard...