## Vanishing/Exploding Gradients Problem

解决梯度逐渐消失或梯度骤增的办法:

### 1.初始化策略

每一层输入的方差和输出的方差应该尽可能一致  
解决方法是使用特定的初始化策略(而不是普通的正态分布随机)

* Logistic uniform: $ r = \sqrt{\dfrac{6}{n_\text{inputs} + n_\text{outputs}}} $
* Logistic normal: $ \sigma = \sqrt{\dfrac{2}{n_\text{inputs} + n_\text{outputs}}} $
* Hyperbolic tangent uniform: $ r = 4 \sqrt{\dfrac{6}{n_\text{inputs} + n_\text{outputs}}} $
* Hyperbolic tangent normal: $ \sigma = 4 \sqrt{\dfrac{2}{n_\text{inputs} + n_\text{outputs}}} $
* ReLU (and its variants) uniform: $ r = \sqrt{2} \sqrt{\dfrac{6}{n_\text{inputs} + n_\text{outputs}}} $
* ReLU (and its variants) normal: $ \sigma = \sqrt{2} \sqrt{\dfrac{2}{n_\text{inputs} + n_\text{outputs}}} $

In [None]:
he_init = tf.contrib.layers.variance_scaling_initializer()     #使用ReLU正态分布初始化
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu,
                          kernel_initializer=he_init, name="hidden1")

### 2.非饱和的激活函数(Nonsaturating Activation Functions)

通常 ELU > leaky ReLU > ReLU > tanh > logistic

如果对于性能更为关注,则 ReLU > ELU

默认$\alpha$:  
ReLU: 0.01 ; ELU: 1

ELU激活函数:

$
\operatorname{ELU}_\alpha(z) =
\begin{cases}
\alpha(\exp(z) - 1) & \text{if } z < 0\\
z & if z \ge 0
\end{cases}
$

In [None]:
# ELU
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.elu, name="hidden1")

In [None]:
# leaky ReLU
def leaky_relu(z, name=None):
    return tf.maximum(0.01 * z, z, name=name)

hidden1 = tf.layers.dense(X, n_hidden1, activation=leaky_relu, name="hidden1")

### 3.批量标准化(Batch Normalization)

虽然使用初始化策略能够显著减少训练初期的Vanishing/Exploding Gradients问题  
但是在训练中,很难保证这个问题不会再卷土重来

通常来说,在训练时,当前一层的参数改变时,后一层的输入分布也会改变

BN技术在每一层的激活函数前,都增加了一步处理  
该处理使模型自己学习每一层输入数据最优的缩放和均值

$
\begin{split}
1.\quad & \mathbf{\mu}_B = \dfrac{1}{m_B}\sum\limits_{i=1}^{m_B}{\mathbf{x}^{(i)}}\\
2.\quad & {\mathbf{\sigma}_B}^2 = \dfrac{1}{m_B}\sum\limits_{i=1}^{m_B}{(\mathbf{x}^{(i)} - \mathbf{\mu}_B)^2}\\
3.\quad & \hat{\mathbf{x}}^{(i)} = \dfrac{\mathbf{x}^{(i)} - \mathbf{\mu}_B}{\sqrt{{\mathbf{\sigma}_B}^2 + \epsilon}}\\
4.\quad & \mathbf{z}^{(i)} = \gamma \hat{\mathbf{x}}^{(i)} + \beta
\end{split}
$

$\mu_B ,{\sigma_B}^2$分别是整个mini-batch B的均值和方差

$\hat{\mathbf{x}}^{(i)}$ 是归零化和标准化后的输入  
$\epsilon$ 是一个防止分母为0的极小值,通常为$10^{-3}$ ,$\epsilon$ 称为平滑项

$\gamma$ 是当前层的缩放参数  
$\beta$ 是当前层的位移参数  
$\mathbf{z}^{(i)}$ 是BN操作的输出,其为输入的缩放+位移版本

**
其实如果是仅仅使用上面的归一化公式，对网络某一层A的输出数据做归一化，然后送入网络下一层B，这样是会影响到本层网络A所学习到的特征的。
打个比方，比如网络中间某一层学习到特征数据本身就分布在S型激活函数的两侧，强制把它给归一化处理、标准差也限制在了1，把数据变换成分布于s函数的中间部分，这样就相当于这一层网络所学习到的特征分布被搞坏了
**

根据$\mathbf{z}^{(i)} = \gamma \hat{\mathbf{x}}^{(i)} + \beta$  
当$\gamma=\dfrac{1}{\sqrt{{\mathbf{\sigma}_B}^2 + \epsilon}}$, $\beta=\mu_{B}$时:   
$\mathbf{z}^{(i)}=\mathbf{x}$ ,可以恢复出原始的某一层所学到的特征的  
因此我们引入了这个可学习重构参数$\gamma$、$\beta$，让神经网络可以学习恢复出原始网络所要学习的特征分布。

在每个batch-normalized层,需要学习的参数有4个:  
$\gamma$(scale), $\beta$(offset), $\mu$(mean), $\sigma$(standard deviation)

BN极大地减小了vanishing gradient问题,因此:  
1.模型可以使用诸如tanh和逻辑回归这样的饱和激活函数    
2.神经网络对初始化权重的敏感度大大减小  
3.模型可以使用较大的学习率,使训练速度显著提升  
4.BN也能够起到一个正则化的功能

缺点:  
BN增加了模型的复杂度  
因为在每一层上都需要添加新的计算,模型的预测速度会变慢  
如果需要更快的预测速度,在使用BN方法前,可以先尝试ELU+HE初始化策略的效果如何

### 使用TensorFlow实现BN

In [None]:
import tensorflow as tf

n_inputs = 28 * 28
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")

training = tf.placeholder_with_default(False, shape=(), name='training')

hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1")
bn1 = tf.layers.batch_normalization(hidden1, training=training, momentum=0.9)
bn1_act = tf.nn.elu(bn1)

hidden2 = tf.layers.dense(bn1_act, n_hidden2, name="hidden2")
bn2 = tf.layers.batch_normalization(hidden2, training=training, momentum=0.9)
bn2_act = tf.nn.elu(bn2)

logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name="outputs")
logits = tf.layers.batch_normalization(logits_before_bn, training=training,
                                       momentum=0.9)

在tf.layers.batch_normalization中  
training: 表示是否使用当前mini-batch的均值和标准差(训练过程中),或是运行期的均值(测试过程中)

为了防止参数的重复输入,可以使用Python的 partial()函数
tf.layers.dense()与tf.contrib.layers.arg_scope()不兼容,所以使用functools.partial() 函数来进行替代

In [None]:
from functools import partial

my_batch_norm_layer = partial(tf.layers.batch_normalization,
                              training=training, momentum=0.9)    #自动将training,momentun参数传入tf.layers.batch_normalization

hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1")
bn1 = my_batch_norm_layer(hidden1)
bn1_act = tf.nn.elu(bn1)

hidden2 = tf.layers.dense(bn1_act, n_hidden2, name="hidden2")
bn2 = my_batch_norm_layer(hidden2)
bn2_act = tf.nn.elu(bn2)

logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name="outputs")
logits = my_batch_norm_layer(logits_before_bn)

###  使用ELU激活函数和Batch Normalization

In [None]:
batch_norm_momentum = 0.9

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int64, shape=(None), name="y")
training = tf.placeholder_with_default(False, shape=(), name='training')

with tf.name_scope("dnn"):
    he_init = tf.contrib.layers.variance_scaling_initializer()

    my_batch_norm_layer = partial(
            tf.layers.batch_normalization,
            training=training,
            momentum=batch_norm_momentum)      #使用Batch Normalization

    my_dense_layer = partial(
            tf.layers.dense,
            kernel_initializer=he_init)        #使用HE初始化

    hidden1 = my_dense_layer(X, n_hidden1, name="hidden1")
    bn1 = tf.nn.elu(my_batch_norm_layer(hidden1))
    hidden2 = my_dense_layer(bn1, n_hidden2, name="hidden2")
    bn2 = tf.nn.elu(my_batch_norm_layer(hidden2))
    logits_before_bn = my_dense_layer(bn2, n_outputs, name="outputs")
    logits = my_batch_norm_layer(logits_before_bn)

with tf.name_scope("loss"):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
    loss = tf.reduce_mean(xentropy, name="loss")

with tf.name_scope("train"):
    optimizer = tf.train.GradientDescentOptimizer(learning_rate)
    training_op = optimizer.minimize(loss)

with tf.name_scope("eval"):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    
init = tf.global_variables_initializer()
saver = tf.train.Saver()

In [None]:
n_epochs = 20
batch_size = 200

 **
 这里使用的是tf.layers.batch_normalization()函数,需要明确地运行额外的更新运算  
 在每次进行Batch Normalization前,更新平均值
 **

In [None]:
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)   #定义更新运算

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for iteration in range(mnist.train.num_examples // batch_size):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run([training_op, extra_update_ops],
                     feed_dict={training: True, X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: mnist.test.images,
                                                y: mnist.test.labels})
        print(epoch, "Test accuracy:", accuracy_val)

    save_path = saver.save(sess, "./my_model_final.ckpt")

**
对于较少隐藏层的神经网络,ELU和BN很难有较好的正效应  
但对于更深层的神经网络,ELU和BN将显著地提升模型的效果
**