### **让训练更加稳定**
- 目标：让梯度值在合理的范围内
  - 例如[1e-6, 1e-3]
- **将乘法变加法**
  - ResNet, LSTM
- **归一化**
  - 梯度归一化，梯度裁剪
- **合理的权重初始和激活函数**

### **让每层的方差是一个常数**
- 将每层的输出与梯度都看作随机变量
- 让它们的均值和方差都保持一致

![20230809122858](https://cdn.jsdelivr.net/gh/Corner430/Picture1/images/20230809122858.png)

a 和 b 都是常数

--------------------------------------

### **权重初始化**
- 在合理值区间里随机初始参数
- 训练开始的时候更容易有数值不稳定
  - 原理最优解的地方损失函数表面可能很复杂
  - 最优解附近表面会比较平
- 使用$\mathscr{N}(0, 0.01)$来初始可能对小网络没问题，但不能保证深度神经网络

[References](https://www.bilibili.com/video/BV1u64y1i75a/?p=2&share_source=copy_web&vd_source=a7ae9163cb2cd121bfd86ea1f4ecd2ef&t=331)

### 例子：MLP
- 假设
  - $w^t_{i,j}$是i.i.d，那么$\mathbb{E}[w^t_{i,j}] = 0$，$\mathcal{Var}[w^t_{i,j}] = \gamma_t$
  - $h_i^{t-1}$独立于$w^t_{i,j}$
- 假设没有激活函数$\mathbf{h}^t = \mathbf{W}^t\mathbf{h}^{t-1}$，这里$\mathbf{W}^t \in \mathbb{R}^{n_t \times n_{t-1}}$
$$\mathbb{E}[\mathbf{h}^t_i] = \mathbb{E}[\sum_{j} w^t_{i,j}h^{t-1}_j] = \sum_{j} \mathbb{E}[w^t_{i,j}]\mathbb{E}[h^{t-1}_j] = 0$$

### **正向方差**

$$
\begin{aligned}
    \mathrm{Var}[h_i^t] & = E[(h_i^t)^2] - (E[h_i^t])^2 = \mathbb{E}[(\sum_{j} w_{i,j}^t h_j^{t-1})^2] \\
        & = \mathbb{E}[\sum_{j} (w_{i,j}^t)^2 (h_j^{t-1})^2 + \sum_{j \neq k} w_{i,j}^t w_{i,k}^t h_j^{t-1} h_k^{t-1}] \\
        & = \sum_{j} \mathbb{E} [(w_{i,j}^t)^2] \mathbb{E}[(h_j^{t-1})^2] \\
        & = \sum_{j} \mathrm{Var}[w_{i,j}^t] \mathrm{Var}[h_j^{t-1}] \\
        & = n_{t-1} \gamma_t \mathrm{Var}[h_j^{t-1}] \\
        \text{欲使} \mathrm{Var}[h_i^t] = \mathrm{Var}[h_j^{t-1}] \qquad \qquad & \Rightarrow n_{t-1} \gamma_t = 1
\end{aligned}
$$

### **反向均值和方差**
- 跟正向情况类似
$$\frac{\partial \mathscr{l}}{\partial \mathbf{h}^{t-1}} = \frac{\partial \mathscr{l}}{\partial \mathbf{h}^{t}} \mathbf{W}^{t} \quad \Rightarrow \quad (\frac{\partial \mathscr{l}}{\partial \mathbf{h}^{t-1}})^T = (\mathbf{W}^{t})^T (\frac{\partial \mathscr{l}}{\partial \mathbf{h}^{t}})^T$$

$$\mathbb{E}[\frac{\partial \mathscr{l}}{\partial h^{t-1}_i}] = n_t \gamma_t \mathrm{Var}[\frac{\partial \mathscr{l}}{\partial h^{t}_i}] \quad \Rightarrow \quad n_t \gamma_t = 1$$

### **Xavier 初始**
- **难以同时满足$n_{t-1} \gamma_t = 1$和$n_t \gamma_t = 1$，因为$n_{t-1} \neq n_t$，二者表示的是前一层和当前层的神经元个数**
- Xavier 进行了一个折中，使得$\gamma_t(n_{t-1} + n_t) / 2 = 1 \quad \rightarrow \quad \gamma_t = 2 / (n_{t-1} + n_t)$
  - 正态分布：$\mathscr{N}(0, \sqrt{2 / (n_{t-1} + n_t)})$
  - 均匀分布：$\mathscr{U}(-\sqrt{6 / (n_{t-1} + n_t)}, \sqrt{6 / (n_{t-1} + n_t)})$
    - 分布$\mathscr{U}[-a, a]$ 的方差为$a^2 / 3$
- 适配权重形状变换，特别是$n_t$

-----------------------------------------

### **假设线性的激活函数**
- 假设$\sigma(x) = \alpha x + \beta$
$$\mathbf{h}' = \mathbf{W} \mathbf{h}^{t-1} \quad \text{and} \quad \mathbf{h}^{t} = \sigma(\mathbf{h}')$$

$$\mathbb{E}[h^t_i] = \mathbb{E}[\alpha h_i' + \beta] = \beta \quad \Rightarrow \quad \beta=0 $$

<div style="text-align: center;">
    <img src="https://cdn.jsdelivr.net/gh/Corner430/Picture1/images/20230809140633.png" alt="20230809140633" />
</div>


### 反向
- 假设 $\sigma(x) = \alpha x + \beta$
$$\frac{\partial \mathscr{l}}{\partial \mathbf{h}'} = \frac{\partial \mathscr{l}}{\partial \mathbf{h}^t} (W^t)^T \quad \text{and} \quad \frac{\partial \mathscr{l}}{\partial \mathbf{h}^{t-1}} = \alpha \frac{\partial \mathscr{l}}{\partial \mathbf{h}'}$$

$$\mathbb{E}[\frac{\partial \mathscr{l}}{\partial h^{t-1}_i}] = 0 \quad \Rightarrow \quad \beta = 0$$

$$\mathrm{Var}[\frac{\partial \mathscr{l}}{\partial h^{t-1}_i}] = \alpha^2 \mathrm{Var}[\frac{\partial \mathscr{l}}{\partial h_j'}] \quad \Rightarrow \quad \alpha = 1$$

> **总之，线性激活函数需要激活之后还是它本身**

-----------------------------------------------

### 检查常用激活函数
- 使用泰勒展开

\begin{align*}
\text{sigmoid}(x) &= \frac{1}{2} + \frac{x}{4} - \frac{x^3}{48} + O(x^5) \\
\tanh(x) &= 0 + x - \frac{x^3}{3} + O(x^5) \\
\text{relu}(x) &= 0 + x \quad \text{for} \quad x \geq 0 
\end{align*}

- 调整sigmoid(根据上述依据)，**只考虑 x = 0 邻域**：
$$4 * sigmoid(x) - 2$$

### **总结**
- **合理的权重初始值和激活函数的选取可以提升数值稳定性**