待优化的参数$w$

损失函数$loss$

学习率$lr$

每次迭代一个$batch$

$t$表示当前$batch$迭代的总次数

## 更新参数步骤

- 计算$t$时刻损失函数关于当前参数的地图$g_t = \frac{\partial{loss}}{\partial{(w_t)}}$
- 计算$t$时刻一阶动量$m_t$和二阶动量$V_t$，一阶动量是与梯度相关的函数，二阶动量是与梯度平方相关的函数
- 计算$t$时刻下降梯度：$\eta_t = \frac{lr.m_t}{\sqrt{V_t}}$
- 计算$t+1$时刻参数：$w_{t+1} = w_t - \frac{lr·m_t}{\sqrt{V_t}}$

>不同的优化器的区别是，一阶动量和二阶动量的函数不同

## SGD梯度下降法

$$
m_t = g_t \\\\ V_t = 1
$$

$$
\eta_t = lr·m_t/\sqrt{V_t} = lr · g_t
$$

$$
w_{t+1} = w_t - \eta_t = w_t - lr·m_t/\sqrt{V_t} = w_t - lr·g_t
$$

也就是 $w_{t+1} = w_t - lr * \frac{\partial{loss}}{\partial{w_t}}$

```python
w1.assign_sub(lr * grads[0])
b1.assign_sub(lr * grads[1])
```

## SGDM 在SGD的基础上增加了一阶动量

$$
m_t = \beta·m_{t-1} + (1-\beta)·g_t \\ V_t = 1
$$

$$
\eta_t = lr·m_t/\sqrt{V_t} = lr · m_t = lr · (\beta·m_{t-1} + (1-\beta)·g_t)
$$

$$
w_{t+1} = w_t - \eta_t = w_t - lr·(\beta·m_{t-1} + (1-\beta)·g_t)
$$

代码中描述：

```python
m_w, m_b = 0, 0
beta = 0.9

···

grads = tape.gradient(loss, [w1, b1])

##########################################################################
# sgd-momentun  
m_w = beta * m_w + (1 - beta) * grads[0]
m_b = beta * m_b + (1 - beta) * grads[1]
w1.assign_sub(lr * m_w)
b1.assign_sub(lr * m_b)

···
```

## Adagrad 在SGD的基础上增加了二阶动量

$$
m_t = g_t \\ V_t = \sum_{\tau_1}^tg_{\tau}^2
$$

$$
\eta_t = lr·m_t/\sqrt{V_t} = lr · g_t /\sqrt{\sum_{\tau_1}^tg_{\tau}^2}
$$

$$
w_{t+1} = w_t - \eta_t = w_t - lr · g_t /\sqrt{\sum_{\tau_1}^tg_{\tau}^2}
$$

代码中描述：

```python
v_w, v_b = 0, 0

# 计算loss对各个参数的梯度
grads = tape.gradient(loss, [w1, b1])

##########################################################################
# adagrad
v_w += tf.square(grads[0])
v_b += tf.square(grads[1])
w1.assign_sub(lr * grads[0] / tf.sqrt(v_w))
b1.assign_sub(lr * grads[1] / tf.sqrt(v_b))
```

## RMSProp 在SGD的基础上增加了二阶动量

$$
m_t = g_t \\ V_t = \beta·V_{t-1} + (1 - \beta)·g_t^2
$$

$$
\eta_t = lr·m_t/\sqrt{V_t} = lr · g_t /\sqrt{\beta·V_{t-1} + (1 - \beta)·g_t^2}
$$

$$
w_{t+1} = w_t - \eta_t = w_t - lr · g_t /\sqrt{\beta·V_{t-1} + (1 - \beta)·g_t^2}
$$

代码中描述：

```python
v_w, v_b = 0, 0
beta = 0.9

# 计算loss对各个参数的梯度
grads = tape.gradient(loss, [w1, b1])

##########################################################################
# rmsprop
v_w = beta * v_w + (1 - beta) * tf.square(grads[0])
v_b = beta * v_b + (1 - beta) * tf.square(grads[1])
w1.assign_sub(lr * grads[0] / tf.sqrt(v_w))
b1.assign_sub(lr * grads[1] / tf.sqrt(v_b))
```

## Adam 结合SGDM的一阶动量和RMSProp的二阶动量

```python
m_w, m_b = 0, 0
v_w, v_b = 0, 0
beta1, beta2 = 0.9, 0.999
delta_w, delta_b = 0, 0
global_step = 0

# 计算loss对各个参数的梯度
        grads = tape.gradient(loss, [w1, b1])

##########################################################################
 # adam
m_w = beta1 * m_w + (1 - beta1) * grads[0]
m_b = beta1 * m_b + (1 - beta1) * grads[1]
v_w = beta2 * v_w + (1 - beta2) * tf.square(grads[0])
v_b = beta2 * v_b + (1 - beta2) * tf.square(grads[1])

m_w_correction = m_w / (1 - tf.pow(beta1, int(global_step)))
m_b_correction = m_b / (1 - tf.pow(beta1, int(global_step)))
v_w_correction = v_w / (1 - tf.pow(beta2, int(global_step)))
v_b_correction = v_b / (1 - tf.pow(beta2, int(global_step)))

w1.assign_sub(lr * m_w_correction / tf.sqrt(v_w_correction))
b1.assign_sub(lr * m_b_correction / tf.sqrt(v_b_correction))
```