### 批量梯度下降法（Batch Gradient Descent）

对样本中的所有数据进行计算

样本量很大时很耗时
### 随机梯度下降法（Stochastic Gradient Descent）

- 抽出一部分样本进行，此时得到的并不是损失函数的梯度，但也代表了一个方向

$$2\cdot \begin{pmatrix}
(X_b^{(i)}\theta - y^{(i)}) \cdot X^{(i)}_0 \\\\
(X_b^{(i)}\theta - y^{(i)}) \cdot X^{(i)}_1 \\\\
(X_b^{(i)}\theta - y^{(i)}) \cdot X^{(i)}_2 \\\\
\ldots \\\\
(X_b^{(i)}\theta - y^{(i)}) \cdot X^{(i)}_n
\end{pmatrix} = 
2 \cdot (X_b^{(i)})^T \cdot (X_b^{(i)}\theta - y^{(i)})$$

- 学习率η应该递减

$$\eta = \frac{a}{i\_iters + b}$$

这是一种模拟退火的思想，其中a和b是随机梯度下降法的超参数

- 过程

![随机梯度下降法](..\img\FEF5319D-E8B8-42ec-B385-F7924051C439.png)

### 使用梯度下降法和随机梯度下降法对比

#### 梯度下降

In [1]:
import numpy as np
import matplotlib.pyplot as plt

In [2]:
m = 100000

x = np.random.normal(size=m)
X = x.reshape(-1,1)
y = 4. * x + 3. + np.random.normal(0, 3, size=m)

In [3]:
def J(theta, x_b, y):
    try:
        return np.sum((y - x_b.dot(theta)) ** 2) / len(y)
    except:
        return float('inf')

def dJ(theta, x_b, y):
    return x_b.T.dot(x_b.dot(theta) - y) * 2. / len(y)

def gradient_descent(x_b, y, initial_theta, eta, n_iters=10000, epsilon=1e-8):
    theta = initial_theta
    i_iter = 0
    while i_iter < n_iters:
        gradient = dJ(theta, x_b, y)
        last_theta = theta
        theta = theta - eta * gradient
        i_iter = i_iter + 1
        if(abs(J(theta, x_b, y) - J(last_theta, x_b, y)) < epsilon):
            break
    return theta

In [4]:
%%time
X_b = np.hstack([np.ones((len(x),1)), X])
initial_theta = np.zeros(X_b.shape[1])
eta = 0.01
theta = gradient_descent(X_b, y, initial_theta, eta)

Wall time: 615 ms


In [5]:
theta

array([2.99507793, 3.9981764 ])

#### 随机梯度下降

In [6]:
def dJ_sgd(theta, x_b_i, y_i):
    return x_b_i.T.dot(x_b_i.dot(theta) - y_i) * 2.

In [7]:
def sgd(x_b, y, initial_theta, n_iters=10000):
    t0 = 5
    t1 = 50
    def learn_rate(t):
        return t0 / (t + t1)
    theta = initial_theta
    
    for cur_iters in range(n_iters):
        rand_i = np.random.randint(len(x_b))
        gradient = dJ_sgd(theta, x_b[rand_i], y[rand_i])
        theta = theta - learn_rate(cur_iters) * gradient
        
    return theta

In [8]:
%%time
X_b = np.hstack([np.ones((len(x),1)), X])
initial_theta = np.zeros(X_b.shape[1])
eta = 0.01
theta = sgd(X_b, y, initial_theta, n_iters=len(X_b) // 3)

Wall time: 248 ms


In [9]:
theta

array([3.00065576, 4.01802475])