# Stochastic Gradient Descent 

- 上节梯度下降法如图所示
[![GD8.png](https://i.postimg.cc/nzp9p0g3/GD8.png)](https://postimg.cc/FfCHxg9S)
- 我们每次都把所有的梯度算出来，称为**批量梯度下降法**
- 但是这样在样本容量很大时，也是比较耗时的，解决方法是**随机梯度下降法**

[![GD9.png](https://i.postimg.cc/rszmwbJN/GD9.png)](https://postimg.cc/9w5VxLsD)

- 我们随机的取一个 $i$ ，然后用这个 $i$ 得到一个向量，然后向这个方向搜索迭代
[![GD11.png](https://i.postimg.cc/hGjPrNFW/GD11.png)](https://postimg.cc/LJcp4C6N)

- 在随机梯度下降法中，我们不能保证寻找的方向就是损失函数减小的方向
- 更不能保证时减小的最快的方向
- 我们希望 $\eta$ 随着迭代次数增大越来越小，于是 $\eta$ 就有右边表示形式
- 其中 a 和 b 是两个超参数

### 1. 批量梯度下降算法

In [1]:
import numpy as np
import matplotlib.pyplot as plt

In [2]:
m = 100000

x = np.random.normal(size = m)
X = x.reshape(-1, 1)
y = 4. * x + 3. + np.random.normal(0, 3, size=m)

In [3]:
def J(theta, X_b, y):
    try:
        return np.sum((y - X_b.dot(theta))**2) / len(y)
    except:
        return float('inf')
def dJ(theta, X_b, y):
    return X_b.T.dot(X_b.dot(theta) - y) * 2. / len(y)

def gradient_descent(X_b, y, initial_theta, eta, n_iters = 1e4, epsilon=1e-8):
    theta = initial_theta
    i_iter = 0
    
    while i_iter < n_iters:
        gradient = dJ(theta, X_b, y)
        last_theta = theta
        theta = theta - eta * gradient

        if np.abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon:
            break
        
        i_iter += 1
    
    return theta

In [4]:

X_b = np.hstack([np.ones([len(X), 1]), X])
initial_theta = np.zeros(X_b.shape[1])
eta = 0.01
theta = gradient_descent(X_b, y, initial_theta, eta)

In [5]:
theta

array([2.99674668, 3.98888301])

### 2. 随机梯度下降法

In [6]:
def dJ(theta, X_b_i, y_i):
    return X_b_i.T.dot(X_b_i.dot(theta) - y_i) * 2. 

In [7]:
def sgd(X_b, y, initial_theta, n_iters):
    
    t0 = 5
    t1 = 50
    
    def learning_rate(cur_iter):
        return t0 / (cur_iter + t1)
    
    theta = initial_theta
    for cur_iter in range(n_iters):
        # 随机取一个 i
        rand_i = np.random.randint(len(X_b))
        gradient = dJ(theta, X_b[rand_i], y[rand_i])
        # 迭代 theta
        theta = theta - learning_rate(cur_iter) * gradient

    return theta

In [8]:

X_b = np.hstack([np.ones([len(X), 1]), X])
initial_theta = np.zeros(X_b.shape[1])
theta = sgd(X_b, y, initial_theta, n_iters=len(X_b)//3)
# 可以看出，我们只使用了三分之一的样本，就达到了很好的效果

In [9]:
theta

array([3.04247436, 3.9865213 ])

# theta 值和批量梯度下降算法几乎一致

### 3. 使用我们自己的SGD

In [10]:
from LR.LinearRegression import LinearRegression

In [11]:
lin_reg = LinearRegression()

In [12]:
lin_reg.fit_sgd(X, y, n_iters=2)

LinearRegression()

In [13]:
lin_reg.coef_

array([3.98430367])

In [14]:
lin_reg.intercept_

2.9997998753071804

#### 使用真实数据

In [15]:
from sklearn import datasets

In [16]:
boston = datasets.load_boston()
X = boston.data
y = boston.target

X = X[y < 50.0]
y = y[y < 50.0]

In [17]:
from LR.model_selection import train_test_split
# 数据集分割
X_train, X_test, y_train, y_test = train_test_split(X, y, seed=333)

In [18]:
# 归一化处理
from sklearn.preprocessing import StandardScaler

standardScaler = StandardScaler()
standardScaler.fit(X_train)
X_train_standard = standardScaler.transform(X_train)
X_test_standard = standardScaler.transform(X_test)

In [19]:
lin_reg2 = LinearRegression()

In [25]:
%time lin_reg2.fit_sgd(X_train_standard, y_train)

Wall time: 31 ms


LinearRegression()

In [26]:
lin_reg2.score(X_test_standard, y_test)

0.8623420963713099

### 4. scikit-learn中的SGD

In [27]:
from sklearn.linear_model import SGDRegressor

In [32]:
sgd_reg = SGDRegressor()
%time sgd_reg.fit(X_train_standard, y_train)
sgd_reg.score(X_test_standard, y_test)

Wall time: 996 µs




0.8700295750106639

In [30]:
SGDRegressor?

In [36]:
sgd_reg = SGDRegressor(n_iter=100)
%time sgd_reg.fit(X_train_standard, y_train)
sgd_reg.score(X_test_standard, y_test)

Wall time: 7 ms




0.8715081559708394