## 线性回归（Linear Regression）

线性回归是机器学习中最简单的模型之一。其假设函数为：
$$
h(\mathbf{x})=\theta \mathbf{x}=\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_3+...+\theta_mx_m
$$
其中$x_0=1$。

损失函数通常定义为平方误差（Square Error），如下所示：

$$
J(\theta)=\frac{1}{2n}\sum_{i=1}^{n}(h(\mathbf{x}^i)-y^i)^2
$$

其中构造的$\frac{1}{2}$是为了后面求导数时可以消掉。

此时$\theta$的求解转换为凸优化问题，通常使用梯度下降法进行求解：

Repeat {

\begin{equation}
\begin{aligned}
\theta_0 = & \theta_0-\alpha\frac{1}{n}\sum_{i=1}^{n}(h(\mathbf{x}^i)-y^i)x_0^i \\
\theta_1 = & \theta_1-\alpha\frac{1}{n}\sum_{i=1}^{n}(h(\mathbf{x}^i)-y^i)x_1^i \\
\theta_2 = & \theta_2-\alpha\frac{1}{n}\sum_{i=1}^{n}(h(\mathbf{x}^i)-y^i)x_2^i \\
\vdots \\
\theta_m = & \theta_m-\alpha\frac{1}{n}\sum_{i=1}^{n}(h(\mathbf{x}^i)-y^i)x_m^i \\
\end{aligned}
\end{equation}

(simultaneously)

} until convergence

其中$\alpha$为学习率，通常取值为0.03-0.1。

下面以Boston Housing Data数据集作为例子，编程实现多元线性回归。Boston Housing Data数据集共有14个属性（包括待预测的房价），506个样本，部分数据如下所示：

| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV|
| -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |
|0.00632 | 18.00  | 2.310 | 0 | 0.5380 | 6.5750 | 65.20 | 4.0900  | 1 | 296.0 | 15.30 | 396.90  | 4.98 | 24.00 |
|0.02731 |  0.00  | 7.070 | 0 | 0.4690 | 6.4210 | 78.90 | 4.9671  | 2 | 242.0 | 17.80 | 396.90  | 9.14 | 21.60 |
|0.02729 |  0.00  | 7.070 | 0 | 0.4690 | 7.1850 | 61.10 | 4.9671  | 2 | 242.0 | 17.80 | 392.83  | 4.03 | 34.70 |
|0.03237 |  0.00  | 2.180 | 0 | 0.4580 | 6.9980 | 45.80 | 6.0622  | 3 | 222.0 | 18.70 | 394.63  | 2.94 | 33.40 |
|0.06905 |  0.00  | 2.180 | 0 | 0.4580 | 7.1470 | 54.20 | 6.0622  | 3 | 222.0 | 18.70 | 396.90  | 5.33 | 36.20 |

1. **CRIM**      per capita crime rate by town
2. **ZN**        proportion of residential land zoned for lots over 25,000 sq.ft.
3. **INDUS**     proportion of non-retail business acres per town
4. **CHAS**      Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. **NOX**       nitric oxides concentration (parts per 10 million)
6. **RM**        average number of rooms per dwelling
7. **AGE**       proportion of owner-occupied units built prior to 1940
8. **DIS**       weighted distances to five Boston employment centres
9. **RAD**       index of accessibility to radial highways
10. **TAX**      full-value property-tax rate per \$10,000
11. **PTRATIO**  pupil-teacher ratio by town
12. **B**        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. **LSTAT**    % lower status of the population
14. **MEDV**     Median value of owner-occupied homes in \$1000's (待预测的房价)

In [None]:
# coding=utf-8

import numpy as np
from sklearn import datasets
from sklearn.preprocessing import StandardScaler

boston = datasets.load_boston()

X, y = boston.data, boston.target

scaler = StandardScaler()
scaler = scaler.fit(X)
X = scaler.transform(X)

X = np.hstack((np.array([1] * len(X)).reshape((-1, 1)), X))

print X.shape, y.shape

theta = np.array([1] * len(X[0]))  # 初始化

print theta.shape


def cost(X, y, theta):
    return np.sum([(np.sum(x * theta) - y[idx]) ** 2 for idx, x in enumerate(X)]) / 2


def converge(theta0, theta1):
    if np.sum(np.abs(theta1 - theta0)) < 0.1:
        return True
    else:
        return False


alpha = 0.0003
n = len(X)

while True:
    theta_pre = theta.copy()
    print theta_pre

    for idx, t in enumerate(theta_pre):
        theta[idx] = theta_pre[idx] - alpha / n * np.sum((np.dot(X, theta_pre) - y) * X[:, idx])

    print 'Cost:{}'.format(cost(X, y, theta))

    if converge(theta_pre, theta):
        break

print theta  

References：

[1] https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names
