回归模型的评估指标:
- 均方误差 MSE
- 均方根误差
- 平均绝对误差
- R平方(可决系数) R Square

# 多远线性回归
单变量线性回归，只有一个自变量X，回归拟合结果是一条直线。

多元线性回归，有多个自变量

多元线性回归，回归拟合的结果是三维空间中的平面

# 波士顿房价回归预测数据集

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [5]:
housing_data = pd.read_csv('data/housing.csv')

In [7]:
housing_data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PIRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


CRIM: Per capita crime rate by town. 人均犯罪率，表示每个城镇的人均犯罪数量。

ZN: Proportion of residential land zoned for lots over 25,000 sq.ft. 占地面积超过25,000平方英尺的住宅用地比例。

INDUS: Proportion of non-retail business acres per town. 非零售商业用地的比例。

CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). 查尔斯河虚拟变量，如果地块边界靠近河流，则为1；否则为0。

NOX: Nitric oxide concentration (parts per 10 million). 一氧化氮浓度，以每千万份中的部分来计量。

RM: Average number of rooms per dwelling. 每个住宅的平均房间数量。

AGE: Proportion of owner-occupied units built prior to 1940. 1940年以前建造的所有者占用单位的比例。

DIS: Weighted mean of distances to five Boston employment centres. 到波士顿五个就业中心的加权平均距离。

RAD: Index of accessibility to radial highways. 辐射高速公路的可达性指数。

TAX: Full-value property-tax rate per $10,000. 每10,000美元的全额物业税率。

PTRATIO: Pupil-teacher ratio by town. 城镇的学生与教师比率。

B: 1000(Bk - 0.63)^2 where Bk is the proportion of black people by town. 1000乘以（城镇黑人比例减去0.63的平方），反映了城镇种族构成的影响。

LSTAT: Lower status of the population. 较低社会地位的人口比例。

MEDV: Median value of owner-occupied homes in $1000's. 自住房的中位价，以千美元计。

In [13]:
X = housing_data.drop(['MEDV'], axis = 1)    # 矩阵用大写字母表示

In [9]:
y = housing_data['MEDV']

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)    # sklearn库中的train_test_split函数将数据集分割为训练集和测试集

In [11]:
model = LinearRegression()

In [12]:
model.fit(X_train, y_train)    # 拟合

In [14]:
y_pred_class = model.predict(X_test)

In [15]:
y_pred_class

array([24.95233283, 23.61699724, 29.20588553, 11.96070515, 21.33362042,
       19.46954895, 20.42228421, 21.52044058, 18.98954101, 19.950983  ,
        4.92468244, 16.09694058, 16.93599574,  5.33508402, 39.84434398,
       32.33549843, 22.32772572, 36.54017819, 31.03300611, 23.32172503,
       24.92086498, 24.26106474, 20.71504422, 30.45072552, 22.45009234,
        9.87470006, 17.70324412, 17.974775  , 35.69932012, 20.7940972 ,
       18.10554174, 17.68317865, 19.71354713, 23.79693873, 29.06528958,
       19.23738284, 10.97815878, 24.56199978, 17.32913052, 15.20340817,
       26.09337458, 20.87706795, 22.26187518, 15.32582693, 22.85847963,
       25.08887173, 19.74138819, 22.70744911,  9.66708558, 24.46175926,
       20.72654169, 17.52545047, 24.45596997, 30.10668865, 13.31250981,
       21.52052342, 20.65642932, 15.34285652, 13.7741129 , 22.07429287,
       17.53293957, 21.60707766, 32.91050188, 31.32796114, 17.64346364,
       32.69909854, 18.56579207, 19.32110821, 18.81256692, 23.04

# Mean Absolute Error 平均绝对误差
- 计算预测值与真实值之间的绝对值之差

In [17]:
metrics.mean_absolute_error(y_test, y_pred_class)

3.6683301481357242

# Mean Squared Error 均方误差
- 计算MSE之前必须去掉所有缺失值

In [18]:
metrics.mean_squared_error(y_test, y_pred_class)

29.782245092302468

# Mean Squred Error 均方误差
- 计算MSE之前必须去掉所有缺失值

In [19]:
metrics.mean_squared_error(y_test, y_pred_class)

29.782245092302468

# RMSE 均方根误差
- 
- 
- 
- 

In [21]:
from math import sqrt

print(sqrt(metrics.mean_squared_error(y_test, y_pred_class)))

5.457311159564064


# Root Mean Sqaured Logarithmic Error 均方根对数误差

In [23]:
metrics.mean_squared_log_error(y_test, y_pred_class)

0.10011074434044048

# R_squared R平方(可决系数)

In [24]:
metrics.r2_score(y_test, y_pred_class)

0.6354638433202117

# Adjusted R-Squared 修正R平方