张子豪 2019-11-26

回归模型的评估指标：
- 均方误差<br>
$$MSE = \frac{1}{m}\sum_{i=1}^{m}(y_i-\hat{y_i})^2$$
<br>
<br>
- 均方根误差<br>
$$RMSE = \sqrt{\frac{1}{m}\sum_{i=1}^{m}(y_i-\hat{y_i})^2}$$
<br>
<br>
- 平均绝对误差<br>
$$MAE = \frac{1}{m}\sum_{i=1}^{m}|(y_i - \hat{y_i})|$$
<br>
<br>
- R平方（可决系数） R Square<br>
$$R^2 = 1 - \frac{MSE(\hat{y},y)}{Var(y)}$$

## 多元线性回归
单变量线性回归，只有一个自变量X，回归拟合结果是一条直线。
$$\hat{y} =\omega X + b$$
<img src='https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1564638256971&di=35113f82f5519650a67caffeb4eff630&imgtype=0&src=http%3A%2F%2Fmmbiz.qpic.cn%2Fmmbiz_jpg%2FxmJDsm3tEXxSwvFMABytfRJdk1BR4DPJyAnpNVKRAEFLczCSMKrmVmGWBBlMBhsqcczBFkcbfibpyFvM4xfL35Q%2F640%3Fwx_fmt%3Djpeg' width=300>
多元线性回归，有多个自变量，每个自变量都有自己对应的斜率。对于类别类型的自变量（比如是否周末），通过独热向量编码进行处理。
$$\hat{y} = \omega_1X_1 + \omega_2X_2 + \omega_3 X_3 + \cdots + \omega_nX_n + b$$
多元线性回归，回归拟合的结果是多维空间的超平面。<br>
下面是二元线性回归的图示，有两个自变量，回归拟合结果是三维空间中的平面。
<img src='https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1564638401236&di=40cef28dd8dc76d8b5f6ebb3a1bd7b9c&imgtype=0&src=http%3A%2F%2Fwww.pianshen.com%2Fimages%2F701%2F0c4b7be6f1139b1ba7d3930c7455f245.png' width=300>

# 波士顿房价回归预测数据集

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [2]:
housing_data = pd.read_csv('boston_house_price_english.csv')

In [3]:
housing_data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [4]:
housing_data.tail()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.12,76.7,2.2875,1,273,21.0,396.9,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.9,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0
505,0.04741,0.0,11.93,0,0.573,6.03,80.8,2.505,1,273,21.0,396.9,7.88,11.9


In [5]:
housing_data.info

<bound method DataFrame.info of         CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD  TAX  \
0    0.00632  18.0   2.31     0  0.538  6.575  65.2  4.0900    1  296   
1    0.02731   0.0   7.07     0  0.469  6.421  78.9  4.9671    2  242   
2    0.02729   0.0   7.07     0  0.469  7.185  61.1  4.9671    2  242   
3    0.03237   0.0   2.18     0  0.458  6.998  45.8  6.0622    3  222   
4    0.06905   0.0   2.18     0  0.458  7.147  54.2  6.0622    3  222   
..       ...   ...    ...   ...    ...    ...   ...     ...  ...  ...   
501  0.06263   0.0  11.93     0  0.573  6.593  69.1  2.4786    1  273   
502  0.04527   0.0  11.93     0  0.573  6.120  76.7  2.2875    1  273   
503  0.06076   0.0  11.93     0  0.573  6.976  91.0  2.1675    1  273   
504  0.10959   0.0  11.93     0  0.573  6.794  89.3  2.3889    1  273   
505  0.04741   0.0  11.93     0  0.573  6.030  80.8  2.5050    1  273   

     PTRATIO       B  LSTAT  MEDV  
0       15.3  396.90   4.98  24.0  
1       17.8  396.9

In [12]:
housing_data.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677082,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


In [6]:
X =  housing_data.drop(["MEDV"],axis = 1)

In [7]:
y = housing_data["MEDV"]

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [9]:
model = LinearRegression()

In [10]:
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [11]:
y_pred_class = model.predict(X_test)

In [13]:
y_pred_class

array([24.95233283, 23.61699724, 29.20588553, 11.96070515, 21.33362042,
       19.46954895, 20.42228421, 21.52044058, 18.98954101, 19.950983  ,
        4.92468244, 16.09694058, 16.93599574,  5.33508402, 39.84434398,
       32.33549843, 22.32772572, 36.54017819, 31.03300611, 23.32172503,
       24.92086498, 24.26106474, 20.71504422, 30.45072552, 22.45009234,
        9.87470006, 17.70324412, 17.974775  , 35.69932012, 20.7940972 ,
       18.10554174, 17.68317865, 19.71354713, 23.79693873, 29.06528958,
       19.23738284, 10.97815878, 24.56199978, 17.32913052, 15.20340817,
       26.09337458, 20.87706795, 22.26187518, 15.32582693, 22.85847963,
       25.08887173, 19.74138819, 22.70744911,  9.66708558, 24.46175926,
       20.72654169, 17.52545047, 24.45596997, 30.10668865, 13.31250981,
       21.52052342, 20.65642932, 15.34285652, 13.7741129 , 22.07429287,
       17.53293957, 21.60707766, 32.91050188, 31.32796114, 17.64346364,
       32.69909854, 18.56579207, 19.32110821, 18.81256692, 23.04

# Mean Absolute Error 平均绝对误差
- 计算预测值与真实值之间的绝对值之差

$$ Mean\ Absolute\ Error = \frac{1}{N} \sum_{i=1}^{N} |y_{i} -  \hat{y_{i}}|$$

In [13]:
metrics.mean_absolute_error(y_test, y_pred_class)

3.66833014813572

# Mean Squared Error 均方误差

- 计算MSE之前必须去掉所有缺失值

$$ Mean\ Squared\ Error = \frac{1}{N} \sum_{i=1}^{N} (y_{i} -  \hat{y_{i}})^2$$

In [14]:
metrics.mean_squared_error(y_test, y_pred_class)

29.782245092302343

# RMSE 均方根误差

- RMSE时MSE的平方根
- RMSE通常比MAE大
- RMSE的量纲与原始数据量纲相同
- RMSE便于求导，因此通常作为回归模型的评估指标


$$ Root\ Mean\ Squared\ Error =\sqrt{ \frac{1}{N} \sum_{i=1}^{N} (y_{i} -  \hat{y_{i}})^2}$$

In [15]:
from math import sqrt

print(sqrt(metrics.mean_squared_error(y_test, y_pred_class)))

5.457311159564052


# Root Mean Squared Logarithmic Error 均方根对数误差

$$ Root\ Mean\ Squared\ Log\ Error =\sqrt{ \frac{1}{N} \sum_{i=1}^{N} (\log (y_{i} + 1) -  (\log \hat{y_{i}} + 1))^2}$$ 

使用 RMSLE 的优点
1.RMSLE 惩罚欠预测大于过预测，适用于某些需要欠预测损失更大的场景，如预测共享单车需求，欠预测会导致共享单车供应量不足。

假如真实值为 1000，若预测值为 600，那么 RMSE=400， RMSLE=0.510

假如真实值为 1000，若预测值为 1400， 那么 RMSE=400， RMSLE=0.336

在 RMSE 相同的情况下，预测值比真实值小这种情况的 RMSLE 比较大，即对于预测值小这种情况惩罚较大。

2.如果预测的值的范围很大，RMSE 会被一些大的值主导。这样即使很多较小的值预测准了，但是有一个非常大的值预测的不准确，RMSE 就会很大。 相应的，如果另外一个比较差的算法对这一个大的值准确一些，但是很多小的值都有偏差，可能 RMSE 会比前一个小。先取 log 再求 RMSE，可以稍微解决这个问题。RMSE 一般对于固定的平均分布的预测值才合理。

直观的经验是这样的，当数据当中有少量的值和真实值差值较大的时候，使用log函数能够减少这些值对于整体误差的影响。


In [17]:
metrics.mean_squared_log_error(y_test, y_pred_class)

0.10011074434043424

# R_squared R平方（可决系数）

在分类问题中，我们经常将随机分类器作为基准模型，随机分类器的准确率是0.5。

在回归问题中，我们将输出平均值的回归器作为基准模型。

将一个回归模型的MSE除以基准模型的MSE，就可以计算R平方了。

如果一个回归模型与基准模型一样差，那么R平方是0。

如果一个回归模型完全预测正确，那么R平方是1。

如果一个回归模型比基准模型还差，那么R平方是负数。

$$R^2 = 1 - \frac{MSE(model)}{MSE(baseline)} = 1 - \frac{\sum_{i=1}^{N}(y_1 - \hat{y_1})^2}{\sum_{i=1}^{N}(\bar{y_1} - \hat{y_1})^2}$$

$$R^2 = 1 - \frac{MSE(\hat{y},y)}{Var(y)}$$

In [18]:
metrics.r2_score(y_test, y_pred_class)

0.6354638433202131

# Adjusted R-Squared 修正R平方

在其他变量不变的情况下，引入新的变量，总能提高模型的准确度，但增加特征却让模型变的更加复杂，这种准确度提升也是虚假的。修正R平方相当于给特征的个数加惩罚项。

换句话说，如果两个模型，样本数一样，R平方一样，使用变量个数少的那个模型更优，这正是奥卡姆剃刀原理的思想。如无必要，勿增实体。


$$\bar{R^2} = 1 - (1 - R^2)(\frac{n - 1}{n - k + 1})$$

k: 特征个数

n: 数据样本个数


修正R平方考虑了特征个数，
直观上讲，当我们增加特征个数时，分母的$n-(k+1)$减小，修正R平方值减小。

实际上，如果我们增加了好的特征，R平方会增大，修正R平方会增大。


# 修正R平方与RMSE的比较

Absolute value of RMSE does not actually tell how good/bad a model is. It can only be used to compare across two models whereas Adjusted R² easily does that. For example, if a model has adjusted R² equal to 0.05 then it is definitely bad.

However, if we care only about prediction accuracy then RMSE is best. It is computationally simple, easily differentiable and present as default metric for most of the models.

RMSE的大小仅仅反映模型预测值与真实值的偏差，不能反映模型真正的好坏，也许一个算命先生胡诌的预测值，RMSE确实很小，但瞎猫碰上死耗子不能说明什么。

但如果一个模型的修正R平方是0.05，那这个模型肯定很烂。

RMSE的好处在于便于微分求导，也便于比较不同模型的偏差，所以各类数据科学竞赛都经常把RMSE作为默认的评估指标。