
## SVM regression models
## K-Nearest Neighbor regression models
## Decision Tree regression models
## Ensemble regression models
### Patrick 🌰

In [1]:
# 从sklearn.datasets导入波士顿房价数据读取器。
from sklearn.datasets import load_boston
# 从读取房价数据存储在变量boston中。
boston = load_boston()
# 输出数据描述。
print (boston.DESCR)


.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

In [2]:
from sklearn.model_selection import train_test_split


import numpy as np

X = boston.data
y = boston.target

# 随机采样25%的数据构建测试样本，其余作为训练样本。
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33, test_size=0.25)

# 分析回归目标值的差异。
print ("The max target value is", np.max(boston.target))
print ("The min target value is", np.min(boston.target))
print ("The average target value is", np.mean(boston.target))


The max target value is 50.0
The min target value is 5.0
The average target value is 22.532806324110677


## 训练与测试数据标准化处理

In [3]:
# 从sklearn.preprocessing导入数据标准化模块。
from sklearn.preprocessing import StandardScaler

# 分别初始化对特征和目标值的标准化器。
ss_X = StandardScaler()
ss_y = StandardScaler()
# must reshape the y_test and y_train
y_test= y_test.reshape(-1,1)
y_train = y_train.reshape(-1,1)
# 分别对训练和测试数据的特征以及目标值进行标准化处理。
X_train = ss_X.fit_transform(X_train)
X_test = ss_X.transform(X_test)

y_train = ss_y.fit_transform(y_train)
y_test = ss_y.transform(y_test)


In [4]:
# y_train.reshape(-1,1)

In [5]:
%debug

ERROR:root:No traceback has been produced, nothing to debug.


## 线性回归模型 Linear regression & SGDRegressor

In [6]:
# 从sklearn.linear_model导入LinearRegression。
from sklearn.linear_model import LinearRegression

# 使用默认配置初始化线性回归器LinearRegression。
lr = LinearRegression()
# 使用训练数据进行参数估计。
lr.fit(X_train, y_train)
# 对测试数据进行回归预测。
lr_y_predict = lr.predict(X_test)



In [7]:
# 从sklearn.linear_model导入SGDRegressor。
from sklearn.linear_model import SGDRegressor

# 使用默认配置初始化线性回归器SGDRegressor。
sgdr = SGDRegressor()
# 使用训练数据进行参数估计。
sgdr.fit(X_train, y_train)
# 对测试数据进行回归预测。
sgdr_y_predict = sgdr.predict(X_test)

  y = column_or_1d(y, warn=True)


In [8]:
# 使用LinearRegression模型自带的评估模块，并输出评估结果。
print ('The value of default measurement of LinearRegression is', lr.score(X_test, y_test))

# 从sklearn.metrics依次导入r2_score、mean_squared_error以及mean_absoluate_error用于回归性能的评估。
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

# 使用r2_score模块，并输出评估结果。
print ('The value of R-squared of LinearRegression is', r2_score(y_test, lr_y_predict))

# 使用mean_squared_error模块，并输出评估结果。
print ('The mean squared error of LinearRegression is', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(lr_y_predict)))

# 使用mean_absolute_error模块，并输出评估结果。
print ('The mean absoluate error of LinearRegression is', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(lr_y_predict)))


The value of default measurement of LinearRegression is 0.675795501452948
The value of R-squared of LinearRegression is 0.675795501452948
The mean squared error of LinearRegression is 25.139236520353457
The mean absoluate error of LinearRegression is 3.5325325437053983


In [9]:
# 使用SGDRegressor模型自带的评估模块，并输出评估结果。
print ('The value of default measurement of SGDRegressor is', sgdr.score(X_test, y_test))

# 使用r2_score模块，并输出评估结果。
print ('The value of R-squared of SGDRegressor is', r2_score(y_test, sgdr_y_predict))

# 使用mean_squared_error模块，并输出评估结果。
print( 'The mean squared error of SGDRegressor is', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(sgdr_y_predict)))

# 使用mean_absolute_error模块，并输出评估结果。
print ('The mean absoluate error of SGDRegressor is', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(sgdr_y_predict)))


The value of default measurement of SGDRegressor is 0.6585897732863233
The value of R-squared of SGDRegressor is 0.6585897732863233
The mean squared error of SGDRegressor is 26.473390956285545
The mean absoluate error of SGDRegressor is 3.549437134404973


* 线性回归器是最为简单、易用的回归模型。正是因为其对特征与回归目标之间的线性假设，从某种程度上说也是局限了其应用范围。特别是，现实生活中的许多实例数据的各个特征与回归目标之间，绝大多数不能保证严格的线性关系。尽管如此，在不清楚特征之间关系的前提下，我们仍然可以使用线性回归模型作为大多数科学实验的基线系统。
* Linear Regressor is the simplest and easiest regression model. It is precisely because of its linear assumption between the feature and the regression goal that it limits the scope of its application to some extent. In particular, the vast majority of the characteristics of many instances of real-life data and regression goals do not guarantee a strictly linear relationship. Nevertheless, we can still use linear regression models as the baseline system for most scientific experiments without knowing the relationship between features.


## 支持向量机回归
* 使用三种不同核函数配置的支持向量机回归模型进行训练，并且分别对测试数据作出预测
* 对三种核函数配置下的支持向量机回归模型在相同测试集上进行性能评估

In [10]:
# 从sklearn.svm中导入支持向量机（回归）模型。
from sklearn.svm import SVR

# 使用线性核函数配置的支持向量机进行回归训练，并且对测试样本进行预测。
linear_svr = SVR(kernel='linear')
linear_svr.fit(X_train, y_train)
linear_svr_y_predict = linear_svr.predict(X_test)

# 使用多项式核函数配置的支持向量机进行回归训练，并且对测试样本进行预测。
poly_svr = SVR(kernel='poly')
poly_svr.fit(X_train, y_train)
poly_svr_y_predict = poly_svr.predict(X_test)

# 使用径向基核函数配置的支持向量机进行回归训练，并且对测试样本进行预测。
rbf_svr = SVR(kernel='rbf')
rbf_svr.fit(X_train, y_train)
rbf_svr_y_predict = rbf_svr.predict(X_test)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [11]:
# 使用R-squared、MSE和MAE指标对三种配置的支持向量机（回归）模型在相同测试集上进行性能评估。
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
print ('R-squared value of linear SVR is', linear_svr.score(X_test, y_test))
print ('The mean squared error of linear SVR is', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(linear_svr_y_predict)))
print ('The mean absoluate error of linear SVR is', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(linear_svr_y_predict)))


R-squared value of linear SVR is 0.650659546421538
The mean squared error of linear SVR is 27.088311013556027
The mean absoluate error of linear SVR is 3.4328013877599624


In [12]:
print( 'R-squared value of Poly SVR is', poly_svr.score(X_test, y_test))
print ('The mean squared error of Poly SVR is', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(poly_svr_y_predict)))
print ('The mean absoluate error of Poly SVR is', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(poly_svr_y_predict)))


R-squared value of Poly SVR is 0.40365065102550846
The mean squared error of Poly SVR is 46.24170053103929
The mean absoluate error of Poly SVR is 3.73840737104651


In [13]:
print ('R-squared value of RBF SVR is', rbf_svr.score(X_test, y_test))
print ('The mean squared error of RBF SVR is', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(rbf_svr_y_predict)))
print ('The mean absoluate error of RBF SVR is', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(rbf_svr_y_predict)))


R-squared value of RBF SVR is 0.7559887416340944
The mean squared error of RBF SVR is 18.920948861538733
The mean absoluate error of RBF SVR is 2.6067819999501114


* 展示了不同配置模型在相同数据上所表现的性能差异，该系列模型还可以通过配置不同的核函数来改变模型性能。

## K 临近回归
* 使用不同配置的K 临近回归模型对美国波士顿房价数据进行回归预测
* 对两种不同配置的K 临近回归模型在美国波士顿房价数据上进行预测性能的评估

In [14]:
# 从sklearn.neighbors导入KNeighborRegressor（K近邻回归器）。
from sklearn.neighbors import KNeighborsRegressor

# 初始化K近邻回归器，并且调整配置，使得预测的方式为平均回归：weights='uniform'。
uni_knr = KNeighborsRegressor(weights='uniform')
uni_knr.fit(X_train, y_train)
uni_knr_y_predict = uni_knr.predict(X_test)

# 初始化K近邻回归器，并且调整配置，使得预测的方式为根据距离加权回归：weights='distance'。
dis_knr = KNeighborsRegressor(weights='distance')
dis_knr.fit(X_train, y_train)
dis_knr_y_predict = dis_knr.predict(X_test)


In [15]:
# 使用R-squared、MSE以及MAE三种指标对平均回归配置的K近邻模型在测试集上进行性能评估。
print ('R-squared value of uniform-weighted KNeighorRegression:', uni_knr.score(X_test, y_test))
print ('The mean squared error of uniform-weighted KNeighorRegression:', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(uni_knr_y_predict)))
print ('The mean absoluate error of uniform-weighted KNeighorRegression', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(uni_knr_y_predict)))


R-squared value of uniform-weighted KNeighorRegression: 0.6907212176346006
The mean squared error of uniform-weighted KNeighorRegression: 23.981877165354337
The mean absoluate error of uniform-weighted KNeighorRegression 2.9650393700787396


In [16]:
# 使用R-squared、MSE以及MAE三种指标对根据距离加权回归配置的K近邻模型在测试集上进行性能评估。
print ('R-squared value of distance-weighted KNeighorRegression:', dis_knr.score(X_test, y_test))
print ('The mean squared error of distance-weighted KNeighorRegression:', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(dis_knr_y_predict)))
print ('The mean absoluate error of distance-weighted KNeighorRegression:', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(dis_knr_y_predict))	)


R-squared value of distance-weighted KNeighorRegression: 0.7201094821421603
The mean squared error of distance-weighted KNeighorRegression: 21.703073090490353
The mean absoluate error of distance-weighted KNeighorRegression: 2.801125502210876


## 回归树
* 从预测值连续这个意义上严格地讲，回归树不能称为“回归算法”。因为回归树的叶节点返回的是“一团“ 训练数据的平均值，而不是具体的、连续的预测值。
* 使用回归树对波士顿房价训练数据进行学习，并对测试数据进行预测
* 对单一回归树模型在美国波士顿房价预测数据上的预测性能进行评估

In [17]:
# 从sklearn.tree中导入DecisionTreeRegressor。
from sklearn.tree import DecisionTreeRegressor
# 使用默认配置初始化DecisionTreeRegressor。
dtr = DecisionTreeRegressor()
# 用波士顿房价的训练数据构建回归树。
dtr.fit(X_train, y_train)
# 使用默认配置的单一回归树对测试数据进行预测，并将预测值存储在变量dtr_y_predict中。
dtr_y_predict = dtr.predict(X_test)


In [18]:
# 使用R-squared、MSE以及MAE指标对默认配置的回归树在测试集上进行性能评估。
print ('R-squared value of DecisionTreeRegressor:', dtr.score(X_test, y_test))
print ('The mean squared error of DecisionTreeRegressor:', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(dtr_y_predict)))
print ('The mean absoluate error of DecisionTreeRegressor:', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(dtr_y_predict)))


R-squared value of DecisionTreeRegressor: 0.6894984401640106
The mean squared error of DecisionTreeRegressor: 24.07669291338583
The mean absoluate error of DecisionTreeRegressor: 3.160629921259843


* 系统的介绍了决策（分类）树与回归树之后，可以总结这类树模型的**优点：**
1. 树模型可以解决非线性特征的问题；
2. 树模型不要求对特征标准化和统一量化，即数值型和类别型特征都可以直接被应用在树模型的构建和预测过程中；
3. 因为上述原因，树模型也可以直观地输出决策过程，使得预测结果具有可解释性。
* **同时，树模型也有一些显著的缺陷：**
1. 正是因为树模型可以解决复杂的非线性拟合问题，所以更加容易因为模型搭建过于复杂而丧失对新数据预测的精度（泛化力）；
2. 树模型从上至下的预测流程会因为数据细微的更改而发生较大的结构变化，因此预测稳定想较差；
3. 依托训练数据构建最佳树模型是NP难问题，即在有限时间内无法找到最优解的问题，因此我们所使用类似贪婪算法的解法只能找到一些次优解，这也是为什么我们经常结束集成模型，在多个次优解中寻觅更高的模型性能。


## 集成模型（回归）
* 极端随机森林（Eextremely Randomized Trees） 与普通随机森林模型不同的是，极端随机森林在每当构建一棵树的分裂节点（node）的时候，不会任意地选取特征，而是先随机手机一部分特征，然后利用信息熵（information gain） 和基尼不纯性（Gini Impurity)等指标挑选最佳的节点特征。

In [19]:
# 从sklearn.ensemble中导入RandomForestRegressor、ExtraTreesGressor以及GradientBoostingRegressor。
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor

# 使用RandomForestRegressor训练模型，并对测试数据做出预测，结果存储在变量rfr_y_predict中。
rfr = RandomForestRegressor()
rfr.fit(X_train, y_train)
rfr_y_predict = rfr.predict(X_test)

# 使用ExtraTreesRegressor训练模型，并对测试数据做出预测，结果存储在变量etr_y_predict中。
etr = ExtraTreesRegressor()
etr.fit(X_train, y_train)
etr_y_predict = etr.predict(X_test)

# 使用GradientBoostingRegressor训练模型，并对测试数据做出预测，结果存储在变量gbr_y_predict中。
gbr = GradientBoostingRegressor()
gbr.fit(X_train, y_train)
gbr_y_predict = gbr.predict(X_test)


  
  # This is added back by InteractiveShellApp.init_path()
  y = column_or_1d(y, warn=True)


In [20]:
# 使用R-squared、MSE以及MAE指标对默认配置的随机回归森林在测试集上进行性能评估。
print ('R-squared value of RandomForestRegressor:', rfr.score(X_test, y_test))
print ('The mean squared error of RandomForestRegressor:', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(rfr_y_predict)))
print ('The mean absoluate error of RandomForestRegressor:', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(rfr_y_predict)))


R-squared value of RandomForestRegressor: 0.7904039225449876
The mean squared error of RandomForestRegressor: 16.252351181102362
The mean absoluate error of RandomForestRegressor: 2.427401574803149


In [23]:
# 使用R-squared、MSE以及MAE指标对默认配置的极端回归森林在测试集上进行性能评估。
print ('R-squared value of ExtraTreesRegressor:', etr.score(X_test, y_test))
print ('The mean squared error of  ExtraTreesRegessor:', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(etr_y_predict)))
print ('The mean absoluate error of ExtraTreesRegessor:', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(etr_y_predict)))

# 利用训练好的极端回归森林模型，输出每种特征对预测目标的贡献度。

print (set(zip(etr.feature_importances_ , boston.feature_names)))
# print (np.array(etr.feature_importances_ , boston.feature_names))



R-squared value of ExtraTreesRegressor: 0.8237325190594427
The mean squared error of  ExtraTreesRegessor: 13.668008661417323
The mean absoluate error of ExtraTreesRegessor: 2.3737795275590554
{(0.05950056111756904, 'INDUS'), (0.03608112426859637, 'TAX'), (0.3041872533259853, 'LSTAT'), (0.04372876118169809, 'NOX'), (0.015833329972827086, 'CHAS'), (0.019750503274310577, 'PTRATIO'), (0.02074390785570981, 'CRIM'), (0.01217830614762584, 'RAD'), (0.03550605742418185, 'DIS'), (0.4196553958498484, 'RM'), (0.01479100178223286, 'AGE'), (0.0027600352520164013, 'ZN'), (0.01528376254739847, 'B')}


In [24]:
# 使用R-squared、MSE以及MAE指标对默认配置的梯度提升回归树在测试集上进行性能评估。
print ('R-squared value of GradientBoostingRegressor:', gbr.score(X_test, y_test))
print ('The mean squared error of GradientBoostingRegressor:', mean_squared_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(gbr_y_predict)))
print ('The mean absoluate error of GradientBoostingRegressor:', mean_absolute_error(ss_y.inverse_transform(y_test), ss_y.inverse_transform(gbr_y_predict)))


R-squared value of GradientBoostingRegressor: 0.828755989313912
The mean squared error of GradientBoostingRegressor: 13.27848227468911
The mean absoluate error of GradientBoostingRegressor: 2.3076587315993566


### 多种经典回归模型在“美国波士顿房价预测”问题的回归预测能力排名
* GradientBoostingRegressor
* ExtraTreesRegressor
* RandomForestRegressor
* SVM Regressor （RBF Kernel）
* KNN Regressor （Distance-weighted）
* DecisionTreeRegressor
* KNN Regressor（Uniform-weighted）
* LinearRegressor
* SGDRegressor
* SVM Regressor （Linear Kernel）
* SVM Regressor （Poly Kernel）