# <center> 【Kaggle】Telco Customer Churn 电信用户流失预测案例

---

## <font face="仿宋">第四部分导读

&emsp;&emsp;<font face="仿宋">在案例的第二、三部分中，我们详细介绍了关于特征工程的各项技术，特征工程技术按照大类来分可以分为数据预处理、特征衍生、特征筛选三部分，其中特征预处理的目的是为了将数据集整理、清洗到可以建模的程度，具体技术包括缺失值处理、异常值处理、数据重编码等，是建模之前必须对数据进行的处理和操作；而特征衍生和特征筛选则更像是一类优化手段，能够帮助模型突破当前数据集建模的效果上界。并且我们在第二部分完整详细的介绍机器学习可解释性模型的训练、优化和解释方法，也就是逻辑回归和决策树模型。并且此前我们也一直以这两种算法为主，来进行各个部分的模型测试。

&emsp;&emsp;<font face="仿宋">而第四部分，我们将开始介绍集成学习的训练和优化的实战技巧，尽管从可解释性角度来说，集成学习的可解释性并不如逻辑回归和决策树，但在大多数建模场景下，集成学习都将获得一个更好的预测结果，这也是目前效果优先的建模场景下最常使用的算法。

&emsp;&emsp;<font face="仿宋">总的来说，本部分内容只有一个目标，那就是借助各类优化方法，抵达每个主流集成学习的效果上界。换而言之，本部分我们将围绕单模优化策略展开详细的探讨，涉及到的具体集成学习包括随机森林、XGBoost、LightGBM、和CatBoost等目前最主流的集成学习算法，而具体的优化策略则包括超参数优化器的使用、特征衍生和筛选方法的使用、单模型自融合方法的使用，这些优化方法也是截至目前，提升单模效果最前沿、最有效、同时也是最复杂的方法。其中有很多较为艰深的理论，也有很多是经验之谈，但无论如何，我们希望能够围绕当前数据集，让每个集成学习算法优化到极限。值得注意的是，在这个过程中，我们会将此前介绍的特征衍生和特征筛选视作是一种模型优化方法，衍生和筛选的效果，一律以模型的最终结果来进行评定。而围绕集成学习进行海量特征衍生和筛选，也才是特征衍生和筛选技术能发挥巨大价值的主战场。

&emsp;&emsp;<font face="仿宋">而在抵达了单模的极限后，我们就会进入到下一阶段，也就是模型融合阶段。需要知道的是，只有单模的效果到达了极限，进一步的多模型融合、甚至多层融合，才是有意义的，才是有效果的。

---

# <center>Part 4.集成算法的训练与优化技巧

In [3]:
# 基础数据科学运算库
import numpy as np
import pandas as pd

# 可视化库
import seaborn as sns
import matplotlib.pyplot as plt

# 时间模块
import time

import warnings
warnings.filterwarnings('ignore')

# sklearn库
# 数据预处理
from sklearn import preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
# 实用函数
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import KFold

# 常用评估器
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestRegressor,ExtraTreesRegressor,GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, BayesianRidge

# 网格搜索
from sklearn.model_selection import GridSearchCV

# 自定义评估器支持模块
from sklearn.base import BaseEstimator, TransformerMixin, ClassifierMixin

# 自定义模块
from telcoFunc import *

# 导入特征衍生模块
import features_creation as fc
from features_creation import *

# 导入模型融合模块
import manual_ensemble as me
from manual_ensemble import *

# re模块相关
import inspect, re

# 其他模块
from tqdm import tqdm
import gc

## <center>Ch.3.12 回归问题的模型融合方法实战

&emsp;&emsp;在训练得到了三个回归模型之后，接下来我们考虑对其结果进行融合。正如此前所说，回归类问题的融合方法和分类问题的融合方法并没有本质上的区别，从大类上来划分都是基础结合器法（平均&加权平均融合）和学习器结合器法（Stacking&Blending）两大类方法，两类融合方法的区别仅仅在于具体使用上的区别。

&emsp;&emsp;接下来，我们首先探讨在如何在回归问题中使用这些融合方法，并在实践过程中总结回归问题模型融合的注意事项、并围绕此前定义的模型融合函数工具进行修改，使得其能够适用于回归问题的融合场景中。

&emsp;&emsp;首先先按照上一小节的方法，对数据和已经保存好的模型进行导入：

- 数据导入

In [4]:
from sklearn.datasets import fetch_california_housing

# 加载数据
cal_housing = fetch_california_housing()
X = pd.DataFrame(cal_housing.data, columns=cal_housing.feature_names)
y = cal_housing.target

# 划分训练集&测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12)

# 重置index
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)

In [5]:
X_train.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,6.6134,4.0,6.560729,0.939271,1552.0,3.1417,37.93,-121.97
1,2.3578,41.0,5.455598,1.007722,1070.0,4.131274,32.8,-117.15
2,5.5111,16.0,5.716747,1.037905,3903.0,2.689869,34.25,-118.61
3,8.1124,52.0,6.623188,1.019324,1153.0,2.785024,34.11,-118.14
4,6.2957,25.0,6.627832,1.008091,2128.0,3.443366,37.25,-121.81


In [6]:
X_train.shape

(16512, 8)

- 模型导入

In [10]:
RF_reg = load('./models/RF_reg.joblib')
ET_reg = load('./models/ET_reg.joblib')
GBR_reg = load('./models/GBR_reg.joblib')

In [6]:
mean_squared_error(RF_reg.predict(X_train), y_train), mean_squared_error(RF_reg.predict(X_test), y_test)

(0.03228467081263358, 0.23113640374016972)

In [7]:
mean_squared_error(ET_reg.predict(X_train), y_train), mean_squared_error(ET_reg.predict(X_test), y_test)

(1.3687030793344334e-07, 0.22259777187847052)

In [8]:
mean_squared_error(GBR_reg.predict(X_train), y_train), mean_squared_error(GBR_reg.predict(X_test), y_test)

(0.02536793820105726, 0.19804404020292288)

同时，为了方便对比后续融合效果，我们将单模MSE指标总结如下：

| 模型 | 训练集得分 | 测试集得分 |
| ------ | ------ | ------ |
| <center>随机森林_OPT | <center>0.0322 | <center>0.2311 |
| <center>极端随机树_OPT | <center>1.37e-07 | <center>0.2225 |
| <center>GBDT_OPT | <center>0.0253 | <center>0.1980 |

## 十二、回归问题的模型融合方法实践

### 1.基础结合器法

&emsp;&emsp;首先是基础接合器法的尝试，也就是平均法&投票法。不同于分类问题可以围绕类别预测结果和概率预测结果进行融合，回归问题只有一种结果——标签的数值预测结果，因此融合时只有平均法和加权平均法，并没有硬投票方法。这里我们先输出三个模型在训练集和测试集上预测结果，方便后续进行手动融合操作：

In [9]:
train_prediction_RF = RF_reg.predict(X_train)
train_prediction_RF

array([3.0558111 , 1.48645396, 2.75037489, ..., 1.22637157, 2.01426659,
       2.01472055])

In [10]:
# 训练集上的预测结果
train_prediction_RF = RF_reg.predict(X_train)
train_prediction_ET = ET_reg.predict(X_train)
train_prediction_GBR = GBR_reg.predict(X_train)

# 测试集上的预测结果
test_prediction_RF = RF_reg.predict(X_test)
test_prediction_ET = ET_reg.predict(X_test)
test_prediction_GBR = GBR_reg.predict(X_test)

#### 1.1 平均法

&emsp;&emsp;然后先看平均融合法，该方法的实践过程非常简单，只需要对三个模型的输出结果进行均值计算即可：

In [297]:
Voting_soft_train = np.mean([train_prediction_RF, train_prediction_ET, train_prediction_GBR], axis=0)
Voting_soft_test = np.mean([test_prediction_RF, test_prediction_ET, test_prediction_GBR], axis=0)

In [298]:
Voting_soft_train 

array([3.06245493, 1.53372996, 2.72675295, ..., 1.25156342, 2.00781961,
       1.95747529])

In [299]:
mean_squared_error(Voting_soft_train, y_train), mean_squared_error(Voting_soft_test, y_test)

(0.010958893356275578, 0.2049222750790844)

| 模型 | 训练集得分 | 测试集得分 |
| ------ | ------ | ------ |
| <center>随机森林_OPT | <center>0.0322 | <center>0.2311 |
| <center>极端随机树_OPT | <center>1.37e-07 | <center>0.2225 |
| <center>GBDT_OPT | <center>0.0253 | <center>0.1980 |
| <center>平均融合法 | <center>0.0109 | <center>0.2049 |

能够发现，（一如既往的）平均融合结果并不如最好的单模效果好。

&emsp;&emsp;当然，平均法融合也可以调用sklearn中的VotingRegressor进行融合：

In [11]:
from sklearn.ensemble import VotingRegressor

In [243]:
VotingRegressor?

[1;31mInit signature:[0m [0mVotingRegressor[0m[1;33m([0m[0mestimators[0m[1;33m,[0m [1;33m*[0m[1;33m,[0m [0mweights[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mn_jobs[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mverbose[0m[1;33m=[0m[1;32mFalse[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Prediction voting regressor for unfitted estimators.

A voting regressor is an ensemble meta-estimator that fits several base
regressors, each on the whole dataset. Then it averages the individual
predictions to form a final prediction.

Read more in the :ref:`User Guide <voting_regressor>`.

.. versionadded:: 0.21

Parameters
----------
estimators : list of (str, estimator) tuples
    Invoking the ``fit`` method on the ``VotingRegressor`` will fit clones
    of those original estimators that will be stored in the class attribute
    ``self.estimators_``. An estimator can be set to ``'drop'`` using
    ``set_params``.

    .. versionchanged:: 0.21
        

这里能发现，sklearn中回归问题的平均法评估器仍然属于Voting大类，不过相比分类问题的VotingClassifier评估器的参数，VotingRegressor简单很多，除了必要的estimators之外，需要我们关注的就只剩下weights参数，而该参数的功能和VotingClassifier中的weights参数一样，用于调整各评估器的权重，用于加权平均法的融合场景中。

&emsp;&emsp;接下来我们创建estimator对象类型，并借助VotingRegressor进行平均融合：

In [12]:
estimators = [('RF', RF_reg), ('ET', ET_reg), ('GBR', GBR_reg)]

In [110]:
VR = VotingRegressor(estimators).fit(X_train, y_train)

In [111]:
mean_squared_error(VR.predict(X_train), y_train), mean_squared_error(VR.predict(X_test), y_test)

(0.010958893356275578, 0.2049222750790844)

能够发现和手动实现结果完全一致。

#### 1.2 基于经验的加权平均法

&emsp;&emsp;诚然在大多数情况下平均法无法获得更好的结果，但加权平均就所有不同了。通过不同评估器的灵活的权重设置，很多时候我们都能够通过加权平均的方法获得一个更好的结果，甚至有些情况下加权平均的效果会超越学习结合器的融合效果。

- 权重设计策略回顾

&emsp;&emsp;这里我们先回顾下此前介绍的加权平均中的权重设计策略。一般来说我们首先权重设计可以分为经验法和超参数搜索法两种方法，所谓经验法，指的是通过不断的尝试，试出一组还不错的权重组合；而所谓的超参数搜索法，则指的是把权重看成是超参数，通过灵活高效的TPE搜索来确定一组最佳参数。而具体经验法的实践过程中又可以细分为倍数梯度权重和指数梯度权重两种方法，而超参数搜索法则可以在具体实践过程中借助经验法找的权重组合进行搜索空间裁剪，从而帮助超参数搜索法更加快速、精准的找到最佳权重组合。

&emsp;&emsp;需要注意的是，上述全部方法均可直接应用于回归问题中，并且回归类问题往往由于数值表现更加丰富，稍微调整下数值可能就会得到一个更好的结果，因此各类融合方法将会有更大的尝试空间和更好的融合效果。这里我们按照经验法、超参数搜索法的顺序进行尝试。

- 经验法确定权重

&emsp;&emsp;在经验法确定权重的过程中，首先值得尝试的就是按照倍数进行权重设置——也就是根据模型效果倒序排序的序号作为权重进行设置，例如此处三个模型，GBDT效果好于极端随机树好于随机森林，则我们可以给这三个模型分别分配3、2、1的权重进行融合：

| 模型 | 训练集得分 | 测试集得分 |
| ------ | ------ | ------ |
| <center>随机森林_OPT | <center>0.0322 | <center>0.2311 |
| <center>极端随机树_OPT | <center>1.37e-07 | <center>0.2225 |
| <center>GBDT_OPT | <center>0.0253 | <center>0.1980 |
| <center>平均融合法 | <center>0.0109 | <center>0.2049 |

具体实现过程如下：

In [304]:
Voting_weight1_train = (train_prediction_RF * 1 + 
                        train_prediction_ET * 2 + 
                        train_prediction_GBR * 3) / (1 + 2 + 3)
Voting_weight1_test = (test_prediction_RF * 1 + 
                       test_prediction_ET * 2 + 
                       test_prediction_GBR * 3) / (1 + 2 + 3)

In [305]:
mean_squared_error(Voting_weight1_train, y_train), mean_squared_error(Voting_weight1_test, y_test)

(0.01065539056373778, 0.19896041813115542)

能够发现，融合结果较平均融合结果更好，不过仍然没有超过单模最好效果，而且训练集得分和测试集得分方向并不一致：

| 模型 | 训练集得分 | 测试集得分 |
| ------ | ------ | ------ |
| <center>随机森林_OPT | <center>0.0322 | <center>0.2311 |
| <center>极端随机树_OPT | <center>1.37e-07 | <center>0.2225 |
| <center>GBDT_OPT | <center>0.0253 | <center>0.1980 |
| <center>平均融合法 | <center>0.0109 | <center>0.2049 |
| <center>倍数梯度权重 | <center>0.0106 | <center>0.1989 |

&emsp;&emsp;既然倍数梯度的权重设置效果不佳，接下来我们继续尝试指数梯度的权重设置方法，即三个模型的权重分配为1、10和100，具体实现过程如下：

In [306]:
Voting_weight2_train = (train_prediction_RF * 1 + 
                        train_prediction_ET * 10 + 
                        train_prediction_GBR * 100) / (1 + 10 + 100)
Voting_weight2_test = (test_prediction_RF * 1 + 
                       test_prediction_ET * 10 + 
                       test_prediction_GBR * 100) / (1 + 10 + 100)

In [308]:
mean_squared_error(Voting_weight2_train, y_train), mean_squared_error(Voting_weight2_test, y_test)

(0.02092639703571113, 0.1963037475747155)

能够发现，指数梯度权重分配方案融合得到了目前为主最好的一个结果，并且超越了单模最佳评分：

| 模型 | 训练集得分 | 测试集得分 |
| ------ | ------ | ------ |
| <center>随机森林_OPT | <center>0.0322 | <center>0.2311 |
| <center>极端随机树_OPT | <center>1.37e-07 | <center>0.2225 |
| <center>GBDT_OPT | <center>0.0253 | <font color="red"><center>**0.1980** |
| <center>平均融合法 | <center>0.0109 | <center>0.2049 |
| <center>倍数梯度权重 | <center>0.0106 | <center>0.1989 |
| <center>指数梯度权重 | <center>0.0209 | <font color="red"><center>**0.1963** |

当然，一旦当我们确定了经验法的某种效果有效之后，我们就可以继续在1：10：100的基础权重设置之上进行微调，例如我们可以设置权重为5：10：100，得到结果如下：

In [315]:
Voting_weight3_train = (train_prediction_RF * 5 + 
                        train_prediction_ET * 10 + 
                        train_prediction_GBR * 100) / (5 + 10 + 100)
Voting_weight3_test = (test_prediction_RF * 5 + 
                       test_prediction_ET * 10 + 
                       test_prediction_GBR * 100) / (5 + 10 + 100)

In [316]:
mean_squared_error(Voting_weight3_train, y_train), mean_squared_error(Voting_weight3_test, y_test)

(0.020792380614160604, 0.19613799284676225)

能够发现，效果有了更进一步的提升。这里可以继续手动尝试，不过一种更合理高效的方法，是基于1：10：100的基础权重进行超参数空间设计、然后借助TPE搜索得到一个更好的权重组合。这一方案我们稍后会进行具体的尝试。

&emsp;&emsp;当然，除了手动实现外，我们也可以借助VotingRegressor快速实现加权平均融合。首先是倍数梯度权重分配融合过程：

In [266]:
estimators

[('RF',
  RandomForestRegressor(max_features=2, max_samples=0.9990956320729892,
                        n_estimators=619, random_state=12)),
 ('ET',
  ExtraTreesRegressor(max_depth=39, max_features=0.6, n_estimators=516,
                      random_state=12)),
 ('GBR',
  GradientBoostingRegressor(learning_rate=0.07948277691156985, max_depth=7,
                            n_estimators=499, random_state=12,
                            subsample=0.9053472864274166))]

In [252]:
VR_weight1 = VotingRegressor(estimators, weights=[1, 2, 3]).fit(X_train, y_train)

In [253]:
mean_squared_error(VR_weight1.predict(X_train), y_train), mean_squared_error(VR_weight1.predict(X_test), y_test)

(0.01065539056373778, 0.19896041813115542)

In [265]:
VR_weight1.predict(X_train)

array([3.06341202, 1.52727695, 2.71233779, ..., 1.24522128, 2.01855007,
       1.95797275])

其次是指数梯度权重分配过程：

In [254]:
VR_weight2 = VotingRegressor(estimators, weights=[1, 10, 100]).fit(X_train, y_train)
mean_squared_error(VR_weight2.predict(X_train), y_train), mean_squared_error(VR_weight2.predict(X_test), y_test)

(0.02092639703571113, 0.1963037475747155)

In [255]:
VR_weight3 = VotingRegressor(estimators, weights=[5, 10, 100]).fit(X_train, y_train)
mean_squared_error(VR_weight3.predict(X_train), y_train), mean_squared_error(VR_weight3.predict(X_test), y_test)

(0.020792380614160604, 0.19613799284676225)

和手动实现过程完全一致。

> 这里需要注意，尽管在课程中回归和分类问题的经验法加权平均融合过程中，都是指数梯度权重设计结果好于倍数梯度权重设计结果，但在实际建模过程中，这两种方法都是值得尝试的方法，并不一定哪种方法就一定好于另一种方法。事实上，正是由于两组实验中都是模型结果差异较大，这才导致指数权重策略好于倍数权重策略。

#### 1.3 基于TPE搜索的加权平均法

- 基于TPE的权重搜索策略

&emsp;&emsp;接下来继续尝试基于TPE的权重搜索策略。

&emsp;&emsp;在此前的分类问题融合过程中，基于TPE权重搜索的加权平均融合在实际执行过程中容易出现过拟合问题，也就是会出现训练集上得分上升、但测试集上得分反而下降的情况。为了解决这个问题，我们提出了三种解决方案，其一是用交叉验证的融合结果代替原始融合结果，也就是借助更有可信度的验证集的平均得分代替训练集上得分，进行权重的筛选；其二则是借助经验法得到的权重组合，对搜索空间进行裁剪，此举不仅能加快迭代速度，更能帮助搜索过程跳出“伪”最优解陷阱；其三则是在模型训练阶段就进行交叉训练，然后用验证集拼接而成的训练集（train_oof）代替原始训练集，并以train_oof的融合结果作为超参数筛选依据，进行权重筛选，此时由于train_oof数据都是间接得出，因此该数据集上的融合结果可信度更高，最终帮助搜索过程提高权重结果的泛化能力。接下来我们逐个方法进行尝试：

- 方案一：基于交叉验证的TPE搜索

&emsp;&emsp;这里我们先尝试基于交叉验证的TPE搜索，此处需要注意使用验证集的平均得分作为目标函数的输出，具体搜索过程如下：

In [139]:
# 定义超参数空间
params_space = {'weight1': hp.uniform("weight1",0,1),
                'weight2': hp.uniform("weight2",0,1),
                'weight3': hp.uniform("weight3",0,1)}

In [140]:
# 定义目标函数
def hyperopt_objective_weight(params, train=True):
    weight1 = params['weight1']
    weight2 = params['weight2']
    weight3 = params['weight3']
    
    weights = [weight1, weight2, weight3]
    
    VR_weight_search = VotingRegressor(estimators, weights=weights)

    if train == True:
        val_score = cross_val_score(VR_weight_search, 
                                    X_train, 
                                    y_train, 
                                    scoring='neg_mean_squared_error', 
                                    n_jobs=15,
                                    cv=5).mean()
        res = -val_score
    else:
        VR_weight_search = VotingRegressor(estimators, 
                                           weights=weights).fit(X_train, y_train)
        res = VR_weight_search
    return res

In [143]:
# 定义优化函数
def param_hyperopt_weight(max_evals):
    params_best = fmin(fn = hyperopt_objective_weight,
                       space = params_space,
                       algo = tpe.suggest,
                       max_evals = max_evals)    
    return params_best

首先尝试迭代50次。这里由于需要重复训练模型，因此整体搜索时间较长。最终得到搜索结果如下：

In [148]:
params_best = param_hyperopt_weight(50)

100%|████████████████████████████████████████████████| 50/50 [30:02<00:00, 36.06s/trial, best loss: 0.2064834684512073]


In [149]:
params_best

{'weight1': 0.11106741979888705,
 'weight2': 0.20768131703488266,
 'weight3': 0.9143087708368308}

In [150]:
VR = hyperopt_objective_weight(params_best, train=False)
mean_squared_error(VR.predict(X_train), y_train), mean_squared_error(VR.predict(X_test), y_test)

(0.016947070903517353, 0.19576421177473802)

能够发现，相比指数梯度权重设置，此时训练集和测试集的MSE均有不同成都下降，这也说明我们得到了一组更优权重。

| 模型 | 训练集得分 | 测试集得分 |
| ------ | ------ | ------ |
| <center>随机森林_OPT | <center>0.0322 | <center>0.2311 |
| <center>极端随机树_OPT | <center>1.37e-07 | <center>0.2225 |
| <center>GBDT_OPT | <center>0.0253 | <font color="red"><center>**0.1980** |
| <center>平均融合法 | <center>0.0109 | <center>0.2049 |
| <center>倍数梯度权重 | <center>0.0106 | <center>0.1989 |
| <center>指数梯度权重 | <center>0.0209 | <font color="red"><center>**0.1963** |
| <center>基于交叉验证的TPE搜索（50次迭代） | <center>0.0169 | <font color="red"><center>**0.1957** |

&emsp;&emsp;而这里需要重点关注的是，对比此前的分类问题实验，回归问题中的基于TPE搜索的权重设计方案似乎过拟合问题并没有那么严重。该结论可以从以下两点可以看出，其一是对比指数梯度权重策略，TPE搜索结果呈现出训练集和测试集得分同步变化的趋势；其二则是最终我们搜索得到的权重组合，和经验法试出来的指数梯度权重组合较为接近，这也说明不同方法其实在逼近相类似的最优解，间接能够判断在当前场景下，基于TPE的权重搜索导致的过拟合问题并没有之前分类问题是严重那么严重。

In [149]:
params_best

{'weight1': 0.11106741979888705,
 'weight2': 0.20768131703488266,
 'weight3': 0.9143087708368308}

&emsp;&emsp;而这里过拟合倾向被抑制，也可以从交叉验证平均结果、训练集预测结果和测试集预测结果三组结果的对比中看出，通过引入交叉验证方法，能够得到更具有泛化能力的评估结果。这里我们可以进一步手动验证结果如下：

In [128]:
VR_weight1 = VotingRegressor(estimators, weights=[1, 2, 3]).fit(X_train, y_train)

In [129]:
mean_squared_error(VR_weight1.predict(X_train), y_train), mean_squared_error(VR_weight1.predict(X_test), y_test)

(0.01065539056373778, 0.19896041813115542)

In [130]:
VR_weight2 = VotingRegressor(estimators, weights=[1, 10, 100]).fit(X_train, y_train)
mean_squared_error(VR_weight2.predict(X_train), y_train), mean_squared_error(VR_weight2.predict(X_test), y_test)

(0.02092639703571113, 0.1963037475747155)

In [131]:
VR_weight3 = VotingRegressor(estimators, weights=[5, 10, 100]).fit(X_train, y_train)
mean_squared_error(VR_weight3.predict(X_train), y_train), mean_squared_error(VR_weight3.predict(X_test), y_test)

(0.020792380614160604, 0.19613799284676225)

引入交叉验验证后：

In [132]:
val_score_weight1 = cross_val_score(VR_weight1, X_train, y_train, scoring='neg_mean_squared_error', n_jobs=15, cv=5).mean()

In [135]:
val_score_weight1

-0.20975354476853605

In [133]:
val_score_weight2 = cross_val_score(VR_weight2, X_train, y_train, scoring='neg_mean_squared_error', n_jobs=15, cv=5).mean()

In [136]:
val_score_weight2

-0.20719220099122068

In [134]:
val_score_weight3 = cross_val_score(VR_weight3, X_train, y_train, scoring='neg_mean_squared_error', n_jobs=15, cv=5).mean()

In [137]:
val_score_weight3

-0.20697517984665487

| 权重分配 | 训练集得分 | 验证集平均得分 | 测试集平均得分 |
| ------ | ------ | ------ | ------ |
| <center>1：2：3 | <center>0.0106 | <center>0.2097 | <center>0.1989 |
| <center>1：10：100 | <center>0.0209 | <center>0.2071 | <center>0.1963 |
| <center>5：10：100 | <center>0.0207 | <center>0.2069 | <center>0.1961 |

能够看出，本实验中，验证集的平均得分不仅和测试集评分更为接近，并且保持了同步变化趋势，这也进一步验证了验证集的有效性，并且这也让我们更加确信TPE权重搜索方法对于该数据的建模有效性。

&emsp;&emsp;不过为何分类问题中存在的过拟合问题，到回归问题中就得到了好转？这里其实需要注意的是，模型融合中某方法是否存在过拟合问题，其实也是因数据而异的。我们这里可以把TPE权重搜索融合也看成是一种“算法”，一种算法针对不同的数据集会有不同的模型表现，有时过拟合而有时候泛化能力较强，都属于正常情况。并且需要实际尝试，才能得到最终结论。

> 抑制过拟合和提升泛化能力其实是一件事。

&emsp;&emsp;此时，在判断权重搜索融合能得确保泛化能力的情况下，我们可以进一步进一步尝试提升迭代次数。此处我们尝试迭代100次，最终结果如下：

In [16]:
params_best = param_hyperopt_weight(100)

100%|█████████████████████████████████████████████| 100/100 [50:59<00:00, 30.60s/trial, best loss: 0.20605715906734842]


In [17]:
params_best

{'weight1': 0.0024817595519138385,
 'weight2': 0.289661638124085,
 'weight3': 0.8367511006808657}

In [18]:
VR = hyperopt_objective_weight(params_best, train=False)
mean_squared_error(VR.predict(X_train), y_train), mean_squared_error(VR.predict(X_test), y_test)

(0.014009603224766298, 0.19532482892782888)

能够发现，效果有更进一步的提升：

| 模型 | 训练集得分 | 测试集得分 |
| ------ | ------ | ------ |
| <center>随机森林_OPT | <center>0.0322 | <center>0.2311 |
| <center>极端随机树_OPT | <center>1.37e-07 | <center>0.2225 |
| <center>GBDT_OPT | <center>0.0253 | <font color="red"><center>**0.1980** |
| <center>平均融合法 | <center>0.0109 | <center>0.2049 |
| <center>倍数梯度权重 | <center>0.0106 | <center>0.1989 |
| <center>指数梯度权重 | <center>0.0209 | <font color="red"><center>**0.1963** |
| <center>基于交叉验证的TPE搜索（50次迭代） | <center>0.0169 | <center>0.1957 |
| <center>基于交叉验证的TPE搜索（100次迭代） | <center>0.0140 | <font color="red"><center>**0.1953** |

- 方案二：基于搜索空间裁剪的TPE权重搜索

&emsp;&emsp;在分类问题的融合过程，我们曾讨论到可以通过裁剪搜索空间来抑制过拟合，实际上除此功能外，搜索空间的裁剪也能够提高搜索效率、提高泛化能力。此案例中尽管TPE搜索融合并没有太多的过拟合倾向，但我们仍然可以通过搜索空间裁剪来提高融合结果。同样我们还是根据经验法探索的1：10：100权重来进行空间裁剪，具体空间裁剪数值如下：

In [157]:
# 定义超参数空间
params_space = {'weight1': hp.uniform("weight1",0,0.05),
                'weight2': hp.uniform("weight2",0.05,0.1),
                'weight3': hp.uniform("weight3",0.5,1)}

然后尝试进行TPE搜索：

In [158]:
params_best = param_hyperopt_weight(50)

100%|███████████████████████████████████████████████| 50/50 [30:11<00:00, 36.23s/trial, best loss: 0.20640153459947724]


In [159]:
params_best

{'weight1': 0.024057728548545276,
 'weight2': 0.09590853047157558,
 'weight3': 0.5172496557013706}

In [160]:
VR = hyperopt_objective_weight(params_best, train=False)
mean_squared_error(VR.predict(X_train), y_train), mean_squared_error(VR.predict(X_test), y_test)

(0.0180191974133278, 0.19562547355816728)

能够发现，对比裁剪前的50次迭代结果，最终结果有进一步的提升。

| 模型 | 训练集得分 | 测试集得分 |
| ------ | ------ | ------ |
| <center>随机森林_OPT | <center>0.0322 | <center>0.2311 |
| <center>极端随机树_OPT | <center>1.37e-07 | <center>0.2225 |
| <center>GBDT_OPT | <center>0.0253 | <font color="red"><center>**0.1980** |
| <center>平均融合法 | <center>0.0109 | <center>0.2049 |
| <center>倍数梯度权重 | <center>0.0106 | <center>0.1989 |
| <center>指数梯度权重 | <center>0.0209 | <font color="red"><center>**0.1963** |
| <center>基于交叉验证的TPE搜索（50次迭代） | <center>0.0169 | <center>0.1957 |
| <center>基于交叉验证的TPE搜索（100次迭代） | <center>0.0140 | <font color="red"><center>**0.1953** |
| <center>空间裁剪的TPE搜索（50次迭代） | <center>0.0180 | <center>0.1956 |

这里同学们可以课后自行尝试基于空间裁剪的100次迭代效果，基本也能逼近基于交叉验证的搜索效果。不难看出，通过上述实践，也再次验证了权重搜索融合方法及其搜索空间裁剪策略的有效性。而0.1953也是目前我们获得的测试集上最佳评分。

- 方案三：基于交叉训练的TPE权重搜索

&emsp;&emsp;接下来我们继续尝试基于交叉训练的TPE搜索。和搜索空间裁剪法类似，提出交叉训练方法的初衷是为了解决过拟合问题，不过在一些非过拟合的场景中，交叉训练仍然能够更进一步的提高融合效果。并且由于交叉训练是不用每次都反复训练模型、只需要在train_oof上反复搜索即可，因此每次搜索的耗时更短，能够支持海量次数的迭代。而当本身TPE权重搜索融合的过程并不会严重过拟合，海量迭代就能够逼近理论上的效果上上限，本实例就是非常好的一次验证。

&emsp;&emsp;首先，交叉训练过程还是借助此前定义的train_cross函数，并在函数内添加回归问题的train_oof数据集创建过程，实现过程如下：

In [162]:
train_cross?

[1;31mSignature:[0m
[0mtrain_cross[0m[1;33m([0m[1;33m
[0m    [0mX_train[0m[1;33m,[0m[1;33m
[0m    [0my_train[0m[1;33m,[0m[1;33m
[0m    [0mX_test[0m[1;33m,[0m[1;33m
[0m    [0mestimators[0m[1;33m,[0m[1;33m
[0m    [0mtest_size[0m[1;33m=[0m[1;36m0.2[0m[1;33m,[0m[1;33m
[0m    [0mn_splits[0m[1;33m=[0m[1;36m5[0m[1;33m,[0m[1;33m
[0m    [0mrandom_state[0m[1;33m=[0m[1;36m12[0m[1;33m,[0m[1;33m
[0m    [0mblending[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Stacking融合过程一级学习器交叉训练函数

:param X_train: 训练集特征
:param y_train: 训练集标签
:param X_test: 测试集特征
:param estimators: 一级学习器，由(名称,评估器)组成的列表
:param n_splits: 交叉训练折数
:param test_size: blending过程留出集占比
:param random_state: 随机数种子
:param blending: 是否进行blending融合

:return：交叉训练后创建oof训练数据和测试集平均预测结果，同时包含特征和标签，标签在最后一列
[1;31mFile:[0m      d:\work\jupyter\telco\正式课程\manual_ensemble.py
[1;31mType:[0m      function


In [16]:
def train_cross(X_train, 
                y_train, 
                X_test, 
                estimators, 
                test_size = 0.2, 
                n_splits = 5, 
                random_state = 12, 
                blending = False, 
                regress = False):
    """
    Stacking融合过程一级学习器交叉训练函数
    
    :param X_train: 训练集特征
    :param y_train: 训练集标签
    :param X_test: 测试集特征
    :param estimators: 一级学习器，由(名称,评估器)组成的列表
    :param n_splits: 交叉训练折数
    :param test_size: blending过程留出集占比，blending参数为False时无用
    :param random_state: 随机数种子
    :param blending: 是否进行blending融合
    :param regress: 是否进行回归类问题融合
    
    :return：交叉训练后创建oof训练数据和测试集平均预测结果，同时包含特征和标签，标签在最后一列
    """    
    # 创建一级评估器输出的训练集预测结果和测试集预测结果数据集
    if type(y_train) == np.ndarray:
        y_train = pd.Series(y_train)
    
    if blending == True:
        X, X1, y, y1 = train_test_split(X_train, y_train, test_size=test_size, random_state=random_state)
        m = X1.shape[0]
        X = X.reset_index(drop=True)
        y = y.reset_index(drop=True)
        X1 = X1.reset_index(drop=True)
        y1 = y1.reset_index(drop=True)
    else:
        m = X_train.shape[0]
        X = X_train.reset_index(drop=True)
        y = y_train.reset_index(drop=True)
    
    n = len(estimators)
    m_test = X_test.shape[0]
    
    columns = []
    for estimator in estimators:
        columns.append(estimator[0] + '_oof')
    
    train_oof = pd.DataFrame(np.zeros((m, n)), columns=columns)
    
    columns = []
    for estimator in estimators:
        columns.append(estimator[0] + '_predict')
    
    test_predict = pd.DataFrame(np.zeros((m_test, n)), columns=columns)
    
    # 实例化重复交叉验证评估器
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)
    
    # 执行交叉训练
    for estimator in estimators:
        model = estimator[1]
        oof_colName = estimator[0] + '_oof'
        predict_colName = estimator[0] + '_predict'
        
        for train_part_index, eval_index in kf.split(X, y):
            # 在训练集上训练
            X_train_part = X.loc[train_part_index]
            y_train_part = y.loc[train_part_index]
            model.fit(X_train_part, y_train_part)
            
            # 如果是回归问题
            if regress == True:
                if blending == True:
                    # 在留出集上进行预测并求均值
                    train_oof[oof_colName] += model.predict(X1) / n_splits
                    # 在测试集上进行预测并求均值
                    test_predict[predict_colName] += model.predict(X_test) / n_splits
                else:
                    # 在验证集上进行验证
                    X_eval_part = X.loc[eval_index]
                    # 将验证集上预测结果拼接入oof数据集
                    train_oof[oof_colName].loc[eval_index] = model.predict(X_eval_part)
                    # 将测试集上预测结果填入predict数据集
                    test_predict[predict_colName] += model.predict(X_test) / n_splits
            
            # 如果是分类问题
            else:
                if blending == True:
                    # 在留出集上进行预测并求均值
                    train_oof[oof_colName] += model.predict_proba(X1)[:, 1] / n_splits
                    # 在测试集上进行预测并求均值
                    test_predict[predict_colName] += model.predict_proba(X_test)[:, 1] / n_splits
                else:
                    # 在验证集上进行验证
                    X_eval_part = X.loc[eval_index]
                    # 将验证集上预测结果拼接入oof数据集
                    train_oof[oof_colName].loc[eval_index] = model.predict_proba(X_eval_part)[:, 1]
                    # 将测试集上预测结果填入predict数据集
                    test_predict[predict_colName] += model.predict_proba(X_test)[:, 1] / n_splits
    
    # 添加标签列
    if blending == True:
        train_oof[y1.name] = y1
    else:
        train_oof[y.name] = y
        
    return train_oof, test_predict

改写过程中我们只添加了回归问题判别语句以及回归问题时的处理代码，回归问题的train_oof和test_predict数据集的创建过程和分类问题基本一致，只需要区分回归问题是用.predict输出预测结果，而分类问题则是.predict_proba输出结果并需要切片1类数据预测概率。其他并无区别。

&emsp;&emsp;接下来测试函数效果：

In [17]:
train_oof, test_predict = train_cross(X_train, y_train, X_test, estimators, regress=True)

In [18]:
train_oof

Unnamed: 0,RF_oof,ET_oof,GBR_oof,None
0,2.910792,2.806845,3.010536,3.07000
1,1.122422,1.356957,1.299047,1.66700
2,2.748327,2.771082,2.520479,2.76600
3,4.495247,4.514102,4.789784,5.00001
4,2.743010,2.723145,2.453989,2.51800
...,...,...,...,...
16507,2.241706,2.312674,2.191293,1.91300
16508,1.402586,1.442147,1.425944,1.61800
16509,1.097334,1.105217,1.088715,1.34000
16510,2.166915,2.142442,2.269993,1.92900


In [21]:
test_predict

Unnamed: 0,RF_predict,ET_predict,GBR_predict
0,1.995458,2.160231,2.073867
1,2.142218,1.961178,2.063859
2,2.096183,2.206849,2.074315
3,1.544865,1.594343,1.440555
4,1.164448,1.159081,1.009008
...,...,...,...
4123,2.744251,2.657388,2.597423
4124,3.054298,2.946849,2.727207
4125,0.837954,0.890181,0.851294
4126,3.904305,3.897062,4.076651


这里需要注意，train_oof最后一列名称为None的列是数据集标签，正常情况标签名是通过y.name来进行提取，但此处输入函数中的y是array，虽然经过了函数内部的Series的转换，但没有列名称，因此结果为None：

In [7]:
y_train

array([3.07 , 1.667, 2.766, ..., 1.34 , 1.929, 1.84 ])

In [18]:
pd.Series(y).name

这里我们可以手动添加最后一列的列名称：

In [19]:
cal_housing.target_names

['MedHouseVal']

In [22]:
cal_housing.target_names[0]

'MedHouseVal'

In [23]:
train_oof.columns = ['RF_oof', 'ET_oof', 'GBR_oof', 'MedHouseVal']

In [24]:
train_oof.head()

Unnamed: 0,RF_oof,ET_oof,GBR_oof,MedHouseVal
0,2.910792,2.806845,3.010536,3.07
1,1.122422,1.356957,1.299047,1.667
2,2.748327,2.771082,2.520479,2.766
3,4.495247,4.514102,4.789784,5.00001
4,2.74301,2.723145,2.453989,2.518


或者也可以在输入y的时候先将其转化为带有名称的Series。

&emsp;&emsp;在得到了train_oof和test_predict之后，我们即可对其进行加权平均融合。这里我们先简单尝试平均融合结果：

In [224]:
train_VR_mean = train_oof.iloc[:, :3].mean(1)
train_VR_mean

0        2.909391
1        1.259475
2        2.679963
3        4.599711
4        2.640048
           ...   
16507    2.248558
16508    1.423559
16509    1.097089
16510    2.193117
16511    2.129306
Length: 16512, dtype: float64

In [226]:
test_VR_mean = test_predict.iloc[:, :3].mean(1)
test_VR_mean

0       2.076519
1       2.055752
2       2.125782
3       1.526588
4       1.110846
          ...   
4123    2.666354
4124    2.909452
4125    0.859810
4126    3.959339
4127    2.017245
Length: 4128, dtype: float64

In [228]:
mean_squared_error(train_VR_mean, y_train), mean_squared_error(test_VR_mean, y_test)

(0.21608936247690866, 0.20849425035096503)

能够发现，交叉训练本身并不能提升简单平均融合的效果。

&emsp;&emsp;这里我们继续考虑进行TPE权重搜索融合：

In [139]:
# 定义超参数空间
params_space = {'weight1': hp.uniform("weight1",0,1),
                'weight2': hp.uniform("weight2",0,1),
                'weight3': hp.uniform("weight3",0,1)}

In [247]:
# 定义目标函数
def hyperopt_objective_weight(params, train=True):
    weight1 = params['weight1']
    weight2 = params['weight2']
    weight3 = params['weight3']
    
    weights = np.array([weight1, weight2, weight3])
    
    if train == True:
        res_train = (train_oof.iloc[:, :3] * weights).sum(1) / weights.sum()
        MSE_res = mean_squared_error(res_train, y_train)
        res = MSE_res
    else:
        res = weights
    return res

需要注意的是，目标函数里面的加权过程用到了数组广播，具体广播相乘计算过程如下：

In [229]:
train_oof.iloc[:, 0]

0        2.910792
1        1.122422
2        2.748327
3        4.495247
4        2.743010
           ...   
16507    2.241706
16508    1.402586
16509    1.097334
16510    2.166915
16511    2.227121
Name: RF_oof, Length: 16512, dtype: float64

In [232]:
train_oof.iloc[:, :3] * np.array([1, 10, 100])

Unnamed: 0,RF_oof,ET_oof,GBR_oof
0,2.910792,28.068451,301.053632
1,1.122422,13.569574,129.904728
2,2.748327,27.710817,252.047942
3,4.495247,45.141018,478.978426
4,2.743010,27.231453,245.398855
...,...,...,...
16507,2.241706,23.126736,219.129305
16508,1.402586,14.421474,142.594369
16509,1.097334,11.052171,108.871537
16510,2.166915,21.424420,226.999336


In [233]:
(train_oof.iloc[:, :3] * np.array([1, 10, 100])).sum(1)

0        332.032875
1        144.596723
2        282.507086
3        528.614692
4        275.373318
            ...    
16507    244.497747
16508    158.418430
16509    121.021042
16510    250.590670
16511    220.568547
Length: 16512, dtype: float64

In [235]:
np.array([1, 10, 100]).sum()

111

然后定义优化函数

In [248]:
# 定义优化函数
def param_hyperopt_weight(max_evals):
    params_best = fmin(fn = hyperopt_objective_weight,
                       space = params_space,
                       algo = tpe.suggest,
                       max_evals = max_evals)    
    return params_best

接下来测试进行搜索，由于不用反复训练模型、每次迭代只是数值计算，因此可以考虑设置更大的迭代次数：

In [249]:
params_best = param_hyperopt_weight(200)

100%|█████████████████████████████████████████████| 200/200 [00:01<00:00, 153.74trial/s, best loss: 0.2066999070373698]


In [250]:
params_best

{'weight1': 0.03132637208709112,
 'weight2': 0.09982461974413209,
 'weight3': 0.5003967731059809}

In [253]:
best_weights = hyperopt_objective_weight(params_best, train=False)
best_weights

array([0.03132637, 0.09982462, 0.50039677])

In [257]:
res_test = (test_predict.iloc[:, :3] * best_weights).sum(1) / best_weights.sum()
MSE_res = mean_squared_error(res_test, y_test)

In [258]:
MSE_res

0.19333050577750574

迭代200次仅用时1s，并且得到了一组截止目前最好的融合结果。接下来继续尝试增加迭代次数：

In [259]:
params_best = param_hyperopt_weight(2000)

100%|███████████████████████████████████████████| 2000/2000 [00:29<00:00, 68.78trial/s, best loss: 0.20668595198010906]


In [260]:
params_best

{'weight1': 0.019872909249536313,
 'weight2': 0.09999748042937995,
 'weight3': 0.5000702673971477}

In [261]:
best_weights = hyperopt_objective_weight(params_best, train=False)
best_weights

array([0.01987291, 0.09999748, 0.50007027])

In [262]:
res_test = (test_predict.iloc[:, :3] * best_weights).sum(1) / best_weights.sum()
MSE_res = mean_squared_error(res_test, y_test)

In [263]:
MSE_res

0.1930182288329917

能够发现，迭代2000次后效果有了更进一步提升，接下来继续增加迭代次数：

In [264]:
params_best = param_hyperopt_weight(10000)

100%|█████████████████████████████████████████| 10000/10000 [09:20<00:00, 17.83trial/s, best loss: 0.20668588719715994]


In [265]:
params_best

{'weight1': 0.01933442915160935,
 'weight2': 0.09999448846226298,
 'weight3': 0.5000126791510763}

In [266]:
best_weights = hyperopt_objective_weight(params_best, train=False)
best_weights

array([0.01933443, 0.09999449, 0.50001268])

In [267]:
res_test = (test_predict.iloc[:, :3] * best_weights).sum(1) / best_weights.sum()
MSE_res = mean_squared_error(res_test, y_test)

In [268]:
MSE_res

0.19300368584356492

伴随着迭代次数提升，融合效果效果不断提升、并且验证集和测试集始终保持同步变化，这也说明方法的在当前数据集上的表现非常稳健：

| 迭代次数 | train_oof得分 | 测试集平均得分 |
| ------ | ------ | ------ |
| <center>200 | <center>0.2066999070373698 | <center>0.19333 |
| <center>2000 | <center>0.20668595198010906 | <center>0.19301 |
| <center>10000 | <center>0.20668588719715994 | <center>0.19300 |

&emsp;&emsp;当然，哪怕是交叉训练后的TPE权重搜索，我们也是可以进行搜索空间裁剪的，以进一步提升迭代效率和预测结果。还是按照类似的裁剪数值，TPE搜索效果如下：

In [294]:
# 定义超参数空间
params_space = {'weight1': hp.uniform("weight1",0,0.01),
                'weight2': hp.uniform("weight2",0.05,0.1),
                'weight3': hp.uniform("weight3",0.5,1)}

In [297]:
params_best = param_hyperopt_weight(2000)

100%|███████████████████████████████████████████| 2000/2000 [00:29<00:00, 68.43trial/s, best loss: 0.20669530153582757]


In [298]:
params_best

{'weight1': 0.009995130515269924,
 'weight2': 0.09999831798285891,
 'weight3': 0.5001811581252161}

In [299]:
best_weights = hyperopt_objective_weight(params_best, train=False)
best_weights

array([0.00999513, 0.09999832, 0.50018116])

In [300]:
res_test = (test_predict.iloc[:, :3] * best_weights).sum(1) / best_weights.sum()
MSE_res = mean_squared_error(res_test, y_test)

In [301]:
MSE_res

0.19274993739359678

能够发现效果有了更进一步的提升。至此我们也获得了加权平均融合最好的一组融合结果：

| 模型 | 训练集得分 | 测试集得分 |
| ------ | ------ | ------ |
| <center>随机森林_OPT | <center>0.0322 | <center>0.2311 |
| <center>极端随机树_OPT | <center>1.37e-07 | <center>0.2225 |
| <center>GBDT_OPT | <center>0.0253 | <font color="red"><center>**0.1980** |
| <center>平均融合法 | <center>0.0109 | <center>0.2049 |
| <center>倍数梯度权重 | <center>0.0106 | <center>0.1989 |
| <center>指数梯度权重 | <center>0.0209 | <font color="red"><center>**0.1963** |
| <center>基于交叉验证的TPE搜索（50次迭代） | <center>0.0169 | <center>0.1957 |
| <center>基于交叉验证的TPE搜索（100次迭代） | <center>0.0140 | <font color="red"><center>**0.1953** |
| <center>空间裁剪的TPE搜索（50次迭代） | <center>0.0180 | <center>0.1956 |
| <center>交叉训练的TPE搜索（200次迭代） | <center>0.2066 | <center>0.19333 |
| <center>交叉训练的TPE搜索（2000次迭代） | <center>0.2066 | <center>0.19301 |
| <center>交叉训练的TPE搜索（10000次迭代） | <center>0.2066 | <center>0.19300 |
| <center>交叉训练+空间裁剪TPE搜索（2000次迭代） |<center>0.2066 | <font color="red"><center>**0.1927** |

&emsp;&emsp;当然和分类问题类似，我们可以类似的把测试集带入进行搜索，测试加权平均融合效果上限：

In [279]:
# 定义目标函数
def hyperopt_objective_weight(params, train=True):
    weight1 = params['weight1']
    weight2 = params['weight2']
    weight3 = params['weight3']
    
    weights = np.array([weight1, weight2, weight3])
    
    if train == True:
        res_test = (test_predict.iloc[:, :3] * weights).sum(1) / weights.sum()
        MSE_res = mean_squared_error(res_test, y_test)
        res = MSE_res
    else:
        res = weights
    return res

In [280]:
# 定义优化函数
def param_hyperopt_weight(max_evals):
    params_best = fmin(fn = hyperopt_objective_weight,
                       space = params_space,
                       algo = tpe.suggest,
                       max_evals = max_evals)    
    return params_best

In [281]:
params_best = param_hyperopt_weight(1000)

100%|██████████████████████████████████████████| 1000/1000 [00:09<00:00, 102.55trial/s, best loss: 0.19170078526754766]


In [282]:
params_best

{'weight1': 1.2340960509984558e-06,
 'weight2': 0.050004127069823705,
 'weight3': 0.9989348939544079}

能够发现测试集加权平均融合的效果上界是0.1917，和训练得到的0.1927非常接近，也说明此前的加权融合取得了不错的效果。当然需要再次强调，一般情况测试集并没有标签，因此这里的效果上线测试只是为了课程为了验证方法效果，并不是一般建模手段。

### 2.学习结合器法

&emsp;&emsp;接下来我们继续尝试学习结合器法进行模型融合——也就是Stacking与Blending方法。

#### 2.1 Stacking融合法

&emsp;&emsp;首先是Stacking模型融合方法，先快速回顾Stacking融合原理如下：

<center><img src="http://ml2022.oss-cn-hangzhou.aliyuncs.com/img/image-20221103152633511.png" alt="image-20221103152633511" style="zoom:40%;" />

并且由于此前我们已经得到了train_oof数据集和test_predict，这里我们直接使用train_oof训练元学习器、然后在test_predict上进行预测即可。

In [25]:
train_oof

Unnamed: 0,RF_oof,ET_oof,GBR_oof,MedHouseVal
0,2.910792,2.806845,3.010536,3.07000
1,1.122422,1.356957,1.299047,1.66700
2,2.748327,2.771082,2.520479,2.76600
3,4.495247,4.514102,4.789784,5.00001
4,2.743010,2.723145,2.453989,2.51800
...,...,...,...,...
16507,2.241706,2.312674,2.191293,1.91300
16508,1.402586,1.442147,1.425944,1.61800
16509,1.097334,1.105217,1.088715,1.34000
16510,2.166915,2.142442,2.269993,1.92900


- 回归融合问题的元学习器选择

&emsp;&emsp;其实无论是回归问题还是分类问题、无论是Stacking还是Blending，元学习器的选择都是类似的——为避免过拟合问题，元学习器往往需要选择模型结构简单、预测效力一般的模型。之前的分类问题讲解中，我们重点推荐逻辑回归模型或者结构非常简单的树模型作为元学习器，而在回归问题Stacking融合中，则往往使用线性回归模型或者贝叶斯回归作为元学习器。

&emsp;&emsp;首先我们先进行一次深度尝试，大范围测试不同类型的回归模型作为元学习器的效果，再就元学习器模型选择问题进行分析：

In [22]:
# 线性回归
from sklearn.linear_model import LinearRegression
lr_reg = LinearRegression().fit(train_oof.iloc[:, :3], y_train)
lr_train_prediction = lr_reg.predict(train_oof.iloc[:, :3])
lr_test_prediction = lr_reg.predict(test_predict)
print('The results of LR-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(lr_train_prediction, y_train), 
                                       mean_squared_error(lr_test_prediction, y_test)))

# 岭回归
from sklearn.linear_model import Ridge
Ridge_reg = Ridge().fit(train_oof.iloc[:, :3], y_train)
Ridge_train_prediction = Ridge_reg.predict(train_oof.iloc[:, :3])
Ridge_test_prediction = Ridge_reg.predict(test_predict)
print('The results of Ridge-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(Ridge_train_prediction, y_train), 
                                       mean_squared_error(Ridge_test_prediction, y_test)))


# LASSO
from sklearn.linear_model import Lasso
Lasso_reg = Lasso().fit(train_oof.iloc[:, :3], y_train)
Lasso_train_prediction = Lasso_reg.predict(train_oof.iloc[:, :3])
Lasso_test_prediction = Lasso_reg.predict(test_predict)
print('The results of Lasso-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(Lasso_train_prediction, y_train), 
                                       mean_squared_error(Lasso_test_prediction, y_test)))


# 弹性网
from sklearn.linear_model import ElasticNet
ElasticNet_reg = ElasticNet().fit(train_oof.iloc[:, :3], y_train)
ElasticNet_train_prediction = ElasticNet_reg.predict(train_oof.iloc[:, :3])
ElasticNet_test_prediction = ElasticNet_reg.predict(test_predict)
print('The results of ElasticNet-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(ElasticNet_train_prediction, y_train), 
                                       mean_squared_error(ElasticNet_test_prediction, y_test)))

# 贝叶斯回归
from sklearn.linear_model import BayesianRidge
BayesianRidge_reg = BayesianRidge().fit(train_oof.iloc[:, :3], y_train)
BayesianRidge_train_prediction = BayesianRidge_reg.predict(train_oof.iloc[:, :3])
BayesianRidge_test_prediction = BayesianRidge_reg.predict(test_predict)
print('The results of BayesianRidge-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(BayesianRidge_train_prediction, y_train), 
                                       mean_squared_error(BayesianRidge_test_prediction, y_test)))


# SVR
from sklearn.svm import SVR
SVR_reg = SVR().fit(train_oof.iloc[:, :3], y_train)
SVR_train_prediction = SVR_reg.predict(train_oof.iloc[:, :3])
SVR_test_prediction = SVR_reg.predict(test_predict)
print('The results of SVR-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(SVR_train_prediction, y_train), 
                                       mean_squared_error(SVR_test_prediction, y_test)))


# 决策树回归
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor().fit(train_oof.iloc[:, :3], y_train)
tree_train_prediction = tree_reg.predict(train_oof.iloc[:, :3])
tree_test_prediction = tree_reg.predict(test_predict)
print('The results of tree_reg-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(tree_train_prediction, y_train), 
                                       mean_squared_error(tree_test_prediction, y_test)))


# Bagging
from sklearn.ensemble import BaggingRegressor
bagging_reg = BaggingRegressor().fit(train_oof.iloc[:, :3], y_train)
bagging_train_prediction = bagging_reg.predict(train_oof.iloc[:, :3])
bagging_test_prediction = bagging_reg.predict(test_predict)
print('The results of Bagging-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(bagging_train_prediction, y_train), 
                                       mean_squared_error(bagging_test_prediction, y_test)))

# 随机森林
RFR = RandomForestRegressor().fit(train_oof.iloc[:, :3], y_train)
RFR_train_prediction = RFR.predict(train_oof.iloc[:, :3])
RFR_test_prediction = RFR.predict(test_predict)
print('The results of RF-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(RFR_train_prediction, y_train), 
                                       mean_squared_error(RFR_test_prediction, y_test)))

# AdaBoost
from sklearn.ensemble import AdaBoostRegressor
ABR = AdaBoostRegressor().fit(train_oof.iloc[:, :3], y_train)
ABR_train_prediction = ABR.predict(train_oof.iloc[:, :3])
ABR_test_prediction = ABR.predict(test_predict)
print('The results of AdaBoost-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(ABR_train_prediction, y_train), 
                                       mean_squared_error(ABR_test_prediction, y_test)))

# GBDT
from sklearn.ensemble import GradientBoostingRegressor
GBR = GradientBoostingRegressor().fit(train_oof.iloc[:, :3], y_train)
GBR_train_prediction = GBR.predict(train_oof.iloc[:, :3])
GBR_test_prediction = GBR.predict(test_predict)
print('The results of GBR-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(GBR_train_prediction, y_train), 
                                       mean_squared_error(GBR_test_prediction, y_test)))

# XGB
from xgboost import XGBRegressor
XGB = XGBRegressor().fit(train_oof.iloc[:, :3], y_train)
XGB_train_prediction = XGB.predict(train_oof.iloc[:, :3])
XGB_test_prediction = XGB.predict(test_predict)
print('The results of XGB-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(XGB_train_prediction, y_train), 
                                       mean_squared_error(XGB_test_prediction, y_test)))

The results of LR-final:
Train-MSE: 0.205622, Test-MSE: 0.192784
The results of Ridge-final:
Train-MSE: 0.205622, Test-MSE: 0.192803
The results of Lasso-final:
Train-MSE: 1.109248, Test-MSE: 1.150578
The results of ElasticNet-final:
Train-MSE: 0.549212, Test-MSE: 0.561142
The results of BayesianRidge-final:
Train-MSE: 0.205622, Test-MSE: 0.192801
The results of SVR-final:
Train-MSE: 0.205412, Test-MSE: 0.194418
The results of tree_reg-final:
Train-MSE: 0.000000, Test-MSE: 0.377318
The results of Bagging-final:
Train-MSE: 0.042989, Test-MSE: 0.224572
The results of RF-final:
Train-MSE: 0.031436, Test-MSE: 0.209634
The results of AdaBoost-final:
Train-MSE: 0.280633, Test-MSE: 0.272866
The results of GBR-final:
Train-MSE: 0.190360, Test-MSE: 0.194770
The results of XGB-final:
Train-MSE: 0.114721, Test-MSE: 0.207962


| 元学习器 | 训练集得分 | 测试集得分 |
| ------ | ------ | ------ |
| <center>线性回归 | <center>0.205622 | <center>0.192784 |
| <center>贝叶斯回归 | <center>0.205622 | <center>0.192801 |

这里我们选取了线性回归方程家族模型，包括线性回归、岭回归、LASSO和弹性网，以及选择了贝叶斯回归、支持向量机、决策树模型，同时也带入了集成学习进行计算，能够非常明显的发现，线性回归和贝叶斯回归效果最好，而其他模型，除了弹性网和LASSO出现了欠拟合问题外，其他模型均出现了不同程度的过拟合问题。这里针对不同模型作为元学习器的不同表现进行分析：

&emsp;&emsp;首先是关于岭回归、LASSO和弹性网三个模型，由于这三个模型在损失函数中增加了扰动项用于解决线性回归中系数矩阵不可逆导致无法求解的问题，因此面对简单数据集（此时不存在共线性问题），建模结果与预测精度并不如原始的线性回归模型，这也就是为何在这里这三个模型会表现出欠拟合的主要问题。

> 岭回归、LASSO和弹性网中的扰动项，从机器学习的角度进行解释其实就是结构风险惩罚项

&emsp;&emsp;对比四个模型的损失函数如下：

$$线性回归：\min_{w} || X w - y||_2^2$$
$$岭回归：\min_{w} || X w - y||_2^2 + \alpha ||w||_2^2$$
$$LASSO：\min_{w} { \frac{1}{2n_{\text{samples}}} ||X w - y||_2 ^ 2 + \alpha ||w||_1}$$
$$弹性网：\min_{w} { \frac{1}{2n_{\text{samples}}} ||X w - y||_2 ^ 2 + \alpha \rho ||w||_1 +
\frac{\alpha(1-\rho)}{2} ||w||_2 ^ 2}$$

而要消除这三个模型的过拟合问题也很简单，只需要手动调整超参数、以减弱扰动项对损失函数造成的影响即可，例如岭回归中我们逐渐减少$\alpha$的取值，则能够逐渐提升模型预测效果：

In [48]:
# 岭回归
from sklearn.linear_model import Ridge
Ridge_reg = Ridge().fit(train_oof.iloc[:, :3], y_train)
Ridge_train_prediction = Ridge_reg.predict(train_oof.iloc[:, :3])
Ridge_test_prediction = Ridge_reg.predict(test_predict)
print('The results of LR-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(Ridge_train_prediction, y_train), 
                                       mean_squared_error(Ridge_test_prediction, y_test)))

The results of LR-final:
Train-MSE: 0.205622, Test-MSE: 0.192803


In [49]:
Ridge?

[1;31mInit signature:[0m
[0mRidge[0m[1;33m([0m[1;33m
[0m    [0malpha[0m[1;33m=[0m[1;36m1.0[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0mfit_intercept[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mnormalize[0m[1;33m=[0m[1;34m'deprecated'[0m[1;33m,[0m[1;33m
[0m    [0mcopy_X[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mmax_iter[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mtol[0m[1;33m=[0m[1;36m0.001[0m[1;33m,[0m[1;33m
[0m    [0msolver[0m[1;33m=[0m[1;34m'auto'[0m[1;33m,[0m[1;33m
[0m    [0mpositive[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mrandom_state[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Linear least squares with l2 regularization.

Minimizes the objective function::

||y - Xw||^2_2 + alpha * ||w||^2_2

This model solves a regression model where the loss function is


In [51]:
Ridge_reg = Ridge(alpha=0.5).fit(train_oof.iloc[:, :3], y_train)
Ridge_train_prediction = Ridge_reg.predict(train_oof.iloc[:, :3])
Ridge_test_prediction = Ridge_reg.predict(test_predict)
print('The results of LR-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(Ridge_train_prediction, y_train), 
                                       mean_squared_error(Ridge_test_prediction, y_test)))

The results of LR-final:
Train-MSE: 0.205622, Test-MSE: 0.192794


In [52]:
Ridge_reg = Ridge(alpha=0.1).fit(train_oof.iloc[:, :3], y_train)
Ridge_train_prediction = Ridge_reg.predict(train_oof.iloc[:, :3])
Ridge_test_prediction = Ridge_reg.predict(test_predict)
print('The results of LR-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(Ridge_train_prediction, y_train), 
                                       mean_squared_error(Ridge_test_prediction, y_test)))

The results of LR-final:
Train-MSE: 0.205622, Test-MSE: 0.192786


类似的，LASSO中通过减少$\alpha$取值，同样能提升元学习器的预测效果：

In [33]:
# LASSO
from sklearn.linear_model import Lasso
Lasso_reg = Lasso().fit(train_oof.iloc[:, :3], y_train)
Lasso_train_prediction = Lasso_reg.predict(train_oof.iloc[:, :3])
Lasso_test_prediction = Lasso_reg.predict(test_predict)
print('The results of Lasso-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(Lasso_train_prediction, y_train), 
                                       mean_squared_error(Lasso_test_prediction, y_test)))

The results of Lasso-final:
Train-MSE: 1.109248, Test-MSE: 1.150578


In [53]:
Lasso?

[1;31mInit signature:[0m
[0mLasso[0m[1;33m([0m[1;33m
[0m    [0malpha[0m[1;33m=[0m[1;36m1.0[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0mfit_intercept[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mnormalize[0m[1;33m=[0m[1;34m'deprecated'[0m[1;33m,[0m[1;33m
[0m    [0mprecompute[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mcopy_X[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mmax_iter[0m[1;33m=[0m[1;36m1000[0m[1;33m,[0m[1;33m
[0m    [0mtol[0m[1;33m=[0m[1;36m0.0001[0m[1;33m,[0m[1;33m
[0m    [0mwarm_start[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mpositive[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mrandom_state[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mselection[0m[1;33m=[0m[1;34m'cyclic'[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Linear Model trained with L1

In [54]:
# LASSO
Lasso_reg = Lasso(alpha=0.1).fit(train_oof.iloc[:, :3], y_train)
Lasso_train_prediction = Lasso_reg.predict(train_oof.iloc[:, :3])
Lasso_test_prediction = Lasso_reg.predict(test_predict)
print('The results of Lasso-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(Lasso_train_prediction, y_train), 
                                       mean_squared_error(Lasso_test_prediction, y_test)))

The results of Lasso-final:
Train-MSE: 0.216589, Test-MSE: 0.203833


In [55]:
from sklearn.linear_model import Lasso
Lasso_reg = Lasso(alpha=0.01).fit(train_oof.iloc[:, :3], y_train)
Lasso_train_prediction = Lasso_reg.predict(train_oof.iloc[:, :3])
Lasso_test_prediction = Lasso_reg.predict(test_predict)
print('The results of Lasso-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(Lasso_train_prediction, y_train), 
                                       mean_squared_error(Lasso_test_prediction, y_test)))

The results of Lasso-final:
Train-MSE: 0.205775, Test-MSE: 0.193191


> 能够发现，从$\alpha$的绝对数值对应的模型训练结果能看出，相同的$\alpha$取值下LASSO效果更差，说明LASSO对结构风险的惩罚力度要大于岭回归，这也是为何有时我们可以借助LASSO来进行特征筛选的原因。

而对于弹性网，则需要同时减少$\alpha$取值、并增加$\rho$取值，则可提升模型效果：

In [34]:
# 弹性网
from sklearn.linear_model import ElasticNet
ElasticNet_reg = ElasticNet().fit(train_oof.iloc[:, :3], y_train)
ElasticNet_train_prediction = ElasticNet_reg.predict(train_oof.iloc[:, :3])
ElasticNet_test_prediction = ElasticNet_reg.predict(test_predict)
print('The results of Lasso-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(ElasticNet_train_prediction, y_train), 
                                       mean_squared_error(ElasticNet_test_prediction, y_test)))

The results of Lasso-final:
Train-MSE: 0.549212, Test-MSE: 0.561142


In [56]:
ElasticNet?

[1;31mInit signature:[0m
[0mElasticNet[0m[1;33m([0m[1;33m
[0m    [0malpha[0m[1;33m=[0m[1;36m1.0[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0ml1_ratio[0m[1;33m=[0m[1;36m0.5[0m[1;33m,[0m[1;33m
[0m    [0mfit_intercept[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mnormalize[0m[1;33m=[0m[1;34m'deprecated'[0m[1;33m,[0m[1;33m
[0m    [0mprecompute[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mmax_iter[0m[1;33m=[0m[1;36m1000[0m[1;33m,[0m[1;33m
[0m    [0mcopy_X[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mtol[0m[1;33m=[0m[1;36m0.0001[0m[1;33m,[0m[1;33m
[0m    [0mwarm_start[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mpositive[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mrandom_state[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mselection[0m[1;33m=[0m[1;34m'cyclic'[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m

In [57]:
# 弹性网
from sklearn.linear_model import ElasticNet
ElasticNet_reg = ElasticNet(alpha=0.1, l1_ratio=0.9).fit(train_oof.iloc[:, :3], y_train)
ElasticNet_train_prediction = ElasticNet_reg.predict(train_oof.iloc[:, :3])
ElasticNet_test_prediction = ElasticNet_reg.predict(test_predict)
print('The results of Lasso-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(ElasticNet_train_prediction, y_train), 
                                       mean_squared_error(ElasticNet_test_prediction, y_test)))

The results of Lasso-final:
Train-MSE: 0.214485, Test-MSE: 0.203406


In [58]:
from sklearn.linear_model import ElasticNet
ElasticNet_reg = ElasticNet(alpha=0.01, l1_ratio=0.99).fit(train_oof.iloc[:, :3], y_train)
ElasticNet_train_prediction = ElasticNet_reg.predict(train_oof.iloc[:, :3])
ElasticNet_test_prediction = ElasticNet_reg.predict(test_predict)
print('The results of Lasso-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(ElasticNet_train_prediction, y_train), 
                                       mean_squared_error(ElasticNet_test_prediction, y_test)))

The results of Lasso-final:
Train-MSE: 0.205773, Test-MSE: 0.193204


不过我们发现，无论如何调整，这一组模型的预测效果都不会超过线性回归（这也是模型原理导致）。因此在大多数情况下，我们都不会考虑采用这些模型作为元学习器。

&emsp;&emsp;而与此不同的是，贝叶斯回归则是另一个备选的可以用于回归问题Stacking的元学习器。和线性回归通过最下二乘法求解线性方程系数不同，贝叶斯回归是通过贝叶斯推断来进行线性方程系数。考虑到该求解过程较为复杂，且其原理中提供的超参数几乎对建模结果没有任何影响，外加贝叶斯回归本身预测效力很弱（和线性回归旗鼓相当），因此课上并不会对其原理进行深入介绍，对其原理感兴趣的同学可以参看《模式识别与机器学习（PRML）》第三章第三节的贝叶斯线性回归部分内容讲解。

In [60]:
BayesianRidge?

[1;31mInit signature:[0m
[0mBayesianRidge[0m[1;33m([0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0mn_iter[0m[1;33m=[0m[1;36m300[0m[1;33m,[0m[1;33m
[0m    [0mtol[0m[1;33m=[0m[1;36m0.001[0m[1;33m,[0m[1;33m
[0m    [0malpha_1[0m[1;33m=[0m[1;36m1e-06[0m[1;33m,[0m[1;33m
[0m    [0malpha_2[0m[1;33m=[0m[1;36m1e-06[0m[1;33m,[0m[1;33m
[0m    [0mlambda_1[0m[1;33m=[0m[1;36m1e-06[0m[1;33m,[0m[1;33m
[0m    [0mlambda_2[0m[1;33m=[0m[1;36m1e-06[0m[1;33m,[0m[1;33m
[0m    [0malpha_init[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mlambda_init[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcompute_score[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mfit_intercept[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mnormalize[0m[1;33m=[0m[1;34m'deprecated'[0m[1;33m,[0m[1;33m
[0m    [0mcopy_X[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mverbose[0

&emsp;&emsp;这里需要注意，sklearn中的贝叶斯回归是提供了四个超参数选项的，分别是alpha_1、alpha_2、lambda_1、lambda_2，默认超参数取值都是1e-6。而由于Python本身计算精度的问题，建模过程中这四个超参数对模型的影响非常不可控，甚至无法通过超参数优化寻找到最有超参数组，例如默认参数情况下，贝叶斯回归元学习器建模效果如下：

In [35]:
# 贝叶斯回归
from sklearn.linear_model import BayesianRidge
BayesianRidge_reg = BayesianRidge().fit(train_oof.iloc[:, :3], y_train)
BayesianRidge_train_prediction = BayesianRidge_reg.predict(train_oof.iloc[:, :3])
BayesianRidge_test_prediction = BayesianRidge_reg.predict(test_predict)
print('The results of BayesianRidge-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(BayesianRidge_train_prediction, y_train), 
                                       mean_squared_error(BayesianRidge_test_prediction, y_test)))

The results of BayesianRidge-final:
Train-MSE: 0.205622, Test-MSE: 0.192801


而如果对其进行超参数搜索，得到结果如下：

In [114]:
BR_space = {'alpha_1': hp.uniform('alpha_1', 1e-6, 1e-2), 
            'alpha_2': hp.uniform('alpha_2', 1e-6, 1e-2), 
            'lambda_1': hp.uniform('lambda_1', 1e-6, 1e-2), 
            'lambda_2': hp.uniform('lambda_2', 1e-6, 1e-2)}

In [115]:
def BR_param_objective(params, train=True):
    
    # 超参数读取
    alpha_1 = params['alpha_1']
    alpha_2 = params['alpha_2']
    lambda_1 = params['lambda_1']
    lambda_2 = params['lambda_2']
        
    # 模型创建
    BayesianRidge_reg = BayesianRidge(n_iter=200000, 
                                      alpha_1=alpha_1, 
                                      alpha_2=alpha_2, 
                                      lambda_1=lambda_1, 
                                      lambda_2=lambda_2)
    if train == True:
        res = -cross_val_score(BayesianRidge_reg, 
                               train_oof.iloc[:, :3], 
                               y_train, 
                               scoring='neg_mean_squared_error', 
                               n_jobs=15).mean()
    else:
        res = BayesianRidge_reg.fit(train_oof.iloc[:, :3], y_train)
        
    return res

In [116]:
def BR_param_search(max_evals=500):
    params_best = fmin(BR_param_objective,
                       space = BR_space,
                       algo = tpe.suggest,
                       max_evals = max_evals)
    return params_best

In [117]:
BR_best_param = BR_param_search(2000)

100%|███████████████████████████████████████████| 2000/2000 [00:55<00:00, 35.79trial/s, best loss: 0.20591380724221203]


In [118]:
BR_best_param

{'alpha_1': 0.00781092764422319,
 'alpha_2': 0.00919958807547112,
 'lambda_1': 0.009997955393171507,
 'lambda_2': 1.5458991132666618e-06}

In [119]:
BR_reg = BR_param_objective(BR_best_param, train=False)

In [120]:
mean_squared_error(BR_reg.predict(train_oof.iloc[:, :3]), y_train), mean_squared_error(BR_reg.predict(test_predict), y_test)

(0.2056222234194353, 0.19280159602984515)

结果几乎没有任何变化，该情况其实是贝叶斯回归建模过程中的一般情况。因此，大多数时候，当我们采用线性回归和贝叶斯回归作为元学习器进行Stacking时，并不需要要进行任何形式的超参数调整（线性回归是因为没有超参数、而贝叶斯回归则是因为超参数优化效果不大）。

> 稍微拓展一些，其实贝叶斯回归模型超参数优化无效的原因，是在于对于同一个数据集，不同的超参数取值有可能会导向相同的训练集结果，从而让超参数优化函数失去判断的标准，哪怕这些不同取值的超参数会对测试集产生不同的影响，但由于训练集上表现相同，因此优化函数是无法判断哪组超参数最优的，因此优化效果很不稳定。这个是贝叶斯回归存在的问题，但贝叶斯回归也有自己的优势，相比线性回归，贝叶斯回归不会因为共线性问题导致模型效果下降，因此整体表现比线性回归更加稳健。不过在模型融合环节，还是建议两个模型都建模、多次输出结果再从中筛选。

&emsp;&emsp;而如果是更复杂的树模型或者是集成算法，作为元学习器则会不可避免的出现过拟合的问题，并且和分类问题类似，这种过拟合问题是无法通过超参数优化来解决的。以决策树模型为例，超参数优化过程如下：

In [74]:
# 决策树回归
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor().fit(train_oof.iloc[:, :3], y_train)
tree_train_prediction = tree_reg.predict(train_oof.iloc[:, :3])
tree_test_prediction = tree_reg.predict(test_predict)
print('The results of tree_reg-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(tree_train_prediction, y_train), 
                                       mean_squared_error(tree_test_prediction, y_test)))

The results of tree_reg-final:
Train-MSE: 0.000000, Test-MSE: 0.385108


In [85]:
tree_reg.get_params()

{'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'random_state': None,
 'splitter': 'best'}

In [None]:
# 实例化决策树评估器
tree_final = DecisionTreeRegressor()

tree_param = {'max_depth': np.arange(2, 16, 1).tolist(), 
              'min_samples_split': np.arange(1, 5, 1).tolist(), 
              'min_samples_leaf': np.arange(1, 4, 1).tolist(), 
              'max_leaf_nodes':np.arange(6, 30, 1).tolist()}

# 实例化网格搜索评估器
tfg = GridSearchCV(estimator = tree_final,
                   param_grid = tree_param,
                   n_jobs = 12)

tfg.fit(train_oof.iloc[:, :3], y_train)

GridSearchCV(estimator=DecisionTreeRegressor(), n_jobs=12,
             param_grid={'max_depth': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
                                       14, 15],
                         'max_leaf_nodes': [6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
                                            16, 17, 18, 19, 20, 21, 22, 23, 24,
                                            25, 26, 27, 28, 29],
                         'min_samples_leaf': [1, 2, 3],
                         'min_samples_split': [1, 2, 3, 4]})

In [81]:
tfg.best_params_

{'max_depth': 5,
 'max_leaf_nodes': 29,
 'min_samples_leaf': 3,
 'min_samples_split': 2}

In [82]:
tree_train_prediction = tfg.predict(train_oof.iloc[:, :3])
tree_test_prediction = tfg.predict(test_predict)
print('The results of tree_reg-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(tree_train_prediction, y_train), 
                                       mean_squared_error(tree_test_prediction, y_test)))

The results of tree_reg-final:
Train-MSE: 0.203716, Test-MSE: 0.200287


能够发现，通过超参数优化，尽管可以限制过拟合问题，但效果不如线性回归或者贝叶斯回归。除了树模型外，其他集成学习也类似，这里就不一一进行举例了。

| 元学习器 | 训练集得分 | 测试集得分 |
| ------ | ------ | ------ |
| <center>线性回归 | <center>0.205622 | <center>0.192784 |
| <center>贝叶斯回归 | <center>0.205622 | <center>0.192801 |

&emsp;&emsp;通过上述实验，总的结论是，在回归问题的Stacking融合的元学习器选择上，优先选择线性回归和贝叶斯回归，只有在极少数的超大规模的模型融合的情况下，才会考虑更加复杂的模型作为元学习器。

- 线性方程元学习器与加权平均融合比较

&emsp;&emsp;实际上，基于交叉训练的加权平均融合和以线性方程作为元学习器的Stacking融合本质上非常类似，都是利用train_oof的特征乘以某种系数作为最终预测结果。只不过线性回归和贝叶斯回归允许在加权求和后增加一个常数项。例如在当前示例中，线性回归元学习器最终得到方程系数如下：

In [24]:
lr_reg.coef_

array([-0.06107945,  0.35375414,  0.73250402])

常数项如下：

In [36]:
lr_reg.intercept_

-0.05678572469977139

当然，贝叶斯回归也是类似：

In [25]:
BayesianRidge_reg.coef_

array([-0.05900869,  0.35252576,  0.7317419 ])

In [51]:
BayesianRidge_reg.intercept_

-0.056949861828635484

而实际上的融合过程，就是以这些系数为权重，进行加权求和，并加上常数项，这里以线性回归为例，算出最终预测结果：

In [38]:
(train_oof.iloc[:, :3] * lr_reg.coef_).sum(1) + lr_reg.intercept_

0        2.963588
1        1.306244
2        2.601891
3        4.774066
4        2.536553
           ...   
16507    2.229541
16508    1.412220
16509    1.064653
16510    2.231537
16511    2.022835
Length: 16512, dtype: float64

In [31]:
lr_reg.predict(train_oof.iloc[:, :3])

array([2.96358772, 1.30624403, 2.60189094, ..., 1.06465319, 2.23153731,
       2.02283499])

而此前介绍的加权平均融合，完全可以看成不带常数项的线性回归元学习器下的Stacking过程：

In [47]:
best_param

array([1.23409605e-06, 5.00041271e-02, 9.98934894e-01])

此时“线性回归”的系数为：

In [44]:
best_param / best_param.sum()

array([1.17651701e-06, 4.76710917e-02, 9.52327732e-01])

In [45]:
best_param_norm = best_param / best_param.sum()

In [46]:
(train_oof.iloc[:, :3] * best_param_norm).sum(1)

0        3.000826
1        1.301808
2        2.532426
3        4.776642
4        2.466820
           ...   
16507    2.197079
16508    1.426716
16509    1.089502
16510    2.263913
16511    1.974831
Length: 16512, dtype: float64

正是因为模型方程本身结构差异不大，导致二者结果往往也不会有非常本质的差异：

In [49]:
# 线性回归结果
mean_squared_error(lr_test_prediction, y_test)

0.19278440531899835

In [50]:
# 加权平均融合结果
mean_squared_error((test_predict * best_param_norm).sum(1), y_test)

0.19170078526754764

| 模型 | 训练集得分 | 测试集得分 |
| ------ | ------ | ------ |
| <center>随机森林_OPT | <center>0.0322 | <center>0.2311 |
| <center>极端随机树_OPT | <center>1.37e-07 | <center>0.2225 |
| <center>GBDT_OPT | <center>0.0253 | <font color="red"><center>**0.1980** |
| <center>平均融合法 | <center>0.0109 | <center>0.2049 |
| <center>倍数梯度权重 | <center>0.0106 | <center>0.1989 |
| <center>指数梯度权重 | <center>0.0209 | <font color="red"><center>**0.1963** |
| <center>基于交叉验证的TPE搜索（50次迭代） | <center>0.0169 | <center>0.1957 |
| <center>基于交叉验证的TPE搜索（100次迭代） | <center>0.0140 | <font color="red"><center>**0.1953** |
| <center>空间裁剪的TPE搜索（50次迭代） | <center>0.0180 | <center>0.1956 |
| <center>交叉训练的TPE搜索（200次迭代） | <center>0.2066 | <center>0.19333 |
| <center>交叉训练的TPE搜索（2000次迭代） | <center>0.2066 | <center>0.19301 |
| <center>交叉训练的TPE搜索（10000次迭代） | <center>0.2066 | <center>0.19300 |
| <center>交叉训练+空间裁剪TPE搜索（2000次迭代） |<center>0.2066 | <font color="red"><center>**0.1927** |
| <center>Stacking+LR | <center>0.2056 | <center>0.1927 |
| <center>Stacking+BR | <center>0.2056 | <center>0.1928 |

- 元学习器优化

&emsp;&emsp;正式因为很多时候线性回归和贝叶斯回归作为元学习器，并不能和加权平均融合形成算法结构层面的差异，并且由于线性回归和贝叶斯回归没有超参数优化的必要，因此回归问题的Stacking融合的关键，就在于能否借助Bagging对其进行元学习器的优化。具体执行过程如下：

首先查看默认超参数下Bagging预测效果：

In [121]:
bagging_reg = BaggingRegressor().fit(train_oof.iloc[:, :3], y_train)
bagging_train_prediction = bagging_reg.predict(train_oof.iloc[:, :3])
bagging_test_prediction = bagging_reg.predict(test_predict)
print('The results of GBR-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(bagging_train_prediction, y_train), 
                                       mean_squared_error(bagging_test_prediction, y_test)))

The results of GBR-final:
Train-MSE: 0.044133, Test-MSE: 0.229778


In [123]:
BaggingRegressor?

[1;31mInit signature:[0m
[0mBaggingRegressor[0m[1;33m([0m[1;33m
[0m    [0mbase_estimator[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mn_estimators[0m[1;33m=[0m[1;36m10[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0mmax_samples[0m[1;33m=[0m[1;36m1.0[0m[1;33m,[0m[1;33m
[0m    [0mmax_features[0m[1;33m=[0m[1;36m1.0[0m[1;33m,[0m[1;33m
[0m    [0mbootstrap[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mbootstrap_features[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0moob_score[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mwarm_start[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mn_jobs[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mrandom_state[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mverbose[0m[1;33m=[0m[1;36m0[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
A Bagging 

然后将默认评估器改为线性回归，输出结果如下：

In [178]:
start = time.time()

# 设置超参数空间
parameter_space = {
    "n_estimators": range(10, 21), 
    "max_samples": np.arange(0.1, 1.1, 0.1).tolist()}

# 实例化模型与评估器
bagging_final = BaggingRegressor(base_estimator=LinearRegression(), random_state=21)

BG = GridSearchCV(bagging_final, parameter_space, n_jobs=15)

# 模型训练
BG.fit(train_oof.iloc[:, :3], y_train)

print(time.time()-start)

1.5333952903747559


In [179]:
BG.best_params_

{'max_samples': 0.9, 'n_estimators': 10}

In [180]:
BG.best_score_

0.8442282084464816

In [181]:
mean_squared_error(BG.predict(train_oof.iloc[:, :3]), y_train), mean_squared_error(BG.predict(test_predict), y_test)

(0.20562645147601102, 0.19285269873019006)

对比结果不难看出，相比原始Bagging，效果有了明显提升。不过相比先线性回归或者贝叶斯回归，还有一定的差距。

| 元学习器 | 训练集得分 | 测试集得分 |
| ------ | ------ | ------ |
| <center>线性回归 | <center>0.205622 | <center>0.192784 |
| <center>贝叶斯回归 | <center>0.205622 | <center>0.192801 |
| <center>Bagging | <center>0.044133 | <center>0.229778 |    
| <center>Bagging+lr | <center>0.205626 | <center>0.192852 |

接下来继续考虑以贝叶斯回归作为基础分类器，带入Bagging过程：

In [186]:
start = time.time()

# 设置超参数空间
parameter_space = {
    "n_estimators": range(10, 21), 
    "max_samples": np.arange(0.1, 1.1, 0.1).tolist()}

# 实例化模型与评估器
bagging_final = BaggingRegressor(base_estimator=BayesianRidge(), random_state=1)

BG = GridSearchCV(bagging_final, parameter_space, n_jobs=15)

# 模型训练
BG.fit(train_oof.iloc[:, :3], y_train)

print(time.time()-start)

1.8406734466552734


In [187]:
BG.best_params_

{'max_samples': 1.0, 'n_estimators': 19}

In [188]:
BG.best_score_

0.844191916402167

In [189]:
mean_squared_error(BG.predict(train_oof.iloc[:, :3]), y_train), mean_squared_error(BG.predict(test_predict), y_test)

(0.205627607165818, 0.19259294571070013)

而进一步对比加权平均法，当前数据集下Stacking融合得到了一个更好的结果。

| 模型 | 训练集得分 | 测试集得分 |
| ------ | ------ | ------ |
| <center>随机森林_OPT | <center>0.0322 | <center>0.2311 |
| <center>极端随机树_OPT | <center>1.37e-07 | <center>0.2225 |
| <center>GBDT_OPT | <center>0.0253 | <font color="red"><center>0.1980 |
| <center>平均融合法 | <center>0.0109 | <center>0.2049 |
| <center>倍数梯度权重 | <center>0.0106 | <center>0.1989 |
| <center>指数梯度权重 | <center>0.0209 | <font color="red"><center>0.1963 |
| <center>基于交叉验证的TPE搜索（50次迭代） | <center>0.0169 | <center>0.1957 |
| <center>基于交叉验证的TPE搜索（100次迭代） | <center>0.0140 | <font color="red"><center>0.1953 |
| <center>空间裁剪的TPE搜索（50次迭代） | <center>0.0180 | <center>0.1956 |
| <center>基于交叉训练的TPE搜索（1万次迭代） | <center>0.2066 | <font color="red"><center>0.1927 |
| <center>Stacking线性回归 | <center>0.205622 | <center>0.192784 |
| <center>Stacking贝叶斯回归 | <center>0.205622 | <center>0.192801 |
| <center>Stacking Bagging | <center>0.044133 | <center>0.229778 |    
| <center>Stacking Bagging+LR | <center>0.205626 | <center>0.192852 |
| <center>Stacking Bagging+BR | <center>0.205627 | <font color="red"><center>**0.192592** |

#### 2.2 Blending融合法

&emsp;&emsp;接下来，继续尝试Blending融合。相比Stacking，Blending的核心区别在于留出集的划分，以达到一级学习器和元学习器数据隔离的目的，从而进一步抑制融合过程过拟合问题。这里首先回顾Blending融合原理如下：

<center><img src="http://ml2022.oss-cn-hangzhou.aliyuncs.com/img/Blending模型融合流程 (2).jpeg" alt="Blending模型融合流程 (2)" style="zoom:50%;" />

同时，尽管Blending本身是作为Stacking的优化策略而诞生，但由于留出集的划分会牺牲一级学习器的训练数据，因此并不是每个Blending的过程都能够的到比Stacking更好的结果。并且留出集的划分方式，也成为决定Blending融合成败的关键因素。关于回归问题的Blending融合优化，我们会在下一小节进行详细探讨。

&emsp;&emsp;接下来尝试手动实现Blending全流程。首先是数据集的划分，这里还是按照一般情况，按照8：2的比例进行留出集的划分：

In [152]:
X_train1, X_train2, y_train1, y_train2 = train_test_split(X_train, y_train,  test_size=0.2, random_state=12)

In [153]:
X_train1.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
13620,4.875,10.0,5.168022,1.078591,921.0,2.495935,34.1,-118.18
14303,2.4196,26.0,8.518248,2.70073,253.0,1.846715,37.68,-122.08
4419,2.9926,25.0,4.63449,1.046638,2946.0,3.195228,33.74,-117.91
1948,0.7403,37.0,4.491429,1.148571,1046.0,2.988571,37.96,-122.37
2560,2.8882,32.0,4.61039,1.006494,550.0,3.571429,38.72,-121.71


In [154]:
X_train.shape

(16512, 8)

In [155]:
X_train1.shape

(13209, 8)

In [157]:
X_train2.shape

(3303, 8)

In [156]:
y_train1

array([2.004, 2.75 , 1.832, ..., 1.591, 1.266, 1.114])

&emsp;&emsp;接下来来进行模型训练。对于Blending来说，由于训练数据重新划分，因此一级学习器需要重新进行训练。这里还是采用上一小节超参数优化方法对随机森林、极端随机树和GBDT进行超参数搜索，具体实现过程如下。

- Blending融合一级学习器训练

&emsp;&emsp;首先是随机森林模型：

In [158]:
max_features_range = ["auto", "sqrt", "log2", None] + np.arange(0.1, 1., 0.1).tolist()
max_features_range

['auto',
 'sqrt',
 'log2',
 None,
 0.1,
 0.2,
 0.30000000000000004,
 0.4,
 0.5,
 0.6,
 0.7000000000000001,
 0.8,
 0.9]

In [160]:
RF_space = {'max_features': hp.choice('max_features', max_features_range),
            'n_estimators': hp.quniform('n_estimators', 20, 700, 1), 
            'max_samples': hp.uniform('max_samples', 0.2, 1)}

In [161]:
def RF_param_objective(params, train=True):
    
    # 超参数读取
    n_estimators = int(params['n_estimators'])
    max_samples = params['max_samples']

    if train == True:
        max_features = params['max_features']
        
    else:
        max_features = max_features_range[params['max_features']]
        
    # 模型创建
    reg_RF = RandomForestRegressor(n_estimators = n_estimators, 
                                   max_samples = max_samples, 
                                   max_features = max_features,
                                   random_state=12)

    if train == True:
        res = -cross_val_score(reg_RF, X_train1, y_train1, scoring='neg_mean_squared_error', n_jobs=15).mean()
    else:
        res = reg_RF.fit(X_train, y_train)
        
    return res

In [162]:
def RF_param_search(max_evals=500):
    params_best = fmin(RF_param_objective,
                       space = RF_space,
                       algo = tpe.suggest,
                       max_evals = max_evals)
    return params_best

然后进行超参数搜索，这里我们尝试100次搜索，得到结果如下：

In [163]:
RF_best_param = RF_param_search(100)

100%|██████████████████████████████████████████████| 100/100 [13:08<00:00,  7.89s/trial, best loss: 0.2516114786950027]


In [164]:
RF_best_param

{'max_features': 7, 'max_samples': 0.9975624861890051, 'n_estimators': 626.0}

In [165]:
RF_Blending = RF_param_objective(RF_best_param, train=False)

In [166]:
mean_squared_error(RF_Blending.predict(X_train1), y_train1), mean_squared_error(RF_Blending.predict(X_test), y_test)

(0.03232190567836679, 0.23142760467841164)

能够看出，随着训练数据的减少，对模型效果的影响还是非常明显的：

| 模型 | 训练集得分 | 测试集得分 |
| ------ | ------ | ------ |
| <center>随机森林_OPT | <center>0.0322 | <center>0.2311 |
| <center>极端随机树_OPT | <center>1.37e-07 | <center>0.2225 |
| <center>GBDT_OPT | <center>0.0253 | <center>0.1980 |
| <center>随机森林_Blending | <center>0.0323 | <center>0.2314 |    

&emsp;&emsp;接下来继续训练极端随机树模型，训练过程如下：

In [167]:
ET_space = {'max_depth': hp.quniform('max_depth', 2, 50, 1), 
            'n_estimators': hp.quniform('n_estimators', 20, 700, 1), 
            'max_features': hp.choice('max_features', max_features_range)}

In [168]:
def ET_param_objective(params, train=True):
    
    # 超参数读取
    max_depth = int(params['max_depth'])
    n_estimators = int(params['n_estimators'])
    
    if train == True:
        max_features = params['max_features']
        
    else:
        max_features = max_features_range[params['max_features']]
    
    # 模型创建
    reg_ET = ExtraTreesRegressor(max_depth = max_depth, 
                                 n_estimators = n_estimators, 
                                 max_features = max_features, 
                                 random_state=12)

    if train == True:
        res = -cross_val_score(reg_ET, X_train1, y_train1, scoring='neg_mean_squared_error', n_jobs=15).mean()
    else:
        res = reg_ET.fit(X_train, y_train)
        
    return res

In [169]:
def ET_param_search(max_evals=500):
    params_best = fmin(ET_param_objective,
                       space = ET_space,
                       algo = tpe.suggest,
                       max_evals = max_evals)
    return params_best

In [170]:
ET_best_param = ET_param_search(100)

100%|█████████████████████████████████████████████| 100/100 [04:23<00:00,  2.63s/trial, best loss: 0.24416803714988985]


In [171]:
ET_best_param

{'max_depth': 32.0, 'max_features': 7, 'n_estimators': 339.0}

In [172]:
ET_Blending = ET_param_objective(ET_best_param, train=False)

In [173]:
ET_Blending

ExtraTreesRegressor(max_depth=32, max_features=0.4, n_estimators=339,
                    random_state=12)

测试模型在训练集和测试集上表现：

In [174]:
mean_squared_error(ET_Blending.predict(X_train1), y_train1), mean_squared_error(ET_Blending.predict(X_test), y_test)

(0.0003684233856979253, 0.22303701455306532)

同样，极端随机树的模型效果也有所下降：

| 模型 | 训练集得分 | 测试集得分 |
| ------ | ------ | ------ |
| <center>随机森林_OPT | <center>0.0322 | <center>0.2311 |
| <center>极端随机树_OPT | <center>1.37e-07 | <center>0.2225 |
| <center>GBDT_OPT | <center>0.0253 | <center>0.1980 |
| <center>随机森林_Blending | <center>0.0323 | <center>0.2314 |        
| <center>极端随机树_Blending | <center>0.0003 | <center>0.2230 |

&emsp;&emsp;最后进行GBDT模型训练，训练过程如下：

In [175]:
GBR_space = {'n_estimators': hp.quniform('n_estimators', 20, 701, 1),
             'learning_rate': hp.uniform('learning_rate', 0.02, 0.2),
             'subsample': hp.uniform('subsample', 0.1, 1.0),
             'max_depth': hp.quniform('max_depth', 2, 20, 1)}

In [176]:
def GBR_param_objective(params, train=True):
    n_estimators = int(params['n_estimators'])
    learning_rate = params['learning_rate']
    subsample = params['subsample']
    max_depth = int(params['max_depth'])
    
    reg_GBR = GradientBoostingRegressor(n_estimators = n_estimators, 
                                        learning_rate = learning_rate, 
                                        subsample = subsample, 
                                        max_depth = max_depth, 
                                        random_state=12)
    if train == True:
        res = -cross_val_score(reg_GBR, X_train1, y_train1, scoring='neg_mean_squared_error', n_jobs=15).mean()
    else:
        res = reg_GBR.fit(X_train, y_train)
    return res

In [177]:
def GBR_param_search(max_evals=500):
    params_best = fmin(GBR_param_objective,
                       space = GBR_space,
                       algo = tpe.suggest,
                       max_evals = max_evals)
    return params_best

In [178]:
GBR_best_param = GBR_param_search(50)

100%|███████████████████████████████████████████████| 50/50 [07:14<00:00,  8.70s/trial, best loss: 0.21898687697414307]


In [179]:
GBR_best_param

{'learning_rate': 0.07520948105762229,
 'max_depth': 6.0,
 'n_estimators': 608.0,
 'subsample': 0.57387966677063}

In [180]:
GBR_Blending = GBR_param_objective(GBR_best_param, train=False)

In [181]:
GBR_Blending

GradientBoostingRegressor(learning_rate=0.07520948105762229, max_depth=6,
                          n_estimators=608, random_state=12,
                          subsample=0.57387966677063)

In [182]:
mean_squared_error(GBR_Blending.predict(X_train1), y_train1), mean_squared_error(GBR_Blending.predict(X_test), y_test)

(0.046335758486162816, 0.2015167616055982)

三个模型的模型效果均有所下降，相比全数据集上训练的到的模型，MSE数值均在百分位上有所上升：

| 模型 | 训练集得分 | 测试集得分 |
| ------ | ------ | ------ |
| <center>随机森林_OPT | <center>0.0322 | <center>0.2311 |
| <center>极端随机树_OPT | <center>1.37e-07 | <center>0.2225 |
| <center>GBDT_OPT | <center>0.0253 | <center>0.1980 |
| <center>随机森林_Blending | <center>0.0323 | <center>0.2314 |        
| <center>极端随机树_Blending | <center>0.0003 | <center>0.2230 |
| <center>GBDT_Blending | <center>0.0463 | <center>0.2015 |

- Blending融合元学习器训练

&emsp;&emsp;接下来元学习器训练过程，首先我们需要准备元学习器的训练数据。和Stacking不同，Blending的元学习器训练数据是一级学习器在留出集上的预测结果。这里我们创建train_oof_blending数据集如下：

In [183]:
RF_Blending.predict(X_train2)

array([1.35470607, 1.63394409, 2.60049043, ..., 1.95473165, 2.86436927,
       3.0097876 ])

In [184]:
train_oof_blending = pd.DataFrame({'RF_oof_blending': RF_Blending.predict(X_train2), 
                                   'ET_oof_blending': ET_Blending.predict(X_train2),
                                   'GBR_oof_blending': GBR_Blending.predict(X_train2)})

In [185]:
train_oof_blending

Unnamed: 0,RF_oof_blending,ET_oof_blending,GBR_oof_blending
0,1.354706,1.322524,1.356082
1,1.633944,1.641734,1.727232
2,2.600490,2.605924,2.720413
3,1.376776,1.234060,1.499121
4,0.688415,0.609165,0.784617
...,...,...,...
3298,3.573957,3.653165,3.708306
3299,2.915882,2.909153,2.976229
3300,1.954732,1.880811,1.829703
3301,2.864369,2.705647,2.730535


然后通过一级学习器在测试集上的预测结果，创建test_predict_blending数据集：

In [186]:
test_predict_blending = pd.DataFrame({'RF_oof_blending': RF_Blending.predict(X_test), 
                                      'ET_oof_blending': ET_Blending.predict(X_test),
                                      'GBR_oof_blending': GBR_Blending.predict(X_test)})

In [187]:
test_predict_blending

Unnamed: 0,RF_oof_blending,ET_oof_blending,GBR_oof_blending
0,2.012107,2.119217,2.020528
1,2.023596,1.970096,1.983483
2,2.092524,2.172498,2.025244
3,1.621594,1.528015,1.408382
4,1.152850,1.178118,0.983733
...,...,...,...
4123,2.722764,2.673605,2.814638
4124,3.165139,2.821868,3.272724
4125,0.819155,0.905988,0.833379
4126,4.021703,3.926552,4.247018


&emsp;&emsp;接下来进行Blending融合的元学习器测试。和Stacking类似，我们还是首先进行大范围模型测试，选择总共十二个回归模型进行测试，测试过程如下：

In [190]:
# 线性回归
lr_reg = LinearRegression().fit(train_oof_blending, y_train2)
lr_train_prediction = lr_reg.predict(train_oof_blending)
lr_test_prediction = lr_reg.predict(test_predict_blending)
print('The results of LR-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(lr_train_prediction, y_train2), 
                                       mean_squared_error(lr_test_prediction, y_test)))

# 岭回归
Ridge_reg = Ridge().fit(train_oof_blending, y_train2)
Ridge_train_prediction = Ridge_reg.predict(train_oof_blending)
Ridge_test_prediction = Ridge_reg.predict(test_predict_blending)
print('The results of Ridge-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(Ridge_train_prediction, y_train2), 
                                       mean_squared_error(Ridge_test_prediction, y_test)))


# LASSO
Lasso_reg = Lasso().fit(train_oof_blending, y_train2)
Lasso_train_prediction = Lasso_reg.predict(train_oof_blending)
Lasso_test_prediction = Lasso_reg.predict(test_predict_blending)
print('The results of Lasso-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(Lasso_train_prediction, y_train2), 
                                       mean_squared_error(Lasso_test_prediction, y_test)))


# 弹性网
ElasticNet_reg = ElasticNet().fit(train_oof_blending, y_train2)
ElasticNet_train_prediction = ElasticNet_reg.predict(train_oof_blending)
ElasticNet_test_prediction = ElasticNet_reg.predict(test_predict_blending)
print('The results of ElasticNet-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(ElasticNet_train_prediction, y_train2), 
                                       mean_squared_error(ElasticNet_test_prediction, y_test)))

# 贝叶斯回归
BayesianRidge_reg = BayesianRidge().fit(train_oof_blending, y_train2)
BayesianRidge_train_prediction = BayesianRidge_reg.predict(train_oof_blending)
BayesianRidge_test_prediction = BayesianRidge_reg.predict(test_predict_blending)
print('The results of BayesianRidge-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(BayesianRidge_train_prediction, y_train2), 
                                       mean_squared_error(BayesianRidge_test_prediction, y_test)))


# SVR
SVR_reg = SVR().fit(train_oof_blending, y_train2)
SVR_train_prediction = SVR_reg.predict(train_oof_blending)
SVR_test_prediction = SVR_reg.predict(test_predict_blending)
print('The results of SVR-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(SVR_train_prediction, y_train2), 
                                       mean_squared_error(SVR_test_prediction, y_test)))


# 决策树回归
tree_reg = DecisionTreeRegressor().fit(train_oof_blending, y_train2)
tree_train_prediction = tree_reg.predict(train_oof_blending)
tree_test_prediction = tree_reg.predict(test_predict_blending)
print('The results of tree_reg-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(tree_train_prediction, y_train2), 
                                       mean_squared_error(tree_test_prediction, y_test)))


# Bagging
bagging_reg = BaggingRegressor().fit(train_oof_blending, y_train2)
bagging_train_prediction = bagging_reg.predict(train_oof_blending)
bagging_test_prediction = bagging_reg.predict(test_predict_blending)
print('The results of GBR-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(bagging_train_prediction, y_train2), 
                                       mean_squared_error(bagging_test_prediction, y_test)))

# 随机森林
RFR = RandomForestRegressor().fit(train_oof_blending, y_train2)
RFR_train_prediction = RFR.predict(train_oof_blending)
RFR_test_prediction = RFR.predict(test_predict_blending)
print('The results of GBR-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(RFR_train_prediction, y_train2), 
                                       mean_squared_error(RFR_test_prediction, y_test)))

# AdaBoost
ABR = AdaBoostRegressor().fit(train_oof_blending, y_train2)
ABR_train_prediction = ABR.predict(train_oof_blending)
ABR_test_prediction = ABR.predict(test_predict_blending)
print('The results of GBR-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(ABR_train_prediction, y_train2), 
                                       mean_squared_error(ABR_test_prediction, y_test)))

# GBDT
GBR = GradientBoostingRegressor().fit(train_oof_blending, y_train2)
GBR_train_prediction = GBR.predict(train_oof_blending)
GBR_test_prediction = GBR.predict(test_predict_blending)
print('The results of GBR-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(GBR_train_prediction, y_train2), 
                                       mean_squared_error(GBR_test_prediction, y_test)))

# XGB
XGB = XGBRegressor().fit(train_oof_blending, y_train2)
XGB_train_prediction = XGB.predict(train_oof_blending)
XGB_test_prediction = XGB.predict(test_predict_blending)
print('The results of XGB-final:')
print('Train-MSE: %f, Test-MSE: %f' % (mean_squared_error(XGB_train_prediction, y_train2), 
                                       mean_squared_error(XGB_test_prediction, y_test)))

The results of LR-final:
Train-MSE: 0.000295, Test-MSE: 0.225926
The results of Ridge-final:
Train-MSE: 0.000299, Test-MSE: 0.225453
The results of Lasso-final:
Train-MSE: 0.764305, Test-MSE: 0.914045
The results of ElasticNet-final:
Train-MSE: 0.299994, Test-MSE: 0.475550
The results of BayesianRidge-final:
Train-MSE: 0.000295, Test-MSE: 0.225926
The results of SVR-final:
Train-MSE: 0.002158, Test-MSE: 0.222606
The results of tree_reg-final:
Train-MSE: 0.000000, Test-MSE: 0.224173
The results of GBR-final:
Train-MSE: 0.000064, Test-MSE: 0.223413
The results of GBR-final:
Train-MSE: 0.000049, Test-MSE: 0.223417
The results of GBR-final:
Train-MSE: 0.004504, Test-MSE: 0.224965
The results of GBR-final:
Train-MSE: 0.000233, Test-MSE: 0.223113
The results of XGB-final:
Train-MSE: 0.000043, Test-MSE: 0.224620


能够发现，虽然线性回归和贝叶斯回归效果仍然较好，但整体效果都有所下降，相比Stacking差距甚远，甚至不如加权平均融合效果。究其原因其实是元学习器过拟合导致。这里能看出，连线性回归都呈现出过拟合问题，说明train_oof_blending对于线性回归来说学习空间很小，外加一级学习器本身效果下降，最终导致Blending整体融合效果下降。

> 一般来说，在Blending融合过程中，如果一级学习器学习效果下降不大、而元学习器过拟合，则可以给留出集更大的划分比例。但此处情况非常特殊，一级学习器学习效果也下降的非常快，因此基本可以判断Blending融合对当前问题无效。

&emsp;&emsp;而在Blending融合整体效果较差的情况下，元学习器优化也很难起到起死回生的效果：

In [199]:
start = time.time()

# 设置超参数空间
parameter_space = {
    "n_estimators": range(10, 21), 
    "max_samples": np.arange(0.1, 1.1, 0.1).tolist()}

# 实例化模型与评估器
bagging_final = BaggingRegressor(base_estimator=LinearRegression(), random_state=22)

BG = GridSearchCV(bagging_final, parameter_space, n_jobs=15)

# 模型训练
BG.fit(train_oof_blending, y_train2)

print(time.time()-start)

0.9558672904968262


In [200]:
BG.best_params_

{'max_samples': 0.6, 'n_estimators': 11}

In [201]:
BG.best_score_

0.9997714009375949

In [202]:
mean_squared_error(BG.predict(train_oof_blending), y_train2), mean_squared_error(BG.predict(test_predict_blending), y_test)

(0.00029533441090624546, 0.22602045954083486)

然后以贝叶斯回归作为基础分类器，带入Bagging过程：

In [207]:
start = time.time()

# 设置超参数空间
parameter_space = {
    "n_estimators": range(10, 21), 
    "max_samples": np.arange(0.1, 1.1, 0.1).tolist()}

# 实例化模型与评估器
bagging_final = BaggingRegressor(base_estimator=BayesianRidge(), random_state=1)

BG = GridSearchCV(bagging_final, parameter_space, n_jobs=15)

# 模型训练
BG.fit(train_oof_blending, y_train2)

print(time.time()-start)

1.1673202514648438


In [208]:
BG.best_params_

{'max_samples': 0.1, 'n_estimators': 10}

In [209]:
BG.best_score_

0.999772111450814

In [210]:
mean_squared_error(BG.predict(train_oof_blending), y_train2), mean_squared_error(BG.predict(test_predict_blending), y_test)

(0.0002955653247170366, 0.22601985525046403)

&emsp;&emsp;至此，我们就完成了一系列回归问题模型融合实践。从方法大类上来说，回归问题和分类问题并无本质区别，但具体每个方法的实践效果，不同的情况下会有很大差异，对于此前分类问题有效的优化方法、对于当前回归问题则不一定有效果。这其实也是模型融合实践过程中的常态，即没有哪一种方法是能够百分百一定起到优化效果，更多的还是需要我们反复尝试、则有输出，并且，更关键的是要理解这些方法背后的原理，才能灵活调整、不断优化。

&emsp;&emsp;当然，围绕回归问题的融合方法其实并没有讨论完，下一小节我们将继续探讨进阶优化策略，以及在自动融合函数中添加回归问题的解决方案。