# <center> 【Kaggle】Telco Customer Churn 电信用户流失预测案例

---

## <font face="仿宋">第四部分导读

&emsp;&emsp;<font face="仿宋">在案例的第二、三部分中，我们详细介绍了关于特征工程的各项技术，特征工程技术按照大类来分可以分为数据预处理、特征衍生、特征筛选三部分，其中特征预处理的目的是为了将数据集整理、清洗到可以建模的程度，具体技术包括缺失值处理、异常值处理、数据重编码等，是建模之前必须对数据进行的处理和操作；而特征衍生和特征筛选则更像是一类优化手段，能够帮助模型突破当前数据集建模的效果上界。并且我们在第二部分完整详细的介绍机器学习可解释性模型的训练、优化和解释方法，也就是逻辑回归和决策树模型。并且此前我们也一直以这两种算法为主，来进行各个部分的模型测试。

&emsp;&emsp;<font face="仿宋">而第四部分，我们将开始介绍集成学习的训练和优化的实战技巧，尽管从可解释性角度来说，集成学习的可解释性并不如逻辑回归和决策树，但在大多数建模场景下，集成学习都将获得一个更好的预测结果，这也是目前效果优先的建模场景下最常使用的算法。

&emsp;&emsp;<font face="仿宋">总的来说，本部分内容只有一个目标，那就是借助各类优化方法，抵达每个主流集成学习的效果上界。换而言之，本部分我们将围绕单模优化策略展开详细的探讨，涉及到的具体集成学习包括随机森林、XGBoost、LightGBM、和CatBoost等目前最主流的集成学习算法，而具体的优化策略则包括超参数优化器的使用、特征衍生和筛选方法的使用、单模型自融合方法的使用，这些优化方法也是截至目前，提升单模效果最前沿、最有效、同时也是最复杂的方法。其中有很多较为艰深的理论，也有很多是经验之谈，但无论如何，我们希望能够围绕当前数据集，让每个集成学习算法优化到极限。值得注意的是，在这个过程中，我们会将此前介绍的特征衍生和特征筛选视作是一种模型优化方法，衍生和筛选的效果，一律以模型的最终结果来进行评定。而围绕集成学习进行海量特征衍生和筛选，也才是特征衍生和筛选技术能发挥巨大价值的主战场。

&emsp;&emsp;<font face="仿宋">而在抵达了单模的极限后，我们就会进入到下一阶段，也就是模型融合阶段。需要知道的是，只有单模的效果到达了极限，进一步的多模型融合、甚至多层融合，才是有意义的，才是有效果的。

---

# <center>Part 4.集成算法的训练与优化技巧

In [1]:
# 基础数据科学运算库
import numpy as np
import pandas as pd

# 可视化库
import seaborn as sns
import matplotlib.pyplot as plt

# 时间模块
import time

import warnings
warnings.filterwarnings('ignore')

# sklearn库
# 数据预处理
from sklearn import preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

# 实用函数
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedKFold

# 常用评估器
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier

# 网格搜索
from sklearn.model_selection import GridSearchCV

# 自定义评估器支持模块
from sklearn.base import BaseEstimator, TransformerMixin, ClassifierMixin

# 自定义模块
from telcoFunc import *

# 导入特征衍生模块
import features_creation as fc
from features_creation import *

# 导入模型融合模块
import manual_ensemble as me
from manual_ensemble import *

# re模块相关
import inspect, re

# 其他模块
from tqdm import tqdm
import gc

&emsp;&emsp;然后执行Part 1中的数据清洗相关工作：

In [2]:
# 读取数据
tcc = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

# 标注连续/离散字段
# 离散字段
category_cols = ['gender', 'SeniorCitizen', 'Partner', 'Dependents',
                'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 
                'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
                'PaymentMethod']

# 连续字段
numeric_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
 
# 标签
target = 'Churn'

# ID列
ID_col = 'customerID'

# 验证是否划分能完全
assert len(category_cols) + len(numeric_cols) + 2 == tcc.shape[1]

# 连续字段转化
tcc['TotalCharges']= tcc['TotalCharges'].apply(lambda x: x if x!= ' ' else np.nan).astype(float)
tcc['MonthlyCharges'] = tcc['MonthlyCharges'].astype(float)

# 缺失值填补
tcc['TotalCharges'] = tcc['TotalCharges'].fillna(0)

# 标签值手动转化 
tcc['Churn'].replace(to_replace='Yes', value=1, inplace=True)
tcc['Churn'].replace(to_replace='No',  value=0, inplace=True)

In [3]:
features = tcc.drop(columns=[ID_col, target]).copy()
labels = tcc['Churn'].copy()

&emsp;&emsp;同时，创建自然编码后的数据集以及经过时序特征衍生的数据集：

In [4]:
# 划分训练集和测试集
train, test = train_test_split(tcc, random_state=22)

X_train = train.drop(columns=[ID_col, target]).copy()
X_test = test.drop(columns=[ID_col, target]).copy()

y_train = train['Churn'].copy()
y_test = test['Churn'].copy()

X_train_seq = pd.DataFrame()
X_test_seq = pd.DataFrame()

# 年份衍生
X_train_seq['tenure_year'] = ((72 - X_train['tenure']) // 12) + 2014
X_test_seq['tenure_year'] = ((72 - X_test['tenure']) // 12) + 2014

# 月份衍生
X_train_seq['tenure_month'] = (72 - X_train['tenure']) % 12 + 1
X_test_seq['tenure_month'] = (72 - X_test['tenure']) % 12 + 1

# 季度衍生
X_train_seq['tenure_quarter'] = ((X_train_seq['tenure_month']-1) // 3) + 1
X_test_seq['tenure_quarter'] = ((X_test_seq['tenure_month']-1) // 3) + 1

# 独热编码
enc = preprocessing.OneHotEncoder()
enc.fit(X_train_seq)

seq_new = list(X_train_seq.columns)

# 创建带有列名称的独热编码之后的df
X_train_seq = pd.DataFrame(enc.transform(X_train_seq).toarray(), 
                           columns = cate_colName(enc, seq_new, drop=None))

X_test_seq = pd.DataFrame(enc.transform(X_test_seq).toarray(), 
                          columns = cate_colName(enc, seq_new, drop=None))

# 调整index
X_train_seq.index = X_train.index
X_test_seq.index = X_test.index

In [5]:
ord_enc = OrdinalEncoder()
ord_enc.fit(X_train[category_cols])

X_train_OE = pd.DataFrame(ord_enc.transform(X_train[category_cols]), columns=category_cols)
X_train_OE.index = X_train.index
X_train_OE = pd.concat([X_train_OE, X_train[numeric_cols]], axis=1)

X_test_OE = pd.DataFrame(ord_enc.transform(X_test[category_cols]), columns=category_cols)
X_test_OE.index = X_test.index
X_test_OE = pd.concat([X_test_OE, X_test[numeric_cols]], axis=1)

然后是模型融合部分所需的第三方库、准备的数据以及训练好的模型：

In [6]:
# 实例化KFold评估器
kf = KFold(n_splits=5, random_state=12, shuffle=True)

# 重置训练集和测试集的index
X_train_OE = X_train_OE.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)

train_part_index_l = []
eval_index_l = []

for train_part_index, eval_index in kf.split(X_train_OE, y_train):
    train_part_index_l.append(train_part_index)
    eval_index_l.append(eval_index)
    
# 训练集特征
X_train1 = X_train_OE.loc[train_part_index_l[0]]
X_train2 = X_train_OE.loc[train_part_index_l[1]]
X_train3 = X_train_OE.loc[train_part_index_l[2]]
X_train4 = X_train_OE.loc[train_part_index_l[3]]
X_train5 = X_train_OE.loc[train_part_index_l[4]]

# 验证集特征
X_eval1 = X_train_OE.loc[eval_index_l[0]]
X_eval2 = X_train_OE.loc[eval_index_l[1]]
X_eval3 = X_train_OE.loc[eval_index_l[2]]
X_eval4 = X_train_OE.loc[eval_index_l[3]]
X_eval5 = X_train_OE.loc[eval_index_l[4]]

# 训练集标签
y_train1 = y_train.loc[train_part_index_l[0]]
y_train2 = y_train.loc[train_part_index_l[1]]
y_train3 = y_train.loc[train_part_index_l[2]]
y_train4 = y_train.loc[train_part_index_l[3]]
y_train5 = y_train.loc[train_part_index_l[4]]

# 验证集标签
y_eval1 = y_train.loc[eval_index_l[0]]
y_eval2 = y_train.loc[eval_index_l[1]]
y_eval3 = y_train.loc[eval_index_l[2]]
y_eval4 = y_train.loc[eval_index_l[3]]
y_eval5 = y_train.loc[eval_index_l[4]]

train_set = [(X_train1, y_train1), 
             (X_train2, y_train2), 
             (X_train3, y_train3), 
             (X_train4, y_train4), 
             (X_train5, y_train5)]

eval_set = [(X_eval1, y_eval1), 
            (X_eval2, y_eval2), 
            (X_eval3, y_eval3), 
            (X_eval4, y_eval4), 
            (X_eval5, y_eval5)]

In [7]:
# 随机森林模型组
grid_RF_1 = load('./models/grid_RF_1.joblib') 
grid_RF_2 = load('./models/grid_RF_2.joblib') 
grid_RF_3 = load('./models/grid_RF_3.joblib') 
grid_RF_4 = load('./models/grid_RF_4.joblib') 
grid_RF_5 = load('./models/grid_RF_5.joblib') 

RF_1 = grid_RF_1.best_estimator_
RF_2 = grid_RF_2.best_estimator_
RF_3 = grid_RF_3.best_estimator_
RF_4 = grid_RF_4.best_estimator_
RF_5 = grid_RF_5.best_estimator_

RF_l = [RF_1, RF_2, RF_3, RF_4, RF_5]

# 决策树模型组
grid_tree_1 = load('./models/grid_tree_1.joblib')
grid_tree_2 = load('./models/grid_tree_2.joblib')
grid_tree_3 = load('./models/grid_tree_3.joblib')
grid_tree_4 = load('./models/grid_tree_4.joblib')
grid_tree_5 = load('./models/grid_tree_5.joblib')

tree_1 = grid_tree_1.best_estimator_
tree_2 = grid_tree_2.best_estimator_
tree_3 = grid_tree_3.best_estimator_
tree_4 = grid_tree_4.best_estimator_
tree_5 = grid_tree_5.best_estimator_

tree_l = [tree_1, tree_2, tree_3, tree_4, tree_5]

# 逻辑回归模型组
grid_lr_1 = load('./models/grid_lr_1.joblib')
grid_lr_2 = load('./models/grid_lr_2.joblib')
grid_lr_3 = load('./models/grid_lr_3.joblib')
grid_lr_4 = load('./models/grid_lr_4.joblib')
grid_lr_5 = load('./models/grid_lr_5.joblib')

lr_1 = grid_lr_1.best_estimator_
lr_2 = grid_lr_2.best_estimator_
lr_3 = grid_lr_3.best_estimator_
lr_4 = grid_lr_4.best_estimator_
lr_5 = grid_lr_5.best_estimator_

lr_l = [lr_1, lr_2, lr_3, lr_4, lr_5]

In [8]:
eval1_predict_proba_RF = pd.Series(RF_l[0].predict_proba(X_eval1)[:, 1], index=X_eval1.index)
eval2_predict_proba_RF = pd.Series(RF_l[1].predict_proba(X_eval2)[:, 1], index=X_eval2.index)
eval3_predict_proba_RF = pd.Series(RF_l[2].predict_proba(X_eval3)[:, 1], index=X_eval3.index)
eval4_predict_proba_RF = pd.Series(RF_l[3].predict_proba(X_eval4)[:, 1], index=X_eval4.index)
eval5_predict_proba_RF = pd.Series(RF_l[4].predict_proba(X_eval5)[:, 1], index=X_eval5.index)

eval_predict_proba_RF = pd.concat([eval1_predict_proba_RF, 
                                   eval2_predict_proba_RF, 
                                   eval3_predict_proba_RF, 
                                   eval4_predict_proba_RF, 
                                   eval5_predict_proba_RF]).sort_index()

eval1_predict_proba_tree = pd.Series(tree_l[0].predict_proba(X_eval1)[:, 1], index=X_eval1.index)
eval2_predict_proba_tree = pd.Series(tree_l[1].predict_proba(X_eval2)[:, 1], index=X_eval2.index)
eval3_predict_proba_tree = pd.Series(tree_l[2].predict_proba(X_eval3)[:, 1], index=X_eval3.index)
eval4_predict_proba_tree = pd.Series(tree_l[3].predict_proba(X_eval4)[:, 1], index=X_eval4.index)
eval5_predict_proba_tree = pd.Series(tree_l[4].predict_proba(X_eval5)[:, 1], index=X_eval5.index)

eval_predict_proba_tree = pd.concat([eval1_predict_proba_tree, 
                                     eval2_predict_proba_tree, 
                                     eval3_predict_proba_tree, 
                                     eval4_predict_proba_tree, 
                                     eval5_predict_proba_tree]).sort_index()

eval1_predict_proba_lr = pd.Series(lr_l[0].predict_proba(X_eval1)[:, 1], index=X_eval1.index)
eval2_predict_proba_lr = pd.Series(lr_l[1].predict_proba(X_eval2)[:, 1], index=X_eval2.index)
eval3_predict_proba_lr = pd.Series(lr_l[2].predict_proba(X_eval3)[:, 1], index=X_eval3.index)
eval4_predict_proba_lr = pd.Series(lr_l[3].predict_proba(X_eval4)[:, 1], index=X_eval4.index)
eval5_predict_proba_lr = pd.Series(lr_l[4].predict_proba(X_eval5)[:, 1], index=X_eval5.index)

eval_predict_proba_lr = pd.concat([eval1_predict_proba_lr, 
                                   eval2_predict_proba_lr, 
                                   eval3_predict_proba_lr, 
                                   eval4_predict_proba_lr, 
                                   eval5_predict_proba_lr]).sort_index()

In [9]:
test_predict_proba_RF = []

for i in range(5):
    test_predict_proba_RF.append(RF_l[i].predict_proba(X_test_OE)[:, 1])

test_predict_proba_RF = np.array(test_predict_proba_RF)
test_predict_proba_RF = test_predict_proba_RF.mean(0)

test_predict_proba_tree = []

for i in range(5):
    test_predict_proba_tree.append(tree_l[i].predict_proba(X_test_OE)[:, 1])

test_predict_proba_tree = np.array(test_predict_proba_tree)
test_predict_proba_tree = test_predict_proba_tree.mean(0)

test_predict_proba_lr = []

for i in range(5):
    test_predict_proba_lr.append(lr_l[i].predict_proba(X_test_OE)[:, 1])

test_predict_proba_lr = np.array(test_predict_proba_lr)
test_predict_proba_lr = test_predict_proba_lr.mean(0)

## <center>Ch.3 模型融合基础方法

## 十一、Blending融合进阶优化

### 1.基本优化思路

&emsp;&emsp;在上一小节，我们介绍Blending融合的手动执行方法及借助manual_ensemble函数库快速实现方法。接下来，我们进一步介绍Blending融合的进阶优化策略。相比Stacking融合，Blending模型融合的核心区别就在于留出集的划分，以及由此导致的一级学习器和元学习器之间的训练数据隔离。而如何围绕留出集划分策略来进行优化，也成了Blending融合优化的最核心的突破口。

- 方案一：搜索最佳留出集划分比例

&emsp;&emsp;根据上一小节的介绍，我们知道了留出集“成于斯者毁于斯”：尽管留出集的划分能带来数据隔离从而提升融合结果的泛化能力，但留出集比例过大或者过小都会影响Blending融合效果：留出集比例越大、一级学习器越弱、元学习器越强；而如果留出集比例较小，则一级学习器较强、但元学习器过拟合风险会很大（极端情况就是留出集为0的情况，融合流程由Blending退化为Stacking）。因此，最容易想到的Blending融合的优化策略就是寻找到一个比较适中的留出集划分比例，尽可能平衡一级学习器和元学习器之间学习能力互斥的关系，从而提升最终Blending融合效果。这个也是Blending融合优化的第一种思路。

&emsp;&emsp;不过，要将这个思路落地成具体可执行的方案却并不简单，其困难之处并不在于代码层面难以实现，而是算力不足条件约束。如何找到合适的留出集的划分比例，对机器学习这类后验的技术来说，免不了需要海量的尝试，典型的方案就是将留出集划分比例视作超参数，带入优化器来搜索出一个可靠的结果。但是，通过上一小节我们发现，一个Blending的过程动辄需要耗费半小时乃至数个小时，“海量的尝试”对于个人用户来说基本是个不可能实现的过程，哪怕是用相对较少尝试来估计最佳划分比例的贝叶斯优化，对于50%-90%这个区间的搜索任务来说，至少也需要100-500次的计算。因此，若要实现最佳划分比例搜索策略，就需要尽可能缩短单次Blending融合所需要的时间，例如可以考虑一级学习器在交叉训练过程中不再进行单独模型的超参数优化，此举尽管会降低单次Blending融合精度，但通过缩短单次Blending融合时间，前期可以帮助最外层优化器快速搜索得到一个最佳划分比例，然后再确定比例之后再训练一个效果更好Blending融合。

- 方案二：多次划分，构建基于Blending结果的（加权）平均融合

&emsp;&emsp;当然，除此之外，根据长期模型融合的经验，其实早就帮助我们奠定了对待差异性结果的另外一种不同观点，那就是：不同留出集比例造成的结果差异性，或许本身也是通往更好结果的阶梯。例如，我们其实也可以通过设置多组不同比例留出集数据、来训练多个不同的Blending融合过程，然后让这些训练好的Blending融合流程对相同的测试集进行预测，并最终围绕这些预测结果来进行（加权）平均融合，如此，就相当于是执行了多层的模型融合。例如我们可以设置5：5-9：1五组不同留出集划分的数据集、训练5个Blending融合流程、再对其结果进行（加权）平均融合，其基本流程如下图所示：

<center><img src="http://ml2022.oss-cn-hangzhou.aliyuncs.com/img/image-20221019190652003.png" alt="image-20221019190652003" style="zoom:50%;" />

相比方案一，第二个方案其实会更加省时省力，并且往往也能得到一个还不错的结果。从理论上来说，强而不同是保障（加权）平均融合效果之根本，在上述流程中，Blending融合结果是“强”的保证，而不同留出集比例的划分，又将严重影响Blending融合结果，因此也保障了“不同”，这也就是该方案具备可执行性的理论基础。

&emsp;&emsp;其实投票法或者平均法也是可以看成是两层的模型在进行计算，第一层模型是一级学习器，第二层的计算过程其实就是简单的投票或者平均计算过程，为了方便解释，此后我们统一称呼投票法&均值法的基础学习器为一级学习器。基于此，上述过程其实就是一个一级学习器是Blending的（加权）平均融合过程。此时，由于只有5个模型参与融合，可以尽可能训练得到更优结果；而对于第二层的（加权）平均融合过程，既可以尝试简单的平均融合，也可以尝试手动设置权重的加权融合方法。

> 再复杂的机器学习算法，本质上也是一个计算过程，和求平均这一计算过程无疑。而算法的本质、其基本定义，也就是一个计算过程，因此求均值这一计算过程就是一个算法。

- 优化效果预估

&emsp;&emsp;其实无论哪种方案，毫无疑问都是一整套更加复杂的融合流程。而正如Stacking融合开篇说的那样，越是复杂的流程越容易过拟合，本小节介绍的优化策略也不例外，伴随着融合流程变得更加复杂，融合结果的过拟合倾向也会更加明显。不过，在真实的实践场景中，复杂融合过程的过拟合倾向其实也是和数据本身息息相关，一般来说数据越简单（样本数量越少、特征越少），复杂融合过程的过拟合倾向就越高，而如果应对的是更加复杂的数据集，则复杂融合过程背后的强学习能力，往往能够提升最后的预测结果。因此，鉴于当前数据集较为简单的数据情况，这些模型融合优化策略大概率将出现过拟合的问题，但在后续更加复杂的数据集上，这些方法将起到非常核心的“提分”的效果。因此，本小节重点介绍各方法背后的理论依据及实现过程，并不会侧重预测结果的对比。

> 更多方法带来更多样的结果、带来更多的可能性，这也是模型融合阶段最核心的任务。

&emsp;&emsp;接下来，我们尝试实现这两种不同优化策略，并测试最终能否提升融合效果。

### 2.方案一：搜索最佳划分比例

&emsp;&emsp;首先尝试实现第一种方案，即通过贝叶斯优化器，搜索最佳留出集划分比例。为实现此方案，按照此前说明，我们需要先创建一个计算耗时更少的Blending融合流程，然后再将其封装为一个目标函数，用于搜索最佳留出集划分比例，然后据此再来进行最后一次高精度的Blending融合，以最终达到优化预测结果的目的。

- 简化Blending融合过程

&emsp;&emsp;这里我们首先创建一个更高效快捷的Blending融合过程，这里我们直接采用此前在全部数据集上搜索得到最优超参数的三个模型作为一级学习器，并免去一级学习器超参数搜索的交叉训练，简单固定超参数进行交叉训练即可；此外，精简元学习器的优化流程，直接带入逻辑回归作为元学习器，由此可构建更快速Blending融合过程如下：

In [55]:
tree = load('./models/tree_model.joblib')
RF = load('./models/RF_0.joblib')
logistic_search = load('./models/logistic_search.joblib')

lr = logistic_search.best_estimator_

In [56]:
estimators = [('lr', lr), ('tree', tree), ('RF', RF)]

In [57]:
start = time.time()
train_oof_blending, test_predict_blending = train_cross(X_train_OE, 
                                                        y_train, 
                                                        X_test_OE,
                                                        estimators, 
                                                        blending=True)
print(time.time()-start)

1.351806879043579


In [58]:
lr = LogisticRegression().fit(train_oof_blending.iloc[:, :-1], train_oof_blending.iloc[:, -1])
print('The results of LR-final:')
print('Train2-Accuracy: %f, Test-Accuracy: %f' % (lr.score(train_oof_blending.iloc[:, :-1], train_oof_blending.iloc[:, -1]), lr.score(test_predict_blending, y_test)))

The results of LR-final:
Train2-Accuracy: 0.822138, Test-Accuracy: 0.788756


能够发现，单次计算仅需要1s左右。在默认的训练集：留出集=8：2的情况下，最终准确率为0.788。此外，我们可以观察此时训练和预测数据如下：

In [29]:
train_oof_blending

Unnamed: 0,lr_oof,tree_oof,RF_oof,Churn
0,0.067367,0.063791,0.024685,0
1,0.470657,0.388356,0.350602,1
2,0.147403,0.131717,0.173163,0
3,0.712578,0.779534,0.721254,1
4,0.427978,0.484931,0.421293,0
...,...,...,...,...
1052,0.023757,0.044198,0.048487,0
1053,0.597838,0.552773,0.559185,0
1054,0.012680,0.044198,0.029577,0
1055,0.004052,0.044198,0.014272,0


In [30]:
test_predict_blending

Unnamed: 0,lr_predict,tree_predict,RF_predict
0,0.031213,0.044198,0.004815
1,0.235520,0.131717,0.339551
2,0.004348,0.044198,0.002016
3,0.026678,0.044198,0.009825
4,0.061598,0.063791,0.043481
...,...,...,...
1756,0.158728,0.211204,0.179236
1757,0.026341,0.044198,0.068920
1758,0.148644,0.131717,0.167651
1759,0.493872,0.473148,0.565447


> 从理论上来说，借助完整数据集上训练的模型来进行留出集划分比例的验证，可能会存在留出集信息提前泄露的问题。不过由于后续将会多次反复划分留出集，而同时又不太可能重复多次训练模型，因此留出集信息泄露问题不可避免；此外，全训练数据集上的最优单模训练过程也是一定会在模型融合之前执行的，因此在实际建模过程中，上述Blending流程会非常容易实现。

- 搜索最佳留出集划分比例

&emsp;&emsp;接下来，即可创建TPE搜索流程搜索最佳留出集划分比例，首先是搜索空间和目标函数：

In [31]:
train_cross?

[1;31mSignature:[0m
[0mtrain_cross[0m[1;33m([0m[1;33m
[0m    [0mX_train[0m[1;33m,[0m[1;33m
[0m    [0my_train[0m[1;33m,[0m[1;33m
[0m    [0mX_test[0m[1;33m,[0m[1;33m
[0m    [0mestimators[0m[1;33m,[0m[1;33m
[0m    [0mtest_size[0m[1;33m=[0m[1;36m0.2[0m[1;33m,[0m[1;33m
[0m    [0mn_splits[0m[1;33m=[0m[1;36m5[0m[1;33m,[0m[1;33m
[0m    [0mrandom_state[0m[1;33m=[0m[1;36m12[0m[1;33m,[0m[1;33m
[0m    [0mblending[0m[1;33m=[0m[1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Stacking融合过程一级学习器交叉训练函数

:param X_train: 训练集特征
:param y_train: 训练集标签
:param X_test: 测试集特征
:param estimators: 一级学习器，由(名称,评估器)组成的列表
:param n_splits: 交叉训练折数
:param test_size: blending过程留出集占比
:param random_state: 随机数种子
:param blending: 是否进行blending融合

:return：交叉训练后创建oof训练数据和测试集平均预测结果，同时包含特征和标签，标签在最后一列
[1;31mFile:[0m      d:\work\jupyter\telco\正式课程\manual_ensemble.py
[1;31mType:[0m      function


In [32]:
split_space = {'test_size': hp.uniform('test_size', 0.1, 0.5)}

In [33]:
def split_res(params, train=True):
    test_size = params['test_size']
    train_oof_blending, test_predict_blending = train_cross(X_train_OE, 
                                                            y_train, 
                                                            X_test_OE,
                                                            estimators, 
                                                            blending=True, 
                                                            test_size=test_size)
    lr = LogisticRegression().fit(train_oof_blending.iloc[:, :-1], train_oof_blending.iloc[:, -1])
    if train == True:
        res = -lr.score(train_oof_blending.iloc[:, :-1], train_oof_blending.iloc[:, -1])
    else:
        res = (train_oof_blending, test_predict_blending)
    return res

和此前类似，我们在创建目标函数的时候分别设置了训练模式和测试模式，训练模式最终输出元学习器在留出集上的预测准确率，而测试模式下最终输出在给定某个划分比例时oof训练数据集和测试集预测结果。当然，这里也可以将测试模式的输出结果设置为留出集和测试集上最终预测概率结果。

最后是优化函数定义：

In [71]:
def param_split_res(max_evals):
    params_best = fmin(fn = split_res,
                       space = split_space,
                       algo = tpe.suggest,
                       max_evals = max_evals, 
                       rstate=np.random.RandomState(11))    
    
    return params_best

&emsp;&emsp;接下来，测试优化能否顺利运行：

In [72]:
best_split = param_split_res(100)

100%|█████████████████████████████████████████████| 100/100 [02:14<00:00,  1.34s/trial, best loss: -0.8321167883211679]


In [73]:
best_split

{'test_size': 0.1295881342481578}

> 这里若对搜索结果的精度进行限制，则可以进一步提升搜索效率。考虑到数据集本身数量有限，确实不用如此高精度的搜索结果

最终搜索得到的留出集划分最佳比例为13%。接下来测试Blending融合效果：

In [74]:
train_oof_blending, test_predict_blending = split_res(best_split,train=False)

In [75]:
train_oof_blending

Unnamed: 0,lr_oof,tree_oof,RF_oof,Churn
0,0.074883,0.069631,0.026061,0
1,0.475950,0.409022,0.347158,1
2,0.137676,0.090934,0.193092,0
3,0.713064,0.763057,0.715137,1
4,0.415286,0.560982,0.385007,0
...,...,...,...,...
680,0.607143,0.763057,0.743935,1
681,0.005990,0.050797,0.008541,0
682,0.388037,0.294272,0.307472,0
683,0.656585,0.235635,0.612801,1


In [76]:
lr = LogisticRegression().fit(train_oof_blending.iloc[:, :-1], train_oof_blending.iloc[:, -1])
print('The results of LR-final:')
print('Train-oof-Accuracy: %f, Test-Accuracy: %f' % (lr.score(train_oof_blending.iloc[:, :-1], train_oof_blending.iloc[:, -1]), lr.score(test_predict_blending, y_test)))

The results of LR-final:
Train-oof-Accuracy: 0.832117, Test-Accuracy: 0.789324


能够发现，通过调整留出集比例，最终融合结果略有提升，这也说明通过合理调配一级学习器和元学习器的学习能力，将有助于Blending融合效果提升。

- 更高精度的Blending融合

&emsp;&emsp;在得到最佳划分比例后，接下来，我们更换一个更高精度的Blending融合过程，测试效果是否会进一步提升：

In [91]:
# 定义一级学习器
lr_hyper = lr_cascade(lr_params_space)
tree_hyper = tree_cascade(tree_params_space)
RF_hyper = RF_cascade(RF_params_space)

estimators = [('lr', lr_hyper), ('tree', tree_hyper), ('rf', RF_hyper)]

In [87]:
train_oof_blending, test_predict_blending = train_cross(X_train_OE, 
                                                        y_train, 
                                                        X_test_OE, 
                                                        estimators=estimators, 
                                                        test_size=0.12958,
                                                        blending=True)

100%|███████████████████████████████████████████████| 20/20 [02:05<00:00,  6.28s/trial, best loss: -0.7916762792073351]
100%|███████████████████████████████████████████████| 20/20 [02:08<00:00,  6.44s/trial, best loss: -0.7837928867199053]
100%|████████████████████████████████████████████████| 20/20 [02:09<00:00,  6.49s/trial, best loss: -0.788195800059154]
100%|███████████████████████████████████████████████| 20/20 [02:09<00:00,  6.45s/trial, best loss: -0.7936372375036971]
100%|███████████████████████████████████████████████| 20/20 [02:07<00:00,  6.37s/trial, best loss: -0.7960854776693285]
100%|███████████████████████████████████████████| 1000/1000 [00:40<00:00, 24.44trial/s, best loss: -0.8006555013309671]
100%|███████████████████████████████████████████| 1000/1000 [00:40<00:00, 24.76trial/s, best loss: -0.7965746081041111]
100%|███████████████████████████████████████████| 1000/1000 [00:40<00:00, 24.77trial/s, best loss: -0.7944531943212068]
100%|███████████████████████████████████

当然，接下来我们就可以将其带入元学习器优化函数，测试最终的优化效果：

In [92]:
# 定义元学习器搜索空间
lr_final_param = [{'thr': np.arange(0.1, 1.1, 0.1).tolist(), 'penalty': ['l1'], 'C': np.arange(0.1, 1.1, 0.1).tolist(), 'solver': ['saga']}, 
                  {'thr': np.arange(0.1, 1.1, 0.1).tolist(), 'penalty': ['l2'], 'C': np.arange(0.1, 1.1, 0.1).tolist(), 'solver': ['lbfgs', 'newton-cg', 'sag', 'saga']}]

tree_final_param = {'max_depth': np.arange(2, 16, 1).tolist(), 
                    'min_samples_split': np.arange(1, 5, 1).tolist(), 
                    'min_samples_leaf': np.arange(1, 4, 1).tolist(), 
                    'max_leaf_nodes':np.arange(6, 30, 1).tolist()}

param_space_l = [lr_final_param, tree_final_param]

In [93]:
# 定义元学习器列表
lr = logit_threshold()
tree = DecisionTreeClassifier()
final_model_l = [lr, tree]

In [89]:
# 执行元学习器训练搜索
best_res_final, best_test_predict_final = final_model_opt(final_model_l, 
                                                          param_space_l, 
                                                          train_oof_blending.iloc[:, :-1], 
                                                          train_oof_blending.iloc[:, -1], 
                                                          test_predict_blending)

In [94]:
best_res_final

0.8335766423357664

In [90]:
accuracy_score((best_test_predict_final >= 0.5) * 1, y_test)

0.787052810902896

能够发现，在执行了一个更强的Blending融合后，训练集准确率有所提升，但测试集准确率下降，说明最终融合结果出现了一定过拟合倾向。当然，由此也说明，对于简单的数据集，并非一定要使用学习能力最强的融合流程。在融合阶段稍微“保留些余地”，或许将更有助于最终模型结果的提升。

### 3.方案二：多次划分，构建基于Blending结果的（加权）平均融合

&emsp;&emsp;接下来，我们尝试第二种优化策略，即通过多次划分不同比例的留出集来训练多个Blending融合结果，然后再进行平均融合或加权平均融合。在已经定义了诸多辅助函数的情况下，该过程的代码实现流程并不复杂，唯一需要考虑的问题就是计算时间。在此前的Blending融合过程中，完整执行一级学习器交叉搜索训练+元学习器优化，在一级学习器搜索次数较少的情况下估计用时45min，此时要完整训练5个Blending融合结果，则至少需要3个半小时。

In [96]:
(45 * 5) / 60

3.75

回顾完整计算过程如下：

<center><img src="http://ml2022.oss-cn-hangzhou.aliyuncs.com/img/image-20221019190652003.png" alt="image-20221019190652003" style="zoom:50%;" />

- 一阶段Blending融合过程如下

&emsp;&emsp;首先是5个不同比例留出集的Blending过程如下：

In [7]:
# 一级学习器交叉训练
lr_hyper = lr_cascade(lr_params_space)
tree_hyper = tree_cascade(tree_params_space)
RF_hyper = RF_cascade(RF_params_space)

estimators = [('lr', lr_hyper), ('tree', tree_hyper), ('rf', RF_hyper)]

train_oof_blending, test_predict_blending = train_cross(X_train_OE, 
                                                        y_train, 
                                                        X_test_OE, 
                                                        estimators=estimators, 
                                                        test_size=0.1,
                                                        blending=True)

# 元学习器训练与优化
lr = logit_threshold()
tree = DecisionTreeClassifier()
final_model_l = [lr, tree]

best_res_final1, best_test_predict_final1 = final_model_opt(final_model_l, 
                                                            param_space_l, 
                                                            train_oof_blending.iloc[:, :-1], 
                                                            train_oof_blending.iloc[:, -1], 
                                                            test_predict_blending)

100%|███████████████████████████████| 20/20 [01:46<00:00,  5.33s/trial, best loss: -0.7935282522996058]
100%|███████████████████████████████| 20/20 [01:43<00:00,  5.17s/trial, best loss: -0.7856397399543538]
100%|███████████████████████████████| 20/20 [01:40<00:00,  5.03s/trial, best loss: -0.7890580261428868]
100%|███████████████████████████████| 20/20 [01:39<00:00,  4.96s/trial, best loss: -0.7972643336330314]
100%|███████████████████████████████| 20/20 [01:31<00:00,  4.59s/trial, best loss: -0.7909509647970122]
100%|███████████████████████████| 1000/1000 [00:30<00:00, 32.62trial/s, best loss: -0.7956362819005464]
100%|███████████████████████████| 1000/1000 [00:30<00:00, 32.91trial/s, best loss: -0.7861650183276853]
100%|███████████████████████████| 1000/1000 [00:30<00:00, 32.58trial/s, best loss: -0.7887986721073379]
100%|███████████████████████████| 1000/1000 [00:30<00:00, 32.38trial/s, best loss: -0.7999000622449685]
100%|███████████████████████████| 1000/1000 [00:30<00:00, 32.84t

In [8]:
# 一级学习器交叉训练
lr_hyper = lr_cascade(lr_params_space)
tree_hyper = tree_cascade(tree_params_space)
RF_hyper = RF_cascade(RF_params_space)

estimators = [('lr', lr_hyper), ('tree', tree_hyper), ('rf', RF_hyper)]

train_oof_blending, test_predict_blending = train_cross(X_train_OE, 
                                                        y_train, 
                                                        X_test_OE, 
                                                        estimators=estimators, 
                                                        test_size=0.2,
                                                        blending=True)

# 元学习器训练与优化
lr = logit_threshold()
tree = DecisionTreeClassifier()
final_model_l = [lr, tree]

best_res_final2, best_test_predict_final2 = final_model_opt(final_model_l, 
                                                            param_space_l, 
                                                            train_oof_blending.iloc[:, :-1], 
                                                            train_oof_blending.iloc[:, -1], 
                                                            test_predict_blending)

100%|███████████████████████████████| 20/20 [01:38<00:00,  4.95s/trial, best loss: -0.7855029585798816]
100%|███████████████████████████████| 20/20 [01:37<00:00,  4.88s/trial, best loss: -0.7884615384615385]
100%|███████████████████████████████| 20/20 [01:36<00:00,  4.84s/trial, best loss: -0.7834319526627219]
100%|███████████████████████████████| 20/20 [01:42<00:00,  5.15s/trial, best loss: -0.7931952662721894]
100%|███████████████████████████████| 20/20 [01:37<00:00,  4.88s/trial, best loss: -0.7884615384615384]
100%|███████████████████████████| 1000/1000 [00:28<00:00, 35.20trial/s, best loss: -0.7973372781065089]
100%|███████████████████████████| 1000/1000 [00:28<00:00, 34.94trial/s, best loss: -0.8041420118343193]
100%|███████████████████████████| 1000/1000 [00:28<00:00, 34.54trial/s, best loss: -0.7899408284023668]
100%|███████████████████████████| 1000/1000 [00:29<00:00, 34.33trial/s, best loss: -0.8023668639053254]
100%|███████████████████████████| 1000/1000 [00:28<00:00, 34.92t

In [9]:
# 一级学习器交叉训练
lr_hyper = lr_cascade(lr_params_space)
tree_hyper = tree_cascade(tree_params_space)
RF_hyper = RF_cascade(RF_params_space)

estimators = [('lr', lr_hyper), ('tree', tree_hyper), ('rf', RF_hyper)]

train_oof_blending, test_predict_blending = train_cross(X_train_OE, 
                                                        y_train, 
                                                        X_test_OE, 
                                                        estimators=estimators, 
                                                        test_size=0.3,
                                                        blending=True)

# 元学习器训练与优化
lr = logit_threshold()
tree = DecisionTreeClassifier()
final_model_l = [lr, tree]

best_res_final3, best_test_predict_final3 = final_model_opt(final_model_l, 
                                                            param_space_l, 
                                                            train_oof_blending.iloc[:, :-1], 
                                                            train_oof_blending.iloc[:, -1], 
                                                            test_predict_blending)

100%|███████████████████████████████| 20/20 [01:26<00:00,  4.33s/trial, best loss: -0.7893132345543513]
100%|███████████████████████████████| 20/20 [01:34<00:00,  4.73s/trial, best loss: -0.7916832441578635]
100%|███████████████████████████████| 20/20 [01:26<00:00,  4.30s/trial, best loss: -0.7789077148214204]
100%|███████████████████████████████| 20/20 [01:23<00:00,  4.18s/trial, best loss: -0.7802550647093793]
100%|███████████████████████████████| 20/20 [01:29<00:00,  4.47s/trial, best loss: -0.7890451365070655]
100%|███████████████████████████| 1000/1000 [00:26<00:00, 37.87trial/s, best loss: -0.8052179082635936]
100%|███████████████████████████| 1000/1000 [00:26<00:00, 37.42trial/s, best loss: -0.7994632322678008]
100%|███████████████████████████| 1000/1000 [00:26<00:00, 37.16trial/s, best loss: -0.8015559976219876]
100%|███████████████████████████| 1000/1000 [00:27<00:00, 36.83trial/s, best loss: -0.7866825581927104]
100%|███████████████████████████| 1000/1000 [00:26<00:00, 37.12t

In [10]:
# 一级学习器交叉训练
lr_hyper = lr_cascade(lr_params_space)
tree_hyper = tree_cascade(tree_params_space)
RF_hyper = RF_cascade(RF_params_space)

estimators = [('lr', lr_hyper), ('tree', tree_hyper), ('rf', RF_hyper)]

train_oof_blending, test_predict_blending = train_cross(X_train_OE, 
                                                        y_train, 
                                                        X_test_OE, 
                                                        estimators=estimators, 
                                                        test_size=0.4,
                                                        blending=True)

# 元学习器训练与优化
lr = logit_threshold()
tree = DecisionTreeClassifier()
final_model_l = [lr, tree]

best_res_final4, best_test_predict_final4 = final_model_opt(final_model_l, 
                                                            param_space_l, 
                                                            train_oof_blending.iloc[:, :-1], 
                                                            train_oof_blending.iloc[:, -1], 
                                                            test_predict_blending)

100%|███████████████████████████████| 20/20 [01:21<00:00,  4.09s/trial, best loss: -0.7838264299802762]
100%|███████████████████████████████| 20/20 [01:21<00:00,  4.09s/trial, best loss: -0.7838264299802762]
100%|████████████████████████████████| 20/20 [01:23<00:00,  4.16s/trial, best loss: -0.790138067061144]
100%|███████████████████████████████| 20/20 [01:21<00:00,  4.08s/trial, best loss: -0.7865877712031557]
100%|███████████████████████████████| 20/20 [01:26<00:00,  4.31s/trial, best loss: -0.7858780226436193]
100%|███████████████████████████| 1000/1000 [00:25<00:00, 39.70trial/s, best loss: -0.7897435897435898]
100%|███████████████████████████| 1000/1000 [00:25<00:00, 39.31trial/s, best loss: -0.7960552268244576]
100%|███████████████████████████| 1000/1000 [00:25<00:00, 39.90trial/s, best loss: -0.7917159763313609]
100%|███████████████████████████| 1000/1000 [00:25<00:00, 39.84trial/s, best loss: -0.7956607495069032]
100%|███████████████████████████| 1000/1000 [00:25<00:00, 39.72t

In [11]:
# 一级学习器交叉训练
lr_hyper = lr_cascade(lr_params_space)
tree_hyper = tree_cascade(tree_params_space)
RF_hyper = RF_cascade(RF_params_space)

estimators = [('lr', lr_hyper), ('tree', tree_hyper), ('rf', RF_hyper)]

train_oof_blending, test_predict_blending = train_cross(X_train_OE, 
                                                        y_train, 
                                                        X_test_OE, 
                                                        estimators=estimators, 
                                                        test_size=0.5,
                                                        blending=True)

# 元学习器训练与优化
lr = logit_threshold()
tree = DecisionTreeClassifier()
final_model_l = [lr, tree]

best_res_final5, best_test_predict_final5 = final_model_opt(final_model_l, 
                                                            param_space_l, 
                                                            train_oof_blending.iloc[:, :-1], 
                                                            train_oof_blending.iloc[:, -1], 
                                                            test_predict_blending)

100%|███████████████████████████████| 20/20 [01:17<00:00,  3.89s/trial, best loss: -0.7940349343999642]
100%|███████████████████████████████| 20/20 [01:06<00:00,  3.34s/trial, best loss: -0.7823154403773543]
100%|███████████████████████████████| 20/20 [01:09<00:00,  3.48s/trial, best loss: -0.7775772242949818]
100%|███████████████████████████████| 20/20 [01:09<00:00,  3.46s/trial, best loss: -0.7889393073622175]
100%|███████████████████████████████| 20/20 [01:14<00:00,  3.71s/trial, best loss: -0.7913033735560709]
100%|███████████████████████████| 1000/1000 [00:23<00:00, 42.29trial/s, best loss: -0.7945178313334006]
100%|███████████████████████████| 1000/1000 [00:23<00:00, 42.56trial/s, best loss: -0.7922523612651675]
100%|███████████████████████████| 1000/1000 [00:23<00:00, 42.48trial/s, best loss: -0.7908227174436714]
100%|███████████████████████████| 1000/1000 [00:23<00:00, 42.49trial/s, best loss: -0.8078608001971922]
100%|███████████████████████████| 1000/1000 [00:23<00:00, 42.97t

&emsp;&emsp;这里我们保存各组Blending的融合结果，方便后续调用：

In [19]:
best_test_predict_final1

array([0.0647482 , 0.28947368, 0.0647482 , ..., 0.0647482 , 0.65714286,
       0.0647482 ])

In [21]:
Blending_res = pd.DataFrame({'res1':best_test_predict_final1, 
                             'res2':best_test_predict_final2, 
                             'res3':best_test_predict_final3, 
                             'res4':best_test_predict_final4, 
                             'res5':best_test_predict_final5})

In [22]:
Blending_res

Unnamed: 0,res1,res2,res3,res4,res5
0,0.064748,0.051689,0.072438,0.052375,0.056628
1,0.289474,0.277228,0.234146,0.230654,0.263626
2,0.064748,0.051689,0.042431,0.041876,0.048764
3,0.064748,0.051689,0.042431,0.052375,0.056628
4,0.064748,0.072704,0.042431,0.052375,0.048764
...,...,...,...,...,...
1756,0.289474,0.137391,0.110875,0.156221,0.141935
1757,0.064748,0.051689,0.042431,0.052375,0.056628
1758,0.064748,0.118314,0.110875,0.129351,0.112634
1759,0.657143,0.630565,0.579210,0.476724,0.500612


In [127]:
# 写入本地
Blending_res.to_csv('Blending_res.csv', index=False)

In [129]:
# 后续可以使用如下方式调用
# Blending_res = pd.read_csv('Blending_res.csv')

In [130]:
Blending_res

Unnamed: 0,res1,res2,res3,res4,res5
0,0.064748,0.051689,0.072438,0.052375,0.056628
1,0.289474,0.277228,0.234146,0.230654,0.263626
2,0.064748,0.051689,0.042431,0.041876,0.048764
3,0.064748,0.051689,0.042431,0.052375,0.056628
4,0.064748,0.072704,0.042431,0.052375,0.048764
...,...,...,...,...,...
1756,0.289474,0.137391,0.110875,0.156221,0.141935
1757,0.064748,0.051689,0.042431,0.052375,0.056628
1758,0.064748,0.118314,0.110875,0.129351,0.112634
1759,0.657143,0.630565,0.579210,0.476724,0.500612


在得到了5组测试集预测结果后，接下来围绕对测试集的预测结果进行融合。

- 二阶段平均融合&手动设置权重的加权融合

&emsp;&emsp;二阶段融合需要注意，由于每个Blending融合过程中元学习器都是针对不同数据集进行的预测，所以无法先在训练集上测试融合效果再对测试集结果进行融合，此阶段的融合只能一次性的围绕测试集的预测结果进行融合并接提交结果，对于测试集标签未知的情况（如竞赛中），我们也只能通过在线提交结果后的结果评估看到最终预测效果。因此，这里只能选择平均融合或者手动设置权重的加权融合。这里我们首先尝试简单的均值融合：

In [135]:
Blending_res.mean(axis=1) >= 0.5

0       False
1       False
2       False
3       False
4       False
        ...  
1756    False
1757    False
1758    False
1759     True
1760    False
Length: 1761, dtype: bool

In [144]:
accuracy_score(Blending_res.mean(axis=1) > 0.5, y_test)

0.7932992617830777

能够发现，最终融合结果较方案一，有非常明显的提升。接下来进一步考虑手动加权平均融合。这里我们采用Part 4.3.3中介绍的权重设置策略，即按照训练集上的评分进行排序，然后以排序结果作为权重进行加权平均融合，执行过程如下：

In [31]:
pd.Series([best_res_final1, 
           best_res_final2, 
           best_res_final3, 
           best_res_final4, 
           best_res_final5], index=['best_res_final1', 
                                    'best_res_final2', 
                                    'best_res_final3', 
                                    'best_res_final4', 
                                    'best_res_final5'])

best_res_final1    0.829868
best_res_final2    0.834437
best_res_final3    0.816404
best_res_final4    0.825367
best_res_final5    0.814086
dtype: float64

In [177]:
Blending_res1 = ((Blending_res['res1'] * 4) + 
                 (Blending_res['res2'] * 5) + 
                 (Blending_res['res3'] * 2) + 
                 (Blending_res['res4'] * 3) + 
                 (Blending_res['res5'] * 1)) / 15

In [178]:
accuracy_score(Blending_res1 > 0.5, y_test)

0.7938671209540034

能够发现，融合结果有了进一步提升。

&emsp;&emsp;至此，我们就完整介绍了Blending融合的两种高阶优化策略。其实在实践过程中，这两种方法的基本表现也和本节展示结果类似，在大多数情况下方案二的效果都会优于方案一，并且方案二也是为数极少的、非常实用的二阶融合策略。