# <center> 【Kaggle】Telco Customer Churn 电信用户流失预测案例

---

## <font face="仿宋">第四部分导读

&emsp;&emsp;<font face="仿宋">在案例的第二、三部分中，我们详细介绍了关于特征工程的各项技术，特征工程技术按照大类来分可以分为数据预处理、特征衍生、特征筛选三部分，其中特征预处理的目的是为了将数据集整理、清洗到可以建模的程度，具体技术包括缺失值处理、异常值处理、数据重编码等，是建模之前必须对数据进行的处理和操作；而特征衍生和特征筛选则更像是一类优化手段，能够帮助模型突破当前数据集建模的效果上界。并且我们在第二部分完整详细的介绍机器学习可解释性模型的训练、优化和解释方法，也就是逻辑回归和决策树模型。并且此前我们也一直以这两种算法为主，来进行各个部分的模型测试。

&emsp;&emsp;<font face="仿宋">而第四部分，我们将开始介绍集成学习的训练和优化的实战技巧，尽管从可解释性角度来说，集成学习的可解释性并不如逻辑回归和决策树，但在大多数建模场景下，集成学习都将获得一个更好的预测结果，这也是目前效果优先的建模场景下最常使用的算法。

&emsp;&emsp;<font face="仿宋">总的来说，本部分内容只有一个目标，那就是借助各类优化方法，抵达每个主流集成学习的效果上界。换而言之，本部分我们将围绕单模优化策略展开详细的探讨，涉及到的具体集成学习包括随机森林、XGBoost、LightGBM、和CatBoost等目前最主流的集成学习算法，而具体的优化策略则包括超参数优化器的使用、特征衍生和筛选方法的使用、单模型自融合方法的使用，这些优化方法也是截至目前，提升单模效果最前沿、最有效、同时也是最复杂的方法。其中有很多较为艰深的理论，也有很多是经验之谈，但无论如何，我们希望能够围绕当前数据集，让每个集成学习算法优化到极限。值得注意的是，在这个过程中，我们会将此前介绍的特征衍生和特征筛选视作是一种模型优化方法，衍生和筛选的效果，一律以模型的最终结果来进行评定。而围绕集成学习进行海量特征衍生和筛选，也才是特征衍生和筛选技术能发挥巨大价值的主战场。

&emsp;&emsp;<font face="仿宋">而在抵达了单模的极限后，我们就会进入到下一阶段，也就是模型融合阶段。需要知道的是，只有单模的效果到达了极限，进一步的多模型融合、甚至多层融合，才是有意义的，才是有效果的。

---

# <center>Part 4.集成算法的训练与优化技巧

In [1]:
# 基础数据科学运算库
import numpy as np
import pandas as pd

# 可视化库
import seaborn as sns
import matplotlib.pyplot as plt

# 时间模块
import time

import warnings
warnings.filterwarnings('ignore')

# sklearn库
# 数据预处理
from sklearn import preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

# 实用函数
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RepeatedKFold

# 常用评估器
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier

# 网格搜索
from sklearn.model_selection import GridSearchCV

# 自定义评估器支持模块
from sklearn.base import BaseEstimator, TransformerMixin, ClassifierMixin

# 自定义模块
from telcoFunc import *

# 导入特征衍生模块
import features_creation as fc
from features_creation import *

# 导入模型融合模块
import manual_ensemble as me
from manual_ensemble import *

# re模块相关
import inspect, re

# 其他模块
from tqdm import tqdm
import gc

&emsp;&emsp;然后执行Part 1中的数据清洗相关工作：

In [2]:
# 读取数据
tcc = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

# 标注连续/离散字段
# 离散字段
category_cols = ['gender', 'SeniorCitizen', 'Partner', 'Dependents',
                'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 
                'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
                'PaymentMethod']

# 连续字段
numeric_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
 
# 标签
target = 'Churn'

# ID列
ID_col = 'customerID'

# 验证是否划分能完全
assert len(category_cols) + len(numeric_cols) + 2 == tcc.shape[1]

# 连续字段转化
tcc['TotalCharges']= tcc['TotalCharges'].apply(lambda x: x if x!= ' ' else np.nan).astype(float)
tcc['MonthlyCharges'] = tcc['MonthlyCharges'].astype(float)

# 缺失值填补
tcc['TotalCharges'] = tcc['TotalCharges'].fillna(0)

# 标签值手动转化 
tcc['Churn'].replace(to_replace='Yes', value=1, inplace=True)
tcc['Churn'].replace(to_replace='No',  value=0, inplace=True)

In [3]:
features = tcc.drop(columns=[ID_col, target]).copy()
labels = tcc['Churn'].copy()

&emsp;&emsp;同时，创建自然编码后的数据集以及经过时序特征衍生的数据集：

In [4]:
# 划分训练集和测试集
train, test = train_test_split(tcc, random_state=22)

X_train = train.drop(columns=[ID_col, target]).copy()
X_test = test.drop(columns=[ID_col, target]).copy()

y_train = train['Churn'].copy()
y_test = test['Churn'].copy()

X_train_seq = pd.DataFrame()
X_test_seq = pd.DataFrame()

# 年份衍生
X_train_seq['tenure_year'] = ((72 - X_train['tenure']) // 12) + 2014
X_test_seq['tenure_year'] = ((72 - X_test['tenure']) // 12) + 2014

# 月份衍生
X_train_seq['tenure_month'] = (72 - X_train['tenure']) % 12 + 1
X_test_seq['tenure_month'] = (72 - X_test['tenure']) % 12 + 1

# 季度衍生
X_train_seq['tenure_quarter'] = ((X_train_seq['tenure_month']-1) // 3) + 1
X_test_seq['tenure_quarter'] = ((X_test_seq['tenure_month']-1) // 3) + 1

# 独热编码
enc = preprocessing.OneHotEncoder()
enc.fit(X_train_seq)

seq_new = list(X_train_seq.columns)

# 创建带有列名称的独热编码之后的df
X_train_seq = pd.DataFrame(enc.transform(X_train_seq).toarray(), 
                           columns = cate_colName(enc, seq_new, drop=None))

X_test_seq = pd.DataFrame(enc.transform(X_test_seq).toarray(), 
                          columns = cate_colName(enc, seq_new, drop=None))

# 调整index
X_train_seq.index = X_train.index
X_test_seq.index = X_test.index

In [5]:
ord_enc = OrdinalEncoder()
ord_enc.fit(X_train[category_cols])

X_train_OE = pd.DataFrame(ord_enc.transform(X_train[category_cols]), columns=category_cols)
X_train_OE.index = X_train.index
X_train_OE = pd.concat([X_train_OE, X_train[numeric_cols]], axis=1)

X_test_OE = pd.DataFrame(ord_enc.transform(X_test[category_cols]), columns=category_cols)
X_test_OE.index = X_test.index
X_test_OE = pd.concat([X_test_OE, X_test[numeric_cols]], axis=1)

然后是模型融合部分所需的第三方库、准备的数据以及训练好的模型：

In [6]:
# 本节新增第三方库
from joblib import dump, load
from sklearn.ensemble import VotingClassifier
from hyperopt import hp, fmin, tpe, Trials
from numpy.random import RandomState
from sklearn.model_selection import cross_val_score

In [7]:
# 实例化KFold评估器
kf = KFold(n_splits=5, random_state=12, shuffle=True)

# 重置训练集和测试集的index
X_train_OE = X_train_OE.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)

train_part_index_l = []
eval_index_l = []

for train_part_index, eval_index in kf.split(X_train_OE, y_train):
    train_part_index_l.append(train_part_index)
    eval_index_l.append(eval_index)
    
# 训练集特征
X_train1 = X_train_OE.loc[train_part_index_l[0]]
X_train2 = X_train_OE.loc[train_part_index_l[1]]
X_train3 = X_train_OE.loc[train_part_index_l[2]]
X_train4 = X_train_OE.loc[train_part_index_l[3]]
X_train5 = X_train_OE.loc[train_part_index_l[4]]

# 验证集特征
X_eval1 = X_train_OE.loc[eval_index_l[0]]
X_eval2 = X_train_OE.loc[eval_index_l[1]]
X_eval3 = X_train_OE.loc[eval_index_l[2]]
X_eval4 = X_train_OE.loc[eval_index_l[3]]
X_eval5 = X_train_OE.loc[eval_index_l[4]]

# 训练集标签
y_train1 = y_train.loc[train_part_index_l[0]]
y_train2 = y_train.loc[train_part_index_l[1]]
y_train3 = y_train.loc[train_part_index_l[2]]
y_train4 = y_train.loc[train_part_index_l[3]]
y_train5 = y_train.loc[train_part_index_l[4]]

# 验证集标签
y_eval1 = y_train.loc[eval_index_l[0]]
y_eval2 = y_train.loc[eval_index_l[1]]
y_eval3 = y_train.loc[eval_index_l[2]]
y_eval4 = y_train.loc[eval_index_l[3]]
y_eval5 = y_train.loc[eval_index_l[4]]

train_set = [(X_train1, y_train1), 
             (X_train2, y_train2), 
             (X_train3, y_train3), 
             (X_train4, y_train4), 
             (X_train5, y_train5)]

eval_set = [(X_eval1, y_eval1), 
            (X_eval2, y_eval2), 
            (X_eval3, y_eval3), 
            (X_eval4, y_eval4), 
            (X_eval5, y_eval5)]

In [8]:
# 随机森林模型组
grid_RF_1 = load('./models/grid_RF_1.joblib') 
grid_RF_2 = load('./models/grid_RF_2.joblib') 
grid_RF_3 = load('./models/grid_RF_3.joblib') 
grid_RF_4 = load('./models/grid_RF_4.joblib') 
grid_RF_5 = load('./models/grid_RF_5.joblib') 

RF_1 = grid_RF_1.best_estimator_
RF_2 = grid_RF_2.best_estimator_
RF_3 = grid_RF_3.best_estimator_
RF_4 = grid_RF_4.best_estimator_
RF_5 = grid_RF_5.best_estimator_

RF_l = [RF_1, RF_2, RF_3, RF_4, RF_5]

# 决策树模型组
grid_tree_1 = load('./models/grid_tree_1.joblib')
grid_tree_2 = load('./models/grid_tree_2.joblib')
grid_tree_3 = load('./models/grid_tree_3.joblib')
grid_tree_4 = load('./models/grid_tree_4.joblib')
grid_tree_5 = load('./models/grid_tree_5.joblib')

tree_1 = grid_tree_1.best_estimator_
tree_2 = grid_tree_2.best_estimator_
tree_3 = grid_tree_3.best_estimator_
tree_4 = grid_tree_4.best_estimator_
tree_5 = grid_tree_5.best_estimator_

tree_l = [tree_1, tree_2, tree_3, tree_4, tree_5]

# 逻辑回归模型组
grid_lr_1 = load('./models/grid_lr_1.joblib')
grid_lr_2 = load('./models/grid_lr_2.joblib')
grid_lr_3 = load('./models/grid_lr_3.joblib')
grid_lr_4 = load('./models/grid_lr_4.joblib')
grid_lr_5 = load('./models/grid_lr_5.joblib')

lr_1 = grid_lr_1.best_estimator_
lr_2 = grid_lr_2.best_estimator_
lr_3 = grid_lr_3.best_estimator_
lr_4 = grid_lr_4.best_estimator_
lr_5 = grid_lr_5.best_estimator_

lr_l = [lr_1, lr_2, lr_3, lr_4, lr_5]

In [9]:
eval1_predict_proba_RF = pd.Series(RF_l[0].predict_proba(X_eval1)[:, 1], index=X_eval1.index)
eval2_predict_proba_RF = pd.Series(RF_l[1].predict_proba(X_eval2)[:, 1], index=X_eval2.index)
eval3_predict_proba_RF = pd.Series(RF_l[2].predict_proba(X_eval3)[:, 1], index=X_eval3.index)
eval4_predict_proba_RF = pd.Series(RF_l[3].predict_proba(X_eval4)[:, 1], index=X_eval4.index)
eval5_predict_proba_RF = pd.Series(RF_l[4].predict_proba(X_eval5)[:, 1], index=X_eval5.index)

eval_predict_proba_RF = pd.concat([eval1_predict_proba_RF, 
                                   eval2_predict_proba_RF, 
                                   eval3_predict_proba_RF, 
                                   eval4_predict_proba_RF, 
                                   eval5_predict_proba_RF]).sort_index()

eval1_predict_proba_tree = pd.Series(tree_l[0].predict_proba(X_eval1)[:, 1], index=X_eval1.index)
eval2_predict_proba_tree = pd.Series(tree_l[1].predict_proba(X_eval2)[:, 1], index=X_eval2.index)
eval3_predict_proba_tree = pd.Series(tree_l[2].predict_proba(X_eval3)[:, 1], index=X_eval3.index)
eval4_predict_proba_tree = pd.Series(tree_l[3].predict_proba(X_eval4)[:, 1], index=X_eval4.index)
eval5_predict_proba_tree = pd.Series(tree_l[4].predict_proba(X_eval5)[:, 1], index=X_eval5.index)

eval_predict_proba_tree = pd.concat([eval1_predict_proba_tree, 
                                     eval2_predict_proba_tree, 
                                     eval3_predict_proba_tree, 
                                     eval4_predict_proba_tree, 
                                     eval5_predict_proba_tree]).sort_index()

eval1_predict_proba_lr = pd.Series(lr_l[0].predict_proba(X_eval1)[:, 1], index=X_eval1.index)
eval2_predict_proba_lr = pd.Series(lr_l[1].predict_proba(X_eval2)[:, 1], index=X_eval2.index)
eval3_predict_proba_lr = pd.Series(lr_l[2].predict_proba(X_eval3)[:, 1], index=X_eval3.index)
eval4_predict_proba_lr = pd.Series(lr_l[3].predict_proba(X_eval4)[:, 1], index=X_eval4.index)
eval5_predict_proba_lr = pd.Series(lr_l[4].predict_proba(X_eval5)[:, 1], index=X_eval5.index)

eval_predict_proba_lr = pd.concat([eval1_predict_proba_lr, 
                                   eval2_predict_proba_lr, 
                                   eval3_predict_proba_lr, 
                                   eval4_predict_proba_lr, 
                                   eval5_predict_proba_lr]).sort_index()

In [10]:
test_predict_proba_RF = []

for i in range(5):
    test_predict_proba_RF.append(RF_l[i].predict_proba(X_test_OE)[:, 1])

test_predict_proba_RF = np.array(test_predict_proba_RF)
test_predict_proba_RF = test_predict_proba_RF.mean(0)

test_predict_proba_tree = []

for i in range(5):
    test_predict_proba_tree.append(tree_l[i].predict_proba(X_test_OE)[:, 1])

test_predict_proba_tree = np.array(test_predict_proba_tree)
test_predict_proba_tree = test_predict_proba_tree.mean(0)

test_predict_proba_lr = []

for i in range(5):
    test_predict_proba_lr.append(lr_l[i].predict_proba(X_test_OE)[:, 1])

test_predict_proba_lr = np.array(test_predict_proba_lr)
test_predict_proba_lr = test_predict_proba_lr.mean(0)

## <center>Ch.3 模型融合基础方法

- Stacking模型融合优化方法综述

&emsp;&emsp;在此前的Stacking模型融合过程中，我们不难发现，其实无论每个流程的各个环节如何优化，最终结果还是存在一定过拟合倾向，从根本上来说，这其实是由于数据重复训练导致的必然结果。所谓的数据重复训练，指的是一级学习器和元学习器都在同一个数据集上反复训练，整个训练过程并没有纳入额外的信息，导致整体过拟合问题较为明显。

&emsp;&emsp;而如何解决Stacking的过拟合问题进而提升融合效果，总的来说有两种思路，其一是从训练方法来说，若能通过修改训练流程使得不同模型的学习倾向各有不同，则可以提升一级学习器彼此之间的独立性，进而减少过拟合倾向、提升泛化能力。这种方法也被称为级联优化。其二则是从数据和特征的分配上来进行调整，例如，我们可以考虑给不同层的模型分配不同数据，这个过程也被称为Blending融合（混合融合），此外，还可以针对不同层的模型分配不同特征，则被称为特征增强方法。

&emsp;&emsp;两种优化策略都是人们长期实践总结得到的、有可能对提升融合结果的方法。这里需要注意，这两大类策略都只是有可能能优化融合效果，而不是一定有效（如果一定有效，则会像交叉训练一样被列为必要操作）。因此，Stacking的优化方法实践其实还是需要结合当前情况来进行灵活调整，必要时也可以考虑暴力计算的思路，即各种方法都进行尝试，然后择优输出。

&emsp;&emsp;无论如何，对这些方法的掌握都是必须的。这一小节我们先介绍流程相对简单清晰的Blending融合方法，下一小节将重点介绍级联优化。而特征增强其实是较为复杂的一块内容，我们将在课程偏后部分介绍。

## 十一、Blending融合原理与实践

### 1.Blending融合的基本原理

- Blending融合的训练和预测流程

&emsp;&emsp;Blending融合的基本过程和Stacking融合较为类似，都是两层模型的基本架构，即都是一级学习器进行训练，然后一级学习器的训练结果带入元学习器进行学习和训练。而和Stacking有所不同的是，为了避免一组数据重复训练导致的过拟合，Blending会在训练集中进一步划分训练集和留出集，一般比例为5：5到9：1不等。其中训练集用于一级学习器的训练，然后一级学习器围绕留出集进行预测，预测结果拼接成类似Stacking中的oof数据集，再将其带入元学习器进行模型训练。至此，即完成了两层模型的训练，其基本过程如下：

<center><img src="http://ml2022.oss-cn-hangzhou.aliyuncs.com/img/image-20221008213642678.png" alt="image-20221008213642678" style="zoom:50%;" />

不难发现，Blending和Stacking的主要区别就在于一级学习器和元学习器带入训练的数据完全不同，由此即可避免由于数据重复训练导致的模型过拟合问题。

&emsp;&emsp;不过需要注意的是，尽管拆分数据能够避免数据的重复训练，但由于此时两层模型带入训练的数据都有所减少，因此每一层模型本身的准确率会有所下降。二在每个独立的模型准确率有所下降的情况下，是否在融合后还能有更好的效果，其实并不确定。也就是说，Blending方法并不一定能获得一个比Stacking更好的融合结果，在大多数情况下，只能说Blending也是一种很有潜力的融合过程。若是追求融合效果极限，往往需要两种方法都进行尝试，然后择优输出。

> 甚至有的时候，我们会同时采用多种不同融合流程训练多组元学习器，再来进行融合。该类方法属于融合结构上的拓展，这部分内容会在后续的案例中进行更深度的探讨。

&emsp;&emsp;而对于训练集和留出集的划分比例，一般来说是在5:5到9:1的范围内划分。训练集划分比例过高（留出集比例较低）则会导致元学习器偏差较大，而如果训练集划分比例较小（留出集比例较大），则会严重影响一级学习器的学习效果，而一级学习器是保障模型融合效果的核心，因此训练集比例较小对模型融合的结果是致命的。当然，在若不确定最佳比例情况下，通用做法是按照8:2的比例进行划分。

> 需要注意的是，此时的留出集就类似于交叉验证过程中的验证集，都是从训练集中划分出来的一部分特殊用途的数据集。在很多场景下，验证集的标签是不可知的，而此时我们手头上带有标签的数据集其实全部可以视作训练集，然后围绕这个训练集来进行训练集和留出集的划分。本案例中我们提前划分出了带标签的测试集，其实更多的是出于教学目的，为的是模拟一个在线提交结果验证的过程。

> 此外，关于Blending中具体比例，也可以作为超参数进行超参数搜索和优化。这部分属于级联优化的内容，相关方法将在下一小节进行介绍。

&emsp;&emsp;当在Blending流程中训练完两层模型后，实际的新数据的预测过程则和Stacking完全一致：只需要将新数据带入一级学习器输出预测结果，然后将预测结果带入元学习器进行预测即可。

&emsp;&emsp;而对于Blending元学习器选取，尽管Blending有数据集层面的信息隔离，理论上元学习器可以有更大的学习空间、同时也可以搭配更强的学习器。但实际训练过程中，限于元学习的训练数据数量有限、且学习数据仍然为一级学习器预测结果，因此元学习器网袜给仍然会表现出严重的过拟合倾向。因此，整体来看，Blending的元学习器的选择和优化方法基本需要和Stacking过程保持一致。

- Blending融合的优化方法

&emsp;&emsp;当然，就整个Blending的融合过程来说，其实也有非常多和Stacking类似的优化方法。例如一级学习器也可以交叉训练，然后围绕留出集进行预测然后取均值，再带入元学习器预测；同时，一级学习器的交叉训练过程也可以配合超参数优化，元学习器的预测也可以采用此前定义的final_model_opt流程等等。很明显，这部分功能的都可以借鉴或改写此前Stacking部分定义的函数来实现。并且，由于新的重要变量的加入——留出集的划分比例，由此也将衍生出一系列的围绕划分比例优化的策略，这部分内容我们将在下一小节进行重点讨论。

&emsp;&emsp;此外，Blending融合也有属于自己的优势优化策略——特征增强策略，即在元学习器中加入留出集特征。我们在Part 4.3.6中曾尝试在Stacking过程进行类似的特征增强操作，但结果是放大了过拟合倾向。而相比之下，Blending融合过程中，由于一级学习器并没有直接学习留出集特征，因此这部分数据对于元学习器来说还是会有较大的学习价值的，甚至Still等人在2009年的Feature-weighted linear stacking. arXiv preprint arXiv:0911.0460.论文中表示，将留出集的特征和一级学习器在留出集上的预测结果进行多项式特征衍生，能有效提升元学习器的泛化能力。这部分内容也是后续特征增强模块内容的重点。

### 2.Blending模型融合的手动实现方法

&emsp;&emsp;在了解了Blending的基本原理后，接下来尝试手动执行Blending融合流程。

#### 2.1 划分训练集和留出集

&emsp;&emsp;首先，是训练集和留出集的划分，这里我们按照8：2的比例划分进行划分，并将训练集命名为train1、留出集命名为train2：

In [15]:
X_train_OE.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,tenure,MonthlyCharges,TotalCharges
0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,2.0,2.0,0.0,2.0,2.0,2.0,2.0,1.0,0.0,68,79.6,5515.8
1,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,0.0,2.0,0.0,0.0,0.0,1.0,2.0,3,80.0,241.3
2,1.0,0.0,0.0,0.0,1.0,0.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,3.0,4,19.0,73.45
3,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,1.0,3.0,10,55.55,551.3
4,0.0,1.0,0.0,0.0,1.0,0.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,3.0,4,20.05,91.45


In [16]:
X_train1, X_train2, y_train1, y_train2 = train_test_split(X_train_OE, y_train,  test_size=0.2, random_state=12)

In [17]:
X_train1.shape

(4225, 19)

In [18]:
X_train2.shape

(1057, 19)

In [19]:
X_train1.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,tenure,MonthlyCharges,TotalCharges
2303,0.0,0.0,1.0,1.0,0.0,1.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,32,35.15,1051.05
1277,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,29,25.1,712.85
1855,0.0,0.0,0.0,0.0,1.0,2.0,1.0,0.0,0.0,2.0,0.0,0.0,2.0,0.0,1.0,2.0,16,88.45,1422.1
4280,0.0,0.0,1.0,1.0,1.0,2.0,0.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,0.0,68,89.05,6185.8
1316,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2,74.95,151.75


注意此时各数据集的index是乱序的。

#### 2.2 一级学习器单模训练

&emsp;&emsp;然后进行一级学习器的模型训练。这里我们先尝试围绕每个模型进行网格搜索的手动调参优化，也是非常高精度的搜索过程。

- 随机森林模型训练

&emsp;&emsp;首先是随机森林。模型超参数优化过程和此前一样，此处省略从第一轮开始的若干轮搜索，只展示最后一次超参数搜索及范围，基本数据如下：

In [20]:
start = time.time()

# 设置超参数空间
parameter_space = {"min_samples_leaf": range(7, 10), 
                   "min_samples_split": range(1, 4),
                   "max_depth": range(5, 8),
                   "max_leaf_nodes": [None] + list(range(32, 49, 2)), 
                   "n_estimators": range(9, 12), 
                   "max_features":['sqrt', 'log2'] + list(range(4, 8)), 
                   "max_samples":[None, 0.55, 0.6, 0.65]}

# 实例化模型与评估器
RF_blending = RandomForestClassifier(random_state=12)
grid_RF_blending = GridSearchCV(RF_blending, parameter_space, n_jobs=15)

# 模型训练
grid_RF_blending.fit(X_train1, y_train1)

print(time.time()-start)

135.82048630714417


In [58]:
grid_RF_blending.best_params_

{'max_depth': 6,
 'max_features': 6,
 'max_leaf_nodes': None,
 'max_samples': 0.6,
 'min_samples_leaf': 8,
 'min_samples_split': 2,
 'n_estimators': 10}

然后测试模型在训练集、留出集和测试集上的效果：

In [59]:
grid_RF_blending.score(X_train1, y_train1), grid_RF_blending.score(X_train2, y_train2), grid_RF_blending.score(X_test_OE, y_test)

(0.8191715976331361, 0.8136234626300851, 0.7830777967064169)

能够发现，最终测试集的效果有所下降。不过这里比较“凑巧”的是，训练集和留出集准确率较为接近，而和留出集效果差异较大，这也说明目前数据划分，对于随机森林来说，训练集和留出集规律较为一致，而和测试集规律相差较大，这也将提升最终Blending的过拟合风险。

> 一般来说，如果训练集和留出集较为一致，理论上可以考虑增加留出集划分比例。不过这只是理论上的建议，实际操作起来并具备可执行性。（何种程度谓之相似？此时又该给留出集腾出多少比例的数据？并没有操作的可行性）

- 逻辑回归模型训练

&emsp;&emsp;然后是逻辑回归模型训练，仍然是按照最高精度最高规格进行优化：

In [21]:
# 设置转化器流
logistic_pre = ColumnTransformer([
    ('cat', preprocessing.OneHotEncoder(drop='if_binary'), category_cols), 
    ('num', 'passthrough', numeric_cols)
])

num_pre = ['passthrough', preprocessing.StandardScaler(), preprocessing.KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='kmeans')]

# 实例化逻辑回归评估器
logistic_blending = logit_threshold(max_iter=int(1e8))

# 设置机器学习流
logistic_pipe = make_pipeline(logistic_pre, logistic_blending)

# 设置超参数空间
cw_l = [None, 'balanced']
#cw_l.extend([{1: x} for x in np.arange(1, 4, 0.2)])
logistic_param = [
    {'columntransformer__num':num_pre, 'logit_threshold__thr': np.arange(0.1, 1, 0.1).tolist(), 'logit_threshold__penalty': ['l1'], 'logit_threshold__C': np.arange(0.1, 1.1, 0.1).tolist(), 'logit_threshold__solver': ['saga'], 'logit_threshold__class_weight':cw_l}, 
    {'columntransformer__num':num_pre, 'logit_threshold__thr': np.arange(0.1, 1, 0.1).tolist(), 'logit_threshold__penalty': ['l2'], 'logit_threshold__C': np.arange(0.1, 1.1, 0.1).tolist(), 'logit_threshold__solver': ['lbfgs', 'newton-cg', 'sag', 'saga'], 'logit_threshold__class_weight':cw_l}, 
]

# 实例化网格搜索评估器
grid_lr_blending = GridSearchCV(estimator = logistic_pipe,
                                param_grid = logistic_param,
                                scoring='accuracy',
                                n_jobs = 15)

s = time.time()
grid_lr_blending.fit(X_train1, y_train1)
print(time.time()-s, "s")

618.0271792411804 s


In [89]:
grid_lr_blending.best_score_

0.8073372781065089

In [61]:
grid_lr_blending.score(X_train1, y_train1), grid_lr_blending.score(X_train2, y_train2), grid_lr_blending.score(X_test_OE, y_test)

(0.8073372781065089, 0.8183538315988647, 0.7825099375354913)

- 决策树模型

&emsp;&emsp;最后是决策树模型训练：

In [22]:
tree_model = DecisionTreeClassifier(random_state=12)

tree_param = {'max_depth': np.arange(2, 16, 1).tolist(), 
              'min_samples_split': np.arange(1, 5, 1).tolist(), 
              'min_samples_leaf': np.arange(1, 4, 1).tolist(), 
              'max_leaf_nodes':np.arange(6, 30, 1).tolist()}

grid_tree_blending = GridSearchCV(estimator = tree_model,
                                  param_grid = tree_param,
                                  n_jobs = 12).fit(X_train1, y_train1)

In [63]:
grid_tree_blending.best_score_

0.8011834319526627

In [64]:
grid_tree_blending.score(X_train1, y_train1), grid_tree_blending.score(X_train2, y_train2), grid_tree_blending.score(X_test_OE, y_test)

(0.8137278106508876, 0.7994323557237465, 0.7717206132879046)

|模型|训练集|留出集|测试集|
|:--:|:--:|:--:|:--:|
|RF单模|0.8191|0.8136|0.7830|
|LR单模|0.8073|0.8183|0.7825|
|Tree单模|0.8137|0.7994|0.7717|

#### 2.3 元学习器训练和预测数据创建

&emsp;&emsp;在训练完一级学习器后，开始创建元学习器的训练和预测数据，也就类似Stacking融合过程中的train_oof和test_predict。为了便于记忆，我们将Blending中元学习器的训练和预测数据集命名为train_oof_blending和test_predict_blending。

&emsp;&emsp;不过和Stacking过程不同的是，Blending的train_oof_blending和test_predict_blending，是在留出集上进行的预测结果。可以通过如下方式进行创建：

- train_oof_blending数据集创建

In [96]:
grid_lr_blending.predict_proba(X_train2)[:, 1]

array([0.08218637, 0.5260078 , 0.14814201, ..., 0.01149645, 0.00475372,
       0.04643898])

In [97]:
train_oof_blending = pd.DataFrame({'lr_oof_blending': grid_lr_blending.predict_proba(X_train2)[:, 1], 
                                   'RF_oof_blending': grid_RF_blending.predict_proba(X_train2)[:, 1],
                                   'tree_oof_blending': grid_tree_blending.predict_proba(X_train2)[:, 1]})

In [98]:
train_oof_blending

Unnamed: 0,lr_oof_blending,RF_oof_blending,tree_oof_blending
0,0.082186,0.041656,0.051867
1,0.526008,0.448663,0.462500
2,0.148142,0.160260,0.194373
3,0.719639,0.747084,0.787986
4,0.388377,0.400153,0.524138
...,...,...,...
1052,0.023818,0.038907,0.026367
1053,0.556429,0.614626,0.717557
1054,0.011496,0.048237,0.026367
1055,0.004754,0.007760,0.026367


- test_predict_blending数据集创建

In [99]:
test_predict_blending = pd.DataFrame({'lr_oof_blending': grid_lr_blending.predict_proba(X_test_OE)[:, 1], 
                                      'RF_oof_blending': grid_RF_blending.predict_proba(X_test_OE)[:, 1],
                                      'tree_oof_blending': grid_tree_blending.predict_proba(X_test_OE)[:, 1]})

In [100]:
test_predict_blending

Unnamed: 0,lr_oof_blending,RF_oof_blending,tree_oof_blending
0,0.040403,0.057965,0.026367
1,0.221748,0.202559,0.194373
2,0.006364,0.003488,0.026367
3,0.030417,0.016963,0.026367
4,0.064667,0.034394,0.051867
...,...,...,...
1756,0.178809,0.145293,0.109890
1757,0.034349,0.053799,0.026367
1758,0.124882,0.204525,0.194373
1759,0.502636,0.505389,0.462500


#### 2.4 元学习器训练测试

&emsp;&emsp;接下来是元学习器的学习与预测过程。首先，我们可以执行类似于Stacking融合过程中的对比实验，即直接带入全部常用分类模型作为元学习器，测试训练效果：

In [70]:
# 逻辑回归
lr = LogisticRegression().fit(train_oof_blending, y_train2)
print('The results of LR-final:')
print('Train2-Accuracy: %f, Test-Accuracy: %f' % (lr.score(train_oof_blending, y_train2), lr.score(test_predict_blending, y_test)))

# 决策树
tree = DecisionTreeClassifier().fit(train_oof_blending, y_train2)
print('The results of tree-final:')
print('Train2-Accuracy: %f, Test-Accuracy: %f' % (tree.score(train_oof_blending, y_train2), tree.score(test_predict_blending, y_test)))

# KNN最近邻分类器
from sklearn import neighbors
KNN = neighbors.KNeighborsClassifier().fit(train_oof_blending, y_train2)
print('The results of KNN-final:')
print('Train2-Accuracy: %f, Test-Accuracy: %f' % (KNN.score(train_oof_blending, y_train2), KNN.score(test_predict_blending, y_test)))

# SVM支持向量机
from sklearn import svm
SVM = svm.SVC().fit(train_oof_blending, y_train2)
print('The results of SVM-final:')
print('Train2-Accuracy: %f, Test-Accuracy: %f' % (SVM.score(train_oof_blending, y_train2), SVM.score(test_predict_blending, y_test)))

# 朴素贝叶斯/高斯贝叶斯
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB().fit(train_oof_blending, y_train2)
print('The results of GaussianNB-final:')
print('Train2-Accuracy: %f, Test-Accuracy: %f' % (gnb.score(train_oof_blending, y_train2), gnb.score(test_predict_blending, y_test)))

# Bagging
from sklearn.ensemble import BaggingClassifier
bagging = BaggingClassifier().fit(train_oof_blending, y_train2)
print('The results of Bagging-final:')
print('Train2-Accuracy: %f, Test-Accuracy: %f' % (bagging.score(train_oof_blending, y_train2), bagging.score(test_predict_blending, y_test)))

# 随机森林
RFC = RandomForestClassifier().fit(train_oof_blending, y_train2)
print('The results of RandomForest-final:')
print('Train2-Accuracy: %f, Test-Accuracy: %f' % (RFC.score(train_oof_blending, y_train2), RFC.score(test_predict_blending, y_test)))

# AdaBoost
from sklearn.ensemble import AdaBoostClassifier
ABC = AdaBoostClassifier().fit(train_oof_blending, y_train2)
print('The results of AdaBoost-final:')
print('Train2-Accuracy: %f, Test-Accuracy: %f' % (ABC.score(train_oof_blending, y_train2), ABC.score(test_predict_blending, y_test)))

# GBDT
from sklearn.ensemble import GradientBoostingClassifier
GBC = GradientBoostingClassifier().fit(train_oof_blending, y_train2)
print('The results of GBDT-final:')
print('Train2-Accuracy: %f, Test-Accuracy: %f' % (GBC.score(train_oof_blending, y_train2), GBC.score(test_predict_blending, y_test)))

# XGB
from xgboost import XGBClassifier
XGB = XGBClassifier().fit(train_oof_blending, y_train2)
print('The results of XGB-final:')
print('Train2-Accuracy: %f, Test-Accuracy: %f' % (XGB.score(train_oof_blending, y_train2), XGB.score(test_predict_blending, y_test)))

The results of LR-final:
Train2-Accuracy: 0.818354, Test-Accuracy: 0.788189
The results of tree-final:
Train2-Accuracy: 1.000000, Test-Accuracy: 0.732538
The results of KNN-final:
Train2-Accuracy: 0.850520, Test-Accuracy: 0.766042
The results of SVM-final:
Train2-Accuracy: 0.826868, Test-Accuracy: 0.787053
The results of GaussianNB-final:
Train2-Accuracy: 0.807001, Test-Accuracy: 0.780239
The results of Bagging-final:
Train2-Accuracy: 0.978240, Test-Accuracy: 0.758660
The results of RandomForest-final:
Train2-Accuracy: 1.000000, Test-Accuracy: 0.762635
The results of AdaBoost-final:
Train2-Accuracy: 0.838221, Test-Accuracy: 0.779671
The results of GBDT-final:
Train2-Accuracy: 0.886471, Test-Accuracy: 0.781374
The results of XGB-final:
Train2-Accuracy: 0.965941, Test-Accuracy: 0.767746


&emsp;&emsp;从上面这组模型的结果能够明显看出，Blending融合结果整体与Stacking融合结果接近，并且也同样是逻辑回归等模型表现较好，最终融合效果也超过了单模最好。从长期实践来看，在不配合特征增强的情况下，元学习器的学习空间有限，仍然还是建议采用Stacking类似的元学习器选取和优化策略。

|模型|训练集|留出集|测试集|
|:--:|:--:|:--:|:--:|
|RF单模|0.8191|0.8136|0.7830|
|LR单模|0.8073|0.8183|0.7825|
|Tree单模|0.8137|0.7994|0.7717|
|lr_final|-|0.8183|0.7881|

### 3.Blending模型融合优化策略

&emsp;&emsp;和Stacking类似，Blending的优化策略也分为一级学习器的交叉训练（包括自动超参数优化），以及元学习器的优化。

#### 3.1 一级学习器的交叉训练优化

&emsp;&emsp;首先，对于Blending的一级学习器来说，也是可以进行交叉训练的。不过相比Stacking，Blending一级学习器的交叉训练过程会相对简单一些。由于Blending是要求一级学习器围绕新的数据进行预测，因此交叉训练过程不再需要拼接oof数据集，而仅需要训练多组模型围绕相同的新的数据进行反复预测，然后求均值即可。就类似Stacking过程一级评估器对预测数据的预测过程。

&emsp;&emsp;而这个过程我们可以通过改写此前定义的train_cross来实现。

In [105]:
from manual_ensemble import *

In [106]:
train_cross?

[1;31mSignature:[0m
[0mtrain_cross[0m[1;33m([0m[1;33m
[0m    [0mX_train[0m[1;33m,[0m[1;33m
[0m    [0my_train[0m[1;33m,[0m[1;33m
[0m    [0mX_test[0m[1;33m,[0m[1;33m
[0m    [0mestimators[0m[1;33m,[0m[1;33m
[0m    [0mn_splits[0m[1;33m=[0m[1;36m5[0m[1;33m,[0m[1;33m
[0m    [0mrandom_state[0m[1;33m=[0m[1;36m12[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Stacking融合过程一级学习器交叉训练函数

:param X_train: 训练集特征
:param y_train: 训练集标签
:param X_test: 测试集特征
:param estimators: 一级学习器，由(名称,评估器)组成的列表
:param n_splits: 交叉训练折数
:param random_state: 随机数种子

:return：交叉训练后创建oof训练数据和测试集平均预测结果
[1;31mFile:[0m      d:\work\jupyter\telco\正式课程\manual_ensemble.py
[1;31mType:[0m      function


&emsp;&emsp;train_cross改写过程如下。核心是在原先的train_cross过程中加入blending的过程判断，若是Blending融合，则额外执行训练集内部的留出集划分过程，并且预测过程不再是oof拼接，而是重复预测后求均值。此外，考虑到Blending过程中oof数据集不再包含完整的训练集所有条目，且训练集的进一步划分是在函数内部完成，因此该函数输出的结果需要包含oof数据集的标签，因此，此时的oof数据集而是同时输出特征和标签，且标签放在最后一列，方便后续单独提取：

In [12]:
y_train.name

'Churn'

In [1]:
def train_cross(X_train, y_train, X_test, estimators, test_size=0.2, n_splits=5, random_state=12, blending=False):
    """
    Stacking融合过程一级学习器交叉训练函数
    
    :param X_train: 训练集特征
    :param y_train: 训练集标签
    :param X_test: 测试集特征
    :param estimators: 一级学习器，由(名称,评估器)组成的列表
    :param n_splits: 交叉训练折数
    :param test_size: blending过程留出集占比
    :param random_state: 随机数种子
    :param blending: 是否进行blending融合
    
    :return：交叉训练后创建oof训练数据和测试集平均预测结果，同时包含特征和标签，标签在最后一列
    """    
    # 创建一级评估器输出的训练集预测结果和测试集预测结果数据集
    if blending == True:
        X, X1, y, y1 = train_test_split(X_train, y_train, test_size=test_size, random_state=random_state)
        m = X1.shape[0]
        X = X.reset_index(drop=True)
        y = y.reset_index(drop=True)
        X1 = X1.reset_index(drop=True)
        y1 = y1.reset_index(drop=True)
    else:
        m = X_train.shape[0]
        X = X_train.reset_index(drop=True)
        y = y_train.reset_index(drop=True)
    
    n = len(estimators)
    m_test = X_test.shape[0]
    
    columns = []
    for estimator in estimators:
        columns.append(estimator[0] + '_oof')
    
    train_oof = pd.DataFrame(np.zeros((m, n)), columns=columns)
    
    columns = []
    for estimator in estimators:
        columns.append(estimator[0] + '_predict')
    
    test_predict = pd.DataFrame(np.zeros((m_test, n)), columns=columns)
    
    # 实例化重复交叉验证评估器
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)
    
    # 执行交叉训练
    for estimator in estimators:
        model = estimator[1]
        oof_colName = estimator[0] + '_oof'
        predict_colName = estimator[0] + '_predict'
        
        for train_part_index, eval_index in kf.split(X, y):
            # 在训练集上训练
            X_train_part = X.loc[train_part_index]
            y_train_part = y.loc[train_part_index]
            model.fit(X_train_part, y_train_part)
            if blending == True:
                # 在留出集上进行预测并求均值
                train_oof[oof_colName] += model.predict_proba(X1)[:, 1] / n_splits
                # 在测试集上进行预测并求均值
                test_predict[predict_colName] += model.predict_proba(X_test)[:, 1] / n_splits
            else:
                # 在验证集上进行验证
                X_eval_part = X.loc[eval_index]
                # 将验证集上预测结果拼接入oof数据集
                train_oof[oof_colName].loc[eval_index] = model.predict_proba(X_eval_part)[:, 1]
                # 将测试集上预测结果填入predict数据集
                test_predict[predict_colName] += model.predict_proba(X_test)[:, 1] / n_splits
    
    # 添加标签列
    if blending == True:
        train_oof[y1.name] = y1
    else:
        train_oof[y.name] = y
        
    return train_oof, test_predict

然后测试函数效果：

In [33]:
estimators = [('lr', grid_lr_blending.best_estimator_), ('tree', grid_tree_blending.best_estimator_), ('rf', grid_RF_blending.best_estimator_)]

In [34]:
train_oof_blending, test_predict_blending = train_cross(X_train_OE, y_train, X_test_OE, estimators, blending=True)

In [35]:
train_oof_blending

Unnamed: 0,lr_oof,tree_oof,rf_oof,Churn
0,0.065765,0.042398,0.044780,0
1,0.471787,0.349819,0.407269,1
2,0.150529,0.233834,0.172493,0
3,0.713892,0.684692,0.735245,1
4,0.426013,0.516487,0.436409,0
...,...,...,...,...
1052,0.022700,0.030196,0.046418,0
1053,0.593308,0.647285,0.590749,0
1054,0.010672,0.030196,0.043482,0
1055,0.003120,0.030196,0.015776,0


而oof的特征标签可以按照如下方式分别提取：

In [43]:
# oof数据集特征
train_oof_blending.iloc[:, :-1]

Unnamed: 0,lr_oof,tree_oof,rf_oof
0,0.065765,0.042398,0.044780
1,0.471787,0.349819,0.407269
2,0.150529,0.233834,0.172493
3,0.713892,0.684692,0.735245
4,0.426013,0.516487,0.436409
...,...,...,...
1052,0.022700,0.030196,0.046418
1053,0.593308,0.647285,0.590749
1054,0.010672,0.030196,0.043482
1055,0.003120,0.030196,0.015776


In [44]:
# oof数据集标签
train_oof_blending.iloc[:, -1]

0       0
1       1
2       0
3       1
4       0
       ..
1052    0
1053    0
1054    0
1055    0
1056    0
Name: Churn, Length: 1057, dtype: int64

In [36]:
test_predict_blending

Unnamed: 0,lr_predict,tree_predict,rf_predict
0,0.028461,0.030196,0.018056
1,0.246439,0.233834,0.285498
2,0.003332,0.030196,0.009496
3,0.023823,0.030196,0.007995
4,0.058757,0.042398,0.016456
...,...,...,...
1756,0.162301,0.102205,0.206897
1757,0.023821,0.030196,0.023841
1758,0.149621,0.203031,0.145774
1759,0.493815,0.492869,0.456376


然后带入逻辑回归元学习器测试效果：

In [127]:
lr = LogisticRegression().fit(train_oof_blending.iloc[:, :-1], train_oof_blending.iloc[:, -1])
print('The results of LR-final:')
print('Train2-Accuracy: %f, Test-Accuracy: %f' % (lr.score(train_oof_blending.iloc[:, :-1], train_oof_blending.iloc[:, -1]), lr.score(test_predict_blending, y_test)))

The results of LR-final:
Train2-Accuracy: 0.825922, Test-Accuracy: 0.793299


能够发现最终结果有较大提升。

> 当然，模型融合的结果也存在一定的偶然性，大多数情况下，无论是Stacking还是Blending，都是在千分位左右提升模型效果。

&emsp;&emsp;此外，此时改写后train_cross函数将能够同时执行Stacking和Blending融合的一级学习器交叉训练+train_oof数据集创建。后续需要将改写的函数替换manual_ensemble.py中原始的train_cross函数。

#### 3.2 元学习器优化

&emsp;&emsp;接下来继续讨论Blending中元学习器的优化过程。这部分优化和Stacking几乎一致，都是选择决策树模型或者逻辑回归模型，进行单模优化—交叉训练优化或者Bagging集成，然后从三个环节中的6个结果中择优输出。

- 手动实现

&emsp;&emsp;和此前一样，首先我们可以尝试手动进行优化。首先是逻辑回归模型的超参数优化：

In [130]:
# 设置超参数空间
logistic_param = [
    {'thr': np.arange(0.1, 1, 0.1).tolist(), 'penalty': ['l1'], 'C': np.arange(0.1, 1.1, 0.1).tolist(), 'solver': ['saga']}, 
    {'thr': np.arange(0.1, 1, 0.1).tolist(), 'penalty': ['l2'], 'C': np.arange(0.1, 1.1, 0.1).tolist(), 'solver': ['lbfgs', 'newton-cg', 'sag', 'saga']}, 
]

# 实例化相关评估器
logistic_final = logit_threshold(max_iter=int(1e6))
    
# 执行网格搜索
lfg = GridSearchCV(estimator = logistic_final,
                   param_grid = logistic_param,
                   scoring='accuracy',
                   n_jobs = 15).fit(train_oof_blending.iloc[:, :-1], train_oof_blending.iloc[:, -1])

lfg.score(train_oof_blending.iloc[:, :-1], train_oof_blending.iloc[:, -1]), lfg.score(test_predict_blending, y_test)

(0.8278145695364238, 0.787052810902896)

In [131]:
lfg.best_params_

{'C': 0.2, 'penalty': 'l2', 'solver': 'lbfgs', 'thr': 0.4}

&emsp;&emsp;接下来是决策树模型：

In [132]:
# 实例化决策树评估器
tree_final = DecisionTreeClassifier()

tree_param = {'max_depth': np.arange(2, 16, 1).tolist(), 
              'min_samples_split': np.arange(1, 5, 1).tolist(), 
              'min_samples_leaf': np.arange(1, 4, 1).tolist(), 
              'max_leaf_nodes':np.arange(6, 30, 1).tolist()}

# 实例化网格搜索评估器
tfg = GridSearchCV(estimator = tree_final,
                   param_grid = tree_param,
                   n_jobs = 12).fit(train_oof_blending.iloc[:, :-1], train_oof_blending.iloc[:, -1])

tfg.score(train_oof_blending.iloc[:, :-1], train_oof_blending.iloc[:, -1]), tfg.score(test_predict_blending, y_test)

(0.826868495742668, 0.7859170925610448)

In [133]:
tfg.best_params_

{'max_depth': 2,
 'max_leaf_nodes': 6,
 'min_samples_leaf': 1,
 'min_samples_split': 2}

然后是交叉训练：

In [119]:
# 逻辑回归交叉训练
res = np.zeros(test_predict_blending.shape[0])

folds = RepeatedKFold(n_splits=5, n_repeats=2, random_state=12)

for trn_idx, val_idx in folds.split(train_oof_blending.iloc[:, :-1], train_oof_blending.iloc[:, -1]):
    lfg = GridSearchCV(estimator = logit_threshold(max_iter=int(1e6)),
                       param_grid = logistic_param,
                       scoring='accuracy',
                       n_jobs = 15)
    lfg.fit(train_oof_blending.iloc[:, :-1].loc[trn_idx], train_oof_blending.iloc[:, -1].loc[trn_idx])
    res += lfg.predict_proba(test_predict_blending)[:, 1] / 10
    
print(accuracy_score((res >= 0.5) * 1, y_test))

0.7881885292447472


In [45]:
# 决策树交叉训练过程
res = np.zeros(test_predict_blending.shape[0])

folds = RepeatedKFold(n_splits=5, n_repeats=2, random_state=12)

for trn_idx, val_idx in folds.split(train_oof_blending.iloc[:, :-1], train_oof_blending.iloc[:, -1]):
    tfg = GridSearchCV(estimator = DecisionTreeClassifier(),
                       param_grid = tree_param,
                       n_jobs = 12)
    tfg.fit(train_oof_blending.iloc[:, :-1].loc[trn_idx], train_oof_blending.iloc[:, -1].loc[trn_idx])
    res += tfg.predict_proba(test_predict_blending)[:, 1] / 10

print(accuracy_score((res >= 0.5) * 1, y_test))

0.7830777967064169


&emsp;&emsp;最后，尝试带入经过超参数优化的基础评估器到Bagging中来进行训练：

In [121]:
start = time.time()

# 设置超参数空间
parameter_space = {
    "n_estimators": range(10, 21), 
    "max_samples": np.arange(0.1, 1.1, 0.1).tolist(),
    'max_features':np.arange(0.1, 1.1, 0.1).tolist()}

# 实例化模型与评估器
bagging_final = BaggingClassifier(LogisticRegression(C=0.3, 
                                                     penalty='l1',
                                                     solver='saga'))
BG = GridSearchCV(bagging_final, parameter_space, n_jobs=15).fit(train_oof_blending.iloc[:, :-1], train_oof_blending.iloc[:, -1])

BG.score(train_oof_blending.iloc[:, :-1], train_oof_blending.iloc[:, -1]), BG.score(test_predict_blending, y_test)

(0.8221381267738883, 0.7864849517319704)

In [114]:
# 设置超参数空间
parameter_space = {
    "n_estimators": range(10, 21), 
    "max_samples": np.arange(0.1, 1.1, 0.1).tolist()}

# 实例化模型与评估器
bagging_final = BaggingClassifier(DecisionTreeClassifier(max_depth=3,
                                                         max_leaf_nodes=7,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2))
BG = GridSearchCV(bagging_final, parameter_space, n_jobs=15).fit(train_oof_blending.iloc[:, :-1], train_oof_blending.iloc[:, -1])

BG.score(train_oof_blending.iloc[:, :-1], train_oof_blending.iloc[:, -1]), BG.score(test_predict_blending, y_test)

(0.8334910122989593, 0.7932992617830777)

能够发现优化流程仍然能输出一个较好的结果。当然，通过上述流程不难发现，这里仍然可以调用此前定义的final_model_opt函数来完成整个优化流程，只需要按需调整输入数据对象即可。

- 借助final_model_opt实现Blending元学习器优化

In [140]:
final_model_opt?

[1;31mSignature:[0m [0mfinal_model_opt[0m[1;33m([0m[0mfinal_model_l[0m[1;33m,[0m [0mparam_space_l[0m[1;33m,[0m [0mX[0m[1;33m,[0m [0my[0m[1;33m,[0m [0mtest_predict[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Stacking元学习器自动优化与预测函数

:param final_model_l: 备选元学习器组成的列表
:param param_space_l: 备选元学习器各自超参数搜索空间组成的列表
:param X: oof_train训练集特征
:param y: oof_train训练集标签
:param test_predict: 一级评估器输出的测试集预测结果

:return：多组元学习器在oof_train上的最佳评分，以及最佳元学习器在test_predict上的预测结果
[1;31mFile:[0m      c:\users\vdmion\appdata\local\temp\ipykernel_16160\67447601.py
[1;31mType:[0m      function


In [141]:
lr = logit_threshold()
tree = DecisionTreeClassifier()
final_model_l = [lr, tree]

In [142]:
lr_final_param = [{'thr': np.arange(0.1, 1.1, 0.1).tolist(), 'penalty': ['l1'], 'C': np.arange(0.1, 1.1, 0.1).tolist(), 'solver': ['saga']}, 
                  {'thr': np.arange(0.1, 1.1, 0.1).tolist(), 'penalty': ['l2'], 'C': np.arange(0.1, 1.1, 0.1).tolist(), 'solver': ['lbfgs', 'newton-cg', 'sag', 'saga']}]

tree_final_param = {'max_depth': np.arange(2, 16, 1).tolist(), 
                    'min_samples_split': np.arange(1, 5, 1).tolist(), 
                    'min_samples_leaf': np.arange(1, 4, 1).tolist(), 
                    'max_leaf_nodes':np.arange(6, 30, 1).tolist()}

param_space_l = [lr_final_param, tree_final_param]

In [149]:
best_res_final, best_test_predict_final = final_model_opt(final_model_l, 
                                                          param_space_l, 
                                                          train_oof_blending.iloc[:, :-1], 
                                                          train_oof_blending.iloc[:, -1], 
                                                          test_predict_blending)

In [160]:
accuracy_score((best_test_predict_final >= 0.5) * 1, y_test)

0.7950028392958546

#### 3.3 一级学习器交叉训练过程的自动超参数优化

&emsp;&emsp;最后就是关于一级交叉训练过程中的自动超参数优化实现。由于Blending和Stacking只是在训练数据的选取上有所区别，实际的训练流程是完全一样的，因此这部分功能实现也完全可以复用上一小节定义的优化流程，仍然还是实例化能够自动进行超参数搜索的评估器，然后带入此前修改过的train_cross函数即可。首先实例化各评估器：

In [153]:
lr_hyper = lr_cascade(lr_params_space)
tree_hyper = tree_cascade(tree_params_space)
RF_hyper = RF_cascade(RF_params_space)

estimators = [('lr', lr_hyper), ('tree', tree_hyper), ('rf', RF_hyper)]

然后执行自动超参数优化的交叉训练，并创建train_oof_blending, test_predict_blending数据集：

In [154]:
train_oof_blending, test_predict_blending = train_cross(X_train_OE, y_train, X_test_OE, estimators=estimators, blending=True)

100%|███████████████████████████████████████████████| 20/20 [02:34<00:00,  7.74s/trial, best loss: -0.7867455621301775]
100%|███████████████████████████████████████████████| 20/20 [03:11<00:00,  9.58s/trial, best loss: -0.7881656804733728]
100%|███████████████████████████████████████████████| 20/20 [02:13<00:00,  6.66s/trial, best loss: -0.7898714451578608]
100%|███████████████████████████████████████████████| 20/20 [02:08<00:00,  6.43s/trial, best loss: -0.7934197826178186]
100%|███████████████████████████████████████████████| 20/20 [02:04<00:00,  6.21s/trial, best loss: -0.7870298096157342]
100%|███████████████████████████████████████████| 1000/1000 [00:42<00:00, 23.59trial/s, best loss: -0.7988165680473374]
100%|███████████████████████████████████████████| 1000/1000 [00:41<00:00, 24.03trial/s, best loss: -0.7917159763313609]
100%|███████████████████████████████████████████| 1000/1000 [00:41<00:00, 23.95trial/s, best loss: -0.7986249248115043]
100%|███████████████████████████████████

当然，接下来我们就可以将其带入元学习器优化函数，测试最终的优化效果：

In [155]:
best_res_final, best_test_predict_final = final_model_opt(final_model_l, 
                                                          param_space_l, 
                                                          train_oof_blending.iloc[:, :-1], 
                                                          train_oof_blending.iloc[:, -1], 
                                                          test_predict_blending)

In [163]:
accuracy_score((best_test_predict_final >= 0.5) * 1, y_test)

0.7950028392958546

至此，我们就完成了自动训练和优化的Blending模型融合。同样，这里需要注意计算时间，对于Blending融合来说，一级学习器的训练、优化以及oof数据集的产出，约需要50分钟左右。

#### 3.4 Blending极限效果测试

&emsp;&emsp;最后，和上一小节类似，让我们增加单模型在交叉训练过程中的超参数搜索迭代次数，来测试当前流程下Blending的效果极限：

In [110]:
lr_hyper = lr_cascade(lr_params_space, max_evals=50)
tree_hyper = tree_cascade(tree_params_space)
RF_hyper = RF_cascade(RF_params_space, max_evals=1000)

estimators = [('lr', lr_hyper), ('tree', tree_hyper), ('rf', RF_hyper)]

In [111]:
train_oof_blending, test_predict_blending = train_cross(X_train_OE, y_train, X_test_OE, estimators=estimators, blending=True)

100%|███████████████████████████████████████████████| 50/50 [05:16<00:00,  6.32s/trial, best loss: -0.7924260355029585]
100%|███████████████████████████████████████████████| 50/50 [05:28<00:00,  6.58s/trial, best loss: -0.7912426035502959]
100%|███████████████████████████████████████████████| 50/50 [05:13<00:00,  6.28s/trial, best loss: -0.7922346720382727]
100%|███████████████████████████████████████████████| 50/50 [05:07<00:00,  6.15s/trial, best loss: -0.7964955866101529]
100%|███████████████████████████████████████████████| 50/50 [05:18<00:00,  6.37s/trial, best loss: -0.7893947151230293]
100%|███████████████████████████████████████████| 1000/1000 [00:41<00:00, 23.90trial/s, best loss: -0.7988165680473374]
100%|███████████████████████████████████████████| 1000/1000 [00:41<00:00, 24.02trial/s, best loss: -0.7917159763313609]
100%|███████████████████████████████████████████| 1000/1000 [00:41<00:00, 24.12trial/s, best loss: -0.7986249248115043]
100%|███████████████████████████████████

In [155]:
best_res_final, best_test_predict_final = final_model_opt(final_model_l, 
                                                          param_space_l, 
                                                          train_oof_blending.iloc[:, :-1], 
                                                          train_oof_blending.iloc[:, -1], 
                                                          test_predict_blending)

In [157]:
accuracy_score((best_test_predict_final >= 0.5) * 1, y_test)

0.7961385576377058

极限效果测试下，一次运行约需要1个半小时。并且，最终得到的融合结果略好于单模建模结果，略差于上一小节的自动Stacking融合结果（0.7978）。至此，我们就完整执行了手动和自动Blending融合及优化各流程。

&emsp;&emsp;尽管此处Blending融合结果不如Stacking，但这只是在当前数据集划分比例和随机数种子情况下融合得到的结果，倘若这些参数发生变化，则最终融合结果也会随之变化。而对于留出集划分及使用方法层面的调整，则是Blending融合优化的进阶内容、同时也是Blending融合独有的优化策略，这部分内容将在下一小节进行详细讨论。