# <center> 【Kaggle】Telco Customer Churn 电信用户流失预测案例

---

## <font face="仿宋">第四部分导读

&emsp;&emsp;<font face="仿宋">在案例的第二、三部分中，我们详细介绍了关于特征工程的各项技术，特征工程技术按照大类来分可以分为数据预处理、特征衍生、特征筛选三部分，其中特征预处理的目的是为了将数据集整理、清洗到可以建模的程度，具体技术包括缺失值处理、异常值处理、数据重编码等，是建模之前必须对数据进行的处理和操作；而特征衍生和特征筛选则更像是一类优化手段，能够帮助模型突破当前数据集建模的效果上界。并且我们在第二部分完整详细的介绍机器学习可解释性模型的训练、优化和解释方法，也就是逻辑回归和决策树模型。并且此前我们也一直以这两种算法为主，来进行各个部分的模型测试。

&emsp;&emsp;<font face="仿宋">而第四部分，我们将开始介绍集成学习的训练和优化的实战技巧，尽管从可解释性角度来说，集成学习的可解释性并不如逻辑回归和决策树，但在大多数建模场景下，集成学习都将获得一个更好的预测结果，这也是目前效果优先的建模场景下最常使用的算法。

&emsp;&emsp;<font face="仿宋">总的来说，本部分内容只有一个目标，那就是借助各类优化方法，抵达每个主流集成学习的效果上界。换而言之，本部分我们将围绕单模优化策略展开详细的探讨，涉及到的具体集成学习包括随机森林、XGBoost、LightGBM、和CatBoost等目前最主流的集成学习算法，而具体的优化策略则包括超参数优化器的使用、特征衍生和筛选方法的使用、单模型自融合方法的使用，这些优化方法也是截至目前，提升单模效果最前沿、最有效、同时也是最复杂的方法。其中有很多较为艰深的理论，也有很多是经验之谈，但无论如何，我们希望能够围绕当前数据集，让每个集成学习算法优化到极限。值得注意的是，在这个过程中，我们会将此前介绍的特征衍生和特征筛选视作是一种模型优化方法，衍生和筛选的效果，一律以模型的最终结果来进行评定。而围绕集成学习进行海量特征衍生和筛选，也才是特征衍生和筛选技术能发挥巨大价值的主战场。

&emsp;&emsp;<font face="仿宋">而在抵达了单模的极限后，我们就会进入到下一阶段，也就是模型融合阶段。需要知道的是，只有单模的效果到达了极限，进一步的多模型融合、甚至多层融合，才是有意义的，才是有效果的。

---

# <center>Part 4.集成算法的训练与优化技巧

In [30]:
# 基础数据科学运算库
import numpy as np
import pandas as pd

# 可视化库
import seaborn as sns
import matplotlib.pyplot as plt

# 时间模块
import time

import warnings
warnings.filterwarnings('ignore')

# sklearn库
# 数据预处理
from sklearn import preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

# 实用函数
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split

# 常用评估器
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# 网格搜索
from sklearn.model_selection import GridSearchCV

# 自定义评估器支持模块
from sklearn.base import BaseEstimator, TransformerMixin, ClassifierMixin

# 自定义模块
from telcoFunc import *
# 导入特征衍生模块
import features_creation as fc
from features_creation import *

# re模块相关
import inspect, re

# 其他模块
from tqdm import tqdm
import gc

&emsp;&emsp;然后执行Part 1中的数据清洗相关工作：

In [31]:
# 读取数据
tcc = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

# 标注连续/离散字段
# 离散字段
category_cols = ['gender', 'SeniorCitizen', 'Partner', 'Dependents',
                'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 
                'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
                'PaymentMethod']

# 连续字段
numeric_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
 
# 标签
target = 'Churn'

# ID列
ID_col = 'customerID'

# 验证是否划分能完全
assert len(category_cols) + len(numeric_cols) + 2 == tcc.shape[1]

# 连续字段转化
tcc['TotalCharges']= tcc['TotalCharges'].apply(lambda x: x if x!= ' ' else np.nan).astype(float)
tcc['MonthlyCharges'] = tcc['MonthlyCharges'].astype(float)

# 缺失值填补
tcc['TotalCharges'] = tcc['TotalCharges'].fillna(0)

# 标签值手动转化 
tcc['Churn'].replace(to_replace='Yes', value=1, inplace=True)
tcc['Churn'].replace(to_replace='No',  value=0, inplace=True)

In [32]:
features = tcc.drop(columns=[ID_col, target]).copy()
labels = tcc['Churn'].copy()

&emsp;&emsp;同时，创建自然编码后的数据集以及经过时序特征衍生的数据集：

In [33]:
# 划分训练集和测试集
train, test = train_test_split(tcc, random_state=22)

X_train = train.drop(columns=[ID_col, target]).copy()
X_test = test.drop(columns=[ID_col, target]).copy()

y_train = train['Churn'].copy()
y_test = test['Churn'].copy()

X_train_seq = pd.DataFrame()
X_test_seq = pd.DataFrame()

# 年份衍生
X_train_seq['tenure_year'] = ((72 - X_train['tenure']) // 12) + 2014
X_test_seq['tenure_year'] = ((72 - X_test['tenure']) // 12) + 2014

# 月份衍生
X_train_seq['tenure_month'] = (72 - X_train['tenure']) % 12 + 1
X_test_seq['tenure_month'] = (72 - X_test['tenure']) % 12 + 1

# 季度衍生
X_train_seq['tenure_quarter'] = ((X_train_seq['tenure_month']-1) // 3) + 1
X_test_seq['tenure_quarter'] = ((X_test_seq['tenure_month']-1) // 3) + 1

# 独热编码
enc = preprocessing.OneHotEncoder()
enc.fit(X_train_seq)

seq_new = list(X_train_seq.columns)

# 创建带有列名称的独热编码之后的df
X_train_seq = pd.DataFrame(enc.transform(X_train_seq).toarray(), 
                           columns = cate_colName(enc, seq_new, drop=None))

X_test_seq = pd.DataFrame(enc.transform(X_test_seq).toarray(), 
                          columns = cate_colName(enc, seq_new, drop=None))

# 调整index
X_train_seq.index = X_train.index
X_test_seq.index = X_test.index

In [34]:
ord_enc = OrdinalEncoder()
ord_enc.fit(X_train[category_cols])

X_train_OE = pd.DataFrame(ord_enc.transform(X_train[category_cols]), columns=category_cols)
X_train_OE.index = X_train.index
X_train_OE = pd.concat([X_train_OE, X_train[numeric_cols]], axis=1)

X_test_OE = pd.DataFrame(ord_enc.transform(X_test[category_cols]), columns=category_cols)
X_test_OE.index = X_test.index
X_test_OE = pd.concat([X_test_OE, X_test[numeric_cols]], axis=1)

然后是模型融合部分所需的第三方库、准备的数据以及训练好的模型：

In [35]:
# 本节新增第三方库
from joblib import dump, load
from sklearn.ensemble import VotingClassifier
from hyperopt import hp, fmin, tpe
from numpy.random import RandomState
from sklearn.model_selection import cross_val_score

In [51]:
class VotingClassifier_threshold(BaseEstimator, ClassifierMixin, TransformerMixin):
    
    def __init__(self, estimators, voting='hard', weights=None, thr=0.5):
        self.estimators = estimators
        self.voting = voting
        self.weights = weights
        self.thr = thr
        
    def fit(self, X, y):
        VC = VotingClassifier(estimators = self.estimators, 
                              voting = self.voting, 
                              weights = self.weights)
        
        VC.fit(X, y)
        self.clf = VC
        
        return self
        
    def predict_proba(self, X):
        if self.voting == 'soft':
            res_proba = self.clf.predict_proba(X)
        else:
            res_proba = None
        return res_proba
    
    def predict(self, X):
        if self.voting == 'soft':
            res = (self.clf.predict_proba(X)[:, 1] >= self.thr) * 1
        else:
            res = self.clf.predict(X)
        return res
    
    def score(self, X, y):
        acc = accuracy_score(self.predict(X), y)
        return acc

In [36]:
# 实例化KFold评估器
kf = KFold(n_splits=5, random_state=12, shuffle=True)

# 重置训练集和测试集的index
X_train_OE = X_train_OE.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)

train_part_index_l = []
eval_index_l = []

for train_part_index, eval_index in kf.split(X_train_OE, y_train):
    train_part_index_l.append(train_part_index)
    eval_index_l.append(eval_index)
    
# 训练集特征
X_train1 = X_train_OE.loc[train_part_index_l[0]]
X_train2 = X_train_OE.loc[train_part_index_l[1]]
X_train3 = X_train_OE.loc[train_part_index_l[2]]
X_train4 = X_train_OE.loc[train_part_index_l[3]]
X_train5 = X_train_OE.loc[train_part_index_l[4]]

# 验证集特征
X_eval1 = X_train_OE.loc[eval_index_l[0]]
X_eval2 = X_train_OE.loc[eval_index_l[1]]
X_eval3 = X_train_OE.loc[eval_index_l[2]]
X_eval4 = X_train_OE.loc[eval_index_l[3]]
X_eval5 = X_train_OE.loc[eval_index_l[4]]

# 训练集标签
y_train1 = y_train.loc[train_part_index_l[0]]
y_train2 = y_train.loc[train_part_index_l[1]]
y_train3 = y_train.loc[train_part_index_l[2]]
y_train4 = y_train.loc[train_part_index_l[3]]
y_train5 = y_train.loc[train_part_index_l[4]]

# 验证集标签
y_eval1 = y_train.loc[eval_index_l[0]]
y_eval2 = y_train.loc[eval_index_l[1]]
y_eval3 = y_train.loc[eval_index_l[2]]
y_eval4 = y_train.loc[eval_index_l[3]]
y_eval5 = y_train.loc[eval_index_l[4]]

train_set = [(X_train1, y_train1), 
             (X_train2, y_train2), 
             (X_train3, y_train3), 
             (X_train4, y_train4), 
             (X_train5, y_train5)]

eval_set = [(X_eval1, y_eval1), 
            (X_eval2, y_eval2), 
            (X_eval3, y_eval3), 
            (X_eval4, y_eval4), 
            (X_eval5, y_eval5)]

In [37]:
# 随机森林模型组
grid_RF_1 = load('grid_RF_1.joblib') 
grid_RF_2 = load('grid_RF_2.joblib') 
grid_RF_3 = load('grid_RF_3.joblib') 
grid_RF_4 = load('grid_RF_4.joblib') 
grid_RF_5 = load('grid_RF_5.joblib') 

RF_1 = grid_RF_1.best_estimator_
RF_2 = grid_RF_2.best_estimator_
RF_3 = grid_RF_3.best_estimator_
RF_4 = grid_RF_4.best_estimator_
RF_5 = grid_RF_5.best_estimator_

RF_l = [RF_1, RF_2, RF_3, RF_4, RF_5]

# 决策树模型组
grid_tree_1 = load('grid_tree_1.joblib')
grid_tree_2 = load('grid_tree_2.joblib')
grid_tree_3 = load('grid_tree_3.joblib')
grid_tree_4 = load('grid_tree_4.joblib')
grid_tree_5 = load('grid_tree_5.joblib')

tree_1 = grid_tree_1.best_estimator_
tree_2 = grid_tree_2.best_estimator_
tree_3 = grid_tree_3.best_estimator_
tree_4 = grid_tree_4.best_estimator_
tree_5 = grid_tree_5.best_estimator_

tree_l = [tree_1, tree_2, tree_3, tree_4, tree_5]

# 逻辑回归模型组
grid_lr_1 = load('grid_lr_1.joblib')
grid_lr_2 = load('grid_lr_2.joblib')
grid_lr_3 = load('grid_lr_3.joblib')
grid_lr_4 = load('grid_lr_4.joblib')
grid_lr_5 = load('grid_lr_5.joblib')

lr_1 = grid_lr_1.best_estimator_
lr_2 = grid_lr_2.best_estimator_
lr_3 = grid_lr_3.best_estimator_
lr_4 = grid_lr_4.best_estimator_
lr_5 = grid_lr_5.best_estimator_

lr_l = [lr_1, lr_2, lr_3, lr_4, lr_5]

In [38]:
eval1_predict_proba_RF = pd.Series(RF_l[0].predict_proba(X_eval1)[:, 1], index=X_eval1.index)
eval2_predict_proba_RF = pd.Series(RF_l[1].predict_proba(X_eval2)[:, 1], index=X_eval2.index)
eval3_predict_proba_RF = pd.Series(RF_l[2].predict_proba(X_eval3)[:, 1], index=X_eval3.index)
eval4_predict_proba_RF = pd.Series(RF_l[3].predict_proba(X_eval4)[:, 1], index=X_eval4.index)
eval5_predict_proba_RF = pd.Series(RF_l[4].predict_proba(X_eval5)[:, 1], index=X_eval5.index)

eval_predict_proba_RF = pd.concat([eval1_predict_proba_RF, 
                                   eval2_predict_proba_RF, 
                                   eval3_predict_proba_RF, 
                                   eval4_predict_proba_RF, 
                                   eval5_predict_proba_RF]).sort_index()

eval1_predict_proba_tree = pd.Series(tree_l[0].predict_proba(X_eval1)[:, 1], index=X_eval1.index)
eval2_predict_proba_tree = pd.Series(tree_l[1].predict_proba(X_eval2)[:, 1], index=X_eval2.index)
eval3_predict_proba_tree = pd.Series(tree_l[2].predict_proba(X_eval3)[:, 1], index=X_eval3.index)
eval4_predict_proba_tree = pd.Series(tree_l[3].predict_proba(X_eval4)[:, 1], index=X_eval4.index)
eval5_predict_proba_tree = pd.Series(tree_l[4].predict_proba(X_eval5)[:, 1], index=X_eval5.index)

eval_predict_proba_tree = pd.concat([eval1_predict_proba_tree, 
                                     eval2_predict_proba_tree, 
                                     eval3_predict_proba_tree, 
                                     eval4_predict_proba_tree, 
                                     eval5_predict_proba_tree]).sort_index()

eval1_predict_proba_lr = pd.Series(lr_l[0].predict_proba(X_eval1)[:, 1], index=X_eval1.index)
eval2_predict_proba_lr = pd.Series(lr_l[1].predict_proba(X_eval2)[:, 1], index=X_eval2.index)
eval3_predict_proba_lr = pd.Series(lr_l[2].predict_proba(X_eval3)[:, 1], index=X_eval3.index)
eval4_predict_proba_lr = pd.Series(lr_l[3].predict_proba(X_eval4)[:, 1], index=X_eval4.index)
eval5_predict_proba_lr = pd.Series(lr_l[4].predict_proba(X_eval5)[:, 1], index=X_eval5.index)

eval_predict_proba_lr = pd.concat([eval1_predict_proba_lr, 
                                   eval2_predict_proba_lr, 
                                   eval3_predict_proba_lr, 
                                   eval4_predict_proba_lr, 
                                   eval5_predict_proba_lr]).sort_index()

In [39]:
test_predict_proba_RF = []

for i in range(5):
    test_predict_proba_RF.append(RF_l[i].predict_proba(X_test_OE)[:, 1])

test_predict_proba_RF = np.array(test_predict_proba_RF)
test_predict_proba_RF = test_predict_proba_RF.mean(0)

test_predict_proba_tree = []

for i in range(5):
    test_predict_proba_tree.append(tree_l[i].predict_proba(X_test_OE)[:, 1])

test_predict_proba_tree = np.array(test_predict_proba_tree)
test_predict_proba_tree = test_predict_proba_tree.mean(0)

test_predict_proba_lr = []

for i in range(5):
    test_predict_proba_lr.append(lr_l[i].predict_proba(X_test_OE)[:, 1])

test_predict_proba_lr = np.array(test_predict_proba_lr)
test_predict_proba_lr = test_predict_proba_lr.mean(0)

## <center>Ch.3 模型融合基础方法

## 六、交叉训练权重搜索策略评价与改进方案

&emsp;&emsp;尽管上述流程已经能够稳定获得一个还不错的融合结果，但实际上这个加权软投票的过程，还是有进一步优化的可能性的。这里我们先对上述流程进行复盘，然后再进一步讨论后续优化的可能性。

### 1.交叉训练权重搜索方案评价

- 搜索空间裁剪不再可行

&emsp;&emsp;首先，相信在熟悉了经验法+搜索空间裁剪过程之后，肯定有同学会觉得，上述流程是否也能通过裁剪搜索空间提高结果的泛化能力呢？

&emsp;&emsp;答案是否定的。其根本原因就在于此时我们只剩下一个权重判别依据——验证集的准确率。在TPE强大的搜索能力下，我们不太可能手动“试”出一个比TPE更好的验证集准确率结果，而不能找到依据说明经验结果更好，就无法据此进一步裁剪搜索空间。需要知道的是，在交叉训练模型之前的经验法，实际上是通过训练集上的准确率判断一个更好的经验结果，而TPE搜索过程则以验证集平均准确率作为搜索依据，正是二者选取依据的不同，才创造了“训练集上准确率比验证集平均准确率更有效”的可能性，而有了这个可能性之后，以训练集准确率作为依据的经验法裁剪搜索空间才是有效的。而在交叉训练之后，训练集的准确率实际上就是验证集的准确率（因为验证集能“拼”出一个训练集）。

> 这里稍微转牛角尖的进行讨论下，哪怕在交叉训练的情况下，每个模型仍然还是有那80%的数据是用于训练的，能否每个模型在自身训练数据集上进行预测然后求平均，最终算出训练集的准确率呢？可以，但没必要。在训练集、验证集信息完全隔离的情况下，不相信验证集的结果而相信训练集的结果，并不是一个有利于提升模型泛化能力的方向。<center><img src="https://s2.loli.net/2022/05/24/6rvmpInWyJcwiF1.png" alt="image-20220524210942174" style="zoom:33%;" />

- 基于交叉训练的权重搜索效果极限

&emsp;&emsp;但是，如果是站在“上帝视角”，直接带入测试集进行权重搜索，其实是能获得一个更好的权重搜索结果的：

In [40]:
# 定义超参数空间
params_space = {'thr': hp.uniform("thr", 0.4, 0.6), 
                'weight1': hp.uniform("weight1",0,1),
                'weight2': hp.uniform("weight2",0,1),
                'weight3': hp.uniform("weight3",0,1)}

In [41]:
# 定义目标函数
def hyperopt_objective_weight(params):
    thr = params['thr']
    weight1 = params['weight1']
    weight2 = params['weight2']
    weight3 = params['weight3']
    
    weights_sum = weight1 + weight2 + weight3

    predict_probo_weight = (test_predict_proba_lr * weight1 + 
                            test_predict_proba_tree * weight2 + 
                            test_predict_proba_RF * weight3) / weights_sum

    res_weight = (predict_probo_weight >= thr) * 1

    eval_score = accuracy_score(res_weight, y_test)
    
    return -eval_score

In [42]:
# 定义优化函数
def param_hyperopt_weight(max_evals):
    params_best = fmin(fn = hyperopt_objective_weight,
                       space = params_space,
                       algo = tpe.suggest,
                       max_evals = max_evals, 
                       rstate = np.random.default_rng(17))    
    return params_best

In [43]:
params_best = param_hyperopt_weight(5000)

100%|███████████████████████████████████████████| 5000/5000 [01:40<00:00, 49.98trial/s, best loss: -0.8006814310051107]


In [44]:
params_best

{'thr': 0.46824033457779035,
 'weight1': 9.187915210945308e-05,
 'weight2': 0.0003131011546776627,
 'weight3': 0.7000333202399632}

能看出，如果带入上述权重，实际上是能在测试集上达到80%准确率的。因此如果我们据此缩小搜索空间，或许能获得一个更好的结果的。但是在实际建模过程中，如果测试集完全未知，我们并没有任何理由去裁剪搜索空间（裁剪之后会让唯一的评估指标——验证集准确率下降）。因此，即使该流程的效果极限是测试集80%的准确率，上述流程最佳实践结果也只能达到79.72%

### 2.基于交叉训练的权重搜索优化方案

&emsp;&emsp;接下来我们继续讨论交叉训练权重搜索策略的改进方案。这里需要注意，其实交叉训练权重搜索策略其实已经属于加权融合过程中效果非常好且结果非常稳定的一种策略，该策略能够很好的利用TPE优化器的性能，同时也能很好的抑制过拟合问题，是模型融合过程中必须要尝试的一类方法，甚至很多时候会比后续要讨论的Stacking、Blending等方法效果更好。但这里我们仍然要探讨进一步优化方案，不仅仅是为大家的模型融合“武器库”提供更多“枪支弹药”，更多的是帮助大家“打开脑洞”。就像本节开篇所言，模型融合的实际应用是非常灵活的，生搬硬套可能不足以和别人拉开差距。而如何才能做到因地制宜、活学活用，很多时候就需要更多思想上的碰撞。这里介绍的优化方法都是交叉训练权重搜索方法的升级版，但就本数据集而言不一定能够有更好的结果，但一方面这些方法背后的思想值得借鉴，其二在后续案例和其他数据集上，这些方法或许能获得一个更好结果；基于此，我们还是非常有必要详细介绍这些方法。

#### 2.1 细粒度权重搜索方案

- 方案介绍

&emsp;&emsp;总的来说，上述流程还可以有至少两个优化的方向，首先，我们完全可以把每个模型看成是独立的模型，独立的在验证集上进行权重搜索，独立的参与测试集的预测，由此获得更灵活的验证集结果表现，借此提升融合效果。如下图所示：例如lr_1、tree_1和RF_1模型，都是在Part 1-4数据集上进行的训练模型，这些三个模型就可以围绕Part 5数据集进行加权融合，分别训练得到三个模型不同的权重，类似的其他模型也可以按照此方法进行训练，最终得到每个模型单独的权重，然后在预测的过程中，每个模型单独对测试集进行预测，然后通过加权的方式输出最终预测结果：

<center><img src="https://s2.loli.net/2022/05/24/8mIKPwJOGoZRcL2.png" alt="image-20220524232358976" style="zoom:33%;" />

- 方案评价

&emsp;&emsp;很明显，该方案通过给每个模型单独分配权重，能够大幅提升测试集上的效果上限，但如此规模的（连续变量）超参数对TPE的搜索过程会造成一定的压力，尽管上限较高，同时验证集仍然可信，但TPE却不一定能搜索得到哪怕是验证集上的最优解，因此实际效果并不一定比原始策略更好。但是可以判断的是，如果能够妥善修改交叉验证折数（如降低折数），或者数据一致性较好，则该方法能取得一个不错的效果。

#### 2.2 多级分层加权融合

- 方案介绍

&emsp;&emsp;而第二种方法则是分层融合，我们可以考虑将每一组模型简单看成是一个数据集上训练得到的5个模型（就像第一小节中在完整数据集上训练的逻辑回归、决策树和随机森林），然后组内进行加权融合，通过对训练集的预测得到每个模型的权重（也就是VotingClassifier+cross_val_score过程），例如RF_l组内加权融合过程如下：

<center>

<img src="https://s2.loli.net/2022/05/25/HErYdURczIotaZW.png" alt="image-20220525010307470" style="zoom:50%;" />

然后，当每一组内都分别训练得到一组权重，并且分别得到了3个不同的训练集上的加权预测结果后，再参考上一小节的内容，计算组间权重：

<center><img src="https://s2.loli.net/2022/05/25/lpJdSjZtenRYP7C.png" alt="image-20220525010429071" style="zoom:50%;" />

能够发现，这其实是一种分层级的加权融合过程，我们可以将其命名为多级分层加权融合。

&emsp;&emsp;当然，在实际围绕测试集的预测过程中，也是先进行每个模型的预测，然后再根据一级权重得到每一组的预测结果，然后再根据组间权重算出最终结果。

- 方案评价

&emsp;&emsp;细心的同学一定发现，其实多级分层融合最终的目的也是为了让每个评估器拥有独立的权重，这点和细粒度权重搜索方案类似，而所不同的是，多级分层融合的组内融合过程会泄露一部分验证集信息（同一个模型在全部数据上进行交叉验证），而换来的却是每一次搜索的高效。换而言之，两种方案的对比就是，细粒度搜索融合效果上限更高、但不容易达到，瓶颈在于超参数优化器，而多级分层搜索上限较低，但更容易达到。可以说两种方法各有优劣。

> 此外，还可以考虑的优化方向则是从模型训练角度入手，把模型的超参数和权重超参数放在一个机器学习流中进行联合调参。不过这类方法的实践需要更多的基础知识——即模型多样性对融合效果的影响。这部分内容我们将在下个阶段进行探讨。

## 七、细粒度加权融合

&emsp;&emsp;接下来让我们开始尝试多级加权融合的过程。该方法同样并不深奥，但较为繁琐，主要比较繁琐的环节就是众多权重参数的设置。我们可以通过如下方式一次性搜索全部参数：

In [64]:
# 定义超参数空间
params_space = {'thr': hp.uniform("thr", 0.4, 0.6), 
                'weight_lr1': hp.uniform("weight_lr1",0,1),
                'weight_lr2': hp.uniform("weight_lr2",0,1),
                'weight_lr3': hp.uniform("weight_lr3",0,1), 
                'weight_lr4': hp.uniform("weight_lr4",0,1), 
                'weight_lr5': hp.uniform("weight_lr5",0,1), 
                'weight_tree1': hp.uniform("weight_tree1",0,1),
                'weight_tree2': hp.uniform("weight_tree2",0,1),
                'weight_tree3': hp.uniform("weight_tree3",0,1), 
                'weight_tree4': hp.uniform("weight_tree4",0,1), 
                'weight_tree5': hp.uniform("weight_tree5",0,1), 
                'weight_RF1': hp.uniform("weight_RF1",0,1),
                'weight_RF2': hp.uniform("weight_RF2",0,1),
                'weight_RF3': hp.uniform("weight_RF3",0,1), 
                'weight_RF4': hp.uniform("weight_RF4",0,1), 
                'weight_RF5': hp.uniform("weight_RF5",0,1)}

In [71]:
# 定义目标函数
def hyperopt_objective_weight(params):
    thr = params['thr']
    weight_lr1 = params['weight_lr1']
    weight_lr2 = params['weight_lr2']
    weight_lr3 = params['weight_lr3']
    weight_lr4 = params['weight_lr4']
    weight_lr5 = params['weight_lr5']
    
    weight_tree1 = params['weight_tree1']
    weight_tree2 = params['weight_tree2']
    weight_tree3 = params['weight_tree3']
    weight_tree4 = params['weight_tree4']
    weight_tree5 = params['weight_tree5']
    
    weight_RF1 = params['weight_RF1']
    weight_RF2 = params['weight_RF2']
    weight_RF3 = params['weight_RF3']
    weight_RF4 = params['weight_RF4']
    weight_RF5 = params['weight_RF5']
    
    eval1_predict_proba_weight = (pd.Series(lr_1.predict_proba(X_eval1)[:, 1], index=X_eval1.index) * weight_lr1 + 
                                  pd.Series(tree_1.predict_proba(X_eval1)[:, 1], index=X_eval1.index) * weight_tree1 + 
                                  pd.Series(RF_1.predict_proba(X_eval1)[:, 1], index=X_eval1.index) * weight_RF1) / (weight_lr1 + weight_tree1 + weight_RF1)

    eval2_predict_proba_weight = (pd.Series(lr_2.predict_proba(X_eval2)[:, 1], index=X_eval2.index) * weight_lr2 + 
                                  pd.Series(tree_2.predict_proba(X_eval2)[:, 1], index=X_eval2.index) * weight_tree2 + 
                                  pd.Series(RF_2.predict_proba(X_eval2)[:, 1], index=X_eval2.index) * weight_RF2) / (weight_lr2 + weight_tree2 + weight_RF2)    
    
    eval3_predict_proba_weight = (pd.Series(lr_3.predict_proba(X_eval3)[:, 1], index=X_eval3.index) * weight_lr3 + 
                                  pd.Series(tree_3.predict_proba(X_eval3)[:, 1], index=X_eval3.index) * weight_tree3 + 
                                  pd.Series(RF_3.predict_proba(X_eval3)[:, 1], index=X_eval3.index) * weight_RF3) / (weight_lr3 + weight_tree3 + weight_RF3)    
    
    eval4_predict_proba_weight = (pd.Series(lr_4.predict_proba(X_eval4)[:, 1], index=X_eval4.index) * weight_lr4 + 
                                  pd.Series(tree_4.predict_proba(X_eval4)[:, 1], index=X_eval4.index) * weight_tree4 + 
                                  pd.Series(RF_4.predict_proba(X_eval4)[:, 1], index=X_eval4.index) * weight_RF4) / (weight_lr4 + weight_tree4 + weight_RF4)    
    
    eval5_predict_proba_weight = (pd.Series(lr_5.predict_proba(X_eval5)[:, 1], index=X_eval5.index) * weight_lr5 + 
                                  pd.Series(tree_5.predict_proba(X_eval5)[:, 1], index=X_eval5.index) * weight_tree5 + 
                                  pd.Series(RF_5.predict_proba(X_eval5)[:, 1], index=X_eval5.index) * weight_RF5) / (weight_lr5 + weight_tree5 + weight_RF5)        
    
    eval_predict_proba_weight = pd.concat([eval1_predict_proba_weight,
                                           eval2_predict_proba_weight, 
                                           eval3_predict_proba_weight, 
                                           eval4_predict_proba_weight, 
                                           eval5_predict_proba_weight]).sort_index()
    
    eval_predict = (eval_predict_proba_weight >= thr) * 1
    
    eval_acc = accuracy_score(eval_predict, y_train)
    
    return -eval_acc

In [88]:
# 定义优化函数
def param_hyperopt_weight(max_evals):
    params_best = fmin(fn = hyperopt_objective_weight,
                       space = params_space,
                       algo = tpe.suggest,
                       max_evals = max_evals, 
                       rstate = np.random.default_rng(2))    
    return params_best

In [91]:
best_params = param_hyperopt_weight(5000)

100%|███████████████████████████████████████████| 5000/5000 [10:42<00:00,  7.78trial/s, best loss: -0.8241196516471033]


In [84]:
def muti_weight_test_acc(params):
    thr = params['thr']
    weight_lr1 = params['weight_lr1']
    weight_lr2 = params['weight_lr2']
    weight_lr3 = params['weight_lr3']
    weight_lr4 = params['weight_lr4']
    weight_lr5 = params['weight_lr5']
    
    weight_lr_l = np.array([weight_lr1, weight_lr2, weight_lr3, weight_lr4, weight_lr5])
    weight_lr_sum = weight_lr_l.sum()
    
    weight_tree1 = params['weight_tree1']
    weight_tree2 = params['weight_tree2']
    weight_tree3 = params['weight_tree3']
    weight_tree4 = params['weight_tree4']
    weight_tree5 = params['weight_tree5']
    
    weight_tree_l = np.array([weight_tree1, weight_tree2, weight_tree3, weight_tree4, weight_tree5])
    weight_tree_sum = weight_tree_l.sum()
    
    weight_RF1 = params['weight_RF1']
    weight_RF2 = params['weight_RF2']
    weight_RF3 = params['weight_RF3']
    weight_RF4 = params['weight_RF4']
    weight_RF5 = params['weight_RF5']
    
    weight_RF_l = np.array([weight_RF1, weight_RF2, weight_RF3, weight_RF4, weight_RF5])
    weight_RF_sum = weight_RF_l.sum()
    
    test_predict_proba = (lr_1.predict_proba(X_test_OE)[:, 1] * weight_lr1 + 
                          lr_2.predict_proba(X_test_OE)[:, 1] * weight_lr2 + 
                          lr_3.predict_proba(X_test_OE)[:, 1] * weight_lr3 + 
                          lr_4.predict_proba(X_test_OE)[:, 1] * weight_lr4 + 
                          lr_5.predict_proba(X_test_OE)[:, 1] * weight_lr5 + 
                          tree_1.predict_proba(X_test_OE)[:, 1] * weight_tree1 + 
                          tree_2.predict_proba(X_test_OE)[:, 1] * weight_tree2 + 
                          tree_3.predict_proba(X_test_OE)[:, 1] * weight_tree3 + 
                          tree_4.predict_proba(X_test_OE)[:, 1] * weight_tree4 + 
                          tree_5.predict_proba(X_test_OE)[:, 1] * weight_tree5 + 
                          RF_1.predict_proba(X_test_OE)[:, 1] * weight_RF1 + 
                          RF_2.predict_proba(X_test_OE)[:, 1] * weight_RF2 + 
                          RF_3.predict_proba(X_test_OE)[:, 1] * weight_RF3 + 
                          RF_4.predict_proba(X_test_OE)[:, 1] * weight_RF4 + 
                          RF_5.predict_proba(X_test_OE)[:, 1] * weight_RF5) / (weight_lr_sum + 
                                                                              weight_tree_sum + 
                                                                              weight_RF_sum)
                          
                
    
    test_predict = (test_predict_proba >= thr) * 1
    
    test_acc = accuracy_score(test_predict, y_test)
    return test_acc

In [92]:
muti_weight_test_acc(best_params)

0.7915956842703009

能够发现，当我们迭代5000次时，测试集准确率为0.79159，接下来我们可以继续尝试增加迭代次数，以提升搜索精度。

## 八、多级分层加权融合

&emsp;&emsp;接下来继续第二种优化思路，即多级分层加权融合。

### 1.组内加权融合

&emsp;&emsp;首先是组内的加权融合过程：

- 随机森林模型组

In [95]:
estimators_RF = [('RF_1',RF_1), ('RF_2',RF_2), ('RF_3',RF_3), ('RF_4',RF_4), ('RF_5',RF_5)]

In [96]:
# 定义超参数空间
params_space = {'thr': hp.uniform("thr", 0.4, 0.6), 
                'weight1': hp.uniform("weight1",0,1),
                'weight2': hp.uniform("weight2",0,1),
                'weight3': hp.uniform("weight3",0,1), 
                'weight4': hp.uniform("weight4",0,1), 
                'weight5': hp.uniform("weight5",0,1)}

In [97]:
# 定义目标函数
def hyperopt_objective_weight(params):
    thr = params['thr']
    weight1 = params['weight1']
    weight2 = params['weight2']
    weight3 = params['weight3']
    weight4 = params['weight4']
    weight5 = params['weight5']
    
    weights = [weight1, weight2, weight3, weight4, weight5]
    
    # 创建带阈值的平均法评估器
    VC_weight_search = VotingClassifier_threshold(estimators=estimators_RF, 
                                                  weights=weights,
                                                  voting='soft', 
                                                  thr=thr)

    # 输出验证集上的平均得分
    val_score = cross_val_score(VC_weight_search, 
                                X_train_OE, 
                                y_train, 
                                scoring='accuracy', 
                                n_jobs=15,
                                cv=5).mean()
    
    return -val_score

In [98]:
# 定义优化函数
def param_hyperopt_weight(max_evals):
    params_best = fmin(fn = hyperopt_objective_weight,
                       space = params_space,
                       algo = tpe.suggest,
                       max_evals = max_evals, 
                       rstate = np.random.default_rng(2))    
    return params_best

In [99]:
best_params = param_hyperopt_weight(500)

100%|█████████████████████████████████████████████| 500/500 [05:35<00:00,  1.49trial/s, best loss: -0.8084048264097934]


In [100]:
def weights_extract(best_params):
    thr = best_params['thr']
    weight1 = best_params['weight1']
    weight2 = best_params['weight2']
    weight3 = best_params['weight3']
    weight4 = best_params['weight4']
    weight5 = best_params['weight5']

    weights_sum = (weight1 + weight2 + weight3 + weight4 + weight5)

    weight1 = weight1 / weights_sum
    weight2 = weight2 / weights_sum
    weight3 = weight3 / weights_sum
    weight4 = weight4 / weights_sum
    weight5 = weight5 / weights_sum

    weights = [weight1, weight2, weight3, weight4, weight5]
    return weights

In [102]:
RF_weights = weights_extract(best_params)

然后输出验证集（训练集）上加权融合预测结果：

In [103]:
eval_predict_proba_RF = 0

for i in range(5):
    eval_predict_proba_RF += (RF_l[i].predict_proba(X_train_OE)[:, 1]) * RF_weights[i]

eval_predict_proba_RF

array([0.03881102, 0.54438575, 0.12338471, ..., 0.5642814 , 0.03396568,
       0.01746549])

以及测试集上预测结果：

In [104]:
test_predict_proba_RF = 0

for i in range(5):
    test_predict_proba_RF += (RF_l[i].predict_proba(X_test_OE)[:, 1]) * RF_weights[i]

test_predict_proba_RF

array([0.03111997, 0.30563842, 0.01820175, ..., 0.14348214, 0.53063581,
       0.1075506 ])

- 逻辑回归模型组

In [105]:
estimators_lr = [('lr_1',lr_1), ('lr_2',lr_2), ('lr_3',lr_3), ('lr_4',lr_4), ('lr_5',lr_5)]

In [106]:
# 定义超参数空间
params_space = {'thr': hp.uniform("thr", 0.4, 0.6), 
                'weight1': hp.uniform("weight1",0,1),
                'weight2': hp.uniform("weight2",0,1),
                'weight3': hp.uniform("weight3",0,1), 
                'weight4': hp.uniform("weight4",0,1), 
                'weight5': hp.uniform("weight5",0,1)}

In [107]:
# 定义目标函数
def hyperopt_objective_weight(params):
    thr = params['thr']
    weight1 = params['weight1']
    weight2 = params['weight2']
    weight3 = params['weight3']
    weight4 = params['weight4']
    weight5 = params['weight5']
    
    weights = [weight1, weight2, weight3, weight4, weight5]
    
    # 创建带阈值的平均法评估器
    VC_weight_search = VotingClassifier_threshold(estimators=estimators_lr, 
                                                  weights=weights,
                                                  voting='soft', 
                                                  thr=thr)

    # 输出验证集上的平均得分
    val_score = cross_val_score(VC_weight_search, 
                                X_train_OE, 
                                y_train, 
                                scoring='accuracy', 
                                n_jobs=15,
                                cv=5).mean()
    
    return -val_score

In [108]:
# 定义优化函数
def param_hyperopt_weight(max_evals):
    params_best = fmin(fn = hyperopt_objective_weight,
                       space = params_space,
                       algo = tpe.suggest,
                       max_evals = max_evals, 
                       rstate = np.random.default_rng(17))    
    return params_best

In [109]:
best_params = param_hyperopt_weight(300)

100%|█████████████████████████████████████████████| 300/300 [01:44<00:00,  2.87trial/s, best loss: -0.8114340543562397]


In [110]:
lr_weights = weights_extract(best_params)

然后输出验证集（训练集）上加权融合预测结果：

In [111]:
eval_predict_proba_lr = 0

for i in range(5):
    eval_predict_proba_lr += (lr_l[i].predict_proba(X_train_OE)[:, 1]) * lr_weights[i]

eval_predict_proba_lr

array([0.01318283, 0.55771798, 0.14512801, ..., 0.68283328, 0.05042271,
       0.00420453])

以及测试集上预测结果：

In [112]:
test_predict_proba_lr = 0

for i in range(5):
    test_predict_proba_lr += (lr_l[i].predict_proba(X_test_OE)[:, 1]) * lr_weights[i]

test_predict_proba_lr

array([0.04505546, 0.23169159, 0.0052967 , ..., 0.12388625, 0.50385759,
       0.0633811 ])

- 决策树模型组

In [113]:
estimators_tree = [('tree_1',tree_1), ('tree_2',tree_2), ('tree_3',tree_3), ('tree_4',tree_4), ('tree_5',tree_5)]

In [114]:
# 定义超参数空间
params_space = {'thr': hp.uniform("thr", 0.4, 0.6), 
                'weight1': hp.uniform("weight1",0,1),
                'weight2': hp.uniform("weight2",0,1),
                'weight3': hp.uniform("weight3",0,1), 
                'weight4': hp.uniform("weight4",0,1), 
                'weight5': hp.uniform("weight5",0,1)}

In [115]:
# 定义目标函数
def hyperopt_objective_weight(params):
    thr = params['thr']
    weight1 = params['weight1']
    weight2 = params['weight2']
    weight3 = params['weight3']
    weight4 = params['weight4']
    weight5 = params['weight5']
    
    weights = [weight1, weight2, weight3, weight4, weight5]
    
    # 创建带阈值的平均法评估器
    VC_weight_search = VotingClassifier_threshold(estimators=estimators_tree, 
                                                  weights=weights,
                                                  voting='soft', 
                                                  thr=thr)

    # 输出验证集上的平均得分
    val_score = cross_val_score(VC_weight_search, 
                                X_train_OE, 
                                y_train, 
                                scoring='accuracy', 
                                n_jobs=15,
                                cv=5).mean()
    
    return -val_score

In [116]:
# 定义优化函数
def param_hyperopt_weight(max_evals):
    params_best = fmin(fn = hyperopt_objective_weight,
                       space = params_space,
                       algo = tpe.suggest,
                       max_evals = max_evals, 
                       rstate = np.random.default_rng(17))    
    return params_best

In [117]:
best_params = param_hyperopt_weight(300)

100%|██████████████████████████████████████████████| 300/300 [00:13<00:00, 22.13trial/s, best loss: -0.797234167598406]


In [118]:
tree_weights = weights_extract(best_params)

然后输出验证集（训练集）上加权融合预测结果：

In [119]:
eval_predict_proba_tree = 0

for i in range(5):
    eval_predict_proba_tree += (tree_l[i].predict_proba(X_train_OE)[:, 1]) * tree_weights[i]

eval_predict_proba_tree

array([0.04365352, 0.74609312, 0.16049057, ..., 0.39059284, 0.04624722,
       0.04365352])

以及测试集上预测结果：

In [120]:
test_predict_proba_tree = 0

for i in range(5):
    test_predict_proba_tree += (tree_l[i].predict_proba(X_test_OE)[:, 1]) * tree_weights[i]

test_predict_proba_tree

array([0.04365352, 0.1870317 , 0.04365352, ..., 0.1870317 , 0.45025595,
       0.17253686])

### 2.组间融合

&emsp;&emsp;接下来进行执行组间融合，该部分操作和上一小节的流程非常类似，代码可以直接复用：

In [121]:
# 定义超参数空间
params_space = {'thr': hp.uniform("thr", 0.4, 0.6), 
                'weight1': hp.uniform("weight1",0,1),
                'weight2': hp.uniform("weight2",0,1),
                'weight3': hp.uniform("weight3",0,1)}

In [122]:
# 定义目标函数
def hyperopt_objective_weight(params):
    thr = params['thr']
    weight1 = params['weight1']
    weight2 = params['weight2']
    weight3 = params['weight3']
    
    weights_sum = weight1 + weight2 + weight3

    predict_probo_weight = (eval_predict_proba_lr * weight1 + 
                            eval_predict_proba_tree * weight2 + 
                            eval_predict_proba_RF * weight3) / weights_sum

    res_weight = (predict_probo_weight >= thr) * 1

    eval_score = accuracy_score(res_weight, y_train)
    
    return -eval_score

In [148]:
# 定义优化函数
def param_hyperopt_weight(max_evals):
    params_best = fmin(fn = hyperopt_objective_weight,
                       space = params_space,
                       algo = tpe.suggest,
                       max_evals = max_evals, 
                       rstate = np.random.default_rng(2))    
    return params_best

In [149]:
params_best = param_hyperopt_weight(5000)

100%|███████████████████████████████████████████| 5000/5000 [01:32<00:00, 54.34trial/s, best loss: -0.8354789852328663]


In [150]:
params_best

{'thr': 0.455650120218397,
 'weight1': 0.018974448598142818,
 'weight2': 0.0013087603673773525,
 'weight3': 0.8149604936936505}

然后以相同权重带入测试集进行加权计算，结果如下：

In [151]:
def test_acc(params_best):
    thr = params_best['thr']
    weight1 = params_best['weight1']
    weight2 = params_best['weight2']
    weight3 = params_best['weight3']

    weights_sum = weight1 + weight2 + weight3

    test_predict_proba = (((test_predict_proba_lr * weight1 + 
                            test_predict_proba_tree * weight2 + 
                            test_predict_proba_RF * weight3) / weights_sum) >= thr) * 1

    print(accuracy_score(test_predict_proba, y_test))

In [152]:
test_acc(params_best)

0.7961385576377058
