# <center> 【Kaggle】Telco Customer Churn 电信用户流失预测案例

---

## <font face="仿宋">第四部分导读

&emsp;&emsp;<font face="仿宋">在案例的第二、三部分中，我们详细介绍了关于特征工程的各项技术，特征工程技术按照大类来分可以分为数据预处理、特征衍生、特征筛选三部分，其中特征预处理的目的是为了将数据集整理、清洗到可以建模的程度，具体技术包括缺失值处理、异常值处理、数据重编码等，是建模之前必须对数据进行的处理和操作；而特征衍生和特征筛选则更像是一类优化手段，能够帮助模型突破当前数据集建模的效果上界。并且我们在第二部分完整详细的介绍机器学习可解释性模型的训练、优化和解释方法，也就是逻辑回归和决策树模型。并且此前我们也一直以这两种算法为主，来进行各个部分的模型测试。

&emsp;&emsp;<font face="仿宋">而第四部分，我们将开始介绍集成学习的训练和优化的实战技巧，尽管从可解释性角度来说，集成学习的可解释性并不如逻辑回归和决策树，但在大多数建模场景下，集成学习都将获得一个更好的预测结果，这也是目前效果优先的建模场景下最常使用的算法。

&emsp;&emsp;<font face="仿宋">总的来说，本部分内容只有一个目标，那就是借助各类优化方法，抵达每个主流集成学习的效果上界。换而言之，本部分我们将围绕单模优化策略展开详细的探讨，涉及到的具体集成学习包括随机森林、XGBoost、LightGBM、和CatBoost等目前最主流的集成学习算法，而具体的优化策略则包括超参数优化器的使用、特征衍生和筛选方法的使用、单模型自融合方法的使用，这些优化方法也是截至目前，提升单模效果最前沿、最有效、同时也是最复杂的方法。其中有很多较为艰深的理论，也有很多是经验之谈，但无论如何，我们希望能够围绕当前数据集，让每个集成学习算法优化到极限。值得注意的是，在这个过程中，我们会将此前介绍的特征衍生和特征筛选视作是一种模型优化方法，衍生和筛选的效果，一律以模型的最终结果来进行评定。而围绕集成学习进行海量特征衍生和筛选，也才是特征衍生和筛选技术能发挥巨大价值的主战场。

&emsp;&emsp;<font face="仿宋">而在抵达了单模的极限后，我们就会进入到下一阶段，也就是模型融合阶段。需要知道的是，只有单模的效果到达了极限，进一步的多模型融合、甚至多层融合，才是有意义的，才是有效果的。

---

# <center>Part 4.集成算法的训练与优化技巧

In [1]:
# 基础数据科学运算库
import numpy as np
import pandas as pd

# 可视化库
import seaborn as sns
import matplotlib.pyplot as plt

# 时间模块
import time

import warnings
warnings.filterwarnings('ignore')

# sklearn库
# 数据预处理
from sklearn import preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

# 实用函数
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split

# 常用评估器
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# 网格搜索
from sklearn.model_selection import GridSearchCV

# 自定义评估器支持模块
from sklearn.base import BaseEstimator, TransformerMixin, ClassifierMixin

# 自定义模块
from telcoFunc import *
# 导入特征衍生模块
import features_creation as fc
from features_creation import *

# re模块相关
import inspect, re

# 其他模块
from tqdm import tqdm
import gc

&emsp;&emsp;然后执行Part 1中的数据清洗相关工作：

In [2]:
# 读取数据
tcc = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

# 标注连续/离散字段
# 离散字段
category_cols = ['gender', 'SeniorCitizen', 'Partner', 'Dependents',
                'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 
                'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
                'PaymentMethod']

# 连续字段
numeric_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
 
# 标签
target = 'Churn'

# ID列
ID_col = 'customerID'

# 验证是否划分能完全
assert len(category_cols) + len(numeric_cols) + 2 == tcc.shape[1]

# 连续字段转化
tcc['TotalCharges']= tcc['TotalCharges'].apply(lambda x: x if x!= ' ' else np.nan).astype(float)
tcc['MonthlyCharges'] = tcc['MonthlyCharges'].astype(float)

# 缺失值填补
tcc['TotalCharges'] = tcc['TotalCharges'].fillna(0)

# 标签值手动转化 
tcc['Churn'].replace(to_replace='Yes', value=1, inplace=True)
tcc['Churn'].replace(to_replace='No',  value=0, inplace=True)

In [3]:
features = tcc.drop(columns=[ID_col, target]).copy()
labels = tcc['Churn'].copy()

&emsp;&emsp;同时，创建自然编码后的数据集以及经过时序特征衍生的数据集：

In [4]:
# 划分训练集和测试集
train, test = train_test_split(tcc, random_state=22)

X_train = train.drop(columns=[ID_col, target]).copy()
X_test = test.drop(columns=[ID_col, target]).copy()

y_train = train['Churn'].copy()
y_test = test['Churn'].copy()

X_train_seq = pd.DataFrame()
X_test_seq = pd.DataFrame()

# 年份衍生
X_train_seq['tenure_year'] = ((72 - X_train['tenure']) // 12) + 2014
X_test_seq['tenure_year'] = ((72 - X_test['tenure']) // 12) + 2014

# 月份衍生
X_train_seq['tenure_month'] = (72 - X_train['tenure']) % 12 + 1
X_test_seq['tenure_month'] = (72 - X_test['tenure']) % 12 + 1

# 季度衍生
X_train_seq['tenure_quarter'] = ((X_train_seq['tenure_month']-1) // 3) + 1
X_test_seq['tenure_quarter'] = ((X_test_seq['tenure_month']-1) // 3) + 1

# 独热编码
enc = preprocessing.OneHotEncoder()
enc.fit(X_train_seq)

seq_new = list(X_train_seq.columns)

# 创建带有列名称的独热编码之后的df
X_train_seq = pd.DataFrame(enc.transform(X_train_seq).toarray(), 
                           columns = cate_colName(enc, seq_new, drop=None))

X_test_seq = pd.DataFrame(enc.transform(X_test_seq).toarray(), 
                          columns = cate_colName(enc, seq_new, drop=None))

# 调整index
X_train_seq.index = X_train.index
X_test_seq.index = X_test.index

In [5]:
ord_enc = OrdinalEncoder()
ord_enc.fit(X_train[category_cols])

X_train_OE = pd.DataFrame(ord_enc.transform(X_train[category_cols]), columns=category_cols)
X_train_OE.index = X_train.index
X_train_OE = pd.concat([X_train_OE, X_train[numeric_cols]], axis=1)

X_test_OE = pd.DataFrame(ord_enc.transform(X_test[category_cols]), columns=category_cols)
X_test_OE.index = X_test.index
X_test_OE = pd.concat([X_test_OE, X_test[numeric_cols]], axis=1)

In [6]:
# 本节新增第三方库
from joblib import dump, load
from sklearn.ensemble import VotingClassifier
from hyperopt import hp, fmin, tpe
from numpy.random import RandomState
from sklearn.model_selection import cross_val_score

## <center>Ch.3 模型融合基础方法

### 五、基于交叉训练的权重搜索融合

&emsp;&emsp;接下来我们继续尝试权重搜索的第二种改进方案，即手动划分训练集和验证集，并在每次划分训练集上进行单独模型的超参数搜索和优化。这个过程并不复杂，但会有大量重复训练和超参数优化的环节。

<center><img src="https://s2.loli.net/2022/05/18/1UAlpgwt2mSbTQ5.png" alt="image-20220518205529331" style="zoom:33%;" />

### 1.数据准备

&emsp;&emsp;首先我们需要围绕当前数据集进行手动五折划分，可以借助KFold过程快速完成：

In [7]:
# 实例化KFold评估器
kf = KFold(n_splits=5, random_state=12, shuffle=True)

# 重置训练集和测试集的index
X_train_OE = X_train_OE.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)

然后需要单独创建每一轮划分后的训练集和验证集，可以通过如下过程完成：

In [8]:
# 循环一次，切分一次数据集和验证集
for train_part_index, eval_index in kf.split(X_train_OE, y_train):
    print(train_part_index)
    print(eval_index)
    break

[   1    2    5 ... 5279 5280 5281]
[   0    3    4 ... 5271 5274 5275]


In [9]:
X_train_OE.loc[train_part_index]

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,tenure,MonthlyCharges,TotalCharges
1,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,0.0,2.0,0.0,0.0,0.0,1.0,2.0,3,80.00,241.30
2,1.0,0.0,0.0,0.0,1.0,0.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,3.0,4,19.00,73.45
5,0.0,0.0,1.0,1.0,1.0,2.0,1.0,2.0,0.0,2.0,0.0,0.0,0.0,1.0,0.0,0.0,69,84.45,5848.60
6,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,1.0,1.0,26,54.75,1406.90
7,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0,11,44.65,472.25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5277,1.0,0.0,1.0,0.0,1.0,0.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,52,106.30,5487.00
5278,0.0,1.0,0.0,0.0,1.0,2.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,16,54.10,889.00
5279,0.0,1.0,0.0,0.0,1.0,2.0,1.0,0.0,2.0,2.0,0.0,2.0,2.0,0.0,1.0,2.0,28,106.15,3152.50
5280,0.0,0.0,1.0,1.0,1.0,0.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,15,20.35,335.95


In [10]:
train_part_index_l = []
eval_index_l = []

for train_part_index, eval_index in kf.split(X_train_OE, y_train):
    train_part_index_l.append(train_part_index)
    eval_index_l.append(eval_index)

In [11]:
train_part_index_l

[array([   1,    2,    5, ..., 5279, 5280, 5281]),
 array([   0,    1,    2, ..., 5279, 5280, 5281]),
 array([   0,    1,    2, ..., 5279, 5280, 5281]),
 array([   0,    1,    3, ..., 5275, 5277, 5281]),
 array([   0,    2,    3, ..., 5278, 5279, 5280])]

In [12]:
# 训练集特征
X_train1 = X_train_OE.loc[train_part_index_l[0]]
X_train2 = X_train_OE.loc[train_part_index_l[1]]
X_train3 = X_train_OE.loc[train_part_index_l[2]]
X_train4 = X_train_OE.loc[train_part_index_l[3]]
X_train5 = X_train_OE.loc[train_part_index_l[4]]

# 验证集特征
X_eval1 = X_train_OE.loc[eval_index_l[0]]
X_eval2 = X_train_OE.loc[eval_index_l[1]]
X_eval3 = X_train_OE.loc[eval_index_l[2]]
X_eval4 = X_train_OE.loc[eval_index_l[3]]
X_eval5 = X_train_OE.loc[eval_index_l[4]]

# 训练集标签
y_train1 = y_train.loc[train_part_index_l[0]]
y_train2 = y_train.loc[train_part_index_l[1]]
y_train3 = y_train.loc[train_part_index_l[2]]
y_train4 = y_train.loc[train_part_index_l[3]]
y_train5 = y_train.loc[train_part_index_l[4]]

# 验证集标签
y_eval1 = y_train.loc[eval_index_l[0]]
y_eval2 = y_train.loc[eval_index_l[1]]
y_eval3 = y_train.loc[eval_index_l[2]]
y_eval4 = y_train.loc[eval_index_l[3]]
y_eval5 = y_train.loc[eval_index_l[4]]

&emsp;&emsp;第一次手动交叉验证训练模型时我们尝试把所有的五次划分的训练集、验证集单独列出，便于大家理解。后面在熟练的情况下可以通过一个循环完成全部操作。

&emsp;&emsp;接下来考虑将训练集的特征和标签、测试集的特征和标签分别放到两个list当中，方便后续调用：

In [13]:
train_set = [(X_train1, y_train1), 
             (X_train2, y_train2), 
             (X_train3, y_train3), 
             (X_train4, y_train4), 
             (X_train5, y_train5)]

In [14]:
eval_set = [(X_eval1, y_eval1), 
            (X_eval2, y_eval2), 
            (X_eval3, y_eval3), 
            (X_eval4, y_eval4), 
            (X_eval5, y_eval5)]

### 2.模型训练

&emsp;&emsp;数据集划分完成后，接下来考虑围绕每个训练数据集进行模型训练与超参数优化。实现该过程有两种基本方法，首先最简单的方法就是每个模型在5组训练数据下分别训练和超参数优化，按照上一节介绍的超参数优化策略，一个个模型进行优化，每类模型需要训练并优化得到5个不同的模型；其二则是批量模型超参数搜索，可以考虑设置一个非常大的超参数空间，然后循环带入五组训练数据集，进行五个模型训练和超参数优化，若每个模型的超参数都正好落在超参数搜素空间内，则五个模型都是最优模型，而如果某个模型的超参数正好落在搜索边界上，则需要围绕这个模型（带入对应的训练数据）再次设置超参数空间并进行超参数搜索。前者需要大量手动操作但较为节省时间，后者更加自动化但需要耗费更大量的计算时间。对于决策树和逻辑回归这种搜索过程较快的模型，可以采用批量搜索的方法，而对于随机森林，考虑到单个模型的超参数优化就会比较耗时，因此建议单独手动训练模型。

> 模型训练枯燥但至关重要，模型训练的好坏与否，直接决定最终融合效果。

#### 2.1 随机森林的交叉训练

- X_train1&y_train1训练过程

&emsp;&emsp;首先，围绕划分出来的第一个训练集进行模型训练与超参数优化，具体每一轮的超参数搜索可以参照上一小节方法执行，同时由于模型训练数据和全训练集样本差别不大，也可以以上一小节最终得到的最有超参数组为依据进行搜索。这里仅展示最后一轮搜索时超参数设置情况：

In [288]:
start = time.time()

# 设置超参数空间
parameter_space = {
    "min_samples_leaf": range(4, 7), 
    "min_samples_split": range(1, 4),
    "max_depth": range(7, 11),
    "max_leaf_nodes": [None] + list(range(31, 34)), 
    "n_estimators": range(93, 96), 
    "max_features":['sqrt', 'log2'] + list(range(5, 8)), 
    "max_samples":[None, 0.49, 0.5, 0.51]}

# 实例化模型与评估器
RF_1 = RandomForestClassifier(random_state=12)
grid_RF_1 = GridSearchCV(RF_1, parameter_space, n_jobs=15)

# 模型训练
grid_RF_1.fit(X_train1, y_train1)

print(time.time()-start)

336.0223355293274


In [289]:
grid_RF_1.best_score_

0.805680473372781

In [290]:
grid_RF_1.best_params_

{'max_depth': 8,
 'max_features': 6,
 'max_leaf_nodes': 32,
 'max_samples': 0.5,
 'min_samples_leaf': 5,
 'min_samples_split': 2,
 'n_estimators': 94}

In [291]:
grid_RF_1.score(X_train1, y_train1), grid_RF_1.score(X_eval1, y_eval1), grid_RF_1.score(X_test_OE, y_test)

(0.818698224852071, 0.8164616840113529, 0.7819420783645656)

&emsp;&emsp;不过受限于样本量，当前训练集上得到的模型效果并不如完整数据集上的模型效果。并且能够发现，在很多情况下，验证集的得分会比训练集上的得分更接近测试集上的得分，这其实也是因为验证集同样也是没有参与模型训练的部分数据集，相比训练集，更能代表模型泛化能力。

&emsp;&emsp;当然，截至目前我们只进行了一次建模，还需要重复上述过程、在四组不同的训练集和验证集上进行建模和预测。

- X_train2&y_train2训练过程

&emsp;&emsp;接下来继续在2号训练数据集上进行超参数优化：

In [361]:
start = time.time()

# 设置超参数空间
parameter_space = {
    "min_samples_leaf": range(4, 7), 
    "min_samples_split": range(1, 4),
    "max_depth": range(8, 11),
    "max_leaf_nodes": [None] + list(range(39, 42)), 
    "n_estimators": range(79, 82), 
    "max_features":['sqrt', 'log2'] + list(range(3, 7)), 
    "max_samples":[None, 0.369, 0.37, 0.371]}

# 实例化模型与评估器
RF_2 = RandomForestClassifier(random_state=12)
grid_RF_2 = GridSearchCV(RF_2, parameter_space, n_jobs=15)

# 模型训练
grid_RF_2.fit(X_train2, y_train2)

print(time.time()-start)

234.58312821388245


In [362]:
grid_RF_2.best_score_

0.8120710059171599

In [363]:
grid_RF_2.best_params_

{'max_depth': 9,
 'max_features': 5,
 'max_leaf_nodes': 40,
 'max_samples': 0.37,
 'min_samples_leaf': 5,
 'min_samples_split': 2,
 'n_estimators': 80}

In [23]:
grid_RF_2.score(X_train2, y_train2), grid_RF_2.score(X_eval2, y_eval2), grid_RF_2.score(X_test_OE, y_test)

(0.8260355029585799, 0.7965941343424787, 0.7893242475865985)

- X_train3&y_train3训练过程

&emsp;&emsp;3号数据集超参数搜索过程如下：

In [364]:
start = time.time()

# 设置超参数空间
parameter_space = {
    "min_samples_leaf": range(1, 4), 
    "min_samples_split": range(6, 9),
    "max_depth": range(9, 12),
    "max_leaf_nodes": [None] + list(range(10, 70, 20)), 
    "n_estimators": range(96, 99), 
    "max_features":['sqrt', 'log2'] + list(range(4, 7)),
    "max_samples":[None] + list(range(1998, 2001))}

# 实例化模型与评估器
RF_3 = RandomForestClassifier(random_state=12)
grid_RF_3 = GridSearchCV(RF_3, parameter_space, n_jobs=15)

# 模型训练
grid_RF_3.fit(X_train3, y_train3)

print(time.time()-start)

364.97769808769226


In [365]:
grid_RF_3.best_score_

0.8130577587533399

In [366]:
grid_RF_3.best_params_

{'max_depth': 10,
 'max_features': 5,
 'max_leaf_nodes': None,
 'max_samples': 1999,
 'min_samples_leaf': 2,
 'min_samples_split': 7,
 'n_estimators': 97}

In [367]:
grid_RF_3.score(X_train3, y_train3), grid_RF_3.score(X_eval3, y_eval3), grid_RF_3.score(X_test_OE, y_test)

(0.862991008045433, 0.7964015151515151, 0.7853492333901193)

- X_train4&y_train4训练过程

&emsp;&emsp;4号数据集超参数搜索过程如下：

In [283]:
start = time.time()

# 设置超参数空间
parameter_space = {
    "min_samples_leaf": range(4, 7), 
    "min_samples_split": range(1, 4),
    "max_depth": range(9, 12),
    "max_leaf_nodes": [None] + list(range(74, 77)), 
    "n_estimators": range(94, 97), 
    "max_features":['sqrt', 'log2'] + list(range(4, 8)), 
    "max_samples":[None, 0.459, 0.46, 0.461]}

# 实例化模型与评估器
RF_4 = RandomForestClassifier(random_state=12)
grid_RF_4 = GridSearchCV(RF_4, parameter_space, n_jobs=15)

# 模型训练
grid_RF_4.fit(X_train4, y_train4)

print(time.time()-start)

319.0900549888611


In [284]:
grid_RF_4.best_score_

0.8116398785793221

In [285]:
grid_RF_4.best_params_

{'max_depth': 10,
 'max_features': 'sqrt',
 'max_leaf_nodes': 75,
 'max_samples': 0.46,
 'min_samples_leaf': 5,
 'min_samples_split': 2,
 'n_estimators': 95}

In [286]:
grid_RF_4.score(X_train4, y_train4), grid_RF_4.score(X_eval4, y_eval4), grid_RF_4.score(X_test_OE, y_test)

(0.8338854708944629, 0.7982954545454546, 0.7887563884156729)

- X_train5&y_train5训练过程

&emsp;&emsp;5号数据集超参数搜索过程如下：

In [276]:
start = time.time()

# 设置超参数空间
parameter_space = {
    "min_samples_leaf": range(2, 5), 
    "min_samples_split": range(1, 4),
    "max_depth": range(7, 10),
    "max_leaf_nodes": [None] + list(range(10, 70, 20)), 
    "n_estimators": range(94, 97), 
    "max_features":['sqrt', 'log2'] + list(range(1, 7)),
    "max_samples":[None] + list(range(2000, 2003))}

# 实例化模型与评估器
RF_5 = RandomForestClassifier(random_state=12)
grid_RF_5 = GridSearchCV(RF_5, parameter_space, n_jobs=15)

# 模型训练
grid_RF_5.fit(X_train5, y_train5)

print(time.time()-start)

371.5700776576996


In [277]:
grid_RF_5.best_score_

0.8125841062011275

In [278]:
grid_RF_5.best_params_

{'max_depth': 8,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': 2001,
 'min_samples_leaf': 3,
 'min_samples_split': 2,
 'n_estimators': 95}

#### 2.2 模型保存

&emsp;&emsp;训练一组模型不易，但是严谨细致的模型训练却是保证融合效果的关键，如果完全是从第一轮搜索开始进行超参数优化，则上述五个模型的训练预计需要耗时（运行时间）4小时左右。因此在训练结束后，切记将模型进行本地保存：

In [170]:
dump(grid_RF_1, 'grid_RF_1.joblib') 
dump(grid_RF_2, 'grid_RF_2.joblib') 
dump(grid_RF_3, 'grid_RF_3.joblib') 
dump(grid_RF_4, 'grid_RF_4.joblib') 
dump(grid_RF_5, 'grid_RF_5.joblib') 

['grid_RF_5.joblib']

而后续在调用时，则可以通过如下语句进行调用：

In [171]:
grid_RF_1 = load('grid_RF_1.joblib') 
grid_RF_2 = load('grid_RF_2.joblib') 
grid_RF_3 = load('grid_RF_3.joblib') 
grid_RF_4 = load('grid_RF_4.joblib') 
grid_RF_5 = load('grid_RF_5.joblib') 

同时，我们挑选模型（而非超参数优化器）放到一个列表中，方便后续进行调用：

In [172]:
RF_1 = grid_RF_1.best_estimator_
RF_2 = grid_RF_2.best_estimator_
RF_3 = grid_RF_3.best_estimator_
RF_4 = grid_RF_4.best_estimator_
RF_5 = grid_RF_5.best_estimator_

In [173]:
RF_l = [RF_1, RF_2, RF_3, RF_4, RF_5]

#### 2.3 模型性能评估

&emsp;&emsp;接下来，我们可以简单测试这一组模型的表现。在模型训练过程中，我们看到了单独模型在训练集、验证集和测试集上的表现，但由于后续我们并不是要单独使用其中某个模型，而是综合这一组模型的结果进行判别，因此我们要对这一组5个模型的性能进行整体的评估。而评估的只要方法就是这5个模型在验证集上的表现。当然我们可以快速计算5个模型分别在验证集上得分的均值，基本过程如下：

In [174]:
# 5个验证集准确率均值计算过程
eval_score = 0

for i  in range(5):
    X, y = eval_set[i]
    eval_score += RF_l[i].score(X, y)
    
eval_score / 5

0.8173065207419512

&emsp;&emsp;能够发现，在信息训练集和测试集信息严格隔离的情况下，验证集的准确率有所下降。当然对于验证集的准确率计算，还有一种更加严谨的做法，是先将5个验证集拼凑成一个完整的“训练集”，然后这个“训练集”上的准确率。5个验证集拼凑成完整训练集过程如下：

<center><img src="https://s2.loli.net/2022/05/10/bPyRBSnD8ZwmQI9.png" alt="image-20220510162151351" style="zoom:50%;" />

这里我们首先输出每个验证集的预测结果，并将其保存为Series对象，并且每个Series的index就是验证集样本的index，方便后续进行拼接：

In [129]:
eval1_predict_proba_RF = pd.Series(RF_l[0].predict_proba(X_eval1)[:, 1], index=X_eval1.index)
eval2_predict_proba_RF = pd.Series(RF_l[1].predict_proba(X_eval2)[:, 1], index=X_eval2.index)
eval3_predict_proba_RF = pd.Series(RF_l[2].predict_proba(X_eval3)[:, 1], index=X_eval3.index)
eval4_predict_proba_RF = pd.Series(RF_l[3].predict_proba(X_eval4)[:, 1], index=X_eval4.index)
eval5_predict_proba_RF = pd.Series(RF_l[4].predict_proba(X_eval5)[:, 1], index=X_eval5.index)

然后拼接为一个完整的Series，并对index进行顺序排序：

In [130]:
eval_predict_proba_RF = pd.concat([eval1_predict_proba_RF, 
                                   eval2_predict_proba_RF, 
                                   eval3_predict_proba_RF, 
                                   eval4_predict_proba_RF, 
                                   eval5_predict_proba_RF]).sort_index()

eval_predict_proba_RF

0       0.044787
1       0.572187
2       0.161815
3       0.250871
4       0.122533
          ...   
5277    0.082653
5278    0.346562
5279    0.551481
5280    0.049011
5281    0.002783
Length: 5282, dtype: float64

能够看到，拼接后的验证集长度和训练集样本数量相同，验证集“拼”成了一个完整的训练集。换个更加准确的说法，我们通过验证集获得了一个完整的训练集上预测结果。接下来我们以0.5为阈值，测试训练集（或者完整验证集）上准确率：

In [131]:
accuracy_score((eval_predict_proba_RF >= 0.5) * 1, y_train)

0.8173040514956456

准确率为0.8173。需要注意的是，5个验证集准确率的均值和上述结果略有差异，一般来说，出于严谨性考虑，同时也是为了后续加权融合代码更加便捷，建议采用后者进行一组模型效果整体评估。

> 对验证集的预测结果进行拼接也是Stacking方法要求的必要数据对象。

> 这里有5个模型，并且要求最终输出一个结果，有没有想过对这五个模型进行加权融合？

&emsp;&emsp;当然，我们也可以尝试计算测试集上准确率，同样我们可以简单计算这5个模型在测试集上准确率的均值，计算过程如下：

In [132]:
test_score = 0

for i in range(5):
    test_score += RF_l[i].score(X_test_OE, y_test)
    
test_score / 5

0.7904599659284497

但是，这里需要注意的是，如果我们利用这样一组模型对测试数据进行预测，实际上的预测流程应该是软投票，即根据每个模型输出的概率结果求均值后以阈值0.5进行判断，基本过程如下：

In [133]:
test_predict_proba_RF = []

for i in range(5):
    test_predict_proba_RF.append(RF_l[i].predict_proba(X_test_OE)[:, 1])

test_predict_proba_RF = np.array(test_predict_proba_RF)

In [134]:
test_predict_proba_RF

array([[3.93370391e-02, 2.49403422e-01, 2.83410667e-02, ...,
        1.49452536e-01, 4.98110882e-01, 1.18764226e-01],
       [4.61954631e-02, 2.45875467e-01, 3.00760027e-02, ...,
        1.38993336e-01, 5.02889767e-01, 1.16184501e-01],
       [2.00137393e-03, 4.16662468e-01, 4.90918017e-04, ...,
        1.42216654e-01, 6.08904016e-01, 8.56019321e-02],
       [3.59908839e-02, 3.22891626e-01, 1.62280766e-02, ...,
        1.52682083e-01, 5.04891040e-01, 1.27421886e-01],
       [2.25733214e-02, 3.25064759e-01, 6.08561589e-03, ...,
        1.43386982e-01, 5.38634836e-01, 8.61879770e-02]])

然后按行求均值，即可计算每条测试数据的平均预测概率：

In [135]:
test_predict_proba_RF = test_predict_proba_RF.mean(0)
test_predict_proba_RF

array([0.02921962, 0.31197955, 0.01624434, ..., 0.14534632, 0.53068611,
       0.1068321 ])

计算最终准确率：

In [136]:
accuracy_score((test_predict_proba_RF>=0.5)*1, y_test)

0.7915956842703009

能够发现，两种结果略有差异。不过需要注意的是，随机森林交叉训练的这组模型平均测试集准确率比带入全部数据集训练得到的模型表现稍差，根本原因在于训练每个模型的数据量减少，导致模型本身判别效力有所下降。但是，单个模型判别效力的下降并不代表融合结果会更差，这点我们将在后面着重讨论。

|Models|eval_train_score|test_score|
|:--:|:--:|:--:|
|Logistic+grid|0.8104|0.7836|
|tree+grid|0.7991|0.7683|
|RF+grid|0.8104|0.7955|
|RF_l|0.8173|0.7915|

#### 2.4 决策树交叉训练与效果测试

- 决策树模型训练

&emsp;&emsp;决策树的超参数搜索过程较为简单快捷，因此可以设置一个较大的超参数搜索空间，循环带入5组训练数据，并分别训练与优化5个决策树模型。首先我们可以简单测试在一个较大的参数空间下训练和优化一个模型耗时：

In [34]:
# 实例化决策树评估器
tree_model = DecisionTreeClassifier()

tree_param = {'max_depth': np.arange(2, 16, 1).tolist(), 
              'min_samples_split': np.arange(1, 5, 1).tolist(), 
              'min_samples_leaf': np.arange(1, 4, 1).tolist(), 
              'max_leaf_nodes':np.arange(6, 30, 1).tolist()}

In [35]:
# 实例化网格搜索评估器
tree_search = GridSearchCV(estimator = tree_model,
                           param_grid = tree_param,
                           n_jobs = 12)

In [36]:
# 在训练集上进行训练
s = time.time()
tree_search.fit(X_train2, y_train2)
print(time.time()-s, "s")

10.22828483581543 s


然后即可循环带入五组数据进行训练：

In [37]:
tree_search_l = []

for X, y in tqdm(train_set):
    tree_model = DecisionTreeClassifier(random_state=12)
    tree_search = GridSearchCV(estimator = tree_model,
                               param_grid = tree_param,
                               n_jobs = 12).fit(X, y)
    tree_search_l.append(tree_search)

100%|████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:45<00:00,  9.11s/it]


In [38]:
len(tree_search_l)

5

&emsp;&emsp;这里需要注意，当我们训练完5个模型之后，需要每个模型单独查看其超参数是否在初始设置的范围内的，若超参数取值在搜索范围的便捷，则需要重新设置超参数搜索范围。这里展示的结果是已经验算多次后的结果。

- 决策树模型保存

&emsp;&emsp;在模型训练完成后，接下来即可对其进行本地保存：

In [21]:
dump(tree_search_l[0], 'grid_tree_1.joblib') 
dump(tree_search_l[1], 'grid_tree_2.joblib') 
dump(tree_search_l[2], 'grid_tree_3.joblib') 
dump(tree_search_l[3], 'grid_tree_4.joblib') 
dump(tree_search_l[4], 'grid_tree_5.joblib') 

['grid_tree_5.joblib']

后续运行时，可以直接按照如下方式读取模型，以及将其保存为一个列表：

In [28]:
grid_tree_1 = load('grid_tree_1.joblib')
grid_tree_2 = load('grid_tree_2.joblib')
grid_tree_3 = load('grid_tree_3.joblib')
grid_tree_4 = load('grid_tree_4.joblib')
grid_tree_5 = load('grid_tree_5.joblib')

tree_1 = grid_tree_1.best_estimator_
tree_2 = grid_tree_2.best_estimator_
tree_3 = grid_tree_3.best_estimator_
tree_4 = grid_tree_4.best_estimator_
tree_5 = grid_tree_5.best_estimator_

tree_l = [tree_1, tree_2, tree_3, tree_4, tree_5]

- 验证集效果测试

&emsp;&emsp;接下来继续按照此前介绍的方法，构建一个基于验证集的完整训练集预测结果：

In [29]:
eval1_predict_proba_tree = pd.Series(tree_l[0].predict_proba(X_eval1)[:, 1], index=X_eval1.index)
eval2_predict_proba_tree = pd.Series(tree_l[1].predict_proba(X_eval2)[:, 1], index=X_eval2.index)
eval3_predict_proba_tree = pd.Series(tree_l[2].predict_proba(X_eval3)[:, 1], index=X_eval3.index)
eval4_predict_proba_tree = pd.Series(tree_l[3].predict_proba(X_eval4)[:, 1], index=X_eval4.index)
eval5_predict_proba_tree = pd.Series(tree_l[4].predict_proba(X_eval5)[:, 1], index=X_eval5.index)

然后拼接为一个完整的Series，并对index进行顺序排序：

In [30]:
eval_predict_proba_tree = pd.concat([eval1_predict_proba_tree, 
                                     eval2_predict_proba_tree, 
                                     eval3_predict_proba_tree, 
                                     eval4_predict_proba_tree, 
                                     eval5_predict_proba_tree]).sort_index()

eval_predict_proba_tree

0       0.037669
1       0.787986
2       0.222819
3       0.259434
4       0.107345
          ...   
5277    0.062959
5278    0.222819
5279    0.438538
5280    0.066419
5281    0.026367
Length: 5282, dtype: float64

最后计算准确率：

In [31]:
accuracy_score((eval_predict_proba_tree >= 0.5) * 1, y_train)

0.7913669064748201

同时测试这组决策树模型在测试集上的准确率：

In [32]:
test_predict_proba_tree = []

for i in range(5):
    test_predict_proba_tree.append(tree_l[i].predict_proba(X_test_OE)[:, 1])

test_predict_proba_tree = np.array(test_predict_proba_tree)
test_predict_proba_tree = test_predict_proba_tree.mean(0)

In [33]:
test_predict_proba_tree

array([0.04647312, 0.15890043, 0.04647312, ..., 0.15890043, 0.43740054,
       0.1303992 ])

计算最终准确率：

In [34]:
accuracy_score((test_predict_proba_tree>=0.5)*1, y_test)

0.7745599091425327

非常有趣的是，决策树模型在经过交叉训练后得到的测试集上平均准确率会比单独更高。这里我们可以简单思考其背后的原因。

&emsp;&emsp;其实不难发现，整个交叉训练模型的过程其实就非常类似于抽取自助集然后进行Bagging集成，只不过这里没有更进一步为不同的树分配不同的特征，否则就是一个简单的随机森林模型了。此外，决策树本身稳定性较差，样本上的差异就能构成模型结果的较大差异，因此会非常适合Bagging集成，几棵树的集成就能立竿见影看到效果。而反观随机森林，其本身学习能力较强、稳定性也非常强，因此简单的交叉训练并不能造成模型之间非常明显的差异，反而因为样本量的减少较大程度上影响了随机森林模型的学习效果，导致整体判别能力下降。

|Models|eval_train_score|test_score|
|:--:|:--:|:--:|
|Logistic+grid|0.8104|0.7836|
|tree+grid|0.7991|0.7683|
|tree_l|0.7913|<font color="red">**0.7745(↑)**</font>|
|RF+grid|0.8104|0.7955|
|RF_l|0.8173|<font color="green">**0.7915(↓)**</font>|

> 当然，从这个结果也能看出为何决策树“容易集成”，而随机森林“难以集成”。

#### 2.5 逻辑回归交叉训练与效果测试

- 逻辑回归交叉训练

&emsp;&emsp;和决策树的交叉训练类似，这里我们同样批量交叉训练5个逻辑回归模型，过程如下：

In [592]:
# 设置超参数空间
logistic_param = [
    {'columntransformer__num':num_pre, 'logit_threshold__thr': np.arange(0.1, 1, 0.1).tolist(), 'logit_threshold__penalty': ['l1'], 'logit_threshold__C': np.arange(0.1, 1.1, 0.1).tolist(), 'logit_threshold__solver': ['saga']}, 
    {'columntransformer__num':num_pre, 'logit_threshold__thr': np.arange(0.1, 1, 0.1).tolist(), 'logit_threshold__penalty': ['l2'], 'logit_threshold__C': np.arange(0.1, 1.1, 0.1).tolist(), 'logit_threshold__solver': ['lbfgs', 'newton-cg', 'sag', 'saga']}, 
]

In [593]:
lr_search_l = []

num_pre = ['passthrough', 
           preprocessing.StandardScaler(), 
           preprocessing.KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='kmeans')]

for X, y in tqdm(train_set):
    # 实例化相关评估器
    logistic_model = logit_threshold(max_iter=int(1e6))
    logistic_pre = ColumnTransformer([('cat', preprocessing.OneHotEncoder(drop='if_binary'), category_cols), 
                                      ('num', 'passthrough', numeric_cols)])
    # 构造机器学习流
    logistic_pipe = make_pipeline(logistic_pre, logistic_model)
    # 执行网格搜索
    logistic_search = GridSearchCV(estimator = logistic_pipe,
                                   param_grid = logistic_param,
                                   scoring='accuracy',
                                   n_jobs = 15).fit(X, y)
    lr_search_l.append(logistic_search)

100%|███████████████████████████████████████████████████████████████████████████████████| 5/5 [19:24<00:00, 232.80s/it]


In [594]:
len(lr_search_l)

5

- 模型保存与读取

&emsp;&emsp;同样，我们对训练好的这批逻辑回归模型进行本地保存：

In [595]:
dump(lr_search_l[0], 'grid_lr_1.joblib') 
dump(lr_search_l[1], 'grid_lr_2.joblib') 
dump(lr_search_l[2], 'grid_lr_3.joblib') 
dump(lr_search_l[3], 'grid_lr_4.joblib') 
dump(lr_search_l[4], 'grid_lr_5.joblib') 

['grid_lr_5.joblib']

后续可以直接按照下述方法进行读取，同时将模型保存为一个列表：

In [35]:
grid_lr_1 = load('grid_lr_1.joblib')
grid_lr_2 = load('grid_lr_2.joblib')
grid_lr_3 = load('grid_lr_3.joblib')
grid_lr_4 = load('grid_lr_4.joblib')
grid_lr_5 = load('grid_lr_5.joblib')

lr_1 = grid_lr_1.best_estimator_
lr_2 = grid_lr_2.best_estimator_
lr_3 = grid_lr_3.best_estimator_
lr_4 = grid_lr_4.best_estimator_
lr_5 = grid_lr_5.best_estimator_

lr_l = [lr_1, lr_2, lr_3, lr_4, lr_5]

- 模型效果测试

&emsp;&emsp;接下来进一步测试逻辑回归模型组的效果：

In [36]:
eval1_predict_proba_lr = pd.Series(lr_l[0].predict_proba(X_eval1)[:, 1], index=X_eval1.index)
eval2_predict_proba_lr = pd.Series(lr_l[1].predict_proba(X_eval2)[:, 1], index=X_eval2.index)
eval3_predict_proba_lr = pd.Series(lr_l[2].predict_proba(X_eval3)[:, 1], index=X_eval3.index)
eval4_predict_proba_lr = pd.Series(lr_l[3].predict_proba(X_eval4)[:, 1], index=X_eval4.index)
eval5_predict_proba_lr = pd.Series(lr_l[4].predict_proba(X_eval5)[:, 1], index=X_eval5.index)

然后拼接为一个完整的Series，并对index进行顺序排序：

In [37]:
eval_predict_proba_lr = pd.concat([eval1_predict_proba_lr, 
                                   eval2_predict_proba_lr, 
                                   eval3_predict_proba_lr, 
                                   eval4_predict_proba_lr, 
                                   eval5_predict_proba_lr]).sort_index()

eval_predict_proba_lr

0       0.011289
1       0.542331
2       0.154121
3       0.273393
4       0.158399
          ...   
5277    0.083575
5278    0.365228
5279    0.674365
5280    0.050536
5281    0.005250
Length: 5282, dtype: float64

最后测试准确率：

In [38]:
accuracy_score((eval_predict_proba_lr >= 0.5) * 1, y_train)

0.8087845513063233

同时测试逻辑回归模型组在测试集上的准确率：

In [39]:
test_predict_proba_lr = []

for i in range(5):
    test_predict_proba_lr.append(lr_l[i].predict_proba(X_test_OE)[:, 1])

test_predict_proba_lr = np.array(test_predict_proba_lr)
test_predict_proba_lr = test_predict_proba_lr.mean(0)

In [40]:
test_predict_proba_lr

array([0.045946  , 0.2337113 , 0.00519206, ..., 0.12532499, 0.5019908 ,
       0.0627869 ])

计算最终准确率：

In [41]:
accuracy_score((test_predict_proba_lr>=0.5)*1, y_test)

0.7887563884156729

能够看出，和决策树类似，逻辑回归通过这个交叉训练的过程，最终集成的效果也有所提升。不过也是因为带有正则化项的逻辑回归较为稳定，因此提升幅度并不服决策树提升幅度这么大。

|Models|eval_train_score|test_score|
|:--:|:--:|:--:|
|Logistic+grid|0.8104|0.7836|
|lr_l|0.8087|<font color="red">**0.7887(↑)**</font>|
|tree+grid|0.7991|0.7683|
|tree_l|0.7913|<font color="red">**0.7745(↑)**</font>|
|RF+grid|0.8104|0.7955|
|RF_l|0.8173|<font color="green">**0.7915(↓)**</font>|

&emsp;&emsp;当然我们创建这三组模型，并不是为了让其平均投票以实现类似Bagging的集成过程，而是采用更加灵活的融合策略，获得一个更好的模型融合结果。

### 3.基于交叉训练的融合权重搜索

- 基础流程测试

&emsp;&emsp;在训练了3组验证集信息严格隔离的模型后，接下来让我们回到权重搜索的问题上来：即测试在验证集信息严格隔离的情况下，能否搜索得到一组更优的融合权重。TPE的搜索过程此前已经尝试过多次，基本流程并不复杂，但面对训练的“多组”模型，我们应该如何确定某组权重和阈值的表现呢？此前的搜索过程都是基于验证集的平均准确率来判断，但此时我们通过手动划分数据集进行交叉计算，已经得到了一组信息完全隔离的、基于验证集的、完整的训练集上的预测结果，需要注意的是，该预测结果会比此前介绍的验证集的均值更能代表模型的泛化能力，因此我们接下来就以每一组模型输出的（基于验证集的）训练集上的预测结果来进行权重搜索，该计算流程的基本流程图如下：

<center><img src="https://s2.loli.net/2022/05/23/ZJs9oHAFQ2wa8Vt.png" alt="image-20220523213524224" style="zoom:33%;" />

例如，在一组10:1:100的权重时，阈值为0.5时，加权融合预测过程如下：

In [465]:
weights = [10, 1, 100]
thr = 0.5

weight1 = weights[0]
weight2 = weights[1]
weight3 = weights[2]

weights_sum = weight1 + weight2 + weight3

predict_probo_weight = (eval_predict_proba_lr * weight1 + 
                        eval_predict_proba_tree * weight2 + 
                        eval_predict_proba_RF * weight3) / weights_sum

res_weight = (predict_probo_weight >= thr) * 1

accuracy_score(res_weight, y_train)

0.8053767512305945

- 全域搜索

&emsp;&emsp;据此，我们就可以进一步制定TPE超参数搜索流程，基本过程如下。需要注意的是，由于本轮模型训练过程中，我们进行了非常严格的信息隔离，因此验证集上的表现应该能够较好的衡量模型泛化能力，因此我们先尝试不对搜索空间进行裁剪，测试这个权重和阈值的搜索过程是否还会有严重的过拟合倾向：

In [191]:
# 定义超参数空间
params_space = {'thr': hp.uniform("thr", 0.4, 0.6), 
                'weight1': hp.uniform("weight1",0,1),
                'weight2': hp.uniform("weight2",0,1),
                'weight3': hp.uniform("weight3",0,1)}

In [192]:
# 定义目标函数
def hyperopt_objective_weight(params):
    thr = params['thr']
    weight1 = params['weight1']
    weight2 = params['weight2']
    weight3 = params['weight3']
    
    weights_sum = weight1 + weight2 + weight3

    predict_probo_weight = (eval_predict_proba_lr * weight1 + 
                            eval_predict_proba_tree * weight2 + 
                            eval_predict_proba_RF * weight3) / weights_sum

    res_weight = (predict_probo_weight >= thr) * 1

    eval_score = accuracy_score(res_weight, y_train)
    
    return -eval_score

In [198]:
# 定义优化函数
def param_hyperopt_weight(max_evals):
    params_best = fmin(fn = hyperopt_objective_weight,
                       space = params_space,
                       algo = tpe.suggest,
                       max_evals = max_evals, 
                       rstate = np.random.default_rng(17))    
    return params_best

这里需要注意，由于我们带入的是验证集上的结果直接进行计算，因此整体搜索效率要比VotingClassifier+cross_val_score效率高很多。搜索完成后即可查看搜索结果：

In [199]:
params_best = param_hyperopt_weight(5000)

100%|███████████████████████████████████████████| 5000/5000 [01:39<00:00, 50.50trial/s, best loss: -0.8199545626656569]


In [200]:
params_best

{'thr': 0.46902429302219506,
 'weight1': 0.03440652839563954,
 'weight2': 0.030945060261316243,
 'weight3': 0.9086713560495386}

&emsp;&emsp;能够看出，在设置了相同的搜索空间时，搜索结果中分配给了随机森林模型组非常大的权重，而逻辑回归模型组和决策树模型组权重类似，且数值非常小，该权重分配结果和各模型组在验证集上的表现一致——即验证集表现越好的模型权重越高。此时，如果验证集上的模型表现是可信的话，那么最终将会得到一个不错的融合结果。

&emsp;&emsp;接下来，我们需要测试该加权融合过程在测试集上的表现，类似的我们可以创建一个函数来完成该过程，并且和此前一样，测试集上的得分是三组模型的加权平均，并且每一组模型输出的测试集评分就是简单的组内5个模型预测概率的均值，也就是此前创建的test_predict_proba_lr、test_predict_proba_tree和test_predict_proba_RF。测试集评分计算过程如下：

In [201]:
def test_acc(params_best):
    thr = params_best['thr']
    weight1 = params_best['weight1']
    weight2 = params_best['weight2']
    weight3 = params_best['weight3']

    weights_sum = weight1 + weight2 + weight3

    test_predict_proba = (((test_predict_proba_lr * weight1 + 
                            test_predict_proba_tree * weight2 + 
                            test_predict_proba_RF * weight3) / weights_sum) >= thr) * 1

    print(accuracy_score(test_predict_proba, y_test))

然后计算测试集上的准确率：

In [202]:
test_acc(params_best)

0.797274275979557


能够发现，此时融合结果达到了0.797，该结果也是目前我们在不裁剪超参数搜索空间下获得的最佳结果，同时也验证了训练集和验证集的信息严格隔离对模型泛化能力的提升是有明显作用的。

|Models|eval_train_score|test_score|
|:--:|:--:|:--:|
|Logistic+grid|0.8104|0.7836|
|lr_l|0.8087|<font color="red">**0.7887(↑)**</font>|
|tree+grid|0.7991|0.7683|
|tree_l|0.7913|<font color="red">**0.7745(↑)**</font>|
|RF+grid|0.8104|0.7955|
|RF_l|0.8173|<font color="green">**0.7915(↓)**</font>|
|weight_search_auto|0.8199|<font color="red">**0.7972(↑)**</font>|

同时，该融合结果也显著好于所有参与融合的模型组在测试集上的效果，也能看出模型融合方法的实际效果。并且，相比经验法+搜索空间裁剪的方法，上述基于交叉训练的加权融合过程会更加稳定的输出一个较好的结果，我们无需根据经验进行大量反复尝试，只需要按部就班的训练好一组组模型，然后设置一个较大的迭代次数进行TPE搜索即可。

- 硬投票效果测试

&emsp;&emsp;在加权软投票得到了一个不错的结果之后，接下来我们继续尝试加权硬投票。不过，由于加权硬投票在中间将概率值转化为投票结果的过程会损失掉很多有效信息，因此在绝大多数情况下，加权硬投票效果会“稳定的弱于”加权软投票，这里我们可以简单测试加权硬投票的效果：

&emsp;&emsp;首先是将概率结果转化为投票结果，需要输出验证集投票结果和测试集投票结果：

In [233]:
eval_predict_lr = (eval_predict_proba_lr >= 0.5) * 1
eval_predict_tree = (eval_predict_proba_tree >= 0.5) * 1
eval_predict_RF = (eval_predict_proba_RF >= 0.5) * 1

In [241]:
test_predict_lr = (test_predict_proba_lr >= 0.5) * 1
test_predict_tree = (test_predict_proba_tree >= 0.5) * 1
test_predict_RF = (test_predict_proba_RF >= 0.5) * 1

然后仍然是在验证集投票结果上进行搜索：

In [234]:
# 定义超参数空间
params_space = {'thr': hp.uniform("thr", 0.4, 0.6), 
                'weight1': hp.uniform("weight1",0,1),
                'weight2': hp.uniform("weight2",0,1),
                'weight3': hp.uniform("weight3",0,1)}

In [235]:
# 定义目标函数
def hyperopt_objective_weight(params):
    thr = params['thr']
    weight1 = params['weight1']
    weight2 = params['weight2']
    weight3 = params['weight3']
    
    weights_sum = weight1 + weight2 + weight3

    predict_probo_weight = (eval_predict_lr * weight1 + 
                            eval_predict_tree * weight2 + 
                            eval_predict_RF * weight3) / weights_sum

    res_weight = (predict_probo_weight >= thr) * 1

    eval_score = accuracy_score(res_weight, y_train)
    
    return -eval_score

In [236]:
# 定义优化函数
def param_hyperopt_weight(max_evals):
    params_best = fmin(fn = hyperopt_objective_weight,
                       space = params_space,
                       algo = tpe.suggest,
                       max_evals = max_evals, 
                       rstate = np.random.default_rng(17))    
    return params_best

In [237]:
params_best = param_hyperopt_weight(5000)

100%|███████████████████████████████████████████| 5000/5000 [01:41<00:00, 49.06trial/s, best loss: -0.8173040514956456]


搜索出一组最佳权重后带入测试集测试效果，能发现，硬投票的验证集和测试集上的融合结果都明显低于软投票的结果。因此，在加权投票融合的过程中，可以优先考虑加权软投票。

In [243]:
thr = params_best['thr']
weight1 = params_best['weight1']
weight2 = params_best['weight2']
weight3 = params_best['weight3']

weights_sum = weight1 + weight2 + weight3

test_predict_proba = (((test_predict_lr * weight1 + 
                        test_predict_tree * weight2 + 
                        test_predict_RF * weight3) / weights_sum) >= thr) * 1

print(accuracy_score(test_predict_proba, y_test))

0.7915956842703009


&emsp;&emsp;至此，我们就完成了完整的基于交叉训练的权重搜索融合的全过程。接下来，我们将进一步探讨这个流程的优化方法。