## 集成学习：对比与调参
虽然现在深度学习大行其道，但以`XGBoost`、`LightGBM`、`CatBoost`为代表的`Boosting`算法仍有其广泛的用武之地。抛开深度学习适用的文本、图像、语音和视频等非结构化数据应用，对于训练样本较少的结构化数据领域，`Boosting`算法仍然是第一选择。

### 1 三大Boosting算法对比
`XGBoost`、`LightGBM`和`CatBoost`都是目前经典的`SOTA`(`state of the art`)Boosting算法，这三个模型都是以决策树为支撑的集成学习框架，其中XGBoost是对原始版本GBDT算法的改进，而`LightGBM`和`CatBoost`在`XGBoost`基础上做了进一步的优化，在精度和速度上各有所长。

三大Boosting算法主要有两个方面的区别：

第一，模型树的构造方式有所不同，`XGBoost`使用按层生长(`level-wise`)的决策树构建策略，LightGBM则使用按叶子生长(leaf-wise)的构建策略，而CatBoost使用对称树结构(oblivious-tree)，其决策树都是完全二叉树。

第二，对于类别特征的处理有较大区别，`XGBoost`本身不具备自动处理类别特征的能力，对于数据中的类别特征，需要我们手动处理变换成数值后才能输入到模型中；`LightGBM`中则需要指定类别特征名称，算法会自动对齐进行处理；`CatBoost`以处理类别特征而闻名，通过目标变量统计等特征编码方式也能实现高效处理类别特征。

#### 1.1 数据预处理
下面以 `Kaggle` 2015年的`flights`数据集为例，分别用`XGBoost`、`LightGBM`和`CatBoost`模型进行实验：

该数据集共有500多万条航班记录数据，特征有31个。我们采取抽样的方式从原始数据集中抽取1%的数据，并筛选11个特征用作演示，经过预处理后重新构建训练集，目标是构建对航班是否延误的二分类模型。

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
flights = pd.read_csv('flights.csv')
flights = flights.sample(frac=0.01, random_state=10) # 数据集抽样1%
flights

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,...,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
411984,2015,1,28,3,WN,103,N7728D,DCA,MKE,705,...,811.0,1.0,0,0,,,,,,
3591965,2015,8,11,2,B6,153,N592JB,JFK,PBI,1859,...,345.0,337.0,0,0,,0.0,0.0,82.0,255.0,0.0
526451,2015,2,4,3,DL,1187,N921DN,MSP,DCA,1735,...,2043.0,-19.0,0,0,,,,,,
1336011,2015,3,27,5,WN,171,N407WN,DEN,RDU,1815,...,2313.0,-7.0,0,0,,,,,,
3424502,2015,8,1,6,WN,4330,N7751A,ATL,RIC,2125,...,2318.0,13.0,0,0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4837220,2015,10,29,4,EV,4727,N14907,11775,13930,1444,...,1625.0,0.0,0,0,,,,,,
3829515,2015,8,26,3,B6,886,N568JB,RDU,JFK,1010,...,1127.0,-21.0,0,0,,,,,,
49110,2015,1,4,7,US,2024,N194UW,SEA,PHX,825,...,1147.0,-23.0,0,0,,,,,,
4936756,2015,11,5,4,WN,1700,N905WN,BUF,MDW,545,...,610.0,-30.0,0,0,,,,,,


In [2]:
flights = flights[["MONTH","DAY","DAY_OF_WEEK","AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT",
                 "ORIGIN_AIRPORT","AIR_TIME", "DEPARTURE_TIME","DISTANCE","ARRIVAL_DELAY"]] # 选11个特征
flights= flights.reset_index(drop=True)
flights

Unnamed: 0,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,DESTINATION_AIRPORT,ORIGIN_AIRPORT,AIR_TIME,DEPARTURE_TIME,DISTANCE,ARRIVAL_DELAY
0,1,28,3,WN,103,MKE,DCA,102.0,713.0,634,1.0
1,8,11,2,B6,153,PBI,JFK,134.0,111.0,1028,337.0
2,2,4,3,DL,1187,DCA,MSP,111.0,1734.0,931,-19.0
3,3,27,5,WN,171,RDU,DEN,173.0,1807.0,1436,-7.0
4,8,1,6,WN,4330,RIC,ATL,63.0,2151.0,481,13.0
...,...,...,...,...,...,...,...,...,...,...,...
58186,10,29,4,EV,4727,13930,11775,81.0,1438.0,462,0.0
58187,8,26,3,B6,886,JFK,RDU,61.0,1005.0,427,-21.0
58188,1,4,7,US,2024,PHX,SEA,131.0,825.0,1107,-23.0
58189,11,5,4,WN,1700,MDW,BUF,75.0,541.0,468,-30.0


In [3]:
flights["ARRIVAL_DELAY"] = (flights["ARRIVAL_DELAY"]>10)*1 # 延误超过10分钟看作是延误，bool类型转换为int类型
flights["ARRIVAL_DELAY"].unique()

array([0, 1], dtype=int32)

In [4]:
cols = ["AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT","ORIGIN_AIRPORT"] # 类别特征
for col in cols:
    print(len(flights[col].unique()))

14
6163
633
644


In [5]:
for item in cols:
    flights[item] = flights[item].astype("category").cat.codes

X_train, X_test, y_train, y_test = train_test_split(flights.drop(["ARRIVAL_DELAY"], axis=1), flights["ARRIVAL_DELAY"], random_state=10, test_size=0.3) # 划分数据集

#### 1.2 XGBoost在flights数据集上的测试

In [6]:
from sklearn.metrics import roc_auc_score
import xgboost as xgb
import time

# 设置模型参数
params = {
    'booster': 'gbtree', # 基于树
    'objective': 'binary:logistic',   
    'gamma': 0.1, # 剪枝中用到的最小损失下降值
    'max_depth': 8,
    'lambda': 2,
    'subsample': 0.7, # 表示用于训练的样本比例
    'colsample_bytree': 0.7, # 表示用于训练的特征比例
    'min_child_weight': 3, # 一个叶子节点的最小权重
    'eta': 0.001,# 学习速率
    'seed': 1000,
    'nthread': 4, # 线程数量，用于并行计算
}

# 训练
t0 = time.time()
num_rounds = 500 # 表示训练轮数，即树的个数
dtrain = xgb.DMatrix(X_train, y_train)
model_xgb = xgb.train(params, dtrain, num_rounds)
print('training spend {} seconds'.format(time.time()-t0))

# 测试
t1 = time.time()
dtest = xgb.DMatrix(X_test)
y_pred = model_xgb.predict(dtest)
print('testing spend {} seconds'.format(time.time()-t1))
y_pred_train = model_xgb.predict(dtrain)
print(f"训练集auc：{roc_auc_score(y_train, y_pred_train)}")
print(f"测试集auc：{roc_auc_score(y_test, y_pred)}")

training spend 6.46193265914917 seconds
testing spend 0.057062625885009766 seconds
训练集auc：0.752049936354563
测试集auc：0.6965194979943091


#### 1.3 LightGBM在flights数据集上的测试

In [7]:
import lightgbm as lgb
d_train = lgb.Dataset(X_train, label=y_train)

# 设置模型参数
params = {
    "max_depth": 5, 
    "learning_rate" : 0.05, 
    "num_leaves": 500,  
    "n_estimators": 300
}

cate_features_name = ["MONTH","DAY","DAY_OF_WEEK","AIRLINE","DESTINATION_AIRPORT", "ORIGIN_AIRPORT"] # 类别特征

t0 = time.time()
model_lgb = lgb.train(params, d_train, categorical_feature = cate_features_name)
print('training spend {} seconds'.format(time.time()-t0))
t1 = time.time()

y_pred = model_lgb.predict(X_test)
print('testing spend {} seconds'.format(time.time()-t1))
y_pred_train = model_lgb.predict(X_train)
print(f"训练集auc：{roc_auc_score(y_train, y_pred_train)}")
print(f"测试集auc：{roc_auc_score(y_test, y_pred)}")

New categorical_feature is ['AIRLINE', 'DAY', 'DAY_OF_WEEK', 'DESTINATION_AIRPORT', 'MONTH', 'ORIGIN_AIRPORT']


You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1862
[LightGBM] [Info] Number of data points in the train set: 40733, number of used features: 10
[LightGBM] [Info] Start training from score 0.211327
training spend 0.43550848960876465 seconds
testing spend 0.0230252742767334 seconds
训练集auc：0.8867447004324996
测试集auc：0.7033506245405025


#### 1.2 CatBoost在flights数据集上的测试

In [8]:
import catboost as cb
cat_features_index = [0,1,2,3,4,5,6]

t0 = time.time()
model_cb = cb.CatBoostClassifier(eval_metric="AUC", one_hot_max_size=50, depth=6, iterations=300, l2_leaf_reg=1, learning_rate=0.1)
model_cb.fit(X_train,y_train, cat_features= cat_features_index)
print('training spend {} seconds'.format(time.time()-t0))

t1 = time.time()
y_pred = model_cb.predict(X_test)
print('testing spend {} seconds'.format(time.time()-t1))
y_pred_train = model_cb.predict(X_train)
print(f"训练集auc：{roc_auc_score(y_train, y_pred_train)}")
print(f"测试集auc：{roc_auc_score(y_test, y_pred)}")

0:	total: 198ms	remaining: 59.1s
1:	total: 254ms	remaining: 37.8s
2:	total: 303ms	remaining: 30s
3:	total: 353ms	remaining: 26.1s
4:	total: 403ms	remaining: 23.8s
5:	total: 454ms	remaining: 22.2s
6:	total: 505ms	remaining: 21.1s
7:	total: 551ms	remaining: 20.1s
8:	total: 602ms	remaining: 19.5s
9:	total: 657ms	remaining: 19s
10:	total: 719ms	remaining: 18.9s
11:	total: 772ms	remaining: 18.5s
12:	total: 821ms	remaining: 18.1s
13:	total: 875ms	remaining: 17.9s
14:	total: 925ms	remaining: 17.6s
15:	total: 979ms	remaining: 17.4s
16:	total: 1.03s	remaining: 17.1s
17:	total: 1.08s	remaining: 16.9s
18:	total: 1.11s	remaining: 16.5s
19:	total: 1.16s	remaining: 16.2s
20:	total: 1.21s	remaining: 16.1s
21:	total: 1.26s	remaining: 16s
22:	total: 1.32s	remaining: 15.9s
23:	total: 1.38s	remaining: 15.9s
24:	total: 1.44s	remaining: 15.8s
25:	total: 1.49s	remaining: 15.7s
26:	total: 1.54s	remaining: 15.6s
27:	total: 1.6s	remaining: 15.5s
28:	total: 1.65s	remaining: 15.4s
29:	total: 1.7s	remaining: 15.3

由上面的实验结果可以看出在没有做进一步数据特征工程和超参数调优的情况下，在该数据集上，`LightGBM`无论是精度上还是速度上都要优于`XGBoost`和`CatBoost`，`CatBoost`的表现最差。

### 2 常用超参数调优方法
我们将不经过模型训练得到的参数叫做`超参数`(`hyperparameter`)，机器学习中常用的调参方法包括`网格搜索法`(`grid search`)、`随机搜索法`(`random search`)和`贝叶斯优化`(`bayesian optimization`)。

#### 2.1 网格搜索法
网格搜索法是一种常用的超参数调优方法，常用于优化三个或者更少数量的超参数，本质上是一种穷举法。对于每个超参数，使用者选择一个较小的有限集去搜索，然后将这些超参数经过笛卡尔乘积得到若干组超参数。网格搜索使用每组超参数训练模型，挑选验证集误差最小的超参数作为最优化超参数。

`sklearn`通过`model_selection`模块下的`GridSearchCV`来实现网格搜索调参，并且这个调参过程是加了交叉验证的。下面展示XGBoost的网格搜索示例：

In [9]:
# 基于XGBoost的GridSearch搜索范例
from sklearn.model_selection import GridSearchCV

model = xgb.XGBClassifier()

# 待搜索的参数列表实例
params_lst = {
    'max_depth': [3,5,7], 
    'min_child_weight': [1,3,6], 
    'n_estimators': [100,200,300],
    'learning_rate': [0.01, 0.05, 0.1]
}

# verbose：表示日志输出的详细程度
# n_jobs：表示并行计算的数量，即同时运行的任务数，-1 表示使用所有可用的 CPU 进行并行计算
t0 = time.time()
grid_search = GridSearchCV(model, param_grid=params_lst, cv=3, verbose=10, n_jobs=-1)
grid_search.fit(X_train, y_train)
print('gridsearch for xgb spend', time.time()-t0, 'seconds.')
print(grid_search.best_params_)

Fitting 3 folds for each of 81 candidates, totalling 243 fits
gridsearch for xgb spend 67.24716424942017 seconds.
{'learning_rate': 0.1, 'max_depth': 5, 'min_child_weight': 6, 'n_estimators': 300}


#### 2.2 随机搜索
随即搜索是在指定超参数范围内或者分布上随即搜寻最优超参数。相较于网格搜索方法，给定超参数分布，并不是所有超参数都要尝试，而是会从给定分布中抽样固定数量的参数，实际仅对这些抽样到的超参数进行实验。

`sklearn`通过model_selection模块下的`RandomizedSearchCV`方法进行随即搜索。

In [10]:
# 基于XGBoost的RandomizedSearch搜索范例
from sklearn.model_selection import RandomizedSearchCV # 通过 n_iter 进行手动设置或是自动根据参数空间大小确定采样次数

model = xgb.XGBClassifier()

# 待搜索的参数列表实例
params_lst = {
    'max_depth': [3,5,7], 
    'min_child_weight': [1,3,6], 
    'n_estimators': [100,200,300],
    'learning_rate': [0.01, 0.05, 0.1]
}

t0 = time.time()
random_search = RandomizedSearchCV(model, params_lst, random_state=0)
random_search.fit(X_train, y_train)
print(random_search.best_params_)
print('randomsearch for xgb spend', time.time()-t0, 'seconds.')

{'n_estimators': 300, 'min_child_weight': 6, 'max_depth': 5, 'learning_rate': 0.1}
randomsearch for xgb spend 60.41917014122009 seconds.


#### 2.3 贝叶斯调参
贝叶斯调参（Bayesian Optimization）是一种基于贝叶斯定理的超参数优化方法，主要用于优化黑盒函数。相对于传统的网格搜索或随机搜索，贝叶斯调参基于样本观测的不确定性建立模型，通过贝叶斯公式计算后验分布，从中选择期望相对较高的超参数进行优化。

贝叶斯调参的关键思想是利用样本观测不断地更新模型的先验（前提假设），并在已知前提基础上预测下一个实验的结果。通过预测的结果，更新模型的后验分布，即已知样本后，超参数分布的概率密度函数。在超参数概率密度函数的统计学指标（比如期望和方差）的基础上，生成下一组超参数值，直到达到最优结果为止。在整个过程中，下一组待测试的超参数都是在当前已有数据的基础上，反向计算已有样本的超参数概率分布，并根据超参数的期望值（或模式）来进行搜索，在一定程度上使搜索更加“聪明”和高效。

In [59]:
# pip install bayesian-optimization

# 定义相关参数
num_rounds = 3000

params = {
    'eta': 0.1,
    'silent': 1,
    'eval_metric': 'auc',
    'verbose_eval': True,
    'seed': 2023
}

# 定义目标优化函数
def xgb_evaluate(min_child_weight, colsample_bytree, max_depth, subsample, gamma,alpha):

    params['min_child_weight'] = int(min_child_weight)
    params['cosample_bytree'] = max(min(colsample_bytree, 1), 0)
    params['max_depth'] = int(max_depth)
    params['subsample'] = max(min(subsample, 1), 0)
    params['gamma'] = max(gamma, 0)
    params['alpha'] = max(alpha, 0)

    cv_result = xgb.cv(params, dtrain, num_boost_round=num_rounds, nfold=5, seed=2023, callbacks=[xgb.callback.EarlyStopping(rounds=50)])

    return cv_result['test-auc-mean'].values[-1]

In [63]:
# pip install bayesian-optimization
from bayes_opt import BayesianOptimization
#-*-coding:utf-8-*-

num_iter = 25
init_points = 5

t0 = time.time()
xgbBO = BayesianOptimization(xgb_evaluate, {
    'min_child_weight': (1, 20),
    'colsample_bytree': (0.1, 1),
    'max_depth': (5, 15),
    'subsample': (0.5, 1),
    'gamma': (0, 10),
    'alpha': (0, 10),
})
xgbBO.maximize(init_points=init_points, n_iter=num_iter)
print('bayesianSearch for xgb spend', time.time()-t0, 'seconds.')

|   iter    |  target   |   alpha   | colsam... |   gamma   | max_depth | min_ch... | subsample |
-------------------------------------------------------------------------------------------------
Parameters: { "cosample_bytree", "silent", "verbose_eval" } are not used.

Parameters: { "cosample_bytree", "silent", "verbose_eval" } are not used.

Parameters: { "cosample_bytree", "silent", "verbose_eval" } are not used.

Parameters: { "cosample_bytree", "silent", "verbose_eval" } are not used.

Parameters: { "cosample_bytree", "silent", "verbose_eval" } are not used.

| [0m1        [0m | [0m0.7193   [0m | [0m7.155    [0m | [0m0.5762   [0m | [0m0.9578   [0m | [0m9.488    [0m | [0m13.18    [0m | [0m0.9343   [0m |
Parameters: { "cosample_bytree", "silent", "verbose_eval" } are not used.

Parameters: { "cosample_bytree", "silent", "verbose_eval" } are not used.

Parameters: { "cosample_bytree", "silent", "verbose_eval" } are not used.

Parameters: { "cosample_bytree", "silent",