# 集成学习 Eensemble Learning

https://blog.csdn.net/weixin_38753213/article/details/119686632  
集成算法（Emseble Learning） 是构建多个学习器，然后通过一定策略结合把它们来完成学习任务的，常常可以获得比单一学习显著优越的学习器。  
它本身不是一个单独的机器学习算法，而是通过数据上构建并结合多个机器学习器来完成学习任务。弱评估器被定义为是表现至少比随机猜测更好的模型，即预测准确率不低于50%的任意模型。  
根据个体学习器的生产方式，目前的集成学习方法大致可分为两大类，即个体学习器间存在强依赖关系、必须串行生产的序列化方法，代表是Boosting。以及个体间不存在强依赖关系、可同时生产的并行化方法，代表是Bagging，和随机森林。

In [1]:
from IPython.display import Image
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
# Added version check for recent scikit-learn 0.18 checks
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version
from sklearn.metrics import classification_report, confusion_matrix  
from sklearn import datasets
import numpy as np
from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import StandardScaler  

In [2]:
iris = datasets.load_iris() 
#http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html
X = iris.data[:, :]
y = iris.target  #取species列，类别
print('Class labels:', np.unique(y))
#Output:Class labels: [0 1 2]

Class labels: [0 1 2]


In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50)  
scaler = StandardScaler()  
scaler.fit(X_train)
X_train = scaler.transform(X_train)  
X_test = scaler.transform(X_test) 

In [4]:
y_plus = y[y!=0]
X_plus = X[y!=0]
X_plus,y_plus

(array([[7. , 3.2, 4.7, 1.4],
        [6.4, 3.2, 4.5, 1.5],
        [6.9, 3.1, 4.9, 1.5],
        [5.5, 2.3, 4. , 1.3],
        [6.5, 2.8, 4.6, 1.5],
        [5.7, 2.8, 4.5, 1.3],
        [6.3, 3.3, 4.7, 1.6],
        [4.9, 2.4, 3.3, 1. ],
        [6.6, 2.9, 4.6, 1.3],
        [5.2, 2.7, 3.9, 1.4],
        [5. , 2. , 3.5, 1. ],
        [5.9, 3. , 4.2, 1.5],
        [6. , 2.2, 4. , 1. ],
        [6.1, 2.9, 4.7, 1.4],
        [5.6, 2.9, 3.6, 1.3],
        [6.7, 3.1, 4.4, 1.4],
        [5.6, 3. , 4.5, 1.5],
        [5.8, 2.7, 4.1, 1. ],
        [6.2, 2.2, 4.5, 1.5],
        [5.6, 2.5, 3.9, 1.1],
        [5.9, 3.2, 4.8, 1.8],
        [6.1, 2.8, 4. , 1.3],
        [6.3, 2.5, 4.9, 1.5],
        [6.1, 2.8, 4.7, 1.2],
        [6.4, 2.9, 4.3, 1.3],
        [6.6, 3. , 4.4, 1.4],
        [6.8, 2.8, 4.8, 1.4],
        [6.7, 3. , 5. , 1.7],
        [6. , 2.9, 4.5, 1.5],
        [5.7, 2.6, 3.5, 1. ],
        [5.5, 2.4, 3.8, 1.1],
        [5.5, 2.4, 3.7, 1. ],
        [5.8, 2.7, 3.9, 1.2],
        [6

In [5]:
y_plus[y_plus==2] = 0
y_plus

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

## Bagging Classifier
Bagging分类器是一种集成元估计器，它适合原始数据集的每个随机子集上的基分类器，然后将它们各自的预测(通过投票或平均)聚合成最终的预测。


In [6]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.model_selection import train_test_split
dt = DecisionTreeClassifier(random_state=1)
bc = BaggingClassifier(base_estimator=dt, 
        n_estimators=50, random_state=1)

In [7]:
from sklearn.metrics import accuracy_score
bc.fit(X_train, y_train)
y_pred = bc.predict(X_test)
# 模型准确性评价
acc_test = accuracy_score(y_pred, y_test)
print('Test set accuracy of bc: {:.2f}'.format(acc_test))

Test set accuracy of bc: 0.95


## 随机森林
随机森林采用决策树作为弱分类器，在bagging的样本随机采样基础上，⼜加上了特征的随机选择。

In [8]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=30)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('Accuracy Score: ', 
      accuracy_score(y_test, y_pred))
print('Accuracy Score Normalized: ',
      accuracy_score(y_test, y_pred, normalize=False))

Accuracy Score:  0.9333333333333333
Accuracy Score Normalized:  70


### Additive model 加性模型
广义加性模型GAM是一种在线性或Logistic回归模型（或任何其他广义线性模型）的框架内，构造非单调的响应模型的方法。

In [9]:
from pygam import LogisticGAM
 
# 使用默认参数训练模型
gam = LogisticGAM().fit(X_plus, y_plus)
print(gam.accuracy(X_plus, y_plus))
gam.summary()

0.98
LogisticGAM                                                                                               
Distribution:                      BinomialDist Effective DoF:                                      8.4684
Link Function:                        LogitLink Log Likelihood:                                    -3.5372
Number of Samples:                          100 AIC:                                               24.0111
                                                AICc:                                              26.2252
                                                UBRE:                                               2.3079
                                                Scale:                                                 1.0
                                                Pseudo R-Squared:                                    0.949
Feature Function                  Lambda               Rank         EDoF         P > x        Sig. Code   
s(0)                            

### AdaBoost
Adaboost 迭代算法就3步:

初始化训练数据的权重。
如果有N个样本，则每一个训练样本最开始时都被赋予相同的权值：1/N。

训练弱分类器。
具体训练过程中，如果某个样本点已经被准确地分类，那么在构造下一个训练集中，它的权值就被降低；相反，如果某个样本点没有被准确地分类，那么它的权值就得到提高。然后，权值更新过的样本集被用于训练下一个分类器，整个训练过程如此迭代地进行下去。

将各个训练得到的弱分类器组合成强分类器。
各个弱分类器的训练过程结束后，加大分类误差率小的弱分类器的权重，使其在最终的分类函数中起着较大的决定作用，而降低分类误差率大的弱分类器的权重，使其在最终的分类函数中起着较小的决定作用。换言之，误差率低的弱分类器在最终分类器中占的权重较大，否则较小。

In [10]:
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import confusion_matrix
 
model = AdaBoostClassifier(n_estimators=100)
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.50, random_state=None)
sss.get_n_splits(X_plus, y_plus)

cm_sum = np.zeros((2,2))

for train_index, test_index in sss.split(X_plus, y_plus):
    X_train, X_test = X_plus[train_index], X_plus[test_index]
    y_train, y_test = y_plus[train_index], y_plus[test_index]
#     print(y_test)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
#     print(y_pred)
#     print(len(y_pred),sum(y_pred))
    cm = confusion_matrix(y_test, y_pred)
    cm_sum = cm_sum + cm
    
print('\nAda Boost Algorithms')
print('\nConfusion Matrix')
print('_'*20)
print('     Predicted')
print('     pos neg')
print('pos: %i %i' % (cm_sum[1,1], cm_sum[0,1]))
print('neg: %i %i' % (cm_sum[1,1], cm_sum[0,1]))


Ada Boost Algorithms

Confusion Matrix
____________________
     Predicted
     pos neg
pos: 116 19
neg: 116 19


In [11]:
from sklearn.metrics import accuracy_score
 
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print('Accuracy Score Normalized: ',accuracy_score(y_test, y_pred, normalize=False))

Accuracy Score:  0.88
Accuracy Score Normalized:  44


### GBDT
梯度提升(Gradient boosting) 是构建预测模型的最强大技术之一，它是集成算法中提升法(Boosting)的代表算法。  
提升树利用加法模型与前向分歩算法实现学习的优化过程。当损失函数是平方误差损失函数和指数损失函数时，每一步优化是很简单的。但对一般损失函数而言，往往每一步优化并不那么容易。针对这一问题，Freidman提出了梯度提升算法。  
Gradient Boosting是Boosting中的一大类算法，它的思想借鉴于梯度下降法，其基本原理是根据当前模型损失函数的负梯度信息来训练新加入的弱分类器，然后将训练好的弱分类器以累加的形式结合到现有模型中。  
采用决策树作为弱分类器的Gradient Boosting算法被称为GBDT，有时又被称为MART（Multiple Additive Regression Tree）。GBDT中使用的决策树通常为CART。

In [12]:
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=100)
sss = StratifiedShuffleSplit(n_splits=5, test_size=0.50, random_state=None)
sss.get_n_splits(X_plus, y_plus)

cm_sum = np.zeros((2,2))

for train_index, test_index in sss.split(X_plus, y_plus):
    X_train, X_test = X_plus[train_index], X_plus[test_index]
    y_train, y_test = y_plus[train_index], y_plus[test_index]
#     print(y_test)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
#     print(y_pred)
#     print(len(y_pred),sum(y_pred))
    cm = confusion_matrix(y_test, y_pred)
    cm_sum = cm_sum + cm

print('\nGradient Boosting Algorithms')
print('\nConfusion Matrix')
print('_'*20)
print('     Predicted')
print('     pos neg')
print('pos: %i %i' % (cm_sum[1,1], cm_sum[0,1]))
print('neg: %i %i' % (cm_sum[1,1], cm_sum[0,1]))


Gradient Boosting Algorithms

Confusion Matrix
____________________
     Predicted
     pos neg
pos: 113 11
neg: 113 11


### XGBoost

In [13]:
from xgboost import XGBClassifier
# XGboost 算法
xgb = XGBClassifier(max_depth=5, learning_rate=0.01, n_estimators=2000, colsample_bytree=0.1)
X_train, X_test, y_train, y_test = train_test_split(X_plus, y_plus, test_size=0.25)
xgb.fit(X_train,y_train)
y_pred = xgb.predict(X_test)

In [14]:
from sklearn.metrics import mean_squared_error
from sklearn import metrics
 
print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)
print('XGBoost Regression Score:', xgb.score(X_test, y_test))

The rmse of prediction is: 0.282842712474619
XGBoost Regression Score: 0.92


### LightGBM

In [None]:
params = {
    'learning_rate': 0.1,
    'lambda_l1': 0.1,
    'lambda_l2': 0.2,
    'max_depth': 4,
    'objective': 'binary',  # 目标函数
}
 
# 转换为Dataset数据格式
train_data = lgb.Dataset(X_train, label=y_train)
validation_data = lgb.Dataset(X_test, label=y_test)
# 模型训练
gbm = lgb.train(params, train_data, valid_sets=[validation_data])

In [16]:
# 安装LightGBM依赖包
# pip install lightgbm
import lightgbm as lgb
from lightgbm import log_evaluation, early_stopping
from sklearn import metrics
X_train, X_test, y_train, y_test = train_test_split(X_plus, y_plus, test_size = 0.2, random_state = 0)
model = lgb.LGBMClassifier(num_leaves=31,
                        learning_rate=0.05,
                        n_estimators=20)

callbacks = [log_evaluation(period=100), early_stopping(stopping_rounds=30)]
model.fit(X_train, y_train,
        eval_set=[(X_test, y_test)],
        eval_metric='l1',
        callbacks=callbacks)
 
y_pred = model.predict(X_test, num_iteration=model.best_iteration_)
from sklearn.metrics import mean_squared_error
print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5)
print('LightGBM Score:', model.score(X_test, y_test))

[LightGBM] [Info] Number of positive: 40, number of negative: 40
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000492 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 66
[LightGBM] [Info] Number of data points in the train set: 80, number of used features: 4
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
Training until validation scores don't improve for 30 rounds
Did not meet early stopping. Best iteration is:
[20]	valid_0's l1: 0.317309	valid_0's binary_logloss: 0.476428
The rmse of prediction is: 0.4472135954999579
LightGBM Score: 0.8


### CatBoost

In [17]:
import catboost as cb
from catboost import CatBoostClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split\
# 调参，用网格搜索调出最优参数
# from sklearn.model_selection import GridSearchCV
# params = {'depth': [4, 7, 10],
#          'learning_rate': [0.03, 0.1, 0.15],
#          'l2_leaf_reg': [1, 4, 9],
#          'iterations': [300, 500]}
#cb = cb.CatBoostClassifier()
#cb_model = GridSearchCV(cb, params, scoring="roc_auc", cv=3)
#cb_model.fit(train, y_train)
# 查看最佳分数
# print(cb_model.best_score_)  
# 查看最佳参数
# print(cb_model.best_params_) 
 
X_train, X_test, y_train, y_test = train_test_split(X_plus, y_plus, test_size = 0.2, random_state = 0)
 
cb = CatBoostClassifier(iterations=5, learning_rate=0.1)
cb.fit(X_train, y_train)
# Categorical features选项的代码
# cat_features_index = [0, 1, 2]
#clf.fit(train, y_train, cat_features=cat_features_index)
y_pred = cb.predict(X_test)
print('Model is fitted: ' + str(cb.is_fitted()))
print('Model params:')
print(cb.get_params())
print('CatBoost Score:', cb.score(X_test, y_test))

0:	learn: 0.6447531	total: 142ms	remaining: 567ms
1:	learn: 0.5979172	total: 143ms	remaining: 214ms
2:	learn: 0.5607884	total: 143ms	remaining: 95.6ms
3:	learn: 0.5254190	total: 144ms	remaining: 36ms
4:	learn: 0.4928742	total: 144ms	remaining: 0us
Model is fitted: True
Model params:
{'iterations': 5, 'learning_rate': 0.1}
CatBoost Score: 0.85


PU learning