<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=default"></script>

# 模型融合方法
本节主要介绍三种方法：袋装Bagging、提升Boosting、随机森林。

## 简介：
三种方法的基本思想：将多个基模型集成为一个模型，目的就是创造一个符合的分类器。好处在哪儿？若现在有5个分类器，当且仅当有两个以上的分类器同时出错的时候，复合分类器才会出错。每个分类器还可以分配到不同的线程上，因此可以并行。

![](images/vote.png)

为什么会起作用？

左图为基模型，右图为集成模型。
![](images/ensmble_model.png)


## Bagging 袋装

假设现在有一个样本空间，然后使用一个模型来训练，但是无法确信现在这个模型是否能够有良好的效果。于是对样本空间放回抽样m次，来训练m个模型，这m个模型共同投票，来对样本进行输出，确保最高的正确率。

算法框架如下：
![](images/Bagging.png)

In [27]:
import numpy as np
import random
from sklearn.model_selection import train_test_split
from sklearn import metrics,ensemble
from sklearn import datasets
from sklearn import linear_model

data = datasets.load_wine()
label = data['target']
train = data['data']
del data

In [28]:
X_train,X_val,Y_train,Y_val = train_test_split(train,label,test_size = 0.2,random_state = 0, shuffle = True)

bagging_model = ensemble.BaggingClassifier(base_estimator=linear_model.RidgeClassifier(),n_estimators=10, random_state=0)
bagging_model.fit(X_train,Y_train)

print(metrics.precision_score(Y_val, bagging_model.predict(X_val), average='macro'))

1.0


## 提升和AdaBoost

给定数据集$D$，$AdaBoost$对每个训练样本赋予$1/d$的相同权重。若有$k$个分类器，则执行$k$次。每次不正确的分类，则权重增加，若元组正确分类，则它的权重减少，该算法的主要思想是：当建立分类器的时候，希望下一轮的时候分类器能够更加关注分类错位的样本，从而提高“难以分类”的样本的正确率。从而建立了一个互补的分类器。

![](images/adaboost.png)

**两种算法的区别：AdaBoost更加关注错误分类元组，从而让复合模型出于以过拟合的风险，但是Bagging就不会出现这个问题。与单个模型相比，两者都能够显著提高准确率，但是Ada会更高一点。**

In [31]:
ada_model = ensemble.AdaBoostClassifier(n_estimators=100, random_state=0)
ada_model.fit(X_train,Y_train)

print(metrics.precision_score(Y_val, ada_model.predict(X_val), average='macro'))

0.912280701754386


## 随机森林

基于Bagging的思想与树模型结合。即森林中的每一棵树都是决策树，决策树的集合就是森林。更准确的说，每一棵树都依赖于独立抽样。分类时，每棵树都投票并且返回票数最多的一类。

随机森林的准确率可以和AdaBoost相媲美，对错误和离群点更棒。随着随机森林树的增加，森林的泛化误差收敛。故，很难过拟合。

In [32]:
rf_model = ensemble.RandomForestClassifier(n_estimators=10, random_state=0)
rf_model.fit(X_train,Y_train)

print(metrics.precision_score(Y_val, rf_model.predict(X_val), average='macro'))

0.9777777777777779


## 提高不平衡数据的分类准确率，只针对分类问题

假设原训练集包含100个正、1000个负。

**过抽样**：将负的训练样本复制。形成包含1000个正、1000个负的新训练集。

**欠抽样**：将正的训练样本复制。形成包含100个正、100个负的新训练集。

## [ImBalance](https://imbalanced-learn.readthedocs.io/en/stable/api.html)

该网站提供处理不平衡数据的库

## Prototype selection
under_sampling.CondensedNearestNeighbour([…])	Class to perform under-sampling based on the condensed nearest neighbour method.

under_sampling.EditedNearestNeighbours([…])	Class to perform under-sampling based on the edited nearest neighbour method.

under_sampling.RepeatedEditedNearestNeighbours([…])	Class to perform under-sampling based on the repeated edited nearest neighbour method.

under_sampling.AllKNN([sampling_strategy, …])	Class to perform under-sampling based on the AllKNN method.

under_sampling.InstanceHardnessThreshold([…])	Class to perform under-sampling based on the instance hardness threshold.

under_sampling.NearMiss([sampling_strategy, …])	Class to perform under-sampling based on NearMiss methods.

under_sampling.NeighbourhoodCleaningRule([…])	Class performing under-sampling based on the neighbourhood cleaning rule.

under_sampling.OneSidedSelection([…])	Class to perform under-sampling based on one-sided selection method.

under_sampling.RandomUnderSampler([…])	Class to perform random under-sampling.

under_sampling.TomekLinks([…])	Class to perform under-sampling by removing Tomek’s links.

## Over-sampling methods
over_sampling.ADASYN([sampling_strategy, …])	Perform over-sampling using Adaptive Synthetic (ADASYN) sampling approach for imbalanced datasets.

over_sampling.BorderlineSMOTE([…])	Over-sampling using Borderline SMOTE.

over_sampling.KMeansSMOTE([…])	Apply a KMeans clustering before to over-sample using SMOTE.

over_sampling.RandomOverSampler([…])	Class to perform random over-sampling.

over_sampling.SMOTE([sampling_strategy, …])	Class to perform over-sampling using SMOTE.

over_sampling.SMOTENC(categorical_features)	Synthetic Minority Over-sampling Technique for Nominal and Continuous (SMOTE-NC).

over_sampling.SVMSMOTE([sampling_strategy, …])	Over-sampling using SVM-SMOTE.

## Combination of over- and under-sampling methods
combine.SMOTEENN([sampling_strategy, …])	Class to perform over-sampling using SMOTE and cleaning using ENN.

combine.SMOTETomek([sampling_strategy, …])	Class to perform over-sampling using SMOTE and cleaning using Tomek links.

## Ensemble methods
ensemble.BalanceCascade(**kwargs)	Create an ensemble of balanced sets by iteratively under-sampling the imbalanced dataset using an estimator.

ensemble.BalancedBaggingClassifier([…])	A Bagging classifier with additional balancing.

ensemble.BalancedRandomForestClassifier([…])	A balanced random forest classifier.

ensemble.EasyEnsemble(**kwargs)	Create an ensemble sets by iteratively applying random under-sampling.

ensemble.EasyEnsembleClassifier([…])	Bag of balanced boosted learners also known as EasyEnsemble.

ensemble.RUSBoostClassifier([…])	Random under-sampling integrating in the learning of an AdaBoost classifier.