<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Voting" data-toc-modified-id="Voting-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Voting</a></span></li><li><span><a href="#Bagging" data-toc-modified-id="Bagging-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Bagging</a></span><ul class="toc-item"><li><span><a href="#RandomForest" data-toc-modified-id="RandomForest-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>RandomForest</a></span></li></ul></li><li><span><a href="#Stacking" data-toc-modified-id="Stacking-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Stacking</a></span></li><li><span><a href="#Blending" data-toc-modified-id="Blending-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Blending</a></span></li><li><span><a href="#Boosting" data-toc-modified-id="Boosting-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Boosting</a></span></li></ul></div>

# 模型融合实践

**载入数据**

In [1]:
# 匹马印第安人糖尿病的数据集
# 【1】Pregnancies：怀孕次数
# 【2】Glucose：葡萄糖
# 【3】BloodPressure：血压 (mm Hg)
# 【4】SkinThickness：皮层厚度 (mm)
# 【5】Insulin：胰岛素 2小时血清胰岛素（mu U / ml
# 【6】BMI：体重指数 （体重/身高）^2
# 【7】DiabetesPedigreeFunction：糖尿病谱系功能
# 【8】Age：年龄 （岁）
# 【9】Outcome：类标变量 （0或1）

import pandas as pd
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
df = pd.read_csv('../data/pima/pima-indians-diabetes.csv', names=names)

In [2]:
df.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
# 是一个二分类问题
df['class'].unique()

array([1, 0], dtype=int64)

## Voting

投票器模型融合

认为模型与模型之间有差异，而单个模型很难控制过拟合。
所以采用使用“使用多种模型，结果多数表决”方法。

In [12]:
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

In [13]:
import warnings
warnings.filterwarnings('ignore')

In [7]:
X = df.iloc[:,0:8]
Y = df['class']

In [8]:
X.head()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [10]:
help(model_selection.cross_val_score)

Help on function cross_val_score in module sklearn.model_selection._validation:

cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv='warn', n_jobs=None, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', error_score='raise-deprecating')
    Evaluate a score by cross-validation
    
    Read more in the :ref:`User Guide <cross_validation>`.
    
    Parameters
    ----------
    estimator : estimator object implementing 'fit'
        The object to use to fit the data.
    
    X : array-like
        The data to fit. Can be for example a list, or an array.
    
    y : array-like, optional, default: None
        The target variable to try to predict in the case of
        supervised learning.
    
    groups : array-like, with shape (n_samples,), optional
        Group labels for the samples used while splitting the dataset into
        train/test set.
    
    scoring : string, callable or None, optional, default: None
        A string (see model evaluation documentat

In [21]:
# 设置 k 折交叉验证
kfold = model_selection.KFold(n_splits=5, random_state=10)

# 创建投票器的子模型
estimators = []
model_1 = LogisticRegression()
estimators.append(('logistic', model_1))

model_2 = DecisionTreeClassifier()
estimators.append(('dt', model_2))

model_3 = SVC()
estimators.append(('svm', model_3))

# 构建投票器融合
ensemble = VotingClassifier(estimators)

# 交叉验证结果
result = model_selection.cross_val_score(ensemble, X, Y ,scoring='accuracy',cv=kfold)


In [22]:
result

array([0.72077922, 0.66233766, 0.73376623, 0.82352941, 0.75163399])

In [16]:
print(result.mean())

0.7435956200662084


## Bagging

1、模型效果不好的原因？
欠拟合/过拟合

2、如何缓解过拟合？
少给点题，别让它直接把所有题目答案背下来
多找几个同学做题，综合一下他们的答案

1. 从原始样本集中抽取训练集。每轮从原始样本集中使用Bootstraping的方法抽取n个训练样本（在训练集中，有些样本可能被多次抽取到，而有些样本可能一次都没有被抽中）。共进行k轮抽取，得到k个训练集。（k个训练集之间是相互独立的）

2. 每次使用一个训练集得到一个模型，k个训练集共得到k个模型。（注：这里并没有具体的分类算法或回归方法，我们可以根据具体问题采用不同的分类或回归方法，如决策树、感知器等）

3. 对分类问题：将上步得到的k个模型采用投票的方式得到分类结果；对回归问题，计算上述模型的均值作为最后的结果。（所有模型的重要性相同）


In [23]:
from sklearn.ensemble import BaggingClassifier

In [25]:
dt = DecisionTreeClassifier()
num = 100
# 设置 k 折交叉验证
kfold = model_selection.KFold(n_splits=5, random_state=10)
model = BaggingClassifier(base_estimator=dt, n_estimators=num, random_state=10)
result = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(result.mean())

0.7695866225277991


###  RandomForest

随机森林也是 bagging，对特征做了采样

In [28]:
from sklearn.ensemble import RandomForestClassifier
num_trees = 100
max_feature_num = 5
kfold = model_selection.KFold(n_splits=5, random_state=2018)
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_feature_num)
result = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(result.mean())

0.7526610644257703


## Stacking

用上层estimator结果作为下一层的特征。有点像是在公司里信息先传到各个部门经理那里，各个部门经理做出决策，再汇总到大boss。大boss根据不同的部门经理决策再做出最终的判断。

## Blending

弱化版的stacking，下一层不使用模型，相当于对结果做线性加权或者投票。

## Boosting

考得不好的原因是什么？

1. 还不够努力，练习题需要多次学习---重复迭代和训练
2. 时间分配要合理，要多练习之前做错的题---每次分配给分错的样本更高的权重
3. 我不聪明，但是脚踏实地，用最简单的知识不断积累，成为专家---最简单的分类器叠加

以 Adaboost 为例

In [30]:
from sklearn.ensemble import AdaBoostClassifier
num_trees = 25
kfold = model_selection.KFold(n_splits=5, random_state=10)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=10)
result = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(result.mean())

0.7513623631270689
