## 数据准备

In [1]:
import numpy as np

from sklearn.datasets import load_breast_cancer
data=load_breast_cancer()
X,Y=data.data,data.target
Y[Y==0]=-1
del data

from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2)
print(X_train.shape,X_test.shape,Y_train.shape,Y_test.shape)

(455, 30) (114, 30) (455,) (114,)


## 模型基础
sklearn中的boosting算法默认的基模型为决策树桩，由于我们还没有在决策树模块中实现```max_depth```参数，首先来看一下sklearn中的决策树桩表现如何：

In [2]:
from sklearn.tree import DecisionTreeClassifier
tree_stump=DecisionTreeClassifier(criterion='entropy',max_depth=1)
tree_stump.fit(X_train,Y_train)
Y_pred=tree_stump.predict(X_test)

print('stump acc:{}'.format(np.sum(Y_pred==Y_test)/len(Y_test)))

stump acc:0.8508771929824561


出于还未实现```max_depth```参数的考量，暂时使用sklearn中的决策树桩来实现一个简单的boosting算法。注意AdaBoost的关键在于权重的更新，并且样本与模型都带有权重，但是最终预测起作用的只有模型的权重。

In [3]:
def update_w(w, Y_train, Y_pred):
    weight_err = np.sum(w*(Y_train != Y_pred))/np.sum(w)    # 加权训练误差
    alpha = np.log(1/weight_err-1)
    w = w*np.exp(alpha*(Y_train != Y_pred))
    w=w/np.sum(w)    # 归一化

    return w, alpha


# w = np.array([1/3, 1/3, 1/3])
# Y_true = np.array([0, 0, 0])
# Y_pred = np.array([0, 0, 1])
# update_w(w, Y_true, Y_pred)

开始串行训练模型并更新参数：

In [4]:
n_samples, n_features = X_train.shape

K = 5    # 模型数量
w = np.array([1/n_samples]*n_samples)    # 初始样本权重
alphas = list()    # 初始模型权重
base_models = list()

for i in range(K):
    cur_model = DecisionTreeClassifier(max_depth=1)
    cur_model.fit(X_train, Y_train,sample_weight=w)    # 样本权重将用于树的分裂过程
    cur_pred = cur_model.predict(X_train)

    w, alpha = update_w(w, Y_train, cur_pred)
    alphas.append(alpha)
    base_models.append(cur_model)
    
# print(w)
# print(alphas)

In [5]:
for i in range(K):
    if i==0:
        Y_pred=alphas[i]*base_models[i].predict(X_test)
    else:
        Y_pred=np.c_[Y_pred,alphas[i]*base_models[i].predict(X_test)]
        
Y_pred=np.sign(np.sum(Y_pred,axis=1))

In [6]:
print('ada acc:{}'.format(np.sum(Y_pred==Y_test)/len(Y_test)))

ada acc:0.9035087719298246


好，借助sklearn的帮助，实现adaboost算法很简单，但是该库的目的是使用Python与numpy实现sklearn的功能，从上述代码中可以看出，关键在于实现树算法中的```max_depth```与```sample_weight```参数，前者用于生成决策树桩，后者用于控制树分裂时样本的权重。该部分的指导详见```../tree/add_param.ipynb```。