## 集成算法
- 集成化算法是构建多个模型，通过某种策略把他们结合起来完成任务
- 目的是为了获取更好的预测效果
- 集成算法分为Bagging，Boosting，Stacking三大类


### 1、Bagging
- 训练多个模型求平均
- 训练时抽样，样本随机抽样，特征随机抽样，自助采样法（bootstrap sampling）
![Bagging](https://img0.baidu.com/it/u=2098698089,3530436934&fm=253&fmt=auto&app=138&f=PNG?w=869&h=500)


### 2、Boosting提升
- AdaBoost算法、Xgboost算法、GBDT算法
![Boosting](https://www.researchgate.net/publication/356698772/figure/fig2/AS:1096436418641951@1638422221975/The-architecture-of-Gradient-Boosting-Decision-Tree.png)

### 3、Stacking算法
![stacking](https://miro.medium.com/v2/resize:fit:720/format:webp/1*GB8U0rAuCmsQi-26EOmgKw.png)

### Random Forest随机森林
- 随机森林是一种有决策树构成的Bagging算法
- 森林：很多棵树
- 随机：样本和特征都随机抽取（有放回随机抽取）
- 分类时，让森林中每一棵决策树进行分类，森林的输出结果就是最多的那个类别
- 回归时，去所有决策树的平均值
- 随机森林可以计算自变量的重要性

In [19]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
from sklearn import datasets
import math
import joblib
from sklearn import metrics

#### RFC

In [4]:
iris = datasets.load_iris()
X = iris.data
y = iris.target

**重要参数**
- n_estimators：森林中树的数量
- max_features :每棵决策树再选取特征是，特征的数量
- max_depth：树的最大深度
- min_samples_split：树节点最小分割的样本数

In [7]:
RFC = RandomForestClassifier(n_estimators=10, max_depth=3, oob_score=True)
# oob是out od bag, 指每次抽样没有抽到的样例，oob_score指用oob数据测试的效果（正确率或R_sqr）
# Bagging集成算法，可以不对数据进行train_test_split，而是使用oob_score

RFC.fit(X,y)
print(RFC.score(X,y))

RFC.oob_score_

0.9666666666666667




0.9133333333333333

In [8]:
RFC.predict_proba(X[-10:]) #预测样本中最后十个的概率值

array([[0.        , 0.0952789 , 0.9047211 ],
       [0.        , 0.0952789 , 0.9047211 ],
       [0.        , 0.14552953, 0.85447047],
       [0.        , 0.10436981, 0.89563019],
       [0.        , 0.10436981, 0.89563019],
       [0.        , 0.0952789 , 0.9047211 ],
       [0.        , 0.12861224, 0.87138776],
       [0.        , 0.0952789 , 0.9047211 ],
       [0.        , 0.10436981, 0.89563019],
       [0.        , 0.15956462, 0.84043538]])

In [10]:
print(iris.feature_names)
RFC.feature_importances_

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


array([0.14061168, 0.01784783, 0.39413871, 0.44740178])

#### RFR

In [None]:
boston = datasets.load_boston()
X = boston.data
y = boston.target

In [None]:
RFR = RandomForestRegressor(n_estimators=14, max_depth=4, oob_score=True)
# oob是out od bag, 指每次抽样没有抽到的样例，oob_score指用oob数据测试的效果（正确率或R_sqr）
# Bagging集成算法，可以不对数据进行train_test_split，而是使用oob_score

RFR.fit(X,y)
print(RFR.score(X,y))


RFR.oob_prediction_

In [15]:
print(boston.feature_names)
RFR.feature_importances_

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']


array([3.32954056e-02, 0.00000000e+00, 2.77678961e-03, 6.64188626e-05,
       1.81191451e-02, 4.47094498e-01, 5.37892519e-03, 5.43998102e-02,
       1.43576194e-03, 6.60369232e-03, 1.64497083e-02, 2.88151298e-03,
       4.11498332e-01])

## part1、 RFR applied in ads effectiveness prediction

In [37]:
df = pd.read_csv("./data/ads_3.csv")

X = df[df.columns[:62]]
Y = df[df.columns[62:]]

In [None]:
MSE = []
RMSE = []
R_squared = []
R_squared_oob = []
feature_importance = []

for i in range(12):
    y = Y[Y.columns[i]]
    
    RF_regression = RandomForestRegressor(n_estimators=15, max_depth=4, oob_score=True, n_jobs=-1)
    RF_regression.fit(X, y)

    joblib.dump(RF_regression, "model/RF_regression/model{}.pkl".format(i+1))
    
    MSE.append(metrics.mean_squared_error(y, RF_regression.predict(X)))
    RMSE.append(math.sqrt(metrics.mean_squared_error(y, RF_regression.predict(X))))
    R_squared.append(metrics.r2_score(y, RF_regression.predict(X)))
    R_squared_oob.append(RF_regression.oob_score_)

    feature_importance.append(list(RF_regression.feature_importances_))

In [39]:
result_dic = {"MSE":MSE, "RMSE":RMSE, "R_squared":R_squared, "R_squared_oob":R_squared_oob}
result_df = pd.DataFrame(result_dic, index=Y.columns)
result_df.to_csv("result/RF_regression.csv")

feature_importance_df = pd.DataFrame(feature_importance, columns=X.columns, index=Y.columns)
feature_importance_df.to_csv("result/RFR_feature_importance.csv")