## 集成算法
- 集成化算法是构建多个模型，通过某种策略把他们结合起来完成任务
- 目的是为了获取更好的预测效果
- 集成算法分为Bagging，Boosting，Stacking三大类


### 1、Bagging
- 训练多个模型求平均
- 训练时抽样，样本随机抽样，特征随机抽样，自助采样法（bootstrap sampling）
![Bagging](https://img0.baidu.com/it/u=2098698089,3530436934&fm=253&fmt=auto&app=138&f=PNG?w=869&h=500)


### 2、Boosting提升
- AdaBoost算法、Xgboost算法、GBDT算法
![Boosting](https://www.researchgate.net/publication/356698772/figure/fig2/AS:1096436418641951@1638422221975/The-architecture-of-Gradient-Boosting-Decision-Tree.png)

### 3、Stacking算法
![stacking](https://miro.medium.com/v2/resize:fit:720/format:webp/1*GB8U0rAuCmsQi-26EOmgKw.png)

### Random Forest随机森林
- 随机森林是一种有决策树构成的Bagging算法
- 森林：很多棵树
- 随机：样本和特征都随机抽取（有放回随机抽取）
- 分类时，让森林中每一棵决策树进行分类，森林的输出结果就是最多的那个类别
- 回归时，去所有决策树的平均值
- 随机森林可以计算自变量的重要性

In [19]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
from sklearn import datasets
import math
import joblib
from sklearn import metrics

#### RFC

In [4]:
iris = datasets.load_iris()
X = iris.data
y = iris.target

In [5]:
help(RandomForestClassifier)

Help on class RandomForestClassifier in module sklearn.ensemble._forest:

class RandomForestClassifier(ForestClassifier)
 |  RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)
 |  
 |  A random forest classifier.
 |  
 |  A random forest is a meta estimator that fits a number of decision tree
 |  classifiers on various sub-samples of the dataset and uses averaging to
 |  improve the predictive accuracy and control over-fitting.
 |  The sub-sample size is controlled with the `max_samples` parameter if
 |  `bootstrap=True` (default), otherwise the whole dataset is used to build
 |  each tree.
 |  
 |  Read more in the :ref:`User Guide <forest>`.
 |  
 |  Parameters
 |  ----------


**重要参数**
- n_estimators：森林中树的数量
- max_features :每棵决策树再选取特征是，特征的数量
- max_depth：树的最大深度
- min_samples_split：树节点最小分割的样本数

In [7]:
RFC = RandomForestClassifier(n_estimators=10, max_depth=3, oob_score=True)
# oob是out od bag, 指每次抽样没有抽到的样例，oob_score指用oob数据测试的效果（正确率或R_sqr）
# Bagging集成算法，可以不对数据进行train_test_split，而是使用oob_score

RFC.fit(X,y)
print(RFC.score(X,y))

RFC.oob_score_

0.9666666666666667




0.9133333333333333

In [8]:
RFC.predict_proba(X[-10:]) #预测样本中最后十个的概率值

array([[0.        , 0.0952789 , 0.9047211 ],
       [0.        , 0.0952789 , 0.9047211 ],
       [0.        , 0.14552953, 0.85447047],
       [0.        , 0.10436981, 0.89563019],
       [0.        , 0.10436981, 0.89563019],
       [0.        , 0.0952789 , 0.9047211 ],
       [0.        , 0.12861224, 0.87138776],
       [0.        , 0.0952789 , 0.9047211 ],
       [0.        , 0.10436981, 0.89563019],
       [0.        , 0.15956462, 0.84043538]])

In [10]:
print(iris.feature_names)
RFC.feature_importances_

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


array([0.14061168, 0.01784783, 0.39413871, 0.44740178])

#### RFR

In [11]:
boston = datasets.load_boston()
X = boston.data
y = boston.target


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

In [24]:
help(RandomForestRegressor)

Help on class RandomForestRegressor in module sklearn.ensemble._forest:

class RandomForestRegressor(ForestRegressor)
 |  RandomForestRegressor(n_estimators=100, *, criterion='squared_error', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, ccp_alpha=0.0, max_samples=None)
 |  
 |  A random forest regressor.
 |  
 |  A random forest is a meta estimator that fits a number of classifying
 |  decision trees on various sub-samples of the dataset and uses averaging
 |  to improve the predictive accuracy and control over-fitting.
 |  The sub-sample size is controlled with the `max_samples` parameter if
 |  `bootstrap=True` (default), otherwise the whole dataset is used to build
 |  each tree.
 |  
 |  Read more in the :ref:`User Guide <forest>`.
 |  
 |  Parameters
 |  ----------
 |  n_estimato

In [23]:
RFR = RandomForestRegressor(n_estimators=14, max_depth=4, oob_score=True)
# oob是out od bag, 指每次抽样没有抽到的样例，oob_score指用oob数据测试的效果（正确率或R_sqr）
# Bagging集成算法，可以不对数据进行train_test_split，而是使用oob_score

RFR.fit(X,y)
print(RFR.score(X,y))


RFR.oob_prediction_

0.5013190861466653


array([0.33518929, 0.37380142, 0.35625337, 0.3690522 , 0.35377727,
       0.40355159, 0.37095873, 0.35372338, 0.38609043, 0.34907917,
       0.38218121, 0.50533333, 0.40093627, 0.33880952, 0.35895521,
       0.33919047, 0.34210825, 0.38607103, 0.34120495, 0.35272908,
       0.32813264, 0.38926357, 0.38618104, 0.36303635, 0.30300893,
       0.32111712, 0.33108734, 0.37788935, 0.35020489, 0.29613445,
       0.39830964, 0.33547036, 0.36795704, 0.42590685, 0.4037796 ,
       0.35994721, 0.3537378 , 0.39771783, 0.39120795, 0.35298984,
       0.32258376, 0.40784821, 0.31569216, 0.45848864, 0.37285103,
       0.38523405, 0.37441577, 0.34240764, 0.38623921, 0.33008617,
       0.31545455, 0.33169229, 0.35140625, 0.34761186, 0.34236493,
       0.34254164, 0.37406088, 0.36959773, 0.29911616, 0.32470288,
       0.25928231, 0.26607921, 0.34919715, 0.34043145, 0.3820021 ,
       0.35649726, 0.42117353, 0.41292929, 0.42625   , 0.3855507 ,
       0.42561701, 0.41517582, 0.51227273, 0.43676042, 0.38077

In [15]:
print(boston.feature_names)
RFR.feature_importances_

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']


array([3.32954056e-02, 0.00000000e+00, 2.77678961e-03, 6.64188626e-05,
       1.81191451e-02, 4.47094498e-01, 5.37892519e-03, 5.43998102e-02,
       1.43576194e-03, 6.60369232e-03, 1.64497083e-02, 2.88151298e-03,
       4.11498332e-01])

## part1、 RFR applied in ads effectiveness prediction

In [31]:
df = pd.read_csv("./data/ads_3.csv")

X = df[df.columns[:62]]
Y = df[df.columns[62:]]

In [32]:
MSE = []
RMSE = []
R_squared = []
R_squared_oob = []
feature_importance = []

for i in range(12):
    y = Y[Y.columns[i]]
    
    RF_regression = RandomForestRegressor(n_estimators=14, max_depth=4, oob_score=True, n_jobs=-1)
    RF_regression.fit(X, y)

    joblib.dump(RF_regression, "model/RF_regression/model{}.pkl".format(i+1))
    
    MSE.append(metrics.mean_squared_error(y, RF_regression.predict(X)))
    RMSE.append(math.sqrt(metrics.mean_squared_error(y, RF_regression.predict(X))))
    R_squared.append(metrics.r2_score(y, RF_regression.predict(X)))
    R_squared_oob.append(RF_regression.oob_score_)

    feature_importance.append(list(RF_regression.feature_importances_))



In [30]:
result_dic = {"MSE":MSE, "RMSE":RMSE, "R_squared":R_squared, "R_squared_oob":R_squared_oob}
result_df = pd.DataFrame(result_dic, index=Y.columns)
result_df.to_csv("result/RF_regression.csv")

feature_importance_df = pd.DataFrame(feature_importance, columns=X.columns, index=Y.columns)
feature_importance_df.to_csv("result/RFR_feature_importance.csv")