# 随机森林 Random Forest
集成算法包括bagging,boosting和stacking。随机森林是一种由决策树构成的bagging集成算法。

(一般说到bagging就是指随机森林，因为其他也不好bag，或者说最典型的bagging就是随机森林)

* 森林：表示很多树(决策树)。
* 随机：样本和特征都随机抽取(bootstrap，有放回的随机抽取)

分类时，让森林中每一棵决策树进行分类，结果取众数。回归取所有决策树的平均值。

随机森林是可以计算自变量重要性的，即特征重要性。

In [4]:
from sklearn.ensemble import RandomForestClassifier
help(RandomForestClassifier)

Help on class RandomForestClassifier in module sklearn.ensemble._forest:

class RandomForestClassifier(ForestClassifier)
 |  RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='sqrt', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)
 |  
 |  A random forest classifier.
 |  
 |  A random forest is a meta estimator that fits a number of decision tree
 |  classifiers on various sub-samples of the dataset and uses averaging to
 |  improve the predictive accuracy and control over-fitting.
 |  The sub-sample size is controlled with the `max_samples` parameter if
 |  `bootstrap=True` (default), otherwise the whole dataset is used to build
 |  each tree.
 |  
 |  Read more in the :ref:`User Guide <forest>`.
 |  
 |  Parameters
 |  ----------


**重要参数**

n_estimators 森林中树的数量,默认100棵

max_features 每棵树在随机选取特征时特征的最大数量

max_depth 树的最大深度

min_samples_split 树节点最小分割的样本数

## 数据预处理

In [1]:
import numpy as np
import pandas as pd
data = pd.read_csv('/data/Iris.csv')
data.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [3]:
X = data.iloc[:,1:-1]
y = data.iloc[:,-1]

## 随机森林 分类

In [5]:
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(n_estimators = 10,
                            max_depth = 3,
                            oob_score = True)
# oob=out of bag指每次抽样没有被抽到的样例，oob_score指用oob数据测试的效果
# bagging可以不把数据集进行train_test_split而是使用oob_score
RFC.fit(X, y)

In [6]:
print(RFC.score(X,y))

0.9533333333333334


In [7]:
RFC.oob_score_

0.9333333333333333

In [19]:
RFC.predict_proba(X[-10:])
# 第一行[0,0,1]表示属于第1类的概率为0，属于第2类的概率为0，属于第3类的概率为1
# 当然这里没有用map所以也不知道第123类是什么

array([[0.        , 0.        , 1.        ],
       [0.        , 0.        , 1.        ],
       [0.        , 0.02      , 0.98      ],
       [0.        , 0.        , 1.        ],
       [0.        , 0.        , 1.        ],
       [0.        , 0.        , 1.        ],
       [0.        , 0.025     , 0.975     ],
       [0.        , 0.        , 1.        ],
       [0.        , 0.02      , 0.98      ],
       [0.        , 0.02909091, 0.97090909]])

In [18]:
RFC.predict(X[-10:])

array(['Iris-virginica', 'Iris-virginica', 'Iris-virginica',
       'Iris-virginica', 'Iris-virginica', 'Iris-virginica',
       'Iris-virginica', 'Iris-virginica', 'Iris-virginica',
       'Iris-virginica'], dtype=object)

In [72]:
np.asarray(y[-10:])

array(['Iris-virginica', 'Iris-virginica', 'Iris-virginica',
       'Iris-virginica', 'Iris-virginica', 'Iris-virginica',
       'Iris-virginica', 'Iris-virginica', 'Iris-virginica',
       'Iris-virginica'], dtype=object)

In [26]:
X.columns

Index(['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm'], dtype='object')

In [25]:
RFC.feature_importances_

array([5.87508892e-02, 1.02105587e-04, 3.95135223e-01, 5.46011782e-01])

## 随机森林 回归

In [28]:
data_ = pd.read_csv('/data/boston_housing.csv')
data_.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


In [42]:
X_1 = data_.iloc[:,:-1] 
y_1 = data_.iloc[:,-1]
y_1.head()

0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
Name: medv, dtype: float64

In [62]:
from sklearn.ensemble import RandomForestRegressor
RFR = RandomForestRegressor(n_estimators = 50,
                            max_depth = 5,
                            oob_score = True)

In [63]:
RFR.fit(X_1, y_1)

In [64]:
RFR.score(X_1,y_1)

0.9350605448891757

In [65]:
RFR.oob_score_

0.8547371911761887

In [66]:
RFR.predict(X_1[-10:])

array([18.87351831, 20.31996788, 20.56022146, 19.69044674, 20.32023983,
       23.67592372, 21.57452642, 28.05303872, 26.12277587, 21.1577987 ])

In [68]:
np.asarray(y_1[-10:])

array([19.7, 18.3, 21.2, 17.5, 16.8, 22.4, 20.6, 23.9, 22. , 11.9])

In [70]:
np.sqrt(np.sum((RFR.predict(X_1[-10:]) - np.asarray(y_1[-10:]))**2))

12.037123222379979