<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#随机森林模型" data-toc-modified-id="随机森林模型-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>随机森林模型</a></span><ul class="toc-item"><li><span><a href="#导入需要的工具库" data-toc-modified-id="导入需要的工具库-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>导入需要的工具库</a></span></li><li><span><a href="#加载数据" data-toc-modified-id="加载数据-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>加载数据</a></span></li><li><span><a href="#构建模型" data-toc-modified-id="构建模型-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>构建模型</a></span></li></ul></li></ul></div>

## 随机森林模型

1. 使用 CART 作为基学习器，基于 Bagging 思想做的改进。
2. 每个学习器，样本 Bootstrap，节点上特征随机选择一部分进行训练
3. Bootstrap 重采样可以避免噪声点，防止过拟合，泛化能力增强。


### 导入需要的工具库

In [1]:
import pandas as pd
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_boston

### 加载数据

In [2]:
boston_house = load_boston()

In [3]:
boston_feature_name = boston_house.feature_names
boston_features = boston_house.data
boston_target = boston_house.target

In [4]:
boston_feature_name

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

In [5]:
print(boston_house.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [6]:
boston_features[:5,:]

array([[6.3200e-03, 1.8000e+01, 2.3100e+00, 0.0000e+00, 5.3800e-01,
        6.5750e+00, 6.5200e+01, 4.0900e+00, 1.0000e+00, 2.9600e+02,
        1.5300e+01, 3.9690e+02, 4.9800e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, 0.0000e+00, 4.6900e-01,
        6.4210e+00, 7.8900e+01, 4.9671e+00, 2.0000e+00, 2.4200e+02,
        1.7800e+01, 3.9690e+02, 9.1400e+00],
       [2.7290e-02, 0.0000e+00, 7.0700e+00, 0.0000e+00, 4.6900e-01,
        7.1850e+00, 6.1100e+01, 4.9671e+00, 2.0000e+00, 2.4200e+02,
        1.7800e+01, 3.9283e+02, 4.0300e+00],
       [3.2370e-02, 0.0000e+00, 2.1800e+00, 0.0000e+00, 4.5800e-01,
        6.9980e+00, 4.5800e+01, 6.0622e+00, 3.0000e+00, 2.2200e+02,
        1.8700e+01, 3.9463e+02, 2.9400e+00],
       [6.9050e-02, 0.0000e+00, 2.1800e+00, 0.0000e+00, 4.5800e-01,
        7.1470e+00, 5.4200e+01, 6.0622e+00, 3.0000e+00, 2.2200e+02,
        1.8700e+01, 3.9690e+02, 5.3300e+00]])

### 构建模型

In [7]:
help(RandomForestRegressor)

Help on class RandomForestRegressor in module sklearn.ensemble.forest:

class RandomForestRegressor(ForestRegressor)
 |  A random forest regressor.
 |  
 |  A random forest is a meta estimator that fits a number of classifying
 |  decision trees on various sub-samples of the dataset and use averaging
 |  to improve the predictive accuracy and control over-fitting.
 |  The sub-sample size is always the same as the original
 |  input sample size but the samples are drawn with replacement if
 |  `bootstrap=True` (default).
 |  
 |  Read more in the :ref:`User Guide <forest>`.
 |  
 |  Parameters
 |  ----------
 |  n_estimators : integer, optional (default=10)
 |      The number of trees in the forest.
 |  
 |  criterion : string, optional (default="mse")
 |      The function to measure the quality of a split. Supported criteria
 |      are "mse" for the mean squared error, which is equal to variance
 |      reduction as feature selection criterion, and "mae" for the mean
 |      absolute 

In [8]:
rgs = RandomForestRegressor(n_estimators=15)
rgs = rgs.fit(boston_features, boston_target)

In [9]:
rgs

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=15, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [10]:
rgs.predict(boston_features)

array([26.18666667, 22.06666667, 33.68666667, 34.83333333, 36.62666667,
       27.16      , 21.9       , 23.45333333, 16.32      , 19.52      ,
       18.58666667, 19.58666667, 21.36666667, 19.74666667, 19.01333333,
       19.72      , 22.48      , 17.30666667, 19.90666667, 19.01333333,
       14.01333333, 18.38666667, 15.22      , 14.70666667, 15.6       ,
       15.26666667, 16.89333333, 14.82      , 19.30666667, 21.82666667,
       13.89333333, 17.        , 14.82      , 13.54666667, 13.58      ,
       19.96666667, 20.58666667, 20.8       , 23.76      , 29.78      ,
       36.05333333, 28.34      , 25.20666667, 24.79333333, 21.77333333,
       19.52666667, 19.65333333, 18.19333333, 15.68666667, 19.65333333,
       19.91333333, 20.58666667, 25.18666667, 22.56      , 19.34666667,
       34.66666667, 23.72666667, 32.2       , 23.08      , 19.95333333,
       19.16666667, 18.38666667, 22.86      , 25.34666667, 33.33333333,
       24.02      , 19.62      , 21.77333333, 18.38666667, 21.29

In [15]:
rgs.score(boston_features,boston_target)

0.9743222749484213

In [11]:
from sklearn import tree

In [12]:
rgs2 = tree.DecisionTreeRegressor()
rgs2.fit(boston_features, boston_target)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=None, splitter='best')

In [13]:
rgs2.predict(boston_features)

array([24. , 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15. ,
       18.9, 21.7, 20.4, 18.2, 19.9, 23.1, 17.5, 20.2, 18.2, 13.6, 19.6,
       15.2, 14.5, 15.6, 13.9, 16.6, 14.8, 18.4, 21. , 12.7, 14.5, 13.2,
       13.1, 13.5, 18.9, 20. , 21. , 24.7, 30.8, 34.9, 26.6, 25.3, 24.7,
       21.2, 19.3, 20. , 16.6, 14.4, 19.4, 19.7, 20.5, 25. , 23.4, 18.9,
       35.4, 24.7, 31.6, 23.3, 19.6, 18.7, 16. , 22.2, 25. , 33. , 23.5,
       19.4, 22. , 17.4, 20.9, 24.2, 21.7, 22.8, 23.4, 24.1, 21.4, 20. ,
       20.8, 21.2, 20.3, 28. , 23.9, 24.8, 22.9, 23.9, 26.6, 22.5, 22.2,
       23.6, 28.7, 22.6, 22. , 22.9, 25. , 20.6, 28.4, 21.4, 38.7, 43.8,
       33.2, 27.5, 26.5, 18.6, 19.3, 20.1, 19.5, 19.5, 20.4, 19.8, 19.4,
       21.7, 22.8, 18.8, 18.7, 18.5, 18.3, 21.2, 19.2, 20.4, 19.3, 22. ,
       20.3, 20.5, 17.3, 18.8, 21.4, 15.7, 16.2, 18. , 14.3, 19.2, 19.6,
       23. , 18.4, 15.6, 18.1, 17.4, 17.1, 13.3, 17.8, 14. , 14.4, 13.4,
       15.6, 11.8, 13.8, 15.6, 14.6, 17.8, 15.4, 21

In [14]:
rgs2.score(boston_features,boston_target)

1.0