## [作業重點]
確保你了解隨機森林模型中每個超參數的意義，並觀察調整超參數對結果的影響

## 作業

1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

In [1]:
from sklearn import datasets, metrics
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split

In [2]:
# 讀取鳶尾花資料集
iris = datasets.load_iris()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

# 建立模型
clf = RandomForestClassifier(n_estimators=20,     # the number of trees
                             criterion = 'gini',  # The function to measure the quality of a split.
                             max_depth=4          # the maximum depth of the tree
                            )

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)
print(iris.feature_names)
print("Feature importance: ", clf.feature_importances_)


Acuuracy:  0.9736842105263158
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.09352897 0.03808263 0.41831698 0.45007141]


In [3]:
# estabish random forest classificaion model
clf = RandomForestClassifier(n_estimators = 30,      # the number of trees
                             criterion = 'entropy',  # The function to measure the quality of a split.
                             max_depth = None        # the maximum depth of the tree
                            )

# train model on training set
clf.fit(x_train, y_train)

# predict test set
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc}")
print(f"Feature Importance:\n{clf.feature_importances_}")


Accuracy: 0.9736842105263158
Feature Importance:
[0.10089898 0.03038989 0.45202445 0.41668668]


Iris dataset is classification problem. 
When n_estimators=20, criterion = 'gini', max_depth=4, Acuuracy is  0.9736842105263158.  
When n_estimators = 30, criterion = 'entropy', max_depth = None, Accuracy is 0.9736842105263158.  
Acurancies are the same.

# boston house-price dataset

In [4]:
# load boston house-price dataset, which is a regression problem
boston = datasets.load_boston()

print(f"boston.data.shape: {boston.data.shape}")
print(f"boston.data:\n{boston.data}")
print(f"boston.target\n{boston.target}")
print(f"boston.DESCR:\n{boston.DESCR}")

# split into training / test sets
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=4)

# estabish random forest regression model
regressor = RandomForestRegressor(n_estimators = 30,   # the number of trees
                                  criterion = 'mse'    # criterion is 'mse'
                                 )

# train model on training set
regressor.fit(x_train, y_train)

# predict test set
y_pred = regressor.predict(x_test)

acc = metrics.explained_variance_score(y_test, y_pred)    # Explained variance regression score function
print(f"Accuracy: {acc}")
print(f"Feature Importance:\n{regressor.feature_importances_}")


boston.data.shape: (506, 13)
boston.data:
[[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00]
 [2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00]
 [2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00]
 ...
 [6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00]
 [1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00]
 [4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]
boston.target
[24.  21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15.  18.9 21.7 20.4
 18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8
 18.4 21.  12.7 14.5 13.2 13.1 13.5 18.9 20.  21.  24.7 30.8 34.9 26.6
 25.3 24.7 21.2 19.3 20.  16.6 14.4 19.4 19.7 20.5 25.  23.4 18.9 35.4
 24.7 31.6 23.3 19.6 18.7 16.  22.2 25.  33.  23.5 19.4 22.  17.4 20.9
 24.2 21.7 22.8 23.4 24.1 21.4 20.  20.8 21.2 20.3 28.  23.9 24.8 22.9
 23.9 26.6 22.5 22.2 23.6 28.7 22.6 22.  22.9 25.  20.6 28.4 21.4 38.7
 43

In [5]:
# estabish random forest regression model
regressor = RandomForestRegressor(n_estimators = 50,     # the number of trees
                                  criterion = 'mse',     # criterion is 'mae'
                                  max_depth = 10,        # maximum depth of trees
                                  min_samples_split = 2, # The minimum number of samples required to split an internal node
                                  min_samples_leaf = 1,  # The minimum number of samples required to be at a leaf node
                                 )

# train model on training set
regressor.fit(x_train, y_train)

# predict test set
y_pred = regressor.predict(x_test)

acc = metrics.explained_variance_score(y_test, y_pred)
print(f"Accuracy: {acc}")
print(f"Feature Importance:\n{regressor.feature_importances_}")


Accuracy: 0.8429151908934548
Feature Importance:
[0.05385649 0.00072104 0.00778785 0.00204491 0.01533325 0.46642843
 0.0121613  0.05415188 0.00270451 0.01981614 0.01873302 0.0102736
 0.33598758]


Boston house-price is regression problem. 
When n_estimators = 30, criterion = 'mse', Accuracy is <font color='blue'>0.8557933947731556</font>.  
When n_estimators = 50, criterion = 'mse', max_depth = 10, min_samples_split = 2, min_samples_leaf = 1, Accuracy is <font color='blue'>0.8429151908934548</font>.  


# wine dataset

In [6]:
# load wine dataset, which is classification problem
wine = datasets.load_wine()

print(f"wine.data.shape: {wine.data.shape}")
print(f"wine.data:\n{wine.data}")
print(f"wine.target:\n{wine.target}")
print(f"wine.target_names: {wine.target_names}")
print(f"wine.DESCR:\n{wine.DESCR}")

# split into training / test sets
x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.25, random_state=4)

# estabish random forest classificaion model
clf = RandomForestClassifier(n_estimators = 30,      # the number of trees
                             criterion = 'gini',     # The function to measure the quality of a split.
                             max_depth = 10,         # maximum depth of trees
                             min_samples_split = 2,  # The minimum number of samples required to split an internal node
                             min_samples_leaf = 1,   # The minimum number of samples required to be at a leaf node
                            )

# train model on training set
clf.fit(x_train, y_train)

# predict test set
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc}")
print(f"Feature Importance:\n{clf.feature_importances_}")


wine.data.shape: (178, 13)
wine.data:
[[1.423e+01 1.710e+00 2.430e+00 ... 1.040e+00 3.920e+00 1.065e+03]
 [1.320e+01 1.780e+00 2.140e+00 ... 1.050e+00 3.400e+00 1.050e+03]
 [1.316e+01 2.360e+00 2.670e+00 ... 1.030e+00 3.170e+00 1.185e+03]
 ...
 [1.327e+01 4.280e+00 2.260e+00 ... 5.900e-01 1.560e+00 8.350e+02]
 [1.317e+01 2.590e+00 2.370e+00 ... 6.000e-01 1.620e+00 8.400e+02]
 [1.413e+01 4.100e+00 2.740e+00 ... 6.100e-01 1.600e+00 5.600e+02]]
wine.target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
wine.target_names: ['class_0' 'class_1' 'class_2']
wine.DESCR:
.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of 

In [7]:
# estabish random forest classificaion model
clf = RandomForestClassifier(n_estimators = 30,      # the number of trees
                             criterion = 'entropy',  # The function to measure the quality of a split.
                             max_depth = 10,         # maximum depth of trees
                             min_samples_split = 2,  # The minimum number of samples required to split an internal node
                             min_samples_leaf = 1,   # The minimum number of samples required to be at a leaf node
                            )

# train model on training set
clf.fit(x_train, y_train)

# predict test set
y_pred = clf.predict(x_test)

acc = metrics.accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc}")
print(f"Feature Importance:\n{clf.feature_importances_}")


Accuracy: 1.0
Feature Importance:
[0.1106323  0.0390639  0.01613925 0.03083174 0.05031585 0.07215543
 0.2275862  0.01340556 0.01392851 0.1338617  0.03893294 0.12900403
 0.12414258]


Wine dataset is classification problem.
When n_estimators = 30,  
     criterion = <font color='blue'>'gini'</font>,  
     max_depth = 10,  
     min_samples_split = 2,  
     min_samples_leaf = 1,  
     Accuracy is <font color='blue'>0.9777777777777777</font>.  
  
When n_estimators = 30,  
     criterion = <font color='blue'>'entropy'</font>,  
     max_depth = 10,  
     min_samples_split = 2,  
     min_samples_leaf = 1,  
     Accuracy is <font color='blue'>1.0</font>.  
 