## [作業重點]
確保你了解隨機森林模型中每個超參數的意義，並觀察調整超參數對結果的影響

## 作業

1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

In [1]:
from sklearn import datasets, metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [2]:
# 讀取鳶尾花資料集
iris = datasets.load_iris()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

# 建立模型 (使用 20 顆樹，每棵樹的最大深度為 4)
clf = RandomForestClassifier(n_estimators=20, max_depth=4)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

In [3]:
acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print("Feature importance: ", clf.feature_importances_)

Accuracy:  0.9736842105263158
Feature importance:  [0.08206226 0.021341   0.4189045  0.47769225]


### 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？
=> max_depth =2 時, 準確率下降,但max_depth=10 的準確率=max_depth=4

In [4]:
clf = RandomForestClassifier(n_estimators=40, max_depth=4)
# 訓練模型
clf.fit(x_train, y_train)
# 預測測試集
y_pred = clf.predict(x_test)
acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print("Feature importance: ", clf.feature_importances_)

Accuracy:  0.9736842105263158
Feature importance:  [0.10242546 0.02200374 0.4527923  0.42277849]


In [6]:
clf = RandomForestClassifier(n_estimators=40, max_depth=2)
# 訓練模型
clf.fit(x_train, y_train)
# 預測測試集
y_pred = clf.predict(x_test)
acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print("Feature importance: ", clf.feature_importances_)

Accuracy:  0.9473684210526315
Feature importance:  [0.1146021  0.00838293 0.47349274 0.40352223]


In [7]:
clf = RandomForestClassifier(n_estimators=40, max_depth=10)
# 訓練模型
clf.fit(x_train, y_train)
# 預測測試集
y_pred = clf.predict(x_test)
acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print("Feature importance: ", clf.feature_importances_)

Accuracy:  0.9736842105263158
Feature importance:  [0.1087875  0.0459779  0.43710357 0.40813104]


### 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較
### => decistion tree  表現不比隨機森林差

In [17]:
boston = datasets.load_boston()
#print(boston.feature_names)
boston.data

array([[6.3200e-03, 1.8000e+01, 2.3100e+00, ..., 1.5300e+01, 3.9690e+02,
        4.9800e+00],
       [2.7310e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9690e+02,
        9.1400e+00],
       [2.7290e-02, 0.0000e+00, 7.0700e+00, ..., 1.7800e+01, 3.9283e+02,
        4.0300e+00],
       ...,
       [6.0760e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        5.6400e+00],
       [1.0959e-01, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9345e+02,
        6.4800e+00],
       [4.7410e-02, 0.0000e+00, 1.1930e+01, ..., 2.1000e+01, 3.9690e+02,
        7.8800e+00]])

In [39]:
import numpy as np
boston = datasets.load_boston()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=4)

clf = RandomForestClassifier(n_estimators=80, max_depth=10)

clf.fit(x_train,y_train.astype('int'))
print ("Training score:%f"%(clf.score(x_train,y_train.astype('int'))))
print ("Test score:%f"%(clf.score(x_test,y_test.astype('int'))))



Training score:0.994723
Test score:0.181102


### bosto with decistion tree

In [31]:
from sklearn.tree import  DecisionTreeRegressor,DecisionTreeClassifier
boston = datasets.load_boston()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=4)

regr = DecisionTreeRegressor()
regr.fit(x_train,y_train)
print ("Training score:%f"%(regr.score(x_train,y_train)))
print ("Test score:%f"%(regr.score(x_test,y_test)))

Training score:1.000000
Test score:0.727582
