## [作業重點]
確保你了解隨機森林模型中每個超參數的意義，並觀察調整超參數對結果的影響

## 作業

1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

In [1]:
from sklearn import datasets, metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

In [2]:
# 讀取鳶尾花資料集
iris = datasets.load_iris()
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

In [3]:
clf = RandomForestClassifier(n_estimators=20,
                             max_depth=4)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print(iris.feature_names)
print("Feature importance: ", clf.feature_importances_)

Accuracy:  0.9736842105263158
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.10598857 0.0157033  0.44318959 0.43511854]


In [4]:
clf2 = RandomForestClassifier(n_estimators=20,
                              max_depth=4,
                              min_samples_split=5,
                              min_samples_leaf=2)
clf2.fit(x_train, y_train)
y_pred2 = clf2.predict(x_test)
acc2 = metrics.accuracy_score(y_test, y_pred2)
print("Accuracy: ", acc2)
print(iris.feature_names)
print("Feature importance: ", clf2.feature_importances_)

Accuracy:  0.9736842105263158
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.06471293 0.02356564 0.48430128 0.42742015]


In [5]:
clf3 = RandomForestClassifier(n_estimators=20,
                              max_depth=6,
                              min_samples_split=5,
                              min_samples_leaf=2)
clf3.fit(x_train, y_train)
y_pred3 = clf2.predict(x_test)
acc3 = metrics.accuracy_score(y_test, y_pred3)
print("Accuracy: ", acc3)
print(iris.feature_names)
print("Feature importance: ", clf3.feature_importances_)

Accuracy:  0.9736842105263158
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Feature importance:  [0.08485435 0.02690703 0.4661392  0.42209942]


1. 結果有些微變化，但是不影響petal length和petal width是最顯著特徵的結果

In [6]:
# Wine
wine = datasets.load_wine()
x_train, x_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.25, random_state=4)

In [7]:
# Random Forest
clf = RandomForestClassifier(n_estimators=20,
                             max_depth=4)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)
print(list(wine.feature_names))
print("Feature importance: ")
print(clf.feature_importances_)

Accuracy:  0.9777777777777777
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Feature importance: 
[0.11094466 0.00911519 0.00431736 0.04071617 0.0782967  0.04256974
 0.12221855 0.00172425 0.01840273 0.14884986 0.10947809 0.12356926
 0.18979745]


In [8]:
# Decision Tree
clf = DecisionTreeClassifier(max_depth=4)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
acc = metrics.accuracy_score(y_test, y_pred)
print("Acuuracy: ", acc)
print(list(wine.feature_names))
print("Feature importance: ")
print(clf.feature_importances_)

Acuuracy:  0.8888888888888888
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Feature importance: 
[0.01407439 0.         0.         0.         0.04544912 0.
 0.01904535 0.         0.         0.34772309 0.         0.17272143
 0.40098662]


2. 決策樹和隨機森林挑出的重要特徵有出入