## [作業重點]
確保你了解隨機森林模型中每個超參數的意義，並觀察調整超參數對結果的影響

## 作業

1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

In [49]:
from sklearn import datasets, metrics, linear_model
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor 
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score

## 1.試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？

### 原本參數

In [2]:
# 讀取鳶尾花資料集
iris = datasets.load_iris()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=4)

# 建立模型 (使用 20 顆樹，每棵樹的最大深度為 4)
clf = RandomForestClassifier(n_estimators=20, max_depth=4)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

In [3]:
acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)

Accuracy:  0.9736842105263158


In [4]:
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [5]:
print("Feature importance: ", clf.feature_importances_)

Feature importance:  [0.09626413 0.01238866 0.38447699 0.50687021]


### 調整參數

In [17]:
# 讀取鳶尾花資料集
iris = datasets.load_iris()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.30, random_state=4)

# 建立模型 (使用 20 顆樹，每棵樹的最大深度為 4)
clf = RandomForestClassifier(n_estimators=50, max_depth=8)

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred = clf.predict(x_test)

In [18]:
acc = metrics.accuracy_score(y_test, y_pred)
print("Accuracy: ", acc)

Accuracy:  0.9777777777777777


In [19]:
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [20]:
print("Feature importance: ", clf.feature_importances_)

Feature importance:  [0.13536575 0.03414215 0.41663571 0.41385639]


# 2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

In [23]:
boston = datasets.load_boston()
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.25, random_state=4)

### 隨機森林模型

In [29]:
clf = RandomForestRegressor(n_estimators=50, max_depth=8)

# 訓練模型
clf.fit(x_train, y_train)
# 預測測試集
y_pred = clf.predict(x_test)

In [34]:
print("Mean squared error: %.2f"
      % mean_squared_error(y_test, y_pred))

Mean squared error: 16.09


### 迴歸分析

In [42]:
# 切分資料
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.1, random_state=4)

#建立模型
regr = linear_model.LinearRegression()

# 訓練資料
regr.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [43]:
y_pred = regr.predict(x_test)
print("Mean squared error: %.2f"
      % mean_squared_error(y_test, y_pred))

Mean squared error: 17.04


### 決策樹

In [45]:
x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2, random_state=4)

In [50]:
# 建立模型
clf = DecisionTreeRegressor()

# 訓練模型
clf.fit(x_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

In [51]:
y_pred = regr.predict(x_test)
mean_squared_error(y_pred, y_test)

24.224586646487133

In [52]:
for k, w in zip(boston.feature_names, regr.coef_):
    if w != 0:
        print(k, w)

CRIM -0.12585665878406954
ZN 0.0484257396100201
INDUS 0.01840852809252633
CHAS 3.085095691516899
NOX -17.327701820564606
RM 3.6167471330861467
AGE 0.0021918185271774765
DIS -1.4936113225001264
RAD 0.3199792000272681
TAX -0.01272946486141267
PTRATIO -0.927469085924641
B 0.009509124683760478
LSTAT -0.5335924706228666
