## [作業重點]
確保你了解隨機森林模型中每個超參數的意義，並觀察調整超參數對結果的影響

## 作業

1. 試著調整 RandomForestClassifier(...) 中的參數，並觀察是否會改變結果？
2. 改用其他資料集 (boston, wine)，並與回歸模型與決策樹的結果進行比較

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings

warnings.simplefilter('ignore')

# Datasets
from sklearn import datasets

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

# Model
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression

# Evaluation
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score

### 調參

In [6]:
# 讀取鳶尾花資料集
iris = datasets.load_iris()

# 轉成 DataFrame 比較方便觀察
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
display(iris_df.head())

# 檢查資料
X = iris_df # X 需要為一個 matrix
y = iris.target

# 切分訓練集/測試集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [7]:
# 建立一個 RandomForestClassifier 模型
rf_clf = RandomForestClassifier()

# Model training and hyper-parameters tuning
rf_clf_param_grid = {"max_depth": [3, 5, 10, None],
                     "min_samples_split": [2, 5, 10, 20],
                     "min_samples_leaf": [1, 2, 5],
                     "max_features": ["sqrt", None],
                     "bootstrap": [True],
                     "n_estimators": [100, 250],
                     "criterion": ['entropy','gini']}

gsrf_clf = GridSearchCV(rf_clf, param_grid=rf_clf_param_grid, cv=5, scoring="accuracy", n_jobs=-1, verbose=1)
gsrf_clf.fit(X_train, y_train)


# Best score
print(f"Best CV score of RandomForestClassifier: {(gsrf_clf.best_score_):.5f}")

# Best parameters
gsrf_clf_best = gsrf_clf.best_estimator_
print("Best parameters of RandomForestClassifier:\n", gsrf_clf_best)

# Predict by model
y_pred = gsrf_clf_best.predict(X_test)

# Acuuracy
print(f"Accuracy of best RandomForestClassifier: {accuracy_score(y_test, y_pred):.5f}")

Fitting 5 folds for each of 384 candidates, totalling 1920 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   21.3s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:   41.0s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed:  4.0min
[Parallel(n_jobs=-1)]: Done 1920 out of 1920 | elapsed:  4.3min finished


Best CV score of RandomForestClassifier: 0.96429
Best parameters of RandomForestClassifier:
 RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=3, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=5, min_samples_split=20,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
Accuracy of best RandomForestClassifier: 0.97368


### 紅酒資料集

In [8]:
# 讀取紅酒資料集(分類問題)，其中 wine 為一個字典
wine = datasets.load_wine()

# 轉成 DataFrame 比較方便觀察
wine_df = pd.DataFrame(wine.data, columns=wine.feature_names)
display(wine_df.head())

# 使用資料集中的所有特徵
X = wine_df # X 需要為一個 matrix
y = wine.target

# 切分訓練集/測試集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)

# 建立模型
log_reg = LogisticRegression()
dct_clf = DecisionTreeClassifier()
rf_clf = RandomForestClassifier()

# 訓練模型
log_reg.fit(X_train, y_train)
dct_clf.fit(X_train, y_train)
rf_clf.fit(X_train, y_train)

# 預測測試集
y_pred_log = log_reg.predict(X_test)
y_pred_dct = dct_clf.predict(X_test)
y_pred_rf = rf_clf.predict(X_test)

# 分類問題的衡量採用 accuracy
print(f"Accuracy of LogisticRegression: {accuracy_score(y_test, y_pred_log):.5f}")
print(f"Accuracy of DecisionTreeClassifier: {accuracy_score(y_test, y_pred_dct):.5f}")
print(f"Accuracy of RandomForestClassifier: {accuracy_score(y_test, y_pred_rf):.5f}")

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


Accuracy of LogisticRegression: 1.00000
Accuracy of DecisionTreeClassifier: 1.00000
Accuracy of RandomForestClassifier: 1.00000


In [9]:
# 讀取波士頓房產資料集(回歸問題)，其中 boston 為一個字典
boston = datasets.load_boston()

# 轉成 DataFrame 比較方便觀察
boston_df = pd.DataFrame(boston.data, columns=boston.feature_names)
display(boston_df.head())

# 使用資料集中的所有特徵
X = boston_df # X 需要為一個 matrix
y = boston.target

# 切分訓練集/測試集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)

# 建立模型
lin_reg = LinearRegression()
dct_reg = DecisionTreeRegressor()
rf_reg = RandomForestRegressor()

# 訓練模型
lin_reg.fit(X_train, y_train)
dct_reg.fit(X_train, y_train)
rf_reg.fit(X_train, y_train)

# 預測測試集
y_pred_lin = lin_reg.predict(X_test)
y_pred_dct = dct_reg.predict(X_test)
y_pred_rf = rf_reg.predict(X_test)

# 回歸問題的衡量採用 MSE 及 R square
print(f"Mean squared error of LinearRegression: {mean_squared_error(y_test, y_pred_lin):.5f}")
print(f"R square of LinearRegression: {r2_score(y_test, y_pred_lin):.5f}")
print(f"Mean squared error of DecisionTreeRegressor: {mean_squared_error(y_test, y_pred_dct):.5f}")
print(f"R square of DecisionTreeRegressor: {r2_score(y_test, y_pred_dct):.5f}")
print(f"Mean squared error of RandomForestRegressor: {mean_squared_error(y_test, y_pred_rf):.5f}")
print(f"R square of RandomForestRegressor: {r2_score(y_test, y_pred_rf):.5f}")

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


Mean squared error of LinearRegression: 41.72458
R square of LinearRegression: 0.51497
Mean squared error of DecisionTreeRegressor: 31.62784
R square of DecisionTreeRegressor: 0.63234
Mean squared error of RandomForestRegressor: 24.60834
R square of RandomForestRegressor: 0.71394
