## [作業重點]
了解如何使用 Sklearn 中的 hyper-parameter search 找出最佳的超參數

### 作業
請使用不同的資料集，並使用 hyper-parameter search 的方式，看能不能找出最佳的超參數組合

In [1]:
from sklearn import datasets, metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, KFold, GridSearchCV

cancer = datasets.load_breast_cancer()
x_train, x_test, y_train, y_test = train_test_split(cancer.data,cancer.target,test_size = 0.25, random_state = 1)

# Base Random Forest Model for comparison

In [19]:
rf = RandomForestClassifier(random_state = 5)
rf.fit(x_train,y_train)
y_pred = rf.predict(x_test)
acc = metrics.accuracy_score(y_test,y_pred)
print(f'Base Random Forest Model accuracy score: {acc}')


Base Random Forest Model accuracy score: 0.951048951048951




# Hyperparameter Tuning process

Set up parameter training sets

In [22]:
n_estimators = [150,200,300,400,500]
max_depth = [3,5,10,20]
min_samples_leaf = [2,4,8]
min_samples_split = [5,10]
param_grid = dict(n_estimators=n_estimators, max_depth=max_depth, min_samples_leaf = min_samples_leaf, min_samples_split = min_samples_split )


Use GridSearchCV for finding out best parameter. Here we use 5-fold cross validation.

In [23]:
grid_search = GridSearchCV(rf, param_grid, scoring="accuracy", n_jobs=-1, verbose=1,cv = 5)
grid_result = grid_search.fit(x_train, y_train)

Fitting 5 folds for each of 120 candidates, totalling 600 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:    6.6s
[Parallel(n_jobs=-1)]: Done 426 tasks      | elapsed:   16.3s
[Parallel(n_jobs=-1)]: Done 600 out of 600 | elapsed:   24.1s finished


In [24]:
print(f' Best accuracy score : {grid_result.best_score_} by using {grid_result.best_params_}')

 Best accuracy score : 0.9647887323943662 by using {'max_depth': 5, 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 400}


# Refit the Random Forest Model by using the best parameter we get

In [25]:
rf_best_parameter = RandomForestClassifier(n_estimators = grid_result.best_params_['n_estimators'],
                                           max_depth = grid_result.best_params_['max_depth'],
                                           min_samples_leaf = grid_result.best_params_['min_samples_leaf'],
                                           min_samples_split = grid_result.best_params_['min_samples_split'],random_state = 5)

In [26]:
rf_best_parameter.fit(x_train,y_train)
y_pred_best = rf_best_parameter.predict(x_test)
best_acc = metrics.accuracy_score(y_test,y_pred_best)
print(f'Best Random Forest Model accuracy score: {best_acc}')


Best Random Forest Model accuracy score: 0.951048951048951
