## [作業重點]
了解如何使用 Sklearn 中的 hyper-parameter search 找出最佳的超參數

### 作業
請使用不同的資料集，並使用 hyper-parameter search 的方式，看能不能找出最佳的超參數組合

In [1]:
from sklearn import datasets, metrics
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score, GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier

In [2]:
# read
iris = datasets.load_iris()

# split
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.25, random_state=18)

# 建立模型
clf = GradientBoostingClassifier().fit(x_train, y_train)

# predict
y_pred = clf.predict(x_test)

# result
print(f'score: {cross_val_score(clf, x_test, y_test, cv=5).mean()}')
print(f'accuracy: {metrics.accuracy_score(y_pred, y_test)}')

score: 0.9464285714285714
accuracy: 0.9736842105263158


In [4]:
# 可能選項
n_estimators = [100,220,240]
max_depth = [2,3]
param_grid = dict(n_estimators=n_estimators, max_depth=max_depth)

## 建立搜尋物件，放入模型及參數組合字典 (n_jobs=-1 會使用全部 cpu 平行運算)
grid_search = GridSearchCV(clf, param_grid, verbose=1)

# 開始搜尋最佳參數
grid_result = grid_search.fit(x_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 6 candidates, totalling 18 fits


[Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed:    6.3s finished


In [7]:
# 透過這樣可以取得剛剛超參數探索的最佳值
print(f'Best Accuracy: {grid_result.best_score_:.2%}, {grid_result.best_params_} using')

Best Accuracy: 94.64%, {'max_depth': 2, 'n_estimators': 100} using


In [8]:
clf = GradientBoostingClassifier(n_estimators=100, max_depth=2)
clf.fit(x_train, y_train)
pred = clf.predict(x_test)
rs = metrics.accuracy_score(pred, y_test)

# 有點疑問, 為什麼比上面的高?
print(f'inital run, accuracy is, {rs:.2%}')

inital run, accuracy is, 97.37%
