## [作業重點]
了解如何使用 Sklearn 中的 hyper-parameter search 找出最佳的超參數

### 作業
請使用不同的資料集，並使用 hyper-parameter search 的方式，看能不能找出最佳的超參數組合

In [3]:
from sklearn import datasets
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score, r2_score

### Iris

In [46]:
# predict by default parameters

iris = datasets.load_iris()
print('feature names: ', iris.feature_names)
print('target: ', iris.target)
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=4)
clf = GradientBoostingClassifier()
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print('Accuracy: ', accuracy_score(y_test, y_pred))

feature names:  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
target:  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
Accuracy:  0.9666666666666667


In [51]:
### grid search ###

n_estimators = [10, 20, 30, 50]
max_depth = [2, 3, 5, 10]
param_grid = dict(n_estimators=n_estimators, max_depth=max_depth)

# n_jobs=-1 會使用全部 cpu 平行運算, 預設 cv=3
grid_search = GridSearchCV(clf, param_grid, scoring="accuracy", n_jobs=-1, verbose=1, cv=10) 
grid_result = grid_search.fit(x_train, y_train)

Fitting 10 folds for each of 16 candidates, totalling 160 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  76 tasks      | elapsed:    7.0s
[Parallel(n_jobs=-1)]: Done 160 out of 160 | elapsed:   15.5s finished


In [52]:
print("Best Accuracy: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
# grid_result.cv_results_

Best Accuracy: 0.975000 using {'max_depth': 2, 'n_estimators': 10}


In [54]:
clf_grid = GradientBoostingClassifier(n_estimators=grid_result.best_params_['n_estimators'],
                                           max_depth=grid_result.best_params_['max_depth'])
clf_grid.fit(x_train, y_train)
y_pred = clf_grid.predict(x_test)
print('Accuracy: ', accuracy_score(y_test, y_pred))

Accuracy:  0.9666666666666667


In [55]:
### random search ###

param_dist = {
        'n_estimators':range(5,100,5),
        'max_depth':range(2,30,1),
        }
random_search = RandomizedSearchCV(clf, param_dist, scoring='accuracy', cv=10, n_jobs=-1, verbose=1)
random_result = random_search.fit(x_train, y_train)
print('Best accuracy:', random_result.best_score_, 'using ', random_result.best_params_)

clf_random = GradientBoostingClassifier(n_estimators=random_result.best_params_['n_estimators'],
                                       max_depth=random_result.best_params_['max_depth'])
clf_random.fit(x_train, y_train)
y_pred = clf_random.predict(x_test)
print('Accuracy: ', accuracy_score(y_test, y_pred))

Fitting 10 folds for each of 10 candidates, totalling 100 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   10.5s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   19.8s finished


Best accuracy: 0.975 using  {'n_estimators': 60, 'max_depth': 6}
Accuracy:  0.9666666666666667
