## [作業重點]
了解如何使用 Sklearn 中的 hyper-parameter search 找出最佳的超參數

### 作業
請使用不同的資料集，並使用 hyper-parameter search 的方式，看能不能找出最佳的超參數組合

In [1]:
# 先載入套件和資料集

from sklearn import datasets, metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

digits = datasets.load_digits()

# 使用隨機森林:

In [25]:
# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.25, random_state=4)

# 建立模型
clf = RandomForestClassifier()

# 訓練模型
clf.fit(x_train, y_train)

# 預測測試集
y_pred1 = clf.predict(x_test)

acc1 = metrics.accuracy_score(y_test, y_pred1)
print("Acuuracy: ", acc1)

Acuuracy:  0.9577777777777777




# 用Grid Search對RF進行調參:

Hyperparameter Tuning the Random Forest in Python:

https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74


In [26]:
from pprint import pprint

# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(clf.get_params())

Parameters currently in use:

{'bootstrap': True,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 10,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}


In [27]:
# Random Hyperparameter Grid

import numpy as np
from sklearn.model_selection import RandomizedSearchCV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

pprint(random_grid)

{'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}



On each iteration, the algorithm will choose a difference combination of the features. Altogether, there are 2 * 12 * 2 * 3 * 3 * 10 = 4320 settings! 

However, the benefit of a random search is that we are not trying every combination, but selecting at random to sample a wide range of values.


In [28]:
# Random Search Training

# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestClassifier()

# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf,
                               param_distributions = random_grid,
                               n_iter = 100,
                               cv = 3,
                               verbose=2,
                               random_state=42,
                               n_jobs = -1)


# Fit the random search model
rf_random.fit(x_train, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:   21.8s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:  3.1min finished


RandomizedSearchCV(cv=3, error_score='raise-deprecating',
                   estimator=RandomForestClassifier(bootstrap=True,
                                                    class_weight=None,
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features='auto',
                                                    max_leaf_nodes=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
                                                    n_estimators='warn',
                                                    n_jobs=None

In [29]:
rf_random.best_params_

{'n_estimators': 400,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': None,
 'bootstrap': False}

In [30]:
# Evaluate Random Search

base_model = RandomForestClassifier(n_estimators = 10, random_state = 42)
best_random = rf_random.best_estimator_

base_model.fit(x_train,y_train)
y_pred2 = base_model.predict(x_test)
acc2 = metrics.accuracy_score(y_test, y_pred2)
print("Base model's Acuuracy: ", acc2)

print("===========================================")

best_random.fit(x_train,y_train)
y_pred3 = best_random.predict(x_test)
acc3 = metrics.accuracy_score(y_test, y_pred3)
print("Best model's Acuuracy: ", acc3)

print("===========================================")

print('Improvement of {:0.2f}%.'.format(100*(acc3-acc2)/acc2))


Base model's Acuuracy:  0.9555555555555556
Best model's Acuuracy:  0.9822222222222222
Improvement of 2.79%.


In [31]:
#Grid Search with Cross Validation

from sklearn.model_selection import GridSearchCV

# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_depth': [80, 90, 100, 110],
    'max_features': [2, 3],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [100, 200, 300, 1000]
}

# Create a based model
rf2 = RandomForestClassifier()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf2,
                           param_grid = param_grid, 
                           cv = 3,
                           n_jobs = -1,
                           verbose = 2)

# This will try out 1 * 4 * 2 * 3 * 3 * 4 = 288 combinations of settings.

In [32]:
grid_search.fit(x_train,y_train)
grid_search.best_params_

Fitting 3 folds for each of 288 candidates, totalling 864 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    2.9s
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed:   18.9s
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed:   46.6s
[Parallel(n_jobs=-1)]: Done 632 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 864 out of 864 | elapsed:  2.0min finished


{'bootstrap': True,
 'max_depth': 100,
 'max_features': 3,
 'min_samples_leaf': 3,
 'min_samples_split': 8,
 'n_estimators': 200}

In [34]:
best_grid = grid_search.best_estimator_

base_model.fit(x_train,y_train)
y_pred2 = base_model.predict(x_test)
acc2 = metrics.accuracy_score(y_test, y_pred2)
print("Base model's Acuuracy: ", acc2)

print("===========================================")

best_grid.fit(x_train,y_train)
y_pred4 = best_grid.predict(x_test)
acc4 = metrics.accuracy_score(y_test, y_pred4)

print("Best grid's Acuuracy: ", acc4)

print("===================================================")

print('Improvement of {:0.2f}%.'.format(100*(acc4-acc2)/acc2))

Base model's Acuuracy:  0.9555555555555556
Best grid's Acuuracy:  0.9622222222222222
Improvement of 0.70%.



隨機grid search 的效果似乎比純粹grid search要來得更好~~


# 使用GBM:

In [35]:
# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.25, random_state=42)

# 建立模型
gbm = GradientBoostingClassifier(random_state=7)

gbm.fit(x_train, y_train)

y_pred5 = gbm.predict(x_test)

acc5 = metrics.accuracy_score(y_test, y_pred5)
print("Acuuracy: ", acc5)

Acuuracy:  0.9688888888888889


In [21]:
gbm

GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='auto',
                           random_state=7, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

# 用Grid Search對GBM進行調參:

In [36]:
from sklearn.model_selection import train_test_split, KFold, GridSearchCV

# 設定要訓練的超參數組合
n_estimators = [100, 200, 300]
max_depth = [1, 3, 5]
param_grid = dict(n_estimators=n_estimators,
                  max_depth=max_depth)

## 建立搜尋物件，放入模型及參數組合字典 (n_jobs=-1 會使用全部 cpu 平行運算)
grid_search = GridSearchCV(gbm,
                           param_grid,
                           scoring="accuracy",
                           # https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
                           n_jobs=-1,
                           verbose=1)

# 開始搜尋最佳參數
grid_result = grid_search.fit(x_train, y_train)

# 預設會跑 3-fold cross-validadtion，總共 9 種參數組合，總共要 train 27 次模型

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


Fitting 3 folds for each of 9 candidates, totalling 27 fits


[Parallel(n_jobs=-1)]: Done  27 out of  27 | elapsed:   24.1s finished


In [37]:
# 印出最佳結果與最佳參數
print("Best Accuracy: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

Best Accuracy: 0.947290 using {'max_depth': 3, 'n_estimators': 200}


In [38]:
grid_result.best_params_

{'max_depth': 3, 'n_estimators': 200}

In [39]:
# 使用最佳參數重新建立模型
gbm_bestparam = GradientBoostingClassifier(max_depth=grid_result.best_params_['max_depth'],
                                           n_estimators=grid_result.best_params_['n_estimators'])

# 訓練模型
gbm_bestparam.fit(x_train, y_train)

# 預測測試集
y_pred6 = gbm_bestparam.predict(x_test)

acc6 = metrics.accuracy_score(y_test, y_pred6)
print("Acuuracy: ", acc6)

print("===================================================")

print('Improvement of {:0.2f}%.'.format(100*(acc6-acc5)/acc5))

Acuuracy:  0.9711111111111111
Improvement of 0.23%.



調整之後正確率從0.9688上升至0.9711了!


# 使用XGBoost:

In [6]:
import numpy as np
import pandas as pd
import xgboost as xgb
 
digits = datasets.load_digits()
 

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.25, random_state=4)

# 建立模型
xgb = xgb.XGBClassifier()

# 訓練模型
xgb.fit(x_train, y_train)

# 預測測試集
y_pred7 = xgb.predict(x_test)

acc7 = metrics.accuracy_score(y_test, y_pred7)
print("Acuuracy: ", acc7)

Acuuracy:  0.9711111111111111


# 用Grid Search對XGBoost進行調參:

In [1]:
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split, KFold, GridSearchCV

from sklearn import datasets, metrics
digits = datasets.load_digits()

# 切分訓練集/測試集
x_train, x_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.25, random_state=4)

# 建立模型
xgb2 = xgb.XGBClassifier()

param_dist = {
        'n_estimators':range(80,200,4),
        'max_depth':range(2,15,1),
        'learning_rate':np.linspace(0.01,2,20),
        'subsample':np.linspace(0.7,0.9,20),
        'colsample_bytree':np.linspace(0.5,0.98,10),
        'min_child_weight':range(1,9,1)
        }#跑不動啦幹3777萬個fits

param_grid = {
        'n_estimators':range(80,200,10),
        'max_depth':range(2,15,1),
        'learning_rate':np.linspace(0.01,0.5,10),
        #'subsample':np.linspace(0.7,0.9,10),
        #'colsample_bytree':np.linspace(0.5,0.98,10),
        #'min_child_weight':range(1,9,1)
        }

In [2]:
# 開始搜尋最佳參數
grid_result = GridSearchCV(xgb2,
                    param_grid,
                    scoring="accuracy",
                    cv = 3,
                    #n_iter=300, #RandomizedSearchCV才有此參數
                    n_jobs = -1,
                    verbose=1)

 
grid_result = grid_result.fit(x_train,y_train)

Fitting 3 folds for each of 1560 candidates, totalling 4680 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   22.2s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  2.2min
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  6.6min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed: 10.9min
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed: 16.0min
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed: 20.6min
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed: 25.2min
[Parallel(n_jobs=-1)]: Done 3184 tasks      | elapsed: 29.8min
[Parallel(n_jobs=-1)]: Done 4034 tasks      | elapsed: 34.7min
[Parallel(n_jobs=-1)]: Done 4680 out of 4680 | elapsed: 38.9min finished


In [4]:
best_estimator = grid_result.best_estimator_
print(best_estimator)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.3911111111111111, max_delta_step=0, max_depth=2,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=110, n_jobs=0, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)


In [7]:
y_pred8 = grid_result.predict(x_test)

acc8 = metrics.accuracy_score(y_test, y_pred8)
print("Best model's Acuuracy: ", acc8)

print("=======================================")

print('Improvement of {:0.2f}%.'.format(100*(acc8-acc7)/acc7))

Best model's Acuuracy:  0.9822222222222222
Improvement of 1.14%.
