# 機械学習

## 3. ハイパーパラメータチューニング

ハイパーパラメータとは各アルゴリズムに付随してアルゴリズムの挙動を制御するための値です．  
モデル学習の前に，ハイパーパラメータを調整することによって過学習の抑制や精度の向上が期待できます．

ここではハイパーパラメータを手動で調整する方法とグリッドサーチの二種類を順に実装していきます．

### 3.1 ハイパーパラメータの調整方法(手動編)

In [65]:
import numpy as np
import pandas as pd

In [66]:
from sklearn.datasets import load_breast_cancer #乳がんに関するデータセット

In [67]:
dataset = load_breast_cancer() #データセットの読み込み

In [68]:
t = dataset.target #正解ラベル
x = dataset.data #特徴量

In [69]:
t.shape, x.shape #データの形状

((569,), (569, 30))

In [70]:
from sklearn.model_selection import train_test_split

トレーニングと評価用のデータの混ぜたtrain_valとtestで分割します

In [71]:
x_train_val, x_test, t_train_val, t_test = train_test_split(x, t, test_size=0.2, random_state=1)

トレーニングと評価用データにさらに分割します

In [72]:
x_train, x_val, t_train, t_val = train_test_split(x_train_val, t_train_val, test_size=0.3, random_state=1)

ここから先は以前と同様に進めます

In [73]:
x_train.shape, x_val.shape, x_test.shape #データの形状

((318, 30), (137, 30), (114, 30))

In [74]:
from sklearn.tree import DecisionTreeClassifier

In [75]:
dtree = DecisionTreeClassifier(random_state=0) #決定木のインスタンスを作成

In [76]:
dtree.fit(x_train, t_train)

In [77]:
print(f'training score: {dtree.score(x_train, t_train)}')
print(f'validation score: {dtree.score(x_val, t_val)}')

training score: 1.0
validation score: 0.927007299270073


上記のように少し過学習の傾向があります  
したがってハイパーパラメータを手動で操作し過学習の傾向をなくします．

In [78]:
dtree = DecisionTreeClassifier(max_depth=10,min_samples_split=30, random_state=0)

ここではmax_depth，min_samples_splitを調整しています．

In [79]:
dtree.fit(x_train, t_train)

In [80]:
print(f'training score: {dtree.score(x_train, t_train)}')
print(f'validation score: {dtree.score(x_val, t_val)}')

training score: 0.9308176100628931
validation score: 0.9562043795620438


In [81]:
print(f'test score: {dtree.score(x_test, t_test)}')

test score: 0.9298245614035088


このように手動で導いたハイパーパラメータは必ずしも最適とは言えません．  
これから手動ではなく，効率的に最適なハイパーパラメータを求めたいと思います．

ハイパーパラメータの調整方法(グリッドサーチ)

メリット  
- 指定された範囲を網羅するため，ある程度漏れがなくハイパーパラメータの探索ができます．

デメリット  
- 場合によっては大量のパターンの組み合わせを計算するため，学習に時間を要します，

In [82]:
from sklearn.model_selection import GridSearchCV #グリッドサーチを行うためのクラス

In [83]:
estimator = DecisionTreeClassifier(random_state=0) # estimator =　学習に用いるモデル

In [84]:
param_grid = [{
    'max_depth': [3, 20, 50], #パラメータの値の候補
    'min_samples_split': [3, 20, 30] #パラメータの値の候補
}]

In [85]:
cv = 5 #クロスバリデーションの分割数

In [86]:
tuned_model = GridSearchCV(estimator=estimator, param_grid=param_grid, cv=cv, return_train_score=False) #グリッドサーチのインスタンスを作成

In [87]:
tuned_model.fit(x_train_val, t_train_val) #グリッドサーチの実行

In [88]:
pd.DataFrame(tuned_model.cv_results_).T

Unnamed: 0,0,1,2,3,4,5,6,7,8
mean_fit_time,0.004916,0.003806,0.004196,0.00553,0.004052,0.007857,0.006539,0.00694,0.004746
std_fit_time,0.001851,0.002802,0.002559,0.002411,0.003611,0.001282,0.003541,0.004064,0.004057
mean_score_time,0.001357,0.001611,0.001616,0.000112,0.000417,0.000111,0.0,0.000823,0.001629
std_score_time,0.001733,0.003221,0.003232,0.000224,0.000833,0.000221,0.0,0.001158,0.003257
param_max_depth,3,3,3,20,20,20,50,50,50
param_min_samples_split,3,20,30,3,20,30,3,20,30
params,"{'max_depth': 3, 'min_samples_split': 3}","{'max_depth': 3, 'min_samples_split': 20}","{'max_depth': 3, 'min_samples_split': 30}","{'max_depth': 20, 'min_samples_split': 3}","{'max_depth': 20, 'min_samples_split': 20}","{'max_depth': 20, 'min_samples_split': 30}","{'max_depth': 50, 'min_samples_split': 3}","{'max_depth': 50, 'min_samples_split': 20}","{'max_depth': 50, 'min_samples_split': 30}"
split0_test_score,0.923077,0.912088,0.912088,0.956044,0.912088,0.912088,0.956044,0.912088,0.912088
split1_test_score,0.901099,0.901099,0.901099,0.912088,0.901099,0.901099,0.912088,0.901099,0.901099
split2_test_score,0.934066,0.934066,0.934066,0.923077,0.934066,0.934066,0.923077,0.934066,0.934066


上記の表のうち，mean_test_scoreに注目して，最も評価がよいものを確認します．  
今回はmax_depth:20，max_samples_split3の時を採用します．

In [89]:
param_grid = [{
    'max_depth': [5, 10, 15],
    'min_samples_split': [10, 12, 15],
}]

上記のように，高かった評価のハイパーパラメータの値の前後を見ていきます

In [90]:
cv = 5

In [91]:
tuned_model = GridSearchCV(estimator=estimator, param_grid=param_grid, cv=cv, return_train_score=False)

In [92]:
tuned_model.fit(x_train_val, t_train_val)

In [93]:
pd.DataFrame(tuned_model.cv_results_).T

Unnamed: 0,0,1,2,3,4,5,6,7,8
mean_fit_time,0.006184,0.006169,0.00319,0.006487,0.006148,0.006167,0.0061,0.005027,0.005524
std_fit_time,0.002344,0.003577,0.00317,0.004306,0.003718,0.00374,0.001885,0.002721,0.001334
mean_score_time,0.001019,0.0,0.0,0.000102,0.000417,0.000102,0.0,0.000232,0.001601
std_score_time,0.001069,0.0,0.0,0.000204,0.000833,0.000205,0.0,0.000464,0.003202
param_max_depth,5,5,5,10,10,10,15,15,15
param_min_samples_split,10,12,15,10,12,15,10,12,15
params,"{'max_depth': 5, 'min_samples_split': 10}","{'max_depth': 5, 'min_samples_split': 12}","{'max_depth': 5, 'min_samples_split': 15}","{'max_depth': 10, 'min_samples_split': 10}","{'max_depth': 10, 'min_samples_split': 12}","{'max_depth': 10, 'min_samples_split': 15}","{'max_depth': 15, 'min_samples_split': 10}","{'max_depth': 15, 'min_samples_split': 12}","{'max_depth': 15, 'min_samples_split': 15}"
split0_test_score,0.967033,0.923077,0.912088,0.967033,0.923077,0.912088,0.967033,0.923077,0.912088
split1_test_score,0.912088,0.901099,0.901099,0.912088,0.901099,0.901099,0.912088,0.901099,0.901099
split2_test_score,0.923077,0.934066,0.934066,0.923077,0.934066,0.934066,0.923077,0.934066,0.934066


先ほどよりも全体の評価が上がったことが確認できます．

In [94]:
tuned_model.best_params_

{'max_depth': 5, 'min_samples_split': 10}

In [95]:
best_model = tuned_model.best_estimator_ #最適なモデルの取得

In [96]:
print(f'training score: {best_model.score(x_train_val, t_train_val)}')
print(f'test score: {best_model.score(x_test, t_test)}')

training score: 0.9934065934065934
test score: 0.956140350877193
