<a href="https://colab.research.google.com/github/Frankensosege/MachineLearning/blob/main/05_2_CrossValidation_%26_GridSearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Test data로 검증을 반복하면 학습모델에 최종 확인해야할 test data가 모델에 영향을 미치므로 검증데이터를 추가 생성하고 test data는 마지막을 위해 남겨둔다.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

wine = pd.read_csv('https://bit.ly/wine_csv_data')

cols = wine.columns[:3]
print(cols)
data = wine[cols].to_numpy()
target = wine['class'].to_numpy()

train_input, test_input, train_target, test_target = train_test_split(data, target, test_size=0.2, random_state=42)
sub_input, val_input, sub_target, val_target = train_test_split(train_input, train_target, test_size=0.2, random_state=42)
print(sub_input.shape, val_input.shape, test_input.shape, sub_target.shape, val_target.shape, test_target.shape)

Index(['alcohol', 'sugar', 'pH'], dtype='object')
(4157, 3) (1040, 3) (1300, 3) (4157,) (1040,) (1300,)


train data, validation data, test data로 나누어 준다.

In [None]:
from sklearn.tree import DecisionTreeClassifier, plot_tree

dt = DecisionTreeClassifier(random_state = 42)
dt.fit(sub_input, sub_target)
print(dt.score(sub_input, sub_target))
print(dt.score(val_input, val_target))

0.9971133028626413
0.864423076923077


n-폴드 교차검증</p>
cross_validate는 훈련세트를 n-1 + 검증세트로 분할하여 n-fold 교차 검증을 수행한다.</p>
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html

In [None]:
from sklearn.model_selection import cross_validate
scores = cross_validate(dt, train_input, train_target) # default 5-폴드(cv=None)
display(scores)

{'fit_time': array([0.01344228, 0.01111031, 0.0106256 , 0.01078057, 0.01042247]),
 'score_time': array([0.00136399, 0.0012691 , 0.00160384, 0.00157166, 0.00143671]),
 'test_score': array([0.86923077, 0.84615385, 0.87680462, 0.84889317, 0.83541867])}

In [None]:
import numpy as np

print(np.mean(scores['test_score']))

0.855300214703487


https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html

In [None]:
from sklearn.model_selection import StratifiedKFold

scores = cross_validate(dt, train_input, train_target, cv=StratifiedKFold())
print(np.mean(scores['test_score']))

0.855300214703487


In [None]:
splitter = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
scores = cross_validate(dt, train_input, train_target, cv=splitter)
print(np.mean(scores['test_score']))

0.8574181117533719


# **하이퍼 파라메터 튜닝</p>**
Grid Search</p>
params = {'min_impurity_decrease':np.arange(0.0001, 0, 001, 0.0001),</p>
          &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'max_depth':range(5, 20, 1),</p>
          &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'min_sample_split': range(2, 100, 10)</p>
}</p>
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [None]:
from sklearn.model_selection import GridSearchCV

params = {'min_impurity_decrease': [0.0001, 0.0002, 0.0003, 0.0004, 0.0005]}
gs = GridSearchCV(DecisionTreeClassifier(random_state=42), params, n_jobs=-1)
gs.fit(train_input, train_target)

In [None]:
dt = gs.best_estimator_
print(dt.score(train_input, train_target))

0.9615162593804117


최적의 param

In [None]:
print(gs.best_params_)

{'min_impurity_decrease': 0.0001}


각 매개변수에서 수행한 교차검증 결과

In [None]:
display(gs.cv_results_)
print(gs.cv_results_['mean_test_score'])

{'mean_fit_time': array([0.01416798, 0.0079021 , 0.00768294, 0.00754795, 0.00672822]),
 'std_fit_time': array([8.54627470e-03, 5.16367053e-04, 7.10281041e-04, 6.61374851e-04,
        8.75018143e-05]),
 'mean_score_time': array([0.00147314, 0.00123787, 0.00142651, 0.001436  , 0.00123935]),
 'std_score_time': array([1.76615836e-04, 8.44779976e-05, 5.36203319e-04, 2.34164091e-04,
        1.00368660e-04]),
 'param_min_impurity_decrease': masked_array(data=[0.0001, 0.0002, 0.0003, 0.0004, 0.0005],
              mask=[False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'min_impurity_decrease': 0.0001},
  {'min_impurity_decrease': 0.0002},
  {'min_impurity_decrease': 0.0003},
  {'min_impurity_decrease': 0.0004},
  {'min_impurity_decrease': 0.0005}],
 'split0_test_score': array([0.86923077, 0.87115385, 0.86923077, 0.86923077, 0.86538462]),
 'split1_test_score': array([0.86826923, 0.86346154, 0.85961538, 0.86346154, 0.86923077]),
 'split2_test_sc

[0.86819297 0.86453617 0.86492226 0.86780891 0.86761605]


In [None]:
best_index = np.argmax(gs.cv_results_['mean_test_score'])
print(gs.cv_results_['params'][best_index])

{'min_impurity_decrease': 0.0001}
