<a href="https://colab.research.google.com/github/Frankensosege/MachineLearning/blob/main/05_2_CrossValidation_%26_GridSearch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Test data로 검증을 반복하면 학습모델에 최종 확인해야할 test data가 모델에 영향을 미치므로 검증데이터를 추가 생성하고 test data는 마지막을 위해 남겨둔다.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

wine = pd.read_csv('https://bit.ly/wine_csv_data')

cols = wine.columns[:3]
print(cols)
data = wine[cols].to_numpy()
target = wine['class'].to_numpy()

train_input, test_input, train_target, test_target = train_test_split(data, target, test_size=0.2, random_state=42)
sub_input, val_input, sub_target, val_target = train_test_split(train_input, train_target, test_size=0.2, random_state=42)
print(sub_input.shape, val_input.shape, test_input.shape, sub_target.shape, val_target.shape, test_target.shape)

Index(['alcohol', 'sugar', 'pH'], dtype='object')
(4157, 3) (1040, 3) (1300, 3) (4157,) (1040,) (1300,)


train data, validation data, test data로 나누어 준다.

In [2]:
from sklearn.tree import DecisionTreeClassifier, plot_tree

dt = DecisionTreeClassifier(random_state = 42)
dt.fit(sub_input, sub_target)
print(dt.score(sub_input, sub_target))
print(dt.score(val_input, val_target))

0.9971133028626413
0.864423076923077


n-폴드 교차검증</p>
cross_validate는 훈련세트를 n-1 + 검증세트로 분할하여 n-fold 교차 검증을 수행한다.</p>
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html

In [3]:
from sklearn.model_selection import cross_validate
scores = cross_validate(dt, train_input, train_target) # default 5-폴드(cv=None)
display(scores)

{'fit_time': array([0.01279521, 0.01019406, 0.00947976, 0.00983167, 0.00916195]),
 'score_time': array([0.00172114, 0.00181222, 0.00162077, 0.0018394 , 0.00179791]),
 'test_score': array([0.86923077, 0.84615385, 0.87680462, 0.84889317, 0.83541867])}

In [4]:
import numpy as np

print(np.mean(scores['test_score']))

0.855300214703487


StratifiedKFold : 클래스별 샘플 수 에따라 적정하게 배분</p>
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html

In [5]:
from sklearn.model_selection import StratifiedKFold

scores = cross_validate(dt, train_input, train_target, cv=StratifiedKFold())
print(np.mean(scores['test_score']))

0.855300214703487


In [6]:
splitter = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
scores = cross_validate(dt, train_input, train_target, cv=splitter)
print(np.mean(scores['test_score']))

0.8574181117533719


# **하이퍼 파라메터 튜닝</p>**
**Grid Search**</p>
하이퍼 파라메터의 검색과 교차검증(CV)를 같이 진행 한다.</p>
함수에 파라메터의 목록을 전달한다</p>
params = {</p>
  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'min_impurity_decrease':np.arange(0.0001, 0.001, 0.0001),</p>
          &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'max_depth':range(5, 20, 1),</p>
          &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'min_samples_split': range(2, 100, 10)</p>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}</p>
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [7]:
from sklearn.model_selection import GridSearchCV

params = {'min_impurity_decrease': [0.0001, 0.0002, 0.0003, 0.0004, 0.0005]}
gs = GridSearchCV(DecisionTreeClassifier(random_state=42), params, n_jobs=-1)
gs.fit(train_input, train_target)

In [8]:
dt = gs.best_estimator_
print(dt.score(train_input, train_target))

0.9615162593804117


최적의 param

In [9]:
print(gs.best_params_)

{'min_impurity_decrease': 0.0001}


각 매개변수에서 수행한 교차검증 결과

In [10]:
display(gs.cv_results_)
print(gs.cv_results_['mean_test_score'])

{'mean_fit_time': array([0.02393479, 0.01412182, 0.00949225, 0.02048707, 0.01791973]),
 'std_fit_time': array([0.00794373, 0.00704493, 0.0027887 , 0.00384687, 0.00315368]),
 'mean_score_time': array([0.00929661, 0.0018486 , 0.00156894, 0.00311341, 0.00209012]),
 'std_score_time': array([0.00942412, 0.00052354, 0.00014209, 0.00101645, 0.00101622]),
 'param_min_impurity_decrease': masked_array(data=[0.0001, 0.0002, 0.0003, 0.0004, 0.0005],
              mask=[False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'min_impurity_decrease': 0.0001},
  {'min_impurity_decrease': 0.0002},
  {'min_impurity_decrease': 0.0003},
  {'min_impurity_decrease': 0.0004},
  {'min_impurity_decrease': 0.0005}],
 'split0_test_score': array([0.86923077, 0.87115385, 0.86923077, 0.86923077, 0.86538462]),
 'split1_test_score': array([0.86826923, 0.86346154, 0.85961538, 0.86346154, 0.86923077]),
 'split2_test_score': array([0.8825794 , 0.87680462, 0.87584216, 0.88161

[0.86819297 0.86453617 0.86492226 0.86780891 0.86761605]


In [11]:
best_index = np.argmax(gs.cv_results_['mean_test_score'])
print(gs.cv_results_['params'][best_index])

{'min_impurity_decrease': 0.0001}


In [14]:
params = {
          'min_impurity_decrease':np.arange(0.0001, 0.001, 0.0001),
          'max_depth':range(5, 20, 1),
          'min_samples_split': range(2, 100, 10)
        }

gs = GridSearchCV(DecisionTreeClassifier(random_state=42), params, n_jobs=-1)
gs.fit(train_input, train_target)

In [15]:
print(gs.best_params_)

{'max_depth': 14, 'min_impurity_decrease': 0.0004, 'min_samples_split': 12}


In [17]:
print(np.max(gs.cv_results_['mean_test_score']))

0.8683865773302731


**Random Search**</p>
매개변수의 목록을 전달하는 것이 아니라 매개변수를 샘플링 할 수 있는 확률분포 개체를 전달</p>
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

In [20]:
from scipy.stats import uniform, randint

rgen = randint(0, 10)
print(rgen)
rgen.rvs(10)

<scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x7fde3c8bcb80>


array([8, 9, 5, 5, 9, 4, 1, 4, 5, 2])

In [19]:
np.unique(rgen.rvs(1000), return_counts=True)

(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 array([ 78, 101, 112,  84, 110, 112, 103, 105, 102,  93]))

In [21]:
ugen = uniform(0, 1)
ugen.rvs(10)

array([0.30771302, 0.05615833, 0.45013511, 0.70870256, 0.93943297,
       0.85920809, 0.14284682, 0.12566815, 0.68554176, 0.66694881])

In [22]:
from sklearn.model_selection import RandomizedSearchCV
params = {
          'min_impurity_decrease':uniform(0.0001, 0.001),
          'max_depth':randint(20, 50),
          'min_samples_split': randint(2, 25),
          'min_samples_leaf': randint(1, 25)
        }

rs = RandomizedSearchCV(DecisionTreeClassifier(random_state=42), params, n_jobs=-1, n_iter=100, random_state=42)
rs.fit(train_input, train_target)

In [25]:
print(rs.best_params_)

{'max_depth': 39, 'min_impurity_decrease': 0.00034102546602601173, 'min_samples_leaf': 7, 'min_samples_split': 13}


In [26]:
print(np.max(rs.cv_results_['mean_test_score']))

0.8695428296438884


In [27]:
dt = rs.best_estimator_
print(dt.score(test_input, test_target))

0.86
