#### [모델 성능 개선 - 튜닝]

- scikit-learn에서는 튜닝을 위한 클래스 제공
    * GridSearchCV
    * CV즉, 교차검증 함께 진행
    * 시간이 오래 걸림!!


[1] 모듈 로딩 및 데이터 준비 <hr>

In [23]:
## [1-1] 모듈 로딩
## 기본 모듈
import pandas as pd
import matplotlib.pyplot as plt
import koreanize_matplotlib

## ML관련 모듈
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV ## 튜빙
from sklearn.neighbors import KNeighborsClassifier                   ## 학습 알고리즘
from sklearn.model_selection import train_test_split                 ## 데이터셋 관련
from sklearn.preprocessing import StandardScaler                     ## 피쳐 스케일러 모듈


In [None]:
## [1-2] 데이터 준비
data_file = '../Data/iris.csv'

irisDF = pd.read_csv(data_file)

In [25]:
## [1-3] 데이터 기본 정보 확인
irisDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal.length  150 non-null    float64
 1   sepal.width   150 non-null    float64
 2   petal.length  150 non-null    float64
 3   petal.width   150 non-null    float64
 4   variety       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


[2] 데이터 전처리 <hr>
- 기본 데이터 전처리 : 결측치, 중복값, 피쳐별 분포
- 학습관련 분석 : 학습 알고리즘에 따른 처리, 피쳐오 타겟, 피쳐와 피쳐
- 학습관련 분리 : 피쳐와 타겟 분리, 학습용/테스트용 분리, 학습용/검증용/테스트용 분리
- 학습데이터 전처리 : 이상치, 스케일러, 인코딩 ...
- 검증/테스트용 데이터 : 학습 알고리즘에 대입하기위한 형태 맞춤 진행 
    * 데이터에대한 도메인 지식 => ★ 이상치 그대로 / 변경 여부 선택 ★

In [26]:
## ========================================================
## [2-1] 피쳐와 타겟 분리
## ========================================================
featureDF = irisDF.drop('variety', axis=1)
targetSR = irisDF['variety']

## Train / Test 분리 (검증은 CV로 처리)
x_train, x_test, y_train, y_test = train_test_split(featureDF, targetSR,
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=targetSR
                                                    )

In [27]:
## ===================================================================
## [2-2] 학습 알고리즘을 위한 전처리 : 거리기반 알고리즘 스케일러
## ===================================================================
# 스케일러 인스턴스 생성
scaler = StandardScaler()

# Train에 대해서만 fit
x_train_scaled = scaler.fit_transform(x_train)

# Test는 transform만
x_test_scaled = scaler.transform(x_test)

    

[3] 학습 및 검증, 하이퍼파라미터 찾기 <hr>
- GridSearchCV : 모든 파라미터 조합으로 모델 생성 및 학습/검증 진행

In [31]:
# 1) 기본 모델
knn = KNeighborsClassifier()

# 2) 그리드 탐색용 하이퍼파라미터 설정
#       키 -> 학습 알고리즘의 매개변수 즉, 속성명
#       값 -> 학습 알고리즘의 매개변수 즉, 속성에 적용할 수 있는 값들
param_grid ={ 
             'n_neighbors' : [1,3,5,7,9,11,13,15],
              'weights' : ['uniform', 'distance'],
              'p' : [1,2] # 1: 맨해튼 거리, 2: 유클리드 거리
              }

# 3) GridSearchCV 설정
grid_search = GridSearchCV(estimator = knn,
                           param_grid = param_grid,
                           cv = 5,
                           scoring='accuracy',
                           n_jobs=-1,
                           verbose=1
                           )

# 4) 학습( ★ 반드시 Train 데이터만 사용)
grid_search.fit(x_train_scaled, y_train)

## 학습 즉, fit() 이후 모델 파라미터(이름_)들
print("최적 파라미터 : ", grid_search.best_params_)
print("최고 점수 : ",grid_search.best_score_)

resultDf = pd.DataFrame(grid_search.cv_results_)
print("교차 검증")
display(resultDf)
# 5) 최적 모델로 Test 성능비교
best_knn_grid = grid_search.best_estimator_
y_pred_grid = best_knn_grid.predict(x_test_scaled)

print('\n Test Accuracy(GRidSearch 최적모델) :', best_knn_grid.score(x_test_scaled, y_test))


Fitting 5 folds for each of 32 candidates, totalling 160 fits
최적 파라미터 :  {'n_neighbors': 5, 'p': 2, 'weights': 'uniform'}
최고 점수 :  0.9666666666666668
교차 검증


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_neighbors,param_p,param_weights,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.000799,0.000399,0.002109,0.0004891662,1,1,uniform,"{'n_neighbors': 1, 'p': 1, 'weights': 'uniform'}",0.875,0.958333,0.875,0.958333,1.0,0.933333,0.05,31
1,0.001005,0.000316,0.000805,0.0005117943,1,1,distance,"{'n_neighbors': 1, 'p': 1, 'weights': 'distance'}",0.875,0.958333,0.875,0.958333,1.0,0.933333,0.05,31
2,0.0006,0.00049,0.001702,0.0004005736,1,2,uniform,"{'n_neighbors': 1, 'p': 2, 'weights': 'uniform'}",0.916667,0.958333,0.875,0.958333,1.0,0.941667,0.042492,29
3,0.001201,0.000398,0.001871,0.0003521747,1,2,distance,"{'n_neighbors': 1, 'p': 2, 'weights': 'distance'}",0.916667,0.958333,0.875,0.958333,1.0,0.941667,0.042492,29
4,0.001035,7.2e-05,0.001937,0.0008521223,3,1,uniform,"{'n_neighbors': 3, 'p': 1, 'weights': 'uniform'}",0.916667,0.958333,0.916667,0.958333,1.0,0.95,0.03118,23
5,0.0013,0.0004,0.001007,0.0005528375,3,1,distance,"{'n_neighbors': 3, 'p': 1, 'weights': 'distance'}",0.916667,0.958333,0.916667,0.958333,1.0,0.95,0.03118,23
6,0.001305,0.000606,0.001703,0.0006036028,3,2,uniform,"{'n_neighbors': 3, 'p': 2, 'weights': 'uniform'}",0.916667,1.0,0.916667,0.958333,1.0,0.958333,0.037268,19
7,0.001205,0.000247,0.001307,0.0008772583,3,2,distance,"{'n_neighbors': 3, 'p': 2, 'weights': 'distance'}",0.916667,1.0,0.916667,0.958333,1.0,0.958333,0.037268,19
8,0.000603,0.000493,0.001604,0.0005875809,5,1,uniform,"{'n_neighbors': 5, 'p': 1, 'weights': 'uniform'}",0.916667,0.958333,0.958333,0.958333,1.0,0.958333,0.026352,6
9,0.000502,0.000635,0.001205,0.0002459324,5,1,distance,"{'n_neighbors': 5, 'p': 1, 'weights': 'distance'}",0.916667,0.958333,0.958333,0.958333,1.0,0.958333,0.026352,6



 Test Accuracy(GRidSearch 최적모델) : 0.9333333333333333


- **RandomizedSearchCV : GridSearchCV의 단점이 시간 개선 튜닝 방법 **


In [36]:
from scipy.stats import randint

# 1 기본 모델
knn = KNeighborsClassifier()

# 2) 랜덤 탐색용 분포 설정
param_dist = {'n_neighbors' : randint(1,31),
              'weights' : ['uniform', 'distance'],
              'p' : [1,2] # 1: 맨해튼 거리, 2: 유클리드 거리
              }

# 3) GridSearchCV 설정
random_search = RandomizedSearchCV(estimator = knn,
                           param_distributions=param_dist,
                           n_iter = 20,
                           cv = 5,
                           scoring='accuracy',
                           random_state=42,
                           n_jobs=-1,
                           verbose=2
                           )

# 4) 학습( ★ 반드시 Train 데이터만 사용)
random_search.fit(x_train_scaled, y_train)

## 학습 즉, fit() 이후 모델 파라미터(이름_)들
print("RandomizedSearchCV 최적 파라미터 : ", grid_search.best_params_)
print("RandomizedSearchCV 최고 점수 : ",grid_search.best_score_)

# resultDf = pd.DataFrame(grid_search.cv_results_)
# print("교차 검증")
# display(resultDf)

# 5) 최적 모델로 Test 성능비교
best_knn_rand = random_search.best_estimator_


print('\n Test Accuracy(RandomizedSearchCV 최적모델) :', best_knn_rand.score(x_test_scaled, y_test))


Fitting 5 folds for each of 20 candidates, totalling 100 fits
RandomizedSearchCV 최적 파라미터 :  {'n_neighbors': 23, 'p': 1, 'weights': 'distance'}
RandomizedSearchCV 최고 점수 :  0.975

 Test Accuracy(RandomizedSearchCV 최적모델) : 0.9333333333333333
