# 데이터셋 출처
* [Pima Indians Diabetes Database | Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database)
* https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html

## 데이터 구성

* Pregnancies : 임신 횟수
* Glucose : 2시간 동안의 경구 포도당 내성 검사에서 혈장 포도당 농도
* BloodPressure : 이완기 혈압 (mm Hg)
* SkinThickness : 삼두근 피부 주름 두께 (mm), 체지방을 추정하는데 사용되는 값
* Insulin : 2시간 혈청 인슐린 (mu U / ml)
* BMI : 체질량 지수 (체중kg / 키(m)^2)
* DiabetesPedigreeFunction : 당뇨병 혈통 기능
* Age : 나이
* Outcome : 768개 중에 268개의 결과 클래스 변수(0 또는 1)는 1이고 나머지는 0입니다.


# 필요한 라이브러리 로드

In [1]:
# 데이터 분석을 위한 pandas, 수치 계산을 위한 numpy
# 시각화를 위한 seaborn, matplotlib.pyplot을 로드합니다.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# 구버전의 쥬피터 노트북 사용중이라면 추가해줘야 할 코드
# %matplotlib inline

# 데이터셋 로드

In [2]:
df = pd.read_csv("../data/diabetes_feature.csv")
df.shape

(768, 16)

In [3]:
# 데이터셋을 미리보기 합니다.

df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,Pregnancies_high,Age_low,Age_middle,Age_high,Insulin_nan,Insulin_log,low_glu_insulin
0,6,148,72,35,0,33.6,0.627,50,1,False,False,True,False,169.5,5.138735,False
1,1,85,66,29,0,26.6,0.351,31,0,False,False,True,False,102.5,4.639572,True
2,8,183,64,0,0,23.3,0.672,32,1,True,False,True,False,169.5,5.138735,False
3,1,89,66,23,94,28.1,0.167,21,0,False,True,False,False,94.0,4.553877,True
4,0,137,40,35,168,43.1,2.288,33,1,False,False,True,False,168.0,5.129899,False


# 학습과 예측에 사용할 데이터셋 만들기

In [4]:
df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome', 'Pregnancies_high',
       'Age_low', 'Age_middle', 'Age_high', 'Insulin_nan', 'Insulin_log',
       'low_glu_insulin'],
      dtype='object')

In [5]:
X = df[['Glucose', 'BloodPressure', 'SkinThickness',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Pregnancies_high',
        'Insulin_nan', 'low_glu_insulin']]
X.shape

(768, 9)

In [6]:
y = df['Outcome']
y.shape

(768,)

In [7]:
# scikit-learn에서 제공하는 model_selection의 train_test_split으로 만듭니다.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

- test_size=20 -> 80%는 train set, 20%는 test set
- X: feature, y: label
- random_state=42: 같은 환경이면 같은 값을 가지고 옴. 이걸 설정하지 않으면, 매번 샘플링 할 때 마다 다른 데이터들을 가져오게 된다.
- random_state를 설정하지 않으면, 어떤 hyperparameter를 변경해서 score가 올라갔는지, 아니면 그냥 랜덤하게 점수가 올라간건지 알 수가 없음


In [8]:
# train 세트의 문제와 정답의 데이터 수를 확인해주세요.

X_train.shape, y_train.shape

((614, 9), (614,))

In [9]:
# test 세트의 문제와 정답의 데이터 수를 확인해주세요.

X_test.shape, y_test.shape

((154, 9), (154,))

# 여러 개의 알고리즘을 사용해서 비교하기

In [10]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier


estimators = [DecisionTreeClassifier(random_state=42),
             RandomForestClassifier(random_state=42),
             GradientBoostingClassifier(random_state=42)
            ]


In [11]:
results=[]

for estimator in estimators:
    result=[]
    result.append(estimator.__class__.__name__)
    results.append(result)

results

[['DecisionTreeClassifier'],
 ['RandomForestClassifier'],
 ['GradientBoostingClassifier']]

In [12]:
from sklearn.model_selection import RandomizedSearchCV


max_depth = np.random.randint(2, 20, 10)
max_features = np.random.uniform(0.3, 1.0, 10)

param_distributions = {"max_depth": max_depth, 
                       "max_features": max_features}

results=[]
for estimator in estimators:
    result=[]
    
    if estimator.__class__.__name__ != 'DecisionTreeClassifier':
        param_distributions["n_estimators"] = np.random.randint(100, 1000, 10)
        # n_estimators의 값이 커질수록 만드는 트리의 갯수가 많아짐
        
    clf = RandomizedSearchCV(estimator, 
                       param_distributions, 
                       n_iter=100, 
                       scoring="accuracy", 
                       n_jobs=-1,
                       cv=5,
                       verbose=2
                      )
    clf.fit(X_train, y_train)
    result.append(estimator.__class__.__name__)
    result.append(clf.best_params_)
    result.append(clf.best_score_)  # train set
    result.append(clf.score(X_test, y_test))  # test set
    result.append(clf.cv_results_)
    results.append(result)


Fitting 5 folds for each of 100 candidates, totalling 500 fits
Fitting 5 folds for each of 100 candidates, totalling 500 fits
Fitting 5 folds for each of 100 candidates, totalling 500 fits


In [13]:
df = pd.DataFrame(results, columns=['estimator', 'best_params', 'train_score', 'test_score', 'cv_result'])
df

Unnamed: 0,estimator,best_params,train_score,test_score,cv_result
0,DecisionTreeClassifier,"{'max_features': 0.8176598710051641, 'max_dept...",0.868106,0.876623,"{'mean_fit_time': [0.007359361648559571, 0.010..."
1,RandomForestClassifier,"{'n_estimators': 668, 'max_features': 0.916353...",0.907197,0.850649,"{'mean_fit_time': [2.550285291671753, 1.676074..."
2,GradientBoostingClassifier,"{'n_estimators': 911, 'max_features': 0.514211...",0.905598,0.850649,"{'mean_fit_time': [1.0814850330352783, 1.78986..."


- hyper-parameter들을 바꿔서 실행해보면서 최적의 파라미터를 찾기
- fold(cv)도 많이 나눌수록 더욱 정확해짐 (train_score와 test_score의 차이가 줄어든다.) -> overfitting을 최대한 줄인다.
- train_score > test_score: overfitting

In [14]:
pd.DataFrame(df.loc[1, "cv_result"]).sort_values(by="rank_test_score").head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_estimators,param_max_features,param_max_depth,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
99,2.179534,0.11306,0.040889,0.006371,704,0.916354,15,"{'n_estimators': 704, 'max_features': 0.916353...",0.886179,0.943089,0.869919,0.910569,0.92623,0.907197,0.026432,1
5,2.210255,0.038308,0.066114,0.005675,668,0.916354,15,"{'n_estimators': 668, 'max_features': 0.916353...",0.886179,0.943089,0.869919,0.910569,0.92623,0.907197,0.026432,1
61,1.903162,0.017004,0.053913,0.002148,537,0.916354,16,"{'n_estimators': 537, 'max_features': 0.916353...",0.894309,0.943089,0.869919,0.910569,0.918033,0.907184,0.024384,3
19,1.395127,0.050692,0.037925,0.002447,412,0.910063,14,"{'n_estimators': 412, 'max_features': 0.910062...",0.894309,0.943089,0.869919,0.910569,0.918033,0.907184,0.024384,3
22,1.907254,0.159293,0.056548,0.008315,537,0.916354,17,"{'n_estimators': 537, 'max_features': 0.916353...",0.894309,0.943089,0.869919,0.910569,0.918033,0.907184,0.024384,3


[CV] END .......max_depth=2, max_features=0.8176598710051641; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.5098596009579262; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.5142110141567688; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.8442936992589842; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.9982351344887936; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.9163535404921896; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.9100629909865654; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.4345911956533568; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.8176598710051641; total time=   0.0s
[CV] END .......max_depth=7, max_features=0.5098596009579262; total time=   0.0s
[CV] END .......max_depth=7, max_features=0.5098596009579262; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.8176598710051641; total time=   0.0s
[CV] END .......max_depth=2,

[CV] END .......max_depth=2, max_features=0.5706742305328552; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.6395122604633405; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.9100629909865654; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.4345911956533568; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.5098596009579262; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.5142110141567688; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.9982351344887936; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.9982351344887936; total time=   0.0s
[CV] END .......max_depth=7, max_features=0.8176598710051641; total time=   0.0s
[CV] END .......max_depth=7, max_features=0.8176598710051641; total time=   0.0s
[CV] END .......max_depth=7, max_features=0.9982351344887936; total time=   0.0s
[CV] END .......max_depth=7, max_features=0.9982351344887936; total time=   0.0s
[CV] END .......max_depth=7,

[CV] END .......max_depth=2, max_features=0.5098596009579262; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.5706742305328552; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.5142110141567688; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.6395122604633405; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.6395122604633405; total time=   0.0s
[CV] END .......max_depth=7, max_features=0.5706742305328552; total time=   0.0s
[CV] END .......max_depth=7, max_features=0.5706742305328552; total time=   0.0s
[CV] END .......max_depth=7, max_features=0.9100629909865654; total time=   0.0s
[CV] END .......max_depth=7, max_features=0.9100629909865654; total time=   0.0s
[CV] END .......max_depth=7, max_features=0.9100629909865654; total time=   0.0s
[CV] END .......max_depth=7, max_features=0.9100629909865654; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.9163535404921896; total time=   0.0s
[CV] END .......max_depth=2,

[CV] END .......max_depth=2, max_features=0.8176598710051641; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.5098596009579262; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.5142110141567688; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.5142110141567688; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.8442936992589842; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.9163535404921896; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.5706742305328552; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.5142110141567688; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.9163535404921896; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.9163535404921896; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.4345911956533568; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.4345911956533568; total time=   0.0s
[CV] END .......max_depth=7,

[CV] END .......max_depth=2, max_features=0.8176598710051641; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.9982351344887936; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.6395122604633405; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.4345911956533568; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.8176598710051641; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.5706742305328552; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.5098596009579262; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.8442936992589842; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.8442936992589842; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.9100629909865654; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.9100629909865654; total time=   0.0s
[CV] END .......max_depth=7, max_features=0.5098596009579262; total time=   0.0s
[CV] END .......max_depth=7,

[CV] END .......max_depth=2, max_features=0.5706742305328552; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.5142110141567688; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.9982351344887936; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.9163535404921896; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.6395122604633405; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.8176598710051641; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.5098596009579262; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.9982351344887936; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.9982351344887936; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.9100629909865654; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.4345911956533568; total time=   0.0s
[CV] END .......max_depth=7, max_features=0.4345911956533568; total time=   0.0s
[CV] END .......max_depth=7,

[CV] END .......max_depth=2, max_features=0.8176598710051641; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.8442936992589842; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.9163535404921896; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.6395122604633405; total time=   0.0s
[CV] END .......max_depth=2, max_features=0.9100629909865654; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.5098596009579262; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.8442936992589842; total time=   0.0s
[CV] END .......max_depth=5, max_features=0.9982351344887936; total time=   0.0s
[CV] END .......max_depth=7, max_features=0.8176598710051641; total time=   0.0s
[CV] END .......max_depth=7, max_features=0.5706742305328552; total time=   0.0s
[CV] END .......max_depth=7, max_features=0.8442936992589842; total time=   0.0s
[CV] END .......max_depth=7, max_features=0.9982351344887936; total time=   0.0s
[CV] END .......max_depth=7,