### Cross Validation Task

### 약물 A, B, C, X, Y
##### 다중 분류(Multiclass Classification)
- 의학 연구원으로서 동일한 질병을 앓고 있는 일련의 환자에 대한 데이터를 수집했다.
- 치료 과정 동안 각 환자는 5가지 약물, 즉 약물 A, 약물 B, 약물 c, 약물 x 및 y 중 하나에 반응했다.
-  미래에 동일한 질병을 앓는 환자에게 어떤 약물이 적합할 수 있는지 알아보기 위한 모델을 구축한다.

##### feature
- Age: 환자의 나이
- Sex: 환자의 성별
- BP: 혈압
- Cholesterol: 콜레스테롤 수치
- Na_to_K: 나트륨-칼륨

##### target
- Drug: 의약품, 환자에게 효과가 있었던 약

In [1]:
import pandas as pd

drugs_df = pd.read_csv('./datasets/drugs.csv')
drugs_df

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,drugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,drugY
...,...,...,...,...,...,...
195,56,F,LOW,HIGH,11.567,drugC
196,16,M,LOW,HIGH,12.006,drugC
197,52,M,NORMAL,HIGH,9.894,drugX
198,23,M,NORMAL,NORMAL,14.020,drugX


##### 타겟 데이터(의약품) 레이블 인코딩

In [3]:
from sklearn.preprocessing import LabelEncoder
encoder_names = ["Sex","BP","Cholesterol","Drug"]
encoders = {}

for name in encoder_names:
    encoders[name] = LabelEncoder()
    targets = encoders[name].fit_transform(drugs_df[name])
    drugs_df[name] = targets

drugs_df

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,0,0,0,25.355,4
1,47,1,1,0,13.093,2
2,47,1,1,0,10.114,2
3,28,0,2,0,7.798,3
4,61,0,1,0,18.043,4
...,...,...,...,...,...,...
195,56,0,1,0,11.567,2
196,16,1,1,0,12.006,2
197,52,1,2,0,9.894,3
198,23,1,2,1,14.020,3


In [4]:
#중복행 검사
drugs_df.duplicated().sum()

# 결측치 검사
drugs_df.isna().sum()


Age            0
Sex            0
BP             0
Cholesterol    0
Na_to_K        0
Drug           0
dtype: int64

In [5]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# DecisionTreeClassifier 생성
decision_tree_classifier = DecisionTreeClassifier(random_state=124)

# 약품 데이터셋을 타겟과 피쳐로 나누기
drug_feature = drugs_df.iloc[:,:-1]
drug_target = drugs_df.iloc[:,-1]

#테스트셋과 트레인셋 분리

X_train, X_test, Y_train, Y_test = \
train_test_split(drug_feature, drug_target, test_size=0.2, random_state=124)


### 모델 검증을 통해 최적의 파라미터 구하기 -> (모델 튜닝)

In [6]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# 데이터를 로딩하고 학습 데이터와 테스트 데이터를 분리한다.

decision_tree_classifier = DecisionTreeClassifier()

# max_depth: 노드가 생성되는 최대 깊이 수 제한
# min_sample_split: 최소 샘플 개수 제한
parameters = {'max_depth': [1,2, 3, 4,5,6,7], 'min_samples_split': [6, 7,8,9]}

In [12]:
grid_decision_tree_classifier = GridSearchCV(decision_tree_classifier
                                             , param_grid=parameters
                                             , cv=5
                                             , refit=True
                                             , return_train_score=True)

grid_decision_tree_classifier.fit(X_train, Y_train)

In [13]:
import pandas as pd

scores_df = pd.DataFrame(grid_decision_tree_classifier.cv_results_)
scores_df[['params', 'mean_test_score', 'rank_test_score', 
           'split0_test_score', 'split1_test_score', 'split2_test_score']]

Unnamed: 0,params,mean_test_score,rank_test_score,split0_test_score,split1_test_score,split2_test_score
0,"{'max_depth': 1, 'min_samples_split': 6}",0.7125,25,0.71875,0.71875,0.71875
1,"{'max_depth': 1, 'min_samples_split': 7}",0.7125,25,0.71875,0.71875,0.71875
2,"{'max_depth': 1, 'min_samples_split': 8}",0.7125,25,0.71875,0.71875,0.71875
3,"{'max_depth': 1, 'min_samples_split': 9}",0.7125,25,0.71875,0.71875,0.71875
4,"{'max_depth': 2, 'min_samples_split': 6}",0.83125,21,0.84375,0.84375,0.84375
5,"{'max_depth': 2, 'min_samples_split': 7}",0.83125,21,0.84375,0.84375,0.84375
6,"{'max_depth': 2, 'min_samples_split': 8}",0.83125,21,0.84375,0.84375,0.84375
7,"{'max_depth': 2, 'min_samples_split': 9}",0.83125,21,0.84375,0.84375,0.84375
8,"{'max_depth': 3, 'min_samples_split': 6}",0.85,17,0.84375,0.8125,0.875
9,"{'max_depth': 3, 'min_samples_split': 7}",0.85,17,0.84375,0.8125,0.875
