## 의사결정 트리 알고리즘을 사용한 분류 연습문제

### 유방암 예측 모델
- 사용 데이터 세트 : breast_cancer 데이터 세트  
- 데이터 세트 분리 : 학습용 데이터 세트 70%, 테스트용(평가용) 데이터 세트 30%  
- 사용하는 ML 알고리즘 : 의사결정 트리 알고리즘   
- 모델 평가 : 예측 성능 평가 - 정확도  

### sklearn 패키지의  breast_cancer 데이터셋 사용
- 유방암 예측  
- label이 0 이면 malignant 양성, 1 이면 benign 음성 

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity="all"

### (1) 데이터 세트 준비 : 유방암 데이터 세트

In [2]:
# 데이터 로드
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
cancer

{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
         1.189e-01],
        [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
         8.902e-02],
        [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
         8.758e-02],
        ...,
        [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
         7.820e-02],
        [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
         1.240e-01],
        [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
         7.039e-02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
        1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
        1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
        1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0

In [3]:
import pandas as pd
cancer_data = cancer.data
cancer_target = cancer.target

### (2) 데이터 세트 분리 : 학습 데이터 / 테스트 데이터 세트 
- 테스트 데이터 세트 : 30% 

In [4]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(cancer_data, cancer_target, test_size=0.3, random_state=1)
print(X_train.size, X_test.size)
print(y_train.size, y_test.size)

11940 5130
398 171


### (3) 모델 학습 : 학습 데이터 세트 기반으로 ML 알고리즘을 적용하여 모델 학습

In [5]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=1)

In [6]:
model.fit(X_train, y_train)

### (4) 예측 수행 : 학습된 ML 모델을 이용해서, 테스트 데이터의 분류 예측 

In [7]:
y_predict = model.predict(X_test)
pd.DataFrame([y_test, y_predict])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,161,162,163,164,165,166,167,168,169,170
0,1,0,1,0,0,0,0,0,1,1,...,1,1,0,1,1,0,0,0,1,1
1,1,0,1,0,1,0,1,1,1,1,...,1,1,0,1,1,0,0,0,1,1


### (5) 평가 : 예측 정확도 평가  
    - 예측된 결과값과 테스트 데이터의 실제 결과값과 비교해서 ML 모델 성능 평가
    - 예측 정확도

In [8]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predict)
# 정확도 : 92.98%

0.9298245614035088

### breast_cancer 데이터 세트로 데이터프레임 생성

In [9]:
# 데이터 로드
cancer = load_breast_cancer()
cancer_data = cancer.data
cancer_target = cancer.target

In [10]:
# 데이터프레임 생성
cancer_df = pd.DataFrame(data=cancer_data, columns=cancer.feature_names)
cancer_df['target'] = cancer_target
cancer_df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


### 교차검증 수행

In [11]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

In [12]:
# Decision Tree Classifier : model
kfold = KFold(n_splits=5, shuffle=True)
scores = cross_val_score(model, cancer_data, cancer_target, cv=kfold)

for count, accuracy in enumerate(scores):
    print('교차검증 {0}회 정확도 : {1:.4f}'.format(count+1, accuracy))
print('평균 정확도 : {0:.4f}'.format(scores.mean()))

교차검증 1회 정확도 : 0.8947
교차검증 2회 정확도 : 0.9035
교차검증 3회 정확도 : 0.9211
교차검증 4회 정확도 : 0.9386
교차검증 5회 정확도 : 0.9381
평균 정확도 : 0.9192


### 하이퍼 파라미터 튜닝

In [13]:
from sklearn.model_selection import GridSearchCV

In [15]:
parameters = {
    'max_depth' : [4, 5, 6],
    'min_samples_split' : [2, 3, 4],
    'min_samples_leaf' : [2, 3, 4, 5]
}

grid_model = GridSearchCV(model, param_grid=parameters, scoring='accuracy', cv=kfold)
grid_model.fit(X_train, y_train)

print('최적 하이퍼 파라미터 : ', grid_model.best_params_)
print('최고 정확도 : ', grid_model.best_score_)
best_model = grid_model.best_estimator_

y_pred = best_model.predict(X_test)
print('최종 정확도 : ', accuracy_score(y_test, y_pred))

rlt = pd.DataFrame(grid_model.cv_results_['params'])
rlt['mean_score'] = grid_model.cv_results_['mean_test_score']
rlt.sort_values(by='mean_score', ascending=False).head(20)

최적 하이퍼 파라미터 :  {'max_depth': 4, 'min_samples_leaf': 4, 'min_samples_split': 2}
최고 정확도 :  0.9395569620253165
최종 정확도 :  0.9590643274853801


Unnamed: 0,max_depth,min_samples_leaf,min_samples_split,mean_score
18,5,4,2,0.939557
19,5,4,3,0.939557
32,6,4,4,0.939557
31,6,4,3,0.939557
6,4,4,2,0.939557
7,4,4,3,0.939557
8,4,4,4,0.939557
30,6,4,2,0.939557
20,5,4,4,0.939557
34,6,5,3,0.934525
