### 분류(Classification) 모델
: 예측 값이 범주로 미리 정해지는 경우
- 2가지 범주 : 0과 1,'음성' or '양성', Logistic Regression, Binary Classification
- 3개 이상의 범주 : 'A' ~ 'F'학점, Multi Classification

<sklearn 의 분류 모델 클래스>
* Decision Tree(결정트리)
* Logistic Regression
* Naive Bayes
* Support Vector Machine
* Nearest Neighbor(최소 근접 알고리즘)

### 결정 트리(Decision Tree)
#### 의사 결정 트리, 의사 결정 나무라고도함, 대표적인 분류 학습 모델, 회귀도 가능, 스무고개와 유사
#### 나무를 거꾸로 뒤집어 놓은 모양 : 루트 노드 --> 규칙 노드(중간 노드) --> 리프노드(끝노드)

#### [1] 불순도(impurity): 해당 범주안에 서로 다른 데이터가 얼마나 섞여있는지를 말한다, 불확실성, 무질서도
#### [2] 엔트로피(Entropy) : 불순도를 수치적으로 나타내는 척도, 엔트로피가 높다는 것은 불순도가 높다, 1이면 불순도가 최대
#### [3] 정보이득(Information Gain) : 1 - 엔트로피
* 결정트리는 분류(구분)한 뒤 각 영역의 순도가 증가, 불순도가 감소하는 방향으로 학습을 진행, 정보이론에서는 정보이득

#### [4] 지니계수(Gini Coefficient) : 1 - (각 범주별 데이터 비율의 제곱의 합), 0 이면 최소(끝노드), 1이면 최대

### Iris(붓꽃) 품종 예측하기

In [2]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [10]:
# iris의 품종 : [0:'setosa',1:'versicolor',2:'virginica']
# 붓꽃의 데이터 세트를 
iris = load_iris()
type(iris)    # Bunch : sklearn의 고유한 타입, DataFrame이 아님   
iris

# x값, 피쳐만 추출
iris_data = iris.data
print(iris_data.shape)  # (150,4) , 2차원 ndarray
print(type(iris.data))
print(iris.feature_names) # sepal:꽃받침    petal:꽃잎
# ['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']

(150, 4)
<class 'numpy.ndarray'>
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [19]:
# y값 ,답(label)만 추출
iris_label = iris.target
print(iris_label.shape)   # (150,)  , 1차원
print(iris_label)
print(iris.target_names)  # ['setosa' 'versicolor' 'virginica']

iris_df = pd.DataFrame(data=iris_data,columns=iris.feature_names)
iris_df['label'] = iris_label
print(iris_df['label'].value_counts()) # 50,50,50 [150개]
iris_df  

(150,)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
['setosa' 'versicolor' 'virginica']
0    50
1    50
2    50
Name: label, dtype: int64


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


In [24]:
# train(학습)데이터와 test(검증)데이터 세트로 분리 : 80:20%, 120개(train),30개(test)
X_train,X_test,y_train,y_test = train_test_split(iris_data,iris_label,
                                                 test_size=0.2,
                                                 random_state=11) # random seed를 고정
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((120, 4), (30, 4), (120,), (30,))

In [28]:
# 학습(Train) 수행
dt_clf = DecisionTreeClassifier(random_state=11) # random seed를 고정
dt_clf.fit(X_train,y_train)

In [31]:
# 예측(predict) 수행
pred = dt_clf.predict(X_test)
pred,y_test

(array([2, 2, 1, 1, 2, 0, 1, 0, 0, 1, 1, 1, 1, 2, 2, 0, 2, 1, 2, 2, 1, 0,
        0, 1, 0, 0, 2, 1, 0, 1]),
 array([2, 2, 2, 1, 2, 0, 1, 0, 0, 1, 2, 1, 1, 2, 2, 0, 2, 1, 2, 2, 1, 0,
        0, 1, 0, 0, 2, 1, 0, 1]))

In [34]:
# 정확도 측정 : accuracy
from sklearn.metrics import accuracy_score ,classification_report
print('정확도:',round(accuracy_score(y_test,pred),4)) # 0.933

cl_report = classification_report(y_test,pred)
print('리포트:\n',cl_report)

정확도: 0.9333
리포트:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00         9
           1       0.83      1.00      0.91        10
           2       1.00      0.82      0.90        11

    accuracy                           0.93        30
   macro avg       0.94      0.94      0.94        30
weighted avg       0.94      0.93      0.93        30



### 평가 지표(2진 분류)
* 정확도(Accuracy)
* 오차행렬(Confusion Matrix)
* 정밀도(Precison)
* 재현율(Recall)
* F1 스코어
* ROC AUC

### 오차 행렬(Confusion Matrix)
* True: 맞음, False: 틀림
* Negative : 0 , Positive : 1

[ [ TN, FP], True Negative, False Positive
<br>
  [ FN, TP] ] False Negative, True Positive
  
정확도 = 맞은갯수/전체갯수 = (TN + TP)/( TN + FP + FN + TP) 

#### 정밀도 = TP/(FP + TP), 예측을  Positive로 한 것 중 Positive답을 맞춘 비율, 양성 예측도
#### 재현율 = TP/(FN + TP), 실제 값이 Positive인 것 중 Positive답을 맞춘 비율, 민감도, TPR(True Positive Rate)
#### F1 스코어  = 2/((1/재현율) + (1/정밀도)) = 2*(재현율*정밀도)/(재현율+정밀도), feature의 중요도, 정밀도와 재현율을 결합한 지표
#### ROC(Receiver Operation Curve) : 수신자 판단 곡선
* TNR(True Negative Rate,특이성) : TN/(FP + TN)
* x축을 FPR(False Positive Rate) : FP/(FP + TN) = 1 - TNR = 1 - 특이성
* y축을 TPR(True Positive Rate,재현율,민감도) : TP/(FN + TP)

####  ROC AUC(Area Under Curve) : ROC 곡선의 면적, 1에 가까울수록 좋음, 1이 최대

In [36]:
from sklearn.metrics import accuracy_score,confusion_matrix
from sklearn.metrics import precision_score,recall_score
from sklearn.metrics import f1_score,roc_auc_score

In [37]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Breast Cancer Wisconsin
dataset = load_breast_cancer()
type(dataset)  # Bunch
dataset.data.shape   # (569, 30)
dataset.target.shape # (569,) , 0: 악성(malignant), 1: 양성(benign)

x_features = dataset.data # X , 피쳐
y_label = dataset.target  # Y , 레이블

cancer_df = pd.DataFrame(data=x_features,columns=dataset.feature_names)
cancer_df['target'] = y_label
cancer_df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,0
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,0
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,0
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,0
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,0
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,0
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,0


In [38]:
print(dataset.target_names)
print(cancer_df['target'].value_counts())

# train(80%):test(20%) 로 데이터 분리
X_train,X_test,y_train,y_test = train_test_split(x_features,y_label,
                                                test_size=0.2,
                                                random_state=0)
print(X_train.shape) # (455, 30)
print(X_test.shape)  # (114, 30)
# type(X_train)      # ndarray

['malignant' 'benign']
1    357
0    212
Name: target, dtype: int64
(455, 30)
(114, 30)


In [41]:
# 학습
clf = DecisionTreeClassifier(random_state=11) # random seed를 고정
clf.fit(X_train,y_train)

# 예측
pred = clf.predict(X_test)

# 정확도 측정
ac_score = accuracy_score(y_test,pred)
print('정확도:',ac_score)  

정확도: 0.9035087719298246


In [43]:
# 정밀도
precision = precision_score(y_test,pred)
print('정밀도:',precision)

정밀도: 0.9666666666666667


In [44]:
# 재현율
recall = recall_score(y_test,pred)
print('재현율:',recall)

재현율: 0.8656716417910447


In [45]:
# 오차 행렬
confusion = confusion_matrix(y_test,pred)
print('오차행렬:\n',confusion)

오차행렬:
 [[45  2]
 [ 9 58]]


In [46]:
# f1_score
f1 = f1_score(y_test,pred)
print('F1-스코어:',f1)

F1-스코어: 0.9133858267716535


In [47]:
# roc_auc
roc_auc = roc_auc_score(y_test,pred)
print('ROC_AUC:',roc_auc)

ROC_AUC: 0.9115592251508415
