### 분류종류

- 이진 분류(Binary Classification)
    - 두 그룹으로 분류
    - 참(True) 또는 거짓(False), A 그룹 또는 B 그룹
    - 분류 결과가 맞다면 1(True, A 그룹에 포함)을 반환하며, 아니라면 0(False)을 반환
- 다중 분류(Multiclass Classification)
    - 분류해야하는 그룹이 3종류 이상

### 피마 인디언 당뇨병 예측

- https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database?select=diabetes.csv

- Pregnancies: 임신 횟수
- Glucose: 포도당 부하 검사 수치
- BloodPressure: 혈압(mm Hg)
- SkinThickness: 팔 삼두근 뒤쪽의 피하지방 측정값(mm)
- Insulin: 혈청 인슐린(mu U/ml)
- BMI: 체질량지수(체중(kg)/키(m))^2
- DiabetesPedigreeFunction: 당뇨 내력 가중치 값
- Age: 나이
- Outcome: 클래스 결정 값(0 또는 1)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.metrics import f1_score, confusion_matrix, precision_recall_curve, roc_curve
from sklearn.metrics import plot_precision_recall_curve
from sklearn.metrics import plot_roc_curve
from sklearn.metrics import classification_report

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings('ignore')

In [3]:
diabetes = pd.read_csv(r"C:\Users\82109\OneDrive\바탕 화면\python study\diabetes.csv")
diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [4]:
# 데이터 편향
diabetes['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

In [7]:
feature = diabetes.iloc[:, :-1]
target = diabetes['Outcome']

In [8]:
feature.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
dtypes: float64(2), int64(6)
memory usage: 48.1 KB


In [9]:
target

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

In [10]:
X_train, X_test, y_train, y_test = train_test_split(feature,target, random_state=156)

dt = DecisionTreeClassifier(random_state=156)
dt.fit(X_train, y_train)
pred = dt.predict(X_test)

accuracy_score(y_test, pred)

0.6875

In [11]:
X_train, X_test, y_train, y_test = train_test_split(feature,target, shuffle = True, stratify = target, random_state=156)

dt = DecisionTreeClassifier(random_state=156)
dt.fit(X_train, y_train)
pred = dt.predict(X_test)

accuracy_score(y_test, pred)

0.6875

In [13]:
# confusion matrix 만들기
mycon = pd.DataFrame()
mycon['actual'] = y_test
mycon['pred'] = pred
mycon

Unnamed: 0,actual,pred
84,1,1
604,1,1
739,1,1
762,0,0
411,0,1
...,...,...
111,1,0
527,0,0
682,0,0
613,0,0


(0,0) : TN, (0,1) : FP

(1,0) : FN, (1,1) : TP

In [None]:
mycon['con'] = np.where(mycon['actual']==1, np.where(mycon['pred']==1, 'TP', 'FN'), np.where