# Summary

병원 진료데이터의 기록을 바탕으로 질병을 예측, 평가.

특히 국내 사망 원인의 큰 비중을 차지하고 있는 심장병과 관련된 데이터를 바탕으로 모델을 예측한다.

> <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcRVcNhoJeMapy_0AOZhDEXdipjMfrtWORcpiw&usqp=CAU" width="400px" height="300px">  
> 심장병은 암, 만성질환과 같은 악성신생물을 제외하고 가장 높은 사망 원인으로 꼽힌다.

### 진행 순서

- 1. 데이터 불러오기 (출처 명시)
    - Kaggle ~ Heart Disease UCI(https://www.kaggle.com/ronitf/heart-disease-uci)
    
- 2. 학습 모델 설정
    - XGboost, Adam 사용시의 비교 및 Ensemble을 통한 예측도 상승을 기대 
    - 1) Supervised Learning
        - Random Forest (Decision Tree Classifier)
        - SVM
        - Knn
    - 2) ~~Unsupervised Learning~~
        - ~~K-Means~~
    - 3) Using Neural Networks
        - RNN
    
- 3. 예측/평가
    - Confusion Matrix
    - F1-score
    - k-fold cross validation
    
- 4. 추가 과제
    - 앙상블 사용 시 hard voting이 아닌, 한 파이프라인 안에서 진행될 수 있도록 진행

# 1. Load Modules and Dataset

In [24]:
# for using DataFrame
import numpy as np
import pandas as pd

# for data split & model training
from sklearn.model_selection import train_test_split

# for the modelling
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.neighbors import NearestNeighbors # knn


# for evaluation
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import confusion_matrix, roc_curve, auc

# Visualization - plotting
from sklearn.metrics import plot_precision_recall_curve
import matplotlib.pyplot as plt
import seaborn as sns

In [7]:
df = pd.read_csv("./dataset/heart.csv")

In [8]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [21]:
print(f'Index: {df.index}')
print(f'Columns: {df.shape[1]}')

Index: RangeIndex(start=0, stop=303, step=1)
Columns: 14


## Column Definition

> - 1. age : age in years
> - 2. sex : (1 = male; 0 = female)
> - 3. cp : chest pain type
> - 4. trestbps : resting blood pressure (in mm Hg on admission to the hospital)
> - 5. chol : serum cholestorla in mg/dl
> - 6. fbs : (fasting blood sugar & gt; 120mg/dl) (1 = true; 0 = false)
> - 7. restecg : resting electrocardiographic results
> - 8. thalach : maximum heart rate achieved
> - 9. exang : exercise induced angina (1 = yes; 0 = no)
> - 10. oldpeak :ST depression induced by exercise relative to rest
> - 11. slope : the slope of the peak exercise ST segment
> - 12. ca : number of major vessels (0-3) colored by flourosopy
> - 13. thal : 3 = normal; 6 = fixed defect; 7 = reversable defect
> - 14. target : 1 or 0

In [None]:
classifiers = [
    SVC(kernel="linear", c=0.025), # Linear SVM
    SVC(gamma=2, C=1) # Non-Linear SVM
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=8, max_features=1),
    
]