<a href="https://colab.research.google.com/github/Redwoods/Py/blob/master/pdm2020/my-note/py-streamlit-21/st-mid-exam/heart_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EDA of Heart disease data from Cleveland clinic
- Exploratory Data Analysis
- > https://archive-beta.ics.uci.edu/ml/datasets/heart+disease
- > https://towardsdatascience.com/heart-disease-prediction-73468d630cfc
- > https://betterprogramming.pub/predicting-heart-disease-with-a-neural-network-a48d2ce59bc5
- > https://medium.com/analytics-vidhya/cleveland-eda-b73f0f62ebf8

## 1. Importing Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

## 2. Data Collection

In [None]:
# Get the data from github
url = "https://github.com/Redwoods/Py/raw/master/pdm2020/my-note/py-pandas/data/cleveland.csv"
df = pd.read_csv(url, header = None)
# df = pd.read_csv('cleveland_raw.csv', header=None)

In [None]:
df.shape,df.head(10)

### columns using name of features

In [None]:
df.columns = ['age', 'sex', 'cp', 'trestbps', 'chol',
              'fbs', 'restecg', 'thalach', 'exang', 
              'oldpeak', 'slope', 'ca', 'thal', 'target']

df.head(15)

## Ckeck & cleaning data
- Check the presence of null or NaN
- target : (0 ~ 4) => (0,1), (1-4) was assigned as 1.

### Imputing data
- Check the NaN or missing values
- Impute the null data by the mean or median of the same feature

## target : (0 ~ 4) => (0,1), (1-4) was assigned as 1.

In [None]:
# https://jaaamj.tistory.com/112
df.target.unique(),df.target.nunique()

In [None]:
df.target.value_counts()

In [None]:
# target : (0 ~ 4) => (0,1), (1-4) was assigned as 1.
df['target'] = df.target.map({0: 0, 1: 1, 2: 1, 3: 1, 4: 1})

In [None]:
df.head(10)

## CHECK for the null values & Imputing by the mean or median

In [None]:
# CHECK FOR NULL VALUES
df.isnull().values.any(), df.isna().sum()  # df.isnull().sum()

In [None]:
df[df.thal.isnull()]

In [None]:
df[df['ca'].isnull()]   # df[df.ca.isnull()]

In [None]:
# Imputing data using the median
df['thal'] = df.thal.fillna(df.thal.median())
df['ca'] = df.ca.fillna(df.ca.median())

In [None]:
# reCHECK FOR NULL VALUES
df.isnull().values.any(), df.isna().sum()  # df.isnull().sum()

In [None]:
df.head(10)

In [None]:
df.target.value_counts()

In [None]:
df.columns

## Features of heart disease data
- age: displays the age of the individual.
- sex: displays the gender of the individual using the following format :
    - 1 = male
    - 0 = female
- cp: Chest-pain type: displays the type of chest-pain experienced by the individual using the following format :
    - 1 = typical angina (협심증)
    - 2 = atypical angina
    - 3 = non — anginal pain
    - 4 = asymptotic
- trestbps: Resting Blood Pressure: displays the resting blood pressure value of an individual in mmHg (unit)
- chol: Serum Cholestrol: displays the serum cholesterol in mg/dl (unit)
- fbs: Fasting Blood Sugar: compares the fasting blood sugar value of an individual with 120mg/dl.
    - If fasting blood sugar > 120mg/dl then : 1 (true)
    
    - else : 0 (false)
- restecg: Resting ECG : displays resting electrocardiographic results
    - 0 = normal
    - 1 = having ST-T wave abnormality
    - 2 = left ventricular hyperthrophy
- thalach: Max heart rate achieved : displays the max heart rate achieved by an individual.
- exang: Exercise induced angina(협심증) :
    - 1 = yes
    - 0 = no
- oldpeak: ST depression induced by exercise relative to rest: displays the value which is an integer or float.
- slope: Peak exercise ST segment :
    - 1 = upsloping
    - 2 = flat
    - 3 = downsloping
- ca: Number of major vessels (0–3) colored by flourosopy : displays the value as integer or float.
- thal : displays the thalassemia (빈혈) :
    - 3 = normal
    - 6 = fixed defect
    - 7 = reversible defect
- target: Diagnosis of heart disease : Displays whether the individual is suffering from heart disease or not :
    - 0 = absence
    - 1 = present. (1,2,3,4 => 1)

---

## 중복 데이터 점검
- duplicated()
- https://pydole.tistory.com/entry/Python-pandas-%EC%A4%91%EB%B3%B5%EA%B0%92-%EC%B2%98%EB%A6%AC-duplicates-dropduplicates

In [None]:
df.duplicated().sum()

In [None]:
df[df.duplicated(keep=False)]

In [None]:
# 중복 샘플을 제거
df.drop_duplicates(subset=df.columns, inplace=True) # 열 전체에서 동일한 중복인 내용이 있다면 중복 제거

# 중복 샘플을 제거 후, 전체 샘플 수를 확인.
print('총 샘플의 수 :',len(df))

> ## 최종 데이터 확정 및 저장

In [None]:
# df.to_csv('heart.csv', index=False)

## 3. Explore Data

In [None]:
df.describe()

In [None]:
df.describe().T

In [None]:
df.info()

### Check the balance of classes in the data through plot

In [None]:
classes=df.target
classes.value_counts(), type(classes)

In [None]:
classes.value_counts(ascending=True)

In [None]:
# Check the balance of the data through plot
classes=df.target
ax=sns.countplot(classes, label='count')
ax.set_xticklabels(['noHD','HD'])
noHD,HD=classes.value_counts() #ascending=True)
print('False: non-Heart Disease',noHD)
print('True: Heart Disease',HD)

In [None]:
### 1 = male, 0 = female
# df['sex'] = df.sex.map({0: 'female', 1: 'male'})

In [None]:
df.tail()

In [None]:
# # barplot of age vs sex with hue = target
# sns.catplot(kind = 'bar', data = df, y = 'age', x = 'sex', hue = 'target', hue_order=[1,0])#, color='br') #order = df['target'].sort_values().unique())
# plt.title('Distribution of age vs sex with the target class')
# plt.show()

***

In [None]:
# plot histograms for each variable
df.hist(figsize = (12, 12))
plt.show()

## DIY : 특징 사이의 상관성을 조사하시오.

In [None]:
df.corr()

In [None]:
# correlation plot of df

plt.figure(figsize=(12,10))
g=sns.heatmap(df.corr(),annot=True,cmap='coolwarm', #cmap= "RdYlGn",
             vmin=-1, vmax=1)

In [None]:
# correlation plot of df.iloc[:,:-1]

plt.figure(figsize=(12,10))
g=sns.heatmap(df.iloc[:,:-1].corr(),annot=True,cmap='coolwarm', #cmap= "RdYlGn",
             vmin=-1, vmax=1)

***

# df를 streamlit wepapp에 이용.

---

### 상관성 분석 결과

> ### **[DIY] 상관성, 반상관성이 높은 변수들에 대한 좀 더 자세한 시각화가 필요하다.**

### [DIY] 상관성/반상관성이 높은 변수들에 대한 좀 더 자세한 시각화를 웹앱에 추가.
![](https://github.com/Redwoods/Py/blob/master/pdm2020/my-note/py-pandas/data/heart_corr.png?raw=true)

# 중간 실기 시험 파일 제출
0. 제출 파일명: heart_EDA_pdmnn.py   (nn 은 id)
1. github의 'pdmnn' repo 내의 py-streamlit/st-mid-exam 폴더에 제출.
2. 제출이 끝나면 chaos21@gmail.com 으로 파일을 한번 더 제출. (제출 시간 결정)