<a href="https://colab.research.google.com/github/Ahmed-Ashraf-Marzouk/data-mining-algorithms/blob/main/Data_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports

In [None]:
import numpy as np
import pandas as pd
# import matplotlib.pyplot as plt
import seaborn as sea

# %matplotlib

Using matplotlib backend: agg


# Loading the Data

In [None]:
!gdown "1uDSWhhUBPFjjUJmsh4-FLiRM2AakrC8f"
!unrar x "./dataset.rar" -idq

# Data Wrangling

## Column Description
1. **age** : Age of the patient
2. **Gender** : Sex of the patient
3. **exang**: exercise induced angina (1 = yes; 0 = no)
4. **ca**: number of major vessels (0-3)
5. **cp** : Chest Pain type chest pain type
    - Value 1: typical angina
    - Value 2: atypical angina
    - Value 3: non-anginal pain
    - Value 4: asymptomatic

6. **trtbps** : resting blood pressure (in mm Hg)
7. **chol** : cholestoral in mg/dl fetched via BMI sensor
8. **fbs** : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false
9. **rest_ecg** : resting electrocardiographic results
    - Value 0: normal
    - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria

10. **thalach** : maximum heart rate achieved


11. **oldpeak** : Previous peak

12. **slp** : Slope

13. **thall** : Thal rate

14. **Output** : 0= less chance of heart attack 1= more chance of heart attack

In [None]:
col_names = ['Age','Gender','CP', 'Trtbps', 'Chol', 'FBS', 'Rest_ecg', 'Thalach', 'Exang', 'Old_peak', 'Slope', 'CA', 'Thall', 'Output']
df = pd.read_csv('dataset/heart.csv', header = 0, names = col_names)
rearranged_col_names = ['Age', 'Gender', 'Exang', 'CA', 'CP', 'Trtbps', 'Chol', 'FBS', 'Rest_ecg', 'Thalach', 'Old_peak', 'Slope','Thall', 'Output']
df = df[rearranged_col_names]
df.head()

Unnamed: 0,Age,Gender,Exang,CA,CP,Trtbps,Chol,FBS,Rest_ecg,Thalach,Old_peak,Slope,Thall,Output
0,63,1,0,0,3,145,233,1,0,150,2.3,0,1,1
1,37,1,0,0,2,130,250,0,1,187,3.5,0,2,1
2,41,0,0,0,1,130,204,0,0,172,1.4,2,2,1
3,56,1,0,0,1,120,236,0,1,178,0.8,2,2,1
4,57,0,1,0,0,120,354,0,1,163,0.6,2,2,1




---



## Checking the Data

In [None]:
df.info() # No null values in dataset
# 303 Examples
# 14 Features
# (303) X (14)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Age       303 non-null    int64  
 1   Gender    303 non-null    int64  
 2   Exang     303 non-null    int64  
 3   CA        303 non-null    int64  
 4   CP        303 non-null    int64  
 5   Trtbps    303 non-null    int64  
 6   Chol      303 non-null    int64  
 7   FBS       303 non-null    int64  
 8   Rest_ecg  303 non-null    int64  
 9   Thalach   303 non-null    int64  
 10  Old_peak  303 non-null    float64
 11  Slope     303 non-null    int64  
 12  Thall     303 non-null    int64  
 13  Output    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [None]:
df.describe()
# Gender range between 29 and 77 which is acceptable range
# Number of major vessales (column name CA) should be between 0 and 3, but 4 can be found.
# Resting blood pressure (column name Trtbps) have a maximum value of 200 mm Hg, is this possible?
# Maximum heart rate achieved (column name Thalach) have a maximum value of 202 units, is this possible?

Unnamed: 0,Age,Gender,Exang,CA,CP,Trtbps,Chol,FBS,Rest_ecg,Thalach,Old_peak,Slope,Thall,Output
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.326733,0.729373,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,1.039604,1.39934,2.313531,0.544554
std,9.082101,0.466011,0.469794,1.022606,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,1.161075,0.616226,0.612277,0.498835
min,29.0,0.0,0.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,1.0,2.0,0.0
50%,55.0,1.0,0.0,0.0,1.0,130.0,240.0,0.0,1.0,153.0,0.8,1.0,2.0,1.0
75%,61.0,1.0,1.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.6,2.0,3.0,1.0
max,77.0,1.0,1.0,4.0,3.0,200.0,564.0,1.0,2.0,202.0,6.2,2.0,3.0,1.0


In [None]:
df['Output'].value_counts(normalize=True) # Each class is about 50%

1    0.544554
0    0.455446
Name: Output, dtype: float64

In [None]:
sea.boxplot(x=df['Age'], hue=df['Output']).plot()
plt.show()