# Project Title: Heart Disease Prediction
## Short Description:
- Heart diseases are one of the leading causes of death globally. People with heart disease or who are at high risk of heart disease need early intervention to prevent future undesirable outcomes. In this project, we are going to use a dataset that contains 11 features considered vital in identifying people with heart disease, and we are going to test it with four machine learning models— *KNN, Decision Tree, Random Forest, and Naive Bayes*— that will predict the likelihood of a person having heart disease.

## About the dataset
### Information about the features
1. **Age:** *(years)*
   - patient's age
2. **Sex:** *(M: Male, F: Female)*
   - patient's sex
3. **ChestPainType:** *(TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic)*
   - type of patient's chest pain
4. **RestingBP:** *(mm Hg)*
   - resting blood pressure
5. **Cholesterol:** *(mm/dl)*
   - serum cholesterol
6. **FastingBS:** *(1: if FastingBS > 120 mg/dl, 0: otherwise)*
   - fasting blood sugar
7. **RestingECG:** *[Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]*
   - resting electrocardiogram results
8. **MaxHR:** *(Numeric value between 60 and 202)*
   - maximum heart rate achieved
9.  **ExerciseAngina:** *(Y: Yes, N: No)*
    - exercise-induced angina
10. **Oldpeak:** *(Numeric value measured in depression)*
    - oldpeak = ST 
    - ST depression refers to a finding on an electrocardiogram, wherein the trace in the ST segment is abnormally low below the baseline
11. **ST_Slope:** *(Up: upsloping, Flat: flat, Down: downsloping)*
    - the slope of the peak exercise ST segment
12. **HeartDisease:** *(1: Heart disease, 0: Normal)*
    - target output

---

To learn more about the *source, citation,* and the *creators* of the dataset, click [here](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction).

---

## Step 1: Import the libaries

In [3]:
# Basic necessities
import pandas as pd
import numpy as np
import matplotlib.pyplot as pl
import seaborn as sn

# Allows the direct interaction with the plots
%matplotlib inline

# Models and preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

# Metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

## Step 2: Exploring the Dataset
- Covers the loading of the dataset and summary of the dataset.

In [4]:
df = pd.read_csv('./dataset.csv')

# First 5 records of the dataset
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [5]:
# Last 5 records of the dataset
df.tail()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
913,45,M,TA,110,264,0,Normal,132,N,1.2,Flat,1
914,68,M,ASY,144,193,1,Normal,141,N,3.4,Flat,1
915,57,M,ASY,130,131,0,Normal,115,Y,1.2,Flat,1
916,57,F,ATA,130,236,0,LVH,174,N,0.0,Flat,1
917,38,M,NAP,138,175,0,Normal,173,N,0.0,Up,0


In [6]:
# Number of records and columns the dataset contains
df.shape   # (records, columns)

(918, 12)

In [7]:
# Basic information about the dataset
# Shows the total number of rows and columns, each attribute's type, and the number of non-null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


In [8]:
# Summary statistics of the numeric attributes
df.describe()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,0.233115,136.809368,0.887364,0.553377
std,9.432617,18.514154,109.384145,0.423046,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


- By looking at the statistics summary of the numeric attributes, we noticed an unusual value. The lowest value in the RestingBP is 0.
- Now, we are not adept in medical field, but we sure know that Blood Pressure can not be that low. We are going to do something about that later in the data preprocessing step.

In [9]:
# Number of null values
df.isnull().sum()

Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64

In [10]:
# Number of duplicate values
df.duplicated().sum()

0

## Step 3: Data Preprocessing

In [21]:
# Separating the numeric and non-numeric attributes
cat_df = df.select_dtypes(include=object)
num_df = df.select_dtypes(exclude=object)

# Creating a dataframe of numeric attributes excluding the target output.
num_att_df = num_df.drop("HeartDisease", axis=1)

In [22]:
cat_df

Unnamed: 0,Sex,ChestPainType,RestingECG,ExerciseAngina,ST_Slope
0,M,ATA,Normal,N,Up
1,F,NAP,Normal,N,Flat
2,M,ATA,ST,N,Up
3,F,ASY,Normal,Y,Flat
4,M,NAP,Normal,N,Up
...,...,...,...,...,...
913,M,TA,Normal,N,Flat
914,M,ASY,Normal,N,Flat
915,M,ASY,Normal,Y,Flat
916,F,ATA,LVH,N,Flat


In [23]:
num_df

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
0,40,140,289,0,172,0.0,0
1,49,160,180,0,156,1.0,1
2,37,130,283,0,98,0.0,0
3,48,138,214,0,108,1.5,1
4,54,150,195,0,122,0.0,0
...,...,...,...,...,...,...,...
913,45,110,264,0,132,1.2,1
914,68,144,193,1,141,3.4,1
915,57,130,131,0,115,1.2,1
916,57,130,236,0,174,0.0,1


In [24]:
num_df.describe()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,0.233115,136.809368,0.887364,0.553377
std,9.432617,18.514154,109.384145,0.423046,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


- As we can see, the attribute that has an unusual value here is the RestingBP. Let's look at the record(s) with 0 value.

In [25]:
num_df[num_df["RestingBP"]==0]

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
449,55,0,0,0,155,1.5,1


- Fortunately, there is only one record with 0 RestingBP. Dropping this record will not have a significant impact on the overall performance of the machine learning algorithm. So we're just going to drop it.

In [26]:
# We set inplace to True to overwrite the current dataframe
num_df.drop(index=num_df[num_df["RestingBP"]==0].index, inplace=True)

num_df.describe()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
count,917.0,917.0,917.0,917.0,917.0,917.0,917.0
mean,53.509269,132.540894,199.016358,0.23337,136.789531,0.886696,0.55289
std,9.437636,17.999749,109.24633,0.423206,25.467129,1.06696,0.497466
min,28.0,80.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,174.0,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0
