# Heart Disease Classification

> In this notebook, we are trying to make a model that can detect the presence of heart disease in a patient.



## Importing the libraries

Importing the Data Science and Machine Learning libraries

* Pandas as **`pd`**
* Numpy as **`np`**
* Matplotlib as **`plt`**

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Import and analyze the data

In [2]:
# Read the csv file
df = pd.read_csv("heart_disease.csv")

# To confirm that the file was read lets view the first 5 rows
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


In [3]:
# Check the size of the dataset
len(df)

1025

In [4]:
# Drop duplicate rows from the df (or else this will cause a massive overfitting that we will not be able to see)
df = df.drop_duplicates()

In [5]:
len(df)

302

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 302 entries, 0 to 878
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       302 non-null    int64  
 1   sex       302 non-null    int64  
 2   cp        302 non-null    int64  
 3   trestbps  302 non-null    int64  
 4   chol      302 non-null    int64  
 5   fbs       302 non-null    int64  
 6   restecg   302 non-null    int64  
 7   thalach   302 non-null    int64  
 8   exang     302 non-null    int64  
 9   oldpeak   302 non-null    float64
 10  slope     302 non-null    int64  
 11  ca        302 non-null    int64  
 12  thal      302 non-null    int64  
 13  target    302 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 35.4 KB


In [7]:
df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,302.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0,302.0
mean,54.42053,0.682119,0.963576,131.602649,246.5,0.149007,0.52649,149.569536,0.327815,1.043046,1.397351,0.718543,2.31457,0.543046
std,9.04797,0.466426,1.032044,17.563394,51.753489,0.356686,0.526027,22.903527,0.470196,1.161452,0.616274,1.006748,0.613026,0.49897
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,48.0,0.0,0.0,120.0,211.0,0.0,0.0,133.25,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.5,1.0,1.0,130.0,240.5,0.0,1.0,152.5,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.75,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [8]:
# Check how many positive and negative samples we have
df["target"].value_counts()

1    164
0    138
Name: target, dtype: int64

## Preprocess the data

### Create the x and y variables


In [9]:
# Create the X varible - the df excluding `target` column
X = df.drop("target", axis=1)

# Create the y variable - the `target` column
y = df["target"]

In [10]:
# Confirm that the X variable is what we are expecting
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2


In [11]:
# Confirm that the y variable is what we are expecting
y.head()

0    0
1    0
2    0
3    0
4    0
Name: target, dtype: int64

### Create the training and testing datasets

In [12]:
# Import the train_test_split() function from sckit-learn
from sklearn.model_selection import train_test_split

# Use the train_test_function() function to split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [13]:
len(X_train), len(y_train), len(X_test), len(y_test)


(241, 241, 61, 61)

## Prepare Machine Learning Model

As our problem is a classification problem, we are going to use the `RandomForestClassifier()` model.



In [30]:
# Import the model from scikit-learn library
from sklearn.ensemble import RandomForestClassifier

# Make the model
model = RandomForestClassifier()

# Train the model
model.fit(X_train, y_train)

# Evaluate the model
model.score(X_test, y_test)

0.8032786885245902

### Is this a good score?

Generally speaking, having a model that can predict with 80% accuracy means that the model definetly is seeing patterns.

However, as the subject of this model is sensetive "Heart Disesae", perhaps 80% is not good enough.

Also, arguably, accuracy is not the best way to measure our model's competency, as in this model, our main aim is not to get the highest accuarcy score, but to make sure that the no one who is actually in danger of heart disease will be predicted as a Free of disease, even if that meant that people who are not in danger might be classified as in danger, as eventually, the doctors are going to examine the patient.

Therefore, **`recall`** is a better evaluation metric for this problem.