# Introduction

The following data science project aims to predict heart disease, using real data from patients. Each row in the dataset represents a patient and each column represents a medical attribute. We will be performing binary classification in this project. 

## Coronary Artery Disease

This dataset investigates Coronary Artery Disease (CAD), which is the gradual narrowing of heart arteries by plaque, measured by the medical attributes in this dataset.

In [6]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import joblib

from statsmodels.formula.api import glm
from statsmodels.formula.api import logit
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_recall_fscore_support

from sklearn.model_selection import train_test_split, cross_val_score, KFold

# Data Cleaning and preprocessing

In [7]:
df = pd.read_csv("data/Heart_Disease_Prediction.csv")
print("Total numbers of missing values", df.isna().sum().sum())

df.columns = [col.replace(' ', '_') for col in df.columns]
df["Heart_Disease"] = df["Heart_Disease"].map({"Presence" : 1, "Absence" : 0})

FileNotFoundError: [Errno 2] No such file or directory: '.data/Heart_Disease_Prediction.csv'

No missing values, no need to remove any.

In [None]:
df.head()

## Exploratory Data Analysis

In [None]:
print("Here is the information regarding the dataset")
print("Shape: ", df.shape)
print("Number of dimensions: ", df.ndim)

df.info()

## Heatmap of the medical attributes

In [None]:
correlation_mat = df.corr()

plt.figure(figsize = (12, 6))
plt.title("Correlation matrix of the medical attributes", fontweight = "bold", pad = 20,  fontsize = 15)
sns.heatmap(correlation_mat, annot = True, fmt = ".2f", center = 0, linecolor = "white")
plt.savefig('plots_and_model/correlation_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

## Further plotting of the important relationships:

In [None]:
plt.figure(figsize = (10, 5))
plt.title("Pointplot between the relationship of heart disease and Thallium", fontweight = "bold", fontsize = 15)
sns.pointplot(x='Thallium', y ='Heart_Disease', data=df, color = "red",  markers='D', markersize=10, linewidth = 3)
plt.ylabel("Heart Disease Indicator")
plt.tight_layout()
sns.despine()
plt.show()

In [None]:
plt.figure(figsize = (10, 5))
sns.pointplot(data=df, x='Number_of_vessels_fluro', y='Heart_Disease', errorbar='ci', markers='D', markersize=10, linewidth=3,
              color='purple') 
plt.title('Number of Calcified Arteries vs Heart Disease', fontweight = "bold", fontsize = 15)
plt.xlabel("Calcified Arteries Under Fluoroscopy")
plt.ylabel("Heart Diease Indicator")
plt.tight_layout()
sns.despine()
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.pointplot(df, x='Exercise_angina', y='Heart_Disease', errorbar='ci',markers='D', markersize=12, linewidth=3, color='Orange')
plt.xlabel('Exercise-Induced Angina (0 = No, 1 = Yes)', fontsize=12, fontweight='bold')
plt.ylabel('Heart Disease Indicator', fontsize=12, fontweight='bold')
plt.title('Exercise Angina vs Heart Disease', fontsize=15, fontweight='bold', pad=20)
sns.despine()
plt.tight_layout()
plt.show()

### Training-Test Split

In [None]:
X = df.drop(("Heart_Disease"), axis = 1)
y = df["Heart_Disease"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

df_train = pd.concat([X_train, y_train], axis = 1)
df_train.head()

In [None]:
X.head()

In [None]:
y.head()

# Logistic Regression

Fitting a Logistic Regression model to classify heart disease, by taking important medical attributes from the exploratory data analysis.

In [None]:
model = logit("Heart_Disease ~ Thallium + Exercise_angina + Number_of_vessels_fluro", data = df_train).fit()

In [None]:
model.summary()

# Predictions:

We take a cut-off of 0.5 as the probability to seperate the classes.

In [None]:
cutoff = 0.40

In [None]:
predictions = model.predict(X_test)

df_testing = pd.concat([X_test, y_test], axis = 1)

df_testing = df_testing.assign(fitted = predictions)
df_testing["Predicted_class"] = np.where(df_testing["fitted"] > cutoff, 1, 0)
df_testing.head()

# Evaluating the performance of the model:

In [None]:
predictions = df_testing["Predicted_class"]
confusion_mat = confusion_matrix(df_testing["Heart_Disease"], predictions)

## Visualising the confusion matrix:

In [None]:
plt.figure(figsize = (12, 6))
sns.heatmap(confusion_mat, annot = True,  center = 0, linecolor = "white")
plt.title("Confusion matrix of the fitted values", fontweight = "bold", fontsize = 15, pad = 20)
plt.xlabel("Predicted labels")
plt.ylabel("True Labels")
plt.show()

In [None]:
print(classification_report(df_testing["Heart_Disease"], predictions))

# ROC Curve:

In [None]:
fpr, tpr, thresholds = roc_curve(y_test, predictions)

plt.plot(fpr, tpr, label=f'AUC = {roc_auc_score(y_test, df_testing["fitted"]):.3f}')
plt.xlabel("False positive rate")
plt.ylabel("True positive rate")
plt.title("ROC Curve and AUC score", fontweight = "bold", fontsize = 15, pad = 20)
plt.plot([0, 1], [0, 1], 'k--', label='Random')
idx = np.argmin(np.abs(thresholds - cutoff))
plt.scatter(fpr[idx], tpr[idx], c='red', s=100, label=f'Cutoff={cutoff:.2f}')
plt.legend(); plt.show()

Our AUC score remains satisfactory.

The adjusting of our model threshold from 0.5 sacrifised some of the model precision, but yielded a greater recall for predicting our true labels. This threshold shifting also enabled us to minimise the false negative rate, since this is vital. It is more crucial (especially in a medical model) to minimise the number of people that were incorrectly predicted to have CAD, then to have false alarms. 

Ultimately, this adjustment helps us save more lives and allows the model to compete with medical standards. Thus, this model competes better than standard ECG sensitivity (which sits at about 60-70%).


### Saving the model from Juptyer

In [None]:
joblib.dump(model, 'Heart_Disease_Model.pkl')
print("Model saved! Ready for Streamlit")

In [None]:
!pip install streamlit pandas joblib scikit-learn