# Heart Disease Prediction
* The following dataset used can be found at https://www.kaggle.com/ronitf/heart-disease-uci . 
* The goal of this project is to determine if a patient with certain input variables is likely to have heart disease. 
* We start by importing the dataset into a pandas data frame.

In [1]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import pandas as pd
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Other imports
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "FinalProject"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

In [2]:
dataset = pd.read_csv("heart.csv")
dataset.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


* To make the dataset easier to read, we will change the column names to binary descriptors.

In [3]:
#rename columns here

## Approach 1: Logistic Regression
* To start, we will use logistic regression to predict whether or not a patient has heart disease given the featured variables.

In [3]:
# Split data into training and test datasets
X = dataset.iloc[:, 0:-1]
y = dataset.iloc[:, -1]
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)

In [4]:
# Train the model using the training data
log_reg = LogisticRegression(solver='lbfgs', max_iter=1000, random_state=42)
log_reg.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [5]:
# Predicting if a patient has heart disease
# Patient is a 45 male, mild chest pain, 175 resting bp, 235 cholestoral, fasting blood sugar,
# resting electrocardiographic result of 0, max heart rate of 185, exercise induced angina, old peak score of 1.6,
# slope score of 2, 0 major vessels colored by flourosopy, and a normal thal score.
log_reg.predict(np.array([[45, 1, 0, 175, 235, 1, 0, 185, 1, 1.6, 2, 0, 3]]))[0]

0

The logistic regression solver predicted that our patient does not have heart disease.

* Let's run some evaluation metrics to assess our model

In [6]:
%%timeit
y_pred_log = log_reg.predict(x_test)

485 µs ± 36 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [7]:
# Evaluate the efficiency of model
# Check the score of the model. (Accuracy)
print("Score")
print(log_reg.score(x_test, y_test))
y_pred_log = log_reg.predict(x_test)
cm_log = confusion_matrix(y_test, y_pred_log)
print("Confusion Matrix")
print (cm_log)
print("Classification Report")
print(classification_report(y_test,y_pred_log))

Score
0.8524590163934426
Confusion Matrix
[[25  4]
 [ 5 27]]
Classification Report
              precision    recall  f1-score   support

           0       0.83      0.86      0.85        29
           1       0.87      0.84      0.86        32

    accuracy                           0.85        61
   macro avg       0.85      0.85      0.85        61
weighted avg       0.85      0.85      0.85        61



## Approach 2: K Nearest Neighbors
* Now, we will use the KNN algorithm to predict whether a patient has heart disease based on featured variables.

In [8]:
# we must scale our variables because the KNN classifier predicts the class of a given test observation 
# by identifying the observations that are nearest to it 
sc_X = StandardScaler()
X = sc_X.fit_transform(X)

In [9]:
# Split data into training and test datasets
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2)

Euclidean distance is selected. Select K: $\sqrt{303} \approx 17$. 

In [10]:
classifier = KNeighborsClassifier(n_neighbors=17, metric='euclidean')

In [11]:
# Fit Model
classifier.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
                     metric_params=None, n_jobs=None, n_neighbors=17, p=2,
                     weights='uniform')

In [12]:
#Run the model with our previous patient
classifier.predict(np.array([[45, 1, 0, 175, 235, 1, 0, 185, 1, 1.6, 2, 0, 3]]))[0]

1

The KNN Classifier predicted that our patient does have heart disease.

* Let's run some evaluation metrics to assess our model

In [13]:
%%timeit
y_pred_KNN = classifier.predict(x_test)

2.08 ms ± 41.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [14]:
# Evaluate the efficiency of model
# Check the score of the model. (Accuracy)
print("Score")
print(classifier.score(x_test, y_test))
y_pred_KNN = classifier.predict(x_test)
cm_KNN = confusion_matrix(y_test, y_pred_KNN)
print("Confusion Matrix")
print (cm_KNN)
print("Classification Report")
print(classification_report(y_test,y_pred_KNN))

Score
0.8852459016393442
Confusion Matrix
[[25  4]
 [ 3 29]]
Classification Report
              precision    recall  f1-score   support

           0       0.89      0.86      0.88        29
           1       0.88      0.91      0.89        32

    accuracy                           0.89        61
   macro avg       0.89      0.88      0.88        61
weighted avg       0.89      0.89      0.89        61

