Link to original data set: https://archive.ics.uci.edu/dataset/45/heart+disease

In [1]:
pip install -U altair ucimlrepo

Collecting altair
  Downloading altair-5.2.0-py3-none-any.whl (996 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m996.9/996.9 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting ucimlrepo
  Downloading ucimlrepo-0.0.3-py3-none-any.whl (7.0 kB)
Installing collected packages: ucimlrepo, altair
  Attempting uninstall: altair
    Found existing installation: altair 4.2.2
    Uninstalling altair-4.2.2:
      Successfully uninstalled altair-4.2.2
Successfully installed altair-5.2.0 ucimlrepo-0.0.3
Note: you may need to restart the kernel to use updated packages.


#### Loading the Dataset and Some Important Background Work

In [2]:
# import statements and setting up framework 

import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

# Output dataframes instead of arrays
set_config(transform_output="pandas")

In [3]:
np.random.seed(2020)

In [4]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
heart_disease = fetch_ucirepo(id=45) 
  
# data (as pandas dataframes) 
X = heart_disease.data.features 
y = heart_disease.data.targets 

# renaming columns and dropping unused columns 
X = X.rename(columns = {
    "trestps" : "resting_blood_pressure", 
    "chol" : "serum_cholestoral", 
    "fbs" : "fasting_blood_sugar_greater_than_120_mg/dl", 
    "thalach" : "maximum_heart_rate_achieved", 
    "exang" : "exercise_induced_angina", 
    "oldpeak" : "ST_depression_induced_by_exercise_relative_to_rest", 
    "ca" : "number_of_major_vessels"
}).drop(columns = ["cp", "restecg", "slope", "thal"])

X

Unnamed: 0,age,sex,trestbps,serum_cholestoral,fasting_blood_sugar_greater_than_120_mg/dl,maximum_heart_rate_achieved,exercise_induced_angina,ST_depression_induced_by_exercise_relative_to_rest,number_of_major_vessels
0,63,1,145,233,1,150,0,2.3,0.0
1,67,1,160,286,0,108,1,1.5,3.0
2,67,1,120,229,0,129,1,2.6,2.0
3,37,1,130,250,0,187,0,3.5,0.0
4,41,0,130,204,0,172,0,1.4,0.0
...,...,...,...,...,...,...,...,...,...
298,45,1,110,264,0,132,0,1.2,0.0
299,68,1,144,193,1,141,0,3.4,2.0
300,57,1,130,131,0,115,1,1.2,1.0
301,57,0,130,236,0,174,0,0.0,1.0


In [5]:
y

Unnamed: 0,num
0,0
1,2
2,1
3,0
4,0
...,...
298,1
299,2
300,3
301,1


#### Building the KNN Classifier Model 

In [6]:
# combine X and y 
heart = X.assign(presence_of_heart_disease = y) 
heart

Unnamed: 0,age,sex,trestbps,serum_cholestoral,fasting_blood_sugar_greater_than_120_mg/dl,maximum_heart_rate_achieved,exercise_induced_angina,ST_depression_induced_by_exercise_relative_to_rest,number_of_major_vessels,presence_of_heart_disease
0,63,1,145,233,1,150,0,2.3,0.0,0
1,67,1,160,286,0,108,1,1.5,3.0,2
2,67,1,120,229,0,129,1,2.6,2.0,1
3,37,1,130,250,0,187,0,3.5,0.0,0
4,41,0,130,204,0,172,0,1.4,0.0,0
...,...,...,...,...,...,...,...,...,...,...
298,45,1,110,264,0,132,0,1.2,0.0,1
299,68,1,144,193,1,141,0,3.4,2.0,2
300,57,1,130,131,0,115,1,1.2,1.0,3
301,57,0,130,236,0,174,0,0.0,1.0,1


In [7]:
heart["presence_of_heart_disease"].value_counts(normalize = True)

0    0.541254
1    0.181518
2    0.118812
3    0.115512
4    0.042904
Name: presence_of_heart_disease, dtype: float64

We use 75% of data in training. This ensures that the training set is "large enough" so that our model can be relatively accurate (in comparison to smaller training sets), and also ensures that our testing set is of a reasonable size so that we can get information on the accuracy and errors of our model. 

In [8]:
# splitting the data into train and test sets 
# use 75% of data in training
heart_train, heart_test = train_test_split(heart, test_size = 0.25, random_state = 123) # set the random state to be 123

X_train = heart_train.drop("presence_of_heart_disease", axis=1)
y_train = heart_train["presence_of_heart_disease"]

In [9]:
heart_train["presence_of_heart_disease"].value_counts(normalize = True)

0    0.537445
1    0.171806
2    0.123348
3    0.118943
4    0.048458
Name: presence_of_heart_disease, dtype: float64

In [10]:
heart_test["presence_of_heart_disease"].value_counts(normalize = True)

0    0.552632
1    0.210526
2    0.105263
3    0.105263
4    0.026316
Name: presence_of_heart_disease, dtype: float64

We confirm that the distribution of the presence of heart disease in the training and testing set is relatively similar to that of the original data set. 

We will be using `age`, `trestbps` (resting blood pressure on admission to hospital), and `serum_cholestoral` to predict `presence_of_heart_disease`. 

In [11]:
# creating column transformer to standardize the data 
numeric_columns = ["age", "trestbps", "serum_cholestoral"]
drop_columns = list(set(X_train.columns) - set(numeric_columns))

heart_preprocessor = make_column_transformer(
    (StandardScaler(), numeric_columns),
    ("drop", drop_columns),
    verbose_feature_names_out=False
)

heart_preprocessor

#### Tuning the Model

In [12]:
# performing cross validation to select the best K value 
knn = KNeighborsClassifier()
heart_tune_pipe = make_pipeline(heart_preprocessor, knn)

In [13]:
parameter_grid = {
    "kneighborsclassifier__n_neighbors": range(1, 40, 1),
}

heart_tune_grid = GridSearchCV(
    estimator=heart_tune_pipe, 
    param_grid=parameter_grid, 
    cv=10, 
    n_jobs=-1
)

In [14]:
accuracies_grid = pd.DataFrame(
    heart_tune_grid.fit(
        X_train,
        y_train
    ).cv_results_
)

In [15]:
accuracies_grid = (
    accuracies_grid[[
        "param_kneighborsclassifier__n_neighbors",
        "mean_test_score",
        "std_test_score"
    ]]
    .assign(sem_test_score=accuracies_grid["std_test_score"] / 10**(1/2))
    .rename(columns={"param_kneighborsclassifier__n_neighbors": "n_neighbors"})
    .drop(columns=["std_test_score"])
)

accuracies_grid

Unnamed: 0,n_neighbors,mean_test_score,sem_test_score
0,1,0.322134,0.020945
1,2,0.419368,0.027219
2,3,0.441107,0.020706
3,4,0.476877,0.025898
4,5,0.493281,0.020841
5,6,0.475692,0.019226
6,7,0.489328,0.015211
7,8,0.506719,0.007016
8,9,0.511462,0.009051
9,10,0.493478,0.008376


In [16]:
accuracy_vs_k = alt.Chart(accuracies_grid, title = "Estimated accuracy versus number of neighbors").mark_line(point = True).encode(
    x = alt.X("n_neighbors", title = "Neighbors"),
    y = alt.Y("mean_test_score", title = "Accuracy estimate")
)

accuracy_vs_k

In [20]:
best_n_neighbour = heart_tune_grid.best_params_["kneighborsclassifier__n_neighbors"]
best_n_neighbour

20

We can therefore note that n=20 is best to maximize the accuracy of our model

In [22]:
model = make_pipeline(
    heart_preprocessor,
    KNeighborsClassifier(n_neighbors=best_n_neighbour)
)

In [23]:
model.fit(X_train, y_train)

In [24]:
heart_test = heart_test.assign(predicted = model.predict(heart_test))

#### Evaluating Our Model

In [25]:
pd.crosstab(
    heart_test["presence_of_heart_disease"],
    heart_test["predicted"]
)

predicted,0,2
presence_of_heart_disease,Unnamed: 1_level_1,Unnamed: 2_level_1
0,41,1
1,14,2
2,8,0
3,8,0
4,2,0


If we consider only "has a heath disease", the confusion matrix become:

In [26]:
pd.crosstab(
    heart_test.assign(presence_of_heart_disease = heart_test["presence_of_heart_disease"] != 0)["presence_of_heart_disease"],
    heart_test.assign(predicted = heart_test["predicted"] != 0)["predicted"]
)

predicted,False,True
presence_of_heart_disease,Unnamed: 1_level_1,Unnamed: 2_level_1
False,41,1
True,32,2


We can compute different metric from that:

In [23]:
accuracy = (41 + 2) / (41 + 1 + 32 + 2)
precision = (2) / (32 + 2)
recall = (2) / (1 + 2)
print('Accuracy: {:.2}, precision: {:.2}, recall: {:.2}'.format(accuracy, precision, recall)) 

Accuracy: 0.57, precision: 0.059, recall: 0.67


In [32]:
metrics=pd.DataFrame({'Accuracy':[accuracy],'Precision':[precision],'Recall':[recall]})
metrics

Unnamed: 0,Accuracy,Precision,Recall
0,0.565789,0.058824,0.666667


We can see here that the model does not perform well. The precision is low and the accuracy not great either. We can see that a model predicting the class 0 for every time (dummy model) would perform almost the same.

What is interesting about this model is that the recall is significantly higher. Thus, this model is a lot less likely to give false negatives, i.e., situations where the patient has a heart disease but is treated as though they don't. In this scenario, false negatives are a lot more catastrophic than false positives, and thus this model is a good starting point to answer this question.