Link to original data set: https://archive.ics.uci.edu/dataset/45/heart+disease

In [1]:
pip install -U altair

Collecting altair
  Downloading altair-5.1.2-py3-none-any.whl (516 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m516.2/516.2 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: altair
  Attempting uninstall: altair
    Found existing installation: altair 4.2.2
    Uninstalling altair-4.2.2:
      Successfully uninstalled altair-4.2.2
Successfully installed altair-5.1.2
Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install ucimlrepo

Collecting ucimlrepo
  Using cached ucimlrepo-0.0.3-py3-none-any.whl (7.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.3
Note: you may need to restart the kernel to use updated packages.


In [7]:
# import statements and setting up framework 

import altair as alt
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

# Output dataframes instead of arrays
set_config(transform_output="pandas")

In [8]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
heart_disease = fetch_ucirepo(id=45) 
  
# data (as pandas dataframes) 
X = heart_disease.data.features 
y = heart_disease.data.targets 

# renaming columns and dropping unused columns 
X = X.rename(columns = {
    "trestps" : "resting_blood_pressure", 
    "chol" : "serum_cholestoral", 
    "fbs" : "fasting_blood_sugar_greater_than_120_mg/dl", 
    "thalach" : "maximum_heart_rate_achieved", 
    "exang" : "exercise_induced_angina", 
    "oldpeak" : "ST_depression_induced_by_exercise_relative_to_rest", 
    "ca" : "number_of_major_vessels"
}).drop(columns = ["cp", "restecg", "slope", "thal"])

X

Unnamed: 0,age,sex,trestbps,serum_cholestoral,fasting_blood_sugar_greater_than_120_mg/dl,maximum_heart_rate_achieved,exercise_induced_angina,ST_depression_induced_by_exercise_relative_to_rest,number_of_major_vessels
0,63,1,145,233,1,150,0,2.3,0.0
1,67,1,160,286,0,108,1,1.5,3.0
2,67,1,120,229,0,129,1,2.6,2.0
3,37,1,130,250,0,187,0,3.5,0.0
4,41,0,130,204,0,172,0,1.4,0.0
...,...,...,...,...,...,...,...,...,...
298,45,1,110,264,0,132,0,1.2,0.0
299,68,1,144,193,1,141,0,3.4,2.0
300,57,1,130,131,0,115,1,1.2,1.0
301,57,0,130,236,0,174,0,0.0,1.0


In [9]:
y

Unnamed: 0,num
0,0
1,2
2,1
3,0
4,0
...,...
298,1
299,2
300,3
301,1


In [12]:
# combine X and y 
heart = X.assign(presence_of_heart_disease = y) 
heart

Unnamed: 0,age,sex,trestbps,serum_cholestoral,fasting_blood_sugar_greater_than_120_mg/dl,maximum_heart_rate_achieved,exercise_induced_angina,ST_depression_induced_by_exercise_relative_to_rest,number_of_major_vessels,presence_of_heart_disease
0,63,1,145,233,1,150,0,2.3,0.0,0
1,67,1,160,286,0,108,1,1.5,3.0,2
2,67,1,120,229,0,129,1,2.6,2.0,1
3,37,1,130,250,0,187,0,3.5,0.0,0
4,41,0,130,204,0,172,0,1.4,0.0,0
...,...,...,...,...,...,...,...,...,...,...
298,45,1,110,264,0,132,0,1.2,0.0,1
299,68,1,144,193,1,141,0,3.4,2.0,2
300,57,1,130,131,0,115,1,1.2,1.0,3
301,57,0,130,236,0,174,0,0.0,1.0,1


In [17]:
heart["presence_of_heart_disease"].value_counts(normalize = True)

0    0.541254
1    0.181518
2    0.118812
3    0.115512
4    0.042904
Name: presence_of_heart_disease, dtype: float64

We use 75% of data in training. This ensures that the training set is "large enough" so that our model can be relatively accurate (in comparison to smaller training sets), and also ensures that our testing set is of a reasonable size so that we can get information on the accuracy and errors of our model. 

In [19]:
# splitting the data into train and test sets 
# use 75% of data in training
heart_train, heart_test = train_test_split(heart, 
                                           test_size = 0.25, 
                                           random_state = 123) # set the random state to be 123

In [20]:
heart_train["presence_of_heart_disease"].value_counts(normalize = True)

0    0.537445
1    0.171806
2    0.123348
3    0.118943
4    0.048458
Name: presence_of_heart_disease, dtype: float64

In [21]:
heart_test["presence_of_heart_disease"].value_counts(normalize = True)

0    0.552632
1    0.210526
2    0.105263
3    0.105263
4    0.026316
Name: presence_of_heart_disease, dtype: float64

We confirm that the distribution of the presence of heart disease in the training and testing set is relatively similar to that of the original data set. 

We will be using `age`, `trestbps` (resting blood pressure on admission to hospital), and `serum_cholestoral` to predict `presence_of_heart_disease`. 

In [22]:
# creating column transformer to standardize the data 
heart_preprocessor = make_column_transformer(
    (StandardScaler(), ["age", "trestbps", "serum_cholestoral"]),
    verbose_feature_names_out=False
)

heart_preprocessor