# REPORT: Predicting Risk of Heart Disease from Accessible Health Metrics

## Introduction:

According to the Public Health Agency of Canada, heart disease is the second leading cause of death in Canada, with approximately 1 in 12 Canadian adults over 20 living with a diagnosis. These metrics highlight the importance of knowing the risk factors and having access to medical advice. However, a shortage of physicians in Canada is causing a lack of available health care (Flood et al., 2023). Non-healthcare professionals do not have the means to properly self-evaluate symptoms, therefore our project seeks to help the general population to make informative decisions about heart disease symptoms that are self-monitored or easily accessible.


Thus we ask, is it possible to classify individuals into levels of heart disease risk (low risk, moderate risk, or high risk) based on blood pressure, cholesterol, heart rate and chest pain?


Our analysis will use the Heart Disease dataset from the Cleveland database for heart disease (Andras et al., 1988). This database consists of 303 patients without history of heart disease, who were admitted to the Cleveland Clinic between 1981 and 1984. 



In [4]:
# Please uncomment the following cell to install the altair in case your package is not up-to-date

In [5]:
# pip install -U altair

In [6]:
import altair as alt
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import set_config
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler 
from sklearn.compose import make_column_transformer
from sklearn.utils import resample
from sklearn.pipeline import make_pipeline

In [7]:
# import dataset
heart_disease = pd.read_csv("https://archive.ics.uci.edu/static/public/45/data.csv")

# filter dataframe
heart_disease.rename(columns = {
                          "fbs" : "fasting_blood_sugar",
                          "chol" : "cholesterol", 
                          "cp":"type_chestpain",
                          "restecg" : "resting_ecg",
                          "thalach" : "max_heart_rate",
                          "exang" : "exercise_induced_angina",
                          "oldpeak" : "ST_depression", 
                          "slope" : "ST_segment_slope", 
                          "ca" : "num_major_vessels", 
                          "thal" : "thallium_stress_test", #not sure
                          "num" : "diagnosis",
                          "trestbps" : "resting_bp"
}, inplace = True)

heart_disease = heart_disease[["cholesterol","type_chestpain","max_heart_rate","resting_bp","diagnosis"]]


# A low-risk diagnosis is 0, 1
# A moderate-risk diagnosis is 2, 3
# A high-risk diagnosis is 4
heart_disease['diagnosis'] = heart_disease['diagnosis'].replace([0,1], "low-risk heart disease")
heart_disease['diagnosis'] = heart_disease['diagnosis'].replace([2,3], "moderate-risk heart disease")
heart_disease['diagnosis'] = heart_disease['diagnosis'].replace([4], "high-risk heart disease")

# chest pain type
heart_disease['type_chestpain'] = heart_disease['type_chestpain'].replace(
    [1,2,3,4],
    ["type1","type2","type3","type4"])

heart_disease['diagnosis'].value_counts()

diagnosis
low-risk heart disease         219
moderate-risk heart disease     71
high-risk heart disease         13
Name: count, dtype: int64

In [8]:
# balancing heart disease since we need more low risk and moderate risk diagnosis observations
rare_diagnosis_1 = heart_disease[heart_disease["diagnosis"] == "high-risk heart disease"]
rare_diagnosis_2 = heart_disease[heart_disease["diagnosis"] == "moderate-risk heart disease"]
low_risk_diagnosis = heart_disease[heart_disease["diagnosis"] == "low-risk heart disease"]

rare_diagnosis_upsample_1 = resample(rare_diagnosis_1, n_samples = low_risk_diagnosis.shape[0])
rare_diagnosis_upsample_2 = resample(rare_diagnosis_2, n_samples = low_risk_diagnosis.shape[0])

heart_disease = pd.concat((rare_diagnosis_upsample_1, rare_diagnosis_upsample_2, low_risk_diagnosis)).reset_index(drop = True)

heart_disease

Unnamed: 0,cholesterol,type_chestpain,max_heart_rate,resting_bp,diagnosis
0,243,type4,128,150,high-risk heart disease
1,225,type4,114,150,high-risk heart disease
2,289,type4,124,165,high-risk heart disease
3,174,type4,125,145,high-risk heart disease
4,318,type4,140,114,high-risk heart disease
...,...,...,...,...,...
652,157,type2,182,120,low-risk heart disease
653,241,type4,123,140,low-risk heart disease
654,264,type1,132,110,low-risk heart disease
655,236,type2,174,130,low-risk heart disease


In [9]:
# split data into training and test sets
heart_disease_train, heart_disease_test = train_test_split(heart_disease, train_size = 0.75, random_state = 0)

In [10]:
# Summary of the categorical variables
heart_disease_categorical = heart_disease_train.drop(columns = ["cholesterol","max_heart_rate","resting_bp"])
heart_disease_categorical.describe()

Unnamed: 0,type_chestpain,diagnosis
count,492,492
unique,4,3
top,type4,low-risk heart disease
freq,340,168


In [11]:
# Summary of the continuous variables
heart_disease_continuous = heart_disease_train.drop(columns = ["type_chestpain","diagnosis"])
heart_disease_continuous.describe()

Unnamed: 0,cholesterol,max_heart_rate,resting_bp
count,492.0,492.0,492.0
mean,254.936992,142.577236,134.485772
std,57.517996,22.202551,17.512573
min,131.0,71.0,94.0
25%,212.0,125.0,120.0
50%,244.0,142.0,132.0
75%,289.0,161.0,146.5
max,564.0,202.0,200.0


In [12]:
# blood pressure
bp_hist = alt.Chart(heart_disease_train).mark_bar().encode(
    x=alt.X("resting_bp:Q", bin = True).title("Blood Pressure"),
    y=alt.Y("count()").stack(False),
    color="diagnosis:N"
).properties(
    title = "Distribution of Blood Pressure"
)

bp_hist

In [13]:
# cholesterol
chol_hist = alt.Chart(heart_disease_train).mark_bar().encode(
    x=alt.X("cholesterol:Q", bin = True).title("Cholesterol"),
    y=alt.Y("count()").stack(False),
    color = "diagnosis:N"
).properties(
    title = "Distribution of Cholesterol"
)

chol_hist

In [14]:
# chest pain type
cp_hist = alt.Chart(heart_disease_train).mark_bar().encode(
    x=alt.X("type_chestpain").title("Chest Pain Type"),
    y=alt.Y("count()").stack(False),
    color = "diagnosis:N"
).properties(
    width=300,
    height=300,
    title = "Distribution of Chest Pain Type"
)

cp_hist

In [15]:
# heart rate
hr_hist = alt.Chart(heart_disease_train).mark_bar().encode(
    x=alt.X("max_heart_rate:Q", bin = True).title("Heart Rate"),
    y=alt.Y("count()").stack(False),
    color = "diagnosis:N"
).properties(
    title = "Distribution of Heart Rate"
)

hr_hist

In [16]:
preprocessor=make_column_transformer(
    (StandardScaler(),['cholesterol','max_heart_rate','resting_bp']),
    remainder='passthrough',
    verbose_feature_names_out=False
)
preprocessor

In [17]:
# create a pipeline
heart_disease_pipe = make_pipeline(preprocessor, KNeighborsClassifier())
heart_disease_pipe

In [18]:
np.random.seed(1234)
parameter_grid = {
    "kneighborsclassifier__n_neighbors" : range(1, 31)
}

grid_search = GridSearchCV(
    estimator = heart_disease_pipe,
    param_grid = parameter_grid,
    cv = 5,
)


grid_search

In [19]:
X_heart_train=heart_disease_train[['cholesterol','max_heart_rate','resting_bp']]
y_heart_train=heart_disease_train['diagnosis']

model_grid=grid_search.fit(X_heart_train,y_heart_train)
grid_results=pd.DataFrame(grid_search.cv_results_)
grid_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsclassifier__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.003064,0.000949,0.004267,0.000964,1,{'kneighborsclassifier__n_neighbors': 1},0.858586,0.848485,0.887755,0.806122,0.846939,0.849577,0.026211,1
1,0.002149,0.000651,0.002766,0.000542,2,{'kneighborsclassifier__n_neighbors': 2},0.767677,0.777778,0.795918,0.744898,0.795918,0.776438,0.019143,2
2,0.001922,0.000836,0.003235,0.00104,3,{'kneighborsclassifier__n_neighbors': 3},0.777778,0.757576,0.785714,0.714286,0.72449,0.751969,0.028325,3
3,0.002544,0.000959,0.003312,0.000432,4,{'kneighborsclassifier__n_neighbors': 4},0.747475,0.717172,0.734694,0.72449,0.704082,0.725582,0.014809,4
4,0.00172,0.00017,0.002489,0.000263,5,{'kneighborsclassifier__n_neighbors': 5},0.747475,0.686869,0.72449,0.704082,0.72449,0.717481,0.020565,5
5,0.001431,0.000269,0.002106,0.000172,6,{'kneighborsclassifier__n_neighbors': 6},0.747475,0.717172,0.663265,0.734694,0.683673,0.709256,0.031433,6
6,0.001322,0.000194,0.002139,0.000129,7,{'kneighborsclassifier__n_neighbors': 7},0.676768,0.656566,0.673469,0.72449,0.693878,0.685034,0.023011,7
7,0.00139,0.000186,0.002302,0.000281,8,{'kneighborsclassifier__n_neighbors': 8},0.646465,0.686869,0.683673,0.734694,0.622449,0.67483,0.038354,8
8,0.001315,0.000103,0.002063,7.4e-05,9,{'kneighborsclassifier__n_neighbors': 9},0.636364,0.686869,0.673469,0.72449,0.622449,0.668728,0.03647,9
9,0.001166,6.1e-05,0.001962,0.000108,10,{'kneighborsclassifier__n_neighbors': 10},0.656566,0.686869,0.653061,0.72449,0.571429,0.658483,0.050545,10


In [20]:
cross_val_plot = alt.Chart(grid_results).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors").title("Values for K").scale(zero=True),
    y=alt.Y("mean_test_score").title("Accuracy of model").scale(zero=False)
)

cross_val_plot

In [21]:
# we can see that the best value for k is 1
knn=KNeighborsClassifier(n_neighbors=1)
heart_fit=knn.fit(X_heart_train,y_heart_train)
heart_predictions = heart_disease_test.assign(predicted = heart_fit.predict(heart_disease_test[['cholesterol', 'max_heart_rate', 'resting_bp']]))
heart_predictions.head(50)

Unnamed: 0,cholesterol,type_chestpain,max_heart_rate,resting_bp,diagnosis,predicted
538,245,type3,166,125,low-risk heart disease,low-risk heart disease
493,197,type4,177,110,low-risk heart disease,low-risk heart disease
14,230,type3,165,112,high-risk heart disease,high-risk heart disease
247,205,type4,130,128,moderate-risk heart disease,high-risk heart disease
85,304,type4,162,125,high-risk heart disease,high-risk heart disease
127,243,type4,128,150,high-risk heart disease,high-risk heart disease
301,290,type4,153,112,moderate-risk heart disease,moderate-risk heart disease
532,227,type3,154,94,low-risk heart disease,moderate-risk heart disease
331,169,type4,144,120,moderate-risk heart disease,moderate-risk heart disease
484,177,type3,160,142,low-risk heart disease,moderate-risk heart disease


In [22]:
# test model's accuracy 
heart_disease_correct = heart_predictions[
    heart_predictions['diagnosis'] == heart_predictions['predicted']
] 
heart_disease_acc = heart_disease_correct.shape[0] / heart_predictions.shape[0]
heart_disease_acc

0.9090909090909091

In [25]:
# confusion matrix
confusion_matrix = pd.crosstab(
    heart_predictions['diagnosis'],
    heart_predictions['predicted'],
    rownames=['Actual'],
    colnames=['Predicted']
)

confusion_matrix

Predicted,high-risk heart disease,low-risk heart disease,moderate-risk heart disease
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
high-risk heart disease,54,0,0
low-risk heart disease,2,39,10
moderate-risk heart disease,2,1,57
