# REPORT: Predicting Risk of Heart Disease from Accessible Health Metrics

## Introduction:

According to the Public Health Agency of Canada, heart disease is the second leading cause of death in Canada, with approximately 1 in 12 Canadian adults over 20 living with a diagnosis. These metrics highlight the importance of knowing the risk factors and having access to medical advice. However, a shortage of physicians in Canada is causing a lack of available health care (Flood et al., 2023). Non-healthcare professionals do not have the means to properly self-evaluate symptoms, therefore our project seeks to help the general population to make informative decisions about heart disease symptoms that are self-monitored or easily accessible.


Thus we ask, is it possible to classify individuals into levels of heart disease risk (low risk, moderate risk, or high risk) based on blood pressure, cholesterol, heart rate and chest pain?


Our analysis will use the Heart Disease dataset from the Cleveland database for heart disease (Andras et al., 1988). This database consists of 303 patients without history of heart disease, who were admitted to the Cleveland Clinic between 1981 and 1984. 



In [1]:
# Please uncomment the following cell to install the altair in case your package is not up-to-date

In [2]:
# pip install -U altair

In [3]:
import altair as alt
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import set_config
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler 
from sklearn.compose import make_column_transformer
from sklearn.utils import resample
from sklearn.pipeline import make_pipeline

## Loading the Heart Disease Dataset

Here, the Heart Disease dataset from UC Irvine Machine Learning Repository is loaded into the notebook. We also rename the variables within the 'diagnosis' and 'chest pain' columns into more meaningful names.  
> For the 'diagnosis' column indicating the risk level of a heart disease diagnosis : 
<br>
> 0,1 is represented as "low-risk diagnosis"
<br>
> 2,3 is represented as "moderate-risk diagnosis"
<br>
> 4 is represented as "high-risk diagnosis"

In [42]:
# import and load the data
heart_disease = pd.read_csv("https://archive.ics.uci.edu/static/public/45/data.csv")

#extract and rename the columns with required data
heart_disease = heart_disease[["chol","cp","thalach","trestbps","num"]]
heart_disease.rename(columns = {
                          "chol" : "cholesterol", 
                          "cp":"type_chestpain",
                          "thalach" : "max_heart_rate",
                          "num" : "diagnosis",
                          "trestbps" : "resting_bp"
}, inplace = True)

# renaming variables in the 'diagnosis' column with readable and explicit labels
heart_disease['diagnosis'] = heart_disease['diagnosis'].replace([0,1], "low-risk heart disease")
heart_disease['diagnosis'] = heart_disease['diagnosis'].replace([2,3], "moderate-risk heart disease")
heart_disease['diagnosis'] = heart_disease['diagnosis'].replace([4], "high-risk heart disease")

# renaming variables in the 'chest pain' column with readable and explicit labels
heart_disease['type_chestpain'] = heart_disease['type_chestpain'].replace(
    [1,2,3,4],
    ["type1","type2","type3","type4"])

low-risk heart disease         219
moderate-risk heart disease     71
high-risk heart disease         13
Name: diagnosis, dtype: int64

## Wrangling the Data

For our preliminary analysis, we have shown that the unique labels in the 'diagnosis' column are unbalanced, so we have resolved this by oversampling the rare classes, which are the "moderate-" and "high-risk disease" classes.

In [43]:
# unbalanced diagnoses, must resample
heart_disease['diagnosis'].value_counts()

low-risk heart disease         219
moderate-risk heart disease     71
high-risk heart disease         13
Name: diagnosis, dtype: int64

In [45]:
# separate the classes out into their own data frames by filtering
rare_diagnosis_1 = heart_disease[heart_disease["diagnosis"] == "high-risk heart disease"]
rare_diagnosis_2 = heart_disease[heart_disease["diagnosis"] == "moderate-risk heart disease"]
low_risk_diagnosis = heart_disease[heart_disease["diagnosis"] == "low-risk heart disease"]

# increase the number of rare observations to the same as the number of non-rare observations
rare_diagnosis_upsample_1 = resample(rare_diagnosis_1, n_samples = low_risk_diagnosis.shape[0])
rare_diagnosis_upsample_2 = resample(rare_diagnosis_2, n_samples = low_risk_diagnosis.shape[0])

# combine observations show that the classes are now balanced
heart_disease = pd.concat((rare_diagnosis_upsample_1, rare_diagnosis_upsample_2, low_risk_diagnosis)).reset_index(drop = True)
heart_disease['diagnosis'].value_counts()

high-risk heart disease        219
moderate-risk heart disease    219
low-risk heart disease         219
Name: diagnosis, dtype: int64

## Split into Training and Testing Data sets

We have split the data set so that 75% of our original data set ends up in the training set. We will also set the stratify argument to the categorical label variable, 'diagnosis', to ensure that the training and testing subsets contain the right proportions of each category of observation.

In [46]:
# split data into training and testing sets
heart_disease_train, heart_disease_test = train_test_split(heart_disease, train_size = 0.75, random_state = 0)

## Summary of the Categorical and Continuous Variables

Here we show an overall summary of the variables: 
<br>
**For the categorical variables**: the number of observations (count), the unique labels, the most frequent value (top) and the frequency of the top value.
<br>
**For the continuous variables**: the number of observations (count), the mean and standard deviation of the observations, the min and the max values and the 25th, 50th, and 75th percentiles of each column of variables for the training set. 

In [48]:
# Summary of the categorical variables
heart_disease_categorical = heart_disease_train.drop(columns = ["cholesterol","max_heart_rate","resting_bp"])
heart_disease_categorical.describe()

Unnamed: 0,type_chestpain,diagnosis
count,492,492
unique,4,3
top,type4,low-risk heart disease
freq,331,168


In [49]:
# Summary of the continuous variables
heart_disease_continuous = heart_disease_train.drop(columns = ["type_chestpain","diagnosis"])
heart_disease_continuous.describe()

Unnamed: 0,cholesterol,max_heart_rate,resting_bp
count,492.0,492.0,492.0
mean,251.611789,142.780488,133.792683
std,55.21299,22.841092,16.912308
min,131.0,71.0,94.0
25%,212.0,125.0,120.0
50%,240.0,143.5,131.0
75%,289.0,162.0,145.0
max,564.0,202.0,200.0


## Distribution of the Variables in Training Set

Here we visualize the distribution of the measurements for each variable on the training set along with the diagnosis associated with the measurements with histograms. The diagnosis is represented with distinct colors. 

In [61]:
# distribution of blood pressure measurements
bp_hist = alt.Chart(heart_disease_train).mark_bar().encode(
    x=alt.X("resting_bp:Q", bin = True).title("Blood Pressure"),
    y=alt.Y("count()").stack(False),
    color="diagnosis:N"
).properties(
    title = "Distribution of Blood Pressure"
)

# distribution of cholesterol measurements
chol_hist = alt.Chart(heart_disease_train).mark_bar().encode(
    x=alt.X("cholesterol:Q", bin = True).title("Cholesterol"),
    y=alt.Y("count()").stack(False),
    color = "diagnosis:N"
).properties(
    title = "Distribution of Cholesterol"
)

# distribution of chest pain type measurements
cp_hist = alt.Chart(heart_disease_train).mark_bar().encode(
    x=alt.X("type_chestpain").title("Chest Pain Type"),
    y=alt.Y("count()").stack(False),
    color = "diagnosis:N"
).properties(
    width=300,
    height=300,
    title = "Distribution of Chest Pain Type"
)

# distribution of heart rate measurements
hr_hist = alt.Chart(heart_disease_train).mark_bar().encode(
    x=alt.X("max_heart_rate:Q", bin = True).title("Heart Rate"),
    y=alt.Y("count()").stack(False),
    color = "diagnosis:N"
).properties(
    title = "Distribution of Heart Rate"
)


combined_charts = bp_hist & chol_hist | cp_hist & hr_hist
combined_charts

## Standardize the Data Set

We created a pipeline that specifies the preprocessing steps and the K-neighbors classifier. Then we defined the parameter grid to show the range of K-values that will be tuned. GridSearchCV is utilized to estimate the classifier accuracy for the range. Following, we executed the grid search by passing the training data to the fit method, and visualized the results with a line plot. 

In [51]:
# create a preprocessor
preprocessor=make_column_transformer(
    (StandardScaler(),['cholesterol','max_heart_rate','resting_bp']),
    remainder='passthrough',
    verbose_feature_names_out=False
)
preprocessor

In [52]:
# create a pipeline
heart_disease_pipe = make_pipeline(preprocessor, KNeighborsClassifier())
heart_disease_pipe

In [None]:
np.random.seed(1234)

# create a parameter grid
parameter_grid = {
    "kneighborsclassifier__n_neighbors" : range(1, 31)
}

# estimate the classifier accuracy for the range of values
grid_search = GridSearchCV(
    estimator = heart_disease_pipe,
    param_grid = parameter_grid,
    cv = 5,
)

X_heart_train=heart_disease_train[['cholesterol','max_heart_rate','resting_bp']]
y_heart_train=heart_disease_train['diagnosis']

model_grid=grid_search.fit(X_heart_train,y_heart_train)
grid_results=pd.DataFrame(grid_search.cv_results_)

grid_results

In [56]:
# plot the accuracy of the K-values with a line-plot
cross_val_plot = alt.Chart(grid_results).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors").title("Values for K").scale(zero=True),
    y=alt.Y("mean_test_score").title("Accuracy of model").scale(zero=False)
)

cross_val_plot

## K-parameter

We show that the best K-parameter is 1, because it provides optimal accuracy, it is not too large to become prohibitive of the model and changing the value to a nearby one would not change the accuracy too much. 

In [58]:
# create a new model object for the best parameter value
knn = KNeighborsClassifier(n_neighbors = 1)
heart_fit = knn.fit(X_heart_train,y_heart_train)

# add predictions to dataframe
heart_predictions = heart_disease_test.assign(predicted = heart_fit.predict(heart_disease_test[['cholesterol', 'max_heart_rate', 'resting_bp']]))
heart_predictions.head(50)

Unnamed: 0,cholesterol,type_chestpain,max_heart_rate,resting_bp,diagnosis,predicted
538,245,type3,166,125,low-risk heart disease,low-risk heart disease
493,197,type4,177,110,low-risk heart disease,low-risk heart disease
14,407,type4,154,150,high-risk heart disease,high-risk heart disease
247,225,type4,146,170,moderate-risk heart disease,moderate-risk heart disease
85,289,type4,145,160,high-risk heart disease,high-risk heart disease
127,166,type4,125,138,high-risk heart disease,high-risk heart disease
301,263,type3,97,130,moderate-risk heart disease,moderate-risk heart disease
532,227,type3,154,94,low-risk heart disease,moderate-risk heart disease
331,254,type4,147,130,moderate-risk heart disease,moderate-risk heart disease
484,177,type3,160,142,low-risk heart disease,moderate-risk heart disease


In [19]:
# test model's accuracy 
heart_disease_correct = heart_predictions[
    heart_predictions['diagnosis'] == heart_predictions['predicted']
] 
heart_disease_acc = heart_disease_correct.shape[0] / heart_predictions.shape[0]
heart_disease_acc

0.9030303030303031

In [20]:
# confusion matrix
confusion_matrix = pd.crosstab(
    heart_predictions['diagnosis'],
    heart_predictions['predicted'],
    rownames=['Actual'],
    colnames=['Predicted']
)

confusion_matrix

Predicted,high-risk heart disease,low-risk heart disease,moderate-risk heart disease
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
high-risk heart disease,54,0,0
low-risk heart disease,2,39,10
moderate-risk heart disease,0,4,56


In [86]:
predictions_hist = alt.Chart(heart_predictions).mark_bar().encode(
    x = alt.X('predicted').title('Predicted Diagnosis'),
    y = alt.Y('count()').title('Count of Predicted Diagnosis'),
    column = alt.Column('diagnosis').title('Diagnosis'),
    color = alt.Color('predicted').title('Predicted Diagnosis'),
).properties(
    height = 350,
    width = 350,
).configure_axisX(labelAngle = -40, titleFontSize = 12).configure_axis(titleFontSize = 12, labelFontSize = 12, labelAlign = 'center', labelPadding = 40)

predictions_hist

In [None]:
# Isolate the predicted and actual diagnosis
heart_predictions = heart_predictions[['diagnosis', 'predicted']]

# create a dataframe for the bar chart
heart_disease_result = pd.DataFrame({
    'type': ['diagnosis',
             'predicted',
             'diagnosis',
             'predicted',
             'diagnosis',
             'predicted'],
    'risk': ['low-risk heart disease',
             'low-risk heart disease',
             'moderate-risk heart disease',
             'moderate-risk heart disease',
             'high-risk heart disease',
             'high-risk heart disease'],
    'count' : [heart_predictions[(heart_predictions['diagnosis'] == 'low-risk heart disease')].shape[0],
               heart_predictions[(heart_predictions['predicted'] == 'low-risk heart disease')].shape[0],
               heart_predictions[(heart_predictions['diagnosis'] == 'moderate-risk heart disease')].shape[0],
               heart_predictions[(heart_predictions['predicted'] == 'moderate-risk heart disease')].shape[0],
               heart_predictions[(heart_predictions['diagnosis'] == 'high-risk heart disease')].shape[0],
               heart_predictions[(heart_predictions['predicted'] == 'high-risk heart disease')].shape[0]]
})

# draw the bar chart
heart_disease_result_bar = alt.Chart(heart_disease_result).mark_bar().encode(
    x = alt.X('type', title = 'Diagnosis Type'),
    y = alt.Y('count', title = 'Count of Diagnosis'),
    color = alt.Color('risk', title = 'Diagnosis Type'),
    column = alt.Column('risk', title = 'Risk Type')
).properties(
    height = 350,
    width = 350,
).configure_axisX(labelAngle = -40, titleFontSize = 12).configure_axis(titleFontSize = 12, labelFontSize = 12, labelAlign = 'center', labelPadding = 40)

heart_disease_result_bar
