# REPORT: Predicting Risk of Heart Disease from Accessible Health Metrics
_Created by : Tyler Ofreneo, Arthur Zhang, Kody Forsyth, Natalia Blanco_

## Introduction:

According to the Public Health Agency of Canada, heart disease is the second leading cause of death in Canada, with approximately 1 in 12 Canadian adults over 20 living with a diagnosis. These metrics highlight the importance of knowing the risk factors and having access to medical advice. However, a shortage of physicians in Canada is causing a lack of available health care (Flood et al., 2023). Non-healthcare professionals do not have the means to properly self-evaluate symptoms, therefore our project seeks to help the general population to make informative decisions about heart disease symptoms that are self-monitored or easily accessible.


Thus we ask, <b>is it possible to classify individuals into levels of heart disease risk (low risk, moderate risk, or high risk) based on blood pressure, cholesterol, heart rate and chest pain?<b>


Our analysis will use the Heart Disease dataset from the Cleveland database for heart disease (Andras et al., 1988). This database consists of 303 patients without history of heart disease, who were admitted to the Cleveland Clinic between 1981 and 1984. 



In [1]:
# Please uncomment the following cell to install the altair in case your package is not up-to-date

In [2]:
#pip install -U altair

Collecting altair
  Using cached altair-5.2.0-py3-none-any.whl (996 kB)
Installing collected packages: altair
  Attempting uninstall: altair
    Found existing installation: altair 4.2.2
    Uninstalling altair-4.2.2:
      Successfully uninstalled altair-4.2.2
Successfully installed altair-5.2.0
Note: you may need to restart the kernel to use updated packages.


In [3]:
import altair as alt
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import set_config
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler 
from sklearn.compose import make_column_transformer
from sklearn.utils import resample
from sklearn.pipeline import make_pipeline

## Loading and Wrangling the Heart Disease Data Set

The Heart Disease dataset retrieved from the UC Irvine Machine Learning Repository is loaded into this notebook in the following cell. The data frame obtained (shown below) was tidy (Table 1). A subset of the columns containing the symptoms chosen in the thesis was extracted. To enhance clarity, column names and variables within the 'diagnosis' and 'chest pain' sections have been given more meaningful names. An outline of the the classes for heart disease diagnosis, each representing the levels of risk, is given below.

> <ins> The 'diagnosis' column: </ins> 
<br>
> 0,1 is represented as "low-risk diagnosis"
<br>
> 2,3 is represented as "moderate-risk diagnosis"
<br>
> 4 is represented as "high-risk diagnosis"

In [4]:
# import and load the data
heart_disease = pd.read_csv("https://archive.ics.uci.edu/static/public/45/data.csv")

#extract and rename the columns with required data
heart_disease = heart_disease[["chol","cp","thalach","trestbps","num"]]
heart_disease.rename(columns = {
                          "chol" : "cholesterol", 
                          "cp":"type_chestpain",
                          "thalach" : "max_heart_rate",
                          "num" : "diagnosis",
                          "trestbps" : "resting_bp"
}, inplace = True)

# renaming variables in the 'diagnosis' column with readable and explicit labels
heart_disease['diagnosis'] = heart_disease['diagnosis'].replace([0,1], "low-risk heart disease")
heart_disease['diagnosis'] = heart_disease['diagnosis'].replace([2,3], "moderate-risk heart disease")
heart_disease['diagnosis'] = heart_disease['diagnosis'].replace([4], "high-risk heart disease")

# renaming variables in the 'chest pain' column with readable and explicit labels
heart_disease['type_chestpain'] = heart_disease['type_chestpain'].replace(
    [1,2,3,4],
    ["type1","type2","type3","type4"])

print("Table 1:")
heart_disease.head()

Table 1:


Unnamed: 0,cholesterol,type_chestpain,max_heart_rate,resting_bp,diagnosis
0,233,type1,150,145,low-risk heart disease
1,286,type4,108,160,moderate-risk heart disease
2,229,type4,129,120,low-risk heart disease
3,250,type3,187,130,low-risk heart disease
4,204,type2,172,130,low-risk heart disease


## Balancing

As shown in the next cells, the unique values in the 'diagnosis' column show a class imbalance because the observations of 'low-risk heart disease' appear more common than the other classes. In order to eliminate the risk of bias in the classifier, this issue was resolved by oversampling the rare classes which were "moderate-" and "high-risk disease". This was done by filtering the classes and randomly resampling to increase the number of rare observations.

In [5]:
# unbalanced diagnoses, must resample
heart_disease['diagnosis'].value_counts()

low-risk heart disease         219
moderate-risk heart disease     71
high-risk heart disease         13
Name: diagnosis, dtype: int64

In [6]:
np.random.seed(1234)

# separate the classes out into their own data frames by filtering
rare_diagnosis_1 = heart_disease[heart_disease["diagnosis"] == "high-risk heart disease"]
rare_diagnosis_2 = heart_disease[heart_disease["diagnosis"] == "moderate-risk heart disease"]
low_risk_diagnosis = heart_disease[heart_disease["diagnosis"] == "low-risk heart disease"]

# increase the number of rare observations to the same amount as non-rare observations
rare_diagnosis_upsample_1 = resample(rare_diagnosis_1, n_samples = low_risk_diagnosis.shape[0])
rare_diagnosis_upsample_2 = resample(rare_diagnosis_2, n_samples = low_risk_diagnosis.shape[0])

# combined observations show that the classes are now balanced
heart_disease = pd.concat((rare_diagnosis_upsample_1, rare_diagnosis_upsample_2, low_risk_diagnosis)).reset_index(drop = True)
heart_disease['diagnosis'].value_counts()

high-risk heart disease        219
moderate-risk heart disease    219
low-risk heart disease         219
Name: diagnosis, dtype: int64

## Create Training and Testing Sets

The data set is split so that 75% is incorporated into a training set and 25% for a testing set. The data is stratified by the categorical label variable ('diagnosis'). We will use the training set to build the classifier and predict the diagnosis of observations in the test set for evaluating performance. 

In [7]:
# split data into training and testing sets
heart_disease_train, heart_disease_test = train_test_split(heart_disease, 
                                                           train_size = 0.75, 
                                                           stratify = heart_disease["diagnosis"],
                                                           random_state = 0)

## Summary Statistics

The following cells are used to compute the summary statistics of the variables.
<br>
<br>
**Categorical variables (Table 2)**: 
> Number of observations ("count")<br>
> Number of unique labels ("unique")<br>
> Most frequent value ("top") <br>
> Regularity of the top value ("freq")<br>

**Continuous variables (Table 3)**: 
> Number of observations ("count")<br>
> Mean, standard deviation, minimum and maximum of the observations ("mean", "std", "min", "max")<br>
> 25th, 50th, and 75th percentiles<br>

In [8]:
# Summary of the categorical variables
heart_disease_categorical = heart_disease_train.drop(columns = ["cholesterol","max_heart_rate","resting_bp"])
print("Table 2:")
heart_disease_categorical.describe()

Table 2:


Unnamed: 0,type_chestpain,diagnosis
count,492,492
unique,4,3
top,type4,high-risk heart disease
freq,330,164


In [9]:
# Summary of the continuous variables
heart_disease_continuous = heart_disease_train.drop(columns = ["type_chestpain","diagnosis"])
print("Table 3:")
heart_disease_continuous.describe()

Table 3:


Unnamed: 0,cholesterol,max_heart_rate,resting_bp
count,492.0,492.0,492.0
mean,254.365854,143.424797,135.243902
std,57.031634,21.738036,18.804208
min,126.0,71.0,94.0
25%,215.75,128.0,120.0
50%,243.0,144.5,132.0
75%,289.0,160.0,150.0
max,409.0,202.0,200.0


## Distribution of the Variables

The next cells visualize the distribution of the measurements/values for each symptom variable in histograms (Figure 1-4). The associated heart disease risk diagnosis is represented with distinct colours. The varying measurements indicate standardization is required. The visualizations suggest patterns in the dataset, revealing that specific measurement ranges within the variables may be linked with varying levels of heart disease risk. The distinct peaks in the graphs signify an abundance of particular ranges within the dataset records, providing evidence of a relationship between these measurements and the diagnosis of heart disease. 

In [22]:
# distribution of blood pressure measurements
bp_hist = alt.Chart(heart_disease_train).mark_bar().encode(
    x=alt.X("resting_bp:Q", bin = True).title("Blood Pressure"),
    y=alt.Y("count()").stack(False),
    color="diagnosis:N"
).properties(
    title = "Figure 1: Distribution of Blood Pressure"
)

# distribution of cholesterol measurements
chol_hist = alt.Chart(heart_disease_train).mark_bar().encode(
    x=alt.X("cholesterol:Q", bin = True).title("Cholesterol"),
    y=alt.Y("count()").stack(False),
    color = "diagnosis:N"
).properties(
    title = "Figure 2: Distribution of Cholesterol"
)

# distribution of chest pain type measurements
cp_hist = alt.Chart(heart_disease_train).mark_bar().encode(
    x=alt.X("type_chestpain").title("Chest Pain Type"),
    y=alt.Y("count()").stack(False),
    color = "diagnosis:N"
).properties(
    width=300,
    height=300,
    title = "Figure 3: Distribution of Chest Pain Type"
)

# distribution of heart rate measurements
hr_hist = alt.Chart(heart_disease_train).mark_bar().encode(
    x=alt.X("max_heart_rate:Q", bin = True).title("Heart Rate"),
    y=alt.Y("count()").stack(False),
    color = "diagnosis:N"
).properties(
    title = "Figure 4: Distribution of Heart Rate"
)


combined_charts = bp_hist & cp_hist | chol_hist & hr_hist
combined_charts

## Note: "Chest Pain" Variable

From here, our team decided that in order to use the chest pain symptoms variable effectively (column: "type_chestpain", Figure 3), a larger and more comprehensive analysis is needed, potentially involving the creation of separate classifiers for each label of chest pain. The nature of this analysis exceeds the scope of this project thus it was not pursued. However, the work above has included chest pain for future analysis that builds upon this classification. 

## Finding the K Parameter 

A pipeline is created that combines the preprocessing and an undefined K-neighbors classifier. The parameter values are collected within a range and utilized to estimate k-classifier accuracy. The classifiers are tuned according to the symptoms and diagnosis labels and the output of the cross-validation accuracy estimate for each K parameter is visualized with a line plot (Figure 5).

In [11]:
# create a preprocessor
preprocessor=make_column_transformer(
    (StandardScaler(),['cholesterol','max_heart_rate','resting_bp']),
    remainder='passthrough',
    verbose_feature_names_out=False
)

# create a pipeline
heart_disease_pipe = make_pipeline(preprocessor, KNeighborsClassifier())
heart_disease_pipe

In [12]:
# create a parameter grid
parameter_grid = {
    "kneighborsclassifier__n_neighbors" : range(1, 31)
}

# estimate the classifier accuracy for the range of values
grid_search = GridSearchCV(
    estimator = heart_disease_pipe,
    param_grid = parameter_grid,
    cv = 5,
)

X_heart_train=heart_disease_train[['cholesterol','max_heart_rate','resting_bp']]
y_heart_train=heart_disease_train['diagnosis']

# tuning process
model_grid=grid_search.fit(X_heart_train,y_heart_train)
grid_results=pd.DataFrame(grid_search.cv_results_)

In [13]:
# plot the accuracy of the K-values with a line-plot
cross_val_plot = alt.Chart(grid_results).mark_line(point=True).encode(
    x=alt.X("param_kneighborsclassifier__n_neighbors").title("Values for K").scale(zero=True),
    y=alt.Y("mean_test_score").title("Accuracy of model").scale(zero=False)
).properties(
    title = "Figure 5: Estimated Accuracy of K Neighbors"
)

cross_val_plot

## Finding the K Parameter: Result

The best K Neighbor value is 1. This value provides optimal accuracy (Figure 5), it is also not too large to become prohibitive of the model and changing the value to a nearby number would not change the accuracy to a great extent. The following code creates a new classifier with this parameter and the predictions from using this model on the test data are added to the test set. Some of the predicted ('predicted') and actual ('diagnosis') diagnoses are shown (Table 4). 

In [14]:
# create a new model object for the best parameter value
knn = KNeighborsClassifier(n_neighbors = 1)
heart_fit = knn.fit(X_heart_train,y_heart_train)

# add predictions to dataframe
heart_predictions = heart_disease_test.assign(predicted = heart_fit.predict(heart_disease_test[['cholesterol', 'max_heart_rate', 'resting_bp']]))
print('Table 4:')
heart_predictions.head()

Table 4:


Unnamed: 0,cholesterol,type_chestpain,max_heart_rate,resting_bp,diagnosis,predicted
399,164,type4,145,160,moderate-risk heart disease,moderate-risk heart disease
142,304,type4,162,125,high-risk heart disease,high-risk heart disease
518,309,type2,156,108,low-risk heart disease,low-risk heart disease
529,220,type2,170,120,low-risk heart disease,low-risk heart disease
447,263,type2,173,120,low-risk heart disease,low-risk heart disease


## Evaluation Results

The next cells show that the model has high accuracy, at 90%.

A confusion matrix displays all observations (55) for high-risk heart disease, 40 observations for low-risk heart disease and 53 observations of moderate-risk heart disease were correctly predicted. The classifier made minimal misdiagnosis in predictions of low- and moderate-risk.

Based on the confusion matrix, here are the calculations for precision and recall:
<br>
<br>
_high-risk heart disease_

>precision = 55/59 100 = 93.22 = 93% <br>
>recall = 55/55 100 = 100 = 100%

_low-risk heart disease_

>precision = 40/41 100 = 97.56 = 98% <br>
>recall = 40/55 = 72.73 = 73%

_moderate-risk heart disease_

>precision = 53/65 100 = 81.54 = 82% <br>
>recall = 53/55 100 = 96.36 = 96%

In [15]:
# test model's accuracy 
heart_disease_correct = heart_predictions[
    heart_predictions['diagnosis'] == heart_predictions['predicted']
] 
heart_disease_acc = heart_disease_correct.shape[0] / heart_predictions.shape[0]
heart_disease_acc

0.896969696969697

In [16]:
# confusion matrix
confusion_matrix = pd.crosstab(
    heart_predictions['diagnosis'],
    heart_predictions['predicted'],
    rownames=['Actual'],
    colnames=['Predicted']
)

print('Table 5:')
confusion_matrix

Table 5:


Predicted,high-risk heart disease,low-risk heart disease,moderate-risk heart disease
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
high-risk heart disease,55,0,0
low-risk heart disease,3,40,12
moderate-risk heart disease,1,1,53


## Visualization of Analysis

Here, the visualization depicts the results of this classification analysis: the predictions made by the classifier for each label of heart disease risk diagnosis are described in bar plots to easily compare the results. The first plot (Figure 6) shows the distributions of predictions correct and incorrect predicted instances and visualizes recall. The second plot (Figure 7) captures the amount of diagnosis and the associated amount of correct prediction, visualizing precision.

In [17]:
predictions_hist = alt.Chart(heart_predictions).mark_bar().encode(
    x = alt.X('predicted').title('Predicted Diagnosis'),
    y = alt.Y('count()').title('Count of Predicted Diagnosis'),
    column = alt.Column('diagnosis').title('Figure 6: Diagnosis'),
    color = alt.Color('predicted').title('Predicted Diagnosis'),
).properties(
    height = 350,
    width = 350,
).configure_axisX(labelAngle = -40, titleFontSize = 12).configure_axis(titleFontSize = 12, labelFontSize = 12, labelAlign = 'center', labelPadding = 40)

predictions_hist

In [18]:
# Isolate the predicted and actual diagnosis
heart_predictions = heart_predictions[['diagnosis', 'predicted']]

# create a dataframe for the bar chart
heart_disease_result = pd.DataFrame({
    'type': ['diagnosis',
             'predicted',
             'diagnosis',
             'predicted',
             'diagnosis',
             'predicted'],
    'risk': ['low-risk heart disease',
             'low-risk heart disease',
             'moderate-risk heart disease',
             'moderate-risk heart disease',
             'high-risk heart disease',
             'high-risk heart disease'],
    'count' : [heart_predictions[(heart_predictions['diagnosis'] == 'low-risk heart disease')].shape[0],
               heart_predictions[(heart_predictions['predicted'] == 'low-risk heart disease')].shape[0],
               heart_predictions[(heart_predictions['diagnosis'] == 'moderate-risk heart disease')].shape[0],
               heart_predictions[(heart_predictions['predicted'] == 'moderate-risk heart disease')].shape[0],
               heart_predictions[(heart_predictions['diagnosis'] == 'high-risk heart disease')].shape[0],
               heart_predictions[(heart_predictions['predicted'] == 'high-risk heart disease')].shape[0]]
})

# draw the bar chart
heart_disease_result_bar = alt.Chart(heart_disease_result).mark_bar().encode(
    x = alt.X('type', title = 'Diagnosis Type'),
    y = alt.Y('count', title = 'Count of Diagnosis'),
    color = alt.Color('risk', title = 'Diagnosis Type'),
    column = alt.Column('risk', title = 'Figure 7: Risk Type')
).properties(
    height = 350,
    width = 350,
).configure_axisX(labelAngle = 0, titleFontSize = 12).configure_axis(titleFontSize = 12, labelFontSize = 12, labelAlign = 'center', labelPadding = 40)

heart_disease_result_bar

## Discussion

This K-nearest neighbours classification analysis of a heart disease dataset allowed us to demonstrate a link between certain health metrics of blood pressure, cholesterol and heart rate to the diagnosis of heart disease. The classifier created in this report demonstrated a high overall accuracy of 90% and performed well in identifying high-risk cases, with slightly lower performance in predicting low-risk and moderate-risk heart disease (Figure 6, 7). An evaluation of the analysis using a confusion matrix (table 5) indicated a high degree of precision and recall across different risk categories. The results emphasized the model’s strength in identifying high-risk cases with a precision of 93% and a recall of 100%. Low-risk and moderate-risk heart disease measured at 98% and 82% precision coupled with 73% and 96% recall respectively. The trade-off in recall measurements for moderate- and low-risk heart disease diagnosis indicates the model has slightly less sensitivity for predicting instances of these diagnoses. However, it is important to highlight that the model excels in identifying high-risk heart disease, which is particularly significant because of the severity of this category. A high-degree of performance for this risk level is crucial for early treatment and better outcomes for critical cases. 

Based on several previous research studies, our team expected a strong relationship between heart disease diagnosis and blood pressure, cholesterol and heart rate symptoms (Fuchs et al., 2020; Hjalmarson et al., 2007; Arnold et al., 2008). In particular, the literature indicated correlations between higher-risk heart disease diagnosis and high blood pressure, high cholesterol serum, and increased heart rate. Our expectations aligned with the effectiveness demonstrated by the model in predicting diagnoses based on these symptoms.

This analysis successfully developed a K-nearest neighbours classification model designed to diagnose heart disease. The model uses readily available health metrics accessible to individuals in Canada, enabling self-monitoring of blood pressure, cholesterol and heart rate symptoms for detection of heart disease risk levels. By designing a model for recognizing heart disease risk using common symptoms, this algorithm can provide individuals more context for the severity of their symptoms that require urgent health care consultations. Furthermore, this project will allow future studies to question whether genetic, environmental or demographics can refine these predictions.


## References

Wiliamson, Laura. “Undiagnosed Heart Disease May Be Common in People with Heart Attacks Not Caused by Clots.” Www.Heart.Org, American Heart Association News, 24 Jan. 2023, www.heart.org/en/news/2022/03/28/undiagnosed-heart-disease-may-be-common-in-people-with-heart-attacks-not-caused-by-clots

Fuchs FD, Whelton PK. High Blood Pressure and Cardiovascular Disease. Hypertension. 2020 Feb;75(2):285-292. doi: 10.1161/HYPERTENSIONAHA.119.14240. Epub 2019 Dec 23. PMID: 31865786; PMCID: PMC10243231.

Åke Hjalmarson, Heart rate: an independent risk factor in cardiovascular disease, European Heart Journal Supplements, Volume 9, Issue suppl_F, September 2007, Pages F3–F7, https://doi.org/10.1093/eurheartj/sum030

Arnold JM, Fitchett DH, Howlett JG, Lonn EM, Tardif JC. Resting heart rate: a modifiable prognostic indicator of cardiovascular risk and outcomes? Can J Cardiol. 2008 May;24 Suppl A(Suppl A):3A-8A. doi: 10.1016/s0828-282x(08)71019-5. PMID: 18437251; PMCID: PMC2787005.

Haasenritter J, Stanze D, Widera G, Wilimzig C, Abu Hani M, Sonnichsen AC, Bosner S, Rochon J, Donner-Banzhoff N. Does the patient with chest pain have a coronary heart disease? Diagnostic value of single symptoms and signs--a meta-analysis. Croat Med J. 2012 Oct;53(5):432-41. doi: 10.3325/cmj.2012.53.432. PMID: 23100205; PMCID: PMC3490454.

Janosi,Andras, Steinbrunn,William, Pfisterer,Matthias, and Detrano,Robert. (1988). Heart Disease. UCI Machine Learning Repository. https://doi.org/10.24432/C52P4X.