# Predicting Risk of Heart Disease from Accessible Health Metrics

## Introduction

Tens of thousands of Canadians die annually from heart attacks. Two-thirds of those who suffered these heart attacks, not due to a blood clot, had an undiagnosed heart disease. The failure to diagnose underlying heart disease has led to the high death rates experienced by those with heart attacks. Those with diagnosed heart disease are able to be monitored and helped in the event of heart problems, potentially saving the patient from experiencing a heart attack early. Thus, we have a great need to effectively and accurately evaluate a patient's risk for heart disease. This is why it is important to explore the ability to classify if an individual is at risk of heart disease.

Thus, we ask the question: Is it possible to classify individuals into different levels of heart disease risk (low risk, moderate risk, or high risk) based on blood pressure readings, cholesterol, and other clinical features such as heart rate, ST depression, and thallium stress test results?

To train an algorithm to potentially answer this question, we are using the Cleveland database for heart disease. This database consists of 303 patients who were admitted to the Cleveland Clinic between 1981 and 1984. These patients had no history of heart disease and had various clinical metrics performed on them, as well as detailed documentation of their medical history, lifestyle, and families’ medical history.




In [2]:
# Please uncomment the following cell to install the altair in case your package is not up-to-date

In [3]:
# pip install -U altair

In [4]:
import altair as alt
import pandas as pd

from sklearn.model_selection import train_test_split

In [43]:
# import dataset
heart_disease = pd.read_csv("https://archive.ics.uci.edu/static/public/45/data.csv")

# filter dataframe
heart_disease.rename(columns = {
                          "fbs" : "fasting_blood_sugar",
                          "chol" : "cholesterol", 
                          "cp":"type_chestpain",
                          "restecg" : "resting_ecg",
                          "thalach" : "max_heart_rate",
                          "exang" : "exercise_induced_angina",
                          "oldpeak" : "ST_depression", 
                          "slope" : "ST_segment_slope", 
                          "ca" : "num_major_vessels", 
                          "thal" : "thallium_stress_test", #not sure
                          "num" : "diagnosis",
                          "trestbps" : "resting_bp"
}, inplace = True)

heart_disease = heart_disease[["cholesterol","type_chestpain","max_heart_rate","resting_bp","diagnosis"]]


# A low-risk diagnosis is 0, 1
# A moderate-risk diagnosis is 2, 3
# A high-risk diagnosis is 4
heart_disease['diagnosis'] = heart_disease['diagnosis'].replace([0,1], "low-risk heart disease")
heart_disease['diagnosis'] = heart_disease['diagnosis'].replace([2,3], "moderate-risk heart disease")
heart_disease['diagnosis'] = heart_disease['diagnosis'].replace([4], "high-risk heart disease")

# chest pain type
heart_disease['type_chestpain'] = heart_disease['type_chestpain'].replace(
    [1,2,3,4],
    ["type1","type2","type3","type4"])


heart_disease

Unnamed: 0,cholesterol,type_chestpain,max_heart_rate,resting_bp,diagnosis
count,303.0,303,303.0,303.0,303
unique,,4,,,3
top,,type4,,,low-risk heart disease
freq,,144,,,219
mean,246.693069,,149.607261,131.689769,
std,51.776918,,22.875003,17.599748,
min,126.0,,71.0,94.0,
25%,211.0,,133.5,120.0,
50%,241.0,,153.0,130.0,
75%,275.0,,166.0,140.0,


In [9]:
# split data into training and test sets
heart_disease_train, heart_disease_test = train_test_split(heart_disease, train_size = 0.75, random_state = 0)

Unnamed: 0,variables,missing_values,type,mode,mean,median,std_deviation
0,cholesterol,0,int64,212.0,245.810573,240.0,49.162043
1,max_heart_rate,0,int64,162.0,150.286344,152.0,21.961187
2,resting_bp,0,int64,140.0,132.277533,130.0,16.659197


In [44]:
# Summary of the categorical variables
heart_disease_categorical = heart_disease_train.drop(columns = ["cholesterol","max_heart_rate","resting_bp"])
heart_disease_categorical.describe()

Unnamed: 0,type_chestpain,diagnosis
count,227,227
unique,4,3
top,type4,low-risk heart disease
freq,108,163


In [46]:
# Summary of the continuous variables
heart_disease_continuous = heart_disease_train.drop(columns = ["type_chestpain","diagnosis"])
heart_disease_continuous.describe()

Unnamed: 0,cholesterol,max_heart_rate,resting_bp
count,227.0,227.0,227.0
mean,245.810573,150.286344,132.277533
std,49.162043,21.961187,16.659197
min,126.0,96.0,94.0
25%,212.0,133.5,120.0
50%,240.0,152.0,130.0
75%,273.5,167.5,140.0
max,417.0,202.0,180.0


In [10]:
# blood pressure
bp_hist = alt.Chart(heart_disease_train).mark_bar().encode(
    x=alt.X("resting_bp:Q", bin = True).title("Blood Pressure"),
    y=alt.Y("count()").stack(False),
    color="diagnosis:N"
).properties(
    title = "Distribution of Blood Pressure"
)

bp_hist

In [11]:
# cholesterol
chol_hist = alt.Chart(heart_disease_train).mark_bar().encode(
    x=alt.X("cholesterol:Q", bin = True).title("Cholesterol"),
    y=alt.Y("count()").stack(False),
    color = "diagnosis:N"
).properties(
    title = "Distribution of Cholesterol"
)

chol_hist

In [18]:
# chest pain type
cp_hist = alt.Chart(heart_disease_train).mark_bar().encode(
    x=alt.X("type_chestpain").title("Chest Pain Type"),
    y=alt.Y("count()").stack(False),
    color = "diagnosis:N"
).properties(
    width=300,
    height=300,
    title = "Distribution of Chest Pain Type"
)

cp_hist

In [13]:
# heart rate
hr_hist = alt.Chart(heart_disease_train).mark_bar().encode(
    x=alt.X("max_heart_rate:Q", bin = True).title("Heart Rate"),
    y=alt.Y("count()").stack(False),
    color = "diagnosis:N"
).properties(
    title = "Distribution of Heart Rate"
)

hr_hist

## Methods

The data analysis will be conducted using the classification technique with a K-nearest neighbors algorithm from the scikit-learn python package. The training set will include the variables: resting blood pressure (mmHg), serum cholesterol levels (mg/dl), maximum heart rate and chest pain type. We will employ a scaling and centering technique to standardize metrics. Due to a class imbalance for the diagnosis classifier, the majority of observations are for low-risk heart disease. We recognize the potential for bias and therefore our preprocessing will include oversampling of rare classes.
We will design a scatter plot to visualize the relationship between standardized serum cholesterol levels (mg/dl) and resting blood pressure (mmHg). Studies show high cholesterol may cause arterial plaque buildup, which narrows vessels and leads to increased blood pressure. Since these variables are strongly related, we will analyze them together to observe whether the data supports previous research. The visualization will include a color distinction for each classification of risk and shape labeling for types of chest pain. 


## Expected Outcomes and Significance

Research studies have found that high blood pressure, chest pain and high cholesterol levels are linked to cardiovascular medical conditions, and notably, blood pressure has been shown as the highest risk factor for cardiovascular disease (Fuchs, 2020). Increased heart rate is also a strong independent indicator of cardiovascular events (Hjalmarson, 2007; Arnold et. al, 2008). Based on the literature, we expect to find a strong positive relationship between higher risk heart disease diagnosis and high blood pressure, high cholesterol serum, increased heart rate and severe chest pain. 

By investigating the relationship between and heart disease diagnosis and self-monitored health metrics or symptoms, the results may help inform the general population their level of susceptibility for heart disease. We will provide a model for recognizing heart disease risk and thus will provide more context for the severity of symptoms that may require more urgent health care consultations. This project will allow future studies to question whether genetic, environmental or demographics refine these results.


## References

1. Wiliamson, Laura. “Undiagnosed Heart Disease May Be Common in People with Heart Attacks Not Caused by Clots.” Www.Heart.Org, American Heart Association News, 24 Jan. 2023, www.heart.org/en/news/2022/03/28/undiagnosed-heart-disease-may-be-common-in-people-with-heart-attacks-not-caused-by-clots

2. Fuchs FD, Whelton PK. High Blood Pressure and Cardiovascular Disease. Hypertension. 2020 Feb;75(2):285-292. doi: 10.1161/HYPERTENSIONAHA.119.14240. Epub 2019 Dec 23. PMID: 31865786; PMCID: PMC10243231.

3. Åke Hjalmarson, Heart rate: an independent risk factor in cardiovascular disease, European Heart Journal Supplements, Volume 9, Issue suppl_F, September 2007, Pages F3–F7, https://doi.org/10.1093/eurheartj/sum030

4. Arnold JM, Fitchett DH, Howlett JG, Lonn EM, Tardif JC. Resting heart rate: a modifiable prognostic indicator of cardiovascular risk and outcomes? Can J Cardiol. 2008 May;24 Suppl A(Suppl A):3A-8A. doi: 10.1016/s0828-282x(08)71019-5. PMID: 18437251; PMCID: PMC2787005.

5. Haasenritter J, Stanze D, Widera G, Wilimzig C, Abu Hani M, Sonnichsen AC, Bosner S, Rochon J, Donner-Banzhoff N. Does the patient with chest pain have a coronary heart disease? Diagnostic value of single symptoms and signs--a meta-analysis. Croat Med J. 2012 Oct;53(5):432-41. doi: 10.3325/cmj.2012.53.432. PMID: 23100205; PMCID: PMC3490454.

6. Janosi,Andras, Steinbrunn,William, Pfisterer,Matthias, and Detrano,Robert. (1988). Heart Disease. UCI Machine Learning Repository. https://doi.org/10.24432/C52P4X.
