Tens of thousands of Canadians die annually from heart attacks. Two-thirds of those who suffered these heart attacks, not due to a blood clot, had an undiagnosed heart disease. The failure to diagnose underlying heart disease has led to the high death rates experienced by those with heart attacks. Those with diagnosed heart disease are able to be monitored and helped in the event of heart problems, potentially saving the patient from experiencing a heart attack early. Thus, we have a great need to effectively and accurately evaluate a patient's risk for heart disease. This is why it is important to explore the ability to classify if an individual is at risk of heart disease.

Thus, we ask the question: Is it possible to classify individuals into different levels of heart disease risk (low risk, moderate risk, or high risk) based on blood pressure readings, cholesterol, and other clinical features such as heart rate, ST depression, and thallium stress test results?

To train an algorithm to potentially answer this question, we are using the Cleveland database for heart disease. This database consists of 303 patients who were admitted to the Cleveland Clinic between 1981 and 1984. These patients had no history of heart disease and had various clinical metrics performed on them, as well as detailed documentation of their medical history, lifestyle, and families’ medical history.




In [1]:
# Please uncomment the following cell to install the altair in case your package is not up-to-date

In [2]:
pip install -U altair

Note: you may need to restart the kernel to use updated packages.


In [3]:
# Please uncomment the following cell to install the package if it is missing

In [4]:
#pip install ucimlrepo

In [5]:
import altair as alt
import pandas as pd
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split

In [45]:
# import dataset
heart_disease_dataset = fetch_ucirepo(name = 'Heart Disease')

# filter dataframe
heart_disease = heart_disease_dataset.data.original
heart_disease.rename(columns = {
                          "fbs" : "fasting_blood_sugar",
                          "chol" : "cholesterol", 
                          "cp":"type_chestpain",
                          "restecg" : "resting_ecg",
                          "thalach" : "max_heart_rate",
                          "exang" : "exercise_induced_angina",
                          "oldpeak" : "ST_depression", 
                          "slope" : "ST_segment_slope", 
                          "ca" : "num_major_vessels", 
                          "thal" : "thallium_stress_test", #not sure
                          "num" : "diagnosis",
                          "trestbps" : "resting_bp"
}, inplace = True)

heart_disease = heart_disease[["cholesterol","type_chestpain","max_heart_rate","resting_bp","diagnosis"]]

heart_disease

Unnamed: 0,cholesterol,type_chestpain,max_heart_rate,resting_bp,diagnosis
0,233,1,150,145,0
1,286,4,108,160,2
2,229,4,129,120,1
3,250,3,187,130,0
4,204,2,172,130,0
...,...,...,...,...,...
298,264,1,132,110,1
299,193,4,141,144,2
300,131,4,115,130,3
301,236,2,174,130,1


In [46]:
# split data
heart_disease_train, heart_disease_test = train_test_split(heart_disease, train_size = 0.75)

# table including: 
# number of observations in each class, 
# mean, mode, std deviation of the predictor variables,
# and how many rows have missing data

num_observations = len(heart_disease_train.index)


predictor_modes = heart_disease_train.apply("mode").iloc[[0]].melt()

table = pd.DataFrame({
    'variables' : heart_disease_train.columns,
    'missing_values' : heart_disease_train.isnull().sum().values,
    'type' : heart_disease_train.dtypes.values,
    'mode' : predictor_modes["value"],
    'mean' : heart_disease_train.apply("mean").values,
    'median' : heart_disease_train.apply("median").values,
    'std_deviation' : heart_disease_train.apply("std").values
})

table

Unnamed: 0,variables,missing_values,type,mode,mean,median,std_deviation
0,cholesterol,0,int64,204.0,242.991189,236.0,49.287668
1,type_chestpain,0,int64,4.0,3.189427,3.0,0.923764
2,max_heart_rate,0,int64,163.0,149.885463,152.0,22.571913
3,resting_bp,0,int64,130.0,132.374449,130.0,16.270824
4,diagnosis,0,int64,0.0,0.894273,0.0,1.181462


In [17]:
# To plot distribution of predictor variables
# we need to replace the diagnosis values with their corresponding names
heart_disease_to_plot = heart_disease_train.copy()

# any value larger than 0 will be classified as heart disease
# any value equal to 0 will be classified as no heart disease
heart_disease_to_plot['diagnosis'] = heart_disease_to_plot['diagnosis'].replace([0,1], "low-risk heart disease")
heart_disease_to_plot['diagnosis'] = heart_disease_to_plot['diagnosis'].replace([2,3], "moderate-risk heart disease")
heart_disease_to_plot['diagnosis'] = heart_disease_to_plot['diagnosis'].replace([4], "high-risk heart disease")

heart_disease_to_plot

Unnamed: 0,age,sex,type_chestpain,resting_bp,cholesterol,fasting_blood_sugar,resting_ecg,max_heart_rate,exercise_induced_angina,ST_depression,ST_segment_slope,num_major_vessels,thallium_stress_test,diagnosis
192,43,1,4,132,247,1,2,143,1,0.1,2,,7.0,low-risk heart disease
227,67,0,3,152,277,0,0,172,0,0.0,1,1.0,3.0,low-risk heart disease
201,64,0,4,180,325,0,0,154,1,0.0,1,0.0,3.0,low-risk heart disease
280,57,1,4,110,335,0,0,143,1,3.0,2,1.0,7.0,moderate-risk heart disease
43,59,1,3,150,212,1,0,157,0,1.6,1,0.0,3.0,low-risk heart disease
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
279,58,0,4,130,197,0,0,131,0,0.6,2,0.0,3.0,low-risk heart disease
243,61,1,1,134,234,0,0,145,0,2.6,2,2.0,3.0,moderate-risk heart disease
73,65,1,4,110,248,0,2,158,0,0.6,1,2.0,6.0,low-risk heart disease
165,57,1,4,132,207,0,0,168,1,0.0,1,0.0,7.0,low-risk heart disease


In [18]:
# blood pressure
bp_hist = alt.Chart(heart_disease_to_plot).mark_bar().encode(
    alt.X("resting_bp:Q", bin = True).title("Blood Pressure"),
    y = alt.Y("count()").stack(False),
    color = "diagnosis:N"
).properties(
    title = "Distribution of Blood Pressure"
)

bp_hist

In [19]:
# cholesterol
chol_hist = alt.Chart(heart_disease_to_plot).mark_bar().encode(
    x=alt.X("cholesterol:Q", bin = True).title("Cholesterol"),
    y=alt.Y("count()").stack(False),
    color = "diagnosis:N"
).properties(
    title = "Distribution of Cholesterol"
)

chol_hist

In [20]:
# chest pain type
cp_hist = alt.Chart(heart_disease_to_plot).mark_bar().encode(
    x=alt.X("type_chestpain:Q", bin = True).title("Chest Pain Type"),
    y=alt.Y("count()").stack(False),
    color = "diagnosis:N"
).properties(
    title = "Distribution of Chest Pain Type"
)

cp_hist

In [21]:
# heart rate
hr_hist = alt.Chart(heart_disease_to_plot).mark_bar().encode(
    x=alt.X("max_heart_rate:Q", bin = True).title("Heart Rate"),
    y=alt.Y("count()").stack(False),
    color = "diagnosis:N"
).properties(
    title = "Distribution of Heart Rate"
)

hr_hist

Methods:

The data analysis will be conducted using the classification technique with a K-nearest neighbors algorithm from the scikit-learn python package. The training set will include the variables: resting blood pressure (mmHg), serum cholesterol levels (mg/dl), maximum heart rate and chest pain type. We will employ a scaling and centering technique to standardize metrics. Due to a class imbalance for the diagnosis classifier, the majority of observations are for low-risk heart disease. We recognize the potential for bias and therefore our preprocessing will include oversampling of rare classes.
We will design a scatter plot to visualize the relationship between standardized serum cholesterol levels (mg/dl) and resting blood pressure (mmHg). Studies show high cholesterol may cause arterial plaque buildup, which narrows vessels and leads to increased blood pressure. Since these variables are strongly related, we will analyze them together to observe whether the data supports previous research. The visualization will include a color distinction for each classification of risk and shape labeling for types of chest pain. 


Expected outcomes and significance:

Research studies have found that high blood pressure, chest pain and high cholesterol levels are linked to cardiovascular medical conditions, and notably, blood pressure has been shown as the highest risk factor for cardiovascular disease (Fuchs, 2020). Increased heart rate is also a strong independent indicator of cardiovascular events (Hjalmarson, 2007; Arnold et. al, 2008). Based on the literature, we expect to find a strong positive relationship between higher risk heart disease diagnosis and high blood pressure, high cholesterol serum, increased heart rate and severe chest pain. 

By investigating the relationship between and heart disease diagnosis and self-monitored health metrics or symptoms, the results may help inform the general population their level of susceptibility for heart disease. We will provide a model for recognizing heart disease risk and thus will provide more context for the severity of symptoms that may require more urgent health care consultations. This project will allow future studies to question whether genetic, environmental or demographics refine these results.
