Tens of thousands of Canadians die annually from heart attacks. Two-thirds of those who suffered these heart attacks, not due to a blood clot, had an undiagnosed heart disease. The failure to diagnose underlying heart disease has led to the high death rates experienced by those with heart attacks. Those with diagnosed heart disease are able to be monitored and helped in the event of heart problems, potentially saving the patient from experiencing a heart attack early. Thus, we have a great need to effectively and accurately evaluate a patient's risk for heart disease. This is why it is important to explore the ability to classify if an individual is at risk of heart disease.

Thus, we ask the question: Is it possible to classify individuals into different levels of heart disease risk (low risk, moderate risk, or high risk) based on blood pressure readings, cholesterol, and other clinical features such as heart rate, ST depression, and thallium stress test results?

To train an algorithm to potentially answer this question, we are using the Cleveland database for heart disease. This database consists of 303 patients who were admitted to the Cleveland Clinic between 1981 and 1984. These patients had no history of heart disease and had various clinical metrics performed on them, as well as detailed documentation of their medical history, lifestyle, and families’ medical history.




In [12]:
# !pip3 install -U ucimlrepo 
import altair as alt
import pandas as pd
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split

In [13]:
# import dataset
heart_disease_dataset = fetch_ucirepo(name = 'Heart Disease')

# filter dataframe
heart_disease = heart_disease_dataset.data.original
heart_disease.rename(columns = {
                          "fbs" : "fasting_blood_sugar",
                          "chol" : "cholesterol", 
                          "cp":"type_chestpain",
                          "restecg" : "resting_ecg",
                          "thalach" : "max_heart_rate",
                          "exang" : "exercise_induced_angina",
                          "oldpeak" : "ST_depression", 
                          "slope" : "ST_segment_slope", 
                          "ca" : "num_major_vessels", 
                          "thal" : "thallium_stress_test", #not sure
                          "num" : "diagnosis",
                          "trestbps" : "resting_bp"
}, inplace = True)

heart_disease

Unnamed: 0,age,sex,type_chestpain,resting_bp,cholesterol,fasting_blood_sugar,resting_ecg,max_heart_rate,exercise_induced_angina,ST_depression,ST_segment_slope,num_major_vessels,thallium_stress_test,diagnosis
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45,1,1,110,264,0,0,132,0,1.2,2,0.0,7.0,1
299,68,1,4,144,193,1,0,141,0,3.4,2,2.0,7.0,2
300,57,1,4,130,131,0,0,115,1,1.2,2,1.0,7.0,3
301,57,0,2,130,236,0,2,174,0,0.0,2,1.0,3.0,1


In [14]:
# split data
heart_disease_train, heart_disease_test = train_test_split(heart_disease, train_size = 0.75)

# table including: 
# number of observations in each class, 
# mean, mode, std deviation of the predictor variables,
# and how many rows have missing data

num_observations = len(heart_disease_train.index)


predictor_modes = heart_disease_train.apply("mode").iloc[[0]].melt()

table = pd.DataFrame({
    'variables' : heart_disease_dataset.variables.name,
    'missing_values' : heart_disease_dataset.variables.missing_values,
    'type' : heart_disease_dataset.variables.type,
    'role' : heart_disease_dataset.variables.role,
    'mode' : predictor_modes["value"], 
    'mean' : heart_disease_train.apply("mean").values,
    'median' : heart_disease_train.apply("median").values,
    'std_deviation' : heart_disease_train.apply("std").values
})

table

Unnamed: 0,variables,missing_values,type,role,mode,mean,median,std_deviation
0,age,no,Integer,Feature,58.0,53.757709,54.0,9.157828
1,sex,no,Categorical,Feature,1.0,0.687225,1.0,0.464648
2,cp,no,Categorical,Feature,4.0,3.180617,3.0,0.949137
3,trestbps,no,Integer,Feature,130.0,130.590308,130.0,16.966508
4,chol,no,Integer,Feature,197.0,246.590308,240.0,52.891645
5,fbs,no,Categorical,Feature,0.0,0.136564,0.0,0.344145
6,restecg,no,Categorical,Feature,0.0,0.991189,1.0,0.995526
7,thalach,no,Integer,Feature,162.0,150.251101,152.0,21.902104
8,exang,no,Categorical,Feature,0.0,0.303965,0.0,0.460984
9,oldpeak,no,Integer,Feature,0.0,0.986784,0.6,1.08052


In [40]:
# To plot distribution of predictor variables
# we need to replace the diagnosis values with their corresponding names
heart_disease_to_plot = heart_disease_train.copy()

# any value larger than 0 will be classified as heart disease
# any value equal to 0 will be classified as no heart disease
heart_disease_to_plot['diagnosis'] = heart_disease_to_plot['diagnosis'].replace([1,2,3,4], "heart disease")
heart_disease_to_plot['diagnosis'] = heart_disease_to_plot['diagnosis'].replace([0], "no heart disease")

heart_disease_to_plot

Unnamed: 0,age,sex,type_chestpain,resting_bp,cholesterol,fasting_blood_sugar,resting_ecg,max_heart_rate,exercise_induced_angina,ST_depression,ST_segment_slope,num_major_vessels,thallium_stress_test,diagnosis
172,59,0,4,174,249,0,0,143,1,0.0,2,0.0,3.0,heart disease
46,51,1,3,110,175,0,0,123,0,0.6,1,0.0,3.0,no heart disease
287,58,1,2,125,220,0,0,144,0,0.4,2,,7.0,no heart disease
43,59,1,3,150,212,1,0,157,0,1.6,1,0.0,3.0,no heart disease
101,34,1,1,118,182,0,2,174,0,0.0,1,0.0,3.0,no heart disease
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72,62,1,4,120,267,0,0,99,1,1.8,2,2.0,7.0,heart disease
38,55,1,4,132,353,0,0,132,1,1.2,2,1.0,7.0,heart disease
61,46,0,3,142,177,0,2,160,1,1.4,3,0.0,3.0,no heart disease
299,68,1,4,144,193,1,0,141,0,3.4,2,2.0,7.0,heart disease


In [42]:
# blood pressure
bp_hist = alt.Chart(heart_disease_to_plot).mark_bar().encode(
    alt.X("resting_bp:Q", bin = True).title("Blood Pressure"),
    y = "count()"
).properties(
    title = "Distribution of Blood Pressure"
).facet(
    column = "diagnosis:N"
)

bp_hist

In [43]:
# cholesterol
chol_hist = alt.Chart(heart_disease_to_plot).mark_bar().encode(
    alt.X("cholesterol:Q", bin = True).title("Cholesterol"),
    y = "count()"
).properties(
    title = "Distribution of Cholesterol"
).facet(
    column = "diagnosis:N"
)

chol_hist

In [44]:
# chest pain type
cp_hist = alt.Chart(heart_disease_to_plot).mark_bar().encode(
    alt.X("type_chestpain:Q", bin = True).title("Chest Pain Type"),
    y = "count()"
).properties(
    title = "Distribution of Chest Pain Type"
).facet(
    column = "diagnosis:N"
)

cp_hist

In [46]:
# heart rate
hr_hist = alt.Chart(heart_disease_to_plot).mark_bar().encode(
    alt.X("max_heart_rate:Q", bin = True).title("Heart Rate"),
    y = "count()"
).properties(
    title = "Distribution of Heart Rate"
).facet(
    column = "diagnosis:N"
)

hr_hist