In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

Data set: https://archive.ics.uci.edu/ml/datasets/Heart+Disease

In [None]:
# Load the dataset
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
names <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")
data_raw <- read_csv(url, col_names = names)

In [None]:
# Wrangle data
data <- data_raw %>%
    select(c("age", "cp", "trestbps", "thalach", "chol", "num", "fbs")) %>% # select only columns that are used
    mutate(cp = as_factor(cp), # update target variables to factor
          num = as_factor(num),
          fbs = as_factor(fbs)) %>%
    rename(heart_disease = num, # rename some columns with more intuitive names
          rest_bp = trestbps,
          max_heart_rate = thalach)

In [None]:
# Currently, there are 5 levels of heart disease, 0, 1, 2, 3, 4.
# Classify heart_disease to either 0 (no visible heart disease) or 1-4 (mild heart disease to very severe)
data <- data %>%
    mutate(heart_disease = as_factor(ifelse(heart_disease == 0, "No", "Yes")))
# For cp, we could classify as either 0-1 (no angina) or 2-3 (angina)
data

In [None]:
# Split the data set into training data and testing data
set.seed(1) # set the seed

heart_split = initial_split(data, prop = 0.75, strata = heart_disease)
heart_training = training(heart_split)
heart_testing = testing(heart_split)

## Preliminary Exploratory Data Analysis

Below are histograms for the four possible predictor variables: age, blood cholesterol, maximum heart rate, and resting blood pressure. Each of the graphs are split up into the groups of people who have heart disease (in blue) and those who don't (in red).

In [None]:
# Histograms for numerical predictor variables
bins_num = ceiling(log(nrow(data),2))+1 # number of bins, by Sturges' formula
# theme_set(theme_grey())
target_var = heart_training$heart_disease # target variable, to split up histograms by (data$num or data$cp)

ggplot(heart_training, aes(x = age, fill = target_var)) +
    geom_histogram(bins = bins_num) +
    facet_grid(rows = target_var) + 
    labs(x = "Age (years)", fill = "Presence of Heart Disease")
ggplot(heart_training, aes(x = rest_bp, fill = target_var)) +
    geom_histogram(bins = bins_num) +
    facet_grid(rows = target_var) + 
    labs(x = "Resting Blood Pressure (mmHg)", fill = "Presence of Heart Disease")
ggplot(heart_training, aes(x = max_heart_rate, fill = target_var)) +
    geom_histogram(bins = bins_num) +
    facet_grid(rows = target_var) + 
    labs(x = "Maximum Heart Rate Achieved (beats per minute)", fill = "Presence of Heart Disease")
ggplot(heart_training, aes(x = chol, fill = target_var)) +
    geom_histogram(bins = bins_num) +
    facet_grid(rows = target_var) + 
    labs(x = "Blood Cholesterol (mg/dL)", fill = "Presence of Heart Disease")

Then, we can roughly compare the distributions to see if any variable individually seems to be associated with the presence of heart disease. From these plots, it seems that age is positively associated with heart disease, and maximum heart rate is negatively associated with heart disease. The other two variables are not clearly associated in either direction.

Below is a bar plot which shows the relative proportion of individuals with and without heart disease in the data set. 

In [None]:
# Bar plots for target variable (heart disease)
ggplot(data, aes(x = heart_disease)) +
    geom_bar() +
    xlab("Presence of Heart Disease") +
    ylab("Number of People")

Notice that the proportions are roughly equal in each category.

## Methods

We will use K-nearest neighbors classfication, with potential predictor variables of age, resting blood pressure, maximum heart rate, and blood cholesterol, and target variable of the diagnosis of heart disease. We will use a training set, and further split the data set to perform cross-validation. We will try the analysis with various combinations of predictor variables, and use the combination which produces the maximum accuracy.

To visualize the results, we could use a confusion matrix in order to identify the proportion of false positives and false negatives. If we end up using only 2 (or 3) predictors, we could use a scatterplot (or 3D scatterplot, respectively).

In [None]:
# textbook 3D scatterplot created with Plotly (can be done with R? 3d scatterplot r)

## Expected Outcomes