# Heart Disease Level Classification Using Age, Cholesterol Level and Blood Pressure

# Introduction

Heart Disease by definition is a type of disease that affects the heart. There are a couple of levels for heart disease which consist of 5 levels. Level 0 which means people with no heart disease, level 1 means people who are at risk for heart failure but do not yet have symptoms or structural or functional heart disease, level 2 means people without current or previous symptoms of heart failure but with either structural heart disease, increased filling pressures in the heart or other risk factors, level 3 means people with current or previous symptoms of heart failure, and level 4 means people with heart failure symptoms that interfere with daily life functions or lead to repeated hospitalizations. 

Question : 
What is the heart disease level of a new observation predicted by the age, cholesterol level, and blood pressure.

## Libraries

In [1]:
library(repr)
library(tidyverse)
library(tidymodels)
library(reshape2)

“package ‘ggplot2’ was built under R version 4.3.2”
── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.5.0     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom    

## Loading Data and Data Wrangling

In [5]:
cleveland_data <- read_csv("data/heart_disease/processed.cleveland.data", col_names= FALSE)
colnames(cleveland_data) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca" ,"thal", "num")


heart_disease <- cleveland_data |>
    mutate(num = as_factor(ifelse(num == 0, 0, 1))) |>
    select(age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, num)

heart_disease |> head(6)

[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): X12, X13
[32mdbl[39m (12): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X14

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,num
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
63,1,1,145,233,1,2,150,0,2.3,3,0
67,1,4,160,286,0,2,108,1,1.5,2,1
67,1,4,120,229,0,2,129,1,2.6,2,1
37,1,3,130,250,0,0,187,0,3.5,3,0
41,0,2,130,204,0,2,172,0,1.4,1,0
56,1,2,120,236,0,0,178,0,0.8,1,0


## Building Model

Splitting data for our training and testing data

The reason why I split into training and testing data is to build our K-nearest neighbor classifier using the training data and to further evaluate the accuracy of our classifier using the testing data.

In [3]:
set.seed(199)

heart_split <- initial_split(heart_disease, prop = 0.75, strata = num)
heart_train <- training(heart_split)
heart_test <- testing(heart_split)

## Summarizing the Training Data

I only summarize the training data because the model is only built using the training data

In [4]:
heart_num <- heart_train |>
    select(age, trestbps, chol) |>
    pivot_longer(cols = age:chol,
                names_to = "Variable",
                values_to = "Stat") |>
    group_by(Variable) |>
    summarize(Minimum = min(Stat), 
              Maximum = max(Stat), 
              Mean = mean(Stat), 
              Median = median(Stat), 
              Mode = names(table(Stat))[table(Stat)==max(table(Stat))], 
              Standard_deviation = sd(Stat))

heart_level_summary <- heart_train |>
    select(num) |>
    pivot_longer(cols = num,
                names_to = "Variable",
                values_to = "Stat") |>
    group_by(Variable) |>
    table() |>
    as.data.frame.matrix()

num_observations <- nrow(heart_train)

heart_level_summary
heart_num
num_observations

ERROR: [1m[33mError[39m in `pivot_longer()`:[22m
[33m![39m Can't combine `age` <double> and `trestbps` <character>.


## Visualizing Data

In [None]:
age_dist <- ggplot(heart_train, aes(x = age, colour = num)) +
                geom_density() +
                labs(x = "Age", 
                     y = "Density", 
                     color = "Heart Disease Level", 
                     title = "Density plot of Weight by Obesity Level", 
                     subtitle = "Figure 1") +
                theme(text = element_text(size = 20))

age_chol_plot <- ggplot(heart_train, aes(x = age, y = chol, color = num)) +
                    geom_point() +
                    labs(x = "Age", 
                         y = "Cholesterol Level (mg/dl)", 
                         color = "Heart Disease Level", 
                         title = "Scatter plot of Cholesterol Level vs Age by Heart Disease Level", 
                         subtitle = "Figure 2") +
                    theme(text = element_text(size = 20))

age_blood_plot <- ggplot(heart_train, aes(x = age, y = trestbps, color = num)) +
                    geom_point() +
                    labs(x = "Age", 
                         y = "Blood Pressure", 
                         color = "Heart Disease Level", 
                         title = "Scatter plot of Blood Pressure vs Age by Heart Disease Level", 
                         subtitle = "Figure 2") +
                    theme(text = element_text(size = 20))

options(repr.plot.length = 10, repr.plot.width = 15)

In [None]:
age_dist

In [None]:
age_chol_plot

In [None]:
age_blood_plot

## Best K

Steps to find the best k : 
1. Create 5 splits for validation sets
2. Create the model and standardized recipe for tuning the classifier
3. Choose 15 values for k (from 1 - 15)
4. Create the workflow to get the accuracy for each value of k

In [None]:
colnames(heart_disease)

In [None]:
set.seed(199) 

heart_vfold <- vfold_cv(heart_train, v = 5, strata = num)

heart_spec <- nearest_neighbor(weight_fun = "rectangular", neighbors = tune()) |>
            set_engine("kknn") |>
            set_mode("classification")

heart_recipe <- recipe(num ~ ., data = heart_train) |>
                step_scale(all_predictors()) |>
                step_center(all_predictors())  

k_vals <- tibble(neighbors = seq(from = 1, to = 15, by = 1))

k_data <- workflow() |>
        add_recipe(heart_recipe) |>
        add_model(heart_spec) |>
        tune_grid(resamples = heart_vfold, grid = k_vals) |>
        collect_metrics() |>
        filter(.metric == "accuracy")

k_data

In [None]:
k_plot <- ggplot(k_data, aes(x = neighbors, y = mean)) +
            geom_line() +
            geom_point() +
            labs(x = "Neighbors", 
                 y = "Mean accuracy", 
                 title = "Graph showing the mean accuracy for each neighbor level", 
                 subtitle = "Figure 3") +
            theme(text = element_text(size = 18))

k_plot

In [None]:
k_min <- k_data |>
            arrange(desc(mean)) |>
            slice(1) |>
            pull(neighbors)

k_min

## Building the model using the min k

In [None]:
heart_spec_known <- nearest_neighbor(weight_fun = "rectangular", neighbors = 1) |>
            set_engine("kknn") |>
            set_mode("classification")

heart_fit <- workflow() |>
        add_recipe(heart_recipe) |>
        add_model(heart_spec) |>
        fit(heart_train)


## Testing the model

In [None]:
set.seed(199)
predictions <- predict(heart_fit, heart_test) |>
                bind_cols(heart_test)

print("Table 5")
predictions

In [None]:
heart_metrics <- predictions |>
                metrics(truth = num, estimate = .pred_class)

heart_conf_mat <- predictions |>
                conf_mat(truth = num, estimate = .pred_class)


heart_metrics

heart_conf_mat