# Group Project Proposal
## Classification of The Presence of Heart Disease
* Jennifer Wang
* Hagan Cheung
* Justine Song


### EXPLORING A DATASET
Before we can conduct data analysis, we must outline our goals and understand the information available through 3 different steps.
* Describing the variables in the heart disease data set.
* Loading the heart disease data.
* Preprocessing the heart disease data for comparison.

### Introduction

Coronary heart disease (or coronary artery disease, CAD) is a chronic heart condition characterized by obstruction of blood flow to the heart due to cholesterol and degenerative tissue plaque buildup (arteriosclerosis) in the coronary arteries. This disease causes oxygen and nutrient deficiency in the heart, leading to chest pain (angina pectoris) and heart attacks. 

As this is an irreversible and incurable condition, it is important to diagnose the disease as quickly as possible to slow the progression of the disease, prolonging the patient’s life expectancy. Our chosen dataset observes the outputs of many variables, which may play a role in diagnosing the disease and its progression in a number of patients (such as age and serum cholesterol content) according to our hypothesis. 

We are curious to see the ways in which the variable helps us accurately diagnosing CAD, and to create a model that predicts CAD presence/progression in patients with high accuracy based on trends observed in the dataset. Additionally, we hope to identify a threshold or litmus in the variables to help in quantifying the likelihood of CAD presence/progression. We hypothesize that there is a positive relationship between age and the other variables in the dataset, and that this relationship contributes to a higher likelihood of CAD presence. 

### Method
Our objective is to conduct a multiclass classification with more than two categories to answer a predictive question: can we use the attributes of the patient available to us to predict and distinguish the presence of heart disease in the patient? Our classifier will be num, the diagnosis of heart disease (angiographic disease status) ranging from 0 to 4:
* presence in severity (values 1,2,3,4) and,
* absence (value 0).  

Given the following attributes and its description:
1. age: age in years
2. sex: sex (1 = male; 0 = female)
3. cp: chest pain type
   * Value 1: typical angina
   * Value 2: atypical angina
   * Value 3: non-anginal pain
   * Value 4: asymptomatic
4. trestbps: resting blood pressure (in mm Hg on admission to the hospital)
5. chol: serum cholesterol in mg/dl
6. fbs: (fasting blood sugar > 120 mg/dl)  (1 = true; 0 = false)
7. restecg: resting electrocardiographic results
    * Value 0: normal
    * Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    * Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
8. thalach: maximum heart rate achieved
9. exang: exercise induced angina (1 = yes; 0 = no)
10. oldpeak: ST depression induced by exercise relative to rest
11. slope: the slope of the peak exercise ST segment
    * Value 1: upsloping
    * Value 2: flat
    * Value 3: downsloping
13. ca: number of major vessels (0-3) colored by flourosopy
14. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
15. num: diagnosis of heart disease (angiographic disease status)
    * Value 0: < 50% diameter narrowing
    * Value 1: > 50% diameter narrowing

We plan to conduct data analysis to build a classifier using three different predictors (variables/columns):
* age: age in years.* chol: serum cholesterol in mg/d.
* thalach: maximum heart rate achieved.

To summarize:
Our data analysis will be conducted using a classification model, and visualized with scatter plots. Our classifier will be the “num” column, which classifies the diagnosis of heart disease progression from 0 to 4 (0 being no presence, 4 being highly progressed). The predictors for the dataset that we will use have been narrowed down to four at present: age, serum cholesterol content (chol), maximum heart rate (thalach), and the number of major vessels visible by fluoroscopy (ca). We have initially chosen these variables given that they have numerical outputs and thus provide a good foundation for drafting a classification model; furthermore, they are highly relevant factors of  CAD and are expected to play major roles in achieving our objective of accurately diagnosing CAD. We hope to be able to increase the number of our predictors with further exploration of this project. 

### Expected outcomes and significanceWe expect to see a general trend in the data where an increase or positive deviation in most variables (e.g. age, serum cholesterol, etc.) is positively correlated with progression of CAD and with each other. For example, we expect to see that serum cholesterol content  is positively correlated with maximum heart rate, and that positive deviation in both these variables is correlated with the presence/progression of CAD. We also expect that these positive correlations are in turn increasingly positively  correlated with the presence/progression of CAD. The impact of these findings would be significant with regards to identifying a definite threshold or litmus in these variables, which could be extremely helpful in outlining a more concrete and reliable method of diagnosing CAD.


# Load Data From The Original Source On The Web 

In [2]:
#loading in the library necessary
library(tidyverse)
install.packages("tidyverse")
library(repr)
library(tidymodels)
install.packages("themis")
library(themis)
options(repr.matrix.max.rows = 6)

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

“installation of package ‘themis’ had non-zero exit status”
Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



ERROR: Error in library(themis): there is no package called ‘themis’


In [3]:
#loads the data
cleveland <- read_csv("https://raw.githubusercontent.com/JennWan/Group_Project/main/new%20data/newcleveland_data.csv", col_names = F)

[1mRows: [22m[34m303[39m [1mColumns: [22m[34m15[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (10): X2, X3, X6, X7, X9, X11, X12, X13, X14, X15
[32mdbl[39m  (5): X1, X4, X5, X8, X10

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


# Wrangling and Cleaning The Data

In [4]:
#renaming variables for readability 
colnames(cleveland) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", 
                         "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")

#selecting predictors of interest
cleveland_tidy <- cleveland |>
                select(age, cp, trestbps, chol, fbs, thalach, exang, ca, thal, num) |>
                mutate(age = as.integer(age), 
                       trestbps = as.integer(trestbps), 
                       chol = as.integer(chol), 
                       thalach = as.integer(thalach), 
                       cp = as_factor(cp), 
                       fbs = as_factor(fbs), 
                       exang = as_factor(exang), 
                       ca = as.integer(ca), 
                       ca = as_factor(ca), 
                       thal = as_factor(thal), 
                       num = as_factor(num)) |>
                filter(!is.na(ca), !is.na(thal))

cleveland_tidy

[1m[22m[36mℹ[39m In argument: `ca = as.integer(ca)`.
[33m![39m NAs introduced by coercion”


age,cp,trestbps,chol,fbs,thalach,exang,ca,thal,num
<int>,<fct>,<int>,<int>,<fct>,<int>,<fct>,<fct>,<fct>,<fct>
63,angina,145,233,true,150,fal,0,fix,buff
67,asympt,160,286,fal,108,true,3,norm,sick
67,asympt,120,229,fal,129,true,2,rev,sick
37,notang,130,250,fal,187,fal,0,norm,buff
41,abnang,130,204,fal,172,fal,0,norm,buff
56,abnang,120,236,fal,178,fal,0,norm,buff
62,asympt,140,268,fal,160,fal,2,norm,sick
57,asympt,120,354,fal,163,true,0,norm,buff
63,asympt,130,254,fal,147,fal,1,rev,sick
53,asympt,140,203,true,155,true,0,rev,sick


In [5]:
# find the number and percentage of differing presence of heart disease observations in our data set
# to check for class imbalance

num_obs <- nrow(cleveland_tidy)
    cleveland_tidy |>
    group_by(num) |>
    summarize(
        count = n(),
        percentage = n() / num_obs * 100)

num,count,percentage
<fct>,<int>,<dbl>
buff,161,54.02685
sick,137,45.97315


In [6]:
# center, scaling and balancing the heart disease data
recipe <- recipe(num ~ ., data = cleveland_tidy) |>
    step_scale(all_predictor()) |>
    step_center(all_predictor()) |>
    step_upsample(num, over_ratio = 1, skip = FALSE) |>
    prep()

preprocessed_cleveland <- bake(recipe, cleveland_tidy)
preprocessed_cleveland

ERROR: Error in step_upsample(step_center(step_scale(recipe(num ~ ., data = cleveland_tidy), : could not find function "step_upsample"


In [None]:
# find the number and percentage of differing presence of heart disease observations in our data set
# double check class imbalance

num_obs <- nrow(preprocessed_cleveland)
    preprocessed_cleveland |>
    group_by(num) |>
    summarize(
        count = n(),
        percentage = n() / num_obs * 100)

In [None]:
# create the TRAIN SET and TEST SET
set.seed(2000)

cleveland_split <- initial_split(preprocessed_cleveland, prop = 0.75, strata = num)
cleveland_train <- training(cleveland_split) 
cleveland_test <- testing(cleveland_split)

## Preliminary Exploratory Data Analysis

In [None]:
# create tbl to compare the average_age for each stage of heart disease presence and arrange by average_age
exploration_tbl1 <- cleveland_train |>
    group_by(num) |>
    summarize(average_age = mean(age)) |>
    arrange(average_age)

exploration_tbl1

In [None]:
# create tbl to compare average_chol for each stage of heart disease presence and arrange by average_colestoral
exploration_tbl2 <- cleveland_train |>
    group_by(num) |>
    summarize(average_cholesterol = mean(chol)) |>
    arrange(average_cholesterol)

exploration_tbl2

In [None]:
# create tbl to compare average_max_heartrate for each stage of heart disease presence and arrange by average_max_heartrate
exploration_tbl3 <- cleveland_train |>
    group_by(num) |>
    summarize(average_max_heartrate = mean(thalach)) |>
    arrange(average_max_heartrate)

exploration_tbl3

In [None]:
# draw a scatter plot to visualize the relationship between the age and chol (serum cholestoral in mg/d) predictors/variables

options(repr.plot.height = 6, repr.plot.width = 8)
exploration_plot1 <- cleveland_train |>
  ggplot(aes(x = age, y = chol, color = num)) +
  geom_point(alpha = 0.6) +
  labs(x = "Age (years)",
       y = "Serum Cholesterol (standardized)",
       color = "Presence of Heart Disease", 
       title = "Visualization of Serum Cholesterol vs Age") + 
  theme(text = element_text(size = 16))

exploration_plot1

In [None]:
# draw a scatter plot to visualize the relationship between the age and thalach (maximum heart rate achieved) predictors/variables

options(repr.plot.height = 6, repr.plot.width = 8)
exploration_plot2 <- cleveland_train |>
  ggplot(aes(x = age, y = thalach, color = num)) +
  geom_point(alpha = 0.6) +
  labs(x = "Age (years)",
       y = "Maximum Heart Rate Achieved (standardized)",
       color = "Presence of Heart Disease", 
       title = "Visualization of Maximum Heart Rate Achieved vs Age") + 
  theme(text = element_text(size = 16))

exploration_plot2

In [None]:
# draw a scatter plot to visualize the relationship between chol (serum cholestoral in mg/d) and thalach (maximum heart rate achieved) predictors/variables

options(repr.plot.height = 6, repr.plot.width = 8)
exploration_plot3 <- cleveland_train |>
  ggplot(aes(x = chol, y = thalach, color = num)) +
  geom_point(alpha = 0.6) +
  labs(x = "Serum Cholesterol (standardized)",
       y = "Maximum Heart Rate Achieved (standardized)",
       color = "Presence of Heart Disease", 
       title = "Visualization of Maximum Heart Rate Achieved vs Serum Cholesterol") + 
  theme(text = element_text(size = 16))

exploration_plot3

## Exploration Graph Analysis
Our tables are relevant in assisting our analysis towards answering our question, as it demonstrates the possible relationship between each predictor and our class. Where within all the tables one in particular clearly demonstrates that there are some positive relationship between the predictor age and num, the presence of heart disease.

Our plots help visually protray further relationship between more variables, whereas there is no clear relationship observed.

To conclude, within the previous tables and plots there does not seem to be any obvious relationship other than the positive relationship between age and the presence of heart disease. It is oberved that between age and the presence of heart disease, num, there is a moderately positive linear relationship where the increase of age has a correlationing increase in the presence of heart disease.

# Data Analysis

In [None]:
# find the best k neighbour value to use with V-fold cross validation
cleveland_vfold <- vfold_cv(cleveland_train, v = 5, strata = num)

cleveland_recipe <- recipe(num ~ ., data = cleveland_train) |>
                    step_scale(all_predictors()) |>
                    step_center(all_predictors())

cleveland_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
            set_engine("kknn") |>
            set_mode("classification")

gridvals = tibble(neighbors = seq(from = 1, to = 80, by = 5))

set.seed(2000)

cleveland_results <- workflow() |>
            add_recipe(cleveland_recipe) |>
            add_model(cleveland_spec) |>
            tune_grid(resamples = cleveland_vfold, grid = gridvals) |>
            collect_metrics() |>
            filter(.metric == "accuracy")
cleveland_results

# plot k values against their respective accuracies and choose optimal k value
cross_val_plot <- cleveland_results |> 
    ggplot(aes(x = neighbors, y = mean)) +
    geom_point() +
    geom_line() +
    labs(x = "Neighbors", y = "Accuracy Estimate") +
    theme(text = element_text(size = 20))
cross_val_plot

In [None]:
# compare the accuracy of predictions to the true values in the test set
cleveland_best_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 1) |>
  set_engine("kknn") |>
  set_mode("classification")

cleveland_fit <- workflow() |>
  add_recipe(cleveland_recipe) |>
  add_model(cleveland_best_spec) |>
  fit(data = cleveland_train)

cleveland_predictions <- predict(cleveland_fit, cleveland_test) |> 
    bind_cols(cleveland_test)

cleveland_acc <- cleveland_predictions |> 
    metrics(truth = num, estimate = .pred_class) |> 
    select(.metric, .estimate) |> 
    head(1)

cleveland_acc
cleveland_cm <- cleveland_predictions |> 
    conf_mat(truth = num, estimate = .pred_class)
cleveland_cm

### Work Cited
Marateb HR, Goudarzi S. A noninvasive method for coronary artery diseases diagnosis using a clinically-interpretable fuzzy rule-based system. J Res Med Sci. 2015 Mar;20(3):214-23. PMID: 26109965; PMCID: PMC4468223.