# Project Report

Quang Duy Do, Jingjing Li, Wendy Li, Lucia Lu

### Introduction

The term “heart disease” refers to several types of heart conditions. The most common type of heart disease in the United States is coronary artery disease (CAD), which affects the blood flow to the heart. (Heart Disease Resources | Cdc.gov, 2023) Decreased blood flow can cause a heart attack.
Many factors contribute to heart disease, including blood pressure and cholesterol levels. (Know Your Risk for Heart Disease | cdc.gov, 2023) Age also has a potential influence on heart disease. Heart disease is primarily symptomatic when it occurs, but it can be prevented by everyday signs such as exercising for angina. (Professional, n.d.)

Too much LDL cholesterol in the blood causes plaque to build up in the arteries. The buildup cuts blood flow and leads to heart and blood vessel conditions. The LDL cholesterol level should be less than 130 mg/dL (3.4 mmol/L). (Blood Tests for Heart Disease, 2023)

Angina pain happens when your heart muscle does not get as much oxygen-rich blood as it needs. An angina event does not cause permanent damage to the heart. However, your angina may turn into a heart attack if the cells in your heart do not get enough oxygen for too long, and they start to die. (Causes and Risk Factors | NHLBI, NIH, 2023)

This project aims to predict a patient's likelihood of developing heart disease using factors such as age, cholesterol levels, and whether or not angina occurs with exercise.

We are using the processed.cleveland.data from the Heart Disease Database (originally collected from the Cleveland Clinic Foundation) to predict if a patient from Cleveland will have heart disease. The columns are as follows:

- 1.**age**: age
- 2.**sex**: sex (1 = male, 0 = female)
- 3.**cp**: chest pain type
- 4.**trestbps**: resting blood pressure in mmHg
- 5.**chol**: serum cholestoral in mg/dl
- 6.**fbs**: fasting blood sugar > 120 mg/dl? (1 = True, 0 = False)
- 7.**restecg**: resting electrocardiographic results
- 8.**thalach**: maximum heart rate achieved
- 9.**exang**: whether exercise induced angina (1 = True, 0 = False)
- 10.**oldpeak**: ST depression induced by exercise, relative to rest
- 11.**slope**: the slope of the peak exercise ST segment (1 = upslope, 2 = flat, 3 = downslope)
- 12.**ca**: number of major vessels (0-3) coloured by flourosopy
- 13.**thal**: (3 = normal, 6 = fixed defect, 7 = reversable defect)
- 14.**num**: diagnosis of heart disease (1,2,3,4 = presence, 0 = no presence)

Based on the list above, we will be using `age`, and `chol` as predictors, to classify patients on whether or not they have heart disease.

## Method and Results

### Loading in dataset and Wrangling

EXPLANATION HERE

#### Importing Libraries and Setting Graph Format

In [6]:
# importing libraries
library(tidyverse)
library(tidymodels)
library(repr)
library(RColorBrewer)

# formatting graphs
options(repr.plot.width = 12, repr.plot.height = 6)

#### Importing dataset

In [7]:
cleveland_data <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data",
                          col_names = FALSE)

head(cleveland_data)

nrow(cleveland_data)

[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): X12, X13
[32mdbl[39m (12): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X14

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
56,1,2,120,236,0,0,178,0,0.8,1,0.0,3.0,0


In [8]:
cleveland_clean <- cleveland_data

# Adding column names:
# 1. Age
# 2. Sex
# 3. Cp
# 4. trestbps
# 5. chol
# 6. fbs
# 7. restcg
# 8. thalach
# 9. exang
# 10. oldpeak
# 11. slope
# 12. ca
# 13. thal
# 14. num

colnames(cleveland_clean) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", 
                               "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")

cleveland_clean <- cleveland_clean |>
                mutate(sex = as.factor(sex)) # Since sex is a dummy variable that is either male(1) and female(0)

head(cleveland_clean)

age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
56,1,2,120,236,0,0,178,0,0.8,1,0.0,3.0,0


### Class imbalance and Upscaling 

Currently, we have an imbalance in our dataset where 54% is 0 for the response variable (num). In this case, we have to upscale the num variable in order to construct a reliable classifier. 

In [18]:
# select only the variables that we are interested in
cleveland_select <- cleveland_clean |>
    select(num, age, chol) |>
    mutate(num = as.factor(num))

# set the seed to 3456 to make our report reproducible
set.seed(3456)

# install and load `themis` R package
install.packages('themis')
library(themis)
# construct a recipe that upscales our num variable
recipe_upsample <- recipe(num ~ age + chol, data = cleveland_select) |>
    step_upsample(num, over_ratio = 1, skip = FALSE) |>
    prep()

recipe_upsample

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



[36m──[39m [1mRecipe[22m [36m──────────────────────────────────────────────────────────────────────[39m



── Inputs 

Number of variables by role

outcome:   1
predictor: 2



── Training information 

Training data contained 303 data points and no incomplete rows.



── Operations 

[36m•[39m Up-sampling based on: [34mnum[39m | [3mTrained[23m



### Splitting the data

In [None]:
cleveland_split <- initial_split(cleveland_select, prop = 0.75, strata = num)
cleveland_train <- training(cleveland_split)
cleveland_test <- testing(cleveland_split)

### Preprocess the data

In [None]:
cleveland_recipe <- recipe_upsample |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

### Train the classifier