Title: Project Proposal
Introduction:
Background Information:
Cardiovascular diseases remain the leading cause of death globally. Cardiovascular diseases encompass many different heart conditions which are characterized by several different variables. Therefore, it becomes difficult to characterize an individual with specific conditions as having heart disease. Because of this, data sets and corresponding algorithms are essential tools for physicians, researchers, and other healthcare workers when characterizing or diagnosing a condition as heart disease.

Research Question:
We attempt to answer the following question using a data set from Cleveland, Ohio regarding heart disease statistics: Considering a patient’s age, resting blood pressure, and serum cholesterol, will the heart disease diagnosis be classified as absent (FALSE) or present (TRUE)?

Data set description:
The data set is from the 1980s and describes 14 different variables for patients who may have heart disease. We will utilize the patient's age, chest pain type, resting blood pressure, and serum cholesterol levels to ultimately create a classification that can aid us in answering the question posed above. We consider these variables as the most crucial and persuasive when wanting to determine the presence or absence of heart disease in a patient.

# Title: Project Proposal

## Introduction:

### Background Information:
Cardiovascular diseases remain the leading cause of death globally. Cardiovascular diseases encompass many different heart conditions which are characterized by several different variables. Therefore, it becomes difficult to characterize an individual with specific conditions as having heart disease. Because of this, data sets and corresponding algorithms are essential tools for physicians, researchers, and other healthcare workers when characterizing or diagnosing a condition as heart disease.

### Research Question:
We attempt to answer the following question using a data set from Cleveland, Ohio regarding heart disease statistics: Considering a patient’s age, resting blood pressure, and serum cholesterol, will the heart disease diagnosis be classified as absent (FALSE) or present (TRUE)?

### Data set description:
The data set is from the 1980s and describes 14 different variables for patients who may have heart disease. We will utilize the patient's age, chest pain type, resting blood pressure, and serum cholesterol levels to ultimately create a classification that can aid us in answering the question posed above. We consider these variables as the most crucial and persuasive when wanting to determine the presence or absence of heart disease in a patient.

In [None]:
library(tidyverse)
library(repr)
library(scales)
library(tidymodels)

In [None]:
URL <- ("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data")

processed_cleveland <- read_csv(URL) 
    colnames(processed_cleveland) <- c("Age", "Sex","Chest_Pain_Type","Resting_Blood_Pressure","Serum_Cholestrol","Fasting_Blood_Sugar", "Resting_Electrocadriographic_Results", "Maximum_Heart_Rate_Achieved","Excercise_Induced_Angina", "ST_Depression_Induced","Slope_of_Peak_Exercise_ST_Segment", "#_of_Major_Vessels", "Defects", "Diagnosis_of_Heart_Disease")
processed_cleveland

In [None]:
#select the variables that we need to make a model
cleveland_heart <- processed_cleveland |>
    select("Age","Resting_Blood_Pressure","Serum_Cholestrol","Diagnosis_of_Heart_Disease")
head(cleveland_heart)

In [None]:
# will construct a binary results for Diagnosis_Hear_Disease instead of numbers
scaled_cleveland_heart <- cleveland_heart |>
    mutate(Diagnosis = Diagnosis_of_Heart_Disease != 0)

scaled_cleveland_heart

In [None]:
# +
#Filter out the variables that we need to build a model and change the diagnosis of Heart Disease to a factor data type
cleveland_heart_cleaned <- scaled_cleveland_heart |>
    select(Age, Resting_Blood_Pressure, Serum_Cholestrol, Diagnosis)|>
    mutate("Diagnosis" = as_factor(Diagnosis))

head(cleveland_heart_cleaned)

In [None]:
# +
# Let's visualize if there is any relationship between the predictors and the results
perim_heart <- cleveland_heart_cleaned |>
  ggplot(aes(x = Resting_Blood_Pressure, y = Serum_Cholestrol, color = Diagnosis)) +
  geom_point(alpha = 0.5) +
  labs(color = "Diagnosis") +
  scale_color_manual(labels = c("Present", "Absent"), 
                     values = c("orange2", "steelblue2")) + 
  theme(text = element_text(size = 20))

perim_heart

In [None]:
# +
perim_heart2 <- cleveland_heart_cleaned |>
  ggplot(aes(x = Resting_Blood_Pressure, y = Age, color = Diagnosis)) +
  geom_point(alpha = 0.5) +
  labs(color = "Diagnosis") +
  scale_color_manual(labels = c("Present", "Absent"), 
                     values = c("orange2", "steelblue2")) + 
  theme(text = element_text(size = 20))

perim_heart2

In [None]:
# +
perim_heart3 <- cleveland_heart_cleaned |>
  ggplot(aes(x = Serum_Cholestrol, y = Age, color = Diagnosis)) +
  geom_point(alpha = 0.5) +
  labs(color = "Diagnosis") +
  scale_color_manual(labels = c("Present", "Absent"), 
                     values = c("orange2", "steelblue2")) + 
  theme(text = element_text(size = 20))

perim_heart3

In [None]:
# +
#Let's see the proportion
sample_proportions <- cleveland_heart_cleaned |>
                      group_by(Diagnosis) |>
                      summarize(n = n()) |>
                      mutate(percent = 100*n/nrow(cleveland_heart_cleaned))

sample_proportions


In [None]:
#The mean of each group was taken just to get a sense of any visible trends
cleveland_heart_mean <- cleveland_heart_cleaned |>
    group_by(Age) |>
    summarize(Resting_Blood_Pressure = mean(Resting_Blood_Pressure),
             Serum_Cholestrol = mean(Serum_Cholestrol))
head(cleveland_heart_mean)

In [None]:
# +
set.seed(2023)

heart_split <- initial_split(cleveland_heart_cleaned, prop = 0.75, strata = Diagnosis)
heart_train <- training(heart_split)
heart_test <- testing(heart_split)

glimpse(heart_train)
glimpse(heart_test)


In [None]:
# +
training_set_proportions <- heart_train |>
                      group_by(Diagnosis) |>
                      summarize(n = n()) |>
                      mutate(percent = 100*n/nrow(heart_train))

training_set_proportions

In [None]:
# +
k_vals <- tibble(neighbors = seq(from = 1, to = 100, by = 5))
options(repr.plot.height = 5, repr.plot.width = 6)

knn_tune <- nearest_neighbor(weight_func = "rectangular", 
                            neighbors = tune()) |>
                            set_engine("kknn") |>
                            set_mode("classification")

diagnosis_recipe <- recipe(Diagnosis ~ ., data = heart_train) |>
                                step_scale(all_predictors()) |>
                                step_center(all_predictors())

diagnosis_vfold <- vfold_cv(heart_train, v = 5, strata = Diagnosis)

diagnosis_analysis <- workflow() |>
                    add_recipe(diagnosis_recipe) |>
                    add_model(knn_tune) |>
                    tune_grid(resamples = diagnosis_vfold, grid = k_vals)

accuracies <- diagnosis_analysis |>
                collect_metrics() |>
                filter(.metric == "accuracy")

cross_val_plot <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
                    geom_point() +
                    geom_line() +
                    labs(x = "Neighbors", y = "Accuracy Estimate") + 
                    theme(text = element_text(size = 12))

cross_val_plot

In [None]:
# +
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 105) |>
  set_engine("kknn") |>
  set_mode("classification")

knn_fit <- workflow() |>
  add_recipe(diagnosis_recipe) |>
  add_model(knn_spec) |>
  fit(data = heart_train)

diagnosis_test_predictions <- predict(knn_fit, heart_test) |>
  bind_cols(heart_test)

head(diagnosis_test_predictions)

In [None]:
# -

diagnosis_test_predictions |>
  metrics(truth = Diagnosis, estimate = .pred_class) |>
  filter(.metric == "accuracy")

confusion <- diagnosis_test_predictions |>
             conf_mat(truth = Diagnosis, estimate = .pred_class)
confusion

### Methods:
We will use Age, Chest pain type, Resting blood pressure, and Serum cholesterol as variables and compare all of these factors against the heart disease diagnosis, in order to determine whether or not these may be contributing factors. One way to visualize this would be to make a histogram of each category against the heart disease diagnosis, to see if increasings blood pressure, for example, has a correlation on the diagnosis.

1. Split the dataset into testing and training data
2. Determine which k to use; test different k values - plot the k vs. accuracy
3. With the best value of k (nearest neighbour, recipe), make a model with the training set
4. Make a prediction based on the model
5. Test it using the testing data
6. Compare our results with the results from the testing dataset
7. Create a multivariable linear regression line to see if we are able to predict the diagnosis based on the variables

## Expected outcomes and significance:
#### What do you expect to find?
- We expect to see a positive correlation in age, resting blood pressure, and serum cholesterol levels with the risk of heart disease. In other words, each variable will have its own positive correlation with the risk of heart disease.
#### What impact could such findings have?
- If there is a significant spike in the presence of heart disease after a certain age for example, then it would be recommended for people near that age to closely monitor their heart health and receive regular checkups.
#### What future questions could this lead to?
- If a significant correlation appears between a variable and the presence of heart disease, then the next step would be further research to examine the underlying biological cause of this observed trend.

### References:
- National Center for Chronic Disease Prevention and Health Promotion, Division for Heart Disease and Stroke Prevention. (2023, Feb. 24). Heart Disease. cdc.gov. https://www.cdc.gov/heartdisease/index.htm#:~:text=The%20term%20%E2%80%9Cheart%20disease%E2%80%9D%20refers,%2C%20in%20some%20cases%2C%20medicine
- World Health Organization. (2021, June 11). Cardiovascular diseases (CVDs). World Health Organization. https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)