# Considering serum cholesterol levels and resting blood pressure, how likely is it for an individual to experience an exercise-induced angina?

An angina is commonly a symptom of cardiovascular disease. It occurs when parts of the heart are not supplied with sufficient oxygen-rich blood as a result of plaque build-up in arteries, known as atherosclerosis (NHLBI, 2023). Atherosclerosis itself is usually caused by high levels of cholesterol that build up in the walls of said arteries, triggering an immune response from the organism, implicating white blood cells that target these build ups but instead get stuck in the areas they attempt to clear, forming a growing lump or structure that permanently stays in place (Harvard Health, 2023). As a result, less red blood cells carrying oxygen are able to pass through the artery at any one time. Exercising can therefore aggravate this phenomenon as our heart rate increases, to supply more oxygen to the cells in need, especially cardiac ones, inducing chest pain, known as an exercise induced angina. The higher demand in oxygen, but restricted supply due to the shrunken diameter of the artery due to the plaque build up leads to the death of cardiac cells that are unable to receive the necessary oxygen (Heart and Stroke Canada). As a result, we strive for our model to predict the presence of an exercise induced angina for random individuals, taking into account their resting blood pressure as well as their cholesterol levels measured in dl/mg. 

Our research question is as follows: **Based on resting blood pressure and serum cholesterol levels, is the patient likely to experience exercise induced angina? How accurate is this prediction?**

We are using a 1988 heart disease dataset that contains 17 variables correlating to cardiovascular health as well as patients' age, sex, chest pain, blood pressure, cholestoral, blood sugar, ECG results, and more. Our dataset has 1025 rows of data from Cleveland, Hungary, Switzerland, and Long Beach V. It provides a comprehensive amount of variables to compare with, and it also has a large number of samples.

## Preliminary Exploratory Data Analysis

In [None]:
# First, import data into R.
library(tidyverse)
library(tidymodels)
set.seed(1)
heart_data <- read_csv("https://raw.githubusercontent.com/L-Wendi/dsci-100-project/main/heart.csv")

Our dataset comes from a website called Kaggle (https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset), and the dataset is based on https://archive.ics.uci.edu/dataset/45/heart+disease but pre-processed for better data analysis. The author of our dataset mutated several fields to improve its readibility. We further mutate the dataframe to turn meanless categorical numbers into meaningful string factors. The definition of column names of the original dataset can be found at https://archive.ics.uci.edu/dataset/45/heart+disease,and the columns of the Kaggle dataset is also on the Kaggle website.

To answer our predict question "**Based on resting blood pressure and serum cholesterol levels, is the patient likely to experience exercise induced angina? How accurate is this prediction?**", we are going to use columns `trestbps` (Resting blood pressure, measured in mm Hg) and `chol` (Serum cholestoral, in mg / dL) as predictors and column `exang` (Excrise induced angina) as class.

In [None]:
heart_data <- heart_data |>
    mutate(sex = fct_recode(as_factor(sex), "Male" = "1", "Female" = "0")) |>
    mutate(exang = fct_recode(as_factor(exang), "Yes" = "1", "No" = "0")) |>
    mutate(target = fct_recode(as_factor(target), "No Disease" = "0", "Disease" = "1"))

Based on the three columns of interest (`trestbps`, `chol`, and `exang`), we did a summary of the predictors based on training data:

In [None]:
heart_split <- initial_split(heart_data, prop = 0.75, strata = exang)
heart_training <- training(heart_split)
heart_testing <- testing(heart_split)
heart_predictors_info <- heart_training |>
    select(exang, trestbps, chol) |>
    group_by(exang) |>
    summarize(count = n(),
              mean_trestbps = mean(trestbps), max_trestbps = max(trestbps), min_trestbps = min(trestbps),
              mean_chol = mean(chol), max_chol = max(chol), min_chol = min(chol))

In the preliminary analysis, we are also drawing a scatter plot between resting blood pressure and serum cholestorol, based on whether there is angina. The plot is solely based on training data.

In [None]:
scatter_plot <- heart_training |> ggplot(aes(x = trestbps, y = chol, color = exang)) +
    geom_point() +
    labs(x = "Resting blood pressure (mm Hg)", y = "Serum cholestorol (mg / dL)", color = "Presence of Angina") +
    scale_color_manual(values = c("red", "blue"))

## Methods

To answer our predict question, we will be using K-nearest neighbors to build a classifier that predicts if the patient experiences exercise induced angina. The variables used as predictors will be the resting blood pressure (column named `trestbps`) and serum cholesterol levels (column named `chol`). The response variable will be the categorical variable for exercised induced angina (column named `exang`), where 1 is Yes and 0 is No. These values have been renamed to `Yes` and `No`.

To visualize the results, we can draw a standarized scatter graph between the resting blood pressure and the serum cholesterol levels, and color the points by whether the patient has exercise induced angima. We can label the predicted values differently (for example, draw a circle or have a background color) and compare them against the expected classification. Beside from graphs, we will also build a confusion matrix to compare the results and calculate accuracy / precision / recall.

## Expected Outcomes and Significance

Using our knowledge of the cardiovascular system it is reason to believe that resting blood pressure and cholesterol levels will have an overall positive relationship as the buildup of cholesterol is often a cause of higher blood pressure. Seeing as how an angina occurs because of atherosclerosis we can predict that the cases of angina will increase as well when either variable increases. 

Being able to predict the likelihood of a patient getting an Angina is very important in the medical field as being able to catch heart disease early on is the best way of preventing future life threatening problems that only increase the longer it goes untreated.

This model could bring up many questions about causes and preventability of heart disease as a whole; testing different variables in the model such as age, maximum heart rate, and resting heart rate can bring more insight into how this type of heart disease occurs. 

## Bibliography

*Angina*, Heart and Stroke Foundation of Canada, 2023, www.heartandstroke.ca/heart-disease/conditions/angina.

*Angina: Symptoms, Diagnosis and Treatments*, Harvard Health Publishing, 21 Sept. 2021, www.health.harvard.edu/heart-health/angina-symptoms-diagnosis-and-treatments. 

*What Is Angina?*, U.S. Department of Health and Human Services, 23 June 2023, www.nhlbi.nih.gov/health/angina