# Considering serum cholesterol levels and resting blood pressure, how likely is it for an individual to experience an exercise-induced angina?

By: Yuuta L., Wendi L., Ella T., Toma V.

## Introduction

An angina is a common symptom of cardiovascular disease. It occurs when parts of the heart are not sufficiently supplied with oxygenated blood as a result of atherosclerosis (NHLBI, 2023). Atherosclerosis is the build-up of plaque in arteries usually caused by high levels of cholesterol that accumulate on artery walls, triggering an immune response. White blood cells attempt to clear the build-up but cling to the walls, forming a growing lump that permanently stays in place (Harvard Health, 2023). As a result, fewer red blood cells carrying oxygen are able to pass through the artery at once. Exercise-induced angina occurs as heart rate increases during exercise to supply more oxygen to the cells in need, inducing chest pain. The higher demand in oxygen and restricted supply due to the shrunken diameter of the artery leads to cardiac cell death (Heart and Stroke Canada). Our model aims to predict the presence of an exercise induced angina for patients, taking into account their resting blood pressure as well as their cholesterol levels.

## Preliminary Exploratory Data Analysis

Our research question is as follows: **Based on resting blood pressure and serum cholesterol levels, is the patient likely to experience exercise induced angina? How accurate is this prediction?**

In [None]:
# First, import data into R.
library(tidyverse)
library(tidymodels)
set.seed(1)
heart_data <- read_csv("https://raw.githubusercontent.com/L-Wendi/dsci-100-project/main/heart.csv")

We are using a 1988 heart disease dataset that contains 17 variables relating to cardiovascular health as well as basic information about the patients. Our dataset has 1025 rows of data and a comprehensive amount of variables for comparison. The dataset is from Kaggle (https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset), a pre-processed version of the dataset from https://archive.ics.uci.edu/dataset/45/heart+disease for better data analysis and readability. The definition of column names of the original dataset and the Kaggle dataset can be found on their respective websites.

To answer our predictive question, we will use columns `trestbps` (Resting blood pressure, measured in mm Hg) and `chol` (Serum cholesterol, in mg / dL) to predict `exang` (Exercise induced angina).

In [None]:
heart_data <- heart_data |>
    mutate(sex = fct_recode(as_factor(sex), "Male" = "1", "Female" = "0")) |>
    mutate(exang = fct_recode(as_factor(exang), "Yes" = "1", "No" = "0")) |>
    mutate(target = fct_recode(as_factor(target), "No Disease" = "0", "Disease" = "1"))

We did a summary of the predictors based on training data using the three columns of interest (`trestbps`, `chol`, and `exang`):

In [None]:
heart_split <- initial_split(heart_data, prop = 0.75, strata = exang)
heart_training <- training(heart_split)
heart_testing <- testing(heart_split)
heart_predictors_info <- heart_training |>
    select(exang, trestbps, chol) |>
    group_by(exang) |>
    summarize(count = n(),
              mean_trestbps = mean(trestbps), max_trestbps = max(trestbps), min_trestbps = min(trestbps),
              mean_chol = mean(chol), max_chol = max(chol), min_chol = min(chol))

In the preliminary analysis, we also drew a scatter plot based on training data to visualize the relationship between resting blood pressure and serum cholesterol, along with the presence of angina.

In [None]:
scatter_plot <- heart_training |> ggplot(aes(x = trestbps, y = chol, color = exang)) +
    geom_point() +
    labs(x = "Resting blood pressure (mm Hg)", y = "Serum cholestorol (mg / dL)", color = "Presence of Angina") +
    scale_color_manual(values = c("red", "blue"))

## Methods

To answer our predictive question, we will be using K-nearest neighbors to build a classifier that predicts if the patient experiences exercise induced angina. The variables used as predictors are the resting blood pressure (`trestbps`) and serum cholesterol levels (`chol`). The response variable is the categorical variable for exercised induced angina (`exang`), where 1 has been renamed to `Yes` and 0 to `No`.

We can visualize the results with a standardized scatter graph of the resting blood pressure against the serum cholesterol levels, and color the points by the presence of exercise induced angina. We can label the predicted values differently (e.g., draw a circle or use color) and compare them against the expected classification. Besides graphs, we will also build a confusion matrix to compare the results and calculate accuracy / precision / recall.

## Expected Outcomes and Significance

Seeing as how an angina occurs because of atherosclerosis, we can predict that the cases of angina will increase as well when either variable increases. Being able to predict the likelihood of a patient getting an angina is very important in the medical field as being able to catch heart disease early on is the best way of preventing future life threatening problems. This model could bring up many questions about the causes of heart disease. Testing different variables in the model such as age, maximum heart rate, and resting heart rate can bring more insight into how this type of heart disease occurs. 

## Bibliography

*Angina*, Heart and Stroke Foundation of Canada, 2023, www.heartandstroke.ca/heart-disease/conditions/angina.

*Angina: Symptoms, Diagnosis and Treatments*, Harvard Health Publishing, 21 Sept. 2021, www.health.harvard.edu/heart-health/angina-symptoms-diagnosis-and-treatments. 

*What Is Angina?*, U.S. Department of Health and Human Services, 23 June 2023, www.nhlbi.nih.gov/health/angina