# DSCI100 2023W1 Group Project Proposal - 27

## Considering serum cholesterol levels and resting blood pressure, how likely is it for an individual to experience an exercise-induced angina?

An angina is commonly a symptom of cardiovascular disease. It occurs when parts of the heart are not supplied with sufficient oxygen-rich blood as a result of plaque build-up in arteries, known as atherosclerosis (NHLBI, 2023). Atherosclerosis itself is usually caused by high levels of cholesterol that build up in the walls of said arteries, triggering an immune response from the organism, implicating white blood cells that target these build ups but instead get stuck in the areas they attempt to clear, forming a growing lump or structure that permanently stays in place (Harvard Health, 2023). As a result, less red blood cells carrying oxygen are able to pass through the artery at any one time. Exercising can therefore aggravate this phenomenon as our heart rate increases, to supply more oxygen to the cells in need, especially cardiac ones, inducing chest pain, known as an exercise induced angina. The higher demand in oxygen, but restricted supply due to the shrunken diameter of the artery due to the plaque build up leads to the death of cardiac cells that are unable to receive the necessary oxygen (Heart and Stroke Canada). 

As a result, we strive for our model to predict the presence of an exercise induced angina for random individuals, taking into account their resting blood pressure as well as their cholesterol levels measured in dl/mg. 

Our research question is as follows: **Based on resting blood pressure and serum cholesterol levels, is the patient likely to experience exercise induced angina? How accurate is this prediction?**

We are using a 1988 heart disease dataset that contains 17 variables correlating to cardiovascular health as well as patients' age, sex, chest pain, blood pressure, cholestoral, blood sugar, ECG results, and more. Our dataset has 1025 rows of data from Cleveland, Hungary, Switzerland, and Long Beach V. We chose this dataset because it provides a comprehensive amount of variables to compare with, and it also has a large number of samples. Our dataset comes from Kaggle, and it is a compilation of the original dataset on https://archive.ics.uci.edu/dataset/45/heart+disease, but processed to be ready for data analysis.

* https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset
* https://www.nhlbi.nih.gov/health/angina
* https://www.heartandstroke.ca/heart-disease/conditions/angina
* https://www.health.harvard.edu/heart-health/angina-symptoms-diagnosis-and-treatments
treatments


## Preliminary Exploratory Data Analysis

### Import data into R

In [None]:
library(tidyverse)
library(tidymodels)
set.seed(1)

In [None]:
# The URL is downloaded from https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset
heart_data <- read_csv("https://raw.githubusercontent.com/L-Wendi/dsci-100-project/main/heart.csv")
heart_data |> slice(1:3)

### Clean Data

Our dataset comes from a website called Kaggle (https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset), and the dataset is based on https://archive.ics.uci.edu/dataset/45/heart+disease but pre-processed for better data analysis. The author of our dataset mutated several fields to improve its readibility. A comprehensive list of the columns in both original and our dataset are shown in the following table. To further more tidy the data, we also made several changes to it, which are also in the table:

| Column   | Meaning                                            | Type         | Original Dataset                                                                       | Kaggle Dataset                                                                       | Our Modifications  |
|----------|----------------------------------------------------|--------------|----------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|--------------------|
| age      | Age of the patient (Years)                         | Quantitative | Number                                                                                 | No change                                                                            | No change          |
| sex      | Assigned sex of the patient                        | Categorical  | Number (1: Male, 0: Female)                                                            | No change                                                                            | Change into factor |
| cp       | Chest pain type                                    | Categorical  | Number (1: Typical Angina, 2: Atypical Angina, 3: Non-anginal Pain, 4: Asymptomatic)   | Number (0: Asymptomatic, 1: Atypical Angina, 2: Non-anginal Pain, 3: Typical Angina) | Change into factor |
| trestbps | Resting blood pressuare (mm Hg)                    | Quantitative | Number                                                                                 | Number                                                                               | No change          |
| chol     | Serum cholestoral (mg / dL)                        | Quantitative | Number                                                                                 | Number                                                                               | No change          |
| fbs      | Fasting blood sugar > 120mg / dL                   | Categorical  | Number (1: True, 0: False)                                                             | No change                                                                            | Change into factor |
| restecg  | Resting electrocardiographic results               | Categorical  | Number (0: Normal, 1: ST-T Wave Abnormal, 2: Left Ventricular Hypertrophy)             | No change                                                                            | Change into factor |
| thalach  | Max heart rate achieved                            | Quantitative | Number                                                                                 | No change                                                                            | No change          |
| exang    | Excrise induced angina                             | Categorical  | Number (1: Yes, 2: No)                                                                 | No change                                                                            | Change into factor |
| oldpeak  | ST depression induced by exercise relative to rest | Quantitative | Number                                                                                 | No change                                                                            | No change          |
| slope    | The slope of the peak exercise ST segment          | Categorical  | Number (1: Upsloping, 2: Flat, 3: Downsloping)                                         | Number (0: Downsloping, 1: Flat, 2: Upsloping)                                       | Change into factor |
| ca       | Number of major vessels colored by flourosopy      | Quantitative | Number[0, 3]                                                                           | No change                                                                            | No change          |
| thal     | (Not defined in the original dataset)              | Categorical  | Number (3: Normal, 6: Fixed Defect, 7: Reversable Defect)                              | Number (1: Normal, 2: Fixed Defect, 3: Reversable Defect)                            | Change into factor |
| num      | Diagnose of heart disease                          | Categorical  | Number (0: No Disease (<50% diameter narrowing), 1: Disease (>50% diameter narrowing)) | Renamed to target                                                                    | Change into factor |

The definition of column names of the original dataset can be found at https://archive.ics.uci.edu/dataset/45/heart+disease,and the columns of the Kaggle dataset is also on the Kaggle website.the Kaggle website.

In [None]:
heart_data <- heart_data |>
    mutate(sex = fct_recode(as_factor(sex), "Male" = "1", "Female" = "0")) |>
    mutate(cp = fct_recode(as_factor(cp), "Asymptomatic" = "0", "Atypical Angina" = "1", "Non-anginal Pain" = "2", "Typical Angina" = "3")) |>
    mutate(fbs = fct_recode(as_factor(fbs), "True" = "1", "False" = "0")) |>
    mutate(restecg = fct_recode(as_factor(restecg), "Normal" = "0", "ST-T Wave Abnormal" = "1", "Left Ventricular Hypertrophy" = "2")) |>
    mutate(exang = fct_recode(as_factor(exang), "Yes" = "1", "No" = "0")) |>
    mutate(slope = fct_recode(as_factor(slope), "Downsloping" = "0", "Flat" = "1", "Upsloping" = "2")) |>
    mutate(thal = fct_recode(as_factor(thal), "Normal" = "1", "Fixed Defect" = "2", "Reversable Defect" = "3")) |>
    mutate(target = fct_recode(as_factor(target), "No Disease" = "0", "Disease" = "1"))

heart_data |> slice(1:5)

### Exploratory Analysis

To answer our predict question "**Based on resting blood pressure and serum cholesterol levels, is the patient likely to experience exercise induced angina? How accurate is this prediction?**", we are going to use columns `trestbps` (Resting blood pressure) and `chol` (Serum cholestoral) as predictors and column `exang` (Excrise induced angina) as class.

In [None]:
# Firstly, split the data into training and testing sets.
heart_split <- initial_split(heart_data, prop = 0.75, strata = exang)
heart_training <- training(heart_split)
heart_testing <- testing(heart_split)

And then do a summary of the predictors based on training data:

In [None]:
heart_predictors_info <- heart_training |>
    select(exang, trestbps, chol) |>
    group_by(exang) |>
    summarize(count = n(),
              mean_trestbps = mean(trestbps),
              max_trestbps = max(trestbps),
              min_trestbps = min(trestbps),
              mean_chol = mean(chol),
              max_chol = max(chol),
              min_chol = min(chol))

heart_predictors_info

In the preliminary analysis, we are also drawing a scatter plot between resting blood pressure and serum cholestorol, based on whether there is angina. The plot is solely based on training data.

In [None]:
scatter_plot <- heart_training |> ggplot(aes(x = trestbps, y = chol, color = exang)) +
    geom_point() +
    labs(x = "Resting blood pressure (mm Hg)",
         y = "Serum cholestorol (mg / dL)",
         color = "Presence of Angina") +
    scale_color_manual(values = c("red", "blue"))

scatter_plot