# Heart Disease Statistics Group Final Report 

## Authors: Nicholas Tam, Hanxi Chen, Levi Zeng, Xinyang Deng

## Group: 4

## 1. Introduction
### 1.1. Background Information
There are numerous studies that have indicated strong associations of coronary heart disease with a wide variety of factors, including age and sex (Lloyd-Jones, Larson, Beiser, Levy, 1999), blood pressure (Lawes, Bennett, Lewington, Rodgers, 2003), and serum cholesterol level (Law, Wald, Thompson, 1994). However, given that the majority of the given factors have significant associations with one another, such as cholesterol level with age and sex (Beckett, N., Nunes, M., & Bulpitt, C., 2000), along with the sheer quantity of potential risk factors (Hajar, 2017), it is unclear how these factors could be combined to model and predict the diagnosis of coronary heart disease.

### 1.2. Dataset and Project Question
For our research project, we have selected datasets containing processed angiography data on patients in various clinics in 1988, applying a probability model derived from test results of 303 patients at the Cleveland Clinic in Cleveland, Ohio to generate and estimate results for the diagnosis of coronary heart disease (Janosi, A., Steinbrunn, W., Pfisterer, M., & Detrano, R., 1989). The datasets include the following patients undergoing angiography: 
- 303 patients at the Cleveland Clinic in Cleveland, Ohio (Original dataset for model) 
- 425 patients at the Hungarian Institute of Cardiology in Budapest, Hungary
- 200 patients at the Veterans Administration Medical Center in Long Beach, California 
- 143 patients from the University Hospitals in Zurich and Basel, Switzerland

These datasets were retrieved from the [Heart Disease](https://archive.ics.uci.edu/dataset/45/heart+disease) dataset from UCI machine learning repository, and converted from .data files to CSV files with Excel. The dataset obtained contains the following 14 attributes out of 76 attributes from the initial dataset for each patient: 

In [1]:
myTable <- data.frame(
  Variable = c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"),
  Definition = c("Age", "Sex", "Chest pain type", "Resting blood pressure on admission to hospital", "Serum cholesterol", "Presence of high blood sugar", "Resting electrocardiographic results", "Maximum heart rate achieved", "Exercise induced angina", "ST depression induced by exercise relative to rest", "Slope of the peak exercise ST segment", "Number of major vessels coloured by fluoroscopy", "Presence of defect", "Diagnosis of heart disease"),
  Type = c("Numerical", "Categorical", "Categorical", "Numerical", "Numerical", "Categorical", "Categorical", "Numerical", "Categorical", "Numerical", "Categorical", "Numerical", "Categorical", "Categorical"),
  Unit = c("Years", "N/A", "N/A", "mmHg", "mg/dl", "N/A", "N/A", "BPM", "N/A", "N/A", "N/A", "N/A", "N/A", "N/A"),
  Categories = c("N/A", "0: Female; 1: Male", "1: Typical angina; 2: Atypical angina; 3: Non-anginal pain; 4: Asymptomatic", "N/A", "N/A", "0: False; 1: True", "0: Normal; 1: Having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV); 2: Showing probable or definite left ventricular hypertrophy by Estes' criteria", "N/A", "0: No; 1: Yes", "N/A", "1: Upsloping; 2: Flat; 3: Downsloping", "Range from 1-3", "3: Normal; 6: Fixed defect; 7: Reversible defect", "0: < 50% diameter narrowing; 1+: > 50% diameter narrowing"),
  AnyMissingValues = c("No", "No", "No", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No")
)
myTable

Variable,Definition,Type,Unit,Categories,AnyMissingValues
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
age,Age,Numerical,Years,,No
sex,Sex,Categorical,,0: Female; 1: Male,No
cp,Chest pain type,Categorical,,1: Typical angina; 2: Atypical angina; 3: Non-anginal pain; 4: Asymptomatic,No
trestbps,Resting blood pressure on admission to hospital,Numerical,mmHg,,Yes
chol,Serum cholesterol,Numerical,mg/dl,,Yes
fbs,Presence of high blood sugar,Categorical,,0: False; 1: True,Yes
restecg,Resting electrocardiographic results,Categorical,,0: Normal; 1: Having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV); 2: Showing probable or definite left ventricular hypertrophy by Estes' criteria,Yes
thalach,Maximum heart rate achieved,Numerical,BPM,,Yes
exang,Exercise induced angina,Categorical,,0: No; 1: Yes,Yes
oldpeak,ST depression induced by exercise relative to rest,Numerical,,,Yes


Our project question is:

#### “Given the sample data for angiography patients, what model would be most effective in predicting each patient’s diagnosis?”

Our analysis will involve the development of a predictive model to estimate the likelihood of angiographic coronary disease based on these variables. Additionally, we will explore regional variations and demographic influences on heart disease risk. This research question is primarily focused on both predictions, as we seek to generate a predictive model given the provided data to estimate diagnoses of new observations. Inference will also be required to a lesser extent, as we aim to gain insights into the factors influencing the likelihood of coronary disease diagnosis in different locations and demographic groups. 

## 2. Preliminary Results

### 2.1. Loading relevant libraries

In [2]:
# Imports

# install.packages("remotes")
# remotes::install_github("tidymodels/infer")
# install.packages("infer") # Install infer package for use

library(dplyr) # Data manipulation operations
library(gridExtra) # Extensions for grid system
library(tidyverse) # Better presentation of data
library(repr) # String and binary representations of objects for several formats / mime types 
library(lubridate) # Easier date organisation
library(infer) # Bootstrap distribution, confidence interval
library(broom) # Reorganises outputs into tidy tibbles
# library(ggplot2) # Provides commands to create complex plots
library(GGally) # Provides correlation between variables
library(tidymodels) # Modelling with training and testing
library(car) # Applied regression tools, including VIF
library(leaps) # Exhaustive search for the best subsets of the variables in x for predicting y in linear regression
library(glmnet) # Regularised regression models
library(mltools) # Regression metrics
library(caret) # Streamline the process for creating predictive models
library(boot) # Allows easy generattion of bootstrap samples of virtually any statistic that they can calculate in R
library(pROC) # Display and analyse ROC curves 
library(MASS) # Support functions and datasets for Venables and Ripley's MASS # WARNING: select() MAY HAVE ISSUES WITH USE IF THIS IS LOADED


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Attaching package: ‘gridExtra’


The following object is masked from ‘package:dplyr’:

    combine


── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mggplot2  [39m 3.4.3     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mpurrr    [39m 1.0.2     [32m✔[39m [34mtidyr    [39m 1.3.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mgridExtra[39m::[32mcombine()[39m masks [34mdplyr[39m::combine()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m      masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39

### 2.2. Uploading and merging relevant tables

- Each data set is read separately, then merged into a single dataframe called `total_heart`, as shown in Table 1. 
- Before merging, the `location` column is created, to indicate the differing clinics that the probability model was applied to; the reference level is set to "Cleveland", the clinic the initial model was derived from.
- The `age`, `sex`, `cp`, `trestbps`, `chol`, `thalach`, and `num` columns require a change in data type to be used as intended.
- Due to several datasets either missing most or all data for `fbs`, `oldpeak`, `slope`, `ca` and `thal` columns, the columns have been removed under the assumption that they are irrelevant or are results from the initial model.
- Any patients with "?" for any variables, `trestbps == 0` or `chol == 0` are assumed to be invalid and have been removed.
- The Switzerland dataset has `chol == 0` for all data and as such has been removed.
- Any values of `num >= 1` provide the same result, and thus have been converted to 1.

In [3]:
Cleveland_heart <- read.csv("https://raw.githubusercontent.com/Nick-2003/STAT-301-Group_4-Project_Final/main/heart%2Bdisease%2BModified/processed_Cleveland.csv") %>% 
    mutate(location = "Cleveland")
Hungary_heart <- read.csv("https://raw.githubusercontent.com/Nick-2003/STAT-301-Group_4-Project_Final/main/heart%2Bdisease%2BModified/processed_Hungarian.csv") %>% 
    mutate(location = "Hungary")
Switzerland_heart <- read.csv("https://raw.githubusercontent.com/Nick-2003/STAT-301-Group_4-Project_Final/main/heart%2Bdisease%2BModified/processed_Switzerland.csv") %>% 
    mutate(location = "Switzerland")
California_heart <- read.csv("https://raw.githubusercontent.com/Nick-2003/STAT-301-Group_4-Project/main/heart%2Bdisease%2BModified/processed_VA.csv") %>% 
    mutate(location = "California")
# head(California_heart)
total_heart <- rbind(Cleveland_heart, Hungary_heart, Switzerland_heart, California_heart) %>% 
    dplyr::select(location, age, sex, cp, trestbps, chol, restecg, thalach, exang, num) %>% 
    filter(!(location == '?' | age == '?' | sex == '?' | cp == '?' | trestbps == '?' | trestbps == '0' | chol == '?' | chol == '0' | restecg == '?' | thalach == '?' | num == '?')) %>% 
    mutate(num = ifelse(num >= 1, 1, num))  %>% 
    transform(sex = as.character(as.factor(sex)), cp = as.character(as.factor(cp)), trestbps = as.double(as.factor(trestbps)), chol = as.double(as.factor(chol)), thalach = as.double(as.factor(thalach))) 
# %>% 
total_heart$location <- factor(total_heart$location) %>% 
    relevel(total_heart$location, ref = "Cleveland")

head(total_heart)
tail(total_heart)

Unnamed: 0_level_0,location,age,sex,cp,trestbps,chol,restecg,thalach,exang,num
Unnamed: 0_level_1,<fct>,<int>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<dbl>,<chr>,<dbl>
1,Cleveland,63,1,1,34,81,2,48,0,0
2,Cleveland,67,1,4,43,134,2,6,1,1
3,Cleveland,67,1,4,16,77,2,27,1,1
4,Cleveland,37,1,3,25,98,0,84,0,0
5,Cleveland,41,0,2,25,52,2,70,0,0
6,Cleveland,56,1,2,16,84,0,76,0,0


Unnamed: 0_level_0,location,age,sex,cp,trestbps,chol,restecg,thalach,exang,num
Unnamed: 0_level_1,<fct>,<int>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<dbl>,<chr>,<dbl>
669,California,53,1,4,33,147,1,26,1,1
670,California,62,1,4,42,21,1,36,1,1
671,California,46,1,4,27,156,0,24,0,1
672,California,54,0,4,22,173,1,52,0,1
673,California,55,1,4,17,71,1,1,0,1
674,California,62,1,2,16,102,2,103,1,1


### 2.3. Exploratory Data Analysis

## 3. Methods and Plan

The dataset contains measurements for coronary heart disease diagnosis in separate locations, and the VIF values for all explanatory variables are relatively low, allowing for variations in diagnosis due to potential confounding variables to be accounted for.

### 3.1. 

## 4. Discussion 
### 4.1. 

## References
- Lloyd-Jones, D. M., Larson, M. G., Beiser, A., & Levy, D. (1999, February 19). Lifetime risk of developing coronary heart disease. The Lancet. https://www.sciencedirect.com/science/article/pii/S0140673698102799?via%3Dihub 
- Lawes, C. M. M., Bennett, D. A., Lewington, S., & Rodgers, A. (2003, January 22). Blood pressure and coronary heart disease: A review of the evidence. Seminars in Vascular Medicine. https://www.thieme-connect.com/products/ejournals/html/10.1055/s-2002-36765 
- Law, M. R., Wald, N. J., & Thompson, S. G. (1994, February 5). By how much and how quickly does reduction in serum cholesterol concentration lower risk of ischaemic heart disease?. The BMJ. https://www.bmj.com/content/308/6925/367.full 
- Beckett, N., Nunes, M., & Bulpitt, C. (2000). Is it advantageous to lower cholesterol in the elderly hypertensive?. Cardiovascular drugs and therapy, 14(4), 397–405. https://doi.org/10.1023/a:1007812232328 
- Hajar R. (2017). Risk Factors for Coronary Artery Disease: Historical Perspectives. Heart views : the official journal of the Gulf Heart Association, 18(3), 109–114. https://doi.org/10.4103/HEARTVIEWS.HEARTVIEWS_106_17 