# Heart Disease Dataset Analysis

Laura Flannigan, Kaitlyn Harvie, Hannah Morstead, & Tony Tang 

**Introduction**

   According to the World Health Organization, the leading cause of death worldwide remains as cardiovascular disease (2021). However, these diseases are preventable as research has revealed various risk factors in the development of heart disease. Yusuf et al. (2001) suggested that smoking, as well as high blood pressure and increased LDL cholesterol levels show causal links to cardiovascular disease. Additionally, it is widely accepted that a family history of heart disease (Kardian et al., 2003), as well as diabetes also increase risk  — however, the risk of diabetes differs in men and women (Wilson, 1998). 
    Using the Heart Disease Dataset from the Cleveland Clinic Foundation, we will determine how accurately a classification model, using criterion variables; age, sex, chest pain type, resting blood pressure, blood cholesterol levels, smoker status, family history of diabetes, family hisotry of cadiac disease, and presence of exercise-indudced angina, correctly predict the presence of heart disease for a new observation. 


**Preliminary Exploratory Data Analysis**

In [5]:
library(tidyverse)
library(tidymodels)
options(repr.matrix.max.rows = 6)

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
download.file(url, "data/processed_cleveland.csv")
heart_disease_data <- read_csv("data/processed_cleveland.csv", col_names = FALSE)
heart_disease_data
# columns 12 and 13 are characters because there are "?"s for missing data.

Parsed with column specification:
cols(
  X1 = [32mcol_double()[39m,
  X2 = [32mcol_double()[39m,
  X3 = [32mcol_double()[39m,
  X4 = [32mcol_double()[39m,
  X5 = [32mcol_double()[39m,
  X6 = [32mcol_double()[39m,
  X7 = [32mcol_double()[39m,
  X8 = [32mcol_double()[39m,
  X9 = [32mcol_double()[39m,
  X10 = [32mcol_double()[39m,
  X11 = [32mcol_double()[39m,
  X12 = [31mcol_character()[39m,
  X13 = [31mcol_character()[39m,
  X14 = [32mcol_double()[39m
)



X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
57,1,4,130,131,0,0,115,1,1.2,2,1.0,7.0,3
57,0,2,130,236,0,2,174,0,0.0,2,1.0,3.0,1
38,1,3,138,175,0,0,173,0,0.0,1,?,3.0,0


**Methods**

How will we conduct our data analysis?
We will conduct our data analysis using classification, because we are trying to determine the categorical value, an integer from 0-4 (0 being no presence of heart disease, 1,2,3,4 being increasing presence of heart disease), that determines if the patient has heart disease or not. 

Which variables/columns will we use?
Based on some brief research, we determined that the most relevant indicators of heart disease include, but are not limited to: age, sex, diabetes, family history, blood pressure, obeisity and smoking history. These are the most common and the most useful factors that physicians look at in diagnosing an individual with heart disease. As a result, we chose the corresponding available columns that are included in our chosen dataset. These columns are: 
#3 age, 
#4 sex, 
#9 cp: chest pain type, 
#10 trestbp: resting blood pressure, 
#12 chol: serum cholesterol level, 
#19 restecg: resting electrocardiographic results, and 
#38 exang: exercise induced angina.

Visualization:
To visualize the results, we would use a scatter plot with standardized data. This data would be the predicted attribute (the patient’s diagnosis) and one of the seven predictors we chose. This predictor would be chosen by its ability to accurately predict the attribute. To do this, we will create plots for each predictor and compare their accuracy. However, based on research, we anticipate cholesterol to be the superior predicting variable in this dataset.


**Expected Outcomes and Significance**

References

World Health Organization. (2021). Cardiovascular Diseases (CVDs). Retreived from https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds) 

Yusuf, S., Reddy, S., Ounpuu, S. & Anand, S. (2001) Global Burden of Cardiovascular Diseases: Part I: General Considerations, the Epidemiologic Transition, Risk Factors, and Impact of Urbanization. Circulation, 104, 2746-2753. https://doi.org/10.1161/hc4601.099487

Kardia, S.L.R., Modell S. M., Peyser P. A. (2003). Family-Centered Approaches to Understanding and Preventing Coronary Heart Disease. American Journal of Preventive Medicine, 24, 143-151.  doi:10.1016/S0749-3797(02)00587-1 

Wilson, P. W. (1998). Diabetes mellitus and coronary heart disease. American Journal of Kidney Diseases, 32, 89-100. https://doi.org/10.1053/ajkd.1998.v32.pm9820468


In [None]:
US National Library of Medicine, National Institutes of Health: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6097244/ 

Johns Hopkins Medicine: https://www.hopkinsmedicine.org/health/wellness-and-prevention/abcs-of-knowing-your-heart-risk 

EveryDay Health: https://www.everydayhealth.com/heart-health-pictures/the-single-best-predictor-of-a-heart-attack.aspx 

Nature.com: https://www.nature.com/articles/s41598-020-62133-5 

Centers for Disease Control and Prevention: https://www.cdc.gov/heartdisease/facts.htm 

Choosing Wisely Canada: https://choosingwiselycanada.org/ecg-electrocardiogram/ 

