 # Title (Current Word Count 526)
 
 
 #### By Chris Jung, Grace Wang,Haonan Su
 
 ## Background
***Heart disease*** is the leading cause of death for men, women, and people of most racial and ethnic groups in most of the countries in the world, even in the countries with developed medical technology such as the United States. About 659,000 people in the US die from heart disease each year, which accounts for 1 in every 4 deaths.[[1]](https://www.who.int/health-topics/cardiovascular-diseases).
The US CDC(Centers for Disease Control and Prevention) suggests maintaining low blood pressure and cholesterol to lower the risk of heart disease [[2]](https://www.cdc.gov/heartdisease/prevention.htm).

## Project Question

##### Do people with low blood pressure/cholesterol level have lower chance of developing heart disease?


## Data
- The dataset is an actual medical records from the Cleveland Clinic Foundation stored in the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Heart+Disease).
- Each observation contains 14 attributes of **a person's medical information** including the presence of heart disease.
- The response variable is the attribute "Class".
- The explanatory variables are the attribute "trestpbs"(blood pressure), and "chol"(cholesterol).
- The detailed descriptions of the response variable and the two explanatory variables are:
    1. *class*: presence of heart disease. integer valued from 0 (no presence) to 4
    2. trestbps: resting blood pressure (in mm Hg on admission to the hospital)
    3. chol: serum cholestoral in mg/dl
    

In [1]:
#loads the libraries
library(cowplot)
library(datateachr)
library(digest)
library(infer)
library(repr)
library(taxyvr)
library(tidyverse)
source("tests_tutorial_04.R")

#sets the seed for random events such as splitting the data
set.seed(1)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“cannot open file 'tests_tutorial_04.R': No such file or directory”


ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


## Reading and Wrangling the Data

In [None]:
#reads in the data table with the 14 attributes
heart_data_0 <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data",
              col_names = FALSE) %>%
              mutate(X12 = as.numeric(X12), X13 = as.numeric(X13))

#outputs the first 6 rows of the data frame
head(heart_data_0)

In [None]:
#renames the columns
colnames(heart_data_0) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak",
                          "slope", "ca", "thal", "Class")

#selects the 2 attributes that will be used as well as the class
heart_data_1 <- heart_data_0%>%
select(trestbps, chol,  Class)
quantile(heart_data_1$trestbps)
quantile(heart_data_1$chol)
#We identify patients with trestbps data higher than 130(50th percentile) as having high resting blood pressure and indicate 
#this result with "HIGH" (and vice versa with "NORMAL"), 
#while changing the data type from numeric to factor


#########Due to the unknown problem 94 cannot be assigned to the correct category, 
#########here we discard the two data for trestbps = 94（I've asked the TA, and she doesn't know why.）
heart_data_2 <- heart_data_1 %>% 
mutate(trestbps = replace(trestbps, trestbps >= 130, "HIGH"),
      trestbps = replace(trestbps, trestbps < 130, "LOW"))%>%
filter(trestbps != "94")
#We identify patients with chol data higher than 241(50th percentile) as having high serum cholestoral and indicate 
#this result with "HIGH" (and vice versa with "NORMAL"), 
#while changing the data type from numeric to factor

heart_data_3 <- heart_data_2 %>%
mutate(chol = replace(chol, chol > 241, "HIGH"),
      chol = replace(chol, chol <= 241, "LOW"))

#We identified patients with class data higher than 0 as having heart disease and indicated this result 
#with "YES" (and vice versa with "NO"). Also change the data type from number to factor
heart_data <- heart_data_3 %>%
mutate(Class = replace(Class, Class > 0, "YES"),
      Class = replace(Class, Class <= 0, "NO"))


#Split heart_data into two data frames
heart_data_trestbps <- heart_data%>%
select("trestbps","Class")

heart_data_chol <- heart_data%>%
select("chol","Class")
#outputs the first 6 rows of the data frame and heart_data is the data frame we will eventually use
head(heart_data)
head(heart_data_trestbps)
head(heart_data_chol)

Through the process above, we have converted the data frame into the format we need, which includes all the 
information in the data box about whether the patient has high resting blood pressure or not, whether 
the patient has high cholesterol or not and whether the patient has heart disease or not.

## Visualizing the Data

In the first diagram, We use whether the patient has high resting blood pressure as the X-axis variable, and mark whether the patient has heart disease with red and blue, red means the patient does not have high blood pressure, and vice versa.In the second diagram, We use whether the patient has high serum cholestoral as the X-axis variable, and mark whether the patient has heart disease with red and blue.

In [2]:
ggplot(heart_data, 
       aes(x = trestbps, fill = Class)) +
  geom_bar() + 
  labs(x = "Resting blood pressure of patients")

ggplot(heart_data, 
       aes(x = chol, fill = Class)) +
  geom_bar() + 
  labs(x = "Serum cholestoral")

ERROR: Error in ggplot(heart_data, aes(x = trestbps, fill = Class)): object 'heart_data' not found


## Compute estimates of the parameter

Below, we have created three data frame and two tables to summarize the data.

In [42]:
#Generate the percentage of two variables under a four-column combination
heart_data_n <- heart_data %>%
                      tally()%>%
pull()

#proportion of the four combinations of low or high resting blood pressure 
#and presence or absence of heart disease.
heart_data_prop_trestbps_0 <- 
    heart_data %>% 
    group_by(trestbps, Class)%>%
    count() %>% 
    mutate(p = n/heart_data_n)
#the proportion of the four combinations of the low or high cholesterol and 
#the presence or absence of heart disease.
heart_data_prop_chol_0 <- 
    heart_data %>% 
    group_by(chol, Class)%>%
    count() %>% 
    mutate(p = n/heart_data_n)

heart_data_prop_trestbps_0
heart_data_prop_chol_0

#Generate the proportion of patients with heart disease with or without high resting blood pressure
Trestbps <- c("HIGH","LOW")
Class <- c("YES","YES")
P <- c(81/(87+81),58/(75+58))
heart_data_prop_trestbps_1 <- data.frame(Trestbps,Class,P)
print(heart_data_prop_trestbps_1)
#Generate the proportion of patients with heart disease with or without high cholesterol
Chol <- c("HIGH","LOW")
P <- c(79/(79+72),60/(60+90))
heart_data_prop_chol_1 <- data.frame(Chol,Class,P)
print(heart_data_prop_chol_1)


#the estimates after taking 1000 replicates of "shuffles" and assuming the null hypothesis is true
null_distribution <- heart_data %>% 
  specify(formula = Class ~ trestbps, success = "YES") %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = 1000, type = "permute") %>% 
  calculate(stat = "diff in props", order = c("HIGH", "LOW"))

head(null_distribution)

trestbps,Class,n,p
<chr>,<chr>,<int>,<dbl>
HIGH,NO,87,0.2890365
HIGH,YES,81,0.269103
LOW,NO,75,0.2491694
LOW,YES,58,0.192691


chol,Class,n,p
<chr>,<chr>,<int>,<dbl>
HIGH,NO,72,0.2392027
HIGH,YES,79,0.2624585
LOW,NO,90,0.2990033
LOW,YES,60,0.1993355


  Trestbps Class         P
1     HIGH   YES 0.4821429
2      LOW   YES 0.4360902
  Chol Class         P
1 HIGH   YES 0.5231788
2  LOW   YES 0.4000000


replicate,stat
<int>,<dbl>
1,-0.06171679
2,-0.00783208
3,-0.08865915
4,-0.06171679
5,0.04605263
6,0.04605263


The differences in the proportion of having heart disease between LOW and HIGH, as reflected in the third and fourth tables, suggest that it would be possible to find out whether high or low resting blood pressure and high or low cholesterol make patients more likely to have heart disease.

## Methods

### Reliability of Research

- reliable and academical data source
- large sample size
- strictly follow statistical rules
- perform analysis using scientific tools

### Limitation of Preliminary Research

- plots and estimates above were generated using a single sample.
- cannot compute sampling variation that quantifies the uncertainty about the point estimate obtained from the sample
- cannot be confident enough that the point estimate from the sample is close to the population parameter

### Future Research Plan

- generate bootstrap samples based on our original sample, which is our dataset.
- use R to compute sample proportions from the bootstrap samples and generate a bootstrap sampling distribution.
- conduct hypothesis tests using a confidenve interval level at 5% to test the claim that people with high blood pressure and cholesterol level tend to have a higher chance of having heart disease.


#### Hypothesis Test1

- p1 : proportion of people having heart disease with low blood pressure
- p2 : proportion of people having heart disease with high blood pressure

H_0 : the difference between p2 and p1 is greater than zero </br>
H_1 : the difference between p2 and p1 is smaller than zero. </br>

#### Hypothesis Test2

- p3 : proportion of people having heart disease with low cholesterol level
- p4 : proportion of people having heart disease with high cholesterol level

H_0 : the difference between p4 and p3 is greater than zero </br>
H_1 : the difference between p4 and p3 is smaller than zerois. </br>
 
### Significance and Expectation

- we expect people with low blood pressure and low cholesterol level would have a lower chance of developing heart disease. 
- Our findings could potentially help raise awareness of the importance of keeping a healthy routine that keeps our blood pressure and cholesterol at a healthy level. 
- Our research could lead to a future question: what is the causation of heart diseases given the correlation we will establish in our report.

### Reference

[1] World Health Organization. (n.d.). Cardiovascular diseases. World Health Organization. Retrieved July 26, 2022, from https://www.who.int/health-topics/cardiovascular-diseases </br>
[2] Centers for Disease Control and Prevention. (2020, April 21). Prevent heart disease. Centers for Disease Control and Prevention. Retrieved July 26, 2022, from https://www.cdc.gov/heartdisease/prevention.htm 