In [1]:
library(tidyverse) |> suppressMessages()
library(repr) |> suppressMessages()
library(tidymodels) |> suppressMessages()
library(cowplot) |> suppressMessages()
options(repr.matrix.max.rows = 6)
#source('tests.R')
#source("cleanup.R")

# Prediction to the risk of heart attack 

## INTRODUCTION

## METHODS AND RESULTS

#### Reading the data

Firstly, we read the data from our github repository using read_csv function. Beforehand, we decided to download the original data from UCI machine learning repository and move it into our github repository for the sake of accessibility. 

In [9]:
heart_attack_data_raw = read_delim("https://raw.githubusercontent.com/RichardAdhika22/group115/main/processed.hungarian%20(1).data",delim=","
                                ,col_names=FALSE) |> suppressMessages()

heart_attack_data_raw

X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14
<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<dbl>
28,1,2,130,132,0,2,185,0,0,?,?,?,0
29,1,2,120,243,0,0,160,0,0,?,?,?,0
29,1,2,140,?,0,0,170,0,0,?,?,?,0
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
56,1,4,155,342,1,0,150,1,3,2,?,?,1
58,0,2,180,393,0,0,110,1,1,2,?,7,1
65,1,4,130,275,0,1,115,1,1,2,?,?,1


#### Cleaning and tidying the data

The dataframe above does not have any column name, so we can add the column's name referring to the information given for each column from the UCI machine learning repository. Furthermore, there are some missing data, which is marked by the "?" in the table. Therefore, we also remove all of the missing data. Last but not least, we change the data type for "sex" and "num" from double to factor, so that we can do classification on the variable "num" (classification can only be done on a factor type variable).

In [10]:
colnames(heart_attack_data_raw) = c('age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num')
heart_attack_data_tidy = select(heart_attack_data_raw,-slope,-ca,-thal)
heart_attack_data_tidy = filter_all(heart_attack_data_tidy,all_vars(.!="?"))
heart_attack_data_tidy = mutate_if(heart_attack_data_tidy, is.character,as.numeric) |> mutate(num = as.factor(num)) |> mutate(sex = as.factor(sex))

heart_attack_data_tidy


age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,num
<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
28,1,2,130,132,0,2,185,0,0,0
29,1,2,120,243,0,0,160,0,0,0
30,0,1,170,237,0,1,170,0,0,0
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
56,1,4,155,342,1,0,150,1,3,1
58,0,2,180,393,0,0,110,1,1,1
65,1,4,130,275,0,1,115,1,1,1


#### Splitting the data into training and testing set

As a good rule of thumb, we should always split our data into training set and testing set when doing a prediction. All of the exploratory analysis and model below will only be built based on the training data. For the testing data, we will only use it to evaluate the accuracy of our prediction and summarize how well our prediction do.

In [12]:
heart_attack_split = initial_split(heart_attack_data_tidy, prop = 0.75, strata = num)
heart_attack_training = training(heart_attack_split)
heart_attack_testing = testing(heart_attack_split)

#### Selecting only the data used in the prediction

Here, we choose 5 variables as our predictors as follow: age, trestbps, chol, thalach, oldpeak, num. The reason is because these 5 variables come in the double type, which will be more accurate to use in our prediction compared to the rest (Based on the table above, other variables also come in double type, but they are actually supposed to be factor type).|

In [7]:
heart_attack_data_selected = heart_attack_training |>
    select(age,trestbps,chol,thalach,oldpeak,num)

heart_attack_data_selected

age,trestbps,chol,thalach,oldpeak,num
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
29,120,243,160,0,0
30,170,237,170,0,0
31,100,219,150,0,0
⋮,⋮,⋮,⋮,⋮,⋮
52,160,331,94,2.5,1
56,155,342,150,3.0,1
65,130,275,115,1.0,1


In [8]:
total_amount = heart_attack_data_selected |>
    summarize("total"=n())|>
    pull()

table_num_count = heart_attack_data_selected |>
    group_by(num) |>
    summarize("total_number_of_num" = n())|>
    mutate("percentage" = total_number_of_num/total_amount*100)

table_num_count 

num,total_number_of_num,percentage
<fct>,<int>,<dbl>
0,122,62.5641
1,73,37.4359
