Heart Failure Prediction

Our main focus of this project is to determine which factors have a significant contribution towards heart failure. The data we obtained consists of factors such as diabetes, high blood pressure, age, sex, whether someone smokes, and so on. Our goal is to use these factors as predictors to predict if someone should receive medical attention immediately or not. If predictions indicates death, it would suggest doctors to focus on this case immediately to prevent death, and if it indicates they are going to survive, then we would do precautions to prevent them from falling into the categories that might lead them to death. 

This dataset we gathered is from Kaggle, which was released by user LARXEL in 2020.

In [None]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
install.packages("kknn")

In [None]:
#Use read_csv to load the dataset into R
data <- read_csv("data/heart_failure_clinical_records_dataset.csv.xls")
data

In [None]:
#Find number of rows and columns of dataset.
data_nrow <- nrow(data)
data_nrow

data_ncol <- ncol(data)
data_ncol

This dataset consists of 299 rows and 13 columns. The DEATH_EVENT variable has values 0 (survived) and 1 (died) and as stated above, we will use this to gain some understanding of a person’s medical condition to further prevent death of others. In other words, we will use data of patients who survived or died to determine if someone needs to seek medical attention immediately. We will use some of these columns to explore our data. In the event where the data is not sufficient, we will gather more data from other sources such as ourworldindata.org.

In [None]:
#wrangling the data by selecting the predictors we would work with and 
#converting the response variable DEATH_EVENT to the factor datatype
#converting the predicting variables that has results 0 or 1 to the logical datatype so it shows as TRUE or FALSE

data_wrangled <- data |> 
            select(age, diabetes, ejection_fraction, high_blood_pressure, smoking, DEATH_EVENT) |>
            mutate(DEATH_EVENT = as_factor(DEATH_EVENT)) |>
            mutate(DEATH_EVENT = fct_recode(DEATH_EVENT, "Yes" = "0", "No" = "1")) |>
            rename("survived" = "DEATH_EVENT") |>
            mutate(diabetes = as.logical(diabetes)) |>
            mutate(high_blood_pressure = as.logical(high_blood_pressure)) |>
            mutate(smoking = as.logical(smoking))    

data_wrangled

We are mainly going to focus on these 5 predictors (age, diabetes, ejection fraction, high blood pressure, and smoking) for now to see if they correlate to heart failure. 

In [None]:
#set the random seed in R
set.seed(1)

#split the data into training and testing set
heart_split <- initial_split(data_wrangled, prop = 0.75, strata = survived)
heart_train <- training(heart_split)
heart_test <- testing(heart_split)

heart_train
heart_test

In [None]:
#the number of observations in the training set
train_nrow <- nrow(heart_train)
any_missing_data <- sum(is.na(heart_train))
summarize(heart_train, train_nrow, any_missing_data)

Our training data consists of 224 rows and we have no missing data.

In [None]:
#summarize values for subgroups "survived" or "died" within the training data set
survive_or_death_summarize <- heart_train |> 
                           group_by(survived) |>
                           summarise(count = n())
survive_or_death_summarize

Out of the 224 rows of training data, 152 are people who survived and 72 are people who died.

In [None]:
#summarize variables that are double
summarize_age <- heart_train |>
                 summarize(variable = "age", 
                           min = min(age), 
                           max = max(age), 
                           mean = mean(age))
summarize_ejection_fraction <- heart_train |>
                               summarize(variable = "ejection_fraction", 
                                         min = min(ejection_fraction), 
                                         max = max(ejection_fraction),
                                         mean = mean(ejection_fraction))

train_summarized <- rbind(summarize_age, summarize_ejection_fraction)
train_summarized

We summarized variables that are double to see the min, max, and mean. (Did not do this for variables that are logical since it is just a TRUE or FALSE observation.)

Now, using our training data, we will see if any of our predictors have a relationship with each other (2 predictors) that indicates heart failure or if any of the predictors independently indicates a heart failure (1 predictor). Here are some plots:

In [None]:
#set our graph to a proper size for visualization
options(repr.plot.width = 8, repr.plot.height = 6)

#visualize the data with a scatter plot
train_graph_point <- heart_train |>
            ggplot(aes(x = age, y = ejection_fraction, color = survived)) +
            geom_point() + 
            labs(x = "Age", 
                 y = "Percentage of Blood Leaving the Heart at Each Contraction (%)",
                 color = "Survived?") +
            theme(text = element_text(size = 12)) +
            ggtitle("Is There A Relationship Between Percentage Of Blood Leaving \n The Heart At Each Contraction and Age?") 
train_graph_point

We hypothesized age and the percentage of blood leaving the heart at each contraction to have a relationship so we plotted this graph. The relationship is pretty weak but we could see that those who survived are generally younger and that the percentage of blood leaving the heart at each contraction is relatively average or above average. 

In [None]:
options(repr.plot.width = 8, repr.plot.height = 6)
train_graph <- heart_train |>
            ggplot(aes(x = age, fill = survived)) +
            geom_histogram() + 
            labs(x = "age",
                 fill = "Survived?") +
            theme(text = element_text(size = 14)) +
            ggtitle("Distribution of Died or Survived according to Age")
train_graph

This histogram plot does not show a clear relationship too, but we should be aware that at older ages, those who died outnumbers those who survived (for most bars) considering that the data for people who died is almost half of those that survived (Recall: Survived total: 152; Died total: 72). We can tell that age could be a factor, but there is definitely other factors that influence heart failure too. 

In [None]:
options(repr.plot.width = 8, repr.plot.height = 6)
train_graph_2 <- heart_train |>
            ggplot(aes(x = ejection_fraction, fill = survived)) +
            geom_histogram() + 
            labs(x = "Percentage of Blood Leaving the Heart (%)",
                 fill = "Survived?") +
            theme(text = element_text(size = 14)) +
            ggtitle("Distribution of Died or Survived according to Percentage of \n Blood Leaving the Heart")
train_graph_2

This histogram plot does not show a clear relationship too but overall we could see that lower percentage of blood leaving the heart causes increased death. 

Since this is just the exploratory phase, we chose to mainly focus on these two predictor variables to get ourselves on the right track and we will work our way towards the remaining three predictor variables in the next few weeks. 

We will conduct our data analysis with all 5 of our predictor variables, find the most suitable k-values for them, and see which one yields the best accuracy. We will then use the one with the highest accuracy to predict which class they belong to and determine if they would need immediate medical attention. 