# Heart Failure Project

## Introduction:

"Cardiovascular diseases (CVDs) are the number 1 cause of death globally" (LARXEL, kaggle) and it is truly concerning how it is taking over many lives, like family members losing their love ones due to it, people needing to live their lives in hospitals, and so on. Our goal of this project is to discover different factors that could lead to heart failure and building a model to predict who are the most in need and hopefully get them the medical attention they require. After conducting some research, we found that "[heart failure] is predominantly a disorder of aging, with prevalence rates increasing exponentially from less than 1% in the population under age 50 to about 10% in individuals over the age of 80" (Rich, The Journals of Gerontology: Series A) and "large meta-analysis that included community-based studies and trials observed lower mortality in HFpEF compared with HFrEF" (Bourlag and Colucci, UpToDate) where
"HFrEF [means the heart's ejection fraction is] less than or equal to 40% [and] HFpEF [is when ejection fraction] is greater than or equal to 50%" (Hajouli and Ludhwani, National Library of Medicine).

Now, we pose the question: 

**To what degree does age and ejection fraction contributes to heart failure, causing death? Does these two factors correlate or are they independent from one another?**

We will do some analysis and find out.

We obtained a dataset from Kaggle, which was released by the user LARXEL in 2020. It consists of factors such as age, sex, ejection fraction, whether someone smokes, has diabetes, has high blood pressure, and so on. We will focus on age and ejection fraction for this project since we could see them possibly have a relationship with mortality through the research we conducted. 
This dataset also consists data that indicates if each observation died or survived. We will perform knn classification and use this as our classifier, while age and ejection fraction would be our predictors. Our goal is to use these predictors to predict if someone should receive medical attention immediately or not. If predictions indicates death, it would suggest doctors to focus on this case immediately to prevent death, and if it indicates they are going to survive, then we would do precautions to prevent them from falling into the categories that might lead them to death.

In [1]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
install.packages("kknn")

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mworkflows   [39m 1.1.2
[32m✔[39

In [2]:
read_data_from_link <- read_csv("https://raw.githubusercontent.com/KristenisHuaiyi/Data_Science_Project/main/data/heart_failure_clinical_records_dataset.csv.xls")
read_data_from_link

[1mRows: [22m[34m299[39m [1mColumns: [22m[34m13[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (13): age, anaemia, creatinine_phosphokinase, diabetes, ejection_fractio...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
75,0,582,0,20,1,265000,1.9,130,1,0,4,1
55,0,7861,0,38,0,263358,1.1,136,1,0,6,1
65,0,146,0,20,0,162000,1.3,129,1,1,7,1
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
45,0,2060,1,60,0,742000,0.8,138,0,0,278,0
45,0,2413,0,38,0,140000,1.4,140,1,1,280,0
50,0,196,0,45,0,395000,1.6,136,1,1,285,0


In [3]:
#wrangling the data by selecting the predictors we would work with and 
#converting the response variable DEATH_EVENT to the factor datatype and renaming it to "survived"
#converting the predicting variables that has results 0 or 1 to the logical datatype so it shows as TRUE or FALSE

data_wrangled <- read_data_from_link |> 
            select(age, ejection_fraction, DEATH_EVENT) |>
            mutate(DEATH_EVENT = as_factor(DEATH_EVENT)) |>
            mutate(DEATH_EVENT = fct_recode(DEATH_EVENT, "Yes" = "0", "No" = "1")) |>
            rename("survived" = "DEATH_EVENT") 

data_wrangled

age,ejection_fraction,survived
<dbl>,<dbl>,<fct>
75,20,No
55,38,No
65,20,No
⋮,⋮,⋮
45,60,Yes
45,38,Yes
50,45,Yes
