TITLE: Predicting whether fasting blood sugar level is higher than 120mg in an individual using age and serum cholestrol levels.

My topic is on predicting whether a person has fasting blood sugar higher than 120mg using the variables age and serum cholestrol. Therefore, my question is "can age and serum cholestrol predict whether an individual has a fasting blood sugar level higher than 120mg/dL using the Cleveland database?". 

In [2]:
library(tidyverse)
library(cowplot)
library(scales)
library(readr)
library(repr)
library(tidymodels)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mstringr  [39m 1.5.0
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mreadr[39m::[32mcol_factor()[39m masks [34mscales[39m::col_factor()
[31m✖[39m [34mpurrr[39m::[32mdiscard()[39m    masks [34mscales[39m::discard()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m     masks [34mstats[39m::filter()
[31m✖[39m [34mstringr[39m::[32mfixed()[39m    masks [34mrecipes[39m::fixed()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m        masks [34mstats[39m::lag()
[31m✖[39m [34mreadr[39m::[32mspec()[39m       masks [34myardstick[39m::spec()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
“package ‘cowplot’

First, I load the data using the shortest relative path and correct read_* function by looking at the separator "," in the cleveland data folder. Headers do not exist, thus I have to indicate that col_names = FALSE. 

In [3]:
training_data <- read_csv("data/heart_disease/processed.cleveland.data", col_names = FALSE)
head(training_data)

[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): X12, X13
[32mdbl[39m (12): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X14

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
56,1,2,120,236,0,0,178,0,0.8,1,0.0,3.0,0


Secondly, I rename the columns X1, X5, X6 to "age", "cholestoral", and "fasting_blood_sugar". Then, I select the three renamed columns and filter the ages over 55 because I only want to use individuals ages 55 and up. Finally, I used the mutate the fasting_blood_sugar column such that it is a factor using 'as_factor().

In [4]:
selected_data <- training_data |>
    rename(age = X1, cholestoral = X5, fasting_blood_sugar = X6) |>
     select(fasting_blood_sugar, cholestoral, age) |>
    filter(age > 55) |>
    mutate(fasting_blood_sugar = as_factor(fasting_blood_sugar)) 
head(selected_data)

fasting_blood_sugar,cholestoral,age
<fct>,<dbl>,<dbl>
1,233,63
0,286,67
0,229,67
0,236,56
0,268,62
0,354,57


I then mutate the variable names in the fasting_blood_sugar column to make it more clear. I recode the variable's name to "over" from "1" to indicate that blood sugar level is over 120mg/dL and "below" from "0" to indicate blood sugar level lower than 120mg/dL using the mutate function. 

In [5]:
mutated_data <- selected_data |>
    mutate(fasting_blood_sugar = recode(fasting_blood_sugar, `1` = "over", `0` = "below"))
head(mutated_data)

fasting_blood_sugar,cholestoral,age
<fct>,<dbl>,<dbl>
over,233,63
below,286,67
below,229,67
below,236,56
below,268,62
below,354,57


In [6]:
new_data <- mutated_data |> drop_na ()
head(new_data)

fasting_blood_sugar,cholestoral,age
<fct>,<dbl>,<dbl>
over,233,63
below,286,67
below,229,67
below,236,56
below,268,62
below,354,57


In [7]:

set.seed(3456) 
cleveland_split <- initial_split(new_data, prop = 0.75, strata = fasting_blood_sugar)  
cleveland_train<- training(cleveland_split)   
cleveland_test<- testing(cleveland_split)

head(cleveland_train)


fasting_blood_sugar,cholestoral,age
<fct>,<dbl>,<dbl>
below,229,67
below,268,62
below,354,57
below,254,63
below,192,57
below,294,56
