---
## Title: "Predicting the Severity of Heart Disease Using Cleveland Heart Disease Dataset"
- Author: Eric Wang
- Date: 2024-06-08
---
## Introduction
Heart disease is a major health concern worldwide, accounting for millions of deaths each year. Early detection and prevention are crucial in managing and reducing the risk of heart disease. Machine learning models play a crucial role in predicting the likelihood of heart disease in individuals. These models assist doctors in diagnosing heart disease more quickly and accurately, enabling timely medical interventions.

In this project, I aim to develop a predictive model using the Cleveland Heart Disease dataset to determine the likelihood of an individual having heart disease based on various medical attributes and lifestyle factors. The primary question this project seeks to answer is: **"Is it possible to predict the severity of heart disease based on chest pain type, resting blood pressure, and thalassemia across all age groups and genders?"**


---
## Variables Used In Prediction
- **age_group**: The age of the patient categorized into different groups
    - Children: Age ≤ 17 years
    - Young Adults: Age 18 to 34 years
    - Adults: Age 35 to 49 years
    - Middle-aged Adults: Age 50 to 64 years
    - Seniors: Age ≥ 65 years
- **sex**: The biological sex of the patient
    - Male
    - Female
- **cp**: chest pain type
  - Value 1: typical angina
  - Value 2: atypical angina
  - Value 3: non-anginal pain
  - Value 4: asymptomatic
- **trestbps**: resting blood pressure (in mm Hg on admission to the hospital)
- **thal**: 3 = normal; 6 = fixed defect; 7 = reversible defect
- **num**: diagnosis of heart disease (angiographic disease status)
  - Value 0: < 50% diameter narrowing
  - Value 1: > 50% diameter narrowing (in any major vessel: attributes 59 through 68 are vessels)
---

In [42]:
#load necessary libraries
library(tidyverse)
library(tidymodels)
library(gridExtra)

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.5     [32m✔[39m [34mrsample     [39m 1.2.0
[32m✔[39m [34mdials       [39m 1.2.0     [32m✔[39m [34mtune        [39m 1.1.2
[32m✔[39m [34minfer       [39m 1.0.5     [32m✔[39m [34mworkflows   [39m 1.1.3
[32m✔[39m [34mmodeldata   [39m 1.2.0     [32m✔[39m [34mworkflowsets[39m 1.0.1
[32m✔[39m [34mparsnip     [39m 1.1.1     [32m✔[39m [34myardstick   [39m 1.2.0
[32m✔[39m [34mrecipes     [39m 1.0.8     

── [1mConflicts[22m ───────────────────────────────────────── tidymodels_conflicts() ──
[31m✖[39m [34mscales[39m::[32mdiscard()[39m masks [34mpurrr[39m::discard()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m   masks [34mstats[39m::filter()
[31m✖[39m [34mrecipes[39m::[32mfixed()[39m  masks [34mstringr[39m::fixed()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m      masks [34mstats[39m::lag()
[31m✖[39m [3

In [39]:
#read the data and assign column names
cleveland<-read_csv("data/heart_disease/processed.cleveland.data",
                    col_names=c("age","sex","cp","trestbps","chol","fbs","restecg","thalach","exang","oldpeak","slope","ca","thal","num"))

cleveland<-cleveland|>
#Define age groups
mutate(age_group = case_when(
    age <= 17 ~ "Children",
    age >= 18 & age <= 34 ~ "Young Adults",
    age >= 35 & age <= 49 ~ "Adults",
    age >= 50 & age <= 64 ~ "Middle-aged Adults",
    age >= 65 ~ "Seniors"))|>
mutate(age_group=as.factor(age_group))|>

#Convert sex to a factor and rename values
mutate(cleveland,sex=as.factor(sex))|>
mutate(sex=fct_recode(sex,"Male"="1","Female"="0"))|>

#Convert num to a factor
mutate(num=as.factor(num))|>
#Convert cp to a factor
mutate(cp=as.factor(cp))|>

#select specific columns for further analysis
select(age_group,sex,cp,trestbps,thal,num)

[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): ca, thal
[32mdbl[39m (12): age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpea...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [40]:
#Print the first few rows to view the data
head(cleveland)

age_group,sex,cp,trestbps,thal,num
<fct>,<fct>,<fct>,<dbl>,<chr>,<fct>
Middle-aged Adults,Male,1,145,6.0,0
Seniors,Male,4,160,3.0,2
Seniors,Male,4,120,7.0,1
Adults,Male,3,130,3.0,0
Adults,Female,2,130,3.0,0
Middle-aged Adults,Male,2,120,3.0,0


In [45]:
#Split the data into training and testing sets with stratification
cleveland_split<-initial_split(cleveland,prop=0.75,strata=num)
#Training Set
cl_training<-training(cleveland_split)
#Testing Set
cl_testing<-testing(cleveland_split)

## Preliminary exploratory data analysis:
- Demonstrate that the dataset can be read from the web into R 
- Clean and wrangle your data into a tidy format
- Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data
- Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.

d this lead to?