**Data Science Section 002 Group 7 Project Proposal**

**Predicting the Presence or Absence of Heart Disease based on Specific Variables.**

Predictive data analysis plays an important role in the field of medical diagnostics. As cardiovascular disease continues to be a leading cause of global mortality, the development of efficient predictive models becomes crucial for detection and intervention. In this analysis, we will be predicting the presence or absence of heart disease in patients using the key parameters of age, cholesterol levels, and resting blood pressure.

For this analysis, we will be using the Heart Disease dataset from the UCI Machine Learning Repository, and focusing on the Cleveland database. The original database contains 76 attributes, but a version containing a subset of only 14 of those attributes will be used for this analysis.

By the end of this analysis, we expect to gain insight into the complex relationships between our chosen variables and the presence of heart disease in patients. Such findings would contribute to medical research regarding cardiovascular disease by providing insight into trends in the broader population of where the data was collected. Furthermore, this insight would allow healthcare systems to identify and focus on individuals at higher risk of heart diseases, and enable earlier intervention, improving patient outcomes. This insight could also contribute to reduced healthcare costs, as early detection and intervention could reduce the need for costly medical procedures associated with later stages of heart disease. A variety of questions could arise from the results of this analysis as well, such as whether the accuracy of the model could be improved using other information. For example, one could inquire about how lifestyle and behavioral factors such as smoking habits, exercise and diet interact with cholesterol, age, and blood pressure to influence the risk of heart disease in patients. We could also question whether the model and its findings generalize to different populations, and discuss how to make the model applicable and effective for more diverse populations. 



To gain access to useful functions for reading and analysing our data, the following libraries must be installed:

In [None]:
library(tidyverse)
library(tidymodels)

We can now read our dataset into R and assign it to an object called cleveland_data.

In [None]:
#Demostration that the dataset can be read from the web into R:

cleveland_data<- read_delim("data/processed.cleveland.data", delim=",", col_names = FALSE)
cleveland_data

In [None]:
#Renaming columns:

cleveland_data <- rename(cleveland_data,
       age = X1,
       sex = X2,
       chest_pain_type = X3,
       trestbps = X4,
       chol = X5,
       fbs = X6,    
       restecg = X7,
       thalach = X8,
       exang = X9,
       oldpeak = X10,
       slope = X11,
       ca = X12,
       thal = X13,
       diagnosis = X14)

#skipping rows with missing values
cleveland_data<- cleveland_data |> drop_na()

glimpse(cleveland_data)

In [None]:
#renaming categorical variables(this might be unneccessary though 
#since we won't be using them all but idk):

#renaming sex values
cleveland_data <- cleveland_data |>
mutate(sex = as_factor(sex))|>
mutate(sex = fct_recode(sex, "female" = "0", 
                            "male" = "1"))
#renaming chest pain values
cleveland_data <- cleveland_data |>
mutate(chest_pain_type = as_factor(chest_pain_type))|>
mutate(chest_pain_type = fct_recode(chest_pain_type, 
                            "typical angina" = "1", 
                            "atypical angina" = "2", 
                            "non-anginal_pain" = "3", 
                            "asymptomatic" = "4"))
#renaming fbs values
cleveland_data <- cleveland_data |>
mutate(fbs = as_factor(fbs))|>
mutate(fbs = fct_recode(fbs, "false" = "0", 
                            "true" = "1"))

#renaming exang values
cleveland_data <- cleveland_data |>
mutate(exang = as_factor(exang))|>
mutate(exang = fct_recode(exang, "no" = "0", 
                            "yes" = "1"))
#renaming slope values
cleveland_data <- cleveland_data |>
mutate(slope = as_factor(slope))|>
mutate(slope = fct_recode(thal,"upsloping" = "1", 
                            "flat" = "2", 
                            "downsloping" = "3"))
#renaming thal values
cleveland_data <- cleveland_data |>
mutate(thal = as_factor(thal))|>
mutate(thal = fct_recode(thal, "reversible defect" = "7.0", 
                            "fixed defect" = "6.0", 
                            "normal" = "3.0"))
#renaming diagnosis values
cleveland_data <- cleveland_data |>
mutate(diagnosis = as_factor(diagnosis))|>
mutate(diagnosis = fct_recode(diagnosis, 
                            "absent" = "0",
                              "present" = "1", 
                            "present" = "2", 
                            "present" = "3", 
                            "present" = "4"))

glimpse(cleveland_data)

In [None]:
#splitting data into training and testing sets
cleveland_data_split <- initial_split(cleveland_data, prop = 0.75, strata = diagnosis)
training_set <- training(cleveland_data_split)
testing_test <- testing(cleveland_data_split)