**Data Science Section 002 Group 7 Project Proposal**

**Predicting the Presence or Absence of Heart Disease based on Specific Variables.**

Predictive data analysis plays an important role in the field of medical diagnostics. As cardiovascular disease continues to be a leading cause of global mortality, the development of efficient predictive models becomes crucial for detection and intervention. In this analysis, we will be predicting the presence or absence of heart disease in patients using the key parameters of age, cholesterol levels, and resting blood pressure. For this analysis, we will be using the Heart Disease dataset from the UCI Machine Learning Repository, and focusing on the Cleveland database. The original database contains 76 attributes, but a version containing a subset of only 14 of those attributes will be used for this analysis.

By the end of this analysis, we expect to gain insight into the complex relationships between our chosen variables and the presence of heart disease in patients. Such findings would contribute to medical research regarding cardiovascular disease by providing insight into trends in the broader population of where the data was collected. Furthermore, this insight would allow healthcare systems to identify and focus on individuals at higher risk of heart diseases, and enable earlier intervention, improving patient outcomes. This insight could also contribute to reduced healthcare costs, as early detection and intervention could reduce the need for costly medical procedures associated with later stages of heart disease. A variety of questions could arise from the results of this analysis as well, such as whether the accuracy of the model could be improved using other information. For example, one could inquire about how lifestyle and behavioral factors such as smoking habits, exercise and diet interact with cholesterol, age, and blood pressure to influence the risk of heart disease in patients. We could also question whether the model and its findings generalize to different populations, and discuss how to make the model applicable and effective for more diverse populations. 



to further understand our data, some preliminary exploratory data analysis is conducted below. To gain access to useful functions for reading and analysing our data, the following libraries must be installed:

In [8]:
library(tidyverse)
library(tidymodels)

We can now read our dataset into R and assign it to an object called cleveland_data.

In [9]:
#Demostration that the dataset can be read from the web into R:

cleveland_data<- read_delim("data/processed.cleveland.data", delim=",", col_names = FALSE)
cleveland_data

[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): X12, X13
[32mdbl[39m (12): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X14

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
56,1,2,120,236,0,0,178,0,0.8,1,0.0,3.0,0
62,0,4,140,268,0,2,160,0,3.6,3,2.0,3.0,3
57,0,4,120,354,0,0,163,1,0.6,1,0.0,3.0,0
63,1,4,130,254,0,2,147,0,1.4,2,1.0,7.0,2
53,1,4,140,203,1,2,155,1,3.1,3,0.0,7.0,1


Since our data contains no column names, we rename each columns with the appropriate variable name below. We also replace the numerial values under our categorical variables with what they are supposed to represent, to better understand what each variable reveals to us.

In [10]:
#Renaming columns:

cleveland_data <- rename(cleveland_data,
       age = X1,
       sex = X2,
       chest_pain_type = X3,
       trestbps = X4,
       chol = X5,
       fbs = X6,    
       restecg = X7,
       thalach = X8,
       exang = X9,
       oldpeak = X10,
       slope = X11,
       ca = X12,
       thal = X13,
       diagnosis = X14)

#skipping rows with missing values
cleveland_data<- cleveland_data |> drop_na()

glimpse(cleveland_data)

Rows: 303
Columns: 14
$ age             [3m[90m<dbl>[39m[23m 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44…
$ sex             [3m[90m<dbl>[39m[23m 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, …
$ chest_pain_type [3m[90m<dbl>[39m[23m 1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3, 2, 4, …
$ trestbps        [3m[90m<dbl>[39m[23m 145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140,…
$ chol            [3m[90m<dbl>[39m[23m 233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192,…
$ fbs             [3m[90m<dbl>[39m[23m 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, …
$ restecg         [3m[90m<dbl>[39m[23m 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, …
$ thalach         [3m[90m<dbl>[39m[23m 150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148,…
$ exang           [3m[90m<dbl>[39m[23m 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, …
$ oldpeak         [3m[90m<dbl>[39m[23m 2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4,

In [11]:
#renaming categorical variables:

#renaming sex values
cleveland_data <- cleveland_data |>
mutate(sex = as_factor(sex))|>
mutate(sex = fct_recode(sex, "female" = "0", 
                            "male" = "1"))
#renaming chest pain values
cleveland_data <- cleveland_data |>
mutate(chest_pain_type = as_factor(chest_pain_type))|>
mutate(chest_pain_type = fct_recode(chest_pain_type, 
                            "typical angina" = "1", 
                            "atypical angina" = "2", 
                            "non-anginal_pain" = "3", 
                            "asymptomatic" = "4"))
#renaming fbs values
cleveland_data <- cleveland_data |>
mutate(fbs = as_factor(fbs))|>
mutate(fbs = fct_recode(fbs, "false" = "0", 
                            "true" = "1"))

#renaming exang values
cleveland_data <- cleveland_data |>
mutate(exang = as_factor(exang))|>
mutate(exang = fct_recode(exang, "no" = "0", 
                            "yes" = "1"))
#renaming slope values
cleveland_data <- cleveland_data |>
mutate(slope = as_factor(slope))|>
mutate(slope = fct_recode(thal,"upsloping" = "1", 
                            "flat" = "2", 
                            "downsloping" = "3"))
#renaming thal values
cleveland_data <- cleveland_data |>
mutate(thal = as_factor(thal))|>
mutate(thal = fct_recode(thal, "reversible defect" = "7.0", 
                            "fixed defect" = "6.0", 
                            "normal" = "3.0"))
#renaming diagnosis values
cleveland_data <- cleveland_data |>
mutate(diagnosis = as_factor(diagnosis))|>
mutate(diagnosis = fct_recode(diagnosis, 
                            "absent" = "0",
                              "present" = "1", 
                            "present" = "2", 
                            "present" = "3", 
                            "present" = "4"))

glimpse(cleveland_data)

[1m[22m[36mℹ[39m In argument: `slope = fct_recode(thal, upsloping = "1", flat = "2",
  downsloping = "3")`.
[33m![39m Unknown levels in `f`: 1, 2, 3”


Rows: 303
Columns: 14
$ age             [3m[90m<dbl>[39m[23m 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44…
$ sex             [3m[90m<fct>[39m[23m male, male, male, male, female, male, female, female, …
$ chest_pain_type [3m[90m<fct>[39m[23m typical angina, asymptomatic, asymptomatic, non-angina…
$ trestbps        [3m[90m<dbl>[39m[23m 145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140,…
$ chol            [3m[90m<dbl>[39m[23m 233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192,…
$ fbs             [3m[90m<fct>[39m[23m true, false, false, false, false, false, false, false,…
$ restecg         [3m[90m<dbl>[39m[23m 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, …
$ thalach         [3m[90m<dbl>[39m[23m 150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148,…
$ exang           [3m[90m<fct>[39m[23m no, yes, yes, no, no, no, no, yes, no, yes, no, no, ye…
$ oldpeak         [3m[90m<dbl>[39m[23m 2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4,

Now that our data is tidy, we can split it into a training set to construct our model, and a testing set to test our model's effectiveness.

In [12]:
#splitting data into training and testing sets

cleveland_data_split <- initial_split(cleveland_data, prop = 0.75, strata = diagnosis)
training_set <- training(cleveland_data_split)
testing_test <- testing(cleveland_data_split)

training_set
testing_test

age,sex,chest_pain_type,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,diagnosis
<dbl>,<fct>,<fct>,<dbl>,<dbl>,<fct>,<dbl>,<dbl>,<fct>,<dbl>,<fct>,<chr>,<fct>,<fct>
63,male,typical angina,145,233,true,2,150,no,2.3,6.0,0.0,fixed defect,absent
37,male,non-anginal_pain,130,250,false,0,187,no,3.5,3.0,0.0,normal,absent
41,female,atypical angina,130,204,false,2,172,no,1.4,3.0,0.0,normal,absent
56,male,atypical angina,120,236,false,0,178,no,0.8,3.0,0.0,normal,absent
57,male,asymptomatic,140,192,false,0,148,no,0.4,6.0,0.0,fixed defect,absent
56,female,atypical angina,140,294,false,2,153,no,1.3,3.0,0.0,normal,absent
44,male,atypical angina,120,263,false,0,173,no,0.0,7.0,0.0,reversible defect,absent
57,male,non-anginal_pain,150,168,false,0,174,no,1.6,3.0,0.0,normal,absent
54,male,asymptomatic,140,239,false,0,160,no,1.2,3.0,0.0,normal,absent
48,female,non-anginal_pain,130,275,false,0,139,no,0.2,3.0,0.0,normal,absent


age,sex,chest_pain_type,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,diagnosis
<dbl>,<fct>,<fct>,<dbl>,<dbl>,<fct>,<dbl>,<dbl>,<fct>,<dbl>,<fct>,<chr>,<fct>,<fct>
67,male,asymptomatic,120,229,false,2,129,yes,2.6,7.0,2.0,reversible defect,present
57,female,asymptomatic,120,354,false,0,163,yes,0.6,3.0,0.0,normal,absent
56,male,non-anginal_pain,130,256,true,2,142,yes,0.6,6.0,1.0,fixed defect,present
52,male,non-anginal_pain,172,199,true,0,162,no,0.5,7.0,0.0,reversible defect,absent
58,female,typical angina,150,283,true,2,162,no,1.0,3.0,0.0,normal,absent
58,male,atypical angina,120,284,false,2,160,no,1.8,3.0,0.0,normal,present
43,male,asymptomatic,150,247,false,0,171,no,1.5,3.0,0.0,normal,absent
59,male,asymptomatic,135,234,false,0,161,no,0.5,7.0,0.0,reversible defect,absent
57,male,asymptomatic,150,276,false,2,112,yes,0.6,6.0,1.0,fixed defect,present
40,male,typical angina,140,199,false,0,178,yes,1.4,7.0,0.0,reversible defect,absent


Below are two tables to conduct exploratory analysis on our training data. The first is a summary table that tells us the number of observations in each class (whether there is a presence/absence of heart disease), and the second is a summary table telling us the means of our predictor variables.

In [None]:
#insert table code here