# Predict Heart Disease Status Based on Quantifiable Variables

# Introduction:

Cardiovascular diseases (CVDs) is a class of disease that involves the heart or blood vessels. the number one cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of five CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

Our question is can we determine heart disease status based on quantifiable variables. This dataset we will be using is a tabular data set with comma-separated variables. It has 12 variables, but we will be only using 8 variables: age, sex, Chest Pain Type, cholesterol level, old peak, resting blood pressure, maximum heart rate, and Heart Disease in order to answer our question. 

We choose thses as variables as they are more representative of heart disease base on our research.

### Attribute Information
 1. Age: years
 2. Sex: (0 = MALE, 1 = FEMALE)
 3. ChestPainType: (ATA = 1, NAP = 2, ASY =3, TA =4)
 4. Cholesterol: (mm/dl)
 5. Oldpeak: (Numeric value measured in depression)
 6. RestingBP: resting blood pressure (mm HG)
 7. MaxHR: maximum heart rate achieved (Numeric value between 60 and 202)
 8. HeartDisease:(1: heart disease, 0: Normal)

In [None]:
library(repr)
library(tidyverse)
library(tidymodels)
library(dplyr)
library(RColorBrewer)

#### Reading files

In [None]:
heart_data <- read_csv("heart.csv") %>%
              mutate(HeartDisease = as_factor(HeartDisease))

head(heart_data)

#### Split data to train and test¶

In [None]:
heart_split <- initial_split(heart_data, prop = 0.75, strata = HeartDisease)
heart_train <- training(heart_split)
heart_test <- testing(heart_split)

#### Summarize dataset

In [None]:
num_obs <- nrow(heart_train)
heart_sum <- heart_train %>%
             glimpse() %>%
             group_by(HeartDisease) %>%
             summarize(count = n(), percentage = n()/ num_obs* 100)
heart_sum 

checking_for_na <- sum(is.na(heart_train))
checking_for_na

summary(heart_train)

#### Observations from Summary
1. Resting BP and Cholesterol have zero as a minimum which is unusual.
2. There may be outliers/Missings in Cholesterol and Resting BP being presented as zero.
3. Number of rows 689 and number of columns 12.
4. Percentage of people with heart disease: 44.70 %
5. Percentage of people without heart disease: 55.30%

#### Fixing zeros in and Resting BP and CHolesterol

In [None]:
heart_train <- heart_train%>%
               filter(RestingBP != 0, Cholesterol != 0)
count(heart_train)
summary(heart_train)
head(heart_train)

#### Visualizations of data

Heart Disease with Age

In [None]:
HeartDisease_Age_plot <- heart_train %>%
                         ggplot(aes(x = Age, fill = Age)) +
                         geom_bar() +
                         facet_grid(~HeartDisease) +
                         labs(title = "Heart Disease with Age", x = "Age", y = "Count")
options(repr.plot.width = 11, repr.plot.height = 8)

HeartDisease_Age_plot

Heart Disease with Sex

In [None]:
HeartDisease_Sex_plot <- heart_train %>%
                         ggplot(aes(x = Sex, fill = Sex)) +
                         geom_bar() +
                         facet_grid(~HeartDisease) +
                         geom_text(aes(label = ..count..), stat = "count",  vjust = 2, colour = "black") +
                         labs(title = "Heart Disease with Sex", x = "Sex", y = "Count", fill = "Sex")
options(repr.plot.width = 8, repr.plot.height = 8)
HeartDisease_Sex_plot

Heart Disease with Chest Pain Type (TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic)

In [None]:
HeartDisease_ChestPainType_plot <- heart_train %>%
                                   ggplot(aes(x = ChestPainType, fill = ChestPainType)) +
                                   geom_bar() +
                                   facet_grid(~HeartDisease) +
                                   geom_text(aes(label = ..count..), stat = "count",  vjust = 2, colour = "black") +
                                   labs(title = "Heart Disease with Chest Pain Type", x = "Chest Pain Type", y = "Count", 
                                        fill = "Chest Pain Type")
HeartDisease_ChestPainType_plot

Heart Disease with Resting Blood Pressure

In [None]:
HeartDisease_RestingBP_plot <- heart_train %>%
                               ggplot(aes(x = RestingBP, fill = RestingBP)) +
                               geom_bar() +
                               facet_grid(~HeartDisease) +
                               labs(title = "Heart Disease with Resting Blood Pressure", x = "Resting Blood Pressure", y = "Count") 
options(repr.plot.width = 8, repr.plot.height = 8)         

HeartDisease_RestingBP_plot

Heart Disease with Serum Cholestero

In [None]:
HeartDisease_Cholesterol_plot <- heart_train %>%
                                 ggplot(aes(x = Cholesterol, fill = Cholesterol)) +
                                 facet_grid(~HeartDisease) +
                                 geom_bar() +
                                 labs(title = "Heart Disease with Serum Cholestero", x = "Serum Cholestero", y = "Count") 
options(repr.plot.width = 8, repr.plot.height = 8)         

HeartDisease_Cholesterol_plot

Heart Disease with Fasting Blood Sugar

In [None]:
HeartDisease_FastingBS_plot <- heart_train %>%
                               ggplot(aes(x = FastingBS, fill = as.character(FastingBS))) +
                               geom_bar() +
                               facet_grid(~HeartDisease) +
                               geom_text(aes(label = ..count..), stat = "count",  vjust = 2, colour = "white") +
                               labs(title = "Heart Disease with Fasting Blood Sugar", 
                                    x = "Fasting Blood Sugar", y = "Count", fill = "Fasting Blood Sugar")
HeartDisease_FastingBS_plot

Heart Disease with Resting Electrocardiogram Results (Normal: Normal, ST: having ST-T wave abnormality, LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria)

In [None]:
HeartDisease_RestingECG_plot <- heart_train %>%
                                ggplot(aes(x = RestingECG, fill = RestingECG)) +
                                geom_bar() +
                                facet_grid(~HeartDisease) +
                                geom_text(aes(label = ..count..), stat = "count",  vjust = 2, colour = "black") +
                                labs(title = "Heart Disease with Resting Electrocardiogram Results", 
                                     x = "Resting Electrocardiogram Results", y = "Count", 
                                     fill = "Resting Electrocardiogram Results")
HeartDisease_RestingECG_plot

HeartDisease with Maximum Heart Rate

In [None]:
HeartDisease_RestingBP_plot <- heart_train %>%
                               ggplot(aes(x = MaxHR, fill = MaxHR)) +
                               facet_grid(~HeartDisease) +
                               geom_bar() +
                               labs(title = "Heart Disease with Maximum Heart Rate", x = "Maximum Heart Rate", y = "Count") 
options(repr.plot.width = 8, repr.plot.height = 8)         

HeartDisease_RestingBP_plot

Heart Disease with Exercise-Induced Angina (Y: Yes, N: No)

In [None]:
HeartDisease_ExerciseAngina_plot <- heart_train %>%
                                    ggplot(aes(x = ExerciseAngina, fill = ExerciseAngina)) +
                                    geom_bar() +
                                    facet_grid(~HeartDisease) +
                                    geom_text(aes(label = ..count..), stat = "count",  vjust = 2, colour = "black") +
                                    labs(title = "Heart Disease with Exercise-Induced Angina", 
                                         x = "Exercise-Induced Angina", y = "Count", 
                                         fill = "Exercise-Induced Angina")
HeartDisease_ExerciseAngina_plot 

Heart Disease with Old peak (ST, Numeric value measured in depression)

In [None]:
HeartDisease_RestingBP_plot <- heart_train %>%
                               ggplot(aes(x = Oldpeak, fill = Oldpeak)) +
                               facet_grid(~HeartDisease) +
                               geom_bar() +
                               labs(title = "Heart Disease with Old peak", x = "Old peak", y = "Count") 
options(repr.plot.width = 8, repr.plot.height = 8)         

HeartDisease_RestingBP_plot

Heart Disease with ST_Slope(ST_Slope: the slope of the peak exercise ST segment Up: upsloping, Flat: flat, Down: downsloping)

In [None]:
HeartDisease_ST_Slope_plot <- heart_train %>%
                              ggplot(aes(x = ST_Slope, fill = HeartDisease)) +
                              geom_bar() +
                              facet_grid(~HeartDisease) +
                              geom_text(aes(label = ..count..), stat = "count",  vjust = 2, colour = "black") +
                              labs(title = "Heart Disease with ST Slope", x = "ST Slope", y = "Count", fill = "Heart Disease")

HeartDisease_ST_Slope_plot

#### Covert Charaters to Nummerics

In [None]:
heart_train <- heart_train %>%
                 mutate(Sex = as_factor(Sex)) %>%
                 mutate(Sex = as.numeric(Sex)) %>%
                 mutate(ChestPainType = as_factor(ChestPainType)) %>%
                 mutate(ChestPainType = as.numeric(ChestPainType))

head(heart_train)

#### Finding the Best K

In [None]:
knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) %>% 
            set_engine("kknn") %>%
            set_mode("classification")

heart_recipe <- recipe(HeartDisease ~ Age + Sex + ChestPainType + MaxHR, data = heart_train) %>%
step_scale(all_predictors()) %>%
step_center(all_predictors())

preprocessed_data <- heart_recipe %>% 
                     prep() %>%
                     bake(heart_train)

heart_vfold <- vfold_cv(heart_train, v = 5, strata = HeartDisease)

heart_workflow <- workflow() %>% 
                  add_recipe(heart_recipe) %>%
                  add_model(knn_tune)


gridvals <- tibble(neighbors = seq(from = 1, to = 200))


#heart_results <- heart_workflow %>%
       #          tune_grid(resamples = heart_vfold, grid = gridvals) %>% 
      #           collect_metrics()

knn_tune
heart_recipe
head(preprocessed_data)


#heart_results
#note Sex M = 2, F = 1
# ChetPainType ATA = 2, ASY =1, NAP =3

# Methodolgy:
We will start by analyzing all of the graphs above and looking at each of the 7 variables to see its relationship with heart disease. The ones with the strongest correlation between the variable and heart diseases, we will use as our predictor variable:
1. Age vs. Heart Disease: people between the ages of 55-65 seem to have the most heart diseases.
2. Sex vs. Heart Disease: Males are much more likely to have heart disease. 
3. Chest Pain Type:  People with ASY chest pain are more likely to have heart diseases
4. Resting Blood pressure vs. Diseases: There's a weak positive relation but it's strong enough to make it a predictor variable. 
5. Cholesterol vs. Heart Diseases: There doesn't seem to be a correlation between cholesterol and heart disease. 
6. Maximum Heart rate vs. Heart diseases: people with a maximum heart rate in the range of 100-150 are more likely to have heart disease. 
7. Old Peak vs Heart Diseases: There's no correlation between old peak and heart disease. 

Based on this information we will be using age, sex, chest pain type, maximum heart rate,  as the predictor variables. So far we have only used a maximum of 2 predictor variables in this course. But later on in the final report, we plan on using multivariable linear regression to predict whether a person has heart disease or not that will take into account all of the predictor variables. We can also chnage the background color ti plot what the classifer will likely predict as we saw in the reading for week 7.

# Expected outcomes and significance:
We expect to find that the 4 predictor variables: age, sex, chest pain type, and maximum heart rate will help us tell if a person has heart disease or not, as there's a strong correlation between each predictor variable and heart disease. The impact of these findings would be very significant. People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidemia or already established disease) need early detection and management wherein a machine learning model can be of great help. In the future, we should try to improve the accuracy of the overall algorithm by adding more predictor variables and having a large sample scale for the data. This same technique can then be applied to detect other types of diseases such as pneumonia in patients. The algorithm removes human error and there's very little chance of the algorithm misdiagnosing someone if it has a strong accuracy.  