# Predicting Heart Disease

## Summary

### Introduction

Heart disease is an umbrella term that refers to several conditions that affect the health of one's heart. Common heart diseases include disease of the blood vessel, arrhythmia (irregular beating of the heart), disease of the heart valve and muscle, infection of the heart, and heart defects from birth (“Heart Disease.”). The symptoms of one’s  heart disease is very dependent on the type of disease they have, however many forms can be prevented with healthy lifestyle choices. Most heart diseases (with the exception of serious defects at birth) are only diagnosed after a heart attack, heart failure, or stroke (“Heart Disease.”).  Heart attacks, heart failure, and strokes are very traumatic events to go through and are oftentimes deadly (“Heart Disease and Stroke.”). Therefore it is very important that we are able to predict if an individual is at an increased risk of heart disease and try to get them preventable care. In this project we want to determine if we can predict if someone is at risk of a heart disease based on the following variables.

We used data from UCI Machine Learning (https://archive-beta.ics.uci.edu/ml/datasets/heart+disease). The Cleveland Heart Disease dataset consists of 13 explanatory variables and 1 target class. The variables, variable type and a brief description of each variable are listed below.

| Variable | Variable Type | Description |
| :-: | :-: | :-: |
| Age | Quantitative | Age of patient in years |
| Sex | Categorical | Sex of patient where 0 = female and 1 = male |
| Chest Pain | Categorical | Type of chest pain the patient has where 1 = typical angina, 2 = atypical angina, 3 = non-anginal, 4 = asymptomatic |
| Resting Blood Pressure | Quantitative | Resting blood pressure of the patient measured at admission to the hospital in mm Hg |
| Serum Cholesterol | Quantitative | Serum Cholesterol is the total amount of cholesterol a patient has in their blood. It is measured in mg/dl |
| Fasting Blood Sugar | Categorical | The blood sugar level of the patient after fasting where 0 if blood sugar is less than 120mg/dl and 1 otherwise |
| Resting ECG | Categorical | The results of the patients resting electrocardiographic where 0 = normal results, 1 = S-T wave abnormality, 2 = left ventricular hypertrophy |
| Max Heart Rate | Quantitative | The maximum heart rate that the patient had in bpm |
| Exercise Induced Angina | Categorical | If the patient has exercised induced angina (chest pain caused by reduced blood flow) where 0 = no and 1 =yes |
| Oldpeak | Quantitative | The ST depression (ECG measurement of the heart) a patient has induced by exercise relative to rest in mm|
| Slope | Categorical | The slope of peak exercise ST segment of patient where 1 = upsloping, 2 = flat, 3 = downsloping |
| Number of Major Blood Vessels | Quantitative | The number of major blood vessels |
| Thalassemia | Categorical | Occurrence of thalassemia in the patient (blood disorder that causes patient to have reduced amounts of hemoglobin) where 3 = normal, 6 = fixed defect, and 7 = reversible defect |
| Diagnosis | Categorical | Presence of heart disease in the patient from 0 to 4 where 0 indicates no presence of heart disease. This is the target column |


# Methods & Results

We started by loading the heart disease data set and adding a column headers. The database has 14 features. For the purpose of this analysis, we have focused on the Diagnosis feature. Based on the data description, any value over 0 has a heart disease and 0 indicates no presence of heart disease. 

We grouped the data by its diagnosis. Then we applied data cleaning. We checked for the data type and for any mistakes about the characters. We further dealt with the NA values. 

## Packages

In [1]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.5     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.6     [32m✔[39m [34mdplyr  [39m 1.0.7
[32m✔[39m [34mtidyr  [39m 1.1.4     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.1     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [2]:
heart_disease <- read.csv("data/processed-cleveland.csv", header = FALSE)

Before any data analysis can be done, the data needs to be cleaned. The downloaded data does not have any column names so first we need to add all the variables to their respective column.

In [3]:
colnames(heart_disease) <- c("age",
                             "sex",
                             "chest_pain",
                             "resting_blood_pressure",
                             "serum_cholesterol",
                             "fasting_blood_sugar",
                             "resting_ecg",
                             "max_heart_rate",
                             "exercise_induced_angina",
                             "oldpeak",
                             "slope",
                             "num_of_major_vessels",
                             "thalassemia",
                             "diagnosis")

In [4]:
head(heart_disease, 10)

Unnamed: 0_level_0,age,sex,chest_pain,resting_blood_pressure,serum_cholesterol,fasting_blood_sugar,resting_ecg,max_heart_rate,exercise_induced_angina,oldpeak,slope,num_of_major_vessels,thalassemia,diagnosis
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<int>
1,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
2,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
3,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
4,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
5,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
6,56,1,2,120,236,0,0,178,0,0.8,1,0.0,3.0,0
7,62,0,4,140,268,0,2,160,0,3.6,3,2.0,3.0,3
8,57,0,4,120,354,0,0,163,1,0.6,1,0.0,3.0,0
9,63,1,4,130,254,0,2,147,0,1.4,2,1.0,7.0,2
10,53,1,4,140,203,1,2,155,1,3.1,3,0.0,7.0,1


The next step in cleaning up the data is to check for NAs. It is important that we remove these before any data analysis so that the functions we use will run on the data properly. Using the `colMeans` function we get the percentage of NAs in each column.

In [5]:
colMeans(is.na(heart_disease))

As we can see above, it looks like we have no NAs in our data. However on the UCI ML website, we know that our data has missing values.

In [6]:
unique(heart_disease)

Unnamed: 0_level_0,age,sex,chest_pain,resting_blood_pressure,serum_cholesterol,fasting_blood_sugar,resting_ecg,max_heart_rate,exercise_induced_angina,oldpeak,slope,num_of_major_vessels,thalassemia,diagnosis
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<int>
1,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
2,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
3,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
4,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
5,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
6,56,1,2,120,236,0,0,178,0,0.8,1,0.0,3.0,0
7,62,0,4,140,268,0,2,160,0,3.6,3,2.0,3.0,3
8,57,0,4,120,354,0,0,163,1,0.6,1,0.0,3.0,0
9,63,1,4,130,254,0,2,147,0,1.4,2,1.0,7.0,2
10,53,1,4,140,203,1,2,155,1,3.1,3,0.0,7.0,1


Instead of R recognizing our missing values, we have a question mark. We need to convert this to an actual NA that R will recognize.

In [7]:
heart_disease[heart_disease == "?"] <- NA 

Now if we run out `colMeans` function again we will see that we do have missing values in num_of_major_vessels and thalassemia. We now need to remove these missing values.

In [None]:
colMeans(is.na(heart_disease))

In [None]:
heart_disease_clean <- na.omit(heart_disease)

In [None]:
colMeans(is.na(heart_disease_clean))

Since the mean of each column is 0 we can tell that there are no more missing values in our dataset.

## Exploring Dataset

Next we need to get to know our data. We will do this by performing exploratory analysis where we summarize our data and graph a few relevant figures. The aim of doing this is to get a better understanding of what our final classification analysis will look like.

The `glimpse`function allows us to easy tell what type of data we have (i.e. are variables made of integers or characters). This is especially important for our dataset since we have categorical data that is separated into groups with numerical values, for example there are three categories in a given variable and they are given a label from 1 to 3.

In [None]:
glimpse(heart_disease_clean)

We know that sex, chest_pain, fasting_blood_sugar, resting_ecg, exercise_induced_angina and slope are all categorical variables but we can see that R is interpreting them are quantitative variables. It is important for us keep this in mind when we process our data. Additionally we can see that in this dataset num_of_major_vessels and thalassemia are characters. Since num_of_major_vessels is a quantitative variable we will be converting this to a numerical type. We will also convert all the categorical variables into factor types.

In [None]:
heart_disease <- heart_disease |>
  mutate(num_of_major_vessels = as.numeric(num_of_major_vessels))
glimpse(heart_disease)

In [None]:
colMeans(is.na(heart_disease))

In [None]:
summary(heart_disease) #summary

We grouped the data by its diagnosis and graphed it in order for us to see if the dataset is balanced or not.

In [None]:
diagnosis_heart_disease <- group_by(heart_disease, diagnosis) %>%
                                count()
diagnosis_heart_disease

As we can see the data seems pretty unbalanced, however we should note that 0 indicates the absence of heart disease and 1-4 indicates the presence of heart disease, so we will turn the diagnosis into binary 0 or 1 (absent or present). We will also perform data cleaning to deal with NA values

In [None]:
heart_bar <- ggplot(diagnosis_heart_disease, 
                    aes(x = diagnosis, 
                        y = n,
                        fill = diagnosis)) +
                    geom_bar(stat = "identity") +
                    ggtitle("Number of diagnosis") +
                    scale_fill_gradient(low = "yellow", high = "red", na.value = NA)

heart_bar