# Heart Disease Prediction Using Machine Learning

## Introduction

Heart disease is a major health issue worldwide, and it is difficult to discover the disease before symptoms emerge. Our goal of this project is to utilize machine learning techniques and predict whether or not a person has heart disease. This will be done through our dataset, which is derived from the Cleveland Heart Disease Database and consists of 14 important attributes selected from a total of 76. It encompasses a range of variables including physical and psychological details, such as age, sex, cholesterol, etc. It also categorizes individuals based on the absence (value 0) or presence (values 1-4) of heart disease, providing a clear framework for analyzing the predictive power of medical test results. The columns are as follows:

1. **age**: age
2. **sex**: sex (1 = male, 0 = female)
3. **cp**: chest pain type
4. **trestbps**: resting blood pressure in mmHg
5. **chol**: serum cholestoral in mg/dl
6. **fbs**: fasting blood sugar > 120 mg/dl? (1 = True, 0 = False)
7. **restecg**: resting electrocardiographic results
8. **thalach**: maximum heart rate achieved
9. **exang**: whether exercise induced angina (1 = True, 0 = False)
10. **oldpeak**: ST depression induced by exercise, relative to rest
11. **slope**: the slope of the peak exercise ST segment (1 = upslope, 2 = flat, 3 = `b`downslope)
12. **ca**: number of major vessels (0-3) coloured by flourosopy
13. **thal**: (3 = normal, 6 = fixed defect, 7 = reversable defect)
14. **num**: diagnosis of heart disease (1,2,3,4 = presence, 0 = no presence)

### Methods

#### Preprocessing and exploratory data analysis

1) Imported libraries and `processed.cleveland.data` dataset from the internet.

2) Cleaned and tidied data to make it usable, by assigning column types and adding a new column, `diag`.

3) Split the data into training and testing sets, working **only** with the training set until the very end.

4) Summarized the training set to make predictions about how we want our classifier to work.

5) Visualized the relationship between `thalac` and `chol` to get a deeper understanding of how the data is distributed.


#### Finding the best $k$ value

Our goal is to find the best value for the $k$-nearest neighbours, providing the highest accuracy in predictions. In the code below, we create a classifier and perform cross-validation to split the training data, train the model with one set and use the other to evaluate it because we can not use testing data. Our next steps are:

1) Use the `recipe` function to center and scale the data.

2) Perform cross-validation with ten folds, using `vfold_cv`, on the training data. We use ten folds because if we only split the data once, the results strongly depend on the observations that end up being in the validation set, so using more folds increases the accuracy.

3) Create a $k$-nearest model with `neighbours = tune()` instead of a value to find the best value of $k$ for $k$-nearest neighbours.

4) Add the recipe and model to a workflow, using `tune_grid` to fit. This worfklow runs cross validation on a range of $k$-nearest neighbours values that is specified in `gridvals`.

5) Find the best $k$ value by filtering for accuracy and plotting a line plot with the accuracy estimate on the y-axis and $k$ (neighbours) on the x-axis.

6) Ensure that the model does not underfit or overfit, and is more accurate than a majority classifier, using our new $k$.

#### Visualizing our results

1) To visualize our results, we plotted max heart rate on the x-axis and cholesterol levels on the y-axis, using diagnosis to colour the points.

2) To check for over/underfitting, we coloured the background of the graph based on what prediction would be made at every possible point. This also allowed us to quickly identify how the model classified patients, and where the boundaries were.

#### Testing our classifier

1) Made a new model specification for the best $k$ value chosen, combined with the recipe made earlier in a workflow, and fit the classifier to our training set.

2) Used `predict` on the testing set to evaluate the classifier's predicition accuracy on data it hadn't seen before.

3) Produced a confusion matrix to get a sense of which diagnoses the classifier was more accurate at giving, and what effects that has on real world application.

4) Tested the accuracy of our classifier when given data from Hungary.

## Preliminary Exploratory Data Analysis

In [2]:
# importing libraries
library(tidyverse)
library(tidymodels)
library(repr)
library(RColorBrewer)

# formatting graphs
options(repr.plot.width = 12, repr.plot.height = 6)

## Importing the dataset 


We use the  `read_csv` to import the processed.cleveland.data dataset from the online directory.

In [25]:
cleveland<- read_csv("data/processed_cleveland.csv")

head(cleveland)

nrow(cleveland)

[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): ca, thal
[32mdbl[39m (12): age, sex, cp, trestbps, chol, fbs, restecg, thalach, exang, oldpea...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
56,1,2,120,236,0,0,178,0,0.8,1,0.0,3.0,0


## Cleaning and Tidying the Data 



- Some of the columns are being read as `chr` instead of `dbl` because they include "?" for unknown values. So we will the observations with "?" to NA so their columns can be assigned the types we need.

- Then we change the names of the columns to make it more readable. 
- We also changed the value of the **num** which we renamed to the **diagnosis** column. The **diagnosis** column will now have only True or False values so that we can predict whether a person has a disease or not. We change 0 which indicates an absence of disease to FALSE and we change the value of (1,2,3,4,5) to TRUE so that we can indicate that the person has a disease irrespective of the severity which the numbers indicate.


In [26]:
cleveland_data <- cleveland |>
                    mutate(sex = as_factor(sex),
                       cp = as_factor(cp),
                       fbs = as_factor(fbs),
                       restecg = as_factor(restecg),
                       exang = as_factor(exang),
                       slope = as_factor(slope),
                       thal = as_factor(thal),
                       num = as_factor(ifelse(is.na(num), NA, (num > 0))),
                       ca = as.integer(ca))|>
                rename(chest_pain = cp,
                       blood_pressure = trestbps,
                       cholesterol = chol,
                       blood_sugar = fbs,
                       rest_ecg = restecg,
                       heart_rate = thalach,
                       angina = exang,
                       st_depression = oldpeak,
                       num_vessels = ca,
                       diagnosis = num)


head(cleveland_data)

nrow(cleveland_data)

[1m[22m[36mℹ[39m In argument: `ca = as.integer(ca)`.
[33m![39m NAs introduced by coercion”


age,sex,chest_pain,blood_pressure,cholesterol,blood_sugar,rest_ecg,heart_rate,angina,st_depression,slope,num_vessels,thal,diagnosis
<dbl>,<fct>,<fct>,<dbl>,<dbl>,<fct>,<fct>,<dbl>,<fct>,<dbl>,<fct>,<int>,<fct>,<fct>
63,1,1,145,233,1,2,150,0,2.3,3,0,6.0,False
67,1,4,160,286,0,2,108,1,1.5,2,3,3.0,True
67,1,4,120,229,0,2,129,1,2.6,2,2,7.0,True
37,1,3,130,250,0,0,187,0,3.5,3,0,3.0,False
41,0,2,130,204,0,2,172,0,1.4,1,0,3.0,False
56,1,2,120,236,0,0,178,0,0.8,1,0,3.0,False


## Selecting our predictor variables

## Splitting our data into training and testing sets


- We split our data into training and testing sets. We use the `diagnosis` column to be stratified since we want to predict the values.
- We use the `initial_split` function to split our dataframe into 75% training and 25% testing.
- The 75-25 split allows us to train our model on as many data points as possible while also keeping enough data for effective testing later.

In [27]:
cleveland_split <- initial_split(cleveland_data, prop = 0.75, strata = diagnosis)
cleveland_train <- training(cleveland_split)
cleveland_test <- testing(cleveland_split)

head(cleveland_train)

nrow(cleveland_train)
nrow(cleveland_test)

age,sex,chest_pain,blood_pressure,cholesterol,blood_sugar,rest_ecg,heart_rate,angina,st_depression,slope,num_vessels,thal,diagnosis
<dbl>,<fct>,<fct>,<dbl>,<dbl>,<fct>,<fct>,<dbl>,<fct>,<dbl>,<fct>,<int>,<fct>,<fct>
63,1,1,145,233,1,2,150,0,2.3,3,0,6.0,False
37,1,3,130,250,0,0,187,0,3.5,3,0,3.0,False
41,0,2,130,204,0,2,172,0,1.4,1,0,3.0,False
57,0,4,120,354,0,0,163,1,0.6,1,0,3.0,False
57,1,4,140,192,0,0,148,0,0.4,2,0,6.0,False
56,0,2,140,294,0,2,153,0,1.3,2,0,3.0,False


## Summary and Visualization of the Training Data

- Before we make our model we need to make sure that the two classes actually have different averages in serum cholersterol and maximum heart rate achieved.

- To do this, we will use `group_by` and `summarize` to create a table with the minimum, maximum and mean of each of our predictors.