# Assessing the Likelihood of Contracting Heart Disease Based on 6 Health Factors

## 1. Introduction

### 1.1 Inference Question:
How is the likelihood of contracting heart disease impacted by, age, sex, max heart rate, cholesterol levels, resting blood pressure, and chest pain type?

### 1.2 Background
For nearly a century, heart disease has claimed the many lives of men and women alike, and men are more likely to get it than women(Fallon, 2019). Besides, as the population ages, an increasing number of people are suffering from heart disease (Kodali et al., 2018). Thus, we believe that the risk of heart disease is related to gender and age and may also be related to other factors.

### 1.3 The Heart Disease Dataset
The dataset that we will be using for this project is one related to heart disease and the factors that contribute to it. The link to this dataset is included below, but in this project, we will be reading the data directly from the web, so there is no need to have the file saved locally.

https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/

From the above link, the Cleveland dataset will be used which contains 76 factors, however, the processed data, contains 14 factors. We will further filter this data to be the 6 factors that we will be examining, which are listed below.

#### The Health Factors:
- age 
- sex 
- pain.type -> type of pain experienced by patient (angina, abnormal angina, not angina, or asymptomatic)
- restbps -> resting blood pressure of the patient
- cholesterol
- maxbpm -> maximum heart rate of the patient

We have chosen these specific factors for our analysis, as these are the factors we think would have the greatest influence on whether or not a patient potentially has heart disease. We also would like to investigate that heart disease would affect women more than men, and thus have included the sex variable, to see if this hypothesis holds true. 

The factors that were omitted were due to redundancy or their uncertain nature, in terms of measurement units and missing data. One predictor, resting ECG, which measures the electrical activity in the heart was considered, however, we deemed that the resting blood pressure would be a better indicator of heart activity.
______________________________________________________________________________________________________________________________________________________

## 2. Wrangling the Data

### 2.1 Loading Tidyverse
First, we need to load the tidyverse library into R so that it is possible to wrangle the data and only select the factors that we will be making use of in this project.

In [13]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

ERROR: Error in library(tidymodels): there is no package called ‘tidymodels’


### 2.2 Reading and Selecting the Data
We can make use of the `read_csv` function to extract the data from the web, and since there are no column names in the dataset, we will have to set `col_names = FALSE` and make a vector with the desired column names.

The dataset classifies sick patients into three different categories, however for this analysis, we will make use of a binary and therefore consider all sick patients as one category, and so we set the `health` column's values 2, 3, and 4, to 1 which signifies contracted-heart-disease, while 0 signifies a healthy patient. Finally, we will convert the health column to be a factor variable rather than double, as this will be needed when classifying data.

In [None]:
heart_data<- read_csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"), col_names = FALSE) |>
    select(X1, X2, X3, X4, X5, X8, X14)
colnames(heart_data) <- make.names(c("age", "sex", "pain.type", "restbps", "cholesterol", "maxbpm", "health"))

heart_data["health"][heart_data["health"] == 2 | heart_data["health"] == 3 | heart_data["health"] == 4] <- 1
heart_data$health <- as_factor(heart_data$health) 

heart_data

We have also decided to check the distribution of patients in terms of sickness to make sure it is balanced, as well as provide a summary of the factors we will be addressing.

In [None]:
health_freq <- table(heart_data$health)
rownames(health_freq) = c("Healthy", "Sick")
"# of Patients with and without Heart Disease"
health_freq

In [None]:
summary(heart_data$age)
summary(heart_data$maxbpm)
summary(heart_data$restbps)
summary(heart_data$cholesterol)
summary(heart_data$maxbpm)
table(heart_data$sex)
table(heart_data$pain.type)

______________________________________________________________________________________________________________________________________________________

## 3. Analysis of Data

### 3.1 Splitting the Dataset into Training and Testing Data
We have chosen to select 75% of the data to be in the training data, while the other 25% will be used as testing data later on. We have also selected a seed to use, so that the results are reproducible in the future.

In [None]:
set.seed(29) # do not change [use this seed throughout]

heart_data_split <- initial_split(heart_data, prop = 0.75, strata = health)
heart_data_training <- training(heart_data_split)
heart_data_testing <- testing(heart_data_split)

### 3.2 Age and Heart Disease
To initially get a grasp on the dataset itself and what we will be conducting an analysis on, we have decided to plot a histogram showing the differing ages, and at what age range is a patient most likely to develop heart disease. This will also provide a good idea of where the majority of the population lies.

In [None]:
ages_and_heart_disease_plot <- heart_data_training |>
    ggplot(aes(x = age, fill = as_factor(health))) +
    geom_histogram() +
    labs(title = "Plot of Ages and Heart Disease", x = "Age", fill = "Health") +
    scale_fill_discrete(labels = c("Healthy", "Sick")) +
    theme(text = element_text(size = 15), plot.title = element_text(hjust = 0.5))
ages_and_heart_disease_plot

We are able to see that the at-risk age range to contract heart disease seems to be around 60 years old. It is noted that although this age is where the number of sick patients peaks, it is also seen that a larger percentage of the population at this age, is healthy.

### 3.3 Sex and Heart Disease
We would also like to explore if there is any immediate correlation between the sex of the patient and the likelihood that they contract heart disease. To visualise this we will be making use of a histogram and the `facet_grid` function to be able to split the male and female plots. we can also `filter` the plots and calculate the percentage of sick male and female patients.

In [None]:
sex_and_heart_disease_plot <- heart_data_training |>
    ggplot(aes(x = age, fill = as_factor(health))) +
    geom_histogram() +
    facet_grid(rows = vars(sex)) +
    labs(title = "Plot of The Relationship Between the Sex of \n a Patient and Likelihood \n of Contracting Heart Disease", x = "Age", fill = "Health") +
    scale_fill_discrete(labels = c("Healthy", "Sick")) +
    theme(text = element_text(size = 15), plot.title = element_text(hjust = 0.5))

sex_and_heart_disease_plot

In [None]:
male_total <- heart_data_training |>
    filter(sex == 1) |>
    nrow()

male_sick <- heart_data_training |>
    filter(sex == 1) |>
    filter(health == 1) |>
    nrow()

male_ratio <- (male_sick / male_total)

female_total <- heart_data_training |>
    filter(sex == 0) |>
    nrow()

female_sick <- heart_data_training |>
    filter(sex == 0) |>
    filter(health == 1) |>
    nrow()

female_ratio <- (female_sick / female_total)


print('Proportion of males that are sick: ')
print(male_ratio)
print('Proportion of females that are sick: ')
print(female_ratio)

From the plot above we can see that the male population [1], is far greater than the female population [0], and approximately 56% of the males have contracted heart disease, while the proportion of female patients that contracted a heart disease was less than half at around 26%.

### 3.4 Chest Pain and Heart Disease
The first correlation we would like to explore is that between chest pain and heart disease. To visualise this, we will make use of a histogram, as well as a facet grid that encompasses the 4 different types of chest pains that were recorded in this dataset, which will essentially split the plot above into 4 different plots. To refresh, these chest pain types are angina [1], abnormal angina [2], not angina [3], or asymptomatic [4]. 

In [None]:
likelihood_of_heart_disease_plot <- heart_data_training |>
    ggplot(aes(x = age, fill = as_factor(health))) +
    geom_histogram() +
    facet_grid(rows = vars(pain.type)) +
    labs(title = "Plot of What Chest Pain Shows \n Likelihood of Developing Heart Disease, \n and How it Relates to Age", x = "Age", fill = "Health") +
    scale_fill_discrete(labels = c("Healthy", "Sick")) +
    theme(text = element_text(size = 15), plot.title = element_text(hjust = 0.5))

likelihood_of_heart_disease_plot

From the visualisation above, it is clear to see that the vast majority of patients, in general, are in category 4, which is asymptomatic. Additionally, this is where the largest proportion of the sick population is as well, however, this is not to say that the other chest pains have no sick patients.

### 3.5 Max Heart Rate and Heart Disease
The max heart rate of a patient will generally decrease as they get older, and we chose to investigate this variable as the lower max heart rate, means that less energy news to be exerted for the heart to overwork itself, possibly resulting in heart injuries and maybe even disease. To visualise a relationship, we will plot the data on a scatterplot, with age on the x-axis and max heart rate on the y-axis and colour the points based on the patient’s health.

In [None]:
age_and_max_heart_rate_plot <- heart_data_training |>
    ggplot(aes(x = age, y = maxbpm, colour = as_factor(health))) +
    geom_point() +
    labs(title = "Plot of Age and Max Heart Rate", x = "Age", y = "Max Heart Rate (bpm)", colour = "Health") +
    scale_colour_discrete(labels = c("Healthy", "Sick")) +
    theme(text = element_text(size = 15), plot.title = element_text(hjust = 0.5))
age_and_max_heart_rate_plot

From the diagram above, the expected downward trend of the points is seen, as with increasing age, the peak heart rate will decrease. We can also see that the bulk of the healthy patients are ones with a higher max heart rate, while the sick patients seem to have a lower average max heart rate suggesting a correlation between heart rate and heart disease.

### 3.6 Cholesterol and Heart Disease
Higher cholesterol levels lead to deposits of fat building up in your arteries, which can create great difficulty with regard to the high-pressure blood flowing through these arteries. When these fatty deposits break apart, they can cause a heart attack (Pruthi, 2023). To plot this relationship we will make use of a scatter plot, with age once again on the x-axis and cholesterol on the y-axis.

In [None]:
age_and_cholesterol_plot <- heart_data_training |>
    ggplot(aes(x = age, y = cholesterol, colour = as_factor(health))) +
    geom_point() +
    labs(title = "Plot of Age and Cholesterol", x = "Age", y = "Cholesterol", colour = "Health") +
    scale_colour_discrete(labels = c("Healthy", "Sick")) +
    theme(text = element_text(size = 15), plot.title = element_text(hjust = 0.5))
age_and_cholesterol_plot

From this plot, there is a very slight upward trend in the cholesterol levels in the patient, and the majority of sick patients have slightly higher cholesterol levels. However, it is typical for the cholesterol levels to increase slightly too, with some outliers.
______________________________________________________________________________________________________________________________________________________

## 4. KNN Classification

### 4.1 Creating the Recipe and Conducting Cross Validation
The first step in KNN classification is to create the recipe that will be used in the algorithm and model that we build to classify the testing data. This recipe determines which variables are the predictors, and which is the variable being predicted. In this step, the predictor variables are also stepped and centred, so that the Euclidean distance between the points isn’t affected by the varying units of measurement.

In [None]:
heart_recipe <- recipe(health ~ ., data = heart_data_training) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())
heart_recipe

Next, we will conduct cross-validation, with the `vfold` function, making use of 5 folds. We chose this number because with the number of points in the training data, 227, with 5 even splits of the data, enough data will be in each section, to act as the validating data when conducting cross-validation. Any higher would be too little data in each section, and any lower could result in less valid data.

In [None]:
set.seed(29) # do not change [use this seed throughout]

heart_vfold <- vfold_cv(heart_data_training, v = 5, strata = health)

### 4.2 Tuning the Model
To begin the KNN analysis on the recipe we have just created, we will create a model using the `tune()` function to find the optimal $k$-value, to classify our testing data correctly. We will set the engine to `“kknn”` and set the mode to `“classification”`. 

With this, we can then create a workflow, to tune the grid using the cross-validation from before, and then use `collect_metrics()` to collect the accuracy data and be able to visualise the different $k$-values that were tested.

In [None]:
set.seed(29) # do not change [use this seed throughout]

knn_tune <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
    set_engine("kknn") |>
    set_mode("classification")
knn_tune

knn_results <- workflow() |>
    add_recipe(heart_recipe) |>
    add_model(knn_tune) |>
    tune_grid(resamples = heart_vfold, grid = 10) |>
    collect_metrics()
knn_results

### 4.3 Selecting the Optimal $k$-value
We can now filter, the data frame to only focus on those where the `.metric` column concerns the `"accuracy"` of the data, and then produce a line plot of the data. This will create what is known as an elbow graph, which will show where the accuracy peaks, allowing us to select the best $k$-value.

In [None]:
accuracies <- knn_results |>
    filter(.metric == "accuracy")

accuracy_versus_k <- ggplot(accuracies, aes(x = neighbors, y = mean))+
    geom_point() +
    geom_line() +
    labs(x = "Neighbors", y = "Accuracy Estimate") + 
    ggtitle("Plot of Neighbours and their Accuracy") +
    scale_x_continuous(breaks = seq(2, 14, by = 1)) +
    theme(text = element_text(size = 15))
accuracy_versus_k

From the graph above, there is a very distinct peak, which occurs when `k = 7` or `k = 8` neighbours, which means we must choose one of these values of $k$ to get the algorithm to be as good as it can be.

### 4.4 Building the KNN Model
Now that we have obtained the optimal $k$-values, we will create a new model named `knn_tuned_spec` making use of 7 neighbours. Then we can create the workflow that we will name `heart_tuned_fit`, which uses the `knn_tuned_spec` model.

In [None]:
set.seed(29) # do not change [use this seed throughout]

knn_tuned_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 7) |>
    set_engine("kknn") |>
    set_mode("classification")

heart_tuned_fit <- workflow() |>
    add_recipe(heart_recipe) |>
    add_model(knn_tuned_spec) |>
    fit(data = heart_data_training)
heart_tuned_fit

### 4.5 Testing the Model and Calculating the Accuracy
Finally, we are able to test the model that we have created with the testing data that we split during the initial stages of this analysis. First, we need to create a variable called `heart_predictions` that binds the predicted value column to the testing data, so that we can compare the `.pred_class` and `health` values.

The `metrics()` function takes in the true values and compares them to the predicted values and produces an accuracy statistic. Additionally, we can also create a `conf_mat()`, which shows how many of each "health" value, the model predicted correctly.

In [None]:
heart_predictions <- predict(heart_tuned_fit, heart_data_testing) |>
    bind_cols(heart_data_testing)

heart_metrics <- heart_predictions |>
    metrics(truth = health, estimate = .pred_class)

heart_conf_mat <- heart_predictions |>
    conf_mat(truth = health, estimate = .pred_class)

heart_predictions
heart_metrics
heart_conf_mat

From the tables and data frames above, it can be seen that the model does a reasonable job with an 80.2% accuracy. It is not very inaccurate, but nonetheless, this wouldn’t be used as a reliable model to form predictions of whether a new patient's data means likely to contract heart disease or not.
______________________________________________________________________________________________________________________________________________________

## 5. Discussion

Through our initial analysis, this dataset suggests that men are more prone to developing heart disease than women, however, this could be affected by the fact that the majority of the patients in this dataset were male, so the samples weren’t the best for this comparison. We also were able to find that there is a correlation between some of the other factors such as the max heart rate and cholesterol, but some factors like chest pain showed that the majority of patients were asymptomatic.
 
Our investigation found that the six different factors we chose could determine whether a person would have heart disease with an accuracy of 80.2%. We believe that if the data set contained more data, the model would be able to predict with a higher degree of accuracy. This analysis provides a good basis for creating a model to help people to prevent heart disease at an early stage. It can potentially prevent heart disease from developing entirely if they find, for instance, their resting blood pressure to be in a sort of "danger" zone in terms of contracting heart disease. 

For the future question, if we can do a good job of predicting patients through these six factors, we hope to be able to predict what type of disease, specifically, patients are going to get.

## 6. References

Fallon, C. K. (2019). Husbands' hearts and women's health: Gender, age, and heart disease in twentiethcentury america. *Bulletin of the          History of Medicine, 93*(4), 577-609. https://doi.org/10.1353/bhm.2019.0073

Kodali, S. K., Velagapudi, P., Hahn, R. T., Abbott, D., & Leon, M. B. (2018). Valvular heart disease in Patients ≥80 Years of age. *Journal of the American College of Cardiology, 71*(18), 2058-2072. https://doi.org/10.1016/j.jacc.2018.03.459

Pruthi, S. (2023, January 11). *High cholesterol*. Mayo Clinic. Retrieved April 11, 2023, from https://www.mayoclinic.org/diseases-conditions/high-blood-cholesterol/symptoms-causes/syc-20350800.