# EVALUATING ACCURACY OF PREDETERMINED VARIABLES TO PREDICT SEVERITY OF HEART DISEASE IN PATIENTS

**Adapted by:** Daniel Lee, Eric Leung, & Sam Thorne

**Original Authors:** Emerson Crick, Allie Janowicz, Ziva Subelj, & Sam Thorne

```{contents}
:local:
```

## Introduction

Heart disease refers to a number of different cardiovascular conditions including coronary artery disease, arrhythmia, and heart failure. These diseases have become the leading causes of death in the United States, killing over 659,000 people and costing the government $363 billion each year. Heart disease is associated with a number of factors such as unhealthy blood pressure, cholesterol, and more {cite}`disease_control`. 


**We are asking: are age (`age`), maximum heart rate (`max_heart_rate`), and resting blood pressure (`rest_bp`) good at predicting the severity of heart disease in a patient? Based on literature research, we predict that age, maximum heart rate, and resting blood pressure will contribute to the different degrees of heart disease (ranging from 0 being healthy to 4 being the most severe heart cases).**


The data we will use comes from “processed.switzerland.data,” "processed.cleveland.data," "processed.hungarian.data" and "processed.va.data" provided in the Heart Disease dataset {cite}`dataset`. It was collected from the clinical and noninvasive test results of 143 patients undergoing angiography at the University Hospitals in Zurich and Basel, 303 patients undergoing angiography at the Cleveland Clinic in Cleveland, Ohio, 425 patients undergoing angiography at the Hungarian Institute of Cardiology in Bupadepest and 200 patients undergoing angiography at the Veterans Administration Medical Center in Long Beach, California {cite}`diagnosis_algorithm`. 

Below are the packages we loaded to complete our data analysis:

In [1]:
source('../tests/tests.R')

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──


[32m✔[39m [34mggplot2[39m 3.3.6      [32m✔[39m [34mpurrr  [39m 0.3.5 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.3      [32m✔[39m [34mforcats[39m 0.5.2 


── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mreadr[39m::[32medition_get()[39m   masks [34mtestthat[39m::edition_get()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m        masks [34mstats[39m::filter()
[31m✖[39m [34mpurrr[39m::[32mis_null()[39m       masks [34mtestthat[39m::is_null()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m           masks [34mstats[39m::lag()
[31m✖[39m [34mreadr[39m::[32mlocal_edition()[39m masks [34mtestthat[39m::local_edition()
[31m✖[39m [34mdplyr[39m::[32mmatches()[39m       masks [34mtidyr[39m::matches(), [34mtestthat[39m::matches()


[32mTest passed[39m 😀
[32mTest passed[39m 😸


── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──



[32m✔[39m [34mbroom       [39m 1.0.1     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mworkflows   [39m 1.1.2
[32m✔[39m [34mmodeldata   [39m 1.0.1     [32m✔[39m [34mworkflowsets[39m 1.0.0
[32m✔[39m [34mparsnip     [39m 1.0.3     [32m✔[39m [34myardstick   [39m 1.1.0
[32m✔[39m [34mrecipes     [39m 1.0.4     



── [1mConflicts[22m ───────────────────────────────────────── tidymodels_conflicts() ──
[31m✖[39m [34mscales[39m::[32mdiscard()[39m  masks [34mpurrr[39m::discard()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m    masks [34mstats[39m::filter()
[31m✖[39m [34mrecipes[39m::[32mfixed()[39m   masks [34mstringr[39m::fixed()
[31m✖[39m [34mpurrr[39m::[32mis_null()[39m   masks [34mtestthat[39m::is_null()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m       masks [34mstats[39m::lag()
[31m✖[39m [34mrsample[39m::[32mmatches()[39m masks [34mdplyr[39m::matches(), [34mtidyr[39m::matches(), [34mtestthat[39m::matches()
[31m✖[39m [34myardstick[39m::[32mspec()[39m  masks [34mreadr[39m::spec()
[31m✖[39m [34mrecipes[39m::[32mstep()[39m    masks [34mstats[39m::step()
[34m•[39m Learn how to get started at [32mhttps://www.tidymodels.org/start/[39m



[1mRows: [22m[34m414[39m [1mColumns: [22m[34m6[39m


[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (6): diagnosis_f, age, rest_bp, max_heart_rate, chest_pain, sex



[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [None]:
# STEP IN 01 R SCRIPT

library(tidyverse)
library(repr)
library(tidymodels)
library(cowplot)
library(ggplot2)
library(caret)
library(e1071)
options(repr.matrix.max.rows = 6)
set.seed(1)
source("../R/selection_forward_function.R")
source("../R/majority_classifier_function.R")

# Reading data off internet to make csv files
source("../R/place_data.R")
source("../R/joining_data.R")

## Preliminary Exploratory Data Analysis

We combined the datasets “processed.switzerland.data,” "processed.va.data", "processed.cleveland.data," and "processed.hungarian.data", to generate a longer and more complete dataset. Our complete aggregated data contains 13 different columns with an additional 8 that repeat some of the variables as factors instead of numeric types. Based on initial viewing of the data we decided to eliminate the `ST_dep` and `slope` columns as they were missing too many points. 

In [None]:
# STEP IN 02 R SCRIPT


heart_data <- join_csv()
heart_data

<span style="color:gray">***Table 1.*** *All columns with all data from the four heart disease data sets.*</style>

### Training and Testing Set

Before creating our classification model, we partition the `heart_data` into a training (75%) and testing (25%) set using the `tidymodels` package. In data analysis it is important to split the data right away to ensure that the classifier we build never sees the testing data. We will train the model using only the training set and test it on the test set once we build the $K$-nn classifier. We will use the variable `diagnosis_f` as our class label as this is what we seek to predict. 

We set the seed with the `set.seed` function in order to make the randomized processes throughout our analysis reproducible. A seed is a numerical starting value, which determines the sequence of random numbers R will generate. Throughout the analysis the `set.seed` function will be at the top of each cell that completes a randomizing action. 

In [None]:
# STEP IN 03 R SCRIPT


# Splitting data into training and testing sets
set.seed(1)
heart_split<-initial_split(heart_data, prop=0.75, strata=diagnosis_f)
heart_training<-training(heart_split)
heart_testing<-testing(heart_split)

## Literature Research on Predictors

To begin choosing which predictors to use for our classifier, we conducted literature research on columns in our data set to determine their relationship with heart disease. This literature research below helped us to build our initial research question and is later confirmed with the use of foward selection. 

### Resting Blood Pressure

High blood pressure has always been associated with heart attack and heart disease. High blood pressure can damage the arteries by forcing them to stay in a taut position. This decreases the vessels ability to flow blood around the body which can lead to heart disease {cite}`disease_control`. From this we found indications that higher blood pressure in individuals resulted in coronary heart disease {cite}`blood_pressure`. According to this research we expect those with higher resting blood pressure to have more severe heart disease. 

### Age

Aging can cause changes in the heart and blood vessels that may increase person's risk of developing cardiovascular disease. Aging to one's heart is asociated with a reduction in its function. Some of the age-associated changes in the heart might be (partially) reversed, by exercise or specific drugs, however, it remains unclear whether this results in any definite advantages for the individual {cite}`disease_review`. According to this research we expect older individuals to obtain heart disease of higher severity. 

### Maximum Heart Rate Achieved

A low maximum heart rate can be associated with an increased risk of cardiac disease. Older hearts have a harder time than younger hearts pumping blood throughout the body because the heart gets tired and weaker with age. This can cause a decrease in daily physcial movement because there the heart is not able to preform as it used to. It is suspected that with a lower rate of blood pumping through the heart, there is a greater chance of other factors decreasing heart health that lead to heart disease {cite}`maximal_heartrate`. According to this research we expect max heart rate will decrease as severity of heart disease increases. 

## Initial Visualizations of Predictors

In [None]:
# SCRIPTED IN 02.1-INITIAL_VISUALIZATION.R

# Boxplot visualizations for each of our predictors and heart diagnosis
heart_data <- read_csv('../data/processed/heart_data.csv')

    
boxplot_age <- grid_boxplot(heart_data, age, "Age (years)", "A. Boxplot of degree of heart \n disease in relation to patient's \nage")
boxplot_rest_bp <- grid_boxplot(heart_data, rest_bp, "Resting blood pressure (mmHg)", "B. Boxplot of degree of heart \ndisease in relation to patient's \nresting blood pressure")
boxplot_max_heart_rate <- grid_boxplot(heart_data, max_heart_rate, "Maximum heart rate (BPM)", "C. Boxplot of degree of heart \ndisease in relation to patient's \nmaximum heart rate")
options(repr.plot.width = 20, repr.plot.height = 10)
boxplots <- plot_grid(boxplot_age, boxplot_rest_bp, boxplot_max_heart_rate, ncol=3)
show(boxplots)
ggsave("../figures/boxplot.png")

<span style="color:gray">***Figure 1.***</style>

These boxplots in Figure 1. are our inital visualizations to see if there really is a correlation between severity of heart disease and our predictors. From looking at all 3 predictors, Figure 1A. `age`, Figure 1B. `rest_bp`, and Figure 1C. `max_heart_rate`, we can see that there is a range of weakly positive and weakly negative to no relationship between the different diagnosed severities of heart disease (`diagnosis_f`)and the predictors.

Figure 1A. shows there is a slight positive relation between age and heart disease severity, meaning the older a patient is, the more severe their heart disease would be.

Figure 1B. shows there is no clear relationship between resting blood pressure and how severe a patients heart disease is. All the medians at different levels of severity are relatively level.

Figure 1C. shows there is a slight negative relationship between maximum heart rate and the severity of heart disease. This means a patient with no heart disease (0) or low severity (1) are able to acheive a higher maximum heart rate. This is because their heart is more healthy and able to pump blood efficiently when compared to a patient with high severity (4) heart disease.

These visualizations are taken into consideration when making our classification model. We will see if these predictors used together will in fact give an accurate diagnosis for someone being examined for heart disease based on given dataset. We chose to proceed with these predictors because of our literature reseach support. In practice, these predictors are conditions that doctors use to give an **initial diagnosis** of heart disease. Our research question as to whether these predictors work to accurately predict the severity is analysed in the remainder of the project report.

## Forward Selection

To visualize the effectiveness of our chosen predictors on our data set, we used the process of forward selection. We chose to add this step to confirm that the choices of our predictors will work using this data, as Figure 1. failed to show any significant correlation.

Forward selection is used to predict accuracy of a classifier that will be made using different predictors to create the model. Based on our literature research detailed above, we will be using forward selection on the variables age, resting blood pressure, and maximum heart rate. We made sure to use our training data set in the forward selection process so that the testing data is never seen by the classifier.

Additionally we chose to run forward selection on chest pain type (`chest_pain`) and sex (`sex`) to get an idea of what our classifier would look like using more variables. 

In [None]:
# Creating data subsets for forward selection model using training data
heart_data_subset<-heart_training%>%
    select(diagnosis_f, age, rest_bp, max_heart_rate, chest_pain, sex) %>%
    na.omit()

# heart_data_subset
write_csv(heart_data_subset, '../data/modelling/forward_selection_subset.csv')

Now that we have a subset of our data to work with, we can run forward selection to produce a table of accuracies based on the number of predictors. This forward selection coding immitates the model that will be made further down in the report. A seed is set to ensure reproducible results. 

Due to the iterative nature of forward selection and the usage of 5 predictors this cell will take a while to run. 

In [None]:
accuracies <- forwardSelection(heart_data_subset)
accuracies

<span style="color:gray">***Table 2.*** *Forward selection results*</style>

From Table 2. above, we can see that the accuracy increases with every added predictor. To make this result more clear, a visualization of accuracy compared to the number of predictors is included below.

In [None]:
# visualization of number of predictors and accuracy based on forward selection
options(repr.plot.width = 7, repr.plot.height = 7)
forward_visualization <- ggplot(accuracies, aes(x = size, y = accuracy)) +
    geom_line() +
    geom_point() +
    labs(x = 'Number of predictors used',
         y = 'Estimated accuracy using forward selection',
         title = 'Number of different predictors compared \nto the accuracy of classifier model') +
    theme(text = element_text(size = 20)) +
    ylim(c(0,1))

forward_visualization

<span style="color:gray">***Figure 2.***</style>

Figure 2. shows us that all five of our predictors increase the estimated accuracy of our classifier. However we can also see that the accuracy plateau's at the fourth and fifth predictor (`chest_pain` and `sex`). 

According to this visualization our pre-chosen predictors should be accurate predictors for the severity of heart disease. It also tells us that with any more predictors our classifier would not be benefitted and with any less we would have drastically lower accuracy. 

To conclude, we chose resting blood pressure, age and maximum heart rate as our predictors for the severity of heart disease as seen in row 3 of Table 2. 

## Completing Data tidying

Now that our indicators `age`, `max_heart_rate` and `rest_bp` to predict `diagnosis_f` have been chosen, we can further tidy our data by eliminating the unused columns. Additionally, all the rows containing NA were removed. Since our data has already been split into training and testing sets we will tidy both of the subsets.

In [None]:
# Selecting chosen predictors within training and testing data
heart_training <- heart_training %>%
    select(rest_bp, age, max_heart_rate, diagnosis_f) %>%
    na.omit()
heart_training

# Adding training set to directory
write_csv(heart_training, '../data/modelling/training_split.csv')

heart_testing <- heart_testing %>%
    select(rest_bp, age, max_heart_rate, diagnosis_f) %>%
    na.omit()

# Adding testing set to directory
write_csv(heart_testing, '../data/modelling/testing_split.csv')

<span style="color:gray">***Table 3.*** *Tidied data set containing our selected predictors we will use to determine the severity of heart disease in an individual*</style>

## Majority Classifier

After selecting our predictors we examined the distribution of our outcome variable. Below, we show the majority classifier data frame which displays the proportion of each outcome found in our training data as well as a visualization to accompany it.

In [None]:
# majority classifier and visualization
set.seed(1)
total_rows<-nrow(heart_training)

number_of_columns<-heart_training%>%
    group_by(diagnosis_f)%>%
    summarize(number=n())%>%
    select(number)
    

majority_classifier<-heart_training%>%
    group_by(diagnosis_f)%>%   
    summarize(percent_outcomes=n()/total_rows*100)%>%
     arrange(desc(percent_outcomes))%>%
     bind_cols(number_of_columns)
#slice(1)
majority_classifier

# write csv to data/modelling
write_csv(majority_classifier, '../data/modelling/majority_classifier.csv')

<span style="color:gray">***Table 4.*** *Majority classifier showing the number of people with each severity level of heart disease in the training set*</style>

In [None]:
majority_classifier_vis_function(majority_classifier)

<span style="color:gray">***Figure 3.***</style>

Table 4. and Figure 3. give insight into the proportion of patients from the training set with each level of heart disease. The most common diagnosis in our data set is severity 1 with 31%. This value sets a baseline accuracy that our model should exceed to be deemed an acceptable classifier. In other words, if our model can predict more accurately than simply picking the most likely outcome every time, then we are on the right track in terms of accuracy.

Table 4. also shows us that none of the percent outcomes are over 50%, and the outcomes for severity 1, 2, 3, and 0 are within 15% of each other. This means our data set is distributed evenly enough to build a decent classifier model. Knowing the frequency of each outcome also enables us to set an upper bound on our number of neighbors which is something we will discuss in further detail in subsequent cells.

## Classification Model

**The following steps can all be found in the function `classifier`**.

Because $K$-nearest neighbors is sensitive to the scale of the predictors, we will do some preprocessing to standardize them. We use only our training data to create a recipe for our classifier. This recipe will compute the shift / scale values for each variable, of the data we input into our classifier. For now we will use our training data as this is part of training procedure and we want to ensure that our test data does not influence any aspect of model training.

Next we we made a model for our classifier being sure to use only the training dataset. 
First using `tune()`, so that we can find the optimal $K$ (number of neighbors) for our classifier. With `weight_func` we specify that we want to use the *straight-line* distance as our measurement between the predictor and our new point. Finally, we must specify that we are conducting classification.

In order to find the optimal $K$ value for our classifier we must now separate our training data into many pieces to test each $K$ we are looking at. Five fold cross validation is used to split the training data into 5 sets of data which is then split into 5 pieces. In each group one of the 5 pieces immitates a testing set and is used to test each $K$ value and determine different accuracies for each $K$. 

We stratified the data by `diagnosis_f` to make sure the new splits contain similar proportions of each of the 5 diagnosis we are trying to predict.

We will now run a `workflow()` to determine various accuracies found using different $K$ values. This workflow does the cross-validation work that was described above using our now split training data. This workflow will include our model and recipe that were formed above. We tested $K$ values from one to twenty-one because this is the lowest frequency of diagnoses (category 4 appeared 21 times in the training set). If we were to use a number of neighbors greater than twenty-one the majority voting used in $Knn$ classification would be skewed against diagnosis 4. 

In [None]:
# Cross validation to find optimal K value
source('../R/classification_model.R')

heart_data_accuracies <- classifier(heart_training)
heart_data_accuracies

<span style="color:gray">***Table 5.*** *Parameter values that help select a k value*</style>

To make this table easier to understand, we made a visualization of the mean column and the number of neighbours. The mean represents an accuracy estimate of the model when different $K$ values are used. The optimal $K$ value has the highest accuracy and has neighbouring $K$ values of similar accuracy.

In [None]:
# K visualization
source('../R/model_visualization.R')
knn_visualization(heart_data_accuracies)

<span style="color:gray">***Figure 4.***</style>

Setting the number of neighbours to $K$ = 2 will give the highest accuracy, as seen in Figure 4. The shape of Figure 4. is not what we expected to see because of the huge drop in the accuracy after $K$ = 2. This means that there is a high risk of losing model accuracy after $K$ = 2. Choosing $K$ = 1 would also work considering the accuracy; however, it is a small number and it could lead to overfitting our classifier.

Figure 4. shows that choosing a $K$ higher than two results in a drastically lower estimated accuracy. We experimented with testing the accuracy of models with higher $K$ values, but they decreased the accuracy even further. For example, when we used three neighbours instead of two, the accuracy decreased by about 15%, and it continued to decrease the higher the $K$ value we used. We also checked for spikes at much higher K values, but the accuracy continued to level off around 0.40. To make this table easier to understand, we made a visualization of the mean column and the number of neighbours. The mean represents an accuracy estimate of the model when different $K$ values are used. The optimal $K$ value has the highest accuracy and has neighbouring $K$ values of similar accuracy.

After our initial analysis in Figure 4. we decided to continue our analysis despite the suspicious shape of Figure 4. This next step pulls optimal the $K$ value based on the above cross validation to choose the appropriate number of neighbors.

In [None]:
# Pulling optimal K value based on above cross validation
set.seed(1)
best_k<-heart_data_accuracies%>%
    arrange(desc(mean))%>%
    slice(2)%>%
    pull(neighbors)
# best_k

Now that we have our optimal $K$ value we continued to build our classifier and started by building a new model using `best_k`. 

In [None]:
# New classifier model using optimal K values
set.seed(1)
heart_data_spec_final<- nearest_neighbor(weight_func="rectangular", neighbors=best_k)%>%
    set_engine("kknn")%>%
    set_mode("classification")

To finalize the classifier we now plug the recipe and model into a workflow so it can be used on other data in the future. This step fits the classifier to our training data therefore enabling it's predictive ability. 

In [None]:
# Final workflow for classifier using new model.
set.seed(1)
heart_data_final_fit<-workflow()%>%
    add_recipe(heart_data_recipe)%>%
    add_model(heart_data_spec_final)%>%
    fit(data=heart_training)
    
#heart_data_final_fit

### Testing the Model

We pass the test set to our workflow to test the accuracy of the model. 

In [None]:
# Testing our classifier using the testing set
set.seed(1)
heart_data_summary<-heart_data_final_fit%>%
    predict(heart_testing)%>%
    bind_cols(heart_testing)%>%
    metrics(truth=diagnosis_f, estimate=.pred_class)%>%
    filter(.metric == 'accuracy')
heart_data_summary

heart_data_predict <- heart_data_final_fit %>%
    predict(heart_testing) %>%
    bind_cols(heart_testing)

write_csv(heart_data_predict, '../data/modelling/predict_data.csv')

<span style="color:gray">***Table 6.*** *Model accuracy with testing set*</style>

The accuracy is a reasonable number that shows that even though our predictors within our model produce a strange $K$-nearest neighbour graph, we still built a model that will diagnose patients with ~80% accuracy.

We know this is a reasonable accuracy for our classifier from looking at Figure 3. As discussed in the preceding paragraph, we are looking for a classifier with higher accuracy than that of the majority classifier. In our case, the majority label made up ~30% of our dataset, so the accuracy of ~80% is more than double that of our majority label. Therefore, our classifier is reasonably good. 

To test the accuracy of the model we built, we also provide a confusion matrix; a table of predicted and correct labels,  using the `conf_mat` function. This enables us to easily see false positives and false negatives of diagnosis, as well as what was predicted accurately.

A heatmap based on the confusion matrix was made to more easily understand what is being displayed above.

In [None]:
# confusion matrix heat map visualization
source('../R/confusion_matrix.R')

confusion_matrix(heart_data_predict)

<span style="color:gray">***Figure 5.***</style>

The confusion matrix (Figure 5.) shows that 22 individuals were correctly diagnosed with no presence of heart disease, 39 were correctly diagnosed with low severity of heart disease, 22 were correctly diagnosed with medium severity of heart disease, 20 were correctly diagnosed with high severity of heart disease, and 4 were correctly diagnosed with extreme severity of heart disease. Therefore, the classifier labeled $22 + 39 + 21 + 20 + 4 = 106$ diagnoses correctly. This is a rather good result as the proportions of each heart disease class diagnosed correctly mirror the proportions seen in the dataset. Our classifier is therefore not favouring one diagnosis over the other.

Unfortunately, the classifier also made errors and classified a total of 18 patients with a false positive, meaning the patient's heart disease was worse than it actually was. Moreover, the classifier labelled six patients with false negatives, meaning the patients' heart disease was actually worse than what was predicted. A discussion as to the false positive and false negative results is in the discussion section.

## Conclusion

The predictors, `age`, `max_heart_rate`, and `rest_bp`, have high accuracy and a good chance of accurately predicting a patient's heart disease severity. However, some results indicate that these predictors may not be the best in actual practice. For instance, Figure 1. shows that these predictors do not have significant correlations with a diagnosis of severity of heart disease. Additionally, there was a significant drop-off after two neighbours in our $K$ graph, proposing that our accuracy results could be due to luck. However, our confusion matrix reveals promising results.

### Is this what you expected to find?

We expected to find that the larger the numerical value of a patient's age and resting blood pressure and the lower the maximum heart rate, the greater at risk a person is to have worsened heart disease. We actually found that these predictors have a weak correlation with the outcome variable. However, our model still predicts heart disease severity accurately.

From our confusion matrix in Figure 5., we were more likely to predict a false positive than a false negative. This is good because, with a health condition like heart disease that generates ranging severity, it is better to be given a false positive and be treated as if the disease were worse than be given a false negative and not be treated at all or to a lower care level than what is needed. A reason we see 18 false positives could be because of the predictors we chose to use in our model. We saw in Figure 1. that these predictors are not ideal because of the weak relationships they show with the severities of heart disease. This can also be said as to why we have 6 false negatives. However, considering that the accuracy of our model is good, this is a reasonable number of false negatives. These values for the false positives and false negatives also relate to our research question and conclusion that our selected predictors are primarily good at predicting the preliminary severity of a patient's heart disease. However, a diagnosis cannot be concluded off these factors alone. Unfortunately, false negatives and false positives are given in medicine in the real world, but from actual research and examples like we were able to display in our classification, it is more likely that a patient will have a proper diagnosis than a false positive or negative.

### What impact could such findings have?

This could give medical professionals an idea that these predictors are good initial tests to check the severity of someone's heart disease, but further testing should be done to have a more accurate diagnosis. The severity levels of heart disease (0 to 4) are hard to distinguish between for our selected predictors, but they can be good indicators of a person having heart disease. This information can also be given to Government health care facilities as a precaution to provide public awareness and send letters to older citizens to check with a doctor on the status of their heart health. Also, individuals who feel they want to watch their heart health can compare their resting blood pressure, age and maximum heart rate with this model to estimate if they have heart disease and to what severity.

### What future Questions could this lead to?

What further research needs to be done on the predictors (maximum heart rate, age and resting blood pressure) to see if they are actually good predictors of heart disease? How could heart disease be lessened in the future, or how can heart disease be prevented? Patients at what age and with what symptoms should be encouraged to take precautions and see a doctor?

## References:

```{bibliography}
```