# Predicting Heart Disease with KNN Classification in Cleveland: <br> Determining the Effects of Age, Sex, Heart Rate and Cholesterol

## Introduction

Heart disease, also known as cardiovascular disease, is the top leading cause of death across the world, according to the [WHO]("https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death"). Heart disease refers to several heart-affecting conditions, with the most common condition causing blood vessels to narrow, restricting blood flow and potentially leading to heart attack..

The predictive question we wish to answer is: <br>
***“What factors contribute the most to the presence of heart disease, and do they change in respect to age, sex, maximum heart rate, or cholesterol?”***

Through data analysis, we will use the heart disease data set from the [UC Irvine Machine Learning Repository]("https://archive.ics.uci.edu/dataset/45/heart+disease"), collected on June 30, 1988. This data set includes various observations from people in Cleveland, Hungary, Switzerland and the VA Long Beach. We will focus on the **Cleveland data set** to answer our question.


## Preliminary exploratory data analysis
Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.

In [81]:
# Libraries
library(tidyverse)
library(tidymodels)
set.seed(29)

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mworkflows   [39m 1.1.2
[32m✔[39m [34mmodeldata   [39m 1.0.1     [32m✔[39m [34mworkflowsets[39m 1.0.0
[32m✔[39m [34mparsnip     [39m 1.0.3     [32m✔[39m [34myardstick   [39m 1.1.0
[32m✔[39m [34mrecipes     [39m 1.0.4     

── [1mConflicts[22m ───────────────────────────────────────── tidymodels_conflicts() ──
[31m✖[39m [34mscales[39m::[32mdiscard()[39m masks [34mpurrr[39m::discard()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m   masks [34mstats[39m::filter()
[31m✖[39m [34mrecipes[39m::[32mfixed()[39m  masks [34mstringr[39m::fixed()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m      masks [34mstats[39m::lag()
[31m✖[39m [3

In [82]:
# Reading the data from the web

# Specifying new column names & old ones

heart_data <- read_csv("https://raw.githubusercontent.com/Mr-Slope/DSCI-100_Group_Project/main/processed.cleveland.data",
                      col_names=FALSE) |>
    rename(age = X1,
          sex = X2,
          cp = X3,
          trestbps = X4,
          chol = X5,
          fbs = X6,
          restecg = X7,
          thalach = X8,
          exang = X9,
          oldpeak = X10,
          slope = X11,
          ca = X12,
          thal = X13,
          num = X14)

heart_tidy <- heart_data |>
    filter(ca != "?", thal != "?") |>
    mutate(across(c(ca, thal), as.numeric)) |>
    mutate(num = as_factor(num)) |> # convert to factor to predict
    mutate(sex = as_factor(sex)) |> 
    mutate(num = fct_recode(num, "1" = "2", "1" = "3", "1" = "4")) |> # in the data files, 1,2,3,4 are all sick
    mutate(sex = fct_recode(sex, "male" = "1", "female" = "0")) |>
    tibble() 

head(heart_tidy)


[1mRows: [22m[34m303[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (2): X12, X13
[32mdbl[39m (12): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X14

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
63,male,1,145,233,1,2,150,0,2.3,3,0,6,0
67,male,4,160,286,0,2,108,1,1.5,2,3,3,1
67,male,4,120,229,0,2,129,1,2.6,2,2,7,1
37,male,3,130,250,0,0,187,0,3.5,3,0,3,0
41,female,2,130,204,0,2,172,0,1.4,1,0,3,0
56,male,2,120,236,0,0,178,0,0.8,1,0,3,0


In [85]:
# Splitting the data
heart_split <- initial_split(heart_tidy, prop = 0.75, strata = num)
heart_training <- training(heart_split)
heart_testing <- testing(heart_split)


In [89]:
# Summary Statistics
heart_mean <- heart_training |>
    select(-sex, -num) |>
    map_df(mean)

heart_summary_diagnosed <- heart_training |>
    group_by(num) |>
    summarize(count = n())

heart_summary_sex <- heart_training |>
    group_by(sex) |>
    summarize(count = n())

heart_mean
heart_summary_diagnosed
heart_summary_sex

age,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
54.3964,3.175676,132.009,244.8018,0.1576577,0.981982,149.5856,0.3423423,1.020721,1.59009,0.6801802,4.842342


num,count
<fct>,<int>
0,120
1,102


sex,count
<fct>,<int>
female,66
male,156


In [144]:
# Add a visualization with training data here; "compares the distributions of each of the predictor variables you plan to use in your analysis"

# age_histogram <- heart_training |>
#     ggplot(aes(x=age, fill=num)) +
#     geom_histogram(stat="count")
# age_histogram

# age_chol <- heart_training |>
#     ggplot(aes(x=age, y=chol, color=num)) +
#     geom_line()
# age_chol

# age_plot <- heart_training |>
#     ggplot(aes(x=age, y=thalach, fill=num)) +
#     geom_bar(stat="identity")
# age_plot

# age_plot <- heart_training |>
#     ggplot(aes(x=num, fill=sex)) +
#     geom_bar(stat="count")
# age_plot

## Methods

Data analysis & variables/columns: is this a useful variable for prediction?

Filter by:
- age: age
- sex: gender defined with 0 (Female) and 1 (Male)
- thalach: maximum heart rate achieved
- chol: serum cholesterol (mg/dl)

Predictors:
- Cp: types of chest pain
- trestbps: resting blood pressure in mmHg
- fbs: fasting blood sugar > 120 mg/dl (1 = True, 0 = False)
- restecg: resting electrocardiographic results
- exang: whether exercise induced angina (1 = True, 0 = False)
- oldpeak: ST depression induced by exercise, relative to rest
- slope: the slope of the peak exercise ST segment (1 = upslope, 2 = flat, 3 = downslope)
- ca: number of major vessels (0-3) colored by flourosopy
- thal: (3 = normal, 6 = fixed defect, 7 = reversable defect)

To visualise relationships in our data, we will generate scatter plots of different numerical (non-factor) variables against each other. This helped us identify the best predictors to use, which ended up aligning with our predictions and expectations. We will be able to create three plots that compare each factor to another (Age vs. sex, Age vs. Chol, Age and Thalach and Sex vs. Cho).


While we explore the data, we would use scatter plots and histograms to investigate the relationships between different numerical variables against each other. Firstly, the plot of age and sex sheds light on the potential age disparities among genders, denoted as 0 for females and 1 for males. While looking at serum vs cholesterol(mg/ml) it offers potential correlations between age and lipid metabolism. Meanwhile, the depiction of Thalach allows us to discern any age related patterns in the cardiovascular performance. Exploring the association between sec and serum cholesterol to identify gender specific distinctions in cholesterol levels. 

We will predict num from (insert variables here)
num = 0 means that the patient does not have heart disease
num = 1 means patient has heart disease

## Expected outcomes and significance:

What do you expect to find? <br>
Although we will be filtering our dataset with respect to multiple variables, we expect [Resting Electrocardiographic]("https://www.ncbi.nlm.nih.gov/books/NBK367910/#.") (restecg), Exercise Induced Angina (exang) and ST Depression induced by exercise (oldpeak) to be some of the most likely indicators of heart disease.

What impact could such findings have? <br>
These findings can help medical professionals identify patients who are potentially at risk of heart disease and treat them accordingly. Furthermore, by identifying the relationship between age, sex and risk factors (maximum heart rate achieved and serum cholesterol [mg/dl]) medical professionals can consider the appropriate predictors when running tests. This will help in accurate identification and swift action for when a patient is suspected of having an underlying heart disease. Moreover, understanding the predictors' impact varies by age, sex, and risk factors can potentially debunk some myths associated with heart diseases. For example, it can address concerns about heart disease’s prevalence in men and if it is a matter of worry for the young.


What future questions could this lead to? <br>
Such findings could lead to questions being asked about the relationship between the aforementioned predictors, as well as those that our study is not considering, and the specific demographics of people. This is a critical question to ask as different demographics lead distinct lifestyles. For example, the diet of someone in Asia differs significantly from the diet of someone in North America. After all, there is a possibility that diet could be a significant predictor of heart disease. What specific age group show the hig
