# Classification of Heart Disease Using K-Nearest Neighbor

## Introduction
- Literature Review: background info on heart disease
- Research Questions
- About the Dataset used

Many people die of cardiovascular diseases without knowing they have a problem with their heart. Unexpected deaths as such can be prevented with early diagnoses of cardiovascular issues and proper medication. How can the presence of heart diseases be detected in patients? The heart disease data set we are using contains 14 attributes collected from Cleveland and is still used my ML researchers to this day.  

## Preliminary Exploratory Data Analysis
#### 1. Download and Read the dataset from the web (use URL)

In [1]:
# Call packages that will be used.
library(tidyverse)
library(tidymodels)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘tidymodels’ was built under R version 4.0.2”
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 0.1.1 ──

[32m✔

In [2]:
# Set the value of seed to ensure reproducibility
set.seed(200)

In [8]:
# To download the dataset from the web
url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data'
download.file(url, ,destfile='HeartDisease_Cleveland.csv')

In [9]:
# Read the 'HeartDisease_Cleveland.csv' file into a dataframe
heart_cleve <- read_csv('HeartDisease_Cleveland.csv',col_names=FALSE)
head(heart_cleve)

Parsed with column specification:
cols(
  X1 = [32mcol_double()[39m,
  X2 = [32mcol_double()[39m,
  X3 = [32mcol_double()[39m,
  X4 = [32mcol_double()[39m,
  X5 = [32mcol_double()[39m,
  X6 = [32mcol_double()[39m,
  X7 = [32mcol_double()[39m,
  X8 = [32mcol_double()[39m,
  X9 = [32mcol_double()[39m,
  X10 = [32mcol_double()[39m,
  X11 = [32mcol_double()[39m,
  X12 = [31mcol_character()[39m,
  X13 = [31mcol_character()[39m,
  X14 = [32mcol_double()[39m
)



X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
56,1,2,120,236,0,0,178,0,0.8,1,0.0,3.0,0


#### 2. Clean and Wrangle data into tidy format

In [10]:
# Clean and Wrangle the data into tidy format
heart_cleve <- rename(heart_cleve, age = X1,
               sex = X2,
               cp = X3,
               trestbps = X4,
               chol = X5,
               fbs = X6,
               restecg = X7,
               thalach = X8,
               exang = X9,
               oldpeak = X10,
               slope = X11,
               ca = X12,
               thal = X13,
               num = X14) %>% 
            mutate(num = as_factor(num),ca = as.numeric(ca),thal = as.numeric(ca))
head(heart_cleve)

“Problem with `mutate()` input `ca`.
[34mℹ[39m NAs introduced by coercion
[34mℹ[39m Input `ca` is `as.numeric(ca)`.”
“NAs introduced by coercion”


age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
63,1,1,145,233,1,2,150,0,2.3,3,0,0,0
67,1,4,160,286,0,2,108,1,1.5,2,3,3,2
67,1,4,120,229,0,2,129,1,2.6,2,2,2,1
37,1,3,130,250,0,0,187,0,3.5,3,0,0,0
41,0,2,130,204,0,2,172,0,1.4,1,0,0,0
56,1,2,120,236,0,0,178,0,0.8,1,0,0,0


#### 4. Summary statistics about the training data (use tables)

In [11]:
# Split the dataset into training and test sets

heart_split <- initial_split(heart_cleve,prop=0.75,strata=num)
heart_train <- training(heart_split)
heart_test <- testing(heart_split)

In [12]:
# Summary Statistics about the training data

# number of observations in each class
heart_sum_class <- heart_train %>% 
                group_by(num) %>%
                summarise(n=n())
heart_sum

`summarise()` ungrouping output (override with `.groups` argument)



ERROR: Error in eval(expr, envir, enclos): object 'heart_sum' not found


In [13]:
# means of predictor variables used in analysis
heart_mean <- heart_train%>% 
                    select(-num) %>%
                    map_df(mean,na.rm = TRUE)
heart_min <- heart_train %>%
                    select(-num) %>%
                    map_df(min,na.rm=TRUE)
heart_max <- heart_train %>%
                    select(-num) %>%
                    map_df(max,na.rm=TRUE)
heart_median <- heart_train %>%
                    select(-num) %>%
                    map_df(median,na.rm=TRUE)

heart_sum_pred <- rbind(heart_min,heart_median,heart_mean,heart_max) %>%
                    mutate(Statistics=c('Min','Median','Mean','Max'),.before=age)
heart_sum_pred

Statistics,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Min,29.0,0.0,1.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,1.0,0.0,0.0
Median,55.0,1.0,3.0,130.0,239.0,0.0,1.0,153.0,0.0,0.8,2.0,0.0,0.0
Mean,54.36245,0.6768559,3.19214,132.0917,245.7424,0.1528384,1.0131,149.4017,0.30131,1.046725,1.598253,0.6371681,0.6371681
Max,77.0,1.0,4.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,3.0,3.0,3.0


In [14]:
# number of rows with missing data
any(is.na(heart_train[12]))

heart_missing <-0
for (n in 1: nrow(heart_train)) {
    if (any(is.na(slice(heart_train,n)))==TRUE) {
        heart_missing <- heart_missing +1 }
        }

heart_missing

In [15]:
summary(heart_train)

      age             sex               cp           trestbps    
 Min.   :29.00   Min.   :0.0000   Min.   :1.000   Min.   : 94.0  
 1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:120.0  
 Median :55.00   Median :1.0000   Median :3.000   Median :130.0  
 Mean   :54.36   Mean   :0.6769   Mean   :3.192   Mean   :132.1  
 3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:140.0  
 Max.   :77.00   Max.   :1.0000   Max.   :4.000   Max.   :200.0  
                                                                 
      chol            fbs            restecg         thalach     
 Min.   :126.0   Min.   :0.0000   Min.   :0.000   Min.   : 71.0  
 1st Qu.:209.0   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:133.0  
 Median :239.0   Median :0.0000   Median :1.000   Median :153.0  
 Mean   :245.7   Mean   :0.1528   Mean   :1.013   Mean   :149.4  
 3rd Qu.:274.0   3rd Qu.:0.0000   3rd Qu.:2.000   3rd Qu.:166.0  
 Max.   :564.0   Max.   :1.0000   Max.   :2.000   Max.   :202.0  
          

#### 4. Visualize the training data (use plots)
    - compare the distributions of each of the predictor variables used in the analysis

In [16]:
op
heart_plot <- heart_train %>%
ggplot(aes(x=age, y=trestbps, colour = thal)) +
geom_point() +
labs(x = "Age", y = "Resting Blood Pressure Upon Admission to Hospital (mm Hg)", colour = "Type of Heart Defect") 

heart_plot

ERROR: Error in eval(expr, envir, enclos): object 'op' not found


## Methods
- Explain how to conduct the data analysis
- Explain which variables you will use
- Describe at least one way you will visualize the results

## Expected Outcomes and Significance
- What do you expect to find?
- What impact could such findings have?
- What future questions could this lead to?