# Classification of Heart Disease Using K-Nearest Neighbor

## Introduction
- Literature Review: background info on heart disease
- Research Questions
- About the Dataset used

Many people die of cardiovascular diseases without knowing they have a problem with their heart. Unexpected deaths as such can be prevented with early diagnoses of cardiovascular issues and proper medication. How can the presence of heart diseases be detected in patients? The heart disease data set we are using contains 14 attributes collected from Cleveland and is still used my ML researchers to this day.  

## Preliminary Exploratory Data Analysis
#### 1. Download and Read the dataset from the web (use URL)

In [25]:
# Call packages that will be used.
library(tidyverse)
library(tidymodels)
library(GGally)

In [26]:
# Set the value of seed to ensure reproducibility
set.seed(200)

In [27]:
# To download the dataset from the web
url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data'
download.file(url,destfile='HeartDisease_Cleveland.csv')

In [28]:
# Read the 'HeartDisease_Cleveland.csv' file into a dataframe
heart_cleve <- read_csv('HeartDisease_Cleveland.csv',col_names=FALSE)
head(heart_cleve)

Parsed with column specification:
cols(
  X1 = [32mcol_double()[39m,
  X2 = [32mcol_double()[39m,
  X3 = [32mcol_double()[39m,
  X4 = [32mcol_double()[39m,
  X5 = [32mcol_double()[39m,
  X6 = [32mcol_double()[39m,
  X7 = [32mcol_double()[39m,
  X8 = [32mcol_double()[39m,
  X9 = [32mcol_double()[39m,
  X10 = [32mcol_double()[39m,
  X11 = [32mcol_double()[39m,
  X12 = [31mcol_character()[39m,
  X13 = [31mcol_character()[39m,
  X14 = [32mcol_double()[39m
)



X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0
56,1,2,120,236,0,0,178,0,0.8,1,0.0,3.0,0


#### 2. Clean and Wrangle data into tidy format

In [29]:
# Rename the column names and format the column types
heart_cleve <- rename(heart_cleve, age = X1,
               sex = X2,
               cp = X3,
               trestbps = X4,
               chol = X5,
               fbs = X6,
               restecg = X7,
               thalach = X8,
               exang = X9,
               oldpeak = X10,
               slope = X11,
               ca = X12,
               thal = X13,
               diagnosis = X14) %>% 
            mutate(diagnosis = as_factor(diagnosis),ca = as.numeric(ca),thal = as.numeric(thal))
head(heart_cleve)

“Problem with `mutate()` input `ca`.
[34mℹ[39m NAs introduced by coercion
[34mℹ[39m Input `ca` is `as.numeric(ca)`.”
“NAs introduced by coercion”
“Problem with `mutate()` input `thal`.
[34mℹ[39m NAs introduced by coercion
[34mℹ[39m Input `thal` is `as.numeric(thal)`.”
“NAs introduced by coercion”


age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,diagnosis
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
67,1,4,160,286,0,2,108,1,1.5,2,3,3,2
67,1,4,120,229,0,2,129,1,2.6,2,2,7,1
37,1,3,130,250,0,0,187,0,3.5,3,0,3,0
41,0,2,130,204,0,2,172,0,1.4,1,0,3,0
56,1,2,120,236,0,0,178,0,0.8,1,0,3,0


In [31]:
# wrangle the `num` column where 0 indicates no heart disease and 1 indicates diagnosis of heart disease.

heart_cleve$diagnosis[heart_cleve$diagnosis==2] <- 1
heart_cleve$diagnosis[heart_cleve$diagnosis==3] <- 1
heart_cleve$diagnosis[heart_cleve$diagnosis==4] <- 1
head(heart_cleve)

# Source: https://www.geeksforgeeks.org/how-to-replace-specific-values-in-column-in-r-dataframe/

age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,diagnosis
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
67,1,4,160,286,0,2,108,1,1.5,2,3,3,1
67,1,4,120,229,0,2,129,1,2.6,2,2,7,1
37,1,3,130,250,0,0,187,0,3.5,3,0,3,0
41,0,2,130,204,0,2,172,0,1.4,1,0,3,0
56,1,2,120,236,0,0,178,0,0.8,1,0,3,0


#### 4. Summary statistics about the training data

In [22]:
# Split the dataset into training and test sets

heart_split <- initial_split(heart_cleve,prop=0.75,strata=num)
heart_train <- training(heart_split)
heart_test <- testing(heart_split)

In [23]:
# Summary Statistics about the training data

# number of observations in each class
heart_sum_class <- heart_train %>% 
                group_by(num) %>%
                summarise(n=n())
heart_sum_class

`summarise()` ungrouping output (override with `.groups` argument)



num,n
<fct>,<int>
0,123
1,105


In [24]:
# summary statistics of predictor variables used in analysis
options(digits=2)
heart_percentile <- heart_train %>%
                    select(-num) %>%
                    map_df(quantile,na.rm=TRUE) 

heart_1stQ <- heart_percentile %>%
                select('25%') %>%
                t()%>%
                as.vector()

heart_3rdQ <- heart_percentile %>%
                select('75%') %>%
                t() %>%
                as.vector()

heart_mean <- heart_train%>% 
                    select(-num) %>%
                    map_df(mean,na.rm = TRUE)

heart_min <- heart_train %>%
                    select(-num) %>%
                    map_df(min,na.rm=TRUE)

heart_max <- heart_train %>%
                    select(-num) %>%
                    map_df(max,na.rm=TRUE)

heart_median <- heart_train %>%
                    select(-num) %>%
                    map_df(median,na.rm=TRUE)

heart_range <- heart_max - heart_min

heart_missing <- colSums(is.na(heart_train%>%select(-num)))
# Source: https://stackoverflow.com/questions/26273663/r-how-to-total-the-number-of-na-in-each-col-of-data-frame

heart_sum_pred <- rbind(heart_min, heart_1stQ, heart_median,heart_mean, heart_3rdQ, heart_max, heart_range, heart_missing) %>%
                    mutate(Statistics=c('Min','First Quartile','Median','Mean','Third Quartile','Max','Range','# of Missing Value'),.before=age)
heart_sum_pred

Statistics,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Min,29,0.0,1.0,94,126,0.0,0,71,0.0,0.0,1.0,0.0,3.0
First Quartile,48,0.0,3.0,120,210,0.0,0,139,0.0,0.0,1.0,0.0,3.0
Median,56,1.0,3.5,130,238,0.0,1,154,0.0,0.75,2.0,0.0,3.0
Mean,55,0.67,3.2,132,245,0.14,1,150,0.33,1.01,1.6,0.64,4.7
Third Quartile,61,1.0,4.0,140,274,0.0,2,168,1.0,1.6,2.0,1.0,7.0
Max,77,1.0,4.0,200,564,1.0,2,202,1.0,6.2,3.0,3.0,7.0
Range,48,1.0,3.0,106,438,1.0,2,131,1.0,6.2,2.0,3.0,4.0
# of Missing Value,0,0.0,0.0,0,0,0.0,0,0,0.0,0.0,0.0,3.0,2.0


In [21]:
# summary(heart_train)

#### 4. Visualize the training data
    - compare the distributions of each of the predictor variables used in the analysis

In [20]:
# ggpairs(heart_train)

## Methods
- Explain how to conduct the data analysis
- Explain which variables you will use
- Describe at least one way you will visualize the results

## Expected Outcomes and Significance
- What do you expect to find?
- What impact could such findings have?
- What future questions could this lead to?