# Project Proposal — Heart Disease Prediction

## 1 - Introduction

Heart disease and its effective diagnosis is a global health challenge. In this proposal, we aim to address diagnosing heart diseases using machine learning techniques. We will attempt to detect the presence of heart disease in patients by leveraging a heart disease dataset obtained from Kaggle. The diagnosis (presence) ranges from 0 (no presence) to 4. However, our focus will be on distinguishing between the presence and absence of heart disease.

The original dataset consists of 16 variables, but we will focus on the following six variables.

1. **Id (id)** : patient Id 
2. **Age (age)**: Age of the patient.
3. **Resting Blood Pressure (trestbps)**: in mm Hg.
4. **Serum Cholesterol (chol)**: Serum cholesterol level in mg/dL.
5. **thalach (thalch)** : maximum heart rate  
6. **Heart Disease (presence)**: representing the degree of blockages or narrowing in major vessels.
   - P (present): < 50% diameter narrowing
   - N (non-present): > 50% diameter narrowing

We aim to answer the following question:

#### Can our K-nearest neighbors (KNN) classification effectively diagnose heart disease using age, heart rate, blood pressure, and cholesterol levels?


## 2 - Preliminary exploratory data analysis

 ### 2.a. Data Cleaninng, wrangling and Table Summary

In [1]:
library(tidyverse)
library(repr)
library(dplyr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
source('tests.R')
source('cleanup.R')

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mworkflows   [39m 1.1.2
[32m✔[39

ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


In [None]:

## We are using the heart disease data. As suggested by the professor we downloaded the csv file from kaggle and uploaded it into our notebook

url <- "https://raw.githubusercontent.com/MAmouzouvi/dsci-100-2022w1-group-10/main/data/heart_disease_uci.csv"

heart_disease_uci <- read_csv(url)

# selecting only the rows we need and Change 'num' to a factor
heart_disease_data <- heart_disease_uci |> 
         rename(presence = num) |>
         select(id, age, trestbps, chol, thalch, presence) |>
         mutate(presence = ifelse(presence == 0, "N", "P"),
                presence = as.factor(presence))
heart_disease_data
# Print the first few rows of the modified dataset
# head(heart_disease_data)

set.seed(3456)

# Split the heart disease data into training and testing sets 
heart_disease_split <- initial_split(heart_disease_data, prop = 0.75, strata = presence)  
heart_disease_train <- training(heart_disease_split)   
heart_disease_test <- testing(heart_disease_split)

# Table that reports the number of observations per variable (missing values removed)
observations_per_variable_tbl <- heart_disease_train |>
  summarize(across(age:thalch, ~ sum(!is.na(.)))) |>
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "number_of_observations")

# Table that reports the number of observations in presence (N vs P)
presence_observations_tbl <- heart_disease_train |>
    group_by(presence) |>
    summarize(count = sum(!is.na(presence)))


# table that reports the means of the predictor variable
mean_of_predictors_tbl <- heart_disease_train |>
  summarize(across(age:thalch, ~ mean(., na.rm = TRUE))) |>
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Mean")

# table that reports the number of rows with missing data
number_of_missing_values <- sum(rowSums(is.na(heart_disease_train)))

observations_per_variable_tbl
presence_observations_tbl
mean_of_predictors_tbl
number_of_missing_values

#### 2.b. Visualization

In [None]:


# standardizing the data
heart_disease_train_recipe  <- recipe(presence ~ age + trestbps + chol + thalch ,data = heart_disease_train) |>
                       step_center(all_predictors()) |>
                       step_scale(all_predictors())

heart_disease_train_scaled <- heart_disease_train_recipe |>  
                           prep() |> 
                           bake(heart_disease_train)



# Scatter plots of age vs resting blood pressure (trestbps)
# by presence of heart disease with facets

options(repr.plot.width = 8, repr.plot.height = 6)

ggplot(heart_disease_train_scaled, aes(x = age, y = trestbps, color = presence)) +
  geom_point(size = 1.2) +  # Adjust the size value as needed
  labs(title = "Age vs Resting Blood Pressure by Heart Disease Presence",
       x = "Age",
       y = "Resting Blood Pressure in mm Hg",
       color = "Presence")

In [None]:
# Scatter plot of serum cholesterol vs. maximum heart rate (thalch) colored by presence of heart disease

ggplot(heart_disease_train_scaled, aes(x = chol, y = thalch, color = presence)) +
  geom_point(size = 1.2) +  # Adjust the size value as needed
  labs(title = "serum cholesterol vs maximum heart rate by Heart Disease Presence",
       x = "serum cholesterol in mg/dl",
       y = "maximum heart rate",
       color = "Presence")

In [None]:
# Scatter plot of resting blood pressure vs. maximum heart rate colored by presence of heart disease

ggplot(heart_disease_train_scaled, aes(x = trestbps, y = thalch, color = presence)) +
  geom_point(size = 1.2) +  # Adjust the size value as needed
  labs(title = "Serum cholesterol vs fasting blood sugar by Heart Disease Presence",
       x = "resting blood pressure in mm Hg",
       y = "maximum heart rate",
       color = "Presence")

## 3 - Method


First, we will import the dataset and perform data processing. This includes handling any missing values, addressing outliers, and standardizing our selected numerical predictors : age (age), maximum heart rate achieved (thalch), resting blood pressure (trestbps),and serum cholesterol (chol). 

Next, we will split the data into training and testing sets to develop and train  our classification model. We will use the K-nearest neighbors algorithm, setting an appropriate value for 'K' through cross-validation.

After standardizing our predictors we will create multiple scatterplots or colour code bar-charts  for the presence (P) or absence (N) of heart disease. Through visualization, we aim to discern any patterns or correlations between the variables (such as Age and Serum Cholesterol), which will aid in understanding their individual impacts on heart disease diagnosis.

Subsequently, we will proceed to fit the K-nearest neighbors model to the training data. Using the testing data, we will assess the model's performance. These tools will enable us to comprehensively evaluate the classifier's efficacy in accurately predicting the presence of heart disease.


## 4 - Expected Outcomes and Significance

Our group expect to find whether K-nearest neighbors classification is an effective method for diagnosing heart disease based on our predictors. We also expect to evaluate the model's performance in terms of accuracy, sensitivity, and recall ratios.

The Impacts of the Project Findings:
If KNN proves to be effective, it could contribute to a quicker and more effective diagnosis of heart disease and help professionals make more informed treatment options. This can significantly reduce healthcare expenses—decreasing unnecessary procedures. 

Future Questions:
- Are our proposals considered variables the most informative for heart disease diagnosis, or are there other key features that should be considered?
- How does knn performance compare to other machine learning algorithms or traditional diagnostic methods in diagnosing heart disease?


