# Project Proposal — Heart Disease Prediction

## Introduction

Cardiovascular diseases, particularly heart disease, pose a significant global health challenge. Accurate diagnosis is crucial for effective patient care. In this proposal, we aim to address the question of diagnosing heart disease using machine learning techniques. Specifically, we will attempt to detect the presence of heart disease in patients by leveraging a heart disease dataset obtained from Kaggle. The diagnosis (num) ranges from 0 (no presence) to 4. However, our focus will be on distinguishing between the presence (values 1, 2, 3, 4) and absence (value 0) of heart disease.

The dataset consists of 16 variables, but for the purpose of this project, we will focus on a subset of 6 variables. Below is the description of the selected variables:

1. **Id (id ) : patient Id num
2. **Age (age) **: Age of the patient.
2. **Resting Blood Pressure (trestbps)**: Measured in mm Hg upon hospital admission.
3. **Serum Cholesterol (chol)**: Serum cholesterol level in mg/dL.
4. **Thalach (thalch)**: maximum heart rate achieved
5. **ca** : number of major vessels (0-3) colored by flourosopy
6. **Diagnosis of Heart Disease (presence)**: Angiographic disease status, representing the degree of blockages or narrowing in major vessels.
   - Value P (present): < 50% diameter narrowing
   - Value N (non-present): > 50% diameter narrowing (in any major vessel; attributes 59 through 68 represent vessels)


We aim to answer the following question:

#### Can K-nearest neighbors (KNN) classification accurately diagnose heart disease using only blood pressure, cholesterol levels, blood sugar and number of vessels?


### Preliminary exploratory data analysis

In [1]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
source('tests.R')
source('cleanup.R')
# install.packages("cowplot")
# library(cowplot)
# library(scales)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mworkflows   [39m 1.1.2
[32m✔[39

ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


In [None]:
# install.packages("readr")
# library(readr)

In [39]:

## We are using the heart disease data. As suggested by the professor we downloaded the csv file from kaggle and uploaded it into our notebook

file_path <- "data/heart_disease_uci.csv"

heart_disease_uci <- read_csv(file_path)

# selecting only the rows we need and Change 'num' to a factor
heart_disease_data <- heart_disease_uci |> 
         rename(presence = num) |>
         select(id, age, sex, trestbps, chol, thalch, ca, presence) |>
         mutate(presence = ifelse(presence == 0, "N", "P"),
                presence = as.factor(presence))



# Print the first few rows of the modified dataset
heart_disease_data

set.seed(3456)

# Split the heart disease data into training and testing sets 
heart_disease_split <- initial_split(heart_disease_data, prop = 0.75, strata = presence)  
heart_disease_train <- training(heart_disease_split)   
heart_disease_test <- testing(heart_disease_split)

presence_observations_table <- heart_disease_train |>
    group_by(presence) |>
    summarize(count = n()) 

predictors_mean_table <- heart_disease_train |>
    select(age, trestbps, chol, thalch, ca) |>
    map_df(mean, na.rm = TRUE)

predictors_mean_table
    
missing_number_table <- sum(rowSums(is.na(heart_disease_train)) > 0)

missing_number_table

[1mRows: [22m[34m920[39m [1mColumns: [22m[34m16[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (6): sex, dataset, cp, restecg, slope, thal
[32mdbl[39m (8): id, age, trestbps, chol, thalch, oldpeak, ca, num
[33mlgl[39m (2): fbs, exang

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


id,age,sex,trestbps,chol,thalch,ca,presence
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
1,63,Male,145,233,150,0,N
2,67,Male,160,286,108,3,P
3,67,Male,120,229,129,2,P
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
918,55,Male,122,223,100,,P
919,58,Male,,385,,,N
920,62,Male,120,254,93,,P


age,trestbps,chol,thalch,ca
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
53.5312,131.9267,197.5571,136.938,0.678733
