# Project Proposal — Heart Disease Prediction

## Introduction

Cardiovascular diseases, particularly heart disease, pose a significant global health challenge. Accurate diagnosis is crucial for effective patient care. In this proposal, we aim to address the question of diagnosing heart disease using machine learning techniques. Specifically, we will attempt to detect the presence of heart disease in patients by leveraging a heart disease dataset obtained from Kaggle. The diagnosis (num) ranges from 0 (no presence) to 4. However, our focus will be on distinguishing between the presence (values 1, 2, 3, 4) and absence (value 0) of heart disease.

The dataset consists of 16 variables, but for the purpose of this project, we will focus on a subset of 8-10 variables. Below is the description of the selected variables:

1. **Age (age) **: Age of the patient.
2. **Sex (age) **: Gender of the patient (1 = male, 0 = female).
3. **Chest Pain Type (cp)**:
   - Value 1: Typical angina
   - Value 2: Atypical angina
   - Value 3: Non-anginal pain
   - Value 4: Asymptomatic
4. **Resting Blood Pressure (trestbps)**: Measured in mm Hg upon hospital admission.
5. **Serum Cholesterol (chol)**: Serum cholesterol level in mg/dL.
6. **Fasting Blood Sugar (fbs)**:
   - 1 = Fasting blood sugar > 120 mg/dL
   - 0 = Fasting blood sugar <= 120 mg/dL
7. **Resting Electrocardiographic Results (restecg)**:
   - Value 0: Normal
   - Value 1: ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
   - Value 2: Probable or definite left ventricular hypertrophy by Estes' criteria
8. **Diagnosis of Heart Disease (num)**: Angiographic disease status, representing the degree of blockages or narrowing in major vessels.
   - Value 0: < 50% diameter narrowing
   - Value 1: > 50% diameter narrowing (in any major vessel; attributes 59 through 68 represent vessels)
   
9. **Identity (id) **: ID number of patient.

We aim to answer the following question:

#### Can K-nearest neighbors (KNN) classification accurately diagnose heart disease using patient demographics and medical histories?

### Preliminary exploratory data analysis

In [17]:
library(tidyverse)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)
source('tests.R')
source('cleanup.R')
# install.packages("cowplot")
# library(cowplot)
# library(scales)

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mworkflows   [39m 1.1.2
[32m✔[39m [34mmodeldata   [39m 1.0.1     [32m✔[39m [34mworkflowsets[39m 1.0.0
[32m✔[39m [34mparsnip     [39m 1.0.3     [32m✔[39m [34myardstick   [39m 1.1.0
[32m✔[39m [34mrecipes     [39m 1.0.4     

── [1mConflicts[22m ───────────────────────────────────────── tidymodels_conflicts() ──
[31m✖[39m [34mscales[39m::[32mdiscard()[39m masks [34mpurrr[39m::discard()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m   masks [34mstats[39m::filter()
[31m✖[39m [34mrecipes[39m::[32mfixed()[39m  masks [34mstringr[39m::fixed()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m      masks [34mstats[39m::lag()
[31m✖[39m [3

ERROR: Error in file(filename, "r", encoding = encoding): cannot open the connection


In [2]:
# install.packages("readr")
# library(readr)

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



In [None]:
## We are using the heart disease data. As suggested by the professor we downloaded the csv file from kaggle and uploaded it into our notebook

file_path <- "data/heart_disease_uci.csv"

heart_disease_uci <- read_csv(file_path)

# selecting only the rows we need and Change 'num' to a factor
heart_disease_data <- heart_disease_uci |> 
         select(id, age, sex, cp, trestbps, chol, fbs, restecg, num) |>
         mutate(num = as.factor(num))

# Change 'num' to a factor
# heart_disease_data$num <- factor(heart_disease_data$num, levels = c(0, 1), labels = c("No Heart Disease", "Heart Disease"))


# heart_disease_data <- heart_disease_data |>
#   mutate(
#     sex = factor(sex, levels = c(0, 1), labels = c("Female", "Male")),  # Recode 'sex'
#     cp = factor(cp, levels = c(1, 2, 3, 4), labels = c("Typical Angina", "Atypical Angina", "Non-Anginal Pain", "Asymptomatic")),  # Recode 'cp'
#     fbs = factor(fbs, levels = c(0, 1), labels = c("<= 120 mg/dL", "> 120 mg/dL")),  # Recode 'fbs'
#     restecg = factor(restecg, levels = c(0, 1, 2), labels = c("Normal", "ST-T Wave Abnormality", "Left Ventricular Hypertrophy"))  # Recode 'restecg'
#   )

# Print the first few rows of the modified dataset
head(heart_disease_data)

In [None]:
# blblala