# **Project Report**

# **SERUM CHOLESTROL AND MAXIMUM HEART RATE ACHIEVED TO DIAGNOSE HEART DISEASE PATIENTS FROM HUNGARY**

Aryan Jain, Vibhav 

## INTRODUCTION

Cardiovascular disease encompasses a spectrum of cardiac conditions originating from malfunctions within the cardiac and vascular systems. Among these, coronary artery disease (CAD) manifests when the arteries responsible for supplying blood to the heart undergo a narrowing process. Numerous risk factors contribute to the predisposition for this malady, including elevated cholesterol levels and the maximum heart rate attained during physiological exertion.

Elevated cholesterol levels precipitate the accumulation of lipid deposits within the vasculature, impeding the smooth flow of blood through the arteries. The rupture of these deposits may culminate in the formation of a thrombus, thereby instigating severe cardiovascular events such as myocardial infarction or stroke. Notably, individuals afflicted by heart disease may experience a notable reduction in their maximum heart rate, as indicated by medical insights provided by WebMD in 2002.

In light of these considerations, the pertinent query arises: can the likelihood of an individual being afflicted by heart disease be ascertained based on an analysis of serum cholesterol levels and the maximum heart rate achieved? To address this question, we propose the utilization of a k-nearest neighbors (KNN) classifier algorithm, an analytical tool with demonstrated efficacy in pattern recognition and classification tasks. By employing this algorithm, we aim to discern discernible patterns and relationships between the aforementioned physiological parameters and the presence of heart disease in a new patient.

Our study involves the utilization of the "processed.hungarian.data" dataset extracted from the Heart Disease Database for the predictive assessment of heart disease presence in patients from Cleveland. The dataset comprises several pertinent variables, and our focus is on utilizing the variables "chol" (serum cholesterol level) and "thalach" (maximum heart rate achieved) as predictive features.

The specific columns within the dataset are defined as follows:

1. **age**: Age of the patient
2. **sex**: Gender of the patient (1 = male, 0 = female)
3. **cp**: Chest pain type
4. **trestbps**: Resting blood pressure in mmHg
5. **chol**: Serum cholesterol level in mg/dl
6. **fbs**: Fasting blood sugar > 120 mg/dl? (1 = True, 0 = False)
7. **restecg**: Resting electrocardiographic results
8. **thalach**: Maximum heart rate achieved
9. **exang**: Whether exercise induced angina (1 = True, 0 = False)
10. **oldpeak**: ST depression induced by exercise, relative to rest
11. **slope**: The slope of the peak exercise ST segment (1 = upslope, 2 = flat, 3 = downslope)
12. **ca**: Number of major vessels (0-3) colored by fluoroscopy
13. **thal**: Thalassemia classification (3 = normal, 6 = fixed defect, 7 = reversible defect)
14. **num**: Diagnosis of heart disease (1, 2, 3, 4 = presence, 0 = no presence)

For our analysis, we aim to employ the "chol" and "thalach" variables as predictors to discern the presence or absence of heart disease in patients. This predictive task aligns with the broader objective of leveraging relevant clinical data to enhance diagnostic capabilities and contribute to the advancement of cardiovascular health assessment methodologies.

### Methodology

Data Preprocessing and Exploratory Data Analysis

We initiated our study by importing relevant libraries and acquiring the "processed.cleveland.data" dataset from an authenticated online source. Subsequently, a meticulous data preprocessing phase ensued, wherein we applied systematic cleaning and tidying procedures to render the dataset amenable for analytical endeavors. This process involved judiciously assigning appropriate column types and introducing a new column labeled as "diag" to enhance the interpretability of the data.

To facilitate subsequent analytical procedures, we judiciously partitioned the dataset into distinct training and testing sets. It is noteworthy that our analytical focus remained exclusively on the training set until the final stages of the investigation.

A comprehensive summary of the training set was generated, laying the groundwork for subsequent predictive modeling. This involved the extraction of key insights and patterns from the training data to inform the desired behavior and performance criteria of our classifier.

Visualization emerged as an integral component of our exploratory analysis. Specifically, we employed graphical representations to elucidate the intricate relationship between the variables "thalac" (maximum heart rate achieved) and "chol" (serum cholesterol level). This visual exploration was pivotal in fostering a nuanced understanding of the distributional characteristics inherent in the dataset, thereby contributing to the refinement of subsequent analytical strategies.

### Determining Optimal k for K-Nearest Neighbors Classifier

The objective of this phase in our investigation is to ascertain the optimal value for the parameter 'k' in the k-nearest neighbors (KNN) algorithm, thereby maximizing the accuracy of our predictive model. The subsequent methodology encapsulates a systematic approach towards achieving this goal.

1. **Data Preprocessing and Scaling:**
   We commence by applying the recipe function to center and scale the training data, a crucial step in normalizing variables to a standardized range, facilitating robust and unbiased model training.

2. **Cross-Validation Technique:**
   Cross-validation, an integral aspect of our methodological framework, is executed with ten folds on the training dataset. This deliberate choice of employing ten folds serves to mitigate the influence of the specific observations in the validation set, thus enhancing the robustness and generalizability of our model.

3. **K-Nearest Neighbors Model Initialization:**
   The KNN model is instantiated with the parameter 'neighbours' set to 'tune()', indicative of a deliberate intention to identify the most optimal value for 'k' through subsequent tuning.

4. **Workflow Integration:**
   The recipe and the KNN model are seamlessly integrated into a workflow, with the 'tune_grid' function employed to systematically explore a range of 'k' values specified in 'gridvals' during cross-validation.

5. **Determining Optimal k:**
   The optimal 'k' value is discerned by filtering for accuracy and visualizing the accuracy estimate against the 'k' values through a line plot. This graphical representation serves to elucidate the relationship between 'k' and accuracy, guiding the selection of the most advantageous 'k' value.

6. **Model Evaluation and Validation:**
   Rigorous evaluation ensues to ensure that the selected 'k' value averts both underfitting and overfitting. Furthermore, a comparative analysis against a majority classifier is conducted to validate the efficacy of our model, affirming its superiority in predictive accuracy.

This methodological framework adheres to rigorous standards, leveraging cross-validation and systematic exploration to identify the optimal 'k' for the KNN algorithm, thus enhancing the robustness and reliability of our predictive model.

### Visualizing our results

To visualize our results, we plotted max heart rate on the x-axis and cholesterol levels on the y-axis, using diagnosis to colour the points.

To check for over/underfitting, we coloured the background of the graph based on what prediction would be made at every possible point. This also allowed us to quickly identify how the model classified patients, and where the boundaries were.

### Testing our classifier

Made a new model specification for the best  value chosen, combined with the recipe made earlier in a workflow, and fit the classifier to our training set.

Used predict on the testing set to evaluate the classifier's predicition accuracy on data it hadn't seen before.

Produced a confusion matrix to get a sense of which diagnoses the classifier was more accurate at giving, and what effects that has on real world application.

Tested the accuracy of our classifier when given data from Hungary.

## Preprocessing and exploratory data analysis


In [2]:
# importing libraries
library(tidyverse)
library(tidymodels)
library(repr)
library(RColorBrewer)

# formatting graphs
options(repr.plot.width = 12, repr.plot.height = 6)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.2     [32m✔[39m [34mpurrr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.2.1     [32m✔[39m [34mdplyr  [39m 1.1.1
[32m✔[39m [34mtidyr  [39m 1.3.0     [32m✔[39m [34mstringr[39m 1.5.0
[32m✔[39m [34mreadr  [39m 2.1.3     [32m✔[39m [34mforcats[39m 0.5.2
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.2     [32m✔[39m [34mrsample     [39m 1.1.1
[32m✔[39m [34mdials       [39m 1.1.0     [32m✔[39m [34mtune        [39m 1.0.1
[32m✔[39m [34minfer       [39m 1.0.4     [32m✔[39m [34mworkflows   [39m 1.1.2
[32m✔[39

In [3]:
hungarian_data <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data",
                          col_names = FALSE)

head(hungarian_data)

nrow(hungarian_data)

[1mRows: [22m[34m294[39m [1mColumns: [22m[34m14[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (9): X4, X5, X6, X7, X8, X9, X11, X12, X13
[32mdbl[39m (5): X1, X2, X3, X10, X14

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14
<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<dbl>
28,1,2,130,132,0,2,185,0,0,?,?,?,0
29,1,2,120,243,0,0,160,0,0,?,?,?,0
29,1,2,140,?,0,0,170,0,0,?,?,?,0
30,0,1,170,237,0,1,170,0,0,?,?,6,0
31,0,2,100,219,0,1,150,0,0,?,?,?,0
32,0,2,105,198,0,0,165,0,0,?,?,?,0


figure 1

As you can see above, the dataframe does not come with column names, so those must be added. Some factor columns are also being read as <dbl> or <chr>, so those need to be changed as well.

The publisher tells us that each column is numeric-valued and there are 294 rows, with missing data represented as the string "?".

## Data Cleaning and Structuring

The presence of "<chr>" data types in certain columns is attributed to the inclusion of "?" as placeholders for unknown values. In order to facilitate appropriate data type assignment, we undertake a meticulous data cleaning process wherein these "?" entries are systematically replaced with NA values.

Furthermore, to enhance the clinical relevance of our analysis, a binary diagnostic column, denoted as "diag," is introduced. While the existing variable "num" categorizes heart disease by severity, with 0 indicating the absence of the condition, the "diag" variable transcends severity levels. It serves the purpose of classifying patients into two categories—those with or without heart disease. This binary classification, irrespective of disease severity, is imperative in practical healthcare scenarios, as it prompts medical attention and potential treatment recommendations for any manifestation of heart disease. The introduction of the "diag" column underscores the translational utility of our analysis in real-world healthcare contexts.

In [4]:
set.seed(1)
# assigning col names
hungarian_clean <- hungarian_data

colnames(hungarian_clean) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", 
                               "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")
                           
# changing "?" into NA
hungarian_clean[ hungarian_clean == "?" ] <- NA

# adding diag column, setting col types
# as.integer is being used to get rid of decimal points when switching to factor
hungarian_clean <- hungarian_clean |>
                    mutate(diag = as.factor(ifelse(is.na(num), NA, (num > 0)))) |>
                    mutate(sex = as.factor(as.integer(sex)), cp = as.factor(as.integer(cp)), 
                            fbs = as.factor(as.integer(fbs)), restecg = as.factor(as.integer(restecg)),
                            exang = as.factor(as.integer(exang)), thal = as.factor(as.integer(thal)),
                            ca = as.factor(as.integer(ca)), slope = as.factor(as.integer(slope))) |>
                     mutate(num = as_factor(num))

head(hungarian_clean)

age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num,diag
<dbl>,<fct>,<fct>,<chr>,<chr>,<fct>,<fct>,<chr>,<fct>,<dbl>,<fct>,<fct>,<fct>,<fct>,<fct>
28,1,2,130,132.0,0,2,185,0,0,,,,0,False
29,1,2,120,243.0,0,0,160,0,0,,,,0,False
29,1,2,140,,0,0,170,0,0,,,,0,False
30,0,1,170,237.0,0,1,170,0,0,,,6.0,0,False
31,0,2,100,219.0,0,1,150,0,0,,,,0,False
32,0,2,105,198.0,0,0,165,0,0,,,,0,False


figure 2

Now our data is clean and tidy!

Since num uses integers to distinguish presence (1,2,3,4) from absence (0), and we want to determine whether or not a patient has heart disease, a new boolean column diag has been appended to narrow diagnoses down to TRUE or FALSE. To be able to stratify by it, we made it a factor column.

#### Splitting our data into training and testing sets

Before working on our model, we need to split our data into training and testing sets. Since we want to predict the new column diag, we will be stratifying by it.

We will use initial_split to split our dataframe into 75% training and 25% testing, since it shuffles our data for us and ensures an constant proportion of each class is present in both. The 75-25 split allows us to train our model on as many data points as possible while also keeping enough data for effective testing later.

In [5]:
#splitting dataframe into training, testing datasets
hungarian_split <- initial_split(hungarian_clean, prop = 3/4, strata = diag)

hungarian_training <- training(hungarian_split)
hungarian_testing <- testing(hungarian_split)

head(hungarian_training)

nrow(hungarian_training)
nrow(hungarian_testing)

age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num,diag
<dbl>,<fct>,<fct>,<chr>,<chr>,<fct>,<fct>,<chr>,<fct>,<dbl>,<fct>,<fct>,<fct>,<fct>,<fct>
28,1,2,130,132,0,2,185,0,0,,,,0,False
29,1,2,120,243,0,0,160,0,0,,,,0,False
32,1,2,110,225,0,0,184,0,0,,,,0,False
34,0,2,130,161,0,0,190,0,0,,,,0,False
35,0,1,120,160,0,1,185,0,0,,,,0,False
35,0,4,140,167,0,0,150,0,0,,,,0,False


figure 3

In the above code, we split the data into a training set to build our model on, and a testing set to, well, test it. Using initial_split allowed us to shuffle the data before splitting (removing bias and order) and stratify the data by diag so that an equal proportion of each is in each set.

There are 220 rows (75%) in the training set and 74 rows (25%) in the testing set. This gives us enough data to train the classifier on, as well as enough to test it on later. This means our classifier is going to be reliable.

Moving forward, we will only use the training set until the very end.

### Summarizing the data

Before we get to work, we need to make sure that the two classes actually have different averages in serum cholersterol and maximum heart rate achieved.

To do this, we will use group_by and summarize to create a table with the minimum, maximum and mean of each of our predictors.

In [6]:
#summarizing to get min, max, mean of each predictor + total no. of rows per class
hungarian_summary <- hungarian_training %>%
                    group_by(diag) %>%
                    summarize(mean_chol = mean(chol), 
                              mean_thalach = mean(thalach),
                              n_of_patients = n())

hungarian_summary

[1m[22m[36mℹ[39m In argument: `mean_chol = mean(chol)`.
[36mℹ[39m In group 1: `diag = FALSE`.
[33m![39m argument is not numeric or logical: returning NA


diag,mean_chol,mean_thalach,n_of_patients
<fct>,<dbl>,<dbl>,<int>
False,,,141
True,,,79


figure 4

To summarize our data, we grouped by diag then summarized for the minimum, maximum and mean of chol and thalach.

We can see that patients with heart disease tend to have higher cholesterol and lower maximum heart rates. Therefore, these trends are what we expect our classifier to predict diagnoses using later. We can also see that the number of TRUE and FALSE diagnoses are roughly balanced in the training set, which means our classifier is unlikely to be biased.