# Heart Disease Detection Model Proposal

## Introduction:

In this proposal, we will aim to address a critical issue related to heart disease classification. Heart disease is the second leading cause of death in Canada affecting over 1.2 million citizens (1). As a result, a lot of research has been directed in the treatment and prevention of heart disease, but there is still a major need to accurately detect the presence of heart disease in the population. The purpose of this proposal is to explore how measured levels of 4 attributes from a patient can predict the presence of heart disease (on a scale of 0-4, 0 is non-presence). To explore this question, we will be using the “cleveland.processed.data” from the heart disease dataset on UC Irvine’s Machine Learning Repository (2).

## Method: 
Among the 14 columns of the dataset, we chose between 5 variables to be our predictor variable candidates. They were 1) resting blood pressure (trestbps/unit: mm Hg), 2) serum cholestoral(chol/unit: mg/dl), 3) maximum heart rate achieved(thalach), 4) ST depression induced by exercise relative to rest(oldpeak), and 5) number of major vessels (0-3) coloured by fluoroscopy (ca). We ultimately decided to use only quantitative variables and discarded the 5th variable (ca) since it is categorical. 

For our model training, we will use 75% of the entire dataset as a training set to train our classifier. We will ensure that our predictors are standardized, and the diagnosis presence (0-4) is balanced. Using a parameter grid with a range of k neighbor values, we will tune our model with k-fold cross validation. After fitting our predictors and target columns we will plot the knn value (x-axis) against the mean test score (y-axis) to determine the k-value that gives us the highest accuracy. Finally, we will evaluate our model on the test set, by comparing the true diagnosis on the test set with our predictions to calculate accuracy. 

As a part of tuning the classifier and to visualize the result of whether or not our model has good accuracy, we will use a line plot to show the relationship between a range of K neighbours and accuracy estimates from the test datasets. 

## Expected Results:
In our expected results, we anticipate finding a relationship between our 4 predictor variables and the presence or absence of heart disease. This classifier could have major impacts on the early detection of heart disease, allowing for earlier treatment and better prevention, potentially saving lives. In addition, this project could provide a basis for further research in the development of more accurate classifying models of heart disease, e.g. by further investigation into relevant predictor variables.


## Preliminary exploratory data analysis:

Package dependencies below:

In [1]:
!pip install pandas==1.5.3
!pip install scikit-learn==1.2.0
!pip install altair==4.2.2



In [59]:
import pandas as pd
import altair as alt
from sklearn import set_config
from sklearn.compose import make_column_transformer
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Simplify working with large datasets in Altair
alt.data_transformers.disable_max_rows()

# Output dataframes instead of arrays
set_config(transform_output="pandas")

## Breakdown of Analysis (0 - 4 Prediction):

In [None]:
# Isolate our predictors
# 1) resting blood pressure (trestbps/unit: mm Hg)
# 2) serum cholestoral(chol/unit: mg/dl)
# 3) maximum heart rate achieved(thalach)
# 4) ST depression induced by exercise relative to rest(oldpeak)

# Remove empty value rows
# Balance the dataset 
# The absence of heart disease (0) is disproportionate to the presence of it (1-4) and our predictions will likely be biased towards non-presence.
# Convert presence (1-4) to 1 ?



# Scale (Normalize?) the predictors?
# Utilize the pairwise plots function from worksheet/tutorial to visualize the 4 predictors against eachother with absence and presence labelling
# Split training predictors from predicted column (heart_disease_presence), Predictors X and target y
# Train model with a range of k neighbors
# Create visualiztion of number of neighbors vs accuracy
# Select highest accuracy k
# Table showing the accuracy and statistics of the model

# Overlayed histograms of target 0 or 1 ?

### Reading in dataset from the web

Utilizing pd.read_csv we can read in the dataset from github as detailed below. We will also isolate our predictors and target. We can also split it into a training and testing set.

In [23]:
col_names = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'heart_disease_presence']
url = 'https://github.com/CCWebb14/DSCI100_Group_Project/blob/main/data/processed.cleveland.data?raw=true'

# Isolate our predictors and target
# 1) resting blood pressure (trestbps/unit: mm Hg)
# 2) serum cholestoral(chol/unit: mg/dl)
# 3) maximum heart rate achieved(thalach)
# 4) ST depression induced by exercise relative to rest(oldpeak)

# Read in data isolate predictors

cleveland = pd.read_csv(url, names=col_names)[['trestbps', 'chol', 'thalach', 'oldpeak', 'heart_disease_presence']]

Unnamed: 0,trestbps,chol,thalach,oldpeak,heart_disease_presence
238,134.0,271.0,162.0,0.0,0
205,142.0,309.0,147.0,0.0,3
143,125.0,309.0,131.0,1.8,1
295,120.0,157.0,182.0,0.0,0
288,130.0,221.0,163.0,0.0,0
...,...,...,...,...,...
11,140.0,294.0,153.0,1.3,0
226,112.0,204.0,143.0,0.1,0
204,110.0,211.0,161.0,0.0,0
256,106.0,223.0,142.0,0.3,0


Let's create a preprocessor to standardize (i.e., center and scale) all of the variables in the fruit dataset. Centering will make sure that every variable has an average of 0, and scaling will make sure that every variable has standard deviation of 1. We will use the StandardScaler in the preprocessor. Then fit_transform the preprocessor so that we can examine the output.

Creating a preprocessor to standardize our predictors (centering (avg=0) and scaling (sd=1)). Transforming our data.

In [66]:
cleveland_preprocessor = make_column_transformer(
    (StandardScaler(), ['trestbps', 'chol', 'thalach', 'oldpeak']),
    remainder = 'passthrough',
    verbose_feature_names_out=False
)

cleveland_scaled = cleveland_preprocessor.fit_transform(cleveland)

Split into training and testing

In [67]:
cleveland_train, cleveland_test = train_test_split(cleveland_scaled, test_size=0.25, random_state=330)
cleveland_train

Unnamed: 0,trestbps,chol,thalach,oldpeak,heart_disease_presence
238,0.131482,0.470232,0.542655,-0.896862,0
205,0.586786,1.205363,-0.114167,-0.896862,3
143,-0.380735,1.205363,-0.814778,0.655990,1
295,-0.665300,-1.735164,1.418418,-0.896862,0
288,-0.096170,-0.497047,0.586443,-0.896862,0
...,...,...,...,...,...
11,0.472960,0.915180,0.148562,0.224643,0
226,-1.120604,-0.825922,-0.289320,-0.810592,0
204,-1.234430,-0.690503,0.498867,-0.896862,0
256,-1.462082,-0.458356,-0.333108,-0.638053,0


In [72]:
X_train = cleveland_train.drop(columns = {'heart_disease_presence'})
y_train = cleveland_train['heart_disease_presence']

Exploring a range of n_neighbors with GridSearchCV

Performing a 4-fold cross validation



In [90]:
param_grid = {
    "kneighborsclassifier__n_neighbors": range(2, 20, 1),
}
cleveland_tune_pipe = make_pipeline(cleveland_preprocessor, KNeighborsClassifier())

In [91]:
knn_tune_grid = GridSearchCV(
    cleveland_tune_pipe, param_grid, cv=4,
)

In [92]:
knn_model_grid = knn_tune_grid.fit(X_train, y_train)

accuracies_grid = pd.DataFrame(knn_model_grid.cv_results_) 
accuracies_grid

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kneighborsclassifier__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
0,0.011398,0.001547,0.009168,0.000224,2,{'kneighborsclassifier__n_neighbors': 2},0.508772,0.596491,0.45614,0.607143,0.542137,0.062627,9
1,0.009539,8.9e-05,0.009821,0.000201,3,{'kneighborsclassifier__n_neighbors': 3},0.578947,0.561404,0.45614,0.535714,0.533051,0.046991,12
2,0.010054,0.00038,0.009912,0.000185,4,{'kneighborsclassifier__n_neighbors': 4},0.561404,0.578947,0.473684,0.517857,0.532973,0.040822,13
3,0.01349,0.003882,0.015922,0.000974,5,{'kneighborsclassifier__n_neighbors': 5},0.526316,0.508772,0.45614,0.553571,0.5112,0.035571,16
4,0.010306,0.001713,0.010586,0.001628,6,{'kneighborsclassifier__n_neighbors': 6},0.526316,0.561404,0.473684,0.517857,0.519815,0.03124,15
5,0.011166,0.0026,0.011388,0.003854,7,{'kneighborsclassifier__n_neighbors': 7},0.54386,0.508772,0.438596,0.535714,0.506736,0.041428,18
6,0.009322,0.000724,0.009044,0.000112,8,{'kneighborsclassifier__n_neighbors': 8},0.526316,0.54386,0.473684,0.5,0.510965,0.026588,17
7,0.009564,0.000422,0.010569,0.002486,9,{'kneighborsclassifier__n_neighbors': 9},0.54386,0.578947,0.473684,0.5,0.524123,0.040377,14
8,0.012948,0.002189,0.011976,0.00238,10,{'kneighborsclassifier__n_neighbors': 10},0.54386,0.596491,0.491228,0.517857,0.537359,0.038882,11
9,0.015016,0.009934,0.013007,0.005361,11,{'kneighborsclassifier__n_neighbors': 11},0.54386,0.578947,0.491228,0.553571,0.541902,0.031938,10


In [93]:
# your code here
accuracy_versus_k_grid = alt.Chart(accuracies_grid).mark_line(point=True).encode(
    x = alt.X('param_kneighborsclassifier__n_neighbors:Q', title='Number of Neighbors', scale=alt.Scale(zero=False)),
    y = alt.Y('mean_test_score', title='Mean Test Score', scale=alt.Scale(zero=False))
)


accuracy_versus_k_grid

In [None]:
# Selecting n=17

In [108]:
knn_spec = KNeighborsClassifier(n_neighbors=17)
cleveland_fit = make_pipeline(cleveland_preprocessor, knn_spec).fit(X_train, y_train)

In [109]:
cleveland_test_predictions = cleveland_test.assign(
    predicted = cleveland_fit.predict(cleveland_test)
)
cleveland_test_predictions

Unnamed: 0,trestbps,chol,thalach,oldpeak,heart_disease_presence,predicted
0,0.757525,-0.264900,0.017197,1.087338,0,1
164,-0.437648,0.160702,1.111901,-0.896862,0,0
158,0.472960,0.895834,0.892960,0.138373,2,0
41,0.472960,-0.922650,1.243266,0.310912,0,0
30,0.472960,-0.148827,0.060985,0.655990,0,0
...,...,...,...,...,...,...
247,-1.234430,0.547614,-1.384024,-0.034166,1,0
250,-1.234430,-0.883959,-1.033718,0.397182,0,0
116,0.472960,-0.690503,0.674020,-0.896862,0,0
59,-0.380735,-0.651812,-1.077507,0.310912,0,0


Determining accuracy

In [110]:
X_test = cleveland_test.drop(columns = {'heart_disease_presence'})
y_test = cleveland_test['heart_disease_presence']

cleveland_prediction_accuracy = cleveland_fit.score(X_test, y_test)
cleveland_prediction_accuracy

0.5526315789473685

#### Disease Presence Counts

The second table reports the count of each heart disease presence (heart_disease_presence column: 0 (no presence) to 4). This table demonstrates why we need to balance the dataset. The absence of heart disease (0) is disproportionate to the presence of it (1-4) and our predictions will likely be biased towards non-presence.

In [113]:
disease_presence_counts = pd.DataFrame(cleveland_train['heart_disease_presence'].value_counts()).reset_index().rename(
    columns={'index': 'heart_disease_presence', 'heart_disease_presence':'count'})

disease_presence_counts

Unnamed: 0,heart_disease_presence,count
0,0,124
1,1,40
2,3,29
3,2,25
4,4,9


## Methods for Classification:

## Results with Visualization:

## Discussion:

## References:

1. https://www.canada.ca/en/public-health/services/publications/diseases-conditions/heart-disease-canada.html 
2. https://archive.ics.uci.edu/dataset/45/heart+disease 