# Heart Disease Detection Model Proposal

## Introduction:

In this proposal, we will aim to address a critical issue related to heart disease classification. Heart disease is the second leading cause of death in Canada affecting over 1.2 million citizens (1). As a result, a lot of research has been directed in the treatment and prevention of heart disease, but there is still a major need to accurately detect the presence of heart disease in the population. The purpose of this proposal is to explore how measured levels of 4 attributes from a patient can predict the presence of heart disease (on a scale of 0-4, 0 is non-presence). To explore this question, we will be using the “cleveland.processed.data” from the heart disease dataset on UC Irvine’s Machine Learning Repository (2).

## Method: 
Among the 14 columns of the dataset, we chose between 5 variables to be our predictor variable candidates. They were 1) resting blood pressure (trestbps/unit: mm Hg), 2) serum cholestoral(chol/unit: mg/dl), 3) maximum heart rate achieved(thalach), 4) ST depression induced by exercise relative to rest(oldpeak), and 5) number of major vessels (0-3) coloured by fluoroscopy (ca). We ultimately decided to use only quantitative variables and discarded the 5th variable (ca) since it is categorical. 

For our model training, we will use 75% of the entire dataset as a training set to train our classifier. We will ensure that our predictors are standardized, and the diagnosis presence (0-4) is balanced. Using a parameter grid with a range of k neighbor values, we will tune our model with k-fold cross validation. After fitting our predictors and target columns we will plot the knn value (x-axis) against the mean test score (y-axis) to determine the k-value that gives us the highest accuracy. Finally, we will evaluate our model on the test set, by comparing the true diagnosis on the test set with our predictions to calculate accuracy. 

As a part of tuning the classifier and to visualize the result of whether or not our model has good accuracy, we will use a line plot to show the relationship between a range of K neighbours and accuracy estimates from the test datasets. 

## Expected Results:
In our expected results, we anticipate finding a relationship between our 4 predictor variables and the presence or absence of heart disease. This classifier could have major impacts on the early detection of heart disease, allowing for earlier treatment and better prevention, potentially saving lives. In addition, this project could provide a basis for further research in the development of more accurate classifying models of heart disease, e.g. by further investigation into relevant predictor variables.


## Preliminary exploratory data analysis:

Package dependencies below:

In [None]:
!pip install pandas==1.5.3
!pip install scikit-learn==1.2.0
!pip install altair==4.2.2
!pip install ucimlrepo==0.0.3

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
import altair as alt
import ucimlrepo

### Reading in dataset from the web

Utilizing the ucimlrepo package we can pull in the dataset as detailed below. We can also split it into a training and testing set.

In [4]:
heart_disease = ucimlrepo.fetch_ucirepo(id=45) 

In [5]:
predictors = heart_disease.data.features 
target = heart_disease.data.targets.rename(columns = {'num': 'heart_disease_presence'})
cleveland = pd.concat([predictors, target], axis=1)

cleveland_train, cleveland_test = train_test_split(cleveland, test_size=0.25, random_state=330)
cleveland_train

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,heart_disease_presence
238,49,0,2,134,271,0,0,162,0,0.0,2,0.0,3.0,0
205,45,1,4,142,309,0,2,147,1,0.0,2,3.0,7.0,3
143,64,1,3,125,309,0,0,131,1,1.8,2,0.0,7.0,1
295,41,1,2,120,157,0,0,182,0,0.0,1,0.0,3.0,0
288,56,1,2,130,221,0,2,163,0,0.0,1,0.0,7.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11,56,0,2,140,294,0,2,153,0,1.3,2,0.0,3.0,0
226,47,1,4,112,204,0,0,143,0,0.1,1,0.0,3.0,0
204,43,1,4,110,211,0,0,161,0,0.0,1,0.0,7.0,0
256,67,0,4,106,223,0,0,142,0,0.3,1,2.0,3.0,0


### Data Summary

#### Empty Values Count
The first table demonstrates how many cells contain missing data. If these columns (ca, thal) are used as predictors, we have the option of dropping these empty rows so they do not interfere with our model.

In [6]:
empty_values_count = pd.DataFrame((cleveland_train.isnull().sum())).T
empty_values_count

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,heart_disease_presence
0,0,0,0,0,0,0,0,0,0,0,0,4,1,0


#### Disease Presence Counts

The second table reports the count of each heart disease presence (heart_disease_presence column: 0 (no presence) to 4). This table demonstrates why we need to balance the dataset. The absence of heart disease (0) is disproportionate to the presence of it (1-4) and our predictions will likely be biased towards non-presence.

In [7]:
disease_presence_counts = pd.DataFrame(cleveland_train['heart_disease_presence'].value_counts()).reset_index().rename(
    columns={'index': 'heart_disease_presence', 'heart_disease_presence':'count'})

disease_presence_counts

Unnamed: 0,heart_disease_presence,count
0,0,124
1,1,40
2,3,29
3,2,25
4,4,9


### Data Visualization

To visualize our predictors and rule out categorical ones, we created pair-wise scatterplots to visualize each predictors distribution

In [8]:
trestbps_chol_plot = alt.Chart(cleveland_train).mark_point().encode(
    x = alt.X("trestbps", title = "resting blood pressure"),
    y = alt.Y('chol', title = "serum cholestoral"),
    color = alt.Color("heart_disease_presence", title = "Heart Disease Presence")
)
trestbps_chol_plot

In [9]:
trestbps_thalach_plot = alt.Chart(cleveland_train).mark_point().encode(
    x = alt.X("trestbps", title = "resting blood pressure" ),
    y = alt.Y('thalach', title = "resting blood pressure"),
    color = alt.Color("heart_disease_presence", title = "Heart Disease Presence")
)
trestbps_thalach_plot

In [10]:
trestbps_oldpeak_plot = alt.Chart(cleveland_train).mark_point().encode(
    x = alt.X("trestbps", title = "resting blood pressure"),
    y = alt.Y('oldpeak', title = "ST depression induced by exercise"),
    color = alt.Color("heart_disease_presence", title = "Heart Disease Presence")
)
trestbps_oldpeak_plot

In [11]:
trestbps_ca_plot = alt.Chart(cleveland_train).mark_point().encode(
    x = alt.X("trestbps", title = "resting blood pressure"),
    y = alt.Y('ca', title = "Major vessels (0-3) coloured by fluoroscopy"),
    color = alt.Color("heart_disease_presence", title = "Heart Disease Presence")
)
trestbps_ca_plot

In [12]:
chol_thalach_plot = alt.Chart(cleveland_train).mark_point().encode(
    x = alt.X("chol", title = "serum cholestoral"),
    y = alt.Y('thalach', title = "resting blood pressure"),
    color = alt.Color("heart_disease_presence", title = "Heart Disease Presence")
)
chol_thalach_plot

In [13]:
chol_oldpeak_plot = alt.Chart(cleveland_train).mark_point().encode(
    x = alt.X("chol", title = "serum cholestoral"),
    y = alt.Y('oldpeak', title = "ST depression induced by exercise"),
    color = alt.Color("heart_disease_presence", title = "Heart Disease Presence")
)
chol_oldpeak_plot

In [14]:
chol_ca_plot = alt.Chart(cleveland_train).mark_point().encode(
    x = alt.X("chol", title = "serum cholestoral"),
    y = alt.Y('ca', title = "Major vessels (0-3) coloured by fluoroscopy"),
    color = alt.Color("heart_disease_presence", title = "Heart Disease Presence")
)
chol_ca_plot

In [15]:
oldpeak_thalach_plot = alt.Chart(cleveland_train).mark_point().encode(
    x = alt.X("oldpeak", title = "ST depression induced by exercise"),
    y = alt.Y('thalach', title = "resting blood pressure"),
    color = alt.Color("heart_disease_presence", title = "Heart Disease Presence")
)
oldpeak_thalach_plot

In [16]:
oldpeak_ca_plot = alt.Chart(cleveland_train).mark_point().encode(
    x = alt.X("oldpeak", title = "ST depression induced by exercise"),
    y = alt.Y('ca', title = "Major vessels (0-3) coloured by fluoroscopy"),
    color = alt.Color("heart_disease_presence", title = "Heart Disease Presence")
)
oldpeak_ca_plot

## References:

1. https://www.canada.ca/en/public-health/services/publications/diseases-conditions/heart-disease-canada.html 
2. https://archive.ics.uci.edu/dataset/45/heart+disease 