# COGS 118A - Final Project

# Heart Disease Model Search

## Group members

- Will Sumerfield
- Miguel Monares
- Abdalla Atalla
- Ritik Raina
- Matilda Michel

### Important Files

- common/model_training
*The template code we used to train each of our models individually*
- common/test_model_training
*A data playground we used to test and learn about our models*

# Abstract 

The goal of our project is to learn about the strengths, weaknesses, and applications of several popular supervised
machine learning models. We will hypothesize their performance predicting Heart Disease on a large dataset of health
related metrics, train each model on the same data, and analyze their performance.

# Background

Heart disease is the leading cause of death in the United States, accounting for more than 696,000 deaths in 2020 alone.
[1]. It is a disease that, if remedied proactively early, can be treated and/or alleviated. Hence, in order to best
address the prominence of heart disease in the US, we must use leading technologies to help understand, predict, and
detect indicators of heart disease in patients. The technology of machine learning is already widely used in the field
of medicine, including the domain of disease prediction & classification.

#### Prior Work:

1. Using patient information such as MRI scans, biomarkers, and numerical data about the patient, researchers have
been able to develop a random forest classifier that predicts Alzheimer's Disease in its early stages up to an 85%
accuracy [2]. The implications of this research are proactive care for those who are predicted to develop Alzheimer's
disease.
2. Using feature selection and data cleaning techniques, researchers have been able to leverage to use of different
machine learning models on multiple sources of data (meterological, epidemic, media data, etc.) to analyze, predict,
and prevent the spread of infectious diseases [3].
3. Finally, within the space of heart disease classification, researchers have already developed machine learning
and deep learning models to predict heart disease. Using a Heart Disease dataset from UCI, researchers have been able
to leverage deep learning techniques to achieve a 94% accuracy in predicting heart disease [4].

As demonstrated in the previous research that has been done in this field, many different models and data science
techniques have been employed in the objective of disease classification. It's important that we understand how we can
use the leading technology of machine learning to best predict and classify health conditions, like heart disease,
in order to promote healthy habits and preventative measure for those who are at risk. This facet of understanding
motivates the purpose of our project.

[1] [Leading Causes of Death](https://www.cdc.gov/nchs/fastats/leading-causes-of-death.htm)
[2] [AlzAlzheimer's Prediction using ML](https://www.frontiersin.org/articles/10.3389/fpubh.2022.853294/full)
[3] [Using ML to Limit Disease Spread](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8219638/)
[4] [Predicting Heart Disease using ML](https://www.hindawi.com/journals/cin/2021/8387680/)

# Problem Statement

In this project, we are trying to find the strengths, weaknesses, and applications of the following supervised machine learning models: Gaussian Process Classification, Support Vector Classification, Decision Trees, K-Nearest Neighbors, MLP Classification, and Polynomial Classification.
Each of us has picked one of the aforementioned supervised machine learning models. With that model, each of us will make a hypothesis about the model's performance on our dataset, train the model, and then analyze the model's performance on the dataset.
To test our models, we are using a Dataset containing health-related features for over 300,000 subjects, and a column specifying whether that subject has Heart Disease or not. Our models will be tasked with predicting Heart Disease given the other features, the performance of which will be used to measure the accuracy of our hypotheses.
Before we work with the data, we will perform EDA to look for oddities and important information in our dataset. Although we expect that all features in our dataset are somewhat relevant, there is a chance that some features are not worth including the dataset. We will use a mixture of automated feature selection and common sense to choose which (if any) features are excluded. Additionally, we may choose to perform feature extraction on the data, and create more features.
In order to hypothesize the performance of a model, we will be taking into account the size of the dataset, the difficulty of the prediction task (Predicting Heart Disease on a dataset of various health related metrics for subjects), the number of and types of features, and the shape/distribution of the data.
Then, we will train our model on the data. Each of us will use the same preprocessed data before giving them to our models, so that the models are easier to compare, although some models may use additional preprocessing on the data.
Next, we will measure the performance of the model using F1, Precision, and Accuracy scores on the training and testing' datasets. Using that knowledge, we will attempt to figure out how and why our model's performance deviated from our expectations.


# Data

[Dataset Link](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease)

### Imports

In [9]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import warnings
from sklearn.preprocessing import OneHotEncoder
from matplotlib.patches import Patch
import re
pd.options.display.max_columns = 999
warnings.filterwarnings('ignore')

### Data General Knowledge

As we can see below, our dataset is a collection of information related to heart health. The first column is a *True*
or *False* value which tells us whether that row's person has some form of heart disease. This is the variable we will
be trying to predict with our models, based on the information the other features provide.

We can see below that we have 17 different features with which to predict heart disease, and over 300,000 data points!

In [10]:
# Import the data as a dataframe
data = pd.read_csv("data/raw_data.csv")

# Create a lists of each type of feature
nominal_features = ['Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking', 'Sex', 'Race', 'Diabetic',
                  'PhysicalActivity', 'Asthma', 'KidneyDisease', 'SkinCancer']
ordinal_features = ['AgeCategory', 'GenHealth']
continuous_features = ['BMI', 'PhysicalHealth', 'MentalHealth', 'SleepTime']


# region Convert the nominal features into one-hot encodings

# Create a One-Hot Encoder
encoder = OneHotEncoder()

# For each nominal feature...
for feature in nominal_features:

    # Get an encoded version
    encoded_feature = pd.DataFrame(encoder.fit_transform(data[[feature]]).toarray(),
                                   columns=[f'{feature}_{f_class}' for f_class in data[feature].unique()])

    # Remove the old feature from the data
    data = data.drop(feature, axis=1)

    # Add the encoded feature to the data
    data = data.join(encoded_feature)

# endregion Convert the nominal features into one-hot encodings

# region Convert the ordinal features into labels

# For each nominal feature...
for feature in ordinal_features:

    # Replace the old feature with an encoded feature
    data[feature] = data[feature].astype('category').cat.codes

# endregion Convert the ordinal features into labels

# Convert the output column to be numerical
data['HeartDisease'] = data['HeartDisease'].astype('category').cat.codes

# Display the number of columns in the dataframe
print(f"Number of Raw Features: {len(data.columns)}")
print()
print(f"Number of Datapoints: {data.shape[0]}")
print()

# Display the head of the dataframe
data.head()

Number of Raw Features: 35

Number of Datapoints: 319795



Unnamed: 0,HeartDisease,BMI,PhysicalHealth,MentalHealth,AgeCategory,GenHealth,SleepTime,Smoking_Yes,Smoking_No,AlcoholDrinking_No,AlcoholDrinking_Yes,Stroke_No,Stroke_Yes,DiffWalking_No,DiffWalking_Yes,Sex_Female,Sex_Male,Race_White,Race_Black,Race_Asian,Race_American Indian/Alaskan Native,Race_Other,Race_Hispanic,Diabetic_Yes,Diabetic_No,"Diabetic_No, borderline diabetes",Diabetic_Yes (during pregnancy),PhysicalActivity_Yes,PhysicalActivity_No,Asthma_Yes,Asthma_No,KidneyDisease_No,KidneyDisease_Yes,SkinCancer_Yes,SkinCancer_No
0,0,16.6,3.0,30.0,7,4,5.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0
1,0,20.34,0.0,0.0,12,4,7.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0
2,0,26.58,20.0,30.0,9,1,8.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0
3,0,24.21,0.0,0.0,11,2,6.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,0,23.71,28.0,0.0,4,4,8.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0


### Data Preprocessing

To make the dataset trainable, we also used one-hot encoding on each of the nominal features. Additionally, we also
changed each ordinal feature to be represented by ordered numbers.

### Heart Disease

We are trying to predict Heart Disease with our dataset. However, the number of cases of heart disease are not equal to
the number of cases without heart disease. Instead, only around 10% of people in the dataset have a heart disease.
Therefore, we should expect a 10% accuracy if we guessed randomly.

# Proposed Solution

In this section, clearly describe a solution to the problem. The solution should be applicable to the project domain and appropriate for the dataset(s) or input(s) given. Provide enough detail (e.g., algorithmic description and/or theoretical properties) to convince us that your solution is applicable. Make sure to describe how the solution will be tested.  

If you know details already, describe how (e.g., library used, function calls) you plan to implement the solution in a way that is reproducible.

If it is appropriate to the problem statement, describe a benchmark model<a name="sota"></a>[<sup>[3]</sup>](#sotanote) against which your solution will be compared. 

# Evaluation Metrics

We will use 3 different metrics to assess the performance of our models on our task of Heart Disease
Prediction/Classification.

The first metric we will use is **Recall/Sensitivity** (formula shown below). We will use this metric to assess
performance of our task because recall measures the model's ability to classify the true positives in its predictions.
This is relevant to the problem we are solving because we want our model to miss as few truly heart disease-prone
individuals as possible. It is very expensive to miss a true positive (miss a heart disease prone patient) in this
context.

The second metric we will consider is **Precision** (formula shown below). We will use this metric to measure the
performance of our model because it tells us how many predicted positives are actually positive. In the context of
our problem, this metric answers the question: How many patients predicted to have heart disease actually had heart
disease? Recall is a measure that can benefit from biasing predictions toward all true, so we use Precision to ensure
that our model is still making balanced and accurate predictions.

![Precision and Recall](assets/PrecisionRecall_formula.png)

The last metric we will consider is **the F1 score** as it is the harmonic balance between Precision and Recall. We use F1 to make
sure that our models predictions are measuring Prediction and Recall in a balanced manner, so that we assess if our
model is biased towards predicting heart disease or non heart disease.

![F1](assets/F1_formula.png)

In the evaluation of our models, we will primarily be looking at Recall/Sensitivity, as it is the most relevant metric
to the problem we are trying to address, which is the prediction/classification of heart disease. However, we will
take the other two metrics into account as they provide information into the general accuracy of our model.

# Model Testing

In order to experiment with our models and in the hopes of understanding how to use them before applying them to our
real dataset, we created a model playground. With it, each of us trained our model on each of the different datasets,
and learned a lot about what our model worked well at, and what it struggled to do.

![GPC Example](assets/GPC_testing.png)

# Hypotheses

Hypothesis header

### Gaussian Process Classifier
*By Will Sumerfield*

Given that GPC models perform very well on datasets with a good spread over the dataspace, and that our dataset is very
large, I predict that the Gaussian Process Classifier model will perform very well, if I can find a good kernel
function for the data. However, I expect that this will be a very difficult process, given that there are many
columns of our data.

Additionally, I may need to use smaller subsamples of the data to train the GPC, given that GPCs take up **$$O(n^2)$$**
space, and take the same training time. I expect this model to be among, if not the best model we employ.

### Support Vector Classifier
*By Miguel Monares*

Support Vector Machines (SVM) are effective in high dimensional spaces. Given that our data has decent number of
features, we can expect these features to be a positive attribute for the SVM's performance on the dataset. The
performance of the SVM works well when the data is mostly separable, but doesn't perform well when the dataset is
overlapping. Hence, we can expect that heart disease classification performance of the SVM will depend heavily on
whether the features can be effectively separated. It will be important that we use an appropriate kernel in order to
get the best performance from this model.

### Decision Tree Classifier
*By Miguel Monares*

Decision Tree's may be a useful model for our task of heart disease classification because Decision Tree's offer
interpretability and visualization that may allow us to discover insights and connections between the features that are
indicative of heart disease. However, in order to maximize our performance, we need to make sure that decision tree
doesn't overfit to the data, which we can influence by controlling pruning and max-depth. I expect this model will be
able to find underlying connection between the features in our data, but will not be among the best.

### Random Forest Classification
*By Abdalla Atalla*

As the number of trees gets bigger, increases it will lead to no overfitting of the model which will be a benefit.
Also, we will try to aim for low bias + correlation that will lead to a better accuracy. I expect that the RF will do
great because of the accuracy, simplicity, and how computationally inexpensive it is to work with many features.
Another positive to using RF is that it is good with big amounts of data and it also isn't too sensitive to the
outliers in the data.

### Neural Networks
*By Ritik Raina*

Techinally, we are working with a binary classification modulation - either one has or does not have a heart disease.
Here, neural networks will be helpful when it comes to ensuring taht we apply weights and biases to each of the fitted
features, whenever the training is in process. Using neural network architectures will open up the opportunity to work
with activation functions which are extremely influential of the features and their values.

### Logistic Regression
*By Matilda Michel*

Given the high dimensionality of the data, it’s hard to tell if the data will be linearly separable and if logistic
regression will be able to accurately predict the data.

With preprocessing the data into polynomial features of varying degrees, we can see how it performs on various higher
orders to try and best fit the relationship of the data. Given the nature of our dataset, I expect a model with a
higher degree polynomial will perform the best, though I'll need to be careful with overfitting. Another issue I might
run into is having enough memory to run higher degrees of polynomial features in the model as the number of features
exponentially increases with n original features.

To conserve time, we can also use a polynomial kernel trick for logistic regression, and see how that fits the data on
higher degrees. I’m expecting that this could perform just as well or better given that it's less computationally
expensive so I might be able to tune the hyperparameters more freely.


Hypothesis Footer

# Results

We chose 7 different supervised machine learning models to explore our dataset with and to see how their performances
compare. We individually explored these models and stated our own hypotheses on how well we think they will perform
and why, given what we know about our data.

Each metric here is based on predicting the positive case, since with a measure like heart disease, we care much more
about finding unhealthy individuals, than finding health ones.

### Gaussian Process Classifier
*By Will Sumerfield*

| Measure | Score |
---|---
Train Precision | 0.14285714285213123
Train Recall |   0.10223642172523961
Train F1 Score | 0.14541233333221455
Test Precision | 0.08712890539028329
Test Recall |    0.11923922938209103
Test F1 Score |  0.09239812829483894

### Support Vector Classifier
*By Miguel Monares*

| Measure | Score |
---|---
Train Precision | 0.16666666666666666
Train Recall | 0.10223642172523961
Train F1 Score | 0.12673267326732673
Test Precision | 0.2777777777777778
Test Recall | 0.16025641025641027
Test F1 Score | 0.20325203252032523

### Decision Tree Classifier
*By Miguel Monares*

| Measure | Score |
---|---
Train Precision | 1.0
Train Recall | 0.9936102236421726
Train F1 Score | 0.9967948717948718
Test Precision |  0.18137254901960784
Test Recall |  0.23717948717948717
Test F1 Score | 0.20555555555555557

### Random Forest Classification
*By Abdalla Atalla*

| Measure | Score |
---|---
Train Precision | 0.9330985915492958
Train Recall | 0.8466453674121406
Train F1 Score | 0.8877721943048575
Test Precision | 0.35
Test Recall | 0.1794871794871795
Test F1 Score | 0.23728813559322035

### Neural Networks
*By Ritik Raina*

| Measure | Score |
---|---
Train Precision | 0.5845070422535211
Train Recall |   0.08186749958901858
Train F1 Score | 0.1436193222782985
Test Precision |   0.5670347003154574
Test Recall |  0.0788031565103025
Test F1 Score | 0.13837567359507313

### Logistic Regression
*By Matilda Michel*

| Measure | Score |
---|---
Train Precision |  0.23917748917748918
Train Recall | 0.7060702875399361
Train F1 Score | 0.35731608730800324
Test Precision |  0.2314410480349345
Test Recall | 0.6794871794871795
Test F1 Score | 0.3452768729641694


### Summary
The highest recall score is Logistic Regression w/ Polynomial Features at .679, the highest precision is  Random Forest at .35, and the highest F1 score award goes to Logistic Regression at .345.

Overall, we can see that the logistic regression performed the best! This goes to show that simpler models can often
perform better than complex ones, and that their faster training times and easier setup are often worth it.

# Analysis

Below, first we individually analyze the results from our trained models, looking at how they performed on their own, looking especially at recall, and also how they performed relative to other models.


### Gaussian Process Classifier
*By Will Sumerfield*

I learned a lot from training my model on this real dataset. First, I learned that Guassian Processes
are very difficult to train. They take up huge amounts of space, and a large amount of time to train.

I attempted to follow a guide on GPCs, specifically on how to create custom kernels on 1-hot-encoded encoded data.
However, the SKLearn library made this difficult to do, and while I believe I had a working version, the training
time when using a kernel for each feature becomes very large. I suspect that libraries made more specifically for GPCs
would work better for this.

I was correct in my hypothesis that GPCs require a lot of knowledge about the data, and that their hyperparameters
(kernel functions) are really what make or break them - they're definitely not 'plug and chug' models.

In comparison to other models, this model performed relatively poorly. However, I do still believe that more 
processing power and time to craft the kernels would allow this model to be one of the better ones. The question now
remains - is it worth it?

The kernel I found that worked best was the DotProduct()*RBF() kernel. This means that a combination of these two 
(multiplication represents 'and') of these kernels was the most optimal kernel type. 
Notably, the White Noise Kernel decreased performance, suggesting that our data was not very noisy, and that this 
actually underfit our model.

### Support Vector Classifier
*By Miguel Monares*

The Support Vector Classification Model performed decently compared to the other models. The Precision and F1 Score outscored our baseline (.15), which indicates that this model is better suited at predicting the correct status of heart disease for each patient. It has a decent recall score compared to the other models, and is a model we may consider out of the models we compare to use in practice. Out of the kernel functions we chose to use, sigmoid performed the best. This is probably because sigmoid is best suited for binary tasks, and the data is not inherently linearly separable, so the other kernels don’t perform as well.

### Decision Tree Classifier
*By Miguel Monares*

The decision tree classification model had an f1 score of .222, which outscores the baseline model f1 score (.15). Additionally, it has a higher precision score. It has a close f1 performance to that of the support vector classifier. However, the decision tree classifier has a higher recall (.256 vs .160). Since we place more importance on the value of the recall score vs precision/f1, we declare the the decision tree model is better suited for this task of prediction of heart disease. Out of the criterion functions we consider for measuring the quality of splits, the gini function outperforms entropy and log_loss, which can be explained since gini facors big partitions. Gini is also faster computationally, so we prefer to use this criterion in the field.      

### Random Forest Classification
*By Abdalla Atalla*

For the Random Forest classification model we see that it did relatively well compared to our baseline model score, with an f1 score of .237. This tells us that the . The best parameter for n_estimators was closer to ‘5’ and that is what produced a good precision score of .350. The recall score of .179 which tells us that this is a very picky classifier because it is missing a lot of true positives and this is due to having a higher precision than low recall score. We will see how this model matches up with other models as this has the potential due to its relatively good precision & f1 score, but seeing how the recall score will match up will be interesting to observe.

### Neural Networks
*By Ritik Raina*

### Logistic Regression
*By Matilda Michel*

## Limitations
A major limitation a few of us had was that our models were very computationally expensive, in terms of memory and
processing time. This included Logistic Regression w/ Polynomial Features, Guassian Process Regression, and SVCs.
We had to create a smaller random subset of the data to be able to perform some of the tasks in a reasonable amount of
time. Since our dataset was so large, some processing  was severely limited due to the amount of memory needed.

When exploring different degrees of polynomial features, the highest degree the model could run on
without crashing was 4, and even then it could only compare multiple hyperparameters in a gridsearch for up to degree 3.
The GPC was also had serious issues with runtime, and Will was only able to test a very limited number of kernels - a
serious limitation when kernel choice is a very experimental and dynamic process.

The ability for our models to be more accurate may have also been restricted by the lack of positive instances of the
predicated class compared to negatives, possibly making it harder for them to find the relationships within the data.


## Conclusion
After comparing the performance of several popular machine learning algorithms on our task of heart disease prediction, we find that, when training for predicting instances of the positive class, logistic regression with polynomial features performed the best on our Heart Disease dataset, when focusing on recall and F1 scores. With a degree of 2, that model resulted in a max F1 score of 0.345, while a degree of 4 resulted in a max recall score of 0.853. The other models performed in the ranges of about 0.10 - 0.23 for max recall scores while generally performing in the 0.20 range for max F1 scores.

This work could help other researchers understand what kinds of models to consider for building practical heart disease classifying models in the field. Additionally, our findings could help give insight to the features that are most helpful for heart disease, or classification of other medical conditions/diseases, which may enable researchers to build more accurate models, maybe employing a combination of the models that were successful to employ a stronger ensemble model/highly optimized and tuned model.

For future work, we may reexamine the models that performed better on our dataset and try to optimize the parameters and hyperparameters that we use in the model. Additionally, we might try to explore how an ensemble model, using our findings above, may improve our performance on the task further.


# Ethics and Privacy

When developing machine learning applications, it's important to be aware and proactive against ethics and privacy
concerns. The dataset we use does not include Personal Identifiable Information (PII), which helps ensure that the
privacy of patients is kept.

Ethical concerns can arise in the results processing and analysis stage. It's important that the results are validated
to make sure they are reasonable, so that possible correlations in the underlying data are pursued and perceived to be
true. For example, if a correlation between race and diabetes is found, it's important to not jump to conclusions and
declare causality/truth behind the correlation.

Additionally, we need to make sure that the data we use to train our model is representative of the general population,
so that we can best fit our model to classify heart disease in patients with different types of backgrounds .
In the field of medicine, it's important that our product is tested and verified in the interest of liability. While it
is important that we miss as few positively predicted heart disease patients in our model, we need to make sure that
our model doesn't overly classify non heart disease prone patients as positive as this would lead to resources and
facilities being directed toward those who don't need it.

In the field, it is possible that this product may produce some unintended consequences relating to ethics or privacy.
If this product is released into the field, we will make sure to address any unintended outcomes breaching ethical or
 privacy concerns by working to understand why such breaches occurred and being proactive about alleviating such
 concerns.