# Heart Disease Report

The dataset used in this report is the Cleveland processed dataset which can be found in the data folder here: https://archive.ics.uci.edu/ml/datasets/Heart+Disease

The objective of this study is to apply a classifer to the dataset to predict whether an individual has heart disease.

In [None]:
import seaborn as sns
from heart_disease import DataLoader, RandomForest

## Ingest and pre-process the data

The data is ingested and pre-processed with the following steps:
* Missing data is replaced. There are two missing values in the Thal column which is replaced with the label for normal. There are also 4 missing values in the Number of Major Vessels which are replaced with 0, which is the most common value in the column.
* Categorical features are converted into one-hot encoded features.
* All values are min-max normalised to be between 0 and 1.

Ideas for future improvement:
* Further test coverage
* Method to version the dataset

In [None]:
Loader = DataLoader()
dataset = Loader.dataset

In [None]:
dataset.describe(include="all")

## Explore the data

Here a simple pairplot is used to visualise the dataset on a 2D scale. Only a few features are used in the plot below to reduce run time. From the plot we can see that some features such as Number of Major Vessels have noticeable split between heart disease and no heart disease.

Ideas for future improvement:
* Visualise dataset after applying PCA.
* Apply t-SNE to see whether there there is a clear split in a lower-dimensional latent space

In [None]:
sns.pairplot(data=dataset, hue="Heart Disease", vars=["Age", "Sex", "Resting Blood Pressure", "Number of Major Vessels", "Chest Pain Asymptomatic"])

## Train Model

A random forest was chosen to train on this dataset. The reason for this is that a random forest is a good choice for a binary classification problem. Furthermore, the implementation of bagging means that the model is less likely to overfit and the variance of the model is typically lower. Feature importance is also easy to extract by looking at which features decrease the gini impurity the most.

In the cells below we train a random forest model on a single train/test split. Afterwards another train/test split is created, from which the train set is used to perform 10-fold cross validation with a parameter grid search. To reduce run time, the number of features searched over and the granularity of the grid search is small. 

Ideas for future improvements:
* Replace k-fold cross validation with leave-one-out. Due to the small nature of the dataset, it would be wise to use LOO to maximise the training set size and ensure a reliable result. A LOO was not implemented here to reduce run time.
* Expose more parameters of the random forest model and perform a larger grid search.
* Better test coverage

In [None]:
train_features, train_labels, test_features, test_labels = Loader.split_dataset(test_size=0.2, balance=True)

In [None]:
rf = RandomForest()
model, score = rf.train(train_features, train_labels, test_features, test_labels)

Perform hyperparameter tuning by utilising K-fold Cross validation. For the sake of reducing the run-time of this notebook, the parameter search here is very minimal.

In [None]:
train, test = Loader.split_dataset(test_size=0.1, balance=False, split_labels=False)
params = {'n_estimators': [100, 250], 'max_depth': [3, None], 'max_features': [7, "auto"]}
param_scores = rf.perform_k_fold_cv(params, train, folds=10)

# Evaluate Model

The area under the receiver operating characteristic curve is used to evaluate the model. Further more, F-beta scores are calculated for a range of different values of beta. From the plots it can be seen that the random forest produces a very high AUC score. Furthermore, we can see by altering the threshold we can influence whether the model should prioritise recall or precision.

Ideas for future improvement:
* Compare the performance to a simple benchmark model (e.g logistic regression)
* Add regression tests
* Add further directional change tests. (If we perturb the input space we expect the result to increase/decrease. E.g. If we increase the age, we expect the outputted probability to increase). We want coverage across all our features.
* Add invariance tests. (A set of pertubations to the input space that we expect won't change the model's output).
* Pre-train model tests. (E.g. Model shape aligns with classes, the ranges are within our expectation etc.)

In [None]:
train_labels, train_features, feature_list = Loader.features_and_labels_to_numpy(train)
test_labels, test_features, _ = Loader.features_and_labels_to_numpy(test)

model, score = rf.train(train_features, train_labels, test_features, test_labels, n_estimators=250, max_depth=3, max_features='auto')
auc_score = rf.evaluate_model(model, test_features, test_labels, betas=[0.1, 0.5, 1, 2, 5])

## Feature Importance

Below all the features from the input space are plotted with their respective importance as evaluated from the random forest. It can be seen that the number of major vessels, asymptomatic chest pain and normal thal are the most decisive features to make a prediction.

In [None]:
rf.plot_feature_importance(model, feature_list)