# Gradient boosted forest model
Author: Roddy Jaques <br>
*NHS Blood and Transplant*
***
## Assessing a gradient boosted forest model

In this notebook models for the DBD and DCD cohorts are fit using a gradient boosted forest classifier.

First, as usual, load in the data and create the training and test datasets...

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
import sklearn.metrics as mets
import time
%matplotlib inline

# Function to print confusion matrix, balanced accuracy and accuracy for a set of actual and predicted labels
def show_metrics(actual,predict):
    """ Prints the confusion matrix, balanced accuracy and accuracy given datasets of actual and predicted labels
    
    Arguments:
        actual - Dataset of actual labels
        predict - Dataset of predicted labels
     """
    cm = mets.confusion_matrix(actual, predict)
    
    print("********* MODEL METRIC REPORT *********\n\nConfusion matrix:\n")

    print("TP  FN\nFP  TN\n") #this is a reminder of what each part of the confusion matrix means e.g. TP = True Positive
    
    # print the confusion matrix
    print(str(int(cm[0,0])) + "    " + str(int(cm[0,1])))
    print(str(int(cm[1,0])) + "    " + str(int(cm[1,1])) + "\n") 

    # classification report for DBD model
    print("Classification report:\n")
    print(mets.classification_report(actual, predict))

    print("Balanced accuracy: " + str(round(mets.balanced_accuracy_score(actual, predict),2)))

    print("Accuracy: " + str(round(mets.accuracy_score(actual, predict),2)))
    
    # Predicted vs actual consent rates
    cons_rate = int(100 * len(actual[actual=="Consent"]) / len(actual) )
    print("\nActual consent rate: " + str(cons_rate))
    
    pred_rate = int(100 * len(predict[predict=="Consent"]) / len(predict) )
    print("Predicted consent rate: " + str(pred_rate))
    
    pass
 
# Function to format consent column from integer code to text
def format_consent(x):
    if x == 2:
        return "Consent"
    if x == 1:
        return "Non-consent"

In [8]:
#Read in datasets 
dbd_model_data = pd.read_csv("Data/dbd_model_data.csv")
dcd_model_data = pd.read_csv("Data/dcd_model_data.csv")

# Columns used to create DBD model
dbd_cols = ["wish", "FORMAL_APR_WHEN", "donation_mentioned", "app_nature", "eth_grp", "religion_grp", "GENDER", "FAMILY_WITNESS_BSDT", "DTC_PRESENT_BSD_CONV", 
            "acorn_new", "adult","FAMILY_CONSENT"]

dbd_model_data2 = pd.get_dummies(data=dbd_model_data,columns=dbd_cols[:-1],drop_first=True)

dbd_features = dbd_model_data2.drop("FAMILY_CONSENT",axis=1)
dbd_consents = dbd_model_data2["FAMILY_CONSENT"].apply(format_consent)

# Columns used to create DCD model in paper
dcd_cols = ["wish", "donation_mentioned", 
            "app_nature", "eth_grp", "religion_grp", "GENDER", "DTC_WD_TRTMENT_PRESENT", 
            "acorn_new", "adult","cod_neuro","FAMILY_CONSENT"]

dcd_model_data2 = pd.get_dummies(data=dcd_model_data,columns=dcd_cols[:-1],drop_first=True)

dcd_features = dcd_model_data2.drop("FAMILY_CONSENT",axis=1)
dcd_consents = dcd_model_data2["FAMILY_CONSENT"].apply(format_consent)

# creating a train and testing dataset for DBD and DCD approaches
DBD_X_train, DBD_X_test, DBD_y_train, DBD_y_test = train_test_split(dbd_features,dbd_consents, test_size=0.33, random_state=10)

DCD_X_train, DCD_X_test, DCD_y_train, DCD_y_test = train_test_split(dcd_features,dcd_consents, test_size=0.33, random_state=10)

<br>
Next, fit a Gradient boosted forest classifier to the DBD and DCD data with default hyperparameters.
<br><br>

In [13]:
# fitting tree to training data 
boost_model = GradientBoostingClassifier()

# fit to DBD training data 
DBD_boost = boost_model.fit(DBD_X_train,DBD_y_train)

# predict and evaluate test data
DBD_preds = DBD_boost.predict(DBD_X_test)

In [14]:
# show metrics for DBD model
show_metrics(DBD_y_test, DBD_preds)

********* MODEL METRIC REPORT *********

Confusion matrix:

TP  FN
FP  TN

1278    104
384    234

Classification report:

              precision    recall  f1-score   support

     Consent       0.77      0.92      0.84      1382
 Non-consent       0.69      0.38      0.49       618

    accuracy                           0.76      2000
   macro avg       0.73      0.65      0.66      2000
weighted avg       0.75      0.76      0.73      2000

Balanced accuracy: 0.65
Accuracy: 0.76

Actual consent rate: 69
Predicted consent rate: 83


In [15]:
# fit to DCD training data
DCD_boost = boost_model.fit(DCD_X_train,DCD_y_train)

# fit and evalute DCD test data
DCD_preds = DCD_boost.predict(DCD_X_test)

In [16]:
# print metrics for DCD model
show_metrics(DCD_y_test, DCD_preds)

********* MODEL METRIC REPORT *********

Confusion matrix:

TP  FN
FP  TN

1477    388
449    790

Classification report:

              precision    recall  f1-score   support

     Consent       0.77      0.79      0.78      1865
 Non-consent       0.67      0.64      0.65      1239

    accuracy                           0.73      3104
   macro avg       0.72      0.71      0.72      3104
weighted avg       0.73      0.73      0.73      3104

Balanced accuracy: 0.71
Accuracy: 0.73

Actual consent rate: 60
Predicted consent rate: 62


***
#### DBD model
The DBD gradient boosted model has a recall of 0.38 for non-consents, lower than the logistic regression model non-consent recall of 0.44. The balanced accuracy is 0.71, also lower than the logistic regression model's balanced accuracy of 0.67. 

The random forest model, even untuned, outperforms the boosted model. 

#### DCD model
The DCD model has a balanced accuracy of 0.71, higher than the logistic regression model's balanced accuracy but lower than the random forest. 

In this model the consent and non-consent recalls are more balanced than the random forest, and the difference between the actual and predicted consent rates are more similar than the random forest model.

***

## Hyperparameter tuning

A cross validated gridsearch, optimising on balanced accuracy so that the balanced accuracy score isn't inflated by a high recall in the consent class.<br>
For both models the hyper parameters and range of values to be explored are: <br>
* max_depth - the maximum tree depth. From 1 to 200 in increments of 25.
* n_estimators - the number of boosting stages. From 40 to 300 in increments of 30.
* learning _rate - contribution of each tree to the ensemble. From 0.05 to 0.6 in increments of 0.1.

In [17]:
# use a 5 fold cross validated grid search to find optimal hyperparameters
cv_boost_model = GradientBoostingClassifier(random_state=66)

# hyperparameters to test
params = {'max_depth':np.arange(1,200,step=25),'n_estimators':np.arange(40,300,step=30),'learning_rate':np.arange(0.05,0.6,step=0.1)}

start_time = time.time()

# train model for highest balanced accuracy
dbd_gs_boost_model = GridSearchCV(cv_boost_model, param_grid=params, scoring="balanced_accuracy",cv=5,n_jobs=3)

dbd_gs_boost_model.fit(DBD_X_train,DBD_y_train)

runtime = time.time() - start_time
print("Runtime = {}minutes".format(round(runtime/60,1)))

# balanced accuracy of best model
dbd_gs_boost_model.score(DBD_X_train,DBD_y_train)

#
print(dbd_gs_boost_model.best_params_)
print(dbd_gs_boost_model.best_score_)

Runtime = 35.8minutes
{'learning_rate': 0.25000000000000006, 'max_depth': 26, 'n_estimators': 130}
0.6457502156474684


In [18]:
# predict DBD test data consent
DBD_preds = dbd_gs_boost_model.predict(DBD_X_test)

# print metrics
show_metrics(DBD_y_test,DBD_preds)

********* MODEL METRIC REPORT *********

Confusion matrix:

TP  FN
FP  TN

1082    300
316    302

Classification report:

              precision    recall  f1-score   support

     Consent       0.77      0.78      0.78      1382
 Non-consent       0.50      0.49      0.50       618

    accuracy                           0.69      2000
   macro avg       0.64      0.64      0.64      2000
weighted avg       0.69      0.69      0.69      2000

Balanced accuracy: 0.64
Accuracy: 0.69

Actual consent rate: 69
Predicted consent rate: 69


In [19]:
# fit DCD model

start_time = time.time()

dcd_gs_boost_model = GridSearchCV(cv_boost_model, param_grid=params, scoring="balanced_accuracy",cv=5,n_jobs=3)

dcd_gs_boost_model.fit(DCD_X_train,DCD_y_train)

runtime = time.time() - start_time
print("Runtime = {}minutes".format(round(runtime/60,1)))

dcd_gs_boost_model.score(DCD_X_train,DCD_y_train)

print(dcd_gs_boost_model.best_params_)
print(dcd_gs_boost_model.best_score_)

Runtime = 4283.1minutes
{'learning_rate': 0.25000000000000006, 'max_depth': 1, 'n_estimators': 250}
0.7012021321179738


In [20]:
DCD_preds = dcd_gs_boost_model.predict(DCD_X_test)

# print metrics
show_metrics(DCD_y_test,DCD_preds)

********* MODEL METRIC REPORT *********

Confusion matrix:

TP  FN
FP  TN

1471    394
461    778

Classification report:

              precision    recall  f1-score   support

     Consent       0.76      0.79      0.77      1865
 Non-consent       0.66      0.63      0.65      1239

    accuracy                           0.72      3104
   macro avg       0.71      0.71      0.71      3104
weighted avg       0.72      0.72      0.72      3104

Balanced accuracy: 0.71
Accuracy: 0.72

Actual consent rate: 60
Predicted consent rate: 62


***
#### Tuned DBD model
Hyperparamter tuning the DBD boosted tree model has not increased the balanced accuracy, it's actually decreased it from 0.65 to 0.64. It has, however increased the recall for non-consents to 0.49 from 0.38 in the untuned model.  

Despite the improvements on the untuned boosted forest model, the random forest model still performs better than the boosted forest model. The random forest is also much faster to run so it could be scaled to bigger datasets easier. 

#### Tuned DCD model
The performance of the tuned DCD model is similar to the logistic regression. The non-consent recall is 0.63, 0.01 higher than the logistic regression and the balanced accuracy is 0.71 (equal to the logistic regression).  

***

### Conclusions

For predicting family consent for organ donation in the DBD cohort a random forest model was deemed to have performed better than the logistic regression model, as it improved the balanced accuracy of the model and had a greater recall for non-consents, and only slightly reduced recall for consents. Despite this improvement, the overall accuracy was still deemed to not perform well enough to provide clinically useful results.  

For predicting family consent in the DCD cohort no model performed better than the original logistic regression model.

Overall, machine learning models could not improve on the previous logistic regression models enough to provide reliable predictions. This highlights the importance and benefits of improving data collection. 

This project did demonstrate the usefulness of machine learning models in this context due to the different characteristics of the models and their ability to be tuned and optimised for different metrics. 

Further work to build on this project could examine the importance of different variables in machine learning models and also how machine learning models compare when trained on more recent PDA datasets which include more variables which could improve the performance of the models.  

