# Decision tree model
Author: Roddy Jaques <br>
*NHS Blood and Transplant*
***
## Assessing the predictive ability of a decision tree model
A decision tree model was chosen to next be compared to the benchmark logistic regression model. A decision tree was chosen because decision trees model categorical data well and can be graphically displayed, which makes the model explainable and transparent. Explainability and transparency are important for any models being used in a medical or clincial setting.
<br><br>
First the data is improted and split into testing and training sets...

In [13]:
# import relevant libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier, plot_tree
import sklearn.metrics as mets
%matplotlib inline

# Function to print confusion matrix, balanced accuracy and accuracy for a set of actual and predicted labels
def show_metrics(actual,predict):
    """ Prints the confusion matrix, balanced accuracy and accuracy given datasets of actual and predicted labels
    
    Arguments:
        actual - Dataset of actual labels
        predict - Dataset of predicted labels
     """
    cm = mets.confusion_matrix(actual, predict)
    
    print("********* MODEL METRIC REPORT *********\n\nConfusion matrix:\n")

    print("TP  FN\nFP  TN\n") #this is a reminder of what each part of the confusion matrix means e.g. TP = True Positive
    
    # print the confusion matrix
    print(str(int(cm[0,0])) + "    " + str(int(cm[0,1])))
    print(str(int(cm[1,0])) + "    " + str(int(cm[1,1])) + "\n") 

    # classification report for DBD model
    print("Classification report:\n")
    print(mets.classification_report(actual, predict))

    print("Balanced accuracy: " + str(round(mets.balanced_accuracy_score(actual, predict),2)))

    print("Accuracy: " + str(round(mets.accuracy_score(actual, predict),2)))
    
    # Predicted vs actual consent rates
    cons_rate = int(100 * len(actual[actual=="Consent"]) / len(actual) )
    print("\nActual consent rate: " + str(cons_rate))
    
    pred_rate = int(100 * len(predict[predict=="Consent"]) / len(predict) )
    print("Predicted consent rate: " + str(pred_rate))
    
    pass
 
# Function to format consent column from integer code to text
def format_consent(x):
    if x == 2:
        return "Consent"
    if x == 1:
        return "Non-consent"

In [14]:
# Read in datasets 
dbd_model_data = pd.read_csv("Data/dbd_model_data.csv")
dcd_model_data = pd.read_csv("Data/dcd_model_data.csv")

# Columns used to create DBD model
dbd_cols = ["wish", "FORMAL_APR_WHEN", "donation_mentioned", "app_nature", "eth_grp", "religion_grp", "GENDER", "FAMILY_WITNESS_BSDT", "DTC_PRESENT_BSD_CONV", 
            "acorn_new", "adult","FAMILY_CONSENT"]

dbd_model_data2 = pd.get_dummies(data=dbd_model_data,columns=dbd_cols[:-1],drop_first=True)

dbd_features = dbd_model_data2.drop("FAMILY_CONSENT",axis=1)
dbd_consents = dbd_model_data2["FAMILY_CONSENT"].apply(format_consent)

# Columns used to create DCD model in paper
dcd_cols = ["wish", "donation_mentioned", 
            "app_nature", "eth_grp", "religion_grp", "GENDER", "DTC_WD_TRTMENT_PRESENT", 
            "acorn_new", "adult","cod_neuro","FAMILY_CONSENT"]

dcd_model_data2 = pd.get_dummies(data=dcd_model_data,columns=dcd_cols[:-1],drop_first=True)

dcd_features = dcd_model_data2.drop("FAMILY_CONSENT",axis=1)
dcd_consents = dcd_model_data2["FAMILY_CONSENT"].apply(format_consent)

# creating a train and testing dataset for DBD and DCD approaches
DBD_X_train, DBD_X_test, DBD_y_train, DBD_y_test = train_test_split(dbd_features,dbd_consents, test_size=0.33, random_state=10)

DCD_X_train, DCD_X_test, DCD_y_train, DCD_y_test = train_test_split(dcd_features,dcd_consents, test_size=0.33, random_state=10)

A decision tree model is fit to the DBD and DCD data with deafualt hyperparameters.

In [15]:
# create a tree model with defaut hyperparameters
tree_model = DecisionTreeClassifier()

In [16]:
# fit tree using dbd training data
DBD_tree = tree_model.fit(DBD_X_train,DBD_y_train)

DBD_preds = DBD_tree.predict(DBD_X_test)

show_metrics(DBD_y_test,DBD_preds)

********* MODEL METRIC REPORT *********

Confusion matrix:

TP  FN
FP  TN

1091    291
343    275

Classification report:

              precision    recall  f1-score   support

     Consent       0.76      0.79      0.77      1382
 Non-consent       0.49      0.44      0.46       618

    accuracy                           0.68      2000
   macro avg       0.62      0.62      0.62      2000
weighted avg       0.68      0.68      0.68      2000

Balanced accuracy: 0.62
Accuracy: 0.68

Actual consent rate: 69
Predicted consent rate: 71


In [17]:
# repeat above for DCD data
DCD_tree = tree_model.fit(DCD_X_train,DCD_y_train)

DCD_preds = DCD_tree.predict(DCD_X_test)

show_metrics(DCD_y_test,DCD_preds)

********* MODEL METRIC REPORT *********

Confusion matrix:

TP  FN
FP  TN

1432    433
525    714

Classification report:

              precision    recall  f1-score   support

     Consent       0.73      0.77      0.75      1865
 Non-consent       0.62      0.58      0.60      1239

    accuracy                           0.69      3104
   macro avg       0.68      0.67      0.67      3104
weighted avg       0.69      0.69      0.69      3104

Balanced accuracy: 0.67
Accuracy: 0.69

Actual consent rate: 60
Predicted consent rate: 63


***
#### DBD model
The DBD model has a balanced accuracy 5% lower than the logistic regression model. The recall for the non-consent class is 1% higher than the logisitic regression model and the loss in accuracy comes from the recall for consents being 11% lower.
<br>
The predicted and actual consent rates are more similar in this model than the logistic regression, though this is due to more consents being incorrectly classified as non-consents than it is for any increase in accuracy.

#### DCD model
The DCD model has a balanced accuracy 4% lower than the logistic regression model. The recall and precision for both classes are slightly lower in this model too. 

***
<br>

As a decision tree with default hyperparameters was no better than the logistic regression on any metric the model hyperparameters will be tuned to see if this can improve model performance.

## Tuning decision tree

A 5-fold cross validated grid search will be used to tune hyperparameters. Cross validation will allow the model to be tuned without overfitting to optimise performance for a single dataset.<br>
The CV grid search will optimise for balanced accuracy, as this will avoid the model being optimised to overestimate the number of consents.<br>
For both models the hyper parameters and range of values to be explored are: <br>
* max_depth - the maximum tree depth. From 10 to 35 in increments of 1.
* min_samples_split - the minimum number of samples needed in a leaf. From 75 to 250 in increments of 25.
* class weight - a weighting parameter to members of the non-consent class so consents aren't overfit. From 1 to 4 in increments of 0.25.

In [21]:
# create tree model 
tree_model = DecisionTreeClassifier(random_state=66)

# create list of dictionaries with non-consent group weights
weights = []
for w in np.arange(1,4,step=0.25):
    w_dic = {"Non-consent":w,"Consent":1}
    weights.append(w_dic)

# use dictionary of parameters in a CV grid search to find tree which optimises balanced accuracy 
params = {'max_depth':np.arange(10,35,step=1),'min_samples_split':np.arange(75,250,step=25),'class_weight':weights}

gs_tree_model = GridSearchCV(tree_model, param_grid=params, scoring="balanced_accuracy",cv=5)

gs_tree_model.fit(DBD_X_train,DBD_y_train)

gs_tree_model.score(DBD_X_train,DBD_y_train)


# print hyperparameter which optimise balanced accuracy and the balanced accuracy
print(gs_tree_model.best_params_)

print(gs_tree_model.best_score_)

{'class_weight': {'Non-consent': 2.25, 'Consent': 1}, 'max_depth': 17, 'min_samples_split': 225}
0.7454541438516957


In [22]:
DBD_preds = gs_tree_model.predict(DBD_X_test)

show_metrics(DBD_y_test,DBD_preds)

********* MODEL METRIC REPORT *********

Confusion matrix:

TP  FN
FP  TN

902    480
95    523

Classification report:

              precision    recall  f1-score   support

     Consent       0.90      0.65      0.76      1382
 Non-consent       0.52      0.85      0.65       618

    accuracy                           0.71      2000
   macro avg       0.71      0.75      0.70      2000
weighted avg       0.79      0.71      0.72      2000

Balanced accuracy: 0.75
Accuracy: 0.71

Actual consent rate: 69
Predicted consent rate: 49


In [26]:
# create tree model 
tree_model = DecisionTreeClassifier(random_state=66)

# create list of dictionaries with non-consent group weights
weights = []
for w in np.arange(1,4,step=0.25):
    w_dic = {"Non-consent":w,"Consent":1}
    weights.append(w_dic)

# use dictionary of parameters in a CV grid search to find tree which optimises balanced accuracy 
params = {'max_depth':np.arange(10,100,step=10),'min_samples_split':np.arange(75,250,step=25),'class_weight':weights}

gs_tree_model = GridSearchCV(tree_model, param_grid=params, scoring="balanced_accuracy",cv=5)

gs_tree_model.fit(DCD_X_train,DCD_y_train)

gs_tree_model.score(DCD_X_train,DCD_y_train)


# print hyperparameter which optimise balanced accuracy and the balanced accuracy
print(gs_tree_model.best_params_)

print(gs_tree_model.best_score_)

{'class_weight': {'Non-consent': 1.75, 'Consent': 1}, 'max_depth': 10, 'min_samples_split': 125}
0.7293181937241343


In [27]:
DCD_preds = gs_tree_model.predict(DCD_X_test)

show_metrics(DCD_y_test,DCD_preds)

********* MODEL METRIC REPORT *********

Confusion matrix:

TP  FN
FP  TN

1019    846
133    1106

Classification report:

              precision    recall  f1-score   support

     Consent       0.88      0.55      0.68      1865
 Non-consent       0.57      0.89      0.69      1239

    accuracy                           0.68      3104
   macro avg       0.73      0.72      0.68      3104
weighted avg       0.76      0.68      0.68      3104

Balanced accuracy: 0.72
Accuracy: 0.68

Actual consent rate: 60
Predicted consent rate: 37


***
#### Tuned DBD model
Hyperparamter tuning the boosted tree model has not increased the balanced accuracy, it's actually decreased it from 65% to 64%. It has has however increased the recall for non-consents to 49% from 38% in the untuned model.  

Despite the improvements on the untuned boosted forest model the random forest model still performs better than the boosted forest model. The random forest is also much faster to run so could be scaled to bigger datasets easier. 

#### Tuned DCD model



***