# Recreating a logistic regression model
Author: Roddy Jaques <br>
*NHS Blood and Transplant*
***

## Recreating the logisitic regression using Sci-kit learn

### Aims and summary of the project
The aim of this project is explore how Scikit-learn and machine learning models perform compared to statistical models created in SAS for predicting family consent for organ doantion using Potential Donor Audit data.<br><br>
The approach and problem will be familiar to any Statistican or Data Scientist, that is using classification models to predicit a binary target variable.  I'll use a logistic regression model from a previous analysis as a baseline to compare other models to. The previous analysis conducted by __[Curtis et al.](https://doi.org/10.1111/anae.15485)__ fit logistic regression models using a dataset of all family approaches for organ doantion between April 2014 and March 2019 from the national Potential Donor Audit (PDA) data held by NHS Blood and Transplant. I will use the same dataset to first recreate the logistic regression with Scikit-learn and then fit other Scikit-learn models with the same dataset and compare their ability to predict whether or not a family consented to organ donation.<br>

In the previous analysis a model was fit using data from potential Donation after Brainstem Death (DBD) donors, and one for potential Donation after Circulatory Death (DCD) donors. I will start by fitting logistic regression models to DBD and DCD cohorts using the same variables as used in the previous analysis. When I fit subsequent models I will used the same cohort and variables, as the variables and cohort were chosen with lots of clincian input so improvements are unlikey to come from using new variables or a different cohort.

### Fitting the logistic regression

First, import the data and remove missing and unknown data to recreate the cohort in the original analysis...

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
%matplotlib inline

In [5]:
# import dataset 
df = pd.read_sas("Data/alldata3.sas7bdat")

# 6931 DBD family approaches
dbd_apps = df[(df["eli_DBD"]==1)&(df["FAMILY_APPROACHED"]==2)]

# remove missing and unknown data to get 6060 DBD approaches matching the cohort in the paper
dbd_apps = dbd_apps[(dbd_apps["eth_grp"]!=5)&(dbd_apps["FORMAL_APR_WHEN"]!=4)&(dbd_apps["donation_mentioned"]!=-1)
                    &(dbd_apps["FAMILY_WITNESS_BSDT"]!=9)&(dbd_apps["GENDER"]!=9)]

# Columns used to create DBD model in paper
dbd_cols = ["wish", "FORMAL_APR_WHEN", "donation_mentioned", "app_nature", "eth_grp", "religion_grp", "GENDER", "FAMILY_WITNESS_BSDT", "DTC_PRESENT_BSD_CONV", 
            "acorn_new", "adult","FAMILY_CONSENT"]

dbd_apps[dbd_cols].astype(int)

dbd_model_data = dbd_apps[dbd_cols]

# export to csv to use in other models
dbd_model_data.to_csv("Data/dbd_model_data.csv",index=False)
     
# 9965 DCD approachess
dcd_apps = df[(df["eli_DCD"]==1)&(df["FAMILY_APPROACHED"]==2)]

# remove missing and unknown data to get 9405 DCD apps matching the cohort in the paper
dcd_apps = dcd_apps[(dcd_apps["GENDER"]!=9)&(dcd_apps["cod_neuro"].notna())&(dcd_apps["eth_grp"]!=5)&(dcd_apps["donation_mentioned"]!=-1)&
                    (~dcd_apps["DTC_WD_TRTMENT_PRESENT"].isin([8,9]))]

# Columns used to create DCD model in paper
dcd_cols = ["wish", "donation_mentioned", 
            "app_nature", "eth_grp", "religion_grp", "GENDER", "DTC_WD_TRTMENT_PRESENT", 
            "acorn_new", "adult","cod_neuro","FAMILY_CONSENT"]

dcd_apps[dbd_cols].astype(int)

dcd_model_data = dcd_apps[dcd_cols]

# export to csv to use in other models
dcd_model_data.to_csv("Data/dcd_model_data.csv",index=False)

All variables are categorical so I can use one-hot encoding. This lets me calculate an odds ratio for each factor of each variable and compare to the model previously fit and verify I've recreated the model.<br>
I'm using a logistic regression model using the 'newton-cg' solver with no penalisation, these hyperparameters are similar to the methods used in SAS to fit the original model.

In [17]:
# use one-hot encoding to so it's possible to calculate odds ratios for each value of each variable
dbd_model_data2 = pd.get_dummies(data=dbd_model_data,columns=dbd_cols[:-1],drop_first=True)

dbd_features = dbd_model_data2.drop("FAMILY_CONSENT",axis=1)
dbd_consents = dbd_model_data2["FAMILY_CONSENT"]

dbd_feature_names = dbd_features.columns.tolist()

In [18]:
LR_model = LogisticRegression(penalty='none',solver='newton-cg')

DBD_LR = LR_model.fit(dbd_features,dbd_consents)

odds_ratios_dbd = np.exp(DBD_LR.coef_) 

for i in range(odds_ratios_dbd.shape[1]):
    print(dbd_feature_names[i],"|",round(odds_ratios_dbd[0][i],2))

wish_2.0 | 23.81
wish_3.0 | 7.58
wish_4.0 | 18.48
wish_5.0 | 1.55
FORMAL_APR_WHEN_2.0 | 0.43
FORMAL_APR_WHEN_3.0 | 0.4
donation_mentioned_2.0 | 1.26
donation_mentioned_3.0 | 1.78
donation_mentioned_4.0 | 2.01
app_nature_2.0 | 0.88
app_nature_3.0 | 0.27
eth_grp_2.0 | 0.5
eth_grp_3.0 | 0.29
eth_grp_4.0 | 0.84
religion_grp_2.0 | 0.17
religion_grp_3.0 | 1.31
religion_grp_4.0 | 1.02
religion_grp_5.0 | 0.69
religion_grp_9.0 | 0.78
GENDER_2.0 | 0.79
FAMILY_WITNESS_BSDT_2.0 | 0.78
DTC_PRESENT_BSD_CONV_2.0 | 1.39
acorn_new_2.0 | 0.93
acorn_new_3.0 | 0.93
acorn_new_4.0 | 0.71
acorn_new_5.0 | 0.75
acorn_new_6.0 | 0.74
adult_1.0 | 0.66


In [19]:
# use one-hot encoding to so it's possible to calculate odds ratios for each value of each variable
dcd_model_data2 = pd.get_dummies(data=dcd_model_data,columns=dcd_cols[:-1],drop_first=True)

dcd_features = dcd_model_data2.drop("FAMILY_CONSENT",axis=1)
dcd_consents = dcd_model_data2["FAMILY_CONSENT"]

dcd_feature_names = dcd_features.columns.tolist()

In [20]:
DCD_LR = LR_model.fit(dcd_features,dcd_consents)

odds_ratios_dcd = np.exp(DCD_LR.coef_) 

for i in range(odds_ratios_dcd.shape[1]):
    print(dcd_feature_names[i],"|",round(odds_ratios_dcd[0][i],2))

wish_2.0 | 10.15
wish_3.0 | 5.62
wish_4.0 | 17.98
wish_5.0 | 1.47
donation_mentioned_2.0 | 1.56
donation_mentioned_3.0 | 2.46
donation_mentioned_4.0 | 2.53
app_nature_2.0 | 1.0
app_nature_3.0 | 0.26
eth_grp_2.0 | 0.79
eth_grp_3.0 | 0.47
eth_grp_4.0 | 1.08
religion_grp_2.0 | 0.13
religion_grp_3.0 | 0.65
religion_grp_4.0 | 1.3
religion_grp_5.0 | 0.68
religion_grp_9.0 | 0.67
GENDER_2.0 | 0.86
DTC_WD_TRTMENT_PRESENT_2.0 | 1.43
acorn_new_2.0 | 0.98
acorn_new_3.0 | 0.92
acorn_new_4.0 | 0.95
acorn_new_5.0 | 0.82
acorn_new_6.0 | 0.85
adult_1.0 | 1.26
cod_neuro_1.0 | 1.08


***

For both models the odds ratios are equal to the odds ratios caculated from the SAS model, so the performance metrics from a model fit using this method will be fair to use a baseline standard to compare other models to.<br>

In the next Notebook I will use a training and test set to fit and assess a logistic regression model.

***