# Tree-based models 

## Overview 

This notebook contains an initial exploration of tree-based regressions to predict monthly ED demand. 

As the variables population, people, places and lives only vary annually they cannot be included in the model due to data leakage between the training and test sets.

For all models, variables used include:

- Service capacity (111, GP, Ambulance)
- Service utility (111, Ambulance)

In [1]:
#turn warnings off to keep notebook tidy
import warnings
warnings.filterwarnings('ignore')

## Import libraries 

In [2]:
import os
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor

from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import cross_validate
from sklearn.model_selection import RepeatedKFold

## Import data 

In [3]:
dta = pd.read_csv('https://raw.githubusercontent.com/CharlotteJames/ed-forecast/main/data/master_scaled.csv',
                  index_col=0)

In [4]:
dta.columns = ['_'.join([c.split('/')[0],c.split('/')[-1]]) 
               if '/' in c else c for c in dta.columns]

In [5]:
dta.ccg.unique().shape

(73,)

## Add random feature

In [6]:
# Adding random features

rng = np.random.RandomState(0)
rand_var = rng.rand(dta.shape[0])
dta['rand1'] = rand_var

In [7]:
dta.shape

(1465, 14)

## Fitting function 

In [8]:
def fit_model(dta, model, features):
    
    
    y = dta['ae_attendances_attendances']
    X = dta[features]
    
    #cross validate to get errors on performance and coefficients
    cv_model = cross_validate(model, X,y, 
                            cv=RepeatedKFold(n_splits=5, n_repeats=5,
                                             random_state=0),
                            return_estimator=True, 
                              return_train_score=True, n_jobs=2)
    
    clf = model.fit(X, y)

    
    return cv_model

## Model Comparison

### Random Forest 

In [9]:
model = RandomForestRegressor()

features = ['gp_appt_available',
            '111_111_offered', 'amb_sys_answered',
            '111_111_answered', 'amb_sys_made']

results = fit_model(dta,model,features)

#### Performance 

In [10]:
res=pd.DataFrame()
res['test_score'] = results['test_score']
res['train_score'] = results['train_score']

res.describe()

Unnamed: 0,test_score,train_score
count,25.0,25.0
mean,0.324762,0.907468
std,0.064132,0.002994
min,0.208251,0.901499
25%,0.291511,0.905286
50%,0.303804,0.90771
75%,0.361086,0.910103
max,0.432379,0.911869


#### Coefficients 

In [11]:
coefs = pd.DataFrame(
   [model.feature_importances_
    for model in results['estimator']],
   columns=features
)

coefs.describe()

Unnamed: 0,gp_appt_available,111_111_offered,amb_sys_answered,111_111_answered,amb_sys_made
count,25.0,25.0,25.0,25.0,25.0
mean,0.365773,0.17094,0.200011,0.088689,0.174587
std,0.008972,0.023524,0.01888,0.004248,0.021818
min,0.347411,0.121029,0.16641,0.081457,0.142955
25%,0.362229,0.156753,0.192181,0.085454,0.157728
50%,0.36485,0.175761,0.203854,0.088883,0.171979
75%,0.370063,0.186688,0.211012,0.09169,0.184947
max,0.383784,0.20997,0.236253,0.095648,0.226292


### Extra Trees

In [12]:
model = ExtraTreesRegressor()

features = ['gp_appt_available',
            '111_111_offered', 'amb_sys_answered',
            '111_111_answered', 'amb_sys_made']

results = fit_model(dta,model,features)

#### Performance 

In [13]:
res=pd.DataFrame()
res['test_score'] = results['test_score']
res['train_score'] = results['train_score']

res.describe()

Unnamed: 0,test_score,train_score
count,25.0,25.0
mean,0.18127,1.0
std,0.089126,0.0
min,-0.016106,1.0
25%,0.13457,1.0
50%,0.182712,1.0
75%,0.25623,1.0
max,0.323334,1.0


#### Coefficients 

In [14]:
coefs = pd.DataFrame(
   [model.feature_importances_
    for model in results['estimator']],
   columns=features
)

coefs.describe()

Unnamed: 0,gp_appt_available,111_111_offered,amb_sys_answered,111_111_answered,amb_sys_made
count,25.0,25.0,25.0,25.0,25.0
mean,0.400862,0.110527,0.203609,0.085571,0.199431
std,0.010486,0.005662,0.008414,0.003544,0.00941
min,0.382725,0.100678,0.184311,0.079812,0.176018
25%,0.395243,0.107284,0.200119,0.08304,0.192893
50%,0.401709,0.109121,0.203551,0.085468,0.199367
75%,0.409566,0.114,0.208703,0.086824,0.205095
max,0.417902,0.123473,0.219799,0.095966,0.219684


### Gradient Boosted Trees

In [15]:
model = GradientBoostingRegressor()

features = ['gp_appt_available',
            '111_111_offered', 'amb_sys_answered',
            '111_111_answered', 'amb_sys_made']

results = fit_model(dta,model,features)

#### Performance 

In [16]:
res=pd.DataFrame()
res['test_score'] = results['test_score']
res['train_score'] = results['train_score']

res.describe()

Unnamed: 0,test_score,train_score
count,25.0,25.0
mean,0.416127,0.575241
std,0.04062,0.009842
min,0.349148,0.552126
25%,0.387715,0.569903
50%,0.416008,0.577257
75%,0.435916,0.581159
max,0.492091,0.594571


#### Coefficients 

In [17]:
coefs = pd.DataFrame(
   [model.feature_importances_
    for model in results['estimator']],
   columns=features
)

coefs.describe()

Unnamed: 0,gp_appt_available,111_111_offered,amb_sys_answered,111_111_answered,amb_sys_made
count,25.0,25.0,25.0,25.0,25.0
mean,0.194433,0.245585,0.3361,0.050868,0.173014
std,0.016621,0.063267,0.037359,0.007163,0.057482
min,0.152058,0.120599,0.284277,0.037708,0.095619
25%,0.18208,0.190266,0.304731,0.043793,0.137058
50%,0.195305,0.259788,0.333779,0.051726,0.160047
75%,0.210149,0.304632,0.355069,0.056263,0.188793
max,0.221817,0.31906,0.424729,0.064012,0.34455


### ADA Boost

In [18]:
model = AdaBoostRegressor()

features = ['gp_appt_available',
            '111_111_offered', 'amb_sys_answered',
            '111_111_answered', 'amb_sys_made']

results = fit_model(dta,model,features)

#### Performance 

In [19]:
res=pd.DataFrame()
res['test_score'] = results['test_score']
res['train_score'] = results['train_score']

res.describe()

Unnamed: 0,test_score,train_score
count,25.0,25.0
mean,0.355176,0.394621
std,0.046545,0.019368
min,0.251157,0.349177
25%,0.326613,0.384017
50%,0.348698,0.396547
75%,0.390514,0.406161
max,0.439704,0.427985


#### Coefficients 

In [20]:
coefs = pd.DataFrame(
   [model.feature_importances_
    for model in results['estimator']],
   columns=features
)

coefs.describe()

Unnamed: 0,gp_appt_available,111_111_offered,amb_sys_answered,111_111_answered,amb_sys_made
count,25.0,25.0,25.0,25.0,25.0
mean,0.136654,0.182902,0.441398,0.032996,0.20605
std,0.019698,0.043528,0.10385,0.018875,0.095447
min,0.101992,0.085822,0.2164,0.007941,0.064226
25%,0.118742,0.15249,0.389911,0.018885,0.147064
50%,0.136631,0.195218,0.4663,0.031307,0.199013
75%,0.152005,0.20735,0.516622,0.046199,0.2503
max,0.170367,0.252095,0.624084,0.086178,0.493023


## Summary 

- Extra Trees does not preform well
- Random forest with default parameters is overfitting to the training data
- Gradient boosted trees performs best