# Tree-based models 

## Overview 

This notebook contains an initial exploration of tree-based regressions to predict monthly ED demand. 

As the variables population, people, places and lives only vary annually they cannot be included in the model due to data leakage between the training and test sets.

For all models, variables used include:

- Service capacity (111, GP, Ambulance)
- Service utility (111, Ambulance)

In [1]:
#turn warnings off to keep notebook tidy
import warnings
warnings.filterwarnings('ignore')

## Import libraries 

In [2]:
import os
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor

from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import cross_validate
from sklearn.model_selection import RepeatedKFold

## Import data 

In [3]:
dta = pd.read_csv('https://raw.githubusercontent.com/CharlotteJames/ed-forecast/main/data/master_scaled.csv',
                  index_col=0)

In [4]:
dta.columns = ['_'.join([c.split('/')[0],c.split('/')[-1]]) 
               if '/' in c else c for c in dta.columns]

In [5]:
dta.ccg.unique().shape

(73,)

## Add random feature

In [6]:
# Adding random features

rng = np.random.RandomState(0)
rand_var = rng.rand(dta.shape[0])
dta['rand1'] = rand_var

In [7]:
dta.shape

(1465, 14)

## Fitting function 

In [8]:
def fit_model(dta, model, features):
    
    
    y = dta['ae_attendances_attendances']
    X = dta[features]
    
    #cross validate to get errors on performance and coefficients
    cv_model = cross_validate(model, X,y, 
                            cv=RepeatedKFold(n_splits=5, n_repeats=5,
                                             random_state=0),
                            return_estimator=True, 
                              return_train_score=True, n_jobs=2)
    
    clf = model.fit(X, y)

    
    return cv_model

## Model Comparison

### Random Forest 

In [9]:
model = RandomForestRegressor()

features = ['gp_appt_available',
            '111_111_offered', 'amb_sys_answered',
            '111_111_answered', 'amb_sys_made']

results = fit_model(dta,model,features)

#### Performance 

In [10]:
res=pd.DataFrame()
res['test_score'] = results['test_score']
res['train_score'] = results['train_score']

res.describe()

Unnamed: 0,test_score,train_score
count,25.0,25.0
mean,0.326184,0.90764
std,0.068479,0.002856
min,0.194189,0.900626
25%,0.295525,0.906455
50%,0.31523,0.907682
75%,0.376244,0.909432
max,0.43352,0.912927


#### Coefficients 

In [11]:
coefs = pd.DataFrame(
   [model.feature_importances_
    for model in results['estimator']],
   columns=features
)

coefs.describe()

Unnamed: 0,gp_appt_available,111_111_offered,amb_sys_answered,111_111_answered,amb_sys_made
count,25.0,25.0,25.0,25.0,25.0
mean,0.36487,0.169619,0.198314,0.089594,0.177603
std,0.009283,0.025945,0.018218,0.003523,0.020872
min,0.343782,0.115615,0.168171,0.082301,0.137226
25%,0.361617,0.154869,0.185078,0.086145,0.166208
50%,0.363937,0.170291,0.202741,0.08972,0.174057
75%,0.371125,0.190763,0.209951,0.092489,0.181188
max,0.382143,0.21095,0.238635,0.094948,0.22201


### Extra Trees

In [12]:
model = ExtraTreesRegressor()

features = ['gp_appt_available',
            '111_111_offered', 'amb_sys_answered',
            '111_111_answered', 'amb_sys_made']

results = fit_model(dta,model,features)

#### Performance 

In [13]:
res=pd.DataFrame()
res['test_score'] = results['test_score']
res['train_score'] = results['train_score']

res.describe()

Unnamed: 0,test_score,train_score
count,25.0,25.0
mean,0.182416,1.0
std,0.089688,0.0
min,-0.021004,1.0
25%,0.135715,1.0
50%,0.183127,1.0
75%,0.253164,1.0
max,0.326425,1.0


#### Coefficients 

In [14]:
coefs = pd.DataFrame(
   [model.feature_importances_
    for model in results['estimator']],
   columns=features
)

coefs.describe()

Unnamed: 0,gp_appt_available,111_111_offered,amb_sys_answered,111_111_answered,amb_sys_made
count,25.0,25.0,25.0,25.0,25.0
mean,0.401046,0.110987,0.20161,0.084824,0.201533
std,0.01085,0.00691,0.007992,0.002776,0.007109
min,0.381822,0.101835,0.178142,0.081146,0.188806
25%,0.391383,0.10643,0.198509,0.082401,0.196518
50%,0.40002,0.10884,0.20216,0.084573,0.201068
75%,0.406894,0.11375,0.204329,0.086932,0.207399
max,0.418465,0.130209,0.218215,0.091477,0.217875


### Gradient Boosted Trees

In [15]:
model = GradientBoostingRegressor()

features = ['gp_appt_available',
            '111_111_offered', 'amb_sys_answered',
            '111_111_answered', 'amb_sys_made']

results = fit_model(dta,model,features)

#### Performance 

In [16]:
res=pd.DataFrame()
res['test_score'] = results['test_score']
res['train_score'] = results['train_score']

res.describe()

Unnamed: 0,test_score,train_score
count,25.0,25.0
mean,0.416243,0.575241
std,0.04036,0.009842
min,0.349189,0.552126
25%,0.389002,0.569903
50%,0.415058,0.577257
75%,0.435761,0.581159
max,0.492157,0.594571


#### Coefficients 

In [17]:
coefs = pd.DataFrame(
   [model.feature_importances_
    for model in results['estimator']],
   columns=features
)

coefs.describe()

Unnamed: 0,gp_appt_available,111_111_offered,amb_sys_answered,111_111_answered,amb_sys_made
count,25.0,25.0,25.0,25.0,25.0
mean,0.194463,0.244419,0.337852,0.051804,0.171462
std,0.016576,0.063612,0.036182,0.009274,0.061353
min,0.151994,0.121871,0.252115,0.035434,0.092215
25%,0.18203,0.189238,0.313259,0.045647,0.140473
50%,0.195153,0.261649,0.341908,0.052066,0.152639
75%,0.209755,0.30134,0.358086,0.056076,0.191089
max,0.22152,0.318146,0.408796,0.067777,0.377489


### ADA Boost

In [18]:
model = AdaBoostRegressor()

features = ['gp_appt_available',
            '111_111_offered', 'amb_sys_answered',
            '111_111_answered', 'amb_sys_made']

results = fit_model(dta,model,features)

#### Performance 

In [19]:
res=pd.DataFrame()
res['test_score'] = results['test_score']
res['train_score'] = results['train_score']

res.describe()

Unnamed: 0,test_score,train_score
count,25.0,25.0
mean,0.357844,0.391452
std,0.051182,0.017144
min,0.238568,0.364805
25%,0.322158,0.383087
50%,0.365117,0.389659
75%,0.39948,0.399603
max,0.443821,0.428363


#### Coefficients 

In [20]:
coefs = pd.DataFrame(
   [model.feature_importances_
    for model in results['estimator']],
   columns=features
)

coefs.describe()

Unnamed: 0,gp_appt_available,111_111_offered,amb_sys_answered,111_111_answered,amb_sys_made
count,25.0,25.0,25.0,25.0,25.0
mean,0.143346,0.175411,0.425632,0.034593,0.221018
std,0.018561,0.047252,0.094007,0.017305,0.09383
min,0.09415,0.06099,0.233218,0.007284,0.030267
25%,0.132855,0.145584,0.38341,0.024297,0.184113
50%,0.143525,0.184479,0.407756,0.033898,0.214753
75%,0.153296,0.205222,0.467581,0.040821,0.255783
max,0.183349,0.284722,0.636627,0.067474,0.397699


## Summary 

- Extra Trees does not preform well
- Random forest with default parameters is overfitting to the training data
- Gradient boosted trees performs best