# Tree-based models 

## Overview 

This notebook contains an initial exploration of tree-based regressions to predict monthly ED demand. 

As the variables population, people, places and lives only vary annually they cannot be included in the model due to data leakage between the training and test sets.

For all models, variables used include:

- Service capacity (111, GP, Ambulance)
- Service utility (111, Ambulance)

In [1]:
#turn warnings off to keep notebook tidy
import warnings
warnings.filterwarnings('ignore')

## Import libraries 

In [2]:
import os
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor

from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import cross_validate
from sklearn.model_selection import RepeatedKFold

## Import data 

In [3]:
dta = pd.read_csv('../data/master_scaled.csv', index_col=0)

In [4]:
dta.columns = ['_'.join([c.split('/')[0],c.split('/')[-1]]) 
               if '/' in c else c for c in dta.columns]

In [5]:
dta.ccg.unique().shape

(73,)

## Add random feature

In [6]:
# Adding random features

rng = np.random.RandomState(0)
rand_var = rng.rand(dta.shape[0])
dta['rand1'] = rand_var

In [7]:
dta.shape

(1465, 14)

## Fitting function 

In [8]:
def fit_model(dta, model, features):
    
    
    y = dta['ae_attendances_attendances']
    X = dta[features]
    
    #cross validate to get errors on performance and coefficients
    cv_model = cross_validate(model, X,y, 
                            cv=RepeatedKFold(n_splits=5, n_repeats=5,
                                             random_state=0),
                            return_estimator=True, 
                              return_train_score=True, n_jobs=2)
    
    clf = model.fit(X, y)

    
    return cv_model

## Model Comparison

### Random Forest 

In [9]:
model = RandomForestRegressor()

features = ['gp_appt_available',
            '111_111_offered', 'amb_sys_answered',
            '111_111_answered', 'amb_sys_made']

results = fit_model(dta,model,features)

#### Performance 

In [10]:
res=pd.DataFrame()
res['test_score'] = results['test_score']
res['train_score'] = results['train_score']

res.describe()

Unnamed: 0,test_score,train_score
count,25.0,25.0
mean,0.32213,0.906824
std,0.069696,0.003302
min,0.194441,0.901193
25%,0.276673,0.904595
50%,0.306864,0.906745
75%,0.363652,0.909827
max,0.430799,0.912196


#### Coefficients 

In [11]:
coefs = pd.DataFrame(
   [model.feature_importances_
    for model in results['estimator']],
   columns=features
)

coefs.describe()

Unnamed: 0,gp_appt_available,111_111_offered,amb_sys_answered,111_111_answered,amb_sys_made
count,25.0,25.0,25.0,25.0,25.0
mean,0.365801,0.170317,0.197221,0.089667,0.176993
std,0.008997,0.027994,0.020464,0.004874,0.026938
min,0.346552,0.107991,0.160373,0.079188,0.133565
25%,0.360525,0.151561,0.180527,0.086168,0.158953
50%,0.364814,0.177633,0.199804,0.091123,0.174052
75%,0.37358,0.192863,0.209531,0.092345,0.18955
max,0.380157,0.204971,0.237596,0.099505,0.256822


### Extra Trees

In [12]:
model = ExtraTreesRegressor()

features = ['gp_appt_available',
            '111_111_offered', 'amb_sys_answered',
            '111_111_answered', 'amb_sys_made']

results = fit_model(dta,model,features)

#### Performance 

In [13]:
res=pd.DataFrame()
res['test_score'] = results['test_score']
res['train_score'] = results['train_score']

res.describe()

Unnamed: 0,test_score,train_score
count,25.0,25.0
mean,0.18058,1.0
std,0.086218,0.0
min,0.017548,1.0
25%,0.121401,1.0
50%,0.178484,1.0
75%,0.233753,1.0
max,0.325149,1.0


#### Coefficients 

In [14]:
coefs = pd.DataFrame(
   [model.feature_importances_
    for model in results['estimator']],
   columns=features
)

coefs.describe()

Unnamed: 0,gp_appt_available,111_111_offered,amb_sys_answered,111_111_answered,amb_sys_made
count,25.0,25.0,25.0,25.0,25.0
mean,0.401077,0.111543,0.204442,0.084349,0.198588
std,0.010982,0.005765,0.008996,0.003703,0.00838
min,0.376985,0.102404,0.183928,0.078453,0.178813
25%,0.393297,0.108543,0.199357,0.081903,0.193795
50%,0.401494,0.110739,0.204724,0.084302,0.198238
75%,0.409686,0.115256,0.208711,0.085843,0.202454
max,0.419369,0.128184,0.224908,0.095649,0.221311


### Gradient Boosted Trees

In [15]:
model = GradientBoostingRegressor()

features = ['gp_appt_available',
            '111_111_offered', 'amb_sys_answered',
            '111_111_answered', 'amb_sys_made']

results = fit_model(dta,model,features)

#### Performance 

In [16]:
res=pd.DataFrame()
res['test_score'] = results['test_score']
res['train_score'] = results['train_score']

res.describe()

Unnamed: 0,test_score,train_score
count,25.0,25.0
mean,0.416063,0.575241
std,0.04078,0.009842
min,0.349796,0.552126
25%,0.3882,0.569903
50%,0.41562,0.577257
75%,0.435762,0.581159
max,0.49227,0.594571


#### Coefficients 

In [17]:
coefs = pd.DataFrame(
   [model.feature_importances_
    for model in results['estimator']],
   columns=features
)

coefs.describe()

Unnamed: 0,gp_appt_available,111_111_offered,amb_sys_answered,111_111_answered,amb_sys_made
count,25.0,25.0,25.0,25.0,25.0
mean,0.194661,0.246862,0.338653,0.049311,0.170514
std,0.01668,0.066562,0.032743,0.008849,0.054057
min,0.151387,0.115634,0.27938,0.036155,0.110274
25%,0.182196,0.185143,0.315717,0.043269,0.133716
50%,0.195259,0.259116,0.340024,0.048502,0.15413
75%,0.210448,0.30126,0.361358,0.053114,0.184405
max,0.221528,0.328437,0.399541,0.069876,0.333344


### ADA Boost

In [18]:
model = AdaBoostRegressor()

features = ['gp_appt_available',
            '111_111_offered', 'amb_sys_answered',
            '111_111_answered', 'amb_sys_made']

results = fit_model(dta,model,features)

#### Performance 

In [19]:
res=pd.DataFrame()
res['test_score'] = results['test_score']
res['train_score'] = results['train_score']

res.describe()

Unnamed: 0,test_score,train_score
count,25.0,25.0
mean,0.355121,0.396363
std,0.052863,0.015194
min,0.242488,0.36821
25%,0.319202,0.388169
50%,0.363843,0.396308
75%,0.399936,0.40046
max,0.450772,0.426606


#### Coefficients 

In [20]:
coefs = pd.DataFrame(
   [model.feature_importances_
    for model in results['estimator']],
   columns=features
)

coefs.describe()

Unnamed: 0,gp_appt_available,111_111_offered,amb_sys_answered,111_111_answered,amb_sys_made
count,25.0,25.0,25.0,25.0,25.0
mean,0.134264,0.18106,0.446667,0.038971,0.199039
std,0.013269,0.058352,0.108582,0.023754,0.102021
min,0.111301,0.039253,0.282089,0.0112,0.025824
25%,0.123832,0.142141,0.354968,0.022755,0.151555
50%,0.133015,0.188871,0.413903,0.033434,0.208024
75%,0.144088,0.21948,0.515295,0.045226,0.271256
max,0.154627,0.306544,0.685616,0.10509,0.463545


## Summary 

- Extra Trees does not preform well
- Random forest with default parameters is overfitting to the training data
- Gradient boosted trees performs best