# Tree-based models 

## Overview 

This notebook contains an initial exploration of tree-based regressions to predict monthly ED demand. 

As the variables population, people, places and lives only vary annually they cannot be included in the model due to data leakage between the training and test sets.

For all models, variables used include:

- Service capacity (111, GP, Ambulance)
- Service utility (111, Ambulance)

In [1]:
#turn warnings off to keep notebook tidy
import warnings
warnings.filterwarnings('ignore')

## Import libraries 

In [2]:
import os
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor

from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import cross_validate
from sklearn.model_selection import RepeatedKFold

## Import data 

In [3]:
dta = pd.read_csv('../data/master_scaled.csv', index_col=0)

In [4]:
dta.columns = ['_'.join([c.split('/')[0],c.split('/')[-1]]) 
               if '/' in c else c for c in dta.columns]

In [5]:
dta.ccg.unique().shape

(71,)

## Add random feature

In [6]:
# Adding random features

rng = np.random.RandomState(0)
rand_var = rng.rand(dta.shape[0])
dta['rand1'] = rand_var

In [7]:
dta.shape

(1425, 13)

## Fitting function 

In [8]:
def fit_model(dta, model, features):
    
    
    y = dta['ae_attendances_attendances']
    X = dta[features]
    
    #cross validate to get errors on performance and coefficients
    cv_model = cross_validate(model, X,y, 
                            cv=RepeatedKFold(n_splits=5, n_repeats=5,
                                             random_state=0),
                            return_estimator=True, 
                              return_train_score=True, n_jobs=2)
    
    clf = model.fit(X, y)

    
    return cv_model

In [9]:
features = ['gp_appt_available',
            '111_111_offered', 'amb_sys_answered',
            '111_111_answered', 'amb_sys_made', 'rand1']

## Model Comparison

### Random Forest 

In [10]:
model = RandomForestRegressor()

results = fit_model(dta,model,features)

#### Performance 

In [11]:
res=pd.DataFrame()
res['test_score'] = results['test_score']
res['train_score'] = results['train_score']

res.describe()

Unnamed: 0,test_score,train_score
count,25.0,25.0
mean,0.382357,0.914871
std,0.05879,0.002631
min,0.286246,0.909122
25%,0.347832,0.913992
50%,0.376833,0.915086
75%,0.40876,0.915795
max,0.506254,0.920028


#### Coefficients 

In [12]:
coefs = pd.DataFrame(
   [model.feature_importances_
    for model in results['estimator']],
   columns=features
)

coefs.describe()

Unnamed: 0,gp_appt_available,111_111_offered,amb_sys_answered,111_111_answered,amb_sys_made,rand1
count,25.0,25.0,25.0,25.0,25.0,25.0
mean,0.229226,0.161094,0.197181,0.083633,0.165713,0.163152
std,0.007867,0.020913,0.015954,0.00418,0.023321,0.006777
min,0.210016,0.128189,0.165972,0.076386,0.134136,0.151106
25%,0.222853,0.146058,0.184449,0.080381,0.14873,0.160056
50%,0.23103,0.161281,0.198673,0.083692,0.162465,0.162262
75%,0.233805,0.176525,0.208388,0.086722,0.173919,0.167679
max,0.242194,0.196061,0.226277,0.093131,0.230124,0.180141


### Extra Trees

In [13]:
model = ExtraTreesRegressor()

results = fit_model(dta,model,features)

#### Performance 

In [14]:
res=pd.DataFrame()
res['test_score'] = results['test_score']
res['train_score'] = results['train_score']

res.describe()

Unnamed: 0,test_score,train_score
count,25.0,25.0
mean,0.32266,1.0
std,0.072149,0.0
min,0.211726,1.0
25%,0.276523,1.0
50%,0.316278,1.0
75%,0.341709,1.0
max,0.482223,1.0


#### Coefficients 

In [15]:
coefs = pd.DataFrame(
   [model.feature_importances_
    for model in results['estimator']],
   columns=features
)

coefs.describe()

Unnamed: 0,gp_appt_available,111_111_offered,amb_sys_answered,111_111_answered,amb_sys_made,rand1
count,25.0,25.0,25.0,25.0,25.0,25.0
mean,0.231448,0.111849,0.205763,0.088088,0.199496,0.163358
std,0.006699,0.005243,0.009131,0.002423,0.006833,0.00824
min,0.21793,0.10239,0.191325,0.08238,0.185961,0.148748
25%,0.228373,0.108096,0.199274,0.086553,0.195099,0.15685
50%,0.231647,0.111696,0.20358,0.088463,0.198387,0.162585
75%,0.236662,0.116064,0.211349,0.089799,0.206081,0.168268
max,0.244781,0.121084,0.224264,0.093409,0.208329,0.183804


### Gradient Boosted Trees

In [16]:
model = GradientBoostingRegressor()

results = fit_model(dta,model,features)

#### Performance 

In [17]:
res=pd.DataFrame()
res['test_score'] = results['test_score']
res['train_score'] = results['train_score']

res.describe()

Unnamed: 0,test_score,train_score
count,25.0,25.0
mean,0.408948,0.590549
std,0.046895,0.012163
min,0.314857,0.566091
25%,0.374924,0.579843
50%,0.41876,0.592954
75%,0.439431,0.595079
max,0.490287,0.615878


#### Coefficients 

In [18]:
coefs = pd.DataFrame(
   [model.feature_importances_
    for model in results['estimator']],
   columns=features
)

coefs.describe()

Unnamed: 0,gp_appt_available,111_111_offered,amb_sys_answered,111_111_answered,amb_sys_made,rand1
count,25.0,25.0,25.0,25.0,25.0,25.0
mean,0.163692,0.228876,0.340909,0.054557,0.15714,0.054827
std,0.017815,0.05276,0.044908,0.009138,0.054748,0.007393
min,0.132901,0.123586,0.261415,0.038832,0.07946,0.036534
25%,0.149851,0.193014,0.310455,0.049679,0.124496,0.050282
50%,0.165998,0.236635,0.334982,0.053867,0.14665,0.054605
75%,0.175683,0.275334,0.354419,0.058632,0.182419,0.061017
max,0.199535,0.295692,0.432266,0.071955,0.320638,0.066965


### ADA Boost

In [19]:
model = AdaBoostRegressor()

results = fit_model(dta,model,features)

#### Performance 

In [20]:
res=pd.DataFrame()
res['test_score'] = results['test_score']
res['train_score'] = results['train_score']

res.describe()

Unnamed: 0,test_score,train_score
count,25.0,25.0
mean,0.351155,0.39305
std,0.0497,0.01942
min,0.254119,0.351694
25%,0.315523,0.382836
50%,0.34619,0.393276
75%,0.389131,0.401918
max,0.439471,0.423659


#### Coefficients 

In [21]:
coefs = pd.DataFrame(
   [model.feature_importances_
    for model in results['estimator']],
   columns=features
)

coefs.describe()

Unnamed: 0,gp_appt_available,111_111_offered,amb_sys_answered,111_111_answered,amb_sys_made,rand1
count,25.0,25.0,25.0,25.0,25.0,25.0
mean,0.133832,0.181929,0.461732,0.028768,0.189131,0.004608
std,0.019764,0.043069,0.090431,0.018895,0.090541,0.010053
min,0.097008,0.077897,0.326745,0.005394,0.024271,0.0
25%,0.118517,0.161966,0.398627,0.015254,0.163187,4e-06
50%,0.139376,0.189782,0.455015,0.02608,0.195561,2.2e-05
75%,0.146256,0.205902,0.491024,0.039033,0.248007,0.004891
max,0.169171,0.265341,0.736419,0.092627,0.358966,0.048035


## Summary 

Extra Trees does not preform well

Random forest with default parameters is overfitting to the training data

Gradient boosted trees performs best

The random variable is comparatively important in all models