# Hyperparameter search

Here we will use cross-validation with a grid-search to optimise the hyperparameters for Random Forests and Gradient Boosting Decision Trees. 

You will optimise the models on the `train_validate` data, and test it on the `test` data.

Tasks:

1. Investigate the documentation for `GridSearchCV` in sklearn


2. Optimise the hyperparameters for the `RandomForestClassifier` on the dataset
    * Use 5-fold cross-validation (`cv=5`)
    * Use `neg_log_loss` as the `scoring` parameter
    * Search values of `max_depth` between 2 and 15
    * Search values of `n_estimators` between 10 and 1000
    * Make sure to store your scores for each hyper-parameter combination
    
    
3. Fit the model to all of the `train_validate` data using the optimal hyperparameters


4. Test the fitted model on the `test` data.
    * How does the log-likelihood score differ for the test data compared to the cross-validation score?
    * What are the possible reasons for this discrepancy? (We will discuss this in detail next week!)
    
    
5. Investigate the feature importances of the optimised model, compare them to those obtained in notebook 1 using the default parameters. 


### Extension exercise

For the extenstion exercise, you will optimise the `xgboost` classifer using the `early_stopping_rounds` variable.


* You will need to use the `xgboost.cv` method from the xgboost learning api, documented [here](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.training). Again, use 5-fold CV for the search. Set `metrics='mlogloss'` to evaluate the model on the log-likelihood loss.


* Set `num_boost_round=1000` and `early_stopping_rounds=10`, so that the number of boosting rounds is capped at 1000, but stops earlier if the model performance does not increase for 10 rounds. 


* Fix the learning rate (called `eta` in learning API) to be 0.05, and investigate the `max_depth` between 2 and 15.  


* Test the optimised model on the `test` data, and comment on the results. 


* Remember to use Google/Stack Overflow for help! This has been attempted many times before. 

# GridSearchCV and RF

In this part, we will use the `GridSearchCV` and the `RandomForestClassifier` from `sklearn` together to find the best parameters for the RF.

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import log_loss
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier

import time
import xgboost as xgb

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

import matplotlib.pyplot as plt
import matplotlib

# For the Python notebook
%matplotlib inline
%reload_ext autoreload
%autoreload 2

We do the usual data preprocessing

In [2]:
df_train_validate = pd.read_csv('data/train_validate.csv', index_col='trip_id')

target = ['travel_mode']
id_context = ['trip_id', 
              'household_id', 
              'person_n', 
              'trip_n',
              'survey_year',
              'travel_year'
             ]
features = [c for c in df_train_validate.columns 
            if c not in (target + id_context)]

y_train = df_train_validate[target].travel_mode.values
X_train = df_train_validate[features]

First, we create the model with the parameters that won't change.

In [3]:
rf = RandomForestClassifier(criterion='entropy', 
                            n_jobs=-1)

Now, we can prepare the parameters

In [4]:
parameters = {'n_estimators': [10, 20, 50, 100, 200, 500, 1000],
              'max_depth': list(range(2,16))}

We can now create the grid search

In [5]:
clf = GridSearchCV(rf, parameters, scoring='neg_log_loss', cv=5, verbose=5)

We can finally fit the `GridSearchCV`. This takes quite some time.

In [6]:
clf.fit(X_train, y_train)

Fitting 5 folds for each of 98 candidates, totalling 490 fits
[CV] max_depth=2, n_estimators=10 ....................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ....... max_depth=2, n_estimators=10, score=-0.863, total=   1.5s
[CV] max_depth=2, n_estimators=10 ....................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.5s remaining:    0.0s


[CV] ....... max_depth=2, n_estimators=10, score=-0.906, total=   0.2s
[CV] max_depth=2, n_estimators=10 ....................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.7s remaining:    0.0s


[CV] ....... max_depth=2, n_estimators=10, score=-0.875, total=   0.2s
[CV] max_depth=2, n_estimators=10 ....................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    2.0s remaining:    0.0s


[CV] ....... max_depth=2, n_estimators=10, score=-0.887, total=   0.2s
[CV] max_depth=2, n_estimators=10 ....................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    2.2s remaining:    0.0s


[CV] ....... max_depth=2, n_estimators=10, score=-0.873, total=   0.2s
[CV] max_depth=2, n_estimators=20 ....................................
[CV] ....... max_depth=2, n_estimators=20, score=-0.893, total=   0.3s
[CV] max_depth=2, n_estimators=20 ....................................
[CV] ....... max_depth=2, n_estimators=20, score=-0.895, total=   0.3s
[CV] max_depth=2, n_estimators=20 ....................................
[CV] ....... max_depth=2, n_estimators=20, score=-0.854, total=   0.4s
[CV] max_depth=2, n_estimators=20 ....................................
[CV] ....... max_depth=2, n_estimators=20, score=-0.893, total=   0.5s
[CV] max_depth=2, n_estimators=20 ....................................
[CV] ....... max_depth=2, n_estimators=20, score=-0.878, total=   0.3s
[CV] max_depth=2, n_estimators=50 ....................................
[CV] ....... max_depth=2, n_estimators=50, score=-0.878, total=   0.5s
[CV] max_depth=2, n_estimators=50 ....................................
[CV] .

[CV] ...... max_depth=3, n_estimators=500, score=-0.798, total=   3.1s
[CV] max_depth=3, n_estimators=500 ...................................
[CV] ...... max_depth=3, n_estimators=500, score=-0.822, total=   3.2s
[CV] max_depth=3, n_estimators=500 ...................................
[CV] ...... max_depth=3, n_estimators=500, score=-0.809, total=   3.5s
[CV] max_depth=3, n_estimators=1000 ..................................
[CV] ..... max_depth=3, n_estimators=1000, score=-0.806, total=   6.9s
[CV] max_depth=3, n_estimators=1000 ..................................
[CV] ..... max_depth=3, n_estimators=1000, score=-0.804, total=   6.8s
[CV] max_depth=3, n_estimators=1000 ..................................
[CV] ..... max_depth=3, n_estimators=1000, score=-0.797, total=   6.8s
[CV] max_depth=3, n_estimators=1000 ..................................
[CV] ..... max_depth=3, n_estimators=1000, score=-0.820, total=   6.8s
[CV] max_depth=3, n_estimators=1000 ..................................
[CV] .

[CV] ...... max_depth=5, n_estimators=100, score=-0.736, total=   1.1s
[CV] max_depth=5, n_estimators=100 ...................................
[CV] ...... max_depth=5, n_estimators=100, score=-0.737, total=   1.1s
[CV] max_depth=5, n_estimators=100 ...................................
[CV] ...... max_depth=5, n_estimators=100, score=-0.730, total=   1.1s
[CV] max_depth=5, n_estimators=100 ...................................
[CV] ...... max_depth=5, n_estimators=100, score=-0.759, total=   1.1s
[CV] max_depth=5, n_estimators=100 ...................................
[CV] ...... max_depth=5, n_estimators=100, score=-0.741, total=   1.1s
[CV] max_depth=5, n_estimators=200 ...................................
[CV] ...... max_depth=5, n_estimators=200, score=-0.734, total=   2.0s
[CV] max_depth=5, n_estimators=200 ...................................
[CV] ...... max_depth=5, n_estimators=200, score=-0.736, total=   2.0s
[CV] max_depth=5, n_estimators=200 ...................................
[CV] .

[CV] ....... max_depth=7, n_estimators=10, score=-0.730, total=   0.3s
[CV] max_depth=7, n_estimators=10 ....................................
[CV] ....... max_depth=7, n_estimators=10, score=-0.706, total=   0.3s
[CV] max_depth=7, n_estimators=20 ....................................
[CV] ....... max_depth=7, n_estimators=20, score=-0.700, total=   0.5s
[CV] max_depth=7, n_estimators=20 ....................................
[CV] ....... max_depth=7, n_estimators=20, score=-0.701, total=   0.4s
[CV] max_depth=7, n_estimators=20 ....................................
[CV] ....... max_depth=7, n_estimators=20, score=-0.694, total=   0.4s
[CV] max_depth=7, n_estimators=20 ....................................
[CV] ....... max_depth=7, n_estimators=20, score=-0.723, total=   0.5s
[CV] max_depth=7, n_estimators=20 ....................................
[CV] ....... max_depth=7, n_estimators=20, score=-0.709, total=   0.4s
[CV] max_depth=7, n_estimators=50 ....................................
[CV] .

[CV] ...... max_depth=8, n_estimators=500, score=-0.686, total=   9.7s
[CV] max_depth=8, n_estimators=500 ...................................
[CV] ...... max_depth=8, n_estimators=500, score=-0.681, total=   8.7s
[CV] max_depth=8, n_estimators=500 ...................................
[CV] ...... max_depth=8, n_estimators=500, score=-0.712, total=   9.3s
[CV] max_depth=8, n_estimators=500 ...................................
[CV] ...... max_depth=8, n_estimators=500, score=-0.689, total=   8.3s
[CV] max_depth=8, n_estimators=1000 ..................................
[CV] ..... max_depth=8, n_estimators=1000, score=-0.684, total=  18.2s
[CV] max_depth=8, n_estimators=1000 ..................................
[CV] ..... max_depth=8, n_estimators=1000, score=-0.685, total=  18.0s
[CV] max_depth=8, n_estimators=1000 ..................................
[CV] ..... max_depth=8, n_estimators=1000, score=-0.680, total=  15.9s
[CV] max_depth=8, n_estimators=1000 ..................................
[CV] .

[CV] ...... max_depth=10, n_estimators=50, score=-0.675, total=   1.0s
[CV] max_depth=10, n_estimators=100 ..................................
[CV] ..... max_depth=10, n_estimators=100, score=-0.666, total=   2.3s
[CV] max_depth=10, n_estimators=100 ..................................
[CV] ..... max_depth=10, n_estimators=100, score=-0.673, total=   2.4s
[CV] max_depth=10, n_estimators=100 ..................................
[CV] ..... max_depth=10, n_estimators=100, score=-0.665, total=   2.0s
[CV] max_depth=10, n_estimators=100 ..................................
[CV] ..... max_depth=10, n_estimators=100, score=-0.696, total=   1.9s
[CV] max_depth=10, n_estimators=100 ..................................
[CV] ..... max_depth=10, n_estimators=100, score=-0.671, total=   1.9s
[CV] max_depth=10, n_estimators=200 ..................................
[CV] ..... max_depth=10, n_estimators=200, score=-0.667, total=   3.7s
[CV] max_depth=10, n_estimators=200 ..................................
[CV] .

[CV] ...... max_depth=12, n_estimators=10, score=-0.725, total=   0.4s
[CV] max_depth=12, n_estimators=10 ...................................
[CV] ...... max_depth=12, n_estimators=10, score=-0.737, total=   0.5s
[CV] max_depth=12, n_estimators=10 ...................................
[CV] ...... max_depth=12, n_estimators=10, score=-0.749, total=   0.5s
[CV] max_depth=12, n_estimators=20 ...................................
[CV] ...... max_depth=12, n_estimators=20, score=-0.674, total=   0.6s
[CV] max_depth=12, n_estimators=20 ...................................
[CV] ...... max_depth=12, n_estimators=20, score=-0.695, total=   0.7s
[CV] max_depth=12, n_estimators=20 ...................................
[CV] ...... max_depth=12, n_estimators=20, score=-0.683, total=   0.6s
[CV] max_depth=12, n_estimators=20 ...................................
[CV] ...... max_depth=12, n_estimators=20, score=-0.716, total=   0.7s
[CV] max_depth=12, n_estimators=20 ...................................
[CV] .

[CV] ..... max_depth=13, n_estimators=500, score=-0.656, total=  13.0s
[CV] max_depth=13, n_estimators=500 ..................................
[CV] ..... max_depth=13, n_estimators=500, score=-0.662, total=  13.6s
[CV] max_depth=13, n_estimators=500 ..................................
[CV] ..... max_depth=13, n_estimators=500, score=-0.656, total=  11.1s
[CV] max_depth=13, n_estimators=500 ..................................
[CV] ..... max_depth=13, n_estimators=500, score=-0.684, total=  11.7s
[CV] max_depth=13, n_estimators=500 ..................................
[CV] ..... max_depth=13, n_estimators=500, score=-0.661, total=  13.2s
[CV] max_depth=13, n_estimators=1000 .................................
[CV] .... max_depth=13, n_estimators=1000, score=-0.654, total=  31.1s
[CV] max_depth=13, n_estimators=1000 .................................
[CV] .... max_depth=13, n_estimators=1000, score=-0.662, total=  21.7s
[CV] max_depth=13, n_estimators=1000 .................................
[CV] .

[CV] ...... max_depth=15, n_estimators=50, score=-0.711, total=   1.7s
[CV] max_depth=15, n_estimators=50 ...................................
[CV] ...... max_depth=15, n_estimators=50, score=-0.698, total=   1.7s
[CV] max_depth=15, n_estimators=100 ..................................
[CV] ..... max_depth=15, n_estimators=100, score=-0.659, total=   3.1s
[CV] max_depth=15, n_estimators=100 ..................................
[CV] ..... max_depth=15, n_estimators=100, score=-0.684, total=   2.7s
[CV] max_depth=15, n_estimators=100 ..................................
[CV] ..... max_depth=15, n_estimators=100, score=-0.665, total=   2.7s
[CV] max_depth=15, n_estimators=100 ..................................
[CV] ..... max_depth=15, n_estimators=100, score=-0.692, total=   3.7s
[CV] max_depth=15, n_estimators=100 ..................................
[CV] ..... max_depth=15, n_estimators=100, score=-0.677, total=   4.4s
[CV] max_depth=15, n_estimators=200 ..................................
[CV] .

[Parallel(n_jobs=1)]: Done 490 out of 490 | elapsed: 38.3min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='entropy',
                                              max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=-1,
                                              oob_score=False,
    

Let's print the best score and the set of parameters that achieved this best score.

In [7]:
print("Best score: {:.3f}".format(clf.best_score_))

Best score: -0.662


In [8]:
print("Best parameter set:")
best_parameters=clf.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Best parameter set:
	max_depth: 14
	n_estimators: 1000


We can now retrain our model on all the train/validate data.

In [10]:
rf = RandomForestClassifier(criterion='entropy', 
                            max_depth = 14,
                            n_estimators = 1000,
                            n_jobs=-1)

rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='entropy', max_depth=14, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

We load the test data and prepare them.

In [11]:
df_test = pd.read_csv('data/test.csv', index_col='trip_id')

target = ['travel_mode']
id_context = ['trip_id', 
              'household_id', 
              'person_n', 
              'trip_n',
              'survey_year',
              'travel_year'
             ]
features = [c for c in df_test.columns 
            if c not in (target + id_context)]

y_test = df_test[target].travel_mode.values
X_test = df_test[features]

And we predict the probabilities and compute the cross entropy loss.

In [12]:
pred_proba = rf.predict_proba(X_test)

In [13]:
log_loss(y_test, pred_proba)

0.6776728179738136

We can see that the cross-validation over-estimates model performance. (Before we had a value of 0.662 and now we have a value of 0.678.) 

## <span style="color:red"> Discussion </span>

# XGBoost and CV

Let's do the same thing with `xgboost`. However, we do not have a class like `GridSearchCV` that can be used with `xgboost`. So, we have to do it in the old-fashioned way as shown multiple times in the previous notebooks.

If we want to use the function `cv` from `xgboost`, we have to transform our data in `DMatrix`. So, we do that first.

In [14]:
xgtrain = xgb.DMatrix(X_train, label=y_train)

In [15]:
max_depths = list(range(2, 16))

ll = []
ne = []

for d in max_depths:
    
    print("CV for max_depth = {}".format(d))
    
    clf = xgb.XGBClassifier(objective='multi:softprob', 
                            max_depth=d,
                            n_jobs=8)
    
    xgb_param = clf.get_xgb_params()
    
    xgb_param['num_class'] = 4

        
    cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=1000, nfold=5,
                      metrics=['mlogloss'], early_stopping_rounds=10, 
                      verbose_eval=False, seed=42)
    
    ll.append(min(cvresult['test-mlogloss-mean']))
    ne.append(cvresult['test-mlogloss-mean'].idxmin())
    

CV for max_depth = 2
CV for max_depth = 3
CV for max_depth = 4
CV for max_depth = 5
CV for max_depth = 6
CV for max_depth = 7
CV for max_depth = 8
CV for max_depth = 9
CV for max_depth = 10
CV for max_depth = 11
CV for max_depth = 12
CV for max_depth = 13
CV for max_depth = 14
CV for max_depth = 15


In [16]:
print("Best log loss found: {}".format(np.min(ll)))

Best log loss found: 0.5065784


In [17]:
depth_idx = np.argmin(ll)
best_depth = max_depths[depth_idx]
best_ne = ne[depth_idx]

print("Best value for max_depth = {}".format(best_depth))
print("{} rounds of boosting needed".format(best_ne))

Best value for max_depth = 9
125 rounds of boosting needed


We can now fit the model on all of the data with the best `max_depth` and test it on the test data as we did with the previous model.

In [18]:
clf = xgb.XGBClassifier(objective='multi:softprob', 
                        n_estimators = best_ne,
                        max_depth=best_depth,
                        n_jobs=8)

clf.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints=None,
              learning_rate=0.300000012, max_delta_step=0, max_depth=9,
              min_child_weight=1, missing=nan, monotone_constraints=None,
              n_estimators=125, n_jobs=8, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method=None, validate_parameters=False, verbosity=None)

In [19]:
df_test = pd.read_csv('data/test.csv', index_col='trip_id')

target = ['travel_mode']
id_context = ['trip_id', 
              'household_id', 
              'person_n', 
              'trip_n',
              'survey_year',
              'travel_year'
             ]
features = [c for c in df_test.columns 
            if c not in (target + id_context)]

y_test = df_test[target].travel_mode.values
X_test = df_test[features]

In [20]:
pred_proba = clf.predict_proba(X_test)

In [21]:
log_loss(y_test, pred_proba)

0.761481447508331