## Fine-Tuning Random Forest Regressor

This notebook assumes the data has been split into a training and a test set. If not, run get_data.ipynb first.

In [1]:
import pandas as pd

TRAINING_FILEPATH = 'data/training_set.csv'
TEST_FILEPATH = 'data/test_set.csv'

training_set = pd.read_csv(TRAINING_FILEPATH, index_col='index')
test_set = pd.read_csv(TEST_FILEPATH, index_col='index')

### Fine-Tuning Random Forest Regressor

In [20]:
from preprocessing_utils import FeaturePreprocessor, separate_features_targets

train_X, train_y = separate_features_targets(training_set)

# preprocess training features (add combinations, power transform)
preprocessor = FeaturePreprocessor(add_combinations=True, powertransform_num=True, onehot_type=True)
train_X = preprocessor_power_tr.fit_transform(train_X)

In [3]:
from train_utils import cross_val_rmse
from sklearn.ensemble import RandomForestRegressor

# random forest regressor model with default hyperparameters
base_forest = RandomForestRegressor(n_estimators=100)
base_forest_errors = cross_val_rmse(base_forest, train_X, train_y, cv=5,
                                    random_state=42, verbose=True)
display(base_forest_errors)

Starting 5-fold cross validation
fit 0	train RMSE: 0.299	 val RMSE: 0.815	 train time: 5.72 s
fit 1	train RMSE: 0.299	 val RMSE: 0.797	 train time: 5.69 s
fit 2	train RMSE: 0.298	 val RMSE: 0.811	 train time: 5.62 s
fit 3	train RMSE: 0.300	 val RMSE: 0.815	 train time: 5.64 s
fit 4	train RMSE: 0.300	 val RMSE: 0.786	 train time: 5.62 s


Unnamed: 0,fold 0,fold 1,fold 2,fold 3,fold 4,mean,std
train,0.299417,0.299162,0.298058,0.300439,0.300183,0.299452,0.00094
val,0.814788,0.797459,0.811265,0.814552,0.785646,0.804742,0.01282


The base model is overfitting the training folds

In [4]:
from train_utils import save_model, load_model

save_model(base_forest, "base_forest.pkl")

Random Forest Regressor Hyperparameters:
- n_estimators: number of trees in the foreset
- max_features: max number of features considered for splitting a node
- max_depth: max number of levels in each decision tree
- min_samples_split: min number of data points placed in a node before the node is split
- min_samples_leaf: min number of data points allowed in a leaf node
- bootstrap: sample points without replacement (True) or use the whole dataset for each tree (False)

In [14]:
import numpy as np
from sklearn.model_selection import RandomizedSearchCV

# run a randomized search on hyperparameters
forest_random_params = {
    'n_estimators': [int(x) for x in np.linspace(200, 2000, 10)],
    'max_features': ['auto', 'sqrt'],
    'max_depth': [int(x) for x in np.linspace(10, 110, 11)] + [None],
    'min_samples_split': [2, 10, 100, 1000],
    'min_samples_leaf': [1, 5, 50, 500],
    'bootstrap': [False, True]
}

forest_random_search = RandomizedSearchCV(RandomForestRegressor(), forest_random_params, n_iter=100,
                                          scoring='neg_mean_squared_error', cv=5, n_jobs=-1, verbose=50)

In [15]:
forest_random_search.fit(train_X, train_y)
print(forest_random_search.best_params_)
save_model(forest_random_search.best_estimator_, "rndsearch_forest.pkl")

[Parallel(n_jobs=-1)]: Done 168 tasks      | elapsed: 16.6min
[Parallel(n_jobs=-1)]: Done 169 tasks      | elapsed: 16.6min
[Parallel(n_jobs=-1)]: Done 170 tasks      | elapsed: 16.6min
[Parallel(n_jobs=-1)]: Done 171 tasks      | elapsed: 16.6min
[Parallel(n_jobs=-1)]: Done 172 tasks      | elapsed: 16.7min
[Parallel(n_jobs=-1)]: Done 173 tasks      | elapsed: 16.7min
[Parallel(n_jobs=-1)]: Done 174 tasks      | elapsed: 16.7min
[Parallel(n_jobs=-1)]: Done 175 tasks      | elapsed: 16.8min
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed: 16.8min
[Parallel(n_jobs=-1)]: Done 177 tasks      | elapsed: 16.8min
[Parallel(n_jobs=-1)]: Done 178 tasks      | elapsed: 16.9min
[Parallel(n_jobs=-1)]: Done 179 tasks      | elapsed: 17.0min
[Parallel(n_jobs=-1)]: Done 180 tasks      | elapsed: 17.0min
[Parallel(n_jobs=-1)]: Done 181 tasks      | elapsed: 17.1min
[Parallel(n_jobs=-1)]: Done 182 tasks      | elapsed: 17.1min
[Parallel(n_jobs=-1)]: Done 183 tasks      | elapsed: 17.1min
[Paralle

In [157]:
from train_utils import save_cv_results

save_cv_results(forest_random_search.cv_results_, 'forest_rndsearch_results.csv')

In [212]:
from train_utils import summarize_cv_results

forest_rndsearch_summary = summarize_cv_results(forest_random_search.cv_results_)
forest_rndsearch_summary.head()

Unnamed: 0,bootstrap,max_depth,max_features,min_samples_leaf,min_samples_split,n_estimators,train_error_mean,val_error_mean,error_diff_mean,train_error_std,val_error_std
0,False,,auto,500,100,200,0.957593,0.96362,0.006027,0.001671,0.007002
1,True,20.0,auto,5,10,400,0.556216,0.80119,0.244974,0.002654,0.00621
2,True,110.0,sqrt,50,1000,1800,0.963081,0.968507,0.005427,0.001884,0.009572
3,True,100.0,auto,50,2,2000,0.81311,0.851608,0.038498,0.002626,0.004816
4,False,60.0,sqrt,1,10,1400,0.38707,0.803025,0.415955,0.001367,0.008934


In [222]:
forest_rndsearch_summary.loc[[forest_random_search.best_index_]]

Unnamed: 0,bootstrap,max_depth,max_features,min_samples_leaf,min_samples_split,n_estimators,train_error_mean,val_error_mean,error_diff_mean,train_error_std,val_error_std
81,True,,auto,1,10,600,0.453795,0.799826,0.346031,0.001457,0.007141


The best model of random search does less overfitting than the base model but is still too much

In [224]:
# find models with a low test error that do not overfit too much
condition = (forest_rndsearch_summary['val_error_mean'] < 0.9) & (forest_rndsearch_summary['error_diff_mean'] < 0.1)
forest_rndsearch_summary[condition].sort_values(by='val_error_mean').head(10)

Unnamed: 0,bootstrap,max_depth,max_features,min_samples_leaf,min_samples_split,n_estimators,train_error_mean,val_error_mean,error_diff_mean,train_error_std,val_error_std
47,True,,auto,1,100,800,0.746378,0.829199,0.082821,0.002351,0.004911
29,True,60.0,auto,5,100,600,0.759906,0.829323,0.069418,0.00231,0.004571
19,True,100.0,auto,1,100,1200,0.746162,0.829471,0.083309,0.002391,0.004434
17,True,60.0,auto,5,100,2000,0.759973,0.8295,0.069527,0.00227,0.004565
58,True,40.0,auto,5,100,1600,0.76001,0.829541,0.069532,0.002218,0.00462
33,False,40.0,sqrt,5,100,2000,0.782186,0.849682,0.067496,0.001351,0.007261
66,False,80.0,sqrt,5,100,1200,0.782246,0.849708,0.067462,0.001288,0.00763
22,False,30.0,sqrt,5,100,1600,0.782232,0.850119,0.067888,0.001468,0.007159
65,True,20.0,auto,50,100,2000,0.813129,0.851497,0.038368,0.002635,0.004493
3,True,100.0,auto,50,2,2000,0.81311,0.851608,0.038498,0.002626,0.004816


In [225]:
# Best parameters so far:
# - bootstrap = True
# - max_depth = [None, 60, 100]
# - max_features = 'auto'
# - min_samples_leaf = [1, 5]
# - min_samples_split = 100
# - n_estimators = [600, 800, 1200, 2000]

In [141]:
from sklearn.model_selection import GridSearchCV

# run a grid search on a narrow range of hyperparameters
forest_grid_params = {
    'n_estimators': [1200, 1600, 2000, 2400],
    'max_features': ['auto'],
    'max_depth': [60, 100, 150, None],
    'min_samples_split': [80, 100, 140, 200],
    'min_samples_leaf': [1, 5, 10],
    'bootstrap': [True]
}

forest_grid_search = GridSearchCV(RandomForestRegressor(), forest_grid_params,
                                  scoring='neg_mean_squared_error', cv=5, n_jobs=-1, verbose=50)

In [142]:
forest_grid_search.fit(train_X, train_y)

| elapsed: 147.4min
[Parallel(n_jobs=-1)]: Done 630 tasks      | elapsed: 147.5min
[Parallel(n_jobs=-1)]: Done 631 tasks      | elapsed: 148.0min
[Parallel(n_jobs=-1)]: Done 632 tasks      | elapsed: 148.1min
[Parallel(n_jobs=-1)]: Done 633 tasks      | elapsed: 148.1min
[Parallel(n_jobs=-1)]: Done 634 tasks      | elapsed: 148.6min
[Parallel(n_jobs=-1)]: Done 635 tasks      | elapsed: 148.8min
[Parallel(n_jobs=-1)]: Done 636 tasks      | elapsed: 149.2min
[Parallel(n_jobs=-1)]: Done 637 tasks      | elapsed: 149.3min
[Parallel(n_jobs=-1)]: Done 638 tasks      | elapsed: 149.5min
[Parallel(n_jobs=-1)]: Done 639 tasks      | elapsed: 149.5min
[Parallel(n_jobs=-1)]: Done 640 tasks      | elapsed: 149.8min
[Parallel(n_jobs=-1)]: Done 641 tasks      | elapsed: 149.9min
[Parallel(n_jobs=-1)]: Done 642 tasks      | elapsed: 150.0min
[Parallel(n_jobs=-1)]: Done 643 tasks      | elapsed: 150.0min
[Parallel(n_jobs=-1)]: Done 644 tasks      | elapsed: 150.4min
[Parallel(n_jobs=-1)]: Done 645 tas

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'n_estimators': [1200, 1600, 2000, 2400], 'max_features': ['auto'], 'max_depth': [60, 100, 150, None], 'min_samples_split': [80, 100, 140, 200], 'min_samples_leaf': [1, 5, 10], 'bootstrap': [True]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_squared_error', verbose=50)

In [143]:
print(forest_grid_search.best_params_)
save_model(forest_grid_search.best_estimator_, "gridsearch_forest.pkl")

{'bootstrap': True, 'max_depth': None, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 80, 'n_estimators': 1200}


In [158]:
save_cv_results(forest_grid_search.cv_results_, "forest_gridsearch_results.csv")

In [231]:
forest_gridsearch_summary = summarize_cv_results(forest_grid_search.cv_results_)
forest_gridsearch_summary.head()

Unnamed: 0,bootstrap,max_depth,max_features,min_samples_leaf,min_samples_split,n_estimators,train_error_mean,val_error_mean,error_diff_mean,train_error_std,val_error_std
0,True,60.0,auto,1,80,1200,0.725895,0.824152,0.098257,0.002671,0.00494
1,True,60.0,auto,1,80,1600,0.725903,0.823978,0.098076,0.002743,0.00496
2,True,60.0,auto,1,80,2000,0.725848,0.823915,0.098067,0.002682,0.005157
3,True,60.0,auto,1,80,2400,0.725925,0.824183,0.098258,0.00263,0.00487
4,True,60.0,auto,1,100,1200,0.746206,0.829122,0.082916,0.002466,0.005022


In [232]:
forest_gridsearch_summary.loc[[forest_grid_search.best_index_]]

Unnamed: 0,bootstrap,max_depth,max_features,min_samples_leaf,min_samples_split,n_estimators,train_error_mean,val_error_mean,error_diff_mean,train_error_std,val_error_std
144,True,,auto,1,80,1200,0.725771,0.823648,0.097876,0.002576,0.004816


- The best model from grid search does not overfit too much and has an error that is slightly higher than the base model
- Error doesn't seem to be decreasing sharply so more fine-tuning probably won't yield a much better model
- This is the best random forest regressor so far

In [233]:
save_model(forest_grid_search.best_estimator_, 'best_forest.pkl')