## 5. Fine-Tune Your Model

This is a model that has many input hyperparameters that can be tweaked for improving performance. For example, we could have a forest with 100 or 1000 trees, or we could use 10 or 50 features during random selection. What are the best values for these hyperparameters to pass as input to the model for training?

We can use Scikit-learn’s `GridSearchCV` to tell  which hyperparameters we would like to explore and which values to try out, and it will evaluate all the possible combinations of hyperparameter values, using cross-validation.

In [50]:
from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_estimators': [10, 50, 100, 150], 'max_features': [10, 20, 30, 40, 50, 100, 150]},
    {'bootstrap': [False], 'n_estimators': [10, 50, 100, 150], 'max_features': [10, 20, 30, 40, 50, 100, 150]},
  ]

forest_reg = RandomForestRegressor()

grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error',
                           return_train_score=True)

grid_search.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid=[{'n_estimators': [10, 50, 100, 150], 'max_features': [10, 20, 30, 40, 50, 100, 150]}, {'bootstrap': [False], 'n_estimators': [10, 50, 100, 150], 'max_features': [10, 20, 30, 40, 50, 100, 150]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='neg_mean_squared_error', verbose=0)

We can use best_params_ to visualize the best values for the passed hyperparameters, and best_estimator_ to get the fine-tuned model:

In [51]:
grid_search.best_params_

{'bootstrap': False, 'max_features': 50, 'n_estimators': 150}

In [52]:
grid_search.best_estimator_

RandomForestRegressor(bootstrap=False, criterion='mse', max_depth=None,
           max_features=50, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=150, n_jobs=None, oob_score=False,
           random_state=None, verbose=0, warm_start=False)

In [53]:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

0.16045887516202725 {'max_features': 10, 'n_estimators': 10}
0.1463990948564421 {'max_features': 10, 'n_estimators': 50}
0.1412026947487663 {'max_features': 10, 'n_estimators': 100}
0.14246253054769992 {'max_features': 10, 'n_estimators': 150}
0.14691746270767267 {'max_features': 20, 'n_estimators': 10}
0.13597115151245137 {'max_features': 20, 'n_estimators': 50}
0.1339317384657923 {'max_features': 20, 'n_estimators': 100}
0.13586344839124506 {'max_features': 20, 'n_estimators': 150}
0.14151866730008977 {'max_features': 30, 'n_estimators': 10}
0.1340031945390629 {'max_features': 30, 'n_estimators': 50}
0.13223525140249814 {'max_features': 30, 'n_estimators': 100}
0.13125997327589092 {'max_features': 30, 'n_estimators': 150}
0.13940136029754568 {'max_features': 40, 'n_estimators': 10}
0.1328361901145508 {'max_features': 40, 'n_estimators': 50}
0.1300245720445149 {'max_features': 40, 'n_estimators': 100}
0.1298365980534873 {'max_features': 40, 'n_estimators': 150}
0.14116266819925172 {'m

Now that we know the optimal values for the hyperparameters (‘bootstrap’: False, ‘max_features’: 50, ‘n_estimators’: 150), let’s plug them in and see if our Random Forest model has improved compared to the vanilla Random Forest model that we trained earlier when we trained multiple models at once:

In [54]:
rf_model_final = RandomForestRegressor(bootstrap=False,max_features=50, n_estimators=150, random_state=5)

rf_model_final.fit(X_train, y_train)
rf_final_val_predictions = rf_model_final.predict(X_test)
rf_final_val_rmse = mean_squared_error(inv_y(rf_final_val_predictions), inv_y(y_test))
np.sqrt(rf_final_val_rmse)

28801.819287572634

In [55]:
rf_model_final.score(X_test, y_test)*100

87.81704897477596

In [56]:
feature_importances = grid_search.best_estimator_.feature_importances_

Let’s display these importance scores next to their corresponding attribute names:

In [57]:
cat_encoder = full_pipeline.named_transformers_["cat"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)

[(0.13587910160757175, 'OverallQual'),
 (0.11448897858060943, 'GrLivArea'),
 (0.09581655010849015, 'YearBuilt'),
 (0.07582896748463265, 'GarageCars'),
 (0.046056980286277464, 'TotalBsmtSF'),
 (0.04487549401201833, '1stFlrSF'),
 (0.041393174102971615, 'FullBath'),
 (0.023880560131905645, 'LotArea'),
 (0.021393985092654423, 'YearRemodAdd'),
 (0.01724007555749671, '2ndFlrSF'),
 (0.015760931004710026, 'BsmtFinSF1'),
 (0.015735215913831282, 'LotFrontage'),
 (0.015286111385595777, 'Fireplaces'),
 (0.01068633452178631, 'TotRmsAbvGrd'),
 (0.008698353502217744, 'OverallCond'),
 (0.0061489973736860975, 'BsmtUnfSF'),
 (0.005958047979369089, 'OpenPorchSF'),
 (0.005208876302112425, 'MasVnrArea'),
 (0.0036112351500639174, 'WoodDeckSF'),
 (0.0034948654242381166, 'BedroomAbvGr'),
 (0.0031363147175676787, 'HalfBath'),
 (0.002912991933145176, 'RM'),
 (0.002525673214705328, 'RL'),
 (0.0017716519205522537, 'BsmtFullBath'),
 (0.0015500110819655765, 'EnclosedPorch'),
 (0.0011829609950252812, 'C (all)'),
 (0