# Machine Learning with H2O - Tutorial 3b: Regression Models (Grid Search)

<hr>

**Objective**:

- This tutorial explains how to fine-tune regression models for better out-of-bag performance.

<hr>

**Wine Quality Dataset:**

- Source: https://archive.ics.uci.edu/ml/datasets/Wine+Quality
- CSV (https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv)

<hr>
    
**Steps**:

1. GBM with default settings
2. GBM with manual settings
3. GBM with manual settings & cross-validation
4. GBM with manual settings, cross-validation and early stopping
5. GBM with cross-validation, early stopping and full grid search
6. GBM with cross-validation, early stopping and random grid search
7. Model stacking (combining different GLM, DRF, GBM and DNN models)


<hr>

**Full Technical Reference:**

- http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/modeling.html

<br>


In [None]:
# Start and connect to a local H2O cluster
import h2o
h2o.init(nthreads = -1)

<br>

In [None]:
# Import wine quality data from a local CSV file
wine = h2o.import_file("winequality-white.csv")
wine.head(5)

In [None]:
# Define features (or predictors)
features = list(wine.columns) # we want to use all the information
features.remove('quality')    # we need to exclude the target 'quality' (otherwise there is nothing to predict)
features

In [None]:
# Split the H2O data frame into training/test sets
# so we can evaluate out-of-bag performance
wine_split = wine.split_frame(ratios = [0.8], seed = 1234)

wine_train = wine_split[0] # using 80% for training
wine_test = wine_split[1]  # using the rest 20% for out-of-bag evaluation

In [None]:
wine_train.shape

In [None]:
wine_test.shape

<br>

## Step 1 - Gradient Boosting Machines (GBM) with Default Settings

In [None]:
# Build a Gradient Boosting Machines (GBM) model with default settings

# Import the function for GBM
from h2o.estimators.gbm import H2OGradientBoostingEstimator

# Set up GBM for regression
# Add a seed for reproducibility
gbm_default = H2OGradientBoostingEstimator(model_id = 'gbm_default', 
                                           seed = 1234)

# Use .train() to build the model
gbm_default.train(x = features, 
                  y = 'quality', 
                  training_frame = wine_train)

In [None]:
# Check the model performance on test dataset
gbm_default.model_performance(wine_test)

<br>

## Step 2 - GBM with Manual Settings

In [None]:
# Build a GBM with manual settings

# Set up GBM for regression
# Add a seed for reproducibility
gbm_manual = H2OGradientBoostingEstimator(model_id = 'gbm_manual', 
                                          seed = 1234,
                                          ntrees = 100,
                                          sample_rate = 0.9,
                                          col_sample_rate = 0.9)

# Use .train() to build the model
gbm_manual.train(x = features, 
                 y = 'quality', 
                 training_frame = wine_train)

In [None]:
# Check the model performance on test dataset
gbm_manual.model_performance(wine_test)

<br>

## Step 3 - GBM with Manual Settings & Cross-Validation (CV)

In [None]:
# Build a GBM with manual settings & cross-validation

# Set up GBM for regression
# Add a seed for reproducibility
gbm_manual_cv = H2OGradientBoostingEstimator(model_id = 'gbm_manual_cv', 
                                             seed = 1234,
                                             ntrees = 100,
                                             sample_rate = 0.9,
                                             col_sample_rate = 0.9,
                                             nfolds = 5)
                                            
# Use .train() to build the model
gbm_manual_cv.train(x = features, 
                    y = 'quality', 
                    training_frame = wine_train)

In [None]:
# Check the cross-validation model performance
gbm_manual_cv

In [None]:
# Check the model performance on test dataset
gbm_manual_cv.model_performance(wine_test)
# It should be the same as gbm_manual above as the model is trained with same parameters

<br>

## Step 4 - GBM with Manual Settings, CV and Early Stopping

In [None]:
# Build a GBM with manual settings, CV and early stopping

# Set up GBM for regression
# Add a seed for reproducibility
gbm_manual_cv_es = H2OGradientBoostingEstimator(model_id = 'gbm_manual_cv_es', 
                                                seed = 1234,
                                                ntrees = 10000,   # increase the number of trees 
                                                sample_rate = 0.9,
                                                col_sample_rate = 0.9,
                                                nfolds = 5,
                                                stopping_metric = 'mse', # let early stopping feature determine
                                                stopping_rounds = 15,     # the optimal number of trees
                                                score_tree_interval = 1) # by looking at the MSE metric
# Use .train() to build the model
gbm_manual_cv_es.train(x = features, 
                       y = 'quality', 
                       training_frame = wine_train)

In [None]:
# Check the model summary
gbm_manual_cv_es.summary()

In [None]:
# Check the cross-validation model performance
gbm_manual_cv_es

In [None]:
# Check the model performance on test dataset
gbm_manual_cv_es.model_performance(wine_test)

<br>

## Step 5 - GBM with CV, Early Stopping and Full Grid Search

In [None]:
# import Grid Search
from h2o.grid.grid_search import H2OGridSearch

In [None]:
# define the criteria for full grid search
search_criteria = {'strategy': "Cartesian"}

In [None]:
# define the range of hyper-parameters for grid search
hyper_params = {'sample_rate': [0.7, 0.8, 0.9],
                'col_sample_rate': [0.7, 0.8, 0.9]}

In [None]:
# Set up GBM grid search
# Add a seed for reproducibility
gbm_full_grid = H2OGridSearch(
                    H2OGradientBoostingEstimator(
                        model_id = 'gbm_full_grid', 
                        seed = 1234,
                        ntrees = 10000,   
                        nfolds = 5,
                        stopping_metric = 'mse', 
                        stopping_rounds = 15,     
                        score_tree_interval = 1),
                    search_criteria = search_criteria, # full grid search
                    hyper_params = hyper_params)

In [None]:
# Use .train() to start the grid search
gbm_full_grid.train(x = features, 
                    y = 'quality', 
                    training_frame = wine_train)

In [None]:
# Sort and show the grid search results
gbm_full_grid_sorted = gbm_full_grid.get_grid(sort_by='mse', decreasing=False)
print(gbm_full_grid_sorted)

In [None]:
# Extract the best model from full grid search
best_model_id = gbm_full_grid_sorted.model_ids[0]
best_gbm_from_full_grid = h2o.get_model(best_model_id)
best_gbm_from_full_grid.summary()

In [None]:
# Check the model performance on test dataset
best_gbm_from_full_grid.model_performance(wine_test)

## GBM with CV, Early Stopping and Random Grid Search

In [None]:
# define the criteria for random grid search
search_criteria = {'strategy': "RandomDiscrete", 
                   'max_models': 9,
                   'seed': 1234}

In [None]:
# define the range of hyper-parameters for grid search
# 27 combinations in total
hyper_params = {'sample_rate': [0.7, 0.8, 0.9],
                'col_sample_rate': [0.7, 0.8, 0.9],
                'max_depth': [3, 5, 7]}

In [None]:
# Set up GBM grid search
# Add a seed for reproducibility
gbm_rand_grid = H2OGridSearch(
                    H2OGradientBoostingEstimator(
                        model_id = 'gbm_rand_grid', 
                        seed = 1234,
                        ntrees = 10000,   
                        nfolds = 5,
                        stopping_metric = 'mse', 
                        stopping_rounds = 15,     
                        score_tree_interval = 1),
                    search_criteria = search_criteria, # full grid search
                    hyper_params = hyper_params)

In [None]:
# Use .train() to start the grid search
gbm_rand_grid.train(x = features, 
                    y = 'quality', 
                    training_frame = wine_train)

In [None]:
# Sort and show the grid search results
gbm_rand_grid_sorted = gbm_rand_grid.get_grid(sort_by='mse', decreasing=False)
print(gbm_rand_grid_sorted)

In [None]:
# Extract the best model from random grid search
best_model_id = gbm_rand_grid_sorted.model_ids[0]
best_gbm_from_rand_grid = h2o.get_model(best_model_id)
best_gbm_from_rand_grid.summary()

In [None]:
# Check the model performance on test dataset
best_gbm_from_rand_grid.model_performance(wine_test)

<br>

## Comparison of Model Performance on Test Data

In [None]:
print('GBM with Default Settings                        :', gbm_default.model_performance(wine_test).mse())
print('GBM with Manual Settings                         :', gbm_manual.model_performance(wine_test).mse())
print('GBM with Manual Settings & CV                    :', gbm_manual_cv.model_performance(wine_test).mse())
print('GBM with Manual Settings, CV & Early Stopping    :', gbm_manual_cv_es.model_performance(wine_test).mse())
print('GBM with CV, Early Stopping & Full Grid Search   :', 
          best_gbm_from_full_grid.model_performance(wine_test).mse())
print('GBM with CV, Early Stopping & Random Grid Search :', 
          best_gbm_from_rand_grid.model_performance(wine_test).mse())

<br>