# (4) Automated Model Selection & Analysis

Now that we have preprocessed our data and hopefully addressed most of the initial problems we discussed in the first three notebooks, we can focus on **modelling** in this final notebook. See the below quote from `h2o`'s documentation to understand what it does:

*We have designed an easy-to-use interface which automates the process of training a large selection of candidate models. H2O’s AutoML can also be a helpful tool for the advanced user, by providing a simple wrapper function that performs a large number of modeling-related tasks that would typically require many lines of code, and by freeing up their time to focus on other aspects of the data science pipeline tasks such as data-preprocessing, feature engineering and model deployment.*

In [None]:
import pandas as pd
import h2o
from h2o.automl import H2OAutoML
h2o.init()
SEED = 42

## Specify Training & Testing Filepaths
To experiment how models perform with previous, raw forms of data, simply change the `training_filepath` and `testing_filepath` variables to other data folders starting with `(0)`, `(1)`, or `(2)`.

In [None]:
training_filepath = '(3)data_trimmed/label_encoded/train_users.csv'
testing_filepath = '(3)data_trimmed/label_encoded/test_users.csv'

## Read Training Data as `H2OFrame`

In [None]:
# Import training set as H20Frames
X_train = h2o.import_file(training_filepath)
X_train.head()

### Remove Unrelated Columns for Training
The response variable 'country_destination' will always be an unrelated column that should be seperated away from the training columns. For `(0)data`, make sure to remove columns 'id' and 'date_first_booking' as well. For `(1)data_manual_ops`, make sure to rem ove colum 'id'. For the rest of the data filepaths, what we have below is sufficient.

In [None]:
# Select/discard variables (columns) to base models on training set
train_variables, response_variable = X_train.columns, 'country_destination'
unrelated_variables = [response_variable]  # REMEMBER: 'id', 'date_first_booking'
for variable in unrelated_variables:
    train_variables.remove(variable)

X_train[response_variable] = X_train[response_variable].asfactor()

## `h2o`'s Best Utility: `H2OAutoML`

You can find a detailed documentation of the `H2OAutoML` module that does most of the magic [here](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html). Below, we outlined a few important parameters and explained what `H2OAutoML` does as well as what it doesn't do.

You can stop *automated* model training with two parameters:
* `@max_runtime_secs`: How long the AutoML will run before starting the training of final Stacked Ensemble models. Defaults to 3600 seconds (1 hour).
* `@max_models`: Maximum number of models to build in an AutoML run excluding the Stacked Ensemble models. Defaults to None.

You can enable either *downsampling* and *upsampling* with two parameters:
* `@balance_classes`: Specify whether to oversample the minority classes to balance the class distribution. This option is not enabled by default and can increase the data frame size. Majority classes can be undersampled to satisfy the max_after_balance_size parameter.
* `@max_after_balance_size`: Specify the maximum relative size of the training data after balancing class counts (balance_classes must be enabled). Defaults to 5.0. (The value can be less than 1.0).

By default, this module trains & validates the following model architectures automatically:
* **DRF**
* **GLM**
* **XGBoost (XGBoost GBM)**
* **GBM (H2O GBM)**
* **DeepLearning** (*Fully-connected multi-layer artificial neural network*)
* **StackedEnsemble**

You can either specify to *include* or *exclude* models with two parameters:
* `@exclude_algos`: A list/vector of character strings naming the algorithms to skip during the model-building phase.
* `@include_algos`: A list/vector of character strings naming the algorithms to include during the model-building phase. 

`H2OAutoML` performs *hyperparameter search* based on **Random Grid Search** over a variety of algorithms in order to deliver the best model. In `H2OAutoML`, the following hyperparameters are fully supported:
* **GLM Hyperparameters**:  *alpha*, *missing_values_handling*
* **XGBoost Hyperparameters**: *ntrees*, *max_depth*, *min_rows*, *min_sum_hessian_in_leaf*, *sample_rate*, *col_sample_rate*, *col_sample_rate_per_tree*, *booster*, *reg_lambda*, *reg_alpha*
* **GBM Hyperparameters**: *histogram_type*, *ntrees*, *max_depth*, *min_rows*, *learn_rate*, *sample_rate*, *col_sample_rate*, *col_sample_rate_per_tree*, *min_split_improvement*
* **Deep Learning Hyperparameters**: *epochs*, *adaptivate_rate*, *activation*, *rho*, *epsilon*, *input_dropout_ratio*, *hidden*, *hidden_dropout_ratios*

In [None]:
aml = H2OAutoML(nfolds=5,
                #balance_classes=True,
                #max_after_balance_size=1.0 if downsampling else 10000.0,
                max_runtime_secs=10000,
                max_models=None,
                stopping_metric='AUTO',  # defaults to logloss for classification
                sort_metric='AUTO',      # defaults to mean_per_class_error for classification
                seed=SEED)

## Perform Training on `H2OFrame`

In [None]:
aml.train(x=train_variables,
          y=response_variable,
          training_frame=X_train)

## View the AutoML Models Leaderboard

In [None]:
lb = aml.leaderboard
print(lb.head(rows=lb.nrows))

## Save Best (Leader) `h2o` Model

In [None]:
model_path = h2o.save_model(model=aml.leader,
                            path='saved_models/',
                            force=True)

## Read Test Data as `H2OFrame`

In [None]:
X_test = h2o.import_file(testing_filepath)
X_test.head()

## Load Best (Leader) `h2o` Model

In [None]:
model = h2o.load_model(model_path)

## Get Predictions

In [None]:
predictions = model.predict(test)

## Save as Submission File & Submit [Here](https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings/submit)

**IMPORTANT**: Notice that we are assigning 'id' column in a sorted way, as `featuretools` automatically sorted our rows.

In [None]:
answers = pd.DataFrame()
answers['id'] = pd.read_csv('(0)data/test_users.csv').sort_values('id')['id']
answers['country'] = predictions.as_data_frame()['predict']
answers.set_index('id', inplace=True)
answers.to_csv('answers.csv')