In [None]:
!apt install -y openjdk-8-jdk && python3 -m pip install h2o

# Machine Learning Crash Course: Model Families and Hyperparameters

In our second notebook, we're going to begin to explore some of the internals that were handled for us when using H2O's AutoML. We won't fully work through prediction like we did in the first portion, since our goal is to drop down just one layer of abstraction in order to understand some of what occured automatically for us, not to fully immerse ourselves in this level of detail.

### Start H2O

Import the **h2o** Python module and `H2OAutoML` class and initialize a local H2O cluster. If you ran part 1's notebook, this should detect the still-running H2O cluster that was started earlier.

In [None]:
import h2o
from h2o.automl import H2OAutoML
h2o.init()

### Load Data

Just as we did when letting H2O figure out what models to build for us, we're going to load a dataset. This time, we'll use a subset of a publically available [Single-family Loan dataset from Freddie Mac](http://www.freddiemac.com/fmac-resources/research/pdf/user_guide.pdf).  Here we'll predict whether or not a loan holder will default on their loan.

In [None]:
df = h2o.import_file("https://s3.amazonaws.com/data.h2o.ai/DAI-Tutorials/loan_level_500k.csv")

Let's again inspect the dataset we've loaded to see what columns it contains, and compare them with the user guide linked above.

In [None]:
df.head()

We can also again investigate the distributions and types of the columns we have.

In [None]:
df.describe()

The target we'll predict will be the `DELINQUENT` column, which indicates whether the loan holder in a given row was delinquent on their loan. Let's look at the distribution of its values.

In [None]:
df["DELINQUENT"].table()

We can see we have many more non-delinquent loan holders than delinquent ones. It's often important to notice highly skewed or imbalanced target variables, and there are techniques for "fixing", "modifying", or "rebalancing" this kind of imbalance in order to improve the quality of a trained model.

The first "new" thing we'll do, which previously H2O had done for us, is to split our dataset into a few parts.

In [None]:
training, validation, test = df.split_frame([0.7, 0.15])
print(
    """
    training rows: %d
    validation rows: %d
    test rows: %d
    """ % (training.nrows, validation.nrows, test.nrows)
)

Similar to previously, we specify the columns that will be used as features, and our target column.

In [None]:
y = "DELINQUENT"

ignore = ["DELINQUENT", "PREPAID", "PREPAYMENT_PENALTY_MORTGAGE_FLAG", "PRODUCT_TYPE"] 

x = list(set(training.names) - set(ignore))

## A Specific Model Family: Boosted Trees

We're going to build a model of a specific type, a **gradient boosted tree**.

In [None]:
import h2o.estimators.gbm
gbm = h2o.estimators.gbm.H2OGradientBoostingEstimator()
gbm.train(x=x, y=y, training_frame=training, validation_frame=validation)

We can investigate the specific GBM model that we created:

In [None]:
gbm

And as before, we can run predictions, here on our validation set:

In [None]:
gbm.predict(validation)

## Hyperparameters

Many model families have what are called "hyperparameters".

Hyperparameters are loosely parameters to the model building itself, as opposed to parameters that are learned during training the model.

In [None]:
hyper_params = {'max_depth' : [1,3,5,6,7,8,9,10,12,13,15]}

gbm = h2o.estimators.gbm.H2OGradientBoostingEstimator(ntrees=150)
gbm_grid = h2o.grid.H2OGridSearch(
    gbm,
    hyper_params,
    search_criteria={"strategy":"Cartesian"},
)

gbm_grid.train(
    x=x, y=y, training_frame=training, validation_frame=validation,
)

In [None]:
sorted_gbm_depth = gbm_grid.get_grid(sort_by='auc',decreasing=True)
sorted_gbm_depth

## Exercise

Use the AutoML techniques we learned in part 1 to build an AutoML model for this new dataset. How does it compare to the model we built manually?

In [None]:
from h2o.automl import H2OAutoML

The contents of this notebook were lightly adapted from a tutorial in the [official H2O AutoML documentation](https://h2oai.github.io/tutorials/introduction-to-machine-learning-with-h2o-part-1/). There are many more interesting tutorials to work through there. Explore them!