# Machine Learning with H2O - Tutorial 4a: Classification Models (Basics)

<hr>

**Objective**:

- This tutorial explains how to build classification models with four different H2O algorithms.

<hr>

**Titanic Dataset:**

- Source: https://www.kaggle.com/c/titanic/data

<hr>
    
**Algorithms**:

1. GLM
2. DRF
3. GBM
4. DNN


<hr>

**Full Technical Reference:**

- http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/modeling.html

<br>


In [None]:
# Start and connect to a local H2O cluster
import h2o
h2o.init(nthreads = -1)

<br>

In [None]:
# Import Titanic data (local CSV)
titanic = h2o.import_file("kaggle_titanic.csv")
titanic.head(5)

In [None]:
# Convert 'Survived' and 'Pclass' to categorical values
titanic['Survived'] = titanic['Survived'].asfactor()
titanic['Pclass'] = titanic['Pclass'].asfactor()

In [None]:
titanic['Survived'].table()

In [None]:
titanic['Pclass'].table()

In [None]:
titanic['Sex'].table()

In [None]:
titanic['Age'].hist()

In [None]:
titanic['SibSp'].hist()

In [None]:
titanic['Parch'].hist()

In [None]:
titanic['Fare'].hist()

In [None]:
titanic['Embarked'].table()

In [None]:
# Define features (or predictors) manually
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']

In [None]:
# Split the H2O data frame into training/test sets
# so we can evaluate out-of-bag performance
titanic_split = titanic.split_frame(ratios = [0.8], seed = 1234)

titanic_train = titanic_split[0] # using 80% for training
titanic_test = titanic_split[1]  # using the rest 20% for out-of-bag evaluation

In [None]:
titanic_train.shape

In [None]:
titanic_test.shape

<br>

## Generalized Linear Model

In [None]:
# Build a Generalized Linear Model (GLM) with default settings

# Import the function for GLM
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

# Set up GLM for binary classification
glm_default = H2OGeneralizedLinearEstimator(family = 'binomial', model_id = 'glm_default')

# Use .train() to build the model
glm_default.train(x = features, 
                  y = 'Survived', 
                  training_frame = titanic_train)

In [None]:
# Check the model performance on training dataset
glm_default

In [None]:
# Check the model performance on test dataset
glm_default.model_performance(titanic_test)

<br>

## Distributed Random Forest

In [None]:
# Build a Distributed Random Forest (DRF) model with default settings

# Import the function for DRF
from h2o.estimators.random_forest import H2ORandomForestEstimator

# Set up DRF for regression
# Add a seed for reproducibility
drf_default = H2ORandomForestEstimator(model_id = 'drf_default', seed = 1234)

# Use .train() to build the model
drf_default.train(x = features, 
                  y = 'Survived', 
                  training_frame = titanic_train)

In [None]:
# Check the DRF model summary
drf_default

In [None]:
# Check the model performance on test dataset
drf_default.model_performance(titanic_test)

<br>

## Gradient Boosting Machines

In [None]:
# Build a Gradient Boosting Machines (GBM) model with default settings

# Import the function for GBM
from h2o.estimators.gbm import H2OGradientBoostingEstimator

# Set up GBM for regression
# Add a seed for reproducibility
gbm_default = H2OGradientBoostingEstimator(model_id = 'gbm_default', seed = 1234)

# Use .train() to build the model
gbm_default.train(x = features, 
                  y = 'Survived', 
                  training_frame = titanic_train)

In [None]:
# Check the GBM model summary
gbm_default

In [None]:
# Check the model performance on test dataset
gbm_default.model_performance(titanic_test)

<br>

## H2O Deep Learning

In [None]:
# Build a Deep Learning (Deep Neural Networks, DNN) model with default settings

# Import the function for DNN
from h2o.estimators.deeplearning import H2ODeepLearningEstimator

# Set up DNN for regression
dnn_default = H2ODeepLearningEstimator(model_id = 'dnn_default')

# (not run) Change 'reproducible' to True if you want to reproduce the results
# The model will be built using a single thread (could be very slow)
# dnn_default = H2ODeepLearningEstimator(model_id = 'dnn_default', reproducible = True)

# Use .train() to build the model
dnn_default.train(x = features, 
                  y = 'Survived', 
                  training_frame = titanic_train)

In [None]:
# Check the DNN model summary
dnn_default

In [None]:
# Check the model performance on test dataset
dnn_default.model_performance(titanic_test)

<br>

## Making Predictions

In [None]:
# Use GLM model to make predictions
yhat_test_glm = glm_default.predict(titanic_test)
yhat_test_glm.head(5)

In [None]:
# Use DRF model to make predictions
yhat_test_drf = drf_default.predict(titanic_test)
yhat_test_drf.head(5)

In [None]:
# Use GBM model to make predictions
yhat_test_gbm = gbm_default.predict(titanic_test)
yhat_test_gbm.head(5)

In [None]:
# Use DNN model to make predictions
yhat_test_dnn = dnn_default.predict(titanic_test)
yhat_test_dnn.head(5)

<br>