Data Science Solutions with Pyton by Tshepo Chris Nokeri, Apress. 2021

# CHAPTER 10: AUTOMATING THE MACHINE LEARNING PROCESS WITH H2O

This is a short yet insightful chapter that reasonably concludes the book by debunking a straightforward approach towards automating machine learning processes with the help of a widespread machine learning framework known as H2O.

In [1]:
import warnings
warnings.filterwarnings("ignore")

# Data Preprocessing

In [2]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
df = pd.read_csv(r"C:\Users\i5 lenov\Desktop\Source Code-20210822T014112Z-001\Source Code\Chapter_3_Parametric_Methods_Linear_Regression_Analysis\WA_Fn-UseC_-Marketing_Customer_Value_Analysis.csv")
drop_column_names = df.columns[[0, 6]]
initial_data = df.drop(drop_column_names, axis="columns")
initial_data.iloc[::, 0] = pd.get_dummies(initial_data.iloc[::, 0])
initial_data.iloc[::, 2] = pd.get_dummies(initial_data.iloc[::, 2])
initial_data.iloc[::, 3] = pd.get_dummies(initial_data.iloc[::, 3])
initial_data.iloc[::, 4] = pd.get_dummies(initial_data.iloc[::, 4])
initial_data.iloc[::, 5] = pd.get_dummies(initial_data.iloc[::, 5])
initial_data.iloc[::, 6] = pd.get_dummies(initial_data.iloc[::, 6])
initial_data.iloc[::, 7] = pd.get_dummies(initial_data.iloc[::, 7])
initial_data.iloc[::, 8] = pd.get_dummies(initial_data.iloc[::, 8])
initial_data.iloc[::, 9] = pd.get_dummies(initial_data.iloc[::, 9])
initial_data.iloc[::, 15] = pd.get_dummies(initial_data.iloc[::, 15])
initial_data.iloc[::, 16] = pd.get_dummies(initial_data.iloc[::, 16])
initial_data.iloc[::, 17] = pd.get_dummies(initial_data.iloc[::, 17])
initial_data.iloc[::, 18] = pd.get_dummies(initial_data.iloc[::, 18])
initial_data.iloc[::, 20] = pd.get_dummies(initial_data.iloc[::, 20])
initial_data.iloc[::, 21] = pd.get_dummies(initial_data.iloc[::, 21])

In [3]:
import h2o as initialize_h2o
initialize_h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
; OpenJDK 64-Bit Server VM (build 11.0.6+8-b765.1, mixed mode)
  Starting server from C:\Users\i5 lenov\AppData\Roaming\Python\Python37\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\I5LENO~1\AppData\Local\Temp\tmp_fwqbteg
  JVM stdout: C:\Users\I5LENO~1\AppData\Local\Temp\tmp_fwqbteg\h2o_i5_lenov_started_from_python.out
  JVM stderr: C:\Users\I5LENO~1\AppData\Local\Temp\tmp_fwqbteg\h2o_i5_lenov_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,04 secs
H2O_cluster_timezone:,Africa/Harare
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.30.0.7
H2O_cluster_version_age:,"1 year, 1 month and 2 days !!!"
H2O_cluster_name:,H2O_from_python_i5_lenov_v8qjih
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,2.975 Gb
H2O_cluster_total_cores:,4
H2O_cluster_allowed_cores:,4


In [4]:
h2o_data = initialize_h2o.H2OFrame(initial_data)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [5]:
int_x = initial_data.iloc[::,0:19]
fin_x = initial_data.iloc[::,19:21]
x_combined = pd.concat([int_x, fin_x], axis=1)
x_list = list(x_combined.columns)
y_list = initial_data.columns[19]
y = y_list
x = h2o_data.col_names
x.remove(y_list)

In [6]:
h2o_training_data, h2o_validation_data, h2o_test_data = h2o_data.split_frame(ratios=[.8,.1])

## Develop the AutoML Model

In [7]:
from h2o.automl import H2OAutoML
h2o_automatic_ml = H2OAutoML(max_runtime_secs = 240)
h2o_automatic_ml.train(x= x,y= y,training_frame = h2o_training_data, validation_frame = h2o_validation_data)

AutoML progress: |
11:57:21.755: User specified a validation frame with cross-validation still enabled. Please note that the models will still be validated using cross-validation only, the validation frame will be used to provide purely informative validation metrics on the trained models.
11:57:21.787: AutoML: XGBoost is not available; skipping it.

████████████████████████████████████████████████████████| 100%


## Leader

In [8]:
h2o_method_ranking = h2o_automatic_ml.leaderboard
h2o_method_ranking

model_id,mean_residual_deviance,rmse,mse,mae,rmsle
StackedEnsemble_AllModels_AutoML_20210824_115721,18371.3,135.541,18371.3,86.6257,0.482043
StackedEnsemble_BestOfFamily_AutoML_20210824_115721,18566.8,136.26,18566.8,87.7959,0.477828
GBM_1_AutoML_20210824_115721,19392.8,139.258,19392.8,90.0652,0.490441
GBM_3_AutoML_20210824_115721,19397.8,139.276,19397.8,90.3467,0.490855
GBM_2_AutoML_20210824_115721,19467.3,139.525,19467.3,89.9908,0.491887
GBM_grid__1_AutoML_20210824_115721_model_4,19607.9,140.028,19607.9,91.3311,0.495784
GBM_4_AutoML_20210824_115721,19633.4,140.119,19633.4,89.9553,0.499693
XRT_1_AutoML_20210824_115721,19717.6,140.42,19717.6,89.7474,0.485144
GBM_grid__1_AutoML_20210824_115721_model_2,19805.8,140.733,19805.8,91.7092,0.494005
GBM_grid__1_AutoML_20210824_115721_model_9,19858.6,140.921,19858.6,90.0839,0.521174




In [9]:
highest_ranking_method = h2o_automatic_ml.leader
highest_ranking_method

Model Details
H2OStackedEnsembleEstimator :  Stacked Ensemble
Model Key:  StackedEnsemble_AllModels_AutoML_20210824_115721

No model summary for this model

ModelMetricsRegressionGLM: stackedensemble
** Reported on train data. **

MSE: 4900.131929292816
RMSE: 70.00094234574857
MAE: 45.791336342731775
RMSLE: 0.3107457400431741
R^2: 0.9421217261180531
Mean Residual Deviance: 4900.131929292816
Null degrees of freedom: 7288
Residual degrees of freedom: 7276
Null deviance: 617106545.1167994
Residual deviance: 35717061.632615335
AIC: 82648.0458245655

ModelMetricsRegressionGLM: stackedensemble
** Reported on validation data. **

MSE: 17967.946605287365
RMSE: 134.04456947331872
MAE: 83.66742154264892
RMSLE: 0.43883929226519075
R^2: 0.7718029372846945
Mean Residual Deviance: 17967.946605287365
Null degrees of freedom: 930
Residual degrees of freedom: 918
Null deviance: 73564698.35450743
Residual deviance: 16728158.289522538
AIC: 11790.460469478105

ModelMetricsRegressionGLM: stackedensemble
**

