In DRF, `checkpoint` can be used to continue training on the same dataset for additional iterations, or continue training on new data for additional iterations.

**Note:** The following parameters cannot be modified during checkpointing:


*   build_tree_one_node
*   max_depth
*   min_rows
*   nbins
*   nbins_cats
*   nbins_top_level
*   sample_rate

The following example **demonstrates how to build a distributed random forest model that will later be used for checkpointing**. This checkpoint example shows how to continue training on an existing model and also builds with new data. This example uses the cars dataset, which classifies whether or not a car is economical based on the car’s displacement, power, weight, and acceleration, and the year it was made.

In [1]:
import h2o
from h2o.estimators.random_forest import H2ORandomForestEstimator
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
; Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.12+8-LTS-237, mixed mode)
  Starting server from C:\ProgramData\Anaconda3\Lib\site-packages\h2o\backend\bin\h2o.jar
  Ice root: C:\Users\Asus\AppData\Local\Temp\tmpqjoe1gql
  JVM stdout: C:\Users\Asus\AppData\Local\Temp\tmpqjoe1gql\h2o_Asus_started_from_python.out
  JVM stderr: C:\Users\Asus\AppData\Local\Temp\tmpqjoe1gql\h2o_Asus_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,09 secs
H2O_cluster_timezone:,Asia/Colombo
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.36.0.2
H2O_cluster_version_age:,"28 days, 18 hours and 33 minutes"
H2O_cluster_name:,H2O_from_python_Asus_v4ez02
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,1.973 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


## Data Processing

In [2]:
import pandas as pd

In [3]:
train = pd.read_csv("D:\\DC Universe\\Ucsc\\Third Year\\ENH 3201 Industrial Placements\\H20 Applications\\H20 ML Notebooks\\H20Csv\\Titanic\\titanic_train.csv")
test = pd.read_csv("D:\\DC Universe\\Ucsc\\Third Year\\ENH 3201 Industrial Placements\\H20 Applications\\H20 ML Notebooks\\H20Csv\\Titanic\\titanic_test.csv")
subs = pd.read_csv('D:\\DC Universe\\Ucsc\\Third Year\\ENH 3201 Industrial Placements\\H20 Applications\\H20 ML Notebooks\\H20Csv\\Titanic\\gender_submission.csv')

drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp','Parch']
train = train.drop(drop_elements, axis = 1)
test = test.drop(drop_elements, axis = 1)

def checkNull_fillData(df):
    for col in df.columns:
        if len(df.loc[df[col].isnull() == True]) != 0:
            if df[col].dtype == "float64" or df[col].dtype == "int64":
                df.loc[df[col].isnull() == True,col] = df[col].mean()
            else:
                df.loc[df[col].isnull() == True,col] = df[col].mode()[0]
                
checkNull_fillData(train)
checkNull_fillData(test)

str_list = [] 
num_list = []
for colname, colvalue in train.iteritems():
    if type(colvalue[1]) == str:
        str_list.append(colname)
    else:
        num_list.append(colname)
        
train = pd.get_dummies(train, columns=str_list)
test = pd.get_dummies(test, columns=str_list)

train = h2o.H2OFrame(train)
test = h2o.H2OFrame(test)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [4]:
# a piece off to demonstrate adding new data with checkpointing.
# In a real world scenario, however, you would not have your
# new data at this point.
train1, valid1, new_data1 = train.split_frame(ratios = [.7, .15], seed = 1434)

In [5]:
#train.columns
train.columns

['Survived',
 'Pclass',
 'Age',
 'Fare',
 'Sex_female',
 'Sex_male',
 'Embarked_C',
 'Embarked_Q',
 'Embarked_S']

In [6]:
predictors = ["Age","Embarked_C","Pclass","Embarked_Q","Sex_male"]
response = "Fare"

In [7]:
titanic = H2ORandomForestEstimator(model_id="titanic", ntrees = 1, seed = 1234)
titanic.train(x = predictors, y = response, training_frame = train1, validation_frame = valid1)

drf Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Model Details
H2ORandomForestEstimator :  Distributed Random Forest
Model Key:  titanic


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,1.0,1.0,2176.0,16.0,16.0,16.0,168.0,168.0,168.0




ModelMetricsRegression: drf
** Reported on train data. **

MSE: 2896.4105250593734
RMSE: 53.81831031404993
MAE: 23.955113317212486
RMSLE: 0.7857759836364192
Mean Residual Deviance: 2896.4105250593734

ModelMetricsRegression: drf
** Reported on validation data. **

MSE: 1486.891125675782
RMSE: 38.56022725135035
MAE: 21.415570370852947
RMSLE: 0.8204757093224551
Mean Residual Deviance: 1486.891125675782

Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance,validation_rmse,validation_mae,validation_deviance
0,,2022-02-23 14:50:14,0.048 sec,0.0,,,,,,
1,,2022-02-23 14:50:14,0.225 sec,1.0,53.81831,23.955113,2896.410525,38.560227,21.41557,1486.891126



Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,Pclass,624871.1875,1.0,0.714952
1,Age,150517.390625,0.240877,0.172216
2,Sex_male,57220.835938,0.091572,0.06547
3,Embarked_C,40736.011719,0.065191,0.046608
4,Embarked_Q,658.883423,0.001054,0.000754




In [11]:
print('Validation Mean Per Class Error for DRF:', titanic.mse(valid=True))

Validation Mean Per Class Error for DRF: 1486.891125675782


## Model Training Iteratively

In [12]:
train2, valid2, new_data2 = test.split_frame(ratios = [.7, .15], seed = 1434)

In [13]:
# Checkpoint on the test dataset. This shows how to train an additional
# 9 trees on top of the first 1.
titanic_continued = H2ORandomForestEstimator(model_id = 'titanic_new',
                                         checkpoint = titanic,
                                         ntrees = 9,
                                         seed = 1234)
titanic_continued.train(x = predictors, y = response, training_frame = train2, validation_frame = valid2)

drf Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Model Details
H2ORandomForestEstimator :  Distributed Random Forest
Model Key:  titanic_new


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,9.0,9.0,12298.0,13.0,17.0,12.888889,54.0,168.0,85.44444




ModelMetricsRegression: drf
** Reported on train data. **

MSE: 2755.4993583134956
RMSE: 52.49285054475033
MAE: 24.397458545397814
RMSLE: 0.7584161744236164
Mean Residual Deviance: 2755.4993583134956

ModelMetricsRegression: drf
** Reported on validation data. **

MSE: 822.6941719965641
RMSE: 28.682645833265873
MAE: 13.717499333984998
RMSLE: 0.5177062781196887
Mean Residual Deviance: 822.6941719965641

Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance,validation_rmse,validation_mae,validation_deviance
0,,2022-02-23 14:50:14,-8 min -30.-630 sec,0.0,,,,,,
1,,2022-02-23 14:50:14,-8 min -30.-453 sec,1.0,38.572675,21.935504,1487.851279,24.886392,13.824573,619.332498
2,,2022-02-23 14:58:45,0.048 sec,2.0,63.287964,29.231809,4005.366382,21.84094,11.855786,477.026656
3,,2022-02-23 14:58:45,0.082 sec,3.0,54.68191,24.907853,2990.111258,24.738467,12.760479,611.991757
4,,2022-02-23 14:58:45,0.111 sec,4.0,53.427153,23.902505,2854.460632,29.70237,14.344334,882.230785
5,,2022-02-23 14:58:45,0.125 sec,5.0,51.629499,23.555156,2665.605167,29.484918,14.160346,869.360417
6,,2022-02-23 14:58:45,0.136 sec,6.0,45.627356,22.82181,2081.85563,27.610229,13.413434,762.324767
7,,2022-02-23 14:58:45,0.146 sec,7.0,52.336204,24.153845,2739.078227,26.228975,12.812463,687.959146
8,,2022-02-23 14:58:45,0.154 sec,8.0,50.96083,23.780387,2597.006203,28.313916,13.474149,801.67783
9,,2022-02-23 14:58:45,0.164 sec,9.0,52.492851,24.397459,2755.499358,28.682646,13.717499,822.694172



Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,Age,1600387.0,1.0,0.438148
1,Pclass,1343119.0,0.839246,0.367714
2,Embarked_C,350263.3,0.218862,0.095894
3,Sex_male,293922.3,0.183657,0.080469
4,Embarked_Q,64926.91,0.04057,0.017775




In [22]:
print('Validation MSE for DRF:', titanic_continued.rmse(valid=True))

Validation MSE for DRF: 28.682645833265873


In [23]:
model_mse= titanic.rmse(valid=True) - titanic_continued.rmse(valid=True)

In [24]:
print("Imporvement of Model Evaluation",model_mse)

Imporvement of Model Evaluation 9.877581418084475
