# Checkpointing Models
In real-world scenarios, data can change. For example, you may have a model currently in production that was built using **1 million records**. At a later date, you may receive several hundred thousand more records. Rather than building a new model from scratch, you can use the checkpoint option to create a new model based on the `existing model`.

The `checkpoint `option is available for DRF, GBM, XGBoost, and Deep Learning algorithms. This allows you to specify a model key associated with a previously trained model. This will build a new model as a continuation of a previously generated model. If this is not specified, then the algorithm will start training a new model instead of continuing building a previous model.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [21]:
!pip install h2o



In [22]:
import h2o
from h2o.automl import H2OAutoML
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,10 mins 11 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.36.0.3
H2O_cluster_version_age:,2 days
H2O_cluster_name:,H2O_from_python_unknownUser_z4ftwb
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.172 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


In [23]:
import pandas as pd

In [25]:
train = pd.read_csv('/content/titanic_test.csv')
test = pd.read_csv('/content/titanic_test.csv')
#subs = pd.read_csv('D:\\DC Universe\\Ucsc\\Third Year\\ENH 3201 Industrial Placements\\H20 Applications\\H20 ML Notebooks\\H20Csv\\Titanic\\gender_submission.csv')

drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp','Parch']
train = train.drop(drop_elements, axis = 1)
test = test.drop(drop_elements, axis = 1)

def checkNull_fillData(df):
    for col in df.columns:
        if len(df.loc[df[col].isnull() == True]) != 0:
            if df[col].dtype == "float64" or df[col].dtype == "int64":
                df.loc[df[col].isnull() == True,col] = df[col].mean()
            else:
                df.loc[df[col].isnull() == True,col] = df[col].mode()[0]
                
checkNull_fillData(train)
checkNull_fillData(test)

str_list = [] 
num_list = []
for colname, colvalue in train.iteritems():
    if type(colvalue[1]) == str:
        str_list.append(colname)
    else:
        num_list.append(colname)
        
train = pd.get_dummies(train, columns=str_list)
test = pd.get_dummies(test, columns=str_list)

train = h2o.H2OFrame(train)
test = h2o.H2OFrame(test)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [26]:
train.describe()

Rows:418
Cols:8




Unnamed: 0,Pclass,Age,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
type,int,real,real,int,int,int,int,int
mins,1.0,0.17,0.0,0.0,0.0,0.0,0.0,0.0
mean,2.2655502392344493,30.27259036144579,35.627188489208606,0.36363636363636365,0.6363636363636364,0.24401913875598086,0.11004784688995216,0.645933014354067
maxs,3.0,76.0,512.3292,1.0,1.0,1.0,1.0,1.0
sigma,0.8418375519640504,12.634534168325061,55.840500479541056,0.4816221409322308,0.4816221409322308,0.4300188157360399,0.3133244005170708,0.4788026786626084
zeros,0,0,2,266,152,316,372,148
missing,0,0,0,0,0,0,0,0
0,3.0,34.5,7.8292,0.0,1.0,0.0,1.0,0.0
1,3.0,47.0,7.0,1.0,0.0,0.0,0.0,1.0
2,2.0,62.0,9.6875,0.0,1.0,0.0,1.0,0.0


In [27]:
test.describe()


Rows:418
Cols:8




Unnamed: 0,Pclass,Age,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
type,int,real,real,int,int,int,int,int
mins,1.0,0.17,0.0,0.0,0.0,0.0,0.0,0.0
mean,2.2655502392344493,30.27259036144579,35.627188489208606,0.36363636363636365,0.6363636363636364,0.24401913875598086,0.11004784688995216,0.645933014354067
maxs,3.0,76.0,512.3292,1.0,1.0,1.0,1.0,1.0
sigma,0.8418375519640504,12.634534168325061,55.840500479541056,0.4816221409322308,0.4816221409322308,0.4300188157360399,0.3133244005170708,0.4788026786626084
zeros,0,0,2,266,152,316,372,148
missing,0,0,0,0,0,0,0,0
0,3.0,34.5,7.8292,0.0,1.0,0.0,1.0,0.0
1,3.0,47.0,7.0,1.0,0.0,0.0,0.0,1.0
2,2.0,62.0,9.6875,0.0,1.0,0.0,1.0,0.0


In [28]:
train1, valid1, new_data1 = train.split_frame(ratios = [.7, .15], seed = 1234)

In [29]:
#train.columns
test.columns

['Pclass',
 'Age',
 'Fare',
 'Sex_female',
 'Sex_male',
 'Embarked_C',
 'Embarked_Q',
 'Embarked_S']

In [30]:
predictors = ["Age","Embarked_C","Pclass","Embarked_Q","Sex_male"]
response = "Fare"

In [31]:
from h2o.estimators.gbm import H2OGradientBoostingEstimator


In [36]:
titanic = H2OGradientBoostingEstimator(model_id="titanic", ntrees = 5, seed = 1234)
titanic.train(x = predictors, y = response, training_frame = train1, validation_frame = valid1)

gbm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  titanic


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,5.0,5.0,1278.0,5.0,5.0,5.0,15.0,16.0,15.8




ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 2194.9895108045826
RMSE: 46.85071515787761
MAE: 25.069911384582518
RMSLE: 0.8685462902193672
Mean Residual Deviance: 2194.9895108045826

ModelMetricsRegression: gbm
** Reported on validation data. **

MSE: 1612.506422357488
RMSE: 40.156025978145394
MAE: 24.73471845608069
RMSLE: 1.0001335954963682
Mean Residual Deviance: 1612.506422357488

Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance,validation_rmse,validation_mae,validation_deviance
0,,2022-02-19 15:50:13,0.009 sec,0.0,57.725725,33.647224,3332.259312,51.34511,32.65433,2636.320304
1,,2022-02-19 15:50:13,0.039 sec,1.0,54.777215,31.293965,3000.543259,48.271843,30.363186,2330.170778
2,,2022-02-19 15:50:13,0.058 sec,2.0,52.267133,29.199146,2731.853205,45.667576,28.301156,2085.52748
3,,2022-02-19 15:50:13,0.080 sec,3.0,50.141942,27.416527,2514.214307,43.478391,26.694602,1890.370457
4,,2022-02-19 15:50:13,0.096 sec,4.0,48.352026,26.076528,2337.918417,41.656904,25.649103,1735.297619
5,,2022-02-19 15:50:13,0.107 sec,5.0,46.850715,25.069911,2194.989511,40.156026,24.734718,1612.506422



Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,Pclass,1336927.0,1.0,0.732315
1,Age,270011.7,0.201964,0.147902
2,Sex_male,129827.6,0.097109,0.071114
3,Embarked_C,87647.04,0.065559,0.04801
4,Embarked_Q,1203.815,0.0009,0.000659




In [33]:
# # load the model
# saved_model = h2o.load_model(model_path)

# # download the model built above to your local machine
# my_local_model = h2o.download_model(model, path="/Users/UserName/Desktop")

# # upload the model that you just downloded above
# # to the H2O cluster
# uploaded_model = h2o.upload_model(my_local_model)

In [37]:
model_path = h2o.save_model(model=titanic, path="/content/", force=True)
print (model_path)

/content/titanic


In [38]:
titanic = h2o.load_model("/content/titanic")

In [39]:
titanic

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  titanic


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,5.0,5.0,1278.0,5.0,5.0,5.0,15.0,16.0,15.8




ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 2194.9895108045826
RMSE: 46.85071515787761
MAE: 25.069911384582518
RMSLE: 0.8685462902193672
Mean Residual Deviance: 2194.9895108045826

ModelMetricsRegression: gbm
** Reported on validation data. **

MSE: 1612.506422357488
RMSE: 40.156025978145394
MAE: 24.73471845608069
RMSLE: 1.0001335954963682
Mean Residual Deviance: 1612.506422357488

Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance,validation_rmse,validation_mae,validation_deviance
0,,2022-02-19 15:50:13,0.009 sec,0.0,57.725725,33.647224,3332.259312,51.34511,32.65433,2636.320304
1,,2022-02-19 15:50:13,0.039 sec,1.0,54.777215,31.293965,3000.543259,48.271843,30.363186,2330.170778
2,,2022-02-19 15:50:13,0.058 sec,2.0,52.267133,29.199146,2731.853205,45.667576,28.301156,2085.52748
3,,2022-02-19 15:50:13,0.080 sec,3.0,50.141942,27.416527,2514.214307,43.478391,26.694602,1890.370457
4,,2022-02-19 15:50:13,0.096 sec,4.0,48.352026,26.076528,2337.918417,41.656904,25.649103,1735.297619
5,,2022-02-19 15:50:13,0.107 sec,5.0,46.850715,25.069911,2194.989511,40.156026,24.734718,1612.506422



Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,Pclass,1336927.0,1.0,0.732315
1,Age,270011.7,0.201964,0.147902
2,Sex_male,129827.6,0.097109,0.071114
3,Embarked_C,87647.04,0.065559,0.04801
4,Embarked_Q,1203.815,0.0009,0.000659




In [40]:
train2, valid2, new_data2 = test.split_frame(ratios = [.7, .15], seed = 1234)

In [43]:
# Checkpoint on the same dataset. This shows how to train an additional
# 9 trees on top of the first 1. To do this, set ntrees equal to 10.
titanic_continued = H2OGradientBoostingEstimator(model_id = 'titanic_new',
                                         checkpoint = titanic,
                                         ntrees = 8,
                                         seed = 1234)
titanic_continued.train(x = predictors, y = response, training_frame = train2, validation_frame = valid2)

gbm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  titanic_new


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,8.0,8.0,2028.0,5.0,5.0,1.875,15.0,16.0,5.75




ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 1908.3015934682485
RMSE: 43.68411145334478
MAE: 22.744000509918713
RMSLE: 0.7740222048347649
Mean Residual Deviance: 1908.3015934682485

ModelMetricsRegression: gbm
** Reported on validation data. **

MSE: 1377.8362951902113
RMSE: 37.11921732997897
MAE: 23.018652381144108
RMSLE: 0.937037516419498
Mean Residual Deviance: 1377.8362951902113

Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance,validation_rmse,validation_mae,validation_deviance
0,,2022-02-19 15:50:13,-3 min -25.-457 sec,0.0,57.725725,33.647224,3332.259312,51.34511,32.65433,2636.320304
1,,2022-02-19 15:50:13,-3 min -25.-427 sec,1.0,54.777215,31.293965,3000.543259,48.271843,30.363186,2330.170778
2,,2022-02-19 15:50:13,-3 min -25.-408 sec,2.0,52.267133,29.199146,2731.853205,45.667576,28.301156,2085.52748
3,,2022-02-19 15:50:13,-3 min -25.-386 sec,3.0,50.141942,27.416527,2514.214307,43.478391,26.694602,1890.370457
4,,2022-02-19 15:50:13,-3 min -25.-370 sec,4.0,48.352026,26.076528,2337.918417,41.656904,25.649103,1735.297619
5,,2022-02-19 15:50:13,-3 min -25.-359 sec,5.0,46.850715,25.069911,2194.989511,40.156026,24.734718,1612.506422
6,,2022-02-19 15:53:39,0.051 sec,6.0,45.597448,24.184156,2079.127228,38.930358,23.96176,1515.572803
7,,2022-02-19 15:53:39,0.064 sec,7.0,44.550875,23.423662,1984.780502,37.908211,23.390732,1437.032461
8,,2022-02-19 15:53:39,0.083 sec,8.0,43.684111,22.744001,1908.301593,37.119217,23.018652,1377.836295



Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,Pclass,334878.09375,1.0,0.727664
1,Age,73266.90625,0.218787,0.159203
2,Sex_male,33110.59375,0.098874,0.071947
3,Embarked_C,18615.921875,0.05559,0.040451
4,Embarked_Q,337.994507,0.001009,0.000734


