# Checkpoint with GBM
In GBM, `checkpoint` can be used to continue training on a previously generated model rather than rebuilding the model from scratch. For example, you may train a model with 50 trees and wonder what the model would look like if you trained 10 more.

**Note:** The following parameters cannot be modified during checkpointing:


*   build_tree_one_node
*   max_depth
*   min_rows
*   nbins
*   nbins_cats
*   nbins_top_level
*   sample_rate

In [1]:
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,31 mins 54 secs
H2O_cluster_timezone:,Asia/Colombo
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.36.0.2
H2O_cluster_version_age:,"28 days, 19 hours and 5 minutes"
H2O_cluster_name:,H2O_from_python_Asus_v4ez02
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,1.972 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


In [2]:
import pandas as pd

In [3]:
train = pd.read_csv("D:\\DC Universe\\Ucsc\\Third Year\\ENH 3201 Industrial Placements\\H20 Applications\\H20 ML Notebooks\\H20Csv\\Titanic\\titanic_train.csv")
test = pd.read_csv("D:\\DC Universe\\Ucsc\\Third Year\\ENH 3201 Industrial Placements\\H20 Applications\\H20 ML Notebooks\\H20Csv\\Titanic\\titanic_test.csv")
subs = pd.read_csv('D:\\DC Universe\\Ucsc\\Third Year\\ENH 3201 Industrial Placements\\H20 Applications\\H20 ML Notebooks\\H20Csv\\Titanic\\gender_submission.csv')

drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp','Parch']
train = train.drop(drop_elements, axis = 1)
test = test.drop(drop_elements, axis = 1)

def checkNull_fillData(df):
    for col in df.columns:
        if len(df.loc[df[col].isnull() == True]) != 0:
            if df[col].dtype == "float64" or df[col].dtype == "int64":
                df.loc[df[col].isnull() == True,col] = df[col].mean()
            else:
                df.loc[df[col].isnull() == True,col] = df[col].mode()[0]
                
checkNull_fillData(train)
checkNull_fillData(test)

str_list = [] 
num_list = []
for colname, colvalue in train.iteritems():
    if type(colvalue[1]) == str:
        str_list.append(colname)
    else:
        num_list.append(colname)
        
train = pd.get_dummies(train, columns=str_list)
test = pd.get_dummies(test, columns=str_list)

train = h2o.H2OFrame(train)
test = h2o.H2OFrame(test)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [4]:
train.describe()

Rows:891
Cols:9




Unnamed: 0,Survived,Pclass,Age,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
type,int,int,real,real,int,int,int,int,int
mins,0.0,1.0,0.42,0.0,0.0,0.0,0.0,0.0,0.0
mean,0.3838383838383838,2.3086419753086447,29.699117647058795,32.20420796857465,0.35241301907968575,0.6475869809203143,0.18855218855218855,0.08641975308641975,0.7250280583613917
maxs,1.0,3.0,80.0,512.3292,1.0,1.0,1.0,1.0,1.0
sigma,0.4865924542648575,0.8360712409770491,13.002015226002891,49.69342859718089,0.4779900708960982,0.4779900708960982,0.3913721645054733,0.28114069214170423,0.44675091003414663
zeros,549,0,0,15,577,314,723,814,245
missing,0,0,0,0,0,0,0,0,0
0,0.0,3.0,22.0,7.25,0.0,1.0,0.0,0.0,1.0
1,1.0,1.0,38.0,71.2833,1.0,0.0,1.0,0.0,0.0
2,1.0,3.0,26.0,7.925,1.0,0.0,0.0,0.0,1.0


In [5]:
train1, valid1, new_data1 = train.split_frame(ratios = [.75, .15], seed = 1534)

In [6]:
predictors = ["Age","Embarked_C","Pclass","Embarked_Q","Sex_male"]
response = "Fare"

In [7]:
titanic = H2OGradientBoostingEstimator(model_id="titanic", ntrees = 2, seed = 1234)
titanic.train(x = predictors, y = response, training_frame = train1, validation_frame = valid1)

gbm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  titanic


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,2.0,2.0,668.0,5.0,5.0,5.0,22.0,22.0,22.0




ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 2395.7041054462316
RMSE: 48.94593042783262
MAE: 26.266039417175477
RMSLE: 1.0478910018098566
Mean Residual Deviance: 2395.7041054462316

ModelMetricsRegression: gbm
** Reported on validation data. **

MSE: 926.7994009418393
RMSE: 30.44338024828779
MAE: 22.56089052157988
RMSLE: 0.9509495122354346
Mean Residual Deviance: 926.7994009418393

Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance,validation_rmse,validation_mae,validation_deviance
0,,2022-02-23 15:21:08,0.006 sec,0.0,53.860245,29.999316,2900.926039,34.444175,26.134937,1186.401175
1,,2022-02-23 15:21:08,0.045 sec,1.0,51.202264,27.9796,2621.671863,32.159143,24.232488,1034.210474
2,,2022-02-23 15:21:08,0.054 sec,2.0,48.94593,26.266039,2395.704105,30.44338,22.560891,926.799401



Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,Pclass,1234623.0,1.0,0.695071
1,Age,351694.9,0.28486,0.197998
2,Embarked_C,113560.4,0.09198,0.063933
3,Sex_male,76375.74,0.061862,0.042998
4,Embarked_Q,0.0,0.0,0.0




In [18]:
train2, valid2, new_data2 = test.split_frame(ratios = [.75, .15], seed = 1234)

In [25]:
# Checkpoint on the same dataset. This shows how to train an additional
# 9 trees on top of the first 1. To do this, set ntrees equal to 10.
titanic_continued = H2OGradientBoostingEstimator(model_id = 'titanic_new',
                                         checkpoint = titanic,
                                         ntrees = 20,
                                         seed = 1534)
titanic_continued.train(x = predictors, y = response, training_frame = train2, validation_frame = valid2)

gbm Model Build progress: |██████████████████████████████████████████████████████| (done) 100%
Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  titanic_new


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees,number_of_internal_trees,model_size_in_bytes,min_depth,max_depth,mean_depth,min_leaves,max_leaves,mean_leaves
0,,20.0,20.0,5046.0,5.0,5.0,4.5,13.0,22.0,13.35




ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 1468.077880561137
RMSE: 38.31550444090664
MAE: 18.69618952311692
RMSLE: 0.588909451342704
Mean Residual Deviance: 1468.077880561137

ModelMetricsRegression: gbm
** Reported on validation data. **

MSE: 1061.578490231611
RMSE: 32.58187364519743
MAE: 19.588472111035518
RMSLE: 0.8143775847879972
Mean Residual Deviance: 1061.578490231611

Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_mae,training_deviance,validation_rmse,validation_mae,validation_deviance
0,,2022-02-23 15:21:08,-11 min -11.-643 sec,0.0,53.860245,29.999316,2900.926039,34.444175,26.134937,1186.401175
1,,2022-02-23 15:21:08,-11 min -11.-604 sec,1.0,51.202264,27.9796,2621.671863,32.159143,24.232488,1034.210474
2,,2022-02-23 15:21:08,-11 min -11.-595 sec,2.0,53.944245,30.224006,2909.981518,37.270265,24.809498,1389.072689
3,,2022-02-23 15:32:20,0.007 sec,3.0,51.476595,28.279923,2649.839881,36.008513,23.267156,1296.612978
4,,2022-02-23 15:32:20,0.009 sec,4.0,49.389175,26.640837,2439.290611,34.834444,22.064384,1213.438464
5,,2022-02-23 15:32:20,0.013 sec,5.0,47.631225,25.391836,2268.733593,33.95882,21.161864,1153.201484
6,,2022-02-23 15:32:20,0.016 sec,6.0,46.157873,24.461956,2130.549277,33.340005,20.414277,1111.555916
7,,2022-02-23 15:32:20,0.017 sec,7.0,44.922097,23.618827,2017.994765,32.931499,19.895775,1084.483601
8,,2022-02-23 15:32:20,0.021 sec,8.0,43.896604,22.936098,1926.911815,32.655892,19.784505,1066.407254
9,,2022-02-23 15:32:20,0.023 sec,9.0,42.922288,22.275181,1842.322767,32.177843,19.711021,1035.413557



See the whole table with table.as_data_frame()

Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,Pclass,1480635.0,1.0,0.61161
1,Age,639403.9,0.431844,0.26412
2,Sex_male,167620.5,0.113209,0.069239
3,Embarked_C,130438.5,0.088096,0.053881
4,Embarked_Q,2782.278,0.001879,0.001149




In [26]:
print('Validation MSE for GBM:', titanic_continued.rmse(valid=True))

Validation MSE for GBM: 32.58187364519743


In [27]:
print('Validation MSE for GBM:', titanic.rmse(valid=True))

Validation MSE for GBM: 30.44338024828779


In [28]:
model_mse= titanic.rmse(valid=True) - titanic_continued.rmse(valid=True)
print("Imporvement of Model Evaluation",model_mse)

Imporvement of Model Evaluation -2.1384933969096416
