# Overview
A GBM is an ensemble of either regression or classification tree models. Both
are forward-learning ensemble methods that obtain predictive results using
gradually improved estimations.
Boosting is a flexible nonlinear regression procedure that helps improve the
accuracy of trees. Weak classification algorithms are sequentially applied to the
incrementally changed data to create a series of decision trees, producing an
ensemble of weak prediction models.
While boosting trees increases their accuracy, it also decreases speed and user
interpretability. The gradient boosting method generalizes tree boosting to
minimize these drawbacks.
# Summary of Features
H2O’s GBM functionalities include:
 supervised learning for regression and classification tasks
 distributed and parallelized computation on either a single node or a
multi-node cluster
 fast and memory-efficient Java implementations of the algorithms
 the ability to run H2O from R, Python, Scala, or the intuitive web UI
(Flow)
 automatic early stopping based on convergence of user-specified metrics
to user-specified relative tolerance
 stochastic gradient boosting with column and row sampling (per split
and per tree) for better generalization
 support for exponential families (Poisson, Gamma, Tweedie) and loss
functions in addition to binomial (Bernoulli), Gaussian and multinomial
distributions, such as Quantile regression (including Laplace)
 grid search for hyperparameter optimization and model selection
 model export in plain Java code for deployment in production environments
Overview | 9
 additional parameters for model tuning (for a complete listing of parameters, refer to the Model Parameters section.)
# NOTE
Gradient Boosting Machine (also known as gradient boosted models) sequentially
fit new models to provide a more accurate estimate of a response variable in
supervised learning tasks such as regression and classification. Although GBM
is known to be difficult to distribute and parallelize, H2O provides an easily
distributable and parallelizable version of GBM in its framework, as well as an
effortless environment for model tuning and selection.

In [4]:
import h2o
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O cluster uptime:,3 hours 20 mins
H2O cluster timezone:,America/Chicago
H2O data parsing timezone:,UTC
H2O cluster version:,3.24.0.4
H2O cluster version age:,20 days
H2O cluster name:,H2O_from_python_malbalawi_gx3o5a
H2O cluster total nodes:,1
H2O cluster free memory:,1.488 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4


In [5]:
url="http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv"
iris= h2o.import_file(url)

Parse progress: |█████████████████████████████████████████████████████████| 100%


In [6]:
iris

sepal_len,sepal_wid,petal_len,petal_wid,class
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa




In [7]:
train , test= iris.split_frame([0.8])

In [8]:
train.summary()

Unnamed: 0,sepal_len,sepal_wid,petal_len,petal_wid,class
type,real,real,real,real,enum
mins,4.3,2.0,1.0,0.1,
mean,5.814516129032256,3.0258064516129037,3.759677419354839,1.203225806451613,
maxs,7.9,4.1,6.7,2.5,
sigma,0.8091643333939824,0.4172833097650143,1.720650253093205,0.7460394596977923,
zeros,0,0,0,0,
missing,0,0,0,0,0
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa


In [9]:
train.nrows

124

In [10]:
test.nrows

26

In [11]:
from h2o.estimators.gbm import H2OGradientBoostingEstimator

# Create an instance of it called mRF

In [14]:
mGBM= H2OGradientBoostingEstimator()
mGBM.train(["sepal_len","sepal_wid","petal_len","petal_wid"], "class",train)

gbm Model Build progress: |███████████████████████████████████████████████| 100%


In [15]:
mGBM

Model Details
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  GBM_model_python_1560877108989_3


ModelMetricsMultinomial: gbm
** Reported on train data. **

MSE: 0.0013149603019611852
RMSE: 0.036262381360870184
LogLoss: 0.014744943552764619
Mean Per-Class Error: 0.0
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4
Iris-setosa,Iris-versicolor,Iris-virginica,Error,Rate
40.0,0.0,0.0,0.0,0 / 40
0.0,44.0,0.0,0.0,0 / 44
0.0,0.0,40.0,0.0,0 / 40
40.0,44.0,40.0,0.0,0 / 124


Top-3 Hit Ratios: 


0,1
k,hit_ratio
1,1.0
2,1.0
3,1.0


Scoring History: 


0,1,2,3,4,5,6
,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_classification_error
,2019-06-18 15:22:10,0.077 sec,0.0,0.6666667,1.0986123,0.6451613
,2019-06-18 15:22:10,0.152 sec,1.0,0.6043823,0.9277902,0.0403226
,2019-06-18 15:22:10,0.175 sec,2.0,0.5472035,0.7933126,0.0403226
,2019-06-18 15:22:10,0.196 sec,3.0,0.4956213,0.6856126,0.0403226
,2019-06-18 15:22:10,0.215 sec,4.0,0.4494924,0.5978299,0.0403226
---,---,---,---,---,---,---
,2019-06-18 15:22:11,1.241 sec,46.0,0.0424421,0.0181564,0.0
,2019-06-18 15:22:11,1.254 sec,47.0,0.0406510,0.0171904,0.0
,2019-06-18 15:22:11,1.264 sec,48.0,0.0388314,0.0162465,0.0



See the whole table with table.as_data_frame()
Variable Importances: 


0,1,2,3
variable,relative_importance,scaled_importance,percentage
petal_wid,265.2432251,1.0,0.7000812
petal_len,107.4221802,0.4049950,0.2835294
sepal_wid,3.7755017,0.0142341,0.0099650
sepal_len,2.4340215,0.0091766,0.0064243




 # One of the characteristics of GBM is it overfits very easily, you have to be careful. But this was done using just a few trees, 50 trees.

In [16]:
p=mGBM.predict(test)

gbm prediction progress: |████████████████████████████████████████████████| 100%


In [18]:
mGBM.model_performance(test)


ModelMetricsMultinomial: gbm
** Reported on test data. **

MSE: 0.033135281284852186
RMSE: 0.18203098990241245
LogLoss: 0.10376292136248161
Mean Per-Class Error: 0.03333333333333333
Confusion Matrix: Row labels: Actual class; Column labels: Predicted class



0,1,2,3,4
Iris-setosa,Iris-versicolor,Iris-virginica,Error,Rate
10.0,0.0,0.0,0.0,0 / 10
0.0,6.0,0.0,0.0,0 / 6
0.0,1.0,9.0,0.1,1 / 10
10.0,7.0,9.0,0.0384615,1 / 26


Top-3 Hit Ratios: 


0,1
k,hit_ratio
1,0.9615384
2,1.0
3,1.0




# One Erorr in the test date out of 26  samples which is %96 

In [22]:
help(h2o.estimators.gbm.H2OGradientBoostingEstimator)

Help on class H2OGradientBoostingEstimator in module h2o.estimators.gbm:

class H2OGradientBoostingEstimator(h2o.estimators.estimator_base.H2OEstimator)
 |  H2OGradientBoostingEstimator(**kwargs)
 |  
 |  Gradient Boosting Machine
 |  
 |  Builds gradient boosted trees on a parsed data set, for regression or classification.
 |  The default distribution function will guess the model type based on the response column type.
 |  Otherwise, the response column must be an enum for "bernoulli" or "multinomial", and numeric
 |  for all other distributions.
 |  
 |  Method resolution order:
 |      H2OGradientBoostingEstimator
 |      h2o.estimators.estimator_base.H2OEstimator
 |      h2o.model.model_base.ModelBase
 |      h2o.utils.backward_compatibility.BackwardsCompatibleBase
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, **kwargs)
 |      Construct a new model instance.
 |  
 |  ----------------------------------------------------------------------
 |  Data d


