## Histogram Gradient Boosting Regression example

This is a demonstration of the <font color='purple'>(still experimental)</font> **histogram-based gradient boosting regression tree estimator** which is now available in scikit-learn as [sklearn.ensemble.HistGradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html).

For input we shall be using the data produced by the excellent notebook ["INGV Volcanic Eruption Prediction - LGBM Baseline"](https://www.kaggle.com/ajcostarino/ingv-volcanic-eruption-prediction-lgbm-baseline), written by [Adam James](https://www.kaggle.com/ajcostarino). The training dataset consists of 4431 rows and 444 columns, and occupies around 23MB.
For the estimator I shall simply use the default parameters (see the sklearn page for details).

In [1]:
import numpy  as np
import pandas as pd

# To use this experimental feature, we need to explicitly ask for it:
from sklearn.experimental    import enable_hist_gradient_boosting
from sklearn.ensemble        import HistGradientBoostingRegressor

from sklearn.model_selection import KFold
from sklearn.metrics         import mean_absolute_error

read in the datasets

In [2]:
train  = pd.read_csv('../input/the-volcano-and-the-regularized-greedy-forest/volcano_train.csv')
test   = pd.read_csv('../input/the-volcano-and-the-regularized-greedy-forest/volcano_test.csv')
sample = pd.read_csv('../input/predict-volcanic-eruptions-ingv-oe/sample_submission.csv')

X      = train.drop(["segment_id","time_to_eruption"],axis=1).to_numpy()
y      = (train["time_to_eruption"]).to_numpy().squeeze()
X_test = test.drop("segment_id",axis=1).to_numpy()

perform the regression, here with 10-fold cross-validation:

In [3]:
%%time

kf = KFold(n_splits=10, random_state=42, shuffle=True)

predictions_array = []
CV_score_array    = []

for train_index, test_index in kf.split(X):
    
    X_train, X_valid = X[train_index], X[test_index]
    y_train, y_valid = y[train_index], y[test_index]
    
    regressor =  HistGradientBoostingRegressor()
    regressor.fit(X_train, y_train)
    
    predictions_array.append(regressor.predict(X_test))
    CV_score_array.append(mean_absolute_error(y_valid,regressor.predict(X_valid)))    

predictions = np.mean(predictions_array,axis=0)

CPU times: user 3min 53s, sys: 30.7 s, total: 4min 23s
Wall time: 1min 10s


In [4]:
print("The average CV mean absolute error is %d" % np.mean(CV_score_array,axis=0))

The average CV mean absolute error is 4128933


now write out a `submission.csv` file

In [5]:
sample.iloc[:,1:] = predictions
sample.to_csv('submission.csv',index=False)

### See also:
* [Histogram Gradient Boosting Classifier example](https://www.kaggle.com/carlmcbrideellis/histogram-gradient-boosting-classifier-example) performed on the *Santander Customer Satisfaction* dataset.

## Related reading

* [Aleksei Guryanov "Histogram-Based Algorithm for Building Gradient Boosting Ensembles of Piecewise Linear Decision Trees", In: van der Aalst W. et al. (eds) Analysis of Images, Social Networks and Texts. AIST 2019. Lecture Notes in Computer Science, vol 11832. Springer (2019)](https://link.springer.com/chapter/10.1007%2F978-3-030-37334-4_4)