## The simpliest usage example of py_boost

### Installation (if needed)

**Note**: replace cupy-cuda110 with your cuda version!!!

In [1]:
# !pip install cupy-cuda110 py-boost

### Imports

In [2]:
import os
# Optional: set the device to run
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

os.makedirs('../data', exist_ok=True)

import joblib
from sklearn.datasets import make_regression
import numpy as np

# simple case - just one class is used
from py_boost import GradientBoosting
from py_boost.cv import CrossValidation

### Generation of dummy regression data

In [3]:
%%time
X, y = make_regression(150000, 100, n_targets=10, random_state=42)
X_test, y_test = X[:50000], y[:50000]
X, y = X[-50000:], y[-50000:]

CPU times: user 2.43 s, sys: 1.48 s, total: 3.91 s
Wall time: 811 ms


### Training a GBDT model

The only argument required here is a loss function. It, together with the input target shape, determines the task type. The loss function can be passed as a Loss instance or using a string alias:

* ***'mse'*** for the regression/multitask regression
* ***'msle'*** for the regression/multitask regression
* ***'bce'*** for the binary/multilabel classification
* ***'crossentropy'*** for the multiclassification

Training is simply done by calling the .fit metod. Possible argumentsare the following:

* ***'X'*** 
* ***'y'*** 
* ***'sample_weight'*** 
* ***'eval_sets'***  
A validation set is passed as a list of dicts with possible keys ['X', 'y', 'sample_weight']. Note: if multiple valid sets are passed, the best model is selected using the last one.

#### The example below illustrates how to train a simple regression task.

In [4]:
%%time
model = GradientBoosting('mse')

model.fit(X, y[:, 0], eval_sets=[{'X': X_test, 'y': y_test[:, 0]},])

[20:52:49] Stdout logging level is INFO.
[20:52:49] GDBT train starts. Max iter 100, early stopping rounds 100
[20:52:49] Iter 0; Sample 0, rmse = 173.68515683069276; 
[20:52:49] Iter 10; Sample 0, rmse = 133.23295041730694; 
[20:52:49] Iter 20; Sample 0, rmse = 107.90963543511216; 
[20:52:50] Iter 30; Sample 0, rmse = 90.08342819554207; 
[20:52:50] Iter 40; Sample 0, rmse = 76.43017533279323; 
[20:52:50] Iter 50; Sample 0, rmse = 65.5577889119952; 
[20:52:50] Iter 60; Sample 0, rmse = 56.76787553118689; 
[20:52:50] Iter 70; Sample 0, rmse = 49.564956655108595; 
[20:52:50] Iter 80; Sample 0, rmse = 43.58867561316726; 
[20:52:50] Iter 90; Sample 0, rmse = 38.67175787149395; 
[20:52:50] Iter 99; Sample 0, rmse = 34.99754081347279; 
CPU times: user 7.27 s, sys: 962 ms, total: 8.23 s
Wall time: 6.34 s


<py_boost.gpu.boosting.GradientBoosting at 0x7f486e3eba90>

### Traininig a GBDT model in a multiregression case

Each of built-in loss functions has its own default metric, so metric definition is optional. 
If you need to specify the evaluation metric, you can pass a Metric instance or use a string alias.

#### Default metrics:

* ***'rmse'*** is the default for the ***'mse'*** loss
* ***'rmsle'*** is the default for the  ***'msle'*** loss
* ***'bce'*** is the default for the ***'bce'*** loss
* ***'crossentropy'*** is the default for the ***'crossentropy'*** loss

#### Non-default metrics:

* ***'r2'*** for the regression/multitask regression
* ***'auc'*** for the binary classification
* ***'accuracy'*** for any classification task
* ***'precision'*** for any classification task
* ***'recall'*** for any classification task
* ***'f1'*** for any classification task

It is possible to specify other common GBDT hyperparameters as shown below.

#### The following example demonstrates how to train a model for a multioutput regression task (no extra definition needed to switch the task to multioutput one, you just need to pass a multidimensional target).

In [5]:
%%time
model = GradientBoosting('mse', 'r2_score',
                         ntrees=1000, lr=.01, verbose=100, es=200, lambda_l2=1,
                         subsample=.8, colsample=.8, min_data_in_leaf=10, min_gain_to_split=0, 
                         max_bin=256, max_depth=6)

model.fit(X, y, eval_sets=[{'X': X_test, 'y': y_test},])

[20:52:50] Stdout logging level is INFO.
[20:52:50] GDBT train starts. Max iter 1000, early stopping rounds 200
[20:52:50] Iter 0; Sample 0, R2_score = 0.008394434412401175; 
[20:52:52] Iter 100; Sample 0, R2_score = 0.5168091229427232; 
[20:52:54] Iter 200; Sample 0, R2_score = 0.7243334810252653; 
[20:52:56] Iter 300; Sample 0, R2_score = 0.8326970487914259; 
[20:52:58] Iter 400; Sample 0, R2_score = 0.8950369225819286; 
[20:53:00] Iter 500; Sample 0, R2_score = 0.9321446308026127; 
[20:53:02] Iter 600; Sample 0, R2_score = 0.9547326078219325; 
[20:53:04] Iter 700; Sample 0, R2_score = 0.9687759168879175; 
[20:53:06] Iter 800; Sample 0, R2_score = 0.9776385294523188; 
[20:53:08] Iter 900; Sample 0, R2_score = 0.9833225210195948; 
[20:53:10] Iter 999; Sample 0, R2_score = 0.9870099055456798; 
CPU times: user 19.6 s, sys: 2.7 s, total: 22.3 s
Wall time: 20.3 s


<py_boost.gpu.boosting.GradientBoosting at 0x7f486e3eb8e0>

## Inference

#### Prediction can be done via calling the .predict method

In [6]:
%%time
preds = model.predict(X_test)

preds.shape

CPU times: user 1.24 s, sys: 505 ms, total: 1.74 s
Wall time: 1.75 s


(50000, 10)

In [7]:
preds

array([[-230.07994  , -139.01242  , -271.89752  , ..., -132.4745   ,
        -209.56622  , -227.33429  ],
       [-105.2129   , -105.53808  ,  -51.97523  , ..., -121.95376  ,
        -110.97196  ,  -13.6838045],
       [ -39.22138  ,  -58.721336 ,  142.582    , ...,   17.447527 ,
         -23.655943 , -213.7487   ],
       ...,
       [ -81.388824 ,  130.64673  ,   79.12572  , ...,  222.39725  ,
          31.501627 ,    8.980256 ],
       [  -3.883288 ,  139.77042  ,  247.42499  , ...,  150.47414  ,
         175.14754  ,  207.02196  ],
       [  -8.103888 ,   40.226532 ,  169.67625  , ...,   95.37619  ,
          27.459566 ,   11.250004 ]], dtype=float32)

#### Prediction for certan iterations can be done via calling the .predict_staged method

In [8]:
%%time
preds = model.predict_staged(X_test, iterations=[100, 300, 500])

preds.shape

CPU times: user 342 ms, sys: 247 ms, total: 589 ms
Wall time: 596 ms


(3, 50000, 10)

#### Tree leaves indicies prediction for certan iterations can be done via calling the .predict_leaves method

In [9]:
%%time
preds = model.predict_leaves(X_test, iterations=[100, 300, 500])

preds.shape

CPU times: user 17.2 ms, sys: 276 µs, total: 17.4 ms
Wall time: 16.2 ms


(3, 50000, 1)

In [10]:
preds.T[0]

array([[14, 20,  9],
       [50, 43, 23],
       [32, 43, 55],
       ...,
       [54, 50,  9],
       [30, 43, 19],
       [60, 43, 23]], dtype=int32)

#### Feature importances

In [11]:
model.get_feature_importance()

array([  37.,   35.,   35.,   49.,   59.,   49., 5586.,   50.,   56.,
         55.,   46.,   52.,   50.,   45.,   46., 5947., 5505.,   40.,
         54., 5439.,   37.,   43.,   47.,   83.,   31.,   48.,   40.,
         47.,   50.,   49.,   58.,   63.,   52.,   59.,   51.,   46.,
       6010.,   38.,   45.,   47.,   69.,   47.,   63.,   37.,   43.,
         46.,   43.,   33.,   41.,   44.,   36.,   55., 5914.,   52.,
         48.,   53.,   50.,   40.,   47.,   39.,   55.,   52.,   49.,
         52.,   57.,   36.,   53.,   50.,   36.,   44.,   45.,   37.,
         37.,   50.,   52.,   46.,   55.,   34.,   47.,   41.,   57.,
         45.,   31.,   49.,   45.,   27., 5487., 3544.,   47., 5797.,
         40., 6244.,   34.,   47.,   63.,   41.,   36.,   51.,   45.,
         52.], dtype=float32)

#### The trained model can be saved as pickle for inference

In [12]:
joblib.dump(model, '../data/temp_model.pkl')

new_model = joblib.load('../data/temp_model.pkl')
new_model.predict(X_test)

array([[-230.07994  , -139.01242  , -271.89752  , ..., -132.4745   ,
        -209.56622  , -227.33429  ],
       [-105.2129   , -105.53808  ,  -51.97523  , ..., -121.95376  ,
        -110.97196  ,  -13.6838045],
       [ -39.22138  ,  -58.721336 ,  142.582    , ...,   17.447527 ,
         -23.655943 , -213.7487   ],
       ...,
       [ -81.388824 ,  130.64673  ,   79.12572  , ...,  222.39725  ,
          31.501627 ,    8.980256 ],
       [  -3.883288 ,  139.77042  ,  247.42499  , ...,  150.47414  ,
         175.14754  ,  207.02196  ],
       [  -8.103888 ,   40.226532 ,  169.67625  , ...,   95.37619  ,
          27.459566 ,   11.250004 ]], dtype=float32)

### Cross Validation

Also py_boost supports built in cross validation wrapper that produce out-of-fold prediction

In [13]:
%%time
model = GradientBoosting('mse')
cv = CrossValidation(model)

oof_pred = cv.fit_predict(X, y, cv=5)

pred = cv.predict(X_test)
((pred - y_test) ** 2).mean() ** .5

[20:53:16] Stdout logging level is INFO.
[20:53:16] GDBT train starts. Max iter 100, early stopping rounds 100
[20:53:16] Iter 0; Sample 0, rmse = 176.17283883265787; 
[20:53:16] Iter 10; Sample 0, rmse = 144.64857862576474; 
[20:53:16] Iter 20; Sample 0, rmse = 122.80526576817687; 
[20:53:16] Iter 30; Sample 0, rmse = 106.20694134133124; 
[20:53:17] Iter 40; Sample 0, rmse = 93.03256160556448; 
[20:53:17] Iter 50; Sample 0, rmse = 82.3056986784575; 
[20:53:17] Iter 60; Sample 0, rmse = 73.12733773653729; 
[20:53:17] Iter 70; Sample 0, rmse = 65.30923174734228; 
[20:53:17] Iter 80; Sample 0, rmse = 58.71652095411406; 
[20:53:18] Iter 90; Sample 0, rmse = 53.02306242586308; 
[20:53:18] Iter 99; Sample 0, rmse = 48.52981207400637; 
[20:53:18] Stdout logging level is INFO.
[20:53:18] GDBT train starts. Max iter 100, early stopping rounds 100
[20:53:18] Iter 0; Sample 0, rmse = 176.1975859432932; 
[20:53:18] Iter 10; Sample 0, rmse = 144.97434053387218; 
[20:53:18] Iter 20; Sample 0, rmse 

47.2994203961322