# Gradient Boosting Reinforcement Learning (GBRL)
GBRL is a Python-based GBT library designed and optimized for reinforcement learning (RL).
GBRL is designed to be integrated in popular RL python libraries as part of the standard RL training loop.  

## GBRL Design
The standard GBT supervised learning training procedure for K boosting iterations on a given set on inputs x and targets y is as follows:  
***For K boosting iterations***   
1. Generate predictions using current GBT ensemble.
2. Calculate loss L(y, predictions).
3. Calculate gradients of the loss function w.r.t predictions  
4. Fit a binary decision tree on the gradients and add it to the ensemble.
5. Repeat from step 1.

The training procedure is typically done E2E within a GBT framework and is optimized for pre-defined loss functions. GBRL modifies this procedure to seemingly integrate within RL loops by:
- Outsourcing gradient calculation to autograd frameworks.
- Incremental learning by performing a single boosting iteration on a given data batch containing pairs of states/observations and gradients.  


## Get Started with GBRL
This is a quick tutorial demonstrating usage examples



## Basic Training Procedure 
***Note: standard training procedure is based on PyTorch***



In [1]:
import numpy as np
import torch as th
import gymnasium as gym 

from sklearn import datasets
from torch.nn.functional import mse_loss 
from torch.distributions import Categorical

from gbrl import GradientBoostingTrees, cuda_available, ParametricActor

ModuleNotFoundError: No module named 'gbrl.gbrl_cpp'

In [2]:
# incremental learning dataset
X_numpy, y_numpy = datasets.load_diabetes(return_X_y=True, as_frame=False, scaled=False)
out_dim = 1 if len(y_numpy.shape) == 1  else  y_numpy.shape[1]
if out_dim == 1:
    y_numpy = y_numpy[:, np.newaxis]

X, y = th.tensor(X_numpy, dtype=th.float32), th.tensor(y_numpy, dtype=th.float32)
# CUDA is not deterministic
device = 'cuda' if cuda_available else 'cpu'

# initializing model parameters
tree_struct = {'max_depth': 4, 
               'n_bins': 256,
               'min_data_in_leaf': 0,
               'par_th': 2,
               'grow_policy': 'oblivious'
        }

optimizer = { 'algo': 'SGD',
              'lr': 1.0,
            }
gbrl_params = {
               "split_score_func": "Cosine",
               "generator_type": "Quantile"
                }

In [3]:
# setting up model
gbt_model = GradientBoostingTrees(
                    output_dim=out_dim,
                    tree_struct=tree_struct,
                    optimizer=optimizer,
                    gbrl_params=gbrl_params,
                    verbose=0,
                    device=device)
gbt_model.set_bias(y)

Setting GBRL device to cuda
Setting policy optimizer indices: 0->1


In [4]:
# training for 10 epochs
n_epochs = 10
for _ in range(n_epochs):
    # forward pass - setting requires_grad=True is mandatory for training
    # y_pred is a torch tensor
    y_pred = gbt_model(X, requires_grad=True)
    # calculate loss - we must scale pytorch's mse loss function by 0.5 to get the correct MSE gradient
    loss = 0.5*mse_loss(y_pred, y) 
    loss.backward()
    # perform a boosting step
    gbt_model.step(X)
    print(f"Boosting iteration: {gbt_model.get_iteration()} RMSE loss: {loss.sqrt()}")
    

Boosting iteration: 1 RMSE loss: 54.45128631591797
Boosting iteration: 2 RMSE loss: 45.48917007446289


Boosting iteration: 3 RMSE loss: 40.836509704589844
Boosting iteration: 4 RMSE loss: 37.73506546020508
Boosting iteration: 5 RMSE loss: 36.77689743041992
Boosting iteration: 6 RMSE loss: 35.08983612060547
Boosting iteration: 7 RMSE loss: 33.8794059753418
Boosting iteration: 8 RMSE loss: 32.981075286865234
Boosting iteration: 9 RMSE loss: 32.36515426635742
Boosting iteration: 10 RMSE loss: 31.6733341217041


GBT work with per-sample gradients but pytorch typically calculates the expected loss. GBRL internally multiplies the gradients with the number of samples when calling the step function. Therefore, when working with pytorch losses and multi-output targets one should take this into consideration.  
For example:
1. When using a summation reduction

In [5]:
gbt_model = GradientBoostingTrees(
                    output_dim=out_dim,
                    tree_struct=tree_struct,
                    optimizer=optimizer,
                    gbrl_params=gbrl_params,
                    verbose=0,
                    device=device)
gbt_model.set_bias(y)
# continuing training 10  epochs using a sum reduction
n_epochs = 10
for _ in range(n_epochs):
    y_pred = gbt_model(X, requires_grad=True)
    # we divide the loss by the number of samples to compensate for GBRL's built-in multiplication by the same value   
    loss = 0.5*mse_loss(y_pred, y, reduction='sum') / len(y_pred) 
    loss.backward()
    # perform a boosting step
    gbt_model.step(X)
    print(f"Boosting iteration: {gbt_model.get_iteration()} RMSE loss: {loss.sqrt()}")
    

Setting GBRL device to cuda
Setting policy optimizer indices: 0->1
Boosting iteration: 1 RMSE loss: 54.45128631591797
Boosting iteration: 2 RMSE loss: 45.48917007446289
Boosting iteration: 3 RMSE loss: 40.836509704589844
Boosting iteration: 4 RMSE loss: 37.73506546020508


Boosting iteration: 5 RMSE loss: 36.77689743041992
Boosting iteration: 6 RMSE loss: 35.08983612060547
Boosting iteration: 7 RMSE loss: 33.87940216064453
Boosting iteration: 8 RMSE loss: 32.981075286865234
Boosting iteration: 9 RMSE loss: 32.36515426635742
Boosting iteration: 10 RMSE loss: 31.6733341217041


2. When working with multi-dimensional outputs

In [6]:
y_multi = th.concat([y, y], dim=1)
out_dim = y_multi.shape[1]
gbt_model = GradientBoostingTrees(
                    output_dim=out_dim,
                    tree_struct=tree_struct,
                    optimizer=optimizer,
                    gbrl_params=gbrl_params,
                    verbose=0,
                    device=device)
gbt_model.set_bias(y_multi)
# continuing training 10  epochs using a sum reduction
n_epochs = 10
for _ in range(n_epochs):
    y_pred = gbt_model(X, requires_grad=True)
    # we multiply the loss by the output dimension to compensate for pytorch's mean reduction for MSE loss that averages across all dimensions.
    # this step is necessary to get the correct loss gradient - however the loss value itself is correct
    loss = 0.5*mse_loss(y_pred, y_multi) * out_dim
    loss.backward()
    # perform a boosting step
    gbt_model.step(X)
    print(f"Boosting iteration: {gbt_model.get_iteration()} RMSE loss: {(loss / out_dim).sqrt()}")
    

Setting GBRL device to cuda
Setting policy optimizer indices: 0->2
Boosting iteration: 1 RMSE loss: 54.45128631591797
Boosting iteration: 2 RMSE loss: 45.48917007446289
Boosting iteration: 3 RMSE loss: 40.836509704589844
Boosting iteration: 4 RMSE loss: 37.73506546020508


Boosting iteration: 5 RMSE loss: 36.77689743041992
Boosting iteration: 6 RMSE loss: 35.08983612060547
Boosting iteration: 7 RMSE loss: 33.87940216064453
Boosting iteration: 8 RMSE loss: 32.981075286865234
Boosting iteration: 9 RMSE loss: 32.365150451660156
Boosting iteration: 10 RMSE loss: 31.6733341217041


## RL using GBRL
Now that we have seen how GBRL is trained using incremental learning and PyTorch we can use it within an RL training loop

Let's start by training a simple Reinforce Algorithm

In [7]:
def calculate_returns(rewards, gamma):
    returns = []
    running_g = 0.0
    for reward in rewards[::-1]:
        running_g = reward + gamma * running_g
        returns.insert(0, running_g)
    return returns

In [8]:
env = gym.make("CartPole-v1")
wrapped_env = gym.wrappers.RecordEpisodeStatistics(env, 50)  # Records episode-reward
num_episodes = 1000
gamma = 0.997
optimizer = { 'algo': 'SGD',
              'lr': 0.01,
            }

bias = np.zeros(env.action_space.n, dtype=np.single)
agent = ParametricActor(
                    output_dim=env.action_space.n,
                    tree_struct=tree_struct,
                    policy_optimizer=optimizer,
                    gbrl_params=gbrl_params,
                    verbose=0,
                    bias=bias, 
                    device='cpu')


update_every = 10

rollout_buffer = {'actions': [], 'obs': [], 'returns': []}
for episode in range(num_episodes):
    # gymnasium v26 requires users to set seed while resetting the environment
    obs, info = wrapped_env.reset(seed=0)
    rollout_buffer['rewards'] = []

    done = False
    while not done:
        action_logits = agent(obs)
        action = Categorical(logits=action_logits).sample()
        action_numpy = action.cpu().numpy()
        
        obs, reward, terminated, truncated, info = wrapped_env.step(action_numpy.squeeze())
        rollout_buffer['rewards'].append(reward)
        rollout_buffer['obs'].append(obs)
        rollout_buffer['actions'].append(action)

        done = terminated or truncated
    
    rollout_buffer['returns'].extend(calculate_returns(rollout_buffer['rewards'], gamma))


    if episode % update_every == 0 and episode > 0:
        returns = th.tensor(rollout_buffer['returns'])
        actions = th.cat(rollout_buffer['actions'])
        # input to model can be either a torch tensor or a numpy ndarray
        observations = np.stack(rollout_buffer['obs'])
        # model update
        action_logits = agent(observations, requires_grad=True)
        dist = Categorical(logits=action_logits)
        log_probs = dist.log_prob(actions)
        # calculate reinforce loss with subtracted baseline
        loss = -(log_probs*(returns - returns.mean())).mean()
        loss.backward()
        grads = agent.step(observations)
        rollout_buffer = {'actions': [], 'obs': [], 'returns': []}

    if episode % 100 == 0:
        print(f"Episode {episode} - boosting iteration: {agent.get_iteration()} episodic return: {np.mean(wrapped_env.return_queue)}")
        

Setting GBRL device to cpu
Setting policy optimizer indices: 0->2
Episode 0 - boosting iteration: 0 episodic return: 9.0


Episode 100 - boosting iteration: 10 episodic return: 22.280000686645508
Episode 200 - boosting iteration: 20 episodic return: 29.299999237060547
Episode 300 - boosting iteration: 30 episodic return: 32.63999938964844
Episode 400 - boosting iteration: 40 episodic return: 42.720001220703125
Episode 500 - boosting iteration: 50 episodic return: 51.540000915527344
Episode 600 - boosting iteration: 60 episodic return: 72.80000305175781
Episode 700 - boosting iteration: 70 episodic return: 78.9000015258789
Episode 800 - boosting iteration: 80 episodic return: 112.83999633789062
Episode 900 - boosting iteration: 90 episodic return: 129.72000122070312


### Using Manually Calculated Gradients
Alternatively GBRL can use manually calculated gradients.  Calling the `predict` method instead of the `__call__` method, returns a numpy array instead of a PyTorch tensor. Autograd libraries or manual calculations can be used to calculate gradients.  
Fitting manually calculated gradients is done using the `_model.step` method that receives numpy arrays. 


In [9]:
# initializing model parameters
tree_struct = {'max_depth': 4, 
               'n_bins': 256,
               'min_data_in_leaf': 0,
               'par_th': 2,
               'grow_policy': 'oblivious'
        }

optimizer = { 'algo': 'SGD',
              'lr': 1.0,
            }
gbrl_params = {
               "split_score_func": "Cosine",
               "generator_type": "Quantile"
                }
# setting up model
gbt_model = GradientBoostingTrees(
                    output_dim=1,
                    tree_struct=tree_struct,
                    optimizer=optimizer,
                    gbrl_params=gbrl_params,
                    verbose=0,
                    device=device)
# works with numpy arrays as well as PyTorch tensors
gbt_model.set_bias(y_numpy)

# training for 10 epochs
n_epochs = 10
for _ in range(n_epochs):
    # y_pred is a numpy array
    y_pred = gbt_model.predict(X_numpy)
    loss = np.sqrt(0.5*((y_pred - y_numpy)**2).mean()) 
    grads = y_pred - y_numpy
    # perform a boosting step
    gbt_model._model.step(X_numpy, grads)
    print(f"Boosting iteration: {gbt_model.get_iteration()} RMSE loss: {loss}")

Setting GBRL device to cuda
Setting policy optimizer indices: 0->1
Boosting iteration: 1 RMSE loss: 54.451285094616374
Boosting iteration: 2 RMSE loss: 45.48916999877324


Boosting iteration: 3 RMSE loss: 40.83651082662459
Boosting iteration: 4 RMSE loss: 37.73506439069844
Boosting iteration: 5 RMSE loss: 36.77689669772262
Boosting iteration: 6 RMSE loss: 35.089837631524226
Boosting iteration: 7 RMSE loss: 33.87940389697403
Boosting iteration: 8 RMSE loss: 32.98107514282689
Boosting iteration: 9 RMSE loss: 32.365154094608144
Boosting iteration: 10 RMSE loss: 31.67333523835015


## Supervised Learning
GBRL supports training multiple boosting iterations with targets similar to other GBT libraries. This is done using the `fit` method.  
***Note: only the RMSE loss function is supported for the `fit` method***

In [11]:
gbt_model = GradientBoostingTrees(
                    output_dim=1,
                    tree_struct=tree_struct,
                    optimizer=optimizer,
                    gbrl_params=gbrl_params,
                    verbose=1,
                    device=device)
final_loss = gbt_model.fit(X_numpy, y_numpy, iterations=10)


Setting GBRL device to cuda
Setting policy optimizer indices: 0->1
0 - MultiRMSE: 45.4892
1 - MultiRMSE: 40.8918


2 - MultiRMSE: 38.3409
3 - MultiRMSE: 36.839
4 - MultiRMSE: 35.6598
5 - MultiRMSE: 34.7947
6 - MultiRMSE: 33.7887
7 - MultiRMSE: 33.0885
8 - MultiRMSE: 32.3866
9 - MultiRMSE: 31.6777
