### LightGBM with Focal Loss for Multiclass classification problems

Let me show how to adapt the Focal Loss implementation for binary classification to a multiclass classification problem.

The idea is to face the problem using the Binary Cross Entropy With Logits (borrowing from `Pytorch` notation `BCEWithLogitsLoss`). 

$$
loss = -[y_{\text true} \cdot log\sigma(x) + (1-y_{\text true}) \cdot log(1-\sigma(x))] 
$$

Where $\sigma$ is the sigmoid function

For example, let's assume we have a problem with 10 classes and we have two samples/observations

In [1]:
import numpy as np

y_true = np.random.choice(10, (1,2))
# from -2 to 2 to illustrate the fact the preds coming from lightGBM when using custom losses are NOT probs
y_pred = np.random.uniform(low=-2, high=2, size=(2, 10))

In [2]:
# labels
y_true

array([[0, 0]])

In [3]:
# predictions
y_pred

array([[-0.62900913,  0.92265852, -1.33477174, -1.89011705,  1.85566209,
         0.76361995,  0.1983925 , -0.5764042 , -0.84919259,  0.92979002],
       [ 1.99012283,  0.43470132,  0.35491818,  1.48850368, -0.20095172,
         1.96445624,  1.25049923, -0.86754563, -1.6867512 ,  1.098587  ]])

In [4]:
def sigmoid(x): 
    return 1./(1. +  np.exp(-x))

def softmax(x):
    exp_x = np.exp(x - np.max(x))
    return exp_x / (np.sum(exp_x, axis=1, keepdims=True) + 1e-6)

In [5]:
# labels one-hot encoded
y_true_oh = np.eye(10)[y_true][0]

In [6]:
y_true_oh

array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [7]:
# BCEWithLogitsLoss
( -( y_true_oh * np.log(sigmoid(y_pred)) + (1-y_true_oh) * np.log(1-sigmoid(y_pred)) ) ).mean()

0.9219779528703436

### Multiclass Focal Loss

Before we jump to the Focal Loss code, let's focus for one second in a sentence in the LightGBM [documentation](https://lightgbm.readthedocs.io/en/latest/index.html) site : *"For multi-class task, the preds is group by class_id first, then group by row_id. If you want to get i-th row preds in j-th class, the access way is score[j $\times$ num_data + i] and you should group grad and hess in this way as well."*

Let's assume we have 100 rows and 4 classes

In [8]:
preds = np.random.rand(100*4)

To access to the prediction for class `1` for the 20-th row we need the index 1 $\times$ 100 + 20 = 120

We will compute the Focal Loss using the `BCEWithLogitsLoss` which requires that we have an array of predictions of shape (num_data, num_class). 

Therefore, to reshape the predictions (scores) coming from lightGBM to that format, we need to use 'Fortran' style.

In [10]:
preds[120]

0.8213133926994343

In [11]:
preds.reshape(-1 , 4, order='F')[20, 1]

0.8213133926994343

And in general

In [12]:
np.all(preds[:100] == preds.reshape(-1 , 4, order='F')[:100,0])

True

So, without further ado:

In [13]:
def focal_loss_lgb(y_pred, dtrain, alpha, gamma, num_class):
    """
    Focal Loss for lightgbm

    Parameters:
    -----------
    y_pred: numpy.ndarray
        array with the predictions
    dtrain: lightgbm.Dataset
    alpha, gamma: float
        See original paper https://arxiv.org/pdf/1708.02002.pdf
    num_class: int
        number of classes
    """
    a,g = alpha, gamma
    y_true = dtrain.label
    # N observations x num_class arrays
    y_true = np.eye(num_class)[y_true.astype('int')]
    y_pred = y_pred.reshape(-1,num_class, order='F')
    # alpha and gamma multiplicative factors with BCEWithLogitsLoss
    def fl(x,t):
        p = 1/(1+np.exp(-x))
        return -( a*t + (1-a)*(1-t) ) * (( 1 - ( t*p + (1-t)*(1-p)) )**g) * ( t*np.log(p)+(1-t)*np.log(1-p) )
    partial_fl = lambda x: fl(x, y_true)
    grad = derivative(partial_fl, y_pred, n=1, dx=1e-6)
    hess = derivative(partial_fl, y_pred, n=2, dx=1e-6)
    # flatten in column-major (Fortran-style) order
    return grad.flatten('F'), hess.flatten('F')

And that's it really. Now one would want/need the corresponding evalulation function.

In [14]:
def focal_loss_lgb_eval_error(y_pred, dtrain, alpha, gamma, num_class):
    """
    Focal Loss for lightgbm

    Parameters:
    -----------
    y_pred: numpy.ndarray
        array with the predictions
    dtrain: lightgbm.Dataset
    alpha, gamma: float
        See original paper https://arxiv.org/pdf/1708.02002.pdf
    num_class: int
        number of classes
    """
    a,g = alpha, gamma
    y_true = dtrain.label
    y_true = np.eye(num_class)[y_true.astype('int')]
    y_pred = y_pred.reshape(-1, num_class, order='F')
    p = 1/(1+np.exp(-y_pred))
    loss = -( a*y_true + (1-a)*(1-y_true) ) * (( 1 - ( y_true*p + (1-y_true)*(1-p)) )**g) * ( y_true*np.log(p)+(1-y_true)*np.log(1-p) )
    # a variant can be np.sum(loss)/num_class
    return 'focal_loss', np.mean(loss), False

### EXAMPLE

In [15]:
import numpy as np
import lightgbm as lgb

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import  accuracy_score
from scipy.misc import derivative

# very inadequate dataset as is perfectly balanced, but just to illustrate
digits = datasets.load_digits()
X = digits.data
y = digits.target

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


In [16]:
X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.2, random_state=1)
lgbtrain = lgb.Dataset(X_tr, y_tr, free_raw_data=True)
lgbeval = lgb.Dataset(X_val, y_val)

In [17]:
focal_loss = lambda x,y: focal_loss_lgb(x, y, 0.25, 2., 10)
eval_error = lambda x,y: focal_loss_lgb_eval_error(x, y, 0.25, 2., 10)
params  = {'learning_rate':0.1, 'num_boost_round':10, 'num_class':10}
# model = lgb.train(params, lgbtrain, fobj=focal_loss)
model = lgb.train(params, lgbtrain, valid_sets=[lgbeval], fobj=focal_loss, feval=eval_error)

[1]	valid_0's focal_loss: 0.107288
[2]	valid_0's focal_loss: 0.0951971
[3]	valid_0's focal_loss: 0.0846662
[4]	valid_0's focal_loss: 0.0755319
[5]	valid_0's focal_loss: 0.0675866
[6]	valid_0's focal_loss: 0.0605897
[7]	valid_0's focal_loss: 0.0544604
[8]	valid_0's focal_loss: 0.0490753
[9]	valid_0's focal_loss: 0.0442874
[10]	valid_0's focal_loss: 0.0400507




In [18]:
accuracy_score(y_val, np.argmax(softmax(model.predict(X_val)), axis=1))

0.9083333333333333