# Loss Benchmark for Catboost MultiClass Classification

This notebook aims at understanding what wan be the best loss function to use to train the model for our multiclass classification problem

## I. Dataset loading

In [20]:
import pandas as pd
from catboost import CatBoostClassifier, Pool
import sklearn.metrics as skl
import matplotlib.pyplot as plt
import os
import numpy as np

os.chdir("C:/Users/thoma/OneDrive - CentraleSupelec/NOPLP/code/ML")

In [1]:
# Chargement du dataset
print("Dataset loading...")
df = pd.read_csv("data/lossBenchmarkData.csv", sep=";")

# Data Cleaning
print("Data cleaning...")
df = df.drop(columns=['Unnamed: 0', 'id', 'Chanson_id'])
reversed_cat = {'50': 1, '40': 2, '30': 3,
                '20': 4, '10': 5, 'MC': 6, '20k': 7, None: 8}
df = df.replace({'categorie': reversed_cat})

# Split train / test
print("Spliting in train and test...")
train = df[0:int(len(df)*0.8)]
print("Taille du dataset de train : " + str(len(train)))
test = df[int(len(df)*0.8)+1:]
print("Taille du dataset de test : " + str(len(test)))
train_labels = train['categorie']
train = train.drop(columns=['categorie'])
train_data = train
test_labels = test['categorie']
test = test.drop(columns=['categorie'])
test_data = test
test_pool = Pool(test_data,
                 test_labels,
                 cat_features=['titre', 'artiste', 'clusterid'])

Dataset loading...
Data cleaning...
Spliting in train and test...
Taille du dataset de train : 480160
Taille du dataset de test : 120040


## II. Loss multiclass classique

Let $N$ be the size of the sample, $t_i$ the label of the i-th row, $a_i = model(row_i)$. <br/>
Knowing that we have $8$ classes : 50, 40, 30, 20, 10, MC, 20k, PP <br/>
$$Loss = \sum_{i=1}^Nlog(\frac{exp(a_{it_i})}{\sum_{j=1}^8exp(a_{ij})})$$
Let's train the model with this loss function and compute the loss for a few rows

In [2]:
# Training model
print("Training the CatBoost model...")
model = CatBoostClassifier(iterations=10,
                           depth=10,
                           learning_rate=1,
                           loss_function='MultiClass',
                           verbose=True)
model.fit(train_data, train_labels, cat_features=[
          'titre', 'artiste', 'clusterid'])

Training the CatBoost model...
0:	learn: 0.0828217	total: 3.97s	remaining: 35.7s
1:	learn: 0.8353704	total: 5.6s	remaining: 22.4s
2:	learn: 0.1182897	total: 8.08s	remaining: 18.9s
3:	learn: 30.5449169	total: 10.4s	remaining: 15.6s
4:	learn: 42.1782888	total: 12.2s	remaining: 12.2s
5:	learn: 40.5164525	total: 15.9s	remaining: 10.6s
6:	learn: 40.5652163	total: 19.8s	remaining: 8.46s
7:	learn: 40.2072280	total: 24.3s	remaining: 6.08s
8:	learn: 39.7849596	total: 29s	remaining: 3.22s
9:	learn: 39.7424531	total: 33.6s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x27728d9f448>

In [3]:
# Prediction on test set
print("Predicting on the test set...")
preds_class = model.predict(test_pool)
preds_proba = model.predict_proba(test_pool)
preds = model.predict_log_proba(test_pool)

# Modification du dataset de test
test_data["pred"] = preds_class
test_data["labels"] = test_labels
test_data[["1", "2", "3", "4", "5", "6", "7", "8"]] = preds

Predicting on the test set...


In [4]:
test_data.sample(5)

Unnamed: 0,titre,année,decennie,artiste,clusterid,deltadate,deltadatemc,deltadatemcma,deltadate20k,deltadate20kma,...,pred,labels,1,2,3,4,5,6,7,8
598210,Le frunkp,2003,2000,Brown Alphonse,3,118.0,,,254.0,254.0,...,8,8,-2547.158986,-2540.429685,-2411.020251,-2508.033605,-1819.368149,-1750.569905,-1299.028236,0.0
571372,Tellement je t'aime,1997,1990,Faudel,3,97.0,,,,168.0,...,8,8,-6.649022,-6.316126,-5.889848,-6.031455,-8.740103,-9.444474,-6.389573,-0.010242
564760,Ville de lumière,1986,1980,Gold,3,60.0,,254.0,60.0,60.0,...,8,8,-9074.794466,-12692.537402,-12716.1121,-9712.065906,-9243.829527,-11316.211732,-9176.468312,0.0
547273,L'un pour l'autre,1998,1990,Maurane,1,70.0,,,,34.0,...,8,8,-8.982866,-7.252601,-7.399211,-7.038224,-9.035636,-11.395058,-6.601115,-0.00382
492765,Ca ira mon amour,2011,2010,1789 Les Amants de la Bastille,4,59.0,,,314.0,314.0,...,8,8,-2740.825652,-2732.763018,-2605.353585,-2707.033605,-1993.034815,-1925.236572,-1531.361569,0.0


Dépivotage pour obtenir le $a_{it_i}$ puis repivotage

In [16]:
dfwork = test_data
dfwork.reset_index()
dfwork['index'] = dfwork.index
dfwork = pd.melt(dfwork, id_vars = ["index", "labels"], value_vars = ["1", "2", "3", "4", "5", "6", "7", "8"])
dfwork['variable'] = dfwork['variable'].astype('int64')
dfwork = dfwork[dfwork['labels'] == dfwork['variable']]
test_data = pd.merge(test_data, dfwork, on = 'index')
test_data[['titre', 'pred', 'labels_x', "1", "2", "3", "4", "5", "6", "7", "8", 'value']].head()

Unnamed: 0,titre,pred,labels_x,1,2,3,4,5,6,7,8,value
0,Simple et funky,8,8,-6.675643,-6.308689,-5.72158,-6.109035,-9.205689,-9.005077,-6.998044,-0.009763,-0.009763
1,Carmen,8,8,-2547.158986,-2540.429685,-2411.020251,-2508.033605,-1819.368149,-1750.569905,-1299.028236,0.0,0.0
2,Un garçon pas comme les autres (Ziggy),8,8,-18.798847,-15.878712,-15.410705,-14.897523,-14.334896,-14.878623,-21.5036,-2e-06,-2e-06
3,Les comédiens,8,8,-5.090653,-5.258897,-6.056197,-6.671024,-8.666012,-7.710688,-7.978065,-0.016057,-0.016057
4,L'épervier,8,8,-7.856923,-7.257187,-7.145057,-7.172741,-8.701174,-9.786883,-7.219582,-0.003609,-0.003609


Loss calculation on a few lines

In [19]:
test_data[["titre", "artiste", "labels_x", "pred", "1", "2", "3", "4", "5", "6", "7", "8", "value"]].sample(5)

Unnamed: 0,titre,artiste,labels_x,pred,1,2,3,4,5,6,7,8,value
83483,J'en rêve encore,De Palmas Gérald,8,8,-4.990878,-5.703808,-6.540736,-7.368997,-9.427697,-6.055568,-7.733336,-0.015185,-0.015185
59608,Le petit pain au chocolat,Dassin Joe,8,8,-6.082804,-5.278978,-5.799498,-6.536931,-7.83604,-5.265152,-6.243683,-0.019554,-0.019554
38858,La chanson de Ziggy,Starmania,8,8,-9.009382,-7.279364,-7.425162,-7.067436,-9.057242,-11.416644,-6.623597,-0.003723,-0.003723
57894,On est bien comme ça,Vianney,8,8,-8.996004,-8.08145,-7.692811,-7.249675,-8.047237,-9.950513,-7.888769,-0.002345,-0.002345
54169,Ces gens-là,Brel Jacques,8,8,-9.641819,-7.927059,-7.535501,-6.956364,-8.936638,-11.486475,-7.955635,-0.002407,-0.002407


In [50]:
test_data['loss'] = np.log(np.exp(test_data['value'])/(np.exp(test_data['1']) + np.exp(test_data['2']) + np.exp(test_data['3']) + np.exp(test_data['4']) + np.exp(test_data['5']) + np.exp(test_data['6']) + np.exp(test_data['7']) + np.exp(test_data['8'])))
test_data[(test_data['pred'] != 8)][["titre", "artiste", "labels_x", "pred", 'loss']].sample(5)

  result = getattr(ufunc, method)(*inputs, **kwargs)


Unnamed: 0,titre,artiste,labels_x,pred,loss
54505,La vie en rose,Piaf Édith,8,3,-38.154971
11263,La vie en rose,Piaf Édith,8,3,-47.282362
114565,Il est mort le soleil,Nicoletta,8,1,-3.595413
72042,Faut rigoler,Salvador Henri,8,7,-226.097364
28510,L'homme à la moto,Piaf Édith,8,7,-251.748772


In [51]:
test_data[test_data['labels_x'] != 8][["titre", "artiste", "labels_x", "pred", 'loss']].sample(5)

Unnamed: 0,titre,artiste,labels_x,pred,loss
37030,On a tous le droit,Foly Liane,1,8,-6.330596
116526,Banana split,Lio,3,8,-inf
24913,"Damn, dis-moi",Christine and the Queens (Chris),4,8,-6.031455
96112,Demain sera parfait,Aubert Jean-Louis,7,8,-5.407803
46537,Mister Hyde,Chatel Philippe,3,8,-7.279961


In [52]:
test_data[(test_data['labels_x'] != 8) & (test_data['pred'] != 8)][["titre", "artiste", "labels_x", "pred",'loss']].sample(5)

Unnamed: 0,titre,artiste,labels_x,pred,loss
82509,Salade de fruits,Bourvil,4,7,-3.214063
97617,Maritie et Gilbert Carpentier,Bénabar,7,1,-69.96097
73809,Jolie môme,Ferré Léo,2,7,-35.829248
106616,La vie en rose,Piaf Édith,1,3,-19.983209
54476,L'homme à la moto,Piaf Édith,1,7,-89.941867
