# Loss Benchmark for Catboost MultiClass Classification

This notebook aims at understanding what wan be the best loss function to use to train the model for our multiclass classification problem

## I. Dataset loading

In [2]:
import pandas as pd
from catboost import CatBoostClassifier, Pool
import sklearn.metrics as skl
import matplotlib.pyplot as plt
import os

os.chdir("C:/Users/thoma/OneDrive - CentraleSupelec/NOPLP/code/ML")

# Chargement du dataset
print("Dataset loading...")
df = pd.read_csv("data/lossBenchmarkData.csv", sep=";")

# Data Cleaning
print("Data cleaning...")
df = df.drop(columns=['Unnamed: 0', 'id', 'Chanson_id'])
reversed_cat = {'50': 1, '40': 2, '30': 3,
                '20': 4, '10': 5, 'MC': 6, '20k': 7, None: 8}
df = df.replace({'categorie': reversed_cat})

# Split train / test
print("Spliting in train and test...")
train = df[0:int(len(df)*0.8)]
print("Taille du dataset de train : " + str(len(train)))
test = df[int(len(df)*0.8)+1:]
print("Taille du dataset de test : " + str(len(test)))
train_labels = train['categorie']
train = train.drop(columns=['categorie'])
train_data = train
test_labels = test['categorie']
test = test.drop(columns=['categorie'])
test_data = test
test_pool = Pool(test_data,
                 test_labels,
                 cat_features=['titre', 'artiste'])

Dataset loading...
Data cleaning...
Spliting in train and test...
Taille du dataset de train : 206130
Taille du dataset de test : 51532


## II. Loss multiclass classique

Let $N$ be the size of the sample, $t_i$ the label of the i-th row, $a_i = model(row_i)$. <br/>
Knowing that we have $8$ classes : 50, 40, 30, 20, 10, MC, 20k, PP <br/>
$$Loss = \sum_{i=1}^Nlog(\frac{exp(a_{it_i})}{\sum_{j=1}^8exp(a_{ij})})$$
Let's train the model with this loss function and compute the loss for a few rows

In [3]:
# Training model
print("Training the CatBoost model...")
model = CatBoostClassifier(iterations=10,
                           depth=10,
                           learning_rate=1,
                           loss_function='MultiClass',
                           verbose=True)
model.fit(train_data, train_labels, cat_features=[
          'titre', 'artiste'])

Training the CatBoost model...
0:	learn: 0.0835630	total: 1.48s	remaining: 13.3s
1:	learn: 0.5917520	total: 2.96s	remaining: 11.8s
2:	learn: 1.1319395	total: 4.19s	remaining: 9.79s
3:	learn: 1.1233489	total: 5.42s	remaining: 8.12s
4:	learn: 1.1151886	total: 6.76s	remaining: 6.76s
5:	learn: 1.0755618	total: 7.98s	remaining: 5.32s
6:	learn: 1.0442403	total: 9.24s	remaining: 3.96s
7:	learn: 1.0639025	total: 10.5s	remaining: 2.63s
8:	learn: 1.0210985	total: 12s	remaining: 1.34s
9:	learn: 1.0189583	total: 13.4s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x22fa7bac188>

In [18]:
# Prediction on test set
print("Predicting on the test set...")
preds_class = model.predict(test_pool)
preds_proba = model.predict_proba(test_pool)
preds = model.predict_log_proba(test_pool)

# Modification du dataset de test
test_data["pred"] = preds_class
test_data["labels"] = test_labels
test_data[["1", "2", "3", "4", "5", "6", "7", "8"]] = preds

Predicting on the test set...


Dépivotage pour obtenir le $a_{it_i}$ puis repivotage

In [27]:
test_df = pd.melt(test_data, id_vars = ["labels"], value_vars = ["1", "2", "3", "4", "5", "6", "7", "8"])
test_df.sample(5)

Unnamed: 0,labels,variable,value
17379,8,1,-1159.449307
408560,8,8,-0.002942
2246,8,1,-3.55532
411164,8,8,-0.009653
391310,8,8,-0.004699


Loss calculation on a few lines

In [28]:
#test_data[["titre", "artiste", "labels", "1", "2", "3", "4", "5", "6", "7", "8"]].sample(5)
test_data.sample(5)

Unnamed: 0,titre,année,decennie,artiste,clusterid,deltadate,deltadatemc,deltadatemcma,deltadate20k,deltadate20kma,...,20k,PP,1,2,3,4,5,6,7,8
225621,J'traîne des pieds,2005,2000,Ruiz Olivia,4,44.0,,35.0,432.0,127.0,...,-786.853478,0.0,-1164.899586,-1118.420728,-816.396575,-786.364496,-789.356976,-802.697736,-786.853478,0.0
255115,L'amour du risque,1982,1980,Générique TV,3,364.0,,,,182.0,...,-7.121584,-0.010281,-6.417172,-6.111952,-6.458136,-6.065715,-6.630894,-7.918763,-7.121584,-0.010281
253130,Prends ma main,2021,2020,Vitaa,3,,,,,217.0,...,-10.203808,-0.002748,-8.229376,-7.71344,-7.168377,-6.770982,-9.946117,-10.444587,-10.203808,-0.002748
251835,C'est une belle journée,2002,2000,Farmer Mylène,2,57.0,,63.0,57.0,57.0,...,-5.387539,-0.007118,-8.177555,-7.3326,-7.032111,-7.349471,-10.25736,-10.667861,-5.387539,-0.007118
251887,Ton visage,2015,2010,Fréro Delavega,2,73.0,,,157.0,63.0,...,-5.785979,-0.004979,-7.729832,-7.855922,-7.591559,-7.617511,-9.700976,-11.387655,-5.785979,-0.004979
