In [1]:
# Importing libraries for classic Python operations
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import libraries for data pre-processing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Importing libraries for model selection
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

# Importing libraries for results analysis
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report
import seaborn

from xgboost import plot_importance
from xgboost import plot_tree
from matplotlib.pylab import rcParams

# Importing the library for the model under test
from xgboost import XGBClassifier

import pickle

# XGBoost 

## I) Theory

XGBoost is a gradient boosting ensemble Machine Learning technique based on decision trees.
We've already seen what the terms decision tree, bagging and random forest mean in the Random Forest section, so let's take a closer look at what's new in XGBoost.

### 1) What does boosting mean ?

Boosting is an ensemble strategy that adds new models to repair faults generated by current models. First, a model is constructed using the training data. The second model is then constructed in an attempt to address the faults in the previous model and models are added sequentially until no further advancements are possible. This ensemble method attempts to build a powerful classifier using previously 'weaker' classifiers.

### 2) What does gradient means ?

Gradient boosting is so named because it employs a gradient descent approach to minimize loss when adding new models.

### 3) What does eXtreme mean

XGBoost (eXtreme Gradient Boosting) is an advanced implementation of a gradient boosting algorithm.

## II) Advantages and Drawbacks

Advantages

-  XGBoost Execution Speed:
XGBoost is generally quick compared to other gradient boosting implementations.

- Performance: XGBoost has a strong track record of producing high-quality results in various machine learning tasks, especially in Kaggle competitions, where it has been a popular choice for winning solutions.

- Regularization :
Standard GBM implementation has no regularization like XGBoost; therefore, it also helps to reduce overfitting. In fact, XGBoost is also known as a ‘regularized boosting‘ technique.

- High Flexibility:
XGBoost allows users to define custom optimization objectives and evaluation criteria. This adds a whole new dimension to the model and there is no limit to what we can do.

- Handling Missing Values:
XGBoost has an in-built routine to handle missing values. The user is required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in the future.

- Interpretability: Unlike some machine learning algorithms that can be difficult to interpret, XGBoost provides feature importances, allowing for a better understanding of which variables are most important in making predictions.

Drawbacks

- Computational Complexity:
XGBoost can be computationally intensive, especially when training large models, making it less suitable for resource-constrained systems.

- Overfitting: XGBoost can be prone to overfitting, especially when trained on small datasets or when too many trees are used in the model

- Hyperparameter Tuning: XGBoost has many hyperparameters that can be adjusted, making it important to properly tune the parameters to optimize performance. However, finding the optimal set of parameters can be time-consuming and requires expertise.

- Memory Requirements: XGBoost can be memory-intensive, especially when working with large datasets, making it less suitable for systems with limited memory resources.

## III) Python Implementation

### 1) Hyperparameters tuning

#### General Parameters

The general parameter are here to access overall functionalities

- booster Select the type of model to run at each iteration. It has 2 options:
gbtree: tree-based models
gblinear: linear models

We will use in this project tree-based models due to its performance

#### Booster Parameters for Tree Booster

There are 2 types of booster parameters, one for linear and another for tree but we will only consider tree booster here.

1. Eta is also known as the learning rate; changing this number makes the model more robust by decreasing the weights on each step.

2. Max_depth, it is the same as what we saw for the random forest

3. gamma 
Only when the resulting split results in a positive reduction in the loss function is a node split. The minimal loss reduction necessary to divide is specified by Gamma. This makes the algorithm more conservative. The values can and should change based on the loss function.

4. subsample 
The percentage of observations that are random samples for each tree. Lower values make the algorithm more conservative and prevent overfitting, but too low values may result in underfitting.

5. colsample_bytree, it is the same as max_features for random forest


#### Learning Task Parameters

1. objective 
This defines the loss function to be minimized. We will use "multi: softmax" which is a multiclass classification using the softmax objective. It will return predicted class.

2. eval_metric 
The evaluation metrics are to be used for validation data.
merror – Multiclass classification error rate 
mlogloss – Multiclass logloss.


In [2]:
df = pd.read_csv(r'C:\Users\Thomas Aujoux\Documents\GitHub\food-classification\Data_Preprocessing\data_clean\clean.csv')
df = df.drop('Unnamed: 0', axis=1)
df = df.drop(labels=22426, axis=0)

y = df[["Secteur"]]
df_features = df.drop(["Code_produit", "Secteur", "Famille"], axis=1)
le = LabelEncoder()
y = le.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(df_features, y, test_size=0.2, shuffle=True, random_state=42)
print(f'The training dataset has {len(X_train)} records.')
print(f'The testing dataset has {len(X_test)} records.')

df

  y = column_or_1d(y, warn=True)


The training dataset has 52637 records.
The testing dataset has 13160 records.


Unnamed: 0,Code_produit,Secteur,Famille,abricot,abricots,acidifiant,acidifiant acidifiant,acidifiant antioxydant,acidifiant arome,acidifiant arome naturel,...,vitamine pp,vitamine vitamine,vitamine vitamine b,vitamine vitamine vitamine,vitamines,vitamines vitamine,vitamines vitamine vitamine,volaille,yaourt,yaourt brass
0,450.0,Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125377,0.0
1,453.0,Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.245026,0.0
2,455.0,Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.241945,0.0
3,456.0,Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.242882,0.0
4,460.0,Produits laitiers et desserts frais,Fromages frais nature non sucres gourmands,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65793,101536.0,Sirops et boissons concentrees a diluer,Sirops,0.0,0.0,0.137469,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
65794,101537.0,Sirops et boissons concentrees a diluer,Sirops,0.0,0.0,0.058943,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
65795,101540.0,Sirops et boissons concentrees a diluer,Sirops,0.0,0.0,0.110652,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
65796,101542.0,Sirops et boissons concentrees a diluer,Sirops,0.0,0.0,0.119906,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0


### 2) Study of the various hyperparameters

Next, we'll look at each of the hyperparameters to understand the consequences of changing them on the model.

#### Maximum Depth

In [3]:
opt_table_estimators_accuracy=list()
opt_table_estimators_f1=list()
list_para = [3, 6, 10, 15, 20, 50, 100, 200, 300, 500, 900]

for i in list_para:
    print(i)
    model = XGBClassifier(max_depth = i, objective='multi:softmax',num_class=31)
    model.fit(X_train,y_train)
    output=model.predict(X_test)
    opt_table_estimators_accuracy.append(accuracy_score(y_test, output))
    opt_table_estimators_f1.append(f1_score(y_test, output, average='macro'))
    print(accuracy_score(y_test, output),f1_score(y_test, output, average='macro'))
plt.plot(list_para, opt_table_estimators_accuracy)
plt.plot(list_para, opt_table_estimators_f1)
plt.xlabel('Number of trees')
plt.ylabel('Random Forest Score')
plt.title('Random Forest Score VS Number of trees (5 features)')
plt.show()

3


0.937082066869301 0.9443062295916467
6
0.9532674772036475 0.9557990170179018
10
0.9569908814589666 0.9588462181510232
15
0.9583586626139817 0.9596482247974817
20
0.9567629179331307 0.9576973363042032
50
0.9579787234042553 0.959043603006687
100


We can analyze the maximum depth as follows: before the maximum depth is 10, our model underfits, and after 10, it overfits. This analysis has enabled us to find a good estimate of this parameter. The number we can estimate as satisfactory is 10.

#### Learning Rate

In [4]:
opt_table_estimators_accuracy=list()
opt_table_estimators_f1=list()
list_para = [0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.6, 0.8]

for i in list_para:
    print(i)
    model = XGBClassifier(learning_rate = i, tree_method="gpu_hist", objective='multi:softmax',num_class=31)
    model.fit(X_train,y_train)
    output=model.predict(X_test)
    opt_table_estimators_accuracy.append(accuracy_score(y_test, output))
    opt_table_estimators_f1.append(f1_score(y_test, output, average='macro'))
    print(accuracy_score(y_test, output),f1_score(y_test, output, average='macro'))
plt.plot(list_para, opt_table_estimators_accuracy)
plt.plot(list_para, opt_table_estimators_f1)
plt.xlabel('Number of trees')
plt.ylabel('Random Forest Score')
plt.title('Random Forest Score VS Number of trees (5 features)')
plt.show()

0.001


XGBoostError: [12:01:31] C:\buildkite-agent\builds\buildkite-windows-cpu-autoscaling-group-i-0fdc6d574b9c0d168-1\xgboost\xgboost-ci-windows\src\tree\updater_gpu_hist.cu:802: Exception in gpu_hist: [12:01:31] C:\buildkite-agent\builds\buildkite-windows-cpu-autoscaling-group-i-0fdc6d574b9c0d168-1\xgboost\xgboost-ci-windows\src\data\../common/device_helpers.cuh:431: Memory allocation error on worker 0: bad allocation: cudaErrorMemoryAllocation: out of memory
- Free memory: 214305178
- Requested memory: 1073741824



We can say the same thing, the optimal number for the learning rate is 0.6.

#### Gama

In [5]:
opt_table_estimators_accuracy=list()
opt_table_estimators_f1=list()
list_para = [0, 0.01, 0.1, 1, 10]

for i in list_para:
    model = XGBClassifier(gamma = i, objective='multi:softmax',num_class=31)
    model.fit(X_train,y_train)
    output=model.predict(X_test)
    opt_table_estimators_accuracy.append(accuracy_score(y_test, output))
    opt_table_estimators_f1.append(f1_score(y_test, output, average='macro'))
plt.plot(list_para, opt_table_estimators_accuracy)
plt.plot(list_para, opt_table_estimators_f1)
plt.xlabel('Number of trees')
plt.ylabel('Random Forest Score')
plt.title('Random Forest Score VS Number of trees (5 features)')
plt.show()

KeyboardInterrupt: 

In [None]:
For the Gamma parameter it is 1

#### Subsample

In [None]:
opt_table_estimators_accuracy=list()
opt_table_estimators_f1=list()
list_para = np.arange(0.1, 1.0, 0.1)

for i in list_para:
    model = XGBClassifier(subsample = i, objective='multi:softmax',num_class=7)
    model.fit(X_train,y_train)
    output=model.predict(X_test)
    opt_table_estimators_accuracy.append(accuracy_score(y_test, output))
    opt_table_estimators_f1.append(f1_score(y_test, output, average='macro'))
plt.plot(list_para, opt_table_estimators_accuracy)
plt.plot(list_para, opt_table_estimators_f1)
plt.xlabel('Number of trees')
plt.ylabel('Random Forest Score')
plt.title('Random Forest Score VS Number of trees (5 features)')
plt.show()

The optimal parameter for the subsample is 0.8

#### Colsample by Tree

In [None]:
opt_table_estimators_accuracy=list()
opt_table_estimators_f1=list()
list_para = np.arange(0.1, 1.0, 0.1)

for i in list_para:
    model = XGBClassifier(colsample_bytree = i, objective='multi:softmax',num_class=7)
    model.fit(X_train,y_train)
    output=model.predict(X_test)
    opt_table_estimators_accuracy.append(accuracy_score(y_test, output))
    opt_table_estimators_f1.append(f1_score(y_test, output, average='macro'))
plt.plot(list_para, opt_table_estimators_accuracy)
plt.plot(list_para, opt_table_estimators_f1)
plt.xlabel('Number of trees')
plt.ylabel('Random Forest Score')
plt.title('Random Forest Score VS Number of trees (5 features)')
plt.show()

Apparagent for the parameter colsample by tree the model doesn't manage to overfitter, in which case we can't yet find an ideal parameter. So we're going to introduce Grid Search.
### 3) Grid Search automatisation

In [None]:
params = {'max_depth': [10],
            'learning_rate': [0.3],
            'gamma': [0.001],
            'subsample': [1],
            'colsample_bytree': [1],
            'eval_metric': ['mlogloss']
            }

model = XGBClassifier


def model_best_param(X_train, X_test, y_train, y_test, model, params):

    pipe_nb = make_pipeline(
    model(num_class=7, objective='multi:softmax')
    )
    grid_search = GridSearchCV(estimator=model(), param_grid=params, verbose=2, cv=5, n_jobs=-1)
    grid_search.fit(X_train, y_train)

    output = grid_search.predict(X_test)
    
    output = le.inverse_transform(output)
    y_test = le.inverse_transform(y_test)

    conf_mat = confusion_matrix(y_test, output)
    cm_display = ConfusionMatrixDisplay(confusion_matrix = conf_mat, display_labels = ["DJ", "chill", "date", "melancholy", "party", "sport", "study"])
    cm_display.plot()
    plt.show() 

    print(classification_report(y_test, output))

    return(grid_search.best_params_)

model_best_param(X_train, X_test, y_train, y_test, model, params)

### 4) Analysis of the best model

In [None]:
model = XGBClassifier(
    objective='multi: softprob',
    max_depth = 10,
    learning_rate = 0.3,
    gamma = 0.001,
    subsample = 1,
    colsample_bytree = 1,
    eval_metric = 'mlogloss'
)

model.fit(x, y)

In [None]:
fig, ax = plt.subplots(figsize=(9,5))
plot_importance(model, ax=ax)
plt.show()