In [2]:
import pandas as pd
import numpy as np# Standardize the data
from sklearn.preprocessing import StandardScaler# Model and performance evaluation
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from xgboost import XGBClassifier
from sklearn.metrics import precision_recall_fscore_support as score# Hyperparameter tuning
from sklearn.model_selection import StratifiedKFold, cross_val_score, GridSearchCV, RandomizedSearchCV
from hyperopt import tpe, STATUS_OK, Trials, hp, fmin, STATUS_OK, space_eval

# XGBoost Multi-class Classification

XGBoost is a gradient boosting ensemble Machine Learning technique based on decision trees.
We've already seen what the terms decision tree, bagging and random forest mean in the Random Forest section, so let's take a closer look at what's new in XGBoost.

## What does boosting mean ?

Boosting is an ensemble strategy that adds new models to repair faults generated by current models. First, a model is constructed using the training data. The second model is then constructed in an attempt to address the faults in the previous model and models are added sequentially until no further advancements are possible. This ensemble method attempts to build a powerful classifier using previously 'weaker' classifiers.

## What does gradient means ?

Gradient boosting is so named because it employs a gradient descent approach to minimize loss when adding new models.

## What does eXtreme mean

XGBoost (eXtreme Gradient Boosting) is an advanced implementation of a gradient boosting algorithm.

## Advantages

-  XGBoost Execution Speed:
XGBoost is generally quick compared to other gradient boosting implementations.

- Performance: XGBoost has a strong track record of producing high-quality results in various machine learning tasks, especially in Kaggle competitions, where it has been a popular choice for winning solutions.

- Regularization :
Standard GBM implementation has no regularization like XGBoost; therefore, it also helps to reduce overfitting. In fact, XGBoost is also known as a ‘regularized boosting‘ technique.

- High Flexibility:
XGBoost allows users to define custom optimization objectives and evaluation criteria. This adds a whole new dimension to the model and there is no limit to what we can do.

- Handling Missing Values:
XGBoost has an in-built routine to handle missing values. The user is required to supply a different value than other observations and pass that as a parameter. XGBoost tries different things as it encounters a missing value on each node and learns which path to take for missing values in the future.

- Interpretability: Unlike some machine learning algorithms that can be difficult to interpret, XGBoost provides feature importances, allowing for a better understanding of which variables are most important in making predictions.

## Drawbacks

- Computational Complexity:
XGBoost can be computationally intensive, especially when training large models, making it less suitable for resource-constrained systems.

- Overfitting: XGBoost can be prone to overfitting, especially when trained on small datasets or when too many trees are used in the model

- Hyperparameter Tuning: XGBoost has many hyperparameters that can be adjusted, making it important to properly tune the parameters to optimize performance. However, finding the optimal set of parameters can be time-consuming and requires expertise.

- Memory Requirements: XGBoost can be memory-intensive, especially when working with large datasets, making it less suitable for systems with limited memory resources.

## Tune Hyperparameters

### General Parameters

The general parameter are here to access overall functionalities

- booster Select the type of model to run at each iteration. It has 2 options:
gbtree: tree-based models
gblinear: linear models

We will use in this project tree-based models due to its performance

### Booster Parameters for Tree Booster

There are 2 types of booster parameters, one for linear and another for tree but we will only consider tree booster here.

1. Eta is also known as the learning rate; changing this number makes the model more robust by decreasing the weights on each step.

2. Max_depth, it is the same as what we saw for the random forest

3. gamma 
Only when the resulting split results in a positive reduction in the loss function is a node split. The minimal loss reduction necessary to divide is specified by Gamma. This makes the algorithm more conservative. The values can and should change based on the loss function.

4. subsample 
The percentage of observations that are random samples for each tree. Lower values make the algorithm more conservative and prevent overfitting, but too low values may result in underfitting.

5. colsample_bytree, it is the same as max_features for random forest


### Learning Task Parameters

1. objective 
This defines the loss function to be minimized. We will use "multi: softmax" which is a multiclass classification using the softmax objective. It will return predicted class.

2. eval_metric 
The evaluation metrics are to be used for validation data.
merror – Multiclass classification error rate 
mlogloss – Multiclass logloss


In [3]:
df = pd.read_csv(r'C:\Users\Thomas Aujoux\Documents\GitHub\food-classification\Data_Preprocessing\data_clean\clean.csv')
#df = df.drop('Unnamed: 3', axis=1)
df = df.drop('Unnamed: 0', axis=1)
df = df.drop(labels=22426, axis=0)
df

Unnamed: 0,Code_produit,Secteur,Famille,abricot,abricots,acidifiant,acidifiant acidifiant,acidifiant antioxydant,acidifiant arome,acidifiant arome naturel,...,vitamine pp,vitamine vitamine,vitamine vitamine b,vitamine vitamine vitamine,vitamines,vitamines vitamine,vitamines vitamine vitamine,volaille,yaourt,yaourt brass
0,450.0,Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125377,0.0
1,453.0,Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.245026,0.0
2,455.0,Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.241945,0.0
3,456.0,Produits laitiers et desserts frais,Yaourts et laits fermentes sucres classiques,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.242882,0.0
4,460.0,Produits laitiers et desserts frais,Fromages frais nature non sucres gourmands,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65793,101536.0,Sirops et boissons concentrees a diluer,Sirops,0.0,0.0,0.137469,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
65794,101537.0,Sirops et boissons concentrees a diluer,Sirops,0.0,0.0,0.058943,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
65795,101540.0,Sirops et boissons concentrees a diluer,Sirops,0.0,0.0,0.110652,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
65796,101542.0,Sirops et boissons concentrees a diluer,Sirops,0.0,0.0,0.119906,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0


In [4]:
y = df[["Secteur"]]
df_features = df.drop(["Code_produit", "Secteur", "Famille"], axis=1)
# Import train_test_split
from sklearn.model_selection import train_test_split

# Split into train, test

X_train, X_test, y_train, y_test = train_test_split(df_features, y, test_size=0.2, shuffle=True, random_state=42)

In [5]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y_train = le.fit_transform(y_train)
y_test = le.fit_transform(y_test)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [None]:
from sklearn.utils import class_weight
classes_weights = class_weight.compute_sample_weight(
    class_weight='balanced',
    y=y_train['class']
)

In [8]:
from sklearn.metrics import f1_score, make_scorer
from sklearn.utils import class_weight

classes_weights = class_weight.compute_sample_weight(
    class_weight='balanced',
    y=y_train['class']
)

params = {'subsample': [0.6], 
 'max_depth': [15],
 'learning_rate': [0.4], 
 'gamma': [0.1],
 'eval_metric': ['mlogloss'],
 'colsample_bytree': [0.5],
}
xgbclf = XGBClassifier(objective='multi:softmax', num_class=31)
clf = RandomizedSearchCV(estimator=xgbclf,
                             param_distributions=params,
                             n_iter=500,
                             cv=5,
                             verbose=1)

clf.fit(X_train, y_train, sample_weight=classes_weights)
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = clf.predict(X_test)
#cm = confusion_matrix(y_test, y_pred, labels=["party", "DJ", "chill"])
import seaborn as sns
#sns.heatmap(cm, annot=True, square=True, cbar=False, fmt='g')
f1_score(y_test, y_pred, average='macro')



Fitting 5 folds for each of 1 candidates, totalling 5 fits


KeyboardInterrupt: 

In [8]:
from sklearn.metrics import f1_score, make_scorer
from sklearn.utils import class_weight

classes_weights = class_weight.compute_sample_weight(
    class_weight='balanced',
    y=y_train
)

params = {'subsample': [1], 
 'max_depth': [6],
 'learning_rate': [0.3], 
 'gamma': [0.1],
 'eval_metric': ['mlogloss'],
 'colsample_bytree': [1],
}
xgbclf = XGBClassifier(objective='multi:softmax', num_class=31)
clf = RandomizedSearchCV(estimator=xgbclf,
                             n_iter=200,
                            param_distributions=params,
                             cv=5,
                             verbose=1)

clf.fit(X_train, y_train, sample_weight=classes_weights)
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = clf.predict(X_test)
#cm = confusion_matrix(y_test, y_pred, labels=["party", "DJ", "chill"])
import seaborn as sns
#sns.heatmap(cm, annot=True, square=True, cbar=False, fmt='g')
f1_score(y_test, y_pred, average='macro')



Fitting 5 folds for each of 1 candidates, totalling 5 fits


KeyboardInterrupt: 

In [7]:
model = XGBClassifier(eval_metric='mlogloss', num_class=31, objective='multi:softmax') 
model.fit(X_train, y_train)
y_pred = model.predict(X_test) 
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions) 

print("Accuracy: %.2f%%" % (accuracy * 100.0))

NameError: name 'accuracy_score' is not defined

In [6]:
y_train

array([24,  8,  3, ...,  3, 23,  5])

In [8]:
from sklearn.metrics import f1_score, make_scorer
params = {'max_depth': [3, 6, 10, 15, 20, 50, 100],
            'learning_rate': [0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.6, 0.8],
            'gamma': [0, 0.01, 0.1, 1, 10],
            'subsample': np.arange(0.1, 1.0, 0.1),
            'colsample_bytree': np.arange(0.1, 1.0, 0.1),
            #'colsample_bylevel': np.arange(0.5, 1.0, 0.1),
            #'n_estimators': [100, 250, 500, 750, 1000, 1500],
            'eval_metric': ['merror','mlogloss']
            }
xgbclf = XGBClassifier(objective='multi: softprob')
clf = RandomizedSearchCV(estimator=xgbclf,
                             param_distributions=params,
                             scoring="f1_micro",
                             n_iter=25,
                             verbose=1)

clf.fit(X_train, y_train)

best_combination = clf.best_params_

Fitting 5 folds for each of 25 candidates, totalling 125 fits


In [None]:
best_combination

In [None]:
params = {'subsample': [0.6], 
 'max_depth': [15],
 'learning_rate': [0.4], 
 'gamma': [0.1],
 'eval_metric': ['mlogloss'],
 'colsample_bytree': [0.5],
}
xgbclf = xgb.XGBClassifier()
clf = RandomizedSearchCV(estimator=xgbclf,
                             param_distributions=params,
                             scoring="f1_micro",
                             n_iter=500,
                             verbose=1)

clf.fit(X_train, y_train)
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
#cm = confusion_matrix(y_test, y_pred, labels=["party", "DJ", "chill"])
import seaborn as sns
#sns.heatmap(cm, annot=True, square=True, cbar=False, fmt='g')
accuracy_score(y_test, y_pred)