# Simple Machine Learning Algorithms Benchmarks
In this notebook we are evaluating some basic machine learning algorithms to choose the one with the best results.

In [None]:
#Loading the packages
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import ConfusionMatrixDisplay, f1_score


In [None]:
#Loading the models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm


## Loading the Data
To begin with we are going to load our data and split the training data into the trainset and the devset to train and evaluate our algorithms

In [None]:
data = pd.read_csv('datasets/incidents_train.csv', index_col=0)
trainset, devset = train_test_split(data, test_size=0.2, random_state=2024)
trainset.head()


Unnamed: 0,year,month,day,country,title,text,hazard-category,product-category,hazard,product
1062,2014,7,30,au,Marvellous Creations Jelly Popping Candy Beani...,Mondelez Australia Pty Ltd has recalled Marvel...,foreign bodies,"cocoa and cocoa preparations, coffee and tea",plastic fragment,chocolate
1969,2016,11,17,us,"Request Foods, Inc. Issues Allergy Alert On Un...","Holland, MI - Request Foods, Inc. is recalling...",allergens,other food product / mixed,eggs and products thereof,pasta products
1053,2014,7,17,uk,"VBites Foods recalls 'Wot, No Dairy?' desserts","VBites Foods is recalling two 'Wot, No Dairy?'...",allergens,ices and desserts,milk and products thereof,desserts
2200,2017,5,1,ca,Toppits brand Battered Blue Cod Fillet recalle...,Food Recall Warning (Allergen) - Toppits brand...,allergens,seafood,milk and products thereof,cod fillets
276,2006,10,6,us,Oct 6_ 2006_ Iowa_ Firm Recalls Ground Beef___,"WASHINGTON, October 6, 2006 - Jims Market and...",biological,"meat, egg and dairy products",escherichia coli,frozen beef patties


### Training models with the title column

In [5]:
#Defining a function to train the model and print the f1 scores
def title_train(clf):
    for label in ('hazard-category', 'product-category', 'hazard', 'product'):
        print(label.upper())
        clf.fit(trainset.title, trainset[label])

        # get development scores:
        devset['predictions-' + label] = clf.predict(devset.title)
        print(f'  macro: {f1_score(devset[label], devset["predictions-" + label], zero_division=0, average="macro"):.2f}')
        print(f'  micro: {f1_score(devset[label], devset["predictions-" + label], zero_division=0, average="micro"):.2f}')


## Logistic Regression
The first model we will try will be the Logistic Regression model

In [None]:
title_clf_lr = Pipeline([
    ('vect', TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2,5), max_df=0.5, min_df=5)),
     ('clf', LogisticRegression(max_iter=1000)),
    ])


title_train(title_clf_lr)

HAZARD-CATEGORY
  macro: 0.46
  micro: 0.81
PRODUCT-CATEGORY
  macro: 0.39
  micro: 0.66
HAZARD
  macro: 0.14
  micro: 0.54
PRODUCT
  macro: 0.07
  micro: 0.27


## KNN Model
The next model that we are testing is the k-nearest neighbors model 

In [None]:
title_clf_knn = Pipeline([
    ('vect', TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2,5), max_df=0.5, min_df=5)),
     ('clf', KNeighborsClassifier()),
    ])


title_train(title_clf_knn)

HAZARD-CATEGORY
  macro: 0.57
  micro: 0.78
PRODUCT-CATEGORY
  macro: 0.39
  micro: 0.59
HAZARD
  macro: 0.19
  micro: 0.51
PRODUCT
  macro: 0.11
  micro: 0.26


### Comparing results
We can see that the KNN model have better scores than the Logistic Regression model

## SVM Model
Next we are running a support vector machine model with a linear kernel


In [None]:
title_clf_svm = Pipeline([
    ('vect', TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2,5), max_df=0.5, min_df=5)),
     ('clf', svm.SVC(kernel='linear')),
    ])


title_train(title_clf_svm)

HAZARD-CATEGORY
  macro: 0.65
  micro: 0.83
PRODUCT-CATEGORY
  macro: 0.51
  micro: 0.71
HAZARD
  macro: 0.24
  micro: 0.59
PRODUCT
  macro: 0.16
  micro: 0.37


### Comparing results
Since the SVM model has better scores from both Logistic Regression and KNN models we will adjust the parameters of it to achieve even better scores

### Adding more parameters
Here we adjust the C parameter of the model which is a trade-off regularizer and the class_weight parameter to take into consideration the inbalanced classes of the data set

In [None]:

title_clf_svm = Pipeline([
    ('vect', TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2,5), max_df=0.5, min_df=5)),
     ('clf', svm.SVC(kernel='linear', C=1.0, class_weight='balanced')),
    ])


title_train(title_clf_svm)

HAZARD-CATEGORY
  macro: 0.65
  micro: 0.83
PRODUCT-CATEGORY
  macro: 0.53
  micro: 0.68
HAZARD
  macro: 0.33
  micro: 0.53
PRODUCT
  macro: 0.17
  micro: 0.26


### GridSearching
We will use the GridSearch to try different parameters for both the tokenizer and the SVM model and find the one with the better f1 macro score for the hazard-category after a 5-fold cross_validation.

In [41]:
from sklearn.model_selection import GridSearchCV
# Define the pipeline
pipeline = Pipeline([
    ('vect', TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(2, 5))),
    ('clf', svm.SVC(class_weight='balanced'))
])

# Define the parameter grid
param_grid = {
    'vect__ngram_range': [(2, 5), (3, 6)],  # Experiment with n-grams
    'vect__max_df': [0.5, 0.7],             # Test different max document frequency thresholds
    'vect__min_df': [3, 5],                 # Test different min document frequency thresholds
    'clf__C': [0.1, 1, 10],                 # Regularization parameter for SVM
    'clf__kernel': ['linear', 'rbf'],       # Linear or RBF kernel
}

# Perform Grid Search
grid_search = GridSearchCV(
    pipeline,
    param_grid,
    scoring='f1_macro',  # Use F1-macro to optimize for class imbalance
    cv=5,                # 5-fold cross-validation
    verbose=2,           # Show progress
    n_jobs=-1            # Use all available CPUs
)

# Fit the grid search on one label (e.g., hazard-category)
grid_search.fit(trainset.title, trainset['hazard-category'])

# Best parameters and score
print("Best parameters:", grid_search.best_params_)
print("Best F1-macro score:", grid_search.best_score_)

Fitting 5 folds for each of 48 candidates, totalling 240 fits




Best parameters: {'clf__C': 10, 'clf__kernel': 'linear', 'vect__max_df': 0.5, 'vect__min_df': 3, 'vect__ngram_range': (3, 6)}
Best F1-macro score: 0.6448460423434812


### Running the model with the optimal parameters

In [None]:
title_clf_svm = Pipeline([
    ('vect', TfidfVectorizer(strip_accents='unicode', analyzer='char', ngram_range=(3,6), max_df=0.5, min_df=3)),
     ('clf', svm.SVC(kernel='linear', C=10, class_weight='balanced')),
    ])


title_train(title_clf_svm)

HAZARD-CATEGORY
  macro: 0.66
  micro: 0.83
PRODUCT-CATEGORY
  macro: 0.54
  micro: 0.71
HAZARD
  macro: 0.33
  micro: 0.63
PRODUCT
  macro: 0.21
  micro: 0.40


### Computing the scores for the subtasks from the devset with the optimized SVM model

In [14]:
def compute_score(hazards_true, products_true, hazards_pred, products_pred):
  # compute f1 for hazards:
  f1_hazards = f1_score(
    hazards_true,
    hazards_pred,
    average='macro'
  )

  # compute f1 for products:
  f1_products = f1_score(
    products_true[hazards_pred == hazards_true],
    products_pred[hazards_pred == hazards_true],
    average='macro'
  )

  return (f1_hazards + f1_products) / 2.

In [None]:
print(f"Score Sub-Task 1: {compute_score(devset['hazard-category'], devset['product-category'], devset['predictions-hazard-category'], devset['predictions-product-category']):.3f}")
print(f"Score Sub-Task 2: {compute_score(devset['hazard'], devset['product'], devset['predictions-hazard'], devset['predictions-product']):.3f}")

Score Sub-Task 1: 0.449
Score Sub-Task 2: 0.121


### Training models with the text column
Now we will follow the same path but with the text column instead of title to see if we can achieve better results.
We have changed the parameters of the vectorizer to vectorize better with the text column.


In [8]:
#Defining a function to train the model with text and print the f1 scores
def text_train(clf):
    for label in ('hazard-category', 'product-category', 'hazard', 'product'):
        print(label.upper())
        clf.fit(trainset.text, trainset[label])

        # get development scores:
        devset['predictions-' + label] = clf.predict(devset.text)
        print(f'  macro: {f1_score(devset[label], devset["predictions-" + label], zero_division=0, average="macro"):.2f}')
        print(f'  micro: {f1_score(devset[label], devset["predictions-" + label], zero_division=0, average="micro"):.2f}')


## Logistic Regression

In [None]:
text_clf_lr = Pipeline([
    ('vect', TfidfVectorizer(strip_accents='unicode', analyzer='word', ngram_range=(1,2), max_df=0.5, min_df=5)),
     ('clf', LogisticRegression(max_iter=1000)),
    ])


text_train(text_clf_lr)

HAZARD-CATEGORY
  macro: 0.54
  micro: 0.85
PRODUCT-CATEGORY
  macro: 0.33
  micro: 0.59
HAZARD
  macro: 0.17
  micro: 0.64
PRODUCT
  macro: 0.04
  micro: 0.20


## KNN

In [9]:
text_clf_knn = Pipeline([
    ('vect', TfidfVectorizer(strip_accents='unicode', analyzer='word', ngram_range=(1,2), max_df=0.5, min_df=5)),
     ('clf', KNeighborsClassifier()),
    ])


text_train(text_clf_knn)

HAZARD-CATEGORY
  macro: 0.53
  micro: 0.79
PRODUCT-CATEGORY
  macro: 0.33
  micro: 0.48
HAZARD
  macro: 0.23
  micro: 0.53
PRODUCT
  macro: 0.07
  micro: 0.20


## SVM

In [12]:
text_clf_svm = Pipeline([
    ('vect', TfidfVectorizer(strip_accents='unicode', analyzer='word', ngram_range=(1,2), max_df=0.5, min_df=5)),
     ('clf', svm.SVC(kernel='linear', C=10, class_weight='balanced')),
    ])


text_train(text_clf_svm)

HAZARD-CATEGORY
  macro: 0.74
  micro: 0.90
PRODUCT-CATEGORY
  macro: 0.55
  micro: 0.68
HAZARD
  macro: 0.37
  micro: 0.73
PRODUCT
  macro: 0.15
  micro: 0.31


### Evaluating the scores of each subtastk with the SVM text-trained classifier

In [15]:
print(f"Score Sub-Task 1: {compute_score(devset['hazard-category'], devset['product-category'], devset['predictions-hazard-category'], devset['predictions-product-category']):.3f}")
print(f"Score Sub-Task 2: {compute_score(devset['hazard'], devset['product'], devset['predictions-hazard'], devset['predictions-product']):.3f}")

Score Sub-Task 1: 0.648
Score Sub-Task 2: 0.278


### Comparing the title-trained and the text-trained models

In the title trained model we achieved the scores: ST-1: 0.449 and ST-2: 0.121\
While in the text trained model we achieved the scores: ST-1: 0.648 and ST-2: 0.278\
for the specific train and dev sets we chose.

## 5-Fold Cross Validation
We are going to do a 5-fold cross validation to determine which of the 2 models achieves better f-1 scores for the hazard-category classification.

In [24]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
score_title = cross_val_score(title_clf_svm, data.title, data['hazard-category'], cv=5, scoring='f1_macro')
score_text = cross_val_score(text_clf_svm, data.text, data['hazard-category'], cv=5, scoring='f1_macro')
print(f"Title SVM average F1-score: {np.mean(score_title):.3f} ± {np.std(score_title):.3f}")
print(f"Text SVM average F1-score: {np.mean(score_text):.3f} ± {np.std(score_text):.3f}")



Title SVM average F1-score: 0.594 ± 0.144
Text SVM average F1-score: 0.579 ± 0.060


### Choosing the final model
After the 5-fold cross validation we can see that the results of the 2 models are close with the text trained model having a smaller standard deviation.\
 So we will choose the text-trained svm model.