# Assignment 3
* This assignment presents a possible approach for the [Food Hazard Detection 2025 competition](https://github.com/food-hazard-detection-semeval-2025/food-hazard-detection-semeval-2025.github.io).  
* It explores data preprocessing, data augmentation, feature extraction, classification models, and hyperparameter tuning.

> Giorgos Papoutsakis 8200137

## Import necessary libraries

* The first essential step needed for the model but also the whole notebook to run is importing all the necessary libraries.

In [1]:
import pandas as pd
import re
import os
import random

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import f1_score

* For text preprocessing, I will use the `nltk` library.  

In [2]:
import nltk

* NLTK requires downloading some additional packages.
* These packages will be stored in the `nltk_data` folder.

In [3]:
os.environ["NLTK_DATA"] = os.path.join(os.getcwd(), "data/nltk_data")
nltk.data.path.append(os.environ["NLTK_DATA"])

nltk.download('punkt', download_dir=os.environ["NLTK_DATA"])
nltk.download('punkt_tab', download_dir=os.environ["NLTK_DATA"])
nltk.download('wordnet', download_dir=os.environ["NLTK_DATA"])
nltk.download('stopwords', download_dir=os.environ["NLTK_DATA"])

[nltk_data] Downloading package punkt to C:\Users\User\Desktop\Food-
[nltk_data]     hazard-text-classification\data/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\User\Desktop\Food-hazard-text-
[nltk_data]     classification\data/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\User\Desktop\Food-
[nltk_data]     hazard-text-classification\data/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\Desktop\Food-hazard-text-
[nltk_data]     classification\data/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.corpus import stopwords

* In SageMaker, I had to manually install Optuna using the following command, so I kept it in the final notebook.  

In [5]:
!pip install optuna




[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
import optuna

## Datasets and Text cleaning

* The first actual step of the assignment was to download the datasets from the [GitHub repository](https://github.com/food-hazard-detection-semeval-2025/food-hazard-detection-semeval-2025.github.io/tree/main/code) and read them as DataFrames.

In [7]:
train_set = pd.read_csv('data/incidents_train.csv', index_col=0)
validation_set = pd.read_csv('data/incidents_valid.csv', index_col=0)
test_set = pd.read_csv('data/incidents_test.csv', index_col=0)

* Lets take a look in the train set.

In [8]:
train_set.head()

Unnamed: 0,year,month,day,country,title,text,hazard-category,product-category,hazard,product
0,1994,1,7,us,Recall Notification: FSIS-024-94,Case Number: 024-94 \n Date Opene...,biological,"meat, egg and dairy products",listeria monocytogenes,smoked sausage
1,1994,3,10,us,Recall Notification: FSIS-033-94,Case Number: 033-94 \n Date Opene...,biological,"meat, egg and dairy products",listeria spp,sausage
2,1994,3,28,us,Recall Notification: FSIS-014-94,Case Number: 014-94 \n Date Opene...,biological,"meat, egg and dairy products",listeria monocytogenes,ham slices
3,1994,4,3,us,Recall Notification: FSIS-009-94,Case Number: 009-94 \n Date Opene...,foreign bodies,"meat, egg and dairy products",plastic fragment,thermal processed pork meat
4,1994,7,1,us,Recall Notification: FSIS-001-94,Case Number: 001-94 \n Date Opene...,foreign bodies,"meat, egg and dairy products",plastic fragment,chicken breast


* Also, let's take a closer look at the title and text of the first row.  

In [9]:
train_set['title'].iloc[0]

'Recall Notification: FSIS-024-94'

In [10]:
train_set['text'].iloc[0]

"Case Number: 024-94   \n            Date Opened: 07/01/1994   \n            Date Closed: 09/22/1994 \n    \n            Recall Class:  1   \n            Press Release (Y/N):  Y  \n    \n            Domestic Est. Number:  05893  P   \n              Name:  GERHARD'S NAPA VALLEY SAUSAGE\n    \n            Imported Product (Y/N):  N       \n            Foreign Estab. Number:  N/A\n    \n            City:  NAPA    \n            State:  CA   \n            Country:  USA\n    \n            Product:  SMOKED CHICKEN SAUSAGE\n    \n            Problem:  BACTERIA   \n            Description: LISTERIA\n    \n            Total Pounds Recalled:  2,894   \n            Pounds Recovered:  2,894"

* It is clear that both titles and texts need to be cleaned of all non-useful words and characters.  
* I will create some functions to handle the cleaning.  

* The first function will:
    * Convert all characters to lowercase.
    * Remove all characters except letters.
    * Remove all words with two or fewer characters.
    * Remove all extra spaces.

In [11]:
def text_cleaning(text):
    text = text.lower()
    text = re.sub(r'http\S+|www.\S+', ' ', text) #Remove URLs
    text = re.sub(r'[^\w\s]|_', ' ', text) #Remove all characters including _ exept letters and numbers
    text = re.sub(r'\d+', ' ', text) #Remove digits
    text = re.sub(r'\b\w{1,2}\b\s*', '', text) #Remove all words with 2 or less characters
    text = re.sub(r'\s+', ' ', text) #Remove all white spaces(\n, tabs, spaces, etc)
    text = text.strip()

    return text

* The second function will lemmatize words.

In [12]:
lemmatizer = WordNetLemmatizer()

def text_lemmatization(text):
    tokens = word_tokenize(text)
    lemmatized_tokens = [lemmatizer.lemmatize(token, pos=wordnet.VERB) for token in tokens] #works better if it considers it as verb
    clean_text = " ".join(lemmatized_tokens)
    
    return clean_text 

* The third function will remove stopwords.

In [13]:
STOP_WORDS = set(stopwords.words('english'))

def stopwords_removal(text):
    tokens = word_tokenize(text)
    
    filtered_tokens = [word for word in tokens if word not in STOP_WORDS]
    clean_text = " ".join(filtered_tokens)
    return clean_text

* I will combine all functions into a single one.

In [14]:
def custom_standarization(text):
    text = text_cleaning(text)
    text = text_lemmatization(text)
    text = stopwords_removal(text)
    return text

* Finally, I will apply the function to all datasets' titles and texts.

In [15]:
train_set['title'] = train_set['title'].apply(custom_standarization)
train_set['text'] = train_set['text'].apply(custom_standarization)

In [16]:
validation_set['title'] = validation_set['title'].apply(custom_standarization)
validation_set['text'] = validation_set['text'].apply(custom_standarization)

In [17]:
test_set['title'] = test_set['title'].apply(custom_standarization)
test_set['text'] = test_set['text'].apply(custom_standarization)

# Deal with Class Imbalancing

* As it stated in the problem instructions, the classes for each label are highly imbalanced.
* Let's take a look.

In [18]:
print("HAZARD CATEGORIES")
print(train_set['hazard-category'].value_counts())
print("\nPRODUCT CATEGORIES")
print(train_set['product-category'].value_counts())

HAZARD CATEGORIES
allergens                         1854
biological                        1741
foreign bodies                     561
fraud                              371
chemical                           287
other hazard                       134
packaging defect                    54
organoleptic aspects                53
food additives and flavourings      24
migration                            3
Name: hazard-category, dtype: int64

PRODUCT CATEGORIES
meat, egg and dairy products                         1434
cereals and bakery products                           671
fruits and vegetables                                 535
prepared dishes and snacks                            469
seafood                                               268
soups, broths, sauces and condiments                  264
nuts, nut products and seeds                          262
ices and desserts                                     222
cocoa and cocoa preparations, coffee and tea          210
confectionery 

In [19]:
print("HAZARD")
print(train_set['hazard'].value_counts())
print("\nPRODUCT")
print(train_set['product'].value_counts())

HAZARD
listeria monocytogenes                        665
salmonella                                    621
milk and products thereof                     588
escherichia coli                              237
peanuts and products thereof                  211
                                             ... 
dioxins                                         3
staphylococcal enterotoxin                      3
dairy products                                  3
sulfamethazine unauthorised                     3
paralytic shellfish poisoning (psp) toxins      3
Name: hazard, Length: 128, dtype: int64

PRODUCT
ice cream                                  185
chicken based products                     138
cakes                                       93
ready to eat - cook meals                   79
cookies                                     78
                                          ... 
breakfast cereals and products therefor      1
dried lilies                                 1
chilled pork ribs 

* This is a serious problem especially since the evaluation metric is the F1 score.
* If class imbalance is not addressed, the evalution score will be low.
* To handle this, I will use data augmentation with synonyms.
* The data augmentation idea will be:
    * Repeatedly select rows from the least represented categories until each category reaches a certain threshold.
    * Replace a number of words in both the title and text with synonyms.
    * Add the new row in the training dataset.
* As a threshold, I will use the category's median and replace 1 word from the title and 5 words from the text.

* There is a special case for the product label.  
* Its median is 2, which is really low.  
* Later, when using cross-validation, I will get a warning that the class members are fewer than the k-folds.  
* For these reasons, I will increase the threshold slightly. It will be 3 instead of 2.

In [20]:
print(train_set['product'].value_counts().median())

2.0


In [21]:
def get_synonym(word):
    synonyms = []
    for syn in wordnet.synsets(word):
        for lemma in syn.lemmas():
            synonyms.append(lemma.name())

    return random.choice(synonyms) if synonyms else word

In [22]:
def data_augmentation_for_label(data, label):

    title_changes = 1
    text_changes = 5
        
    #Category for augmentation and its median
    category_counts = data[label].value_counts()
    threshold = category_counts.median()

    # Case specific for label 'product'
    if threshold < 3:
        threshold = 3
        
    rare_categories = category_counts[category_counts <= threshold].index.tolist()
    augmented_rows = []
    
    for category in rare_categories:
        #Get threshold
        current_count = category_counts[category]
        needed_count = int(threshold - current_count)
        category_data = data[data[label] == category]

        random_seed = 0
        while needed_count > 0:

            random.seed(random_seed)
            row = category_data.sample(n=1, random_state = random_seed).squeeze()
            augmentation_details = []

            #Title
            title = row['title']
            words_title = title.split()

            words_to_change_title = title_changes
            if len(words_title) <= words_to_change_title:
                words_to_change_title = len(words_title)
                
            random_word_title = random.choice(words_title)
            synonym_word_title = get_synonym(random_word_title)
            augmentation_details.append((random_word_title, synonym_word_title))
            
            augmented_title = " ".join([synonym_word_title if word == random_word_title else word for word in words_title])


            #Text
            text = row['text']
            words_text = text.split()

            words_to_change_text = text_changes
            if len(words_text) <= words_to_change_text:
                words_to_change_text = len(words_text)
                
            random_words_text = random.sample(words_text, words_to_change_text)
            augmentation_details.extend([(word, get_synonym(word)) for word in random_words_text])

            augmented_text = " ".join([get_synonym(word) if word in random_words_text else word for word in words_text])

            #Create the new columns
            new_row = row.copy()
            new_row['title'] = augmented_title
            new_row['text'] = augmented_text
            new_row["augmentation_details"] = augmentation_details
            
            augmented_rows.append(new_row)
            needed_count -= 1
            random_seed += 1

    #Add the new columns
    augmented_data = pd.DataFrame(augmented_rows)
    data = pd.concat([data, augmented_data], ignore_index=True)

    return data

* Apply the data augmentation function to all labels.
* I will create a new DataFrame.

In [23]:
train_set_augmented = data_augmentation_for_label(
    data=train_set,
    label='hazard-category'
)
train_set_augmented = data_augmentation_for_label(
    data=train_set_augmented,
    label='product-category'
)
train_set_augmented = data_augmentation_for_label(
    data=train_set_augmented,
    label='hazard'
)
train_set_augmented = data_augmentation_for_label(
    data=train_set_augmented,
    label='product'
)

* Lets see the final results.

In [24]:
print("HAZARD CATEGORIES")
print(train_set_augmented['hazard-category'].value_counts())
print("\nPRODUCT CATEGORIES")
print(train_set_augmented['product-category'].value_counts())

HAZARD CATEGORIES
allergens                         2516
biological                        2410
chemical                           946
foreign bodies                     915
fraud                              733
other hazard                       412
organoleptic aspects               329
packaging defect                   279
food additives and flavourings     221
migration                          210
Name: hazard-category, dtype: int64

PRODUCT CATEGORIES
meat, egg and dairy products                         2021
cereals and bakery products                           815
fruits and vegetables                                 782
prepared dishes and snacks                            599
seafood                                               397
soups, broths, sauces and condiments                  372
nuts, nut products and seeds                          351
confectionery                                         340
dietetic foods, food supplements, fortified foods     278
non-alcoholic 

In [25]:
print("HAZARD")
print(train_set_augmented['hazard'].value_counts())
print("\nPRODUCT")
print(train_set_augmented['product'].value_counts())

HAZARD
salmonella                     902
milk and products thereof      824
listeria monocytogenes         818
other                          600
plastic fragment               325
                              ... 
unauthorised import             17
tampering                       17
improper packaging              17
celery and products thereof     17
poor hygienic state             17
Name: hazard, Length: 128, dtype: int64

PRODUCT
plastics             225
pet feed             216
ice cream            193
feed materials       178
honey                173
                    ... 
semi-hard cheeses      3
whole chicken          3
white cheese           3
meat broth             3
mung bean sprouts      3
Name: product, Length: 1022, dtype: int64


* We can see that there is still some class imbalance, but it is much better than before.  
* Lastly, I will clean the new texts and titles once again.

In [26]:
train_set_augmented['title'] = train_set_augmented['title'].apply(custom_standarization)
train_set_augmented['text'] = train_set_augmented['text'].apply(custom_standarization)

## Classification using k-fold validation

* The approach I will follow for this multi-class text classification problem will be in two phases:
    * The first phase is vectorizing the input title and text.
    * The second phase is classifying the vectors.
* Finally, I will evaluate the classification using the given function below.

In [27]:
def compute_score(hazards_true, products_true, hazards_pred, products_pred):
  # compute f1 for hazards:
  f1_hazards = f1_score(
    hazards_true,
    hazards_pred,
    average='macro'
  )

  # compute f1 for products:
  f1_products = f1_score(
    products_true[hazards_pred == hazards_true],
    products_pred[hazards_pred == hazards_true],
    average='macro'
  )

  return (f1_hazards + f1_products) / 2.

* I will save the important columns from the training set into variables to use easily throughout the notebook.

In [28]:
train_set_input = train_set_augmented[['title','text']]

train_labels_hazard_category = train_set_augmented['hazard-category']
train_labels_product_category = train_set_augmented['product-category']
train_labels_hazard = train_set_augmented['hazard']
train_labels_product = train_set_augmented['product']

* I will create two functions, one for each subtask, to try different vectorizers and classifiers.  
* The function will use k-fold validation on the training set.  
* The number of folds will be:
  * 5 for the labels `hazard-category`, `product-category`, and `hazard`.  
  * 3 for the label `product`.  
* The reason for this difference is that the least frequent product categories have a minimum of 3 members due to the way data augmentation was implemented.

* The function summarizes the model's basic logic.  
* For each label, I will use a `Pipeline` that includes a `ColumnTransformer` and a traditional vectorizer.  
* The `ColumnTransformer` applies two vectorizers: one for the title and one for the text, then combines the outputs.  
* The classifier then takes this combined output as input and classifies it based on the training labels.  
* I will use `cross_val_predict` to make predictions during cross validation.  
* I will take the predictions as pairs and evaluate them using the given evaluation function.

In [29]:
def test_vectorizers_classifiers_subtask1(vectorizerTitle, vectorizerText, classifierHazard_category, classifierProduct_category):
    vectorizer_title = vectorizerTitle
    vectorizer_text = vectorizerText
    classifier_hazard_category = classifierHazard_category
    classifier_product_category = classifierProduct_category
    
    pipeline_hazard_category = Pipeline([
        ("vectorizer_hazard_category", ColumnTransformer([
            ("title_vectorizer", vectorizer_title, "title"),
            ("text_vectorizer", vectorizer_text, "text")
        ])),
        ("classifier_hazard_category", classifier_hazard_category)
    ])
    
    pipeline_product_category = Pipeline([
        ("vectorizer_product_category", ColumnTransformer([
            ("title_vectorizer", vectorizer_title, "title"),
            ("text_vectorizer", vectorizer_text, "text")
        ])),
        ("classifier_product_category", classifier_product_category)
    ])
    
    # cross-validation
    pred_hazard_category = cross_val_predict(pipeline_hazard_category, train_set_input, train_labels_hazard_category, cv=5, n_jobs=-1)
    pred_product_category = cross_val_predict(pipeline_product_category, train_set_input, train_labels_product_category, cv=5, n_jobs=-1)
    
    return compute_score(train_labels_hazard_category, train_labels_product_category, pred_hazard_category, pred_product_category)

In [30]:
def test_vectorizers_classifiers_subtask2(vectorizerTitle, vectorizerText, classifierHazard, classifierProduct):
    vectorizer_title = vectorizerTitle
    vectorizer_text = vectorizerText
    classifier_hazard = classifierHazard
    classifier_product = classifierProduct
    
    pipeline_hazard = Pipeline([
        ("vectorizer_hazard", ColumnTransformer([
            ("title_vectorizer", vectorizer_title, "title"),
            ("text_vectorizer", vectorizer_text, "text")
        ])),
        ("classifier_hazard", classifier_hazard)
    ])
    
    pipeline_product = Pipeline([
        ("vectorizer_product", ColumnTransformer([
            ("title_vectorizer", vectorizer_title, "title"),
            ("text_vectorizer", vectorizer_text, "text")
        ])),
        ("classifier_product", classifier_product)
    ])
    
    # cross-validation
    pred_hazard = cross_val_predict(pipeline_hazard, train_set_input, train_labels_hazard, cv=5, n_jobs=-1)
    pred_product = cross_val_predict(pipeline_product, train_set_input, train_labels_product, cv=3, n_jobs=-1)
    
    return compute_score(train_labels_hazard, train_labels_product, pred_hazard, pred_product)

* I will select the vectorizers and classifiers combination that achieves the best result for each task.
* I tried a variatey of vectorizers and classifiers. Some of thems are:
* Vectorizers:
  * `CountVectorizer`
  * `TfidfVectorizer`
* Classifiers:
  * `LogisticRegression`
  * `RandomForestClassifier`
  * `LinearSVC`
  * `SGDClassifier`
  * `DesicionTreeClassifier`

* I also tried vectorizing the text and title using pre-trained embeddings from the `gensim` library.  
* The idea was to use the embedding of each word and compute the average for each title or text before feeding it into the classifier.  
* Unfortunately, the results were very disappointing, so I abandoned the embeddings approach.

* The best result was achieved using `TfidfVectorizer` and `LinearSVC` for both subtasks. 

* For `LinearSVC`, I didn't modify any hyperparameters and used the default settings.  
* For `TfidfVectorizer`, I selected some hyperparameters based on experimentation and logical reasoning:

  * **`ngram_range=(1,2)`**: Uses both unigrams and bigrams.  
  * **`max_df=0.7`**: Ignores terms that appear in more than 70% of the documents.  
  * **`min_df=2`**: Ignores terms that appear in fewer than 2 documents.  
  * **`max_features`**: Limits the vocabulary size.  

In [31]:
title_vectorizer = TfidfVectorizer(ngram_range=(1,2), max_df=0.7, min_df=2, max_features = 10000)
text_vectorizer = TfidfVectorizer(ngram_range=(1,2), max_df=0.7, min_df=2, max_features = 30000)

classifier_SVC = LinearSVC(random_state=1234)

* SubTask1

In [32]:
test_vectorizers_classifiers_subtask1(title_vectorizer, text_vectorizer, classifier_SVC, classifier_SVC)

0.9337122264785322

* SubTask2

In [33]:
test_vectorizers_classifiers_subtask2(title_vectorizer, text_vectorizer, classifier_SVC, classifier_SVC)

0.8213583741619951

## TfIdf & Linear SVC on Validation Set

* Now that I have selected the vectorizer and classifiers, I will train them on the entire training dataset instead of a single fold each time.  
* Then, I will evaluate them by making predictions on the validation set.
* I will save the important columns from the validation set into variables to use easily throughout the notebook.

In [34]:
validation_set_input = validation_set[['title','text']]

validation_labels_hazard_category = validation_set['hazard-category']
validation_labels_product_category = validation_set['product-category']
validation_labels_hazard = validation_set['hazard']
validation_labels_product = validation_set['product']

* Create vectorizer and classifiers.

In [35]:
vectorizer = ColumnTransformer([
        ("title_tfidf", TfidfVectorizer(ngram_range=(1,2), max_df=0.7, min_df=2, max_features = 10000), "title"),
        ("text_tfidf", TfidfVectorizer(ngram_range=(1,2), max_df=0.7, min_df=2, max_features = 30000), "text")
    ])

classifier_hazard_category = LinearSVC(random_state=1234)
classifier_product_category = LinearSVC(random_state=1234)
classifier_hazard = LinearSVC(random_state=1234)
classifier_product = LinearSVC(random_state=1234)

* Fit them.

In [36]:
vectorized_train_set_input = vectorizer.fit_transform(train_set_input)

classifier_hazard_category.fit(vectorized_train_set_input, train_labels_hazard_category)
classifier_product_category.fit(vectorized_train_set_input, train_labels_product_category)
classifier_hazard.fit(vectorized_train_set_input, train_labels_hazard)
classifier_product.fit(vectorized_train_set_input, train_labels_product)

* Predictions on Validation Set.

In [37]:
vectorized_valid_set_input = vectorizer.transform(validation_set_input)

validation_predictions_hazard_category = classifier_hazard_category.predict(vectorized_valid_set_input)
validation_predictions_product_category = classifier_product_category.predict(vectorized_valid_set_input)
validation_predictions_hazard = classifier_hazard.predict(vectorized_valid_set_input)
validation_predictions_product = classifier_product.predict(vectorized_valid_set_input)

* Evaluation for each subtask.

In [38]:
print(f"SCORE Sub-Task 1: {compute_score(validation_labels_hazard_category, validation_labels_product_category, validation_predictions_hazard_category, validation_predictions_product_category):.3f}")
print(f"SCORE Sub-Task 2: {compute_score(validation_labels_hazard, validation_labels_product, validation_predictions_hazard, validation_predictions_product):.3f}")

SCORE Sub-Task 1: 0.771
SCORE Sub-Task 2: 0.454


* The scores are lower compared to cross-validation on the training dataset.  
* That means there is some overfitting.  
* However, this was expected to some degree due to the way data augmentation was performed.  
* Many rows were duplicated with only a few words changed.  
* Nevertheless, the scores aren't too bad compared to the leaderboard. Let's see if we can improve them before the final evaluation on the test set.  

## Hyper parameters search

* In this section, I will try to improve the scores for sub-task 1 and sub-task 2 by tuning or adding hyperparameters to the vectorizers and classifiers.  
* However, we must keep in mind that there is already some overfitting.  
* This raises a concern that tuning the hyperparameters might lead to even more overfitting.  
* Ideally, I would like to try new hyperparameter values by evaluating them on the test set.  
* However, I think this would introduce data leakage since the test set shouldn't be used until the final evaluation.  
* Otherwise, I would be peeking at the results and making decisions based on them.  
* For these reasons, each time I test a new hyperparameter value, I will evaluate it on the validation set.  
* If it improves performance, I will keep it in the model. Otherwise, I will discard it, as it would only contribute to further overfitting.

* I will use Optuna for hyperparameter optimization and follow a three-step process:  
  1. **Tune hyperparameters for the vectorizers**  
  2. **Tune hyperparameters for the classifiers in subtask 1** 
  3. **Tune hyperparameters for the classifiers in subtask 2**  
* After each step, I will evaluate the new hyperparameters on the validation set to ensure they improve performance and do not lead to excessive overfitting.  
* The Optuna studies are not replicable. This means that rerunning the notebook will result in different Optuna results each time.

* The first function will tune the hyperparameters for both the title and text vectorizers.  
* The metric used will be the combined score for both subtask 1 and subtask 2.  

In [39]:
def objective_vectorizer_tuning(trial):

    max_features1 = trial.suggest_int("max_features1", 5000, 13000, step=1000)
    max_features2 = trial.suggest_int("max_features2", 20000, 60000, step=5000)
    
    ngram_choice = trial.suggest_categorical("ngram_range", [1, 2, 3])
    ngram_range = (1, ngram_choice)

    max_df = trial.suggest_float("max_df", 0.3, 0.8, step=0.1)
    min_df = trial.suggest_int("min_df", 1, 10)
    
    vectorizer_title = TfidfVectorizer(max_features=max_features1, ngram_range=ngram_range, max_df=max_df, min_df=min_df)
    vectorizer_text = TfidfVectorizer(max_features=max_features2, ngram_range=ngram_range, max_df=max_df, min_df=min_df)

    clf = LinearSVC(random_state=1234)

    subtask1 = test_vectorizers_classifiers_subtask1(vectorizer_title, vectorizer_text, clf, clf)
    subtask2 = test_vectorizers_classifiers_subtask1(vectorizer_title, vectorizer_text, clf, clf)

    return subtask1 + subtask2

In [40]:
study1 = optuna.create_study(direction="maximize")
study1.optimize(objective_vectorizer_tuning, n_trials=50)

print("Best hyperparameters:", study1.best_params)

[I 2025-02-13 15:56:31,019] A new study created in memory with name: no-name-d0be39c5-02fb-48d5-849d-1e76f69884ba
[I 2025-02-13 15:57:02,357] Trial 0 finished with value: 1.8625156942891388 and parameters: {'max_features1': 5000, 'max_features2': 55000, 'ngram_range': 2, 'max_df': 0.7, 'min_df': 5}. Best is trial 0 with value: 1.8625156942891388.
[I 2025-02-13 15:57:37,976] Trial 1 finished with value: 1.8710014440703078 and parameters: {'max_features1': 13000, 'max_features2': 60000, 'ngram_range': 2, 'max_df': 0.5, 'min_df': 2}. Best is trial 1 with value: 1.8710014440703078.
[I 2025-02-13 15:57:51,349] Trial 2 finished with value: 1.8677102642825436 and parameters: {'max_features1': 12000, 'max_features2': 30000, 'ngram_range': 1, 'max_df': 0.5, 'min_df': 2}. Best is trial 1 with value: 1.8710014440703078.
[I 2025-02-13 15:58:22,324] Trial 3 finished with value: 1.863200453937012 and parameters: {'max_features1': 9000, 'max_features2': 50000, 'ngram_range': 2, 'max_df': 0.6000000000

Best hyperparameters: {'max_features1': 10000, 'max_features2': 45000, 'ngram_range': 2, 'max_df': 0.5, 'min_df': 1}


* I will now use the tuned hyperparameters for evaluation on the validation set.

In [41]:
tuned_vectorizer = ColumnTransformer([
        ("title_tfidf", TfidfVectorizer(max_features = 10000, ngram_range=(1,2), max_df=0.5, min_df=1), "title"),
        ("text_tfidf", TfidfVectorizer(max_features = 45000, ngram_range=(1,2), max_df=0.5, min_df=1), "text")
    ])

tuned_vectorized_train_set_input = tuned_vectorizer.fit_transform(train_set_input)
tuned_vectorized_valid_set_input = tuned_vectorizer.transform(validation_set_input)

In [42]:
#Create
classifier_hazard_category = LinearSVC(random_state=1234)
classifier_product_category = LinearSVC(random_state=1234)
classifier_hazard = LinearSVC(random_state=1234)
classifier_product = LinearSVC(random_state=1234)

#Fit
classifier_hazard_category.fit(tuned_vectorized_train_set_input, train_labels_hazard_category)
classifier_product_category.fit(tuned_vectorized_train_set_input, train_labels_product_category)
classifier_hazard.fit(tuned_vectorized_train_set_input, train_labels_hazard)
classifier_product.fit(tuned_vectorized_train_set_input, train_labels_product)

#Predict
validation_predictions_hazard_category = classifier_hazard_category.predict(tuned_vectorized_valid_set_input)
validation_predictions_product_category = classifier_product_category.predict(tuned_vectorized_valid_set_input)
validation_predictions_hazard = classifier_hazard.predict(tuned_vectorized_valid_set_input)
validation_predictions_product = classifier_product.predict(tuned_vectorized_valid_set_input)

#Evaluate
print(f"SCORE Sub-Task 1: {compute_score(validation_labels_hazard_category, validation_labels_product_category, validation_predictions_hazard_category, validation_predictions_product_category):.3f}")
print(f"SCORE Sub-Task 2: {compute_score(validation_labels_hazard, validation_labels_product, validation_predictions_hazard, validation_predictions_product):.3f}")

SCORE Sub-Task 1: 0.775
SCORE Sub-Task 2: 0.443


* There is a slight increase in the score for sub-task 1 and a slight decrease in the score for sub-task 2.  
* Since the decrease is larger than the increase, I will choose not to use the tuned vectorizer parameters and will keep the untuned ones.

* This function will tune the classifier's hyperparameters for subtask 1.  
* Hyperparameters to tune: 
  * **C**: Controls the regularization strength.  
  * **loss**: Defines the loss function used by the model.  
* The vectorizers will use the untuned parameters.  

In [43]:
def objective_classifiers_tuning_task1(trial):

    vectorizer_title = TfidfVectorizer(ngram_range=(1,2), max_df=0.7, min_df=2, max_features = 10000)
    vectorizer_text = TfidfVectorizer(ngram_range=(1,2), max_df=0.7, min_df=2, max_features = 30000)

    C1 = trial.suggest_float("C1", 0.01, 10.0, log=True)
    C2 = trial.suggest_float("C2", 0.01, 10.0, log=True)
    
    loss1 = trial.suggest_categorical("loss1", ["hinge", "squared_hinge"])
    loss2 = trial.suggest_categorical("loss2", ["hinge", "squared_hinge"])

    # high max iters in order not to fail to converge
    clf1 = LinearSVC(C=C1, loss=loss1, max_iter=10000, random_state=1234)
    clf2 = LinearSVC(C=C2, loss=loss2, max_iter=10000, random_state=1234)

    return test_vectorizers_classifiers_subtask1(vectorizer_title, vectorizer_text, clf1, clf2)

In [44]:
study2 = optuna.create_study(direction="maximize")
study2.optimize(objective_classifiers_tuning_task1, n_trials=50)

print("Best hyperparameters:", study2.best_params)

[I 2025-02-13 16:25:55,763] A new study created in memory with name: no-name-6ec0ff7e-0dfb-44ef-923a-463ad4cff7dc
[I 2025-02-13 16:26:11,802] Trial 0 finished with value: 0.9343754612868326 and parameters: {'C1': 2.0977590496642193, 'C2': 0.5385824296479859, 'loss1': 'squared_hinge', 'loss2': 'squared_hinge'}. Best is trial 0 with value: 0.9343754612868326.
[I 2025-02-13 16:26:27,241] Trial 1 finished with value: 0.8819138483284339 and parameters: {'C1': 2.4427638180692663, 'C2': 0.020995705524877985, 'loss1': 'squared_hinge', 'loss2': 'squared_hinge'}. Best is trial 0 with value: 0.9343754612868326.
[I 2025-02-13 16:26:45,173] Trial 2 finished with value: 0.8772865993715859 and parameters: {'C1': 0.860562568566405, 'C2': 0.010535471788045191, 'loss1': 'squared_hinge', 'loss2': 'hinge'}. Best is trial 0 with value: 0.9343754612868326.
[I 2025-02-13 16:27:01,433] Trial 3 finished with value: 0.8578348470383523 and parameters: {'C1': 0.01138757894393367, 'C2': 0.8987699820921236, 'loss1'

Best hyperparameters: {'C1': 2.361742992708202, 'C2': 0.6397958094930469, 'loss1': 'squared_hinge', 'loss2': 'squared_hinge'}


* I will now use the tuned hyperparameters for evaluation on the validation set.
* The `vectorized_train_set_input` and `vectorized_valid_set_input` are results from the untuned vectorizer.

In [47]:
#Create
tuned_classifier_hazard_category = LinearSVC(C=2.3617, loss= 'squared_hinge', random_state=1234, max_iter=10000)
tuned_classifier_product_category = LinearSVC(C=0.6397, loss= 'squared_hinge', random_state=1234, max_iter=10000)

#Fit
tuned_classifier_hazard_category.fit(vectorized_train_set_input, train_labels_hazard_category)
tuned_classifier_product_category.fit(vectorized_train_set_input, train_labels_product_category)

#Predict
validation_predictions_hazard_category = tuned_classifier_hazard_category.predict(vectorized_valid_set_input)
validation_predictions_product_category = tuned_classifier_product_category.predict(vectorized_valid_set_input)

#Evaluate
print(f"SCORE Sub-Task 1: {compute_score(validation_labels_hazard_category, validation_labels_product_category, validation_predictions_hazard_category, validation_predictions_product_category):.3f}")

SCORE Sub-Task 1: 0.769


* Once again, the results got slightly worse, indicating more overfitting.  
* So, I will not use the tuned hyperparameters in the classifiers.  

* The third function will tune the classifier's hyperparameters for subtask 3.  
* I will use the untuned parameters for the vectorizers.  

In [48]:
def objective_classifiers_tuning_task2(trial):

    vectorizer_title = TfidfVectorizer(ngram_range=(1,2), max_df=0.7, min_df=2, max_features = 10000)
    vectorizer_text = TfidfVectorizer(ngram_range=(1,2), max_df=0.7, min_df=2, max_features = 30000)

    C3 = trial.suggest_float("C3", 0.01, 10.0, log=True)
    C4 = trial.suggest_float("C4", 0.01, 10.0, log=True)
    
    loss3 = trial.suggest_categorical("loss3", ["hinge", "squared_hinge"])
    loss4 = trial.suggest_categorical("loss4", ["hinge", "squared_hinge"])

    # high max iters in order not to fail to converge
    clf3 = LinearSVC(C=C3, loss=loss3, max_iter=10000, random_state=1234)
    clf4 = LinearSVC(C=C4, loss=loss4, max_iter=10000, random_state=1234)

    return test_vectorizers_classifiers_subtask2(vectorizer_title, vectorizer_text, clf3, clf4)

* This study was significantly slower compared to the other two, so I ran 30 trials instead of 50.

In [49]:
study3 = optuna.create_study(direction="maximize")
study3.optimize(objective_classifiers_tuning_task2, n_trials=30)

print("Best hyperparameters:", study3.best_params)

[I 2025-02-11 20:32:07,939] A new study created in memory with name: no-name-c70ad19e-5e02-4c68-b3d3-392860a57cad
[I 2025-02-11 20:34:08,161] Trial 0 finished with value: 0.8199202410932087 and parameters: {'C3': 1.3248506587681272, 'C4': 0.9146762733538936, 'loss3': 'hinge', 'loss4': 'squared_hinge'}. Best is trial 0 with value: 0.8199202410932087.
[I 2025-02-11 20:37:03,115] Trial 1 finished with value: 0.8067905428950513 and parameters: {'C3': 6.605941546175378, 'C4': 0.11282682041251339, 'loss3': 'hinge', 'loss4': 'hinge'}. Best is trial 0 with value: 0.8199202410932087.
[I 2025-02-11 20:38:30,474] Trial 2 finished with value: 0.7147911538976564 and parameters: {'C3': 1.2402529659862864, 'C4': 0.012850149527030022, 'loss3': 'squared_hinge', 'loss4': 'squared_hinge'}. Best is trial 0 with value: 0.8199202410932087.
[I 2025-02-11 20:39:58,231] Trial 3 finished with value: 0.8069535936239078 and parameters: {'C3': 9.811828544166803, 'C4': 0.08002787746125997, 'loss3': 'squared_hinge',

Best hyperparameters: {'C3': 9.551176444912237, 'C4': 4.597265950166175, 'loss3': 'hinge', 'loss4': 'squared_hinge'}


In [50]:
#Create
tuned_classifier_hazard = LinearSVC(C=9.5511, loss= 'hinge', random_state=1234, max_iter=10000)
tuned_classifier_product = LinearSVC(C=4.5972, loss= 'squared_hinge', random_state=1234, max_iter=10000)

#Fit
tuned_classifier_hazard.fit(vectorized_train_set_input, train_labels_hazard)
tuned_classifier_product.fit(vectorized_train_set_input, train_labels_product)

#Predict
validation_predictions_hazard = tuned_classifier_hazard.predict(vectorized_valid_set_input)
validation_predictions_product = tuned_classifier_product.predict(vectorized_valid_set_input)

#Evaluate
print(f"SCORE Sub-Task 2: {compute_score(validation_labels_hazard, validation_labels_product, validation_predictions_hazard, validation_predictions_product):.3f}")

SCORE Sub-Task 2: 0.458


* Finally, there is a clear improvement in the validation set.  
* Therefore, I will keep the tuned hyperparameters for the hazard and product classifiers.

## Final Evaluation

* It's time for the final evaluation.  
* I will evaluate the entire model on both the validation and test sets.  

In [51]:
test_set_input = test_set[['title','text']]

test_labels_hazard_category = test_set['hazard-category']
test_labels_product_category = test_set['product-category']
test_labels_hazard = test_set['hazard']
test_labels_product = test_set['product']

* Create the ColumnTransformer and classifiers.

In [52]:
final_vectorizer = ColumnTransformer([
        ("title_tfidf", TfidfVectorizer(max_features = 10000, ngram_range=(1,2), max_df=0.7, min_df=2), "title"),
        ("text_tfidf", TfidfVectorizer(max_features = 30000, ngram_range=(1,2), max_df=0.7, min_df=2), "text")
    ])

final_classifier_hazard_category = LinearSVC(random_state=1234)
final_classifier_product_category = LinearSVC(random_state=1234)
final_classifier_hazard = LinearSVC(C=9.5511, loss= 'hinge', random_state=1234, max_iter=10000)
final_classifier_product = LinearSVC(C=4.5972, loss= 'squared_hinge', random_state=1234, max_iter=10000)

* Fit them with the entire training dataset.

In [53]:
final_vectorized_train_set_input = final_vectorizer.fit_transform(train_set_input)

final_classifier_hazard_category.fit(final_vectorized_train_set_input, train_labels_hazard_category)
final_classifier_product_category.fit(final_vectorized_train_set_input, train_labels_product_category)
final_classifier_hazard.fit(final_vectorized_train_set_input, train_labels_hazard)
final_classifier_product.fit(final_vectorized_train_set_input, train_labels_product)

* Predictions and Evaluation on the validation set.

In [54]:
final_vectorized_validation_set_input = final_vectorizer.transform(validation_set_input)

validation_predictions_hazard_category = final_classifier_hazard_category.predict(final_vectorized_validation_set_input)
validation_predictions_product_category = final_classifier_product_category.predict(final_vectorized_validation_set_input)
validation_predictions_hazard = final_classifier_hazard.predict(final_vectorized_validation_set_input)
validation_predictions_product = final_classifier_product.predict(final_vectorized_validation_set_input)

print(f"SCORE Sub-Task 1: {compute_score(validation_labels_hazard_category, validation_labels_product_category, validation_predictions_hazard_category, validation_predictions_product_category):.3f}")
print(f"SCORE Sub-Task 2: {compute_score(validation_labels_hazard, validation_labels_product, validation_predictions_hazard, validation_predictions_product):.3f}")

SCORE Sub-Task 1: 0.771
SCORE Sub-Task 2: 0.458


* Predictions and Evaluation on the unseen testing set.

In [55]:
final_vectorized_test_set_input = final_vectorizer.transform(test_set_input)

test_predictions_hazard_category = final_classifier_hazard_category.predict(final_vectorized_test_set_input)
test_predictions_product_category = final_classifier_product_category.predict(final_vectorized_test_set_input)
test_predictions_hazard = final_classifier_hazard.predict(final_vectorized_test_set_input)
test_predictions_product = final_classifier_product.predict(final_vectorized_test_set_input)

print(f"SCORE Sub-Task 1: {compute_score(test_labels_hazard_category, test_labels_product_category, test_predictions_hazard_category, test_predictions_product_category):.3f}")
print(f"SCORE Sub-Task 2: {compute_score(test_labels_hazard, test_labels_product, test_predictions_hazard, test_predictions_product):.3f}")

SCORE Sub-Task 1: 0.727
SCORE Sub-Task 2: 0.419


## Save predictions to csv

* Finally, I will save all the predictions from the test set into a CSV file.

In [56]:
submission_set = pd.DataFrame({
    "hazard-category": test_predictions_hazard_category,
    "product-category": test_predictions_product_category,
    "hazard": test_predictions_hazard,
    "product": test_predictions_product
})

submission_set

Unnamed: 0,hazard-category,product-category,hazard,product
0,biological,"meat, egg and dairy products",listeria monocytogenes,ham slices
1,biological,"meat, egg and dairy products",listeria monocytogenes,thermal processed pork meat
2,biological,"meat, egg and dairy products",listeria monocytogenes,hot dogs
3,biological,"meat, egg and dairy products",listeria monocytogenes,sliced ham
4,foreign bodies,ices and desserts,metal fragment,ice cream
...,...,...,...,...
992,allergens,"meat, egg and dairy products",bone fragment,chicken based products
993,biological,fruits and vegetables,salmonella,dried elder berries
994,biological,seafood,listeria monocytogenes,fish and fish products
995,chemical,confectionery,alkaloids,tortilla chips


In [57]:
submission_set.to_csv("submission.csv")