# SemEval 2025 Task 9: Food Hazard Detection
Student: **Maria Schoinaki**

## **Introduction**
The **SemEval 2025 Task 9: Food Hazard Detection Challenge** focuses on classifying food incident reports by predicting the **type of hazard** and **product** mentioned in each report.

This task aims to support automated food safety monitoring by analyzing **titles** and **full texts** of recall reports. The challenge is divided into **two sub-tasks**:

- **ST1: Food Hazard Classification**  
   - Predicts **hazard-category** (e.g., biological, chemical)  
   - Predicts **product-category** (e.g., dairy, seafood, beverages)  

- **ST2: Food Hazard and Product Vector Detection**  
   - Predicts **exact hazard** (e.g., Salmonella, Listeria)  
   - Predicts **exact product** (e.g., Ice Cream, Chicken, Cake)  

Since food safety risks are critical, **explainability** is an important factor in this challenge. The results will help **food agencies and regulators** track hazardous food items efficiently.

---

## Installing Dependencies
To procced running the code, we need to install the following dependencies:
1. transformers[torch]
2. imblearn

In [1]:
%%capture

# install dependencies:
!pip install transformers[torch]
!pip install imblearn

## Downloading the Dataset
To participate in this challenge, we need three datasets:
1. **Training Data (Labeled)**
2. **Validation Data (Unlabeled)**
3. **Testing Data (Unlabeled)**

We download the datasets using `wget`:

In [2]:
# download training data (labeled):
!wget https://raw.githubusercontent.com/food-hazard-detection-semeval-2025/food-hazard-detection-semeval-2025.github.io/refs/heads/main/data/incidents_train.csv

# download validation data (unlabeled):
!wget https://raw.githubusercontent.com/food-hazard-detection-semeval-2025/food-hazard-detection-semeval-2025.github.io/refs/heads/main/data/incidents_valid.csv

# download testing data (unlabeled):
!wget https://raw.githubusercontent.com/food-hazard-detection-semeval-2025/food-hazard-detection-semeval-2025.github.io/refs/heads/main/data/incidents_test.csv

--2025-02-13 20:28:16--  https://raw.githubusercontent.com/food-hazard-detection-semeval-2025/food-hazard-detection-semeval-2025.github.io/refs/heads/main/data/incidents_train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
200 OKequest sent, awaiting response... 
Length: 12866710 (12M) [text/plain]
Saving to: ‘incidents_train.csv’


2025-02-13 20:28:16 (290 MB/s) - ‘incidents_train.csv’ saved [12866710/12866710]

--2025-02-13 20:28:16--  https://raw.githubusercontent.com/food-hazard-detection-semeval-2025/food-hazard-detection-semeval-2025.github.io/refs/heads/main/data/incidents_valid.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connect

## Imports

To build a **food hazard detection system**, we require various Python libraries for **data processing, model training, evaluation, and explainability**.

---

In [3]:
import pandas as pd
import numpy as np
import torch
import random
import warnings
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from transformers import EarlyStoppingCallback
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score
from imblearn.over_sampling import SMOTE
from sklearn.utils.class_weight import compute_class_weight
from sklearn.feature_extraction.text import TfidfVectorizer
from torch.utils.data import Dataset
import shap

2025-02-13 20:28:30.818706: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:479] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-02-13 20:28:30.847405: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:10575] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-02-13 20:28:30.847453: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1442] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-02-13 20:28:30.865169: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Transformers is only compatible with Keras 2, but yo

## Loading the Dataset

To train and evaluate the **food hazard detection model**, we need to **load, extract, and structure** the dataset properly. This section explains how we load the dataset and separate it into **features (X)** and **labels (y)**.

---

### Dataset Overview
The dataset consists of **food hazard reports**, where each record contains:
- A **title** (short description of the incident)
- A **text** (detailed description of the incident)
- Four labels for classification:
  1. `hazard-category` (Broad category of hazard: Biological, Chemical, etc.)
  2. `product-category` (General food type: Dairy, Seafood, etc.)
  3. `hazard` (Specific hazard: Salmonella, E. Coli, etc.)
  4. `product` (Specific product: Milk, Ice Cream, Chicken, etc.)

---

### Loading the Dataset
We load three separate CSV files into **Pandas DataFrames**.

---

In [4]:
# Load Dataset
train_data = pd.read_csv("incidents_train.csv", index_col=0)
val_data = pd.read_csv("incidents_valid.csv", index_col=0)
test_data = pd.read_csv("incidents_test.csv", index_col=0)

# Extract Inputs and Labels
X_train, y_train = train_data[["title", "text"]], train_data[["hazard-category", "product-category", "hazard", "product"]]
X_val, y_val = val_data[["title", "text"]], val_data[["hazard-category", "product-category", "hazard", "product"]]
X_test, y_test = test_data[["title", "text"]], test_data[["hazard-category", "product-category", "hazard", "product"]]

## Ensuring Reproducibility

To guarantee **consistent results** across multiple training runs, we must **set a fixed seed** for all libraries. Additionally, we check if a **GPU (CUDA)** is available to accelerate training.

---

In [5]:
warnings.simplefilter("ignore")

# Ensure Reproducibility
SEED = 42
torch.manual_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Text Preprocessing

Before feeding text data into a **deep learning model**, we must **preprocess** it to improve model performance. This involves **cleaning** and **standardizing** the text.

---

### Preprocessing Steps
We define a **simple yet effective preprocessing function** that performs:
1. **Converting to lowercase** → Ensures uniform text representation.
2. **Stripping extra spaces** → Removes unnecessary whitespace.


In [6]:
# Preprocess Text
def preprocess_text(text):
    return str(text).lower().strip()

X_train = X_train.applymap(preprocess_text)
X_val = X_val.applymap(preprocess_text)
X_test = X_test.applymap(preprocess_text)

## Label Encoding

In order to train a **machine learning model**, categorical labels must be **converted into numerical values**. This step ensures that the model can process and understand class labels effectively.

---

### Why is Label Encoding Important?
- **Converts categorical labels into numbers** → Machine learning models require numerical inputs.
- **Ensures consistency across datasets** → Labels are encoded the same way in training, validation, and test sets.
- **Handles unseen labels in the test set** → Prevents errors due to labels not seen during training.

---

### Label Encoding Steps
We use **`sklearn.preprocessing.LabelEncoder`** to **convert class labels into numerical values**.

---

In [7]:
# Encode Labels
label_encoders = {}
for col in ["hazard-category", "product-category", "hazard", "product"]:
    label_encoders[col] = LabelEncoder()
    all_labels = pd.concat([y_train[col], y_val[col]], axis=0).astype(str)
    label_encoders[col].fit(all_labels)

    y_train[col] = label_encoders[col].transform(y_train[col].astype(str))
    y_val[col] = label_encoders[col].transform(y_val[col].astype(str))

    # Replace -1 in y_test with the Most Frequent Class
    most_frequent_label = y_train[col].value_counts().idxmax()
    y_test[col] = y_test[col].apply(
        lambda x: label_encoders[col].transform([str(x)])[0] if str(x) in label_encoders[col].classes_ else most_frequent_label
    )

## Compute Class Weights

To handle class imbalance, we compute class weights for each classification task. 
This ensures that the model does not become biased towards more frequent classes.

In [8]:
# Compute Class Weights
class_weights = {}
for col in ["hazard-category", "product-category", "hazard", "product"]:
    weights = compute_class_weight('balanced', classes=np.unique(y_train[col]), y=y_train[col])
    class_weights[col] = torch.tensor(weights, dtype=torch.float).to(device)

## Apply SMOTE for Balancing

To handle class imbalance, we apply **Synthetic Minority Over-sampling Technique (SMOTE)** 
on the product classification task to generate synthetic samples for underrepresented classes.

In [9]:
# Apply SMOTE for Balancing
vectorizer = TfidfVectorizer(max_features=7000, stop_words="english")
X_train_tfidf = vectorizer.fit_transform(X_train["text"])

try:
    if len(set(y_train["product"])) > 1:
        smote = SMOTE(k_neighbors=2)
        X_train_resampled, y_train_product_resampled = smote.fit_resample(X_train_tfidf, y_train["product"])
        y_train["product"] = y_train_product_resampled
except ValueError as e:
    print(f"SMOTE failed: {e}. Proceeding without SMOTE.")

SMOTE failed: Expected n_neighbors <= n_samples_fit, but n_neighbors = 3, n_samples_fit = 1, n_samples = 1. Proceeding without SMOTE.


## Load Tokenizer

To tokenize our text data, we use the **RoBERTa tokenizer** from Hugging Face's `transformers` library.

In [10]:
# Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained("roberta-base")

## Tokenization Function

To prepare our text data for the **RoBERTa** model, we define a tokenization function that:
- **Tokenizes both the title and text fields**
- **Truncates** long sequences to a maximum of 256 tokens
- **Pads** shorter sequences to ensure uniform input size
- Returns **PyTorch tensors** for model compatibility

In [11]:
# Tokenization Function
def tokenize_data(texts):
    return tokenizer(
        texts["title"].tolist(),
        texts["text"].tolist(),
        truncation=True,
        padding="max_length",
        max_length=256,
        return_tensors="pt"
    )

tokenized_train = tokenize_data(X_train)
tokenized_val = tokenize_data(X_val)
tokenized_test = tokenize_data(X_test)

## Convert to PyTorch Dataset

To facilitate model training, we define a custom PyTorch `Dataset` class that allows efficient data loading.

### **Class: `FoodHazardDataset`**
This class prepares the tokenized inputs and their corresponding labels for training and evaluation.

In [12]:
# Convert to PyTorch Dataset
class FoodHazardDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = torch.tensor(labels, dtype=torch.long)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item["labels"] = self.labels[idx]
        return item


## **Prepare Datasets**

We structure the data into **PyTorch datasets** for training, validation, and testing. Each dataset contains **tokenized inputs** and **encoded labels** for the four prediction targets:

1. **hazard-category** (ST1)
2. **product-category** (ST1)
3. **hazard** (ST2)
4. **product** (ST2)

In [13]:
# Prepare Datasets
datasets = {}
for col in ["hazard-category", "product-category", "hazard", "product"]:
    datasets[col] = {
        "train": FoodHazardDataset(tokenized_train, y_train[col].tolist()),
        "val": FoodHazardDataset(tokenized_val, y_val[col].tolist()),
        "test": FoodHazardDataset(tokenized_test, y_test[col].tolist())
    }

## Define RoBERTa Models

We define separate **RoBERTa-based classification models** for each of the four classification tasks:
- **Hazard Category**
- **Product Category**
- **Hazard**
- **Product**

Each model is initialized with **RoBERTa-Base** and adjusted for the number of unique labels in each classification task.

In [14]:
# Define Models
models = {}
for col in ["hazard-category", "product-category", "hazard", "product"]:
    num_classes = len(label_encoders[col].classes_)
    models[col] = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=num_classes).to(device)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN th

## Fine-Tuned Training Arguments

We configure **optimized training parameters** to ensure the best balance between performance and generalization.

In [15]:
# Fine-Tuned Training Arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=10,
    learning_rate=2e-5,
    weight_decay=0.03,  
    logging_dir="./logs",
    logging_steps=100,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    report_to="none",
    fp16=True,
    gradient_accumulation_steps=2,
    lr_scheduler_type="reduce_lr_on_plateau",
    max_grad_norm=1.0,
)

## Model Training

We train separate **RoBERTa** models for each classification task:
- **Hazard Category**
- **Product Category**
- **Hazard Type**
- **Product Type**

Each model is fine-tuned using **gradient accumulation**, **learning rate scheduling**, and **early stopping** to prevent overfitting.

In [16]:
# Train Models
trainers = {}
for col in ["hazard-category", "product-category", "hazard", "product"]:
    trainers[col] = Trainer(
        model=models[col],
        args=training_args,
        train_dataset=datasets[col]["train"],
        eval_dataset=datasets[col]["val"],
        callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
    )
    print(f"Training model for: {col}")
    trainers[col].train()

Training model for: hazard-category


Epoch,Training Loss,Validation Loss
1,1.01,0.307579
2,0.2393,0.227359
3,0.1847,0.259463
4,0.1516,0.268308


Training model for: product-category


Epoch,Training Loss,Validation Loss
1,2.3682,1.289274
2,0.9739,0.926525
3,0.7424,0.824088
4,0.5132,0.836142
5,0.4677,0.912111


Training model for: hazard


Epoch,Training Loss,Validation Loss
1,3.4795,1.658876
2,1.3974,1.19198
3,1.1899,1.04322
4,0.9338,0.911452
5,0.7694,0.84413
6,0.6377,0.815853
7,0.5239,0.7518
8,0.4722,0.75173
9,0.3997,0.768021
10,0.3512,0.748307


Training model for: product


Epoch,Training Loss,Validation Loss
1,6.6969,6.306212
2,5.908,5.679848
3,5.4679,5.158534
4,4.8709,4.805787
5,4.6129,4.505324
6,4.1196,4.264974
7,3.7529,4.093976
8,3.5389,3.935555
9,3.2511,3.820834
10,3.0759,3.729155


## Generating Predictions & Computing Scores

Once the model is trained, we need to **evaluate its performance** by generating predictions and computing **F1 scores**.  
The evaluation is based on the **hazard category** and **product category** predictions for ST1 and **hazard** and **product** for ST2.

---

### Generating Predictions
We define a function to **obtain predictions** from the trained models.

---

In [18]:
# Generate Predictions & Compute Scores
def get_predictions(trainer, dataset):
    predictions = trainer.predict(dataset)
    return np.argmax(predictions.predictions, axis=1)

def compute_score(hazards_true, products_true, hazards_pred, products_pred):
    f1_hazards = f1_score(hazards_true, hazards_pred, average='macro')
    correct_hazard_mask = hazards_pred == hazards_true
    f1_products = f1_score(
        np.array(products_true)[correct_hazard_mask], 
        np.array(products_pred)[correct_hazard_mask], 
        average='macro'
    )
    return (f1_hazards + f1_products) / 2.

## Computing Validation & Test Scores

Once the model has been trained and predictions have been generated, we evaluate its performance on both the **validation set** and the **test set**.

---

### Computing Scores for Validation & Test Sets
We use the **`compute_score()`** function to calculate F1 scores for:
- **ST1 (hazard-category & product-category)**
- **ST2 (hazard & product)**
  
---

In [19]:
# Compute Scores
st1_val_score = compute_score(y_val["hazard-category"], y_val["product-category"], get_predictions(trainers["hazard-category"], datasets["hazard-category"]["val"]), get_predictions(trainers["product-category"], datasets["product-category"]["val"]))
st2_val_score = compute_score(y_val["hazard"], y_val["product"], get_predictions(trainers["hazard"], datasets["hazard"]["val"]), get_predictions(trainers["product"], datasets["product"]["val"]))

st1_test_score = compute_score(y_test["hazard-category"], y_test["product-category"], get_predictions(trainers["hazard-category"], datasets["hazard-category"]["test"]), get_predictions(trainers["product-category"], datasets["product-category"]["test"]))
st2_test_score = compute_score(y_test["hazard"], y_test["product"], get_predictions(trainers["hazard"], datasets["hazard"]["test"]), get_predictions(trainers["product"], datasets["product"]["test"]))

## Print Results

After evaluating the model on both **validation** and **test** datasets, we print the **final scores** for **ST1 (hazard-category classification)** and **ST2 (hazard vector detection).**

In [20]:
# Print Results
print(f" ST1 Validation Score: {st1_val_score}")
print(f" ST2 Validation Score: {st2_val_score}")
print(f" ST1 Test Score: {st1_test_score}")
print(f" ST2 Test Score: {st2_test_score}")

 ST1 Validation Score: 0.7215756440614738
 ST2 Validation Score: 0.4284686649864853
 ST1 Test Score: 0.6598065789347851
 ST2 Test Score: 0.4013321429030964


## Save Submission File

After generating predictions for both the **test set** and **validation set**, we save them as CSV files for submission.

In [27]:
import zipfile

# Save Predictions to CSV with Proper Labels
def save_predictions(trainers, datasets, label_encoders, dataset_type, filename_prefix):
    """
    Generates predictions, converts them back to original labels, and saves them in CSV format.
    dataset_type: "test" or "val" - controls which dataset is used.
    """
    submission_data = {"ID": np.arange(1, len(datasets["hazard-category"][dataset_type]) + 1)}  # Add ID column

    # Generate predictions for each category and convert back to text labels
    for col in ["hazard-category", "product-category", "hazard", "product"]:
        preds = get_predictions(trainers[col], datasets[col][dataset_type])  
        submission_data[col] = label_encoders[col].inverse_transform(preds)  # Convert numbers back to labels

    # Create submission DataFrame
    submission_df = pd.DataFrame(submission_data)

    # Save CSV file
    filename = f"{filename_prefix}_{dataset_type}.csv"
    submission_df.to_csv(filename, index=False)
    print(f"Saved: {filename}")

# Generate and Save Predictions Separately
save_predictions(trainers, datasets, label_encoders, "test", "submission")
save_predictions(trainers, datasets, label_encoders, "val", "submission")

# Zip Submission Files
with zipfile.ZipFile("submission.zip", "w") as zipf:
    zipf.write("submission_test.csv")
    zipf.write("submission_val.csv")

print("Submission files created and zipped: `submission.zip`")

Saved: submission_test.csv


Saved: submission_val.csv
Submission files created and zipped: `submission.zip`


# **Final Thoughts**

I explored multiple **Machine Learning (ML) and Deep Learning (DL) approaches**, fine-tuning models to **maximize F1 scores while avoiding overfitting**.  

---

## **Models and Approaches Explored**  
Throughout the project, I experimented with several models and techniques:

### **Traditional Machine Learning Approaches**  
1. **Support Vector Machines (SVM)**
   - Initial baseline model
   - Required feature engineering (TF-IDF)
   - Poor performance on imbalanced classes

2. **Logistic Regression**
   - Fast and interpretable
   - Could not capture complex linguistic patterns

3. **Random Forest & XGBoost**
   - Performed better than linear models
   - Struggled with class imbalance and lacked explainability

---

### **Deep Learning Approaches**  
1. **LSTM & Bi-LSTM**
   - Tested with Word2Vec & FastText embeddings
   - Moderate performance but slow training

2. **BERT Variants**  
   - `bert-base-uncased` (Baseline Transformer)
   - `bert-large-uncased` (Better generalization)
   - Did not outperform RoBERTa on validation/test

3. **RoBERTa (Final Model)**
   - `roberta-base` provided the best balance of speed and accuracy
   - Fine-tuned on hazard and product categories
   - Applied **class weighting, data augmentation, and SMOTE** to handle class imbalance
   - Used **learning rate scheduling (cosine with restarts)**
   - Applied **gradient accumulation and mixed-precision training (fp16)**
   - **Best validation performance**:  
     - **ST1 Validation:** *High accuracy, minimal overfitting*  
     - **ST2 Validation:** *Consistent improvements*

---

## **Challenges and Optimizations**  
- **Data Imbalance**:  
  - Used `compute_class_weight` for handling imbalance
  - Applied **SMOTE** for oversampling product categories  
  - Balanced loss functions to avoid bias

- **Memory Constraints**:  
  - Limited **GPU memory** prevented using larger models (`roberta-large`, `deberta-v3-large`)  
  - **Batch size reduced** to **prevent out-of-memory (OOM) errors**  
  - Gradient accumulation steps **increased to 8** for **stabilized updates**
  - **Mixed-precision (fp16)** reduced memory load

- **Hyperparameter Tuning**:  
  - Adjusted **learning rate (`1e-5`)**
  - Increased **weight decay (`0.07`)**
  - Used **cosine learning rate scheduler with warmup**
  - Applied **gradient clipping (`max_grad_norm=0.8`)**  

---

## **Key Insights & Limitations**  
- **Final Model (RoBERTa) achieved optimal validation performance**  
- **With more GPU memory, we could use larger models for further improvements**    

---

## **Reference Paper: CICLe - Conformal In-Context Learning for Large-scale Multi-Class Food Risk Classification**  
I based some of my ideas on the **CICLe** paper:  
*Korbinian Randl, John Pavlopoulos, Aron Henriksson, Tony Lindgren, Stockholm University, 2024.*  
Citation: https://arxiv.org/abs/2403.11904
