# 3rd Assignment: SemEval 2025 Task 9 - The Food Hazard Detection Challenge
> Markella Englezou, 8210039 <br />
> Department of Management Science and Technology <br />
> Athens University of Economics and Business <br />
> t8210039@aueb.gr

*The Food Hazard Detection task evaluates explainable classification systems for titles of food-incident reports collected from the web. These algorithms may help automated crawlers find and extract food issues from web sources like social media in the future.*

*The SemEval-Task combines two sub-tasks:*
* *(ST1) Text classification for food hazard prediction, predicting the type of hazard and product.*
* *(ST2) Food hazard and product "vector" detection, predicting the exact hazard and product.*

*The task focuses on detecting the hazard and uses a two-step scoring metric based on the macro F1 score, focusing on the hazard label per sub-task.*

*Explainability in food risk classification based on texts is currently underexplored although it may help humans quickly assess prediction validity and can be used for meta-learning approaches like clustering or pre-sorting examples. However, explanations can be diverse and task/model-dependent. Current literature includes both model-specific ([Assael et al., 2022](https://www.nature.com/articles/s41586-022-04448-z); [Pavlopoulos et al., 2022](https://aclanthology.org/2022.acl-long.259/)) and model agnostic ([Ribeiro et al., 2016](https://aclanthology.org/N16-3020/)) approaches. We aim to study mechanisms to explain decisions on food safety risks by asking participants to submit precise "vector"-labels (ST2) as explanations for their ST1 predictions.*

*The complete challenge can be found [here](https://food-hazard-detection-semeval-2025.github.io/).*

# Prepare data and create necessary functions and classes.

* Install necessary libraries (uncomment if you need any).

In [1]:
# !pip install pandas numpy torch scikit-learn transformers transformers[torch]

* Import necessary libraries.

In [2]:
import pandas as pd
import numpy as np
import torch
import re
import os

from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score, classification_report
from transformers import RobertaTokenizer, RobertaForSequenceClassification, Trainer, TrainingArguments
from torch.utils.data import Dataset, DataLoader
from shutil import make_archive

* Download training, evaluation and testing data.

In [3]:
!wget https://raw.githubusercontent.com/food-hazard-detection-semeval-2025/food-hazard-detection-semeval-2025.github.io/refs/heads/main/data/incidents_train.csv
!wget https://raw.githubusercontent.com/food-hazard-detection-semeval-2025/food-hazard-detection-semeval-2025.github.io/refs/heads/main/data/incidents_valid.csv
!wget https://raw.githubusercontent.com/food-hazard-detection-semeval-2025/food-hazard-detection-semeval-2025.github.io/refs/heads/main/data/incidents_test.csv

--2025-02-13 10:10:39--  https://raw.githubusercontent.com/food-hazard-detection-semeval-2025/food-hazard-detection-semeval-2025.github.io/refs/heads/main/data/incidents_train.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12866710 (12M) [text/plain]
Saving to: ‘incidents_train.csv’


2025-02-13 10:10:39 (257 MB/s) - ‘incidents_train.csv’ saved [12866710/12866710]

--2025-02-13 10:10:39--  https://raw.githubusercontent.com/food-hazard-detection-semeval-2025/food-hazard-detection-semeval-2025.github.io/refs/heads/main/data/incidents_valid.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... c

* Read training, evaluation and testing data.

In [4]:
train_set = pd.read_csv('incidents_train.csv', index_col=0)
dev_set = pd.read_csv('incidents_valid.csv', index_col=0)
test_set = pd.read_csv('incidents_test.csv', index_col=0)

* Add function to compute final score by calculating the macro-F1-score on the participants’ predicted labels (hazards_pred & products_pred) using the annotated labels (hazards_true & products_true) as ground truth.
* This function is given by the task organizers.

In [5]:
def compute_score(hazards_true, products_true, hazards_pred, products_pred):
  # Compute f1 for hazards:
  f1_hazards = f1_score(
    hazards_true,
    hazards_pred,
    average='macro'
  )

  # Compute f1 for products:
  f1_products = f1_score(
    products_true[hazards_pred == hazards_true],
    products_pred[hazards_pred == hazards_true],
    average='macro'
  )

  return (f1_hazards + f1_products) / 2.

* Preprocess text to train, validate and test the models.

In [6]:
# Combine title and text
train_set['combined_text'] = train_set['title'] + " " + train_set['text']
dev_set['combined_text'] = dev_set['title'] + " " + dev_set['text']
test_set['combined_text'] = test_set['title'] + " " + test_set['text']

# Preprocess text
def preprocess_text(text):
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'\b[A-Z]+-\d{3}-\d{2}\b', '', text)  # Remove patterns like FSIS-017-94
    text = re.sub(r'\b\d{3}-\d{2}\b', '', text)  # Remove patterns like 017-94
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove special characters
    text = text.lower()  # Convert to lowercase
    return text

# Preprocess train, dev and test sets
train_set['cleaned_text'] = train_set['combined_text'].apply(preprocess_text)
dev_set['cleaned_text'] = dev_set['combined_text'].apply(preprocess_text)
test_set['cleaned_text'] = test_set['combined_text'].apply(preprocess_text)

* Define a class as a custom dataset class for tokenizing text and preparing it for model training.
* Parameters:
    - texts: List of input texts.
    - labels: Corresponding labels for each text.
    - tokenizer: Pretrained tokenizer for processing text.
    - max_len: Maximum sequence length for tokenization.

In [7]:
class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        inputs = self.tokenizer(
            text,
            padding='max_length',
            truncation=True,
            max_length=self.max_len,
            return_tensors='pt'
        )
        input_ids = inputs['input_ids'].squeeze()
        attention_mask = inputs['attention_mask'].squeeze()
        labels = torch.tensor(self.labels[idx], dtype=torch.float32 if isinstance(self.labels[idx], list) else torch.long)
        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'labels': labels
        }

* Initialize RoBERTa tokenizer.

In [8]:
tokenizer = RobertaTokenizer.from_pretrained('roberta-large')

* Create a training function for the RoBERTa models for sequence classification.
* Parameters:
    - texts: List of training input texts.
    - labels: Corresponding labels for training texts.
    - num_labels: Number of classification labels.
    - val_texts: (Optional) List of validation input texts.
    - val_labels: (Optional) Corresponding labels for validation texts.
    - epochs: Number of training epochs (default: 5).
    - batch_size: Training batch size (default: 16).
* Returns:
    - Trained RoBERTa model.

In [9]:
def train_roberta_model(texts, labels, num_labels, val_texts=None, val_labels=None, epochs=5, batch_size=16):
    train_dataset = TextDataset(texts, labels, tokenizer, max_len=128)
    val_dataset = TextDataset(val_texts, val_labels, tokenizer, max_len=128) if val_texts else None

    model = RobertaForSequenceClassification.from_pretrained('roberta-large', num_labels=num_labels)
    
    # Specify training arguments
    training_args = TrainingArguments(
        output_dir="./tmp",
        num_train_epochs=epochs,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        logging_steps=10,
        eval_strategy="epoch",
        save_strategy="no",
        save_total_limit=0,
        weight_decay=0.02,
        learning_rate=3e-5,
        warmup_steps=500,
        fp16=True
    )
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset
    )
    
    trainer.train()
    return model

* Create a function that evaluates a trained model on a given dataloader.
* Parameters:
    - model: Trained model to be evaluated.
    - dataloader: DataLoader containing the evaluation dataset.
    - label_encoder: Encoder to transform numeric labels back to original string labels.
* Returns:
    - predicted_labels: List of predicted labels in their original form.
    - gold_labels: List of ground truth labels in their original form.

In [10]:
def evaluate_model_with_dataloader(model, dataloader, label_encoder):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    model.eval()

    total_predictions = []
    total_labels = []

    with torch.no_grad():
        for batch in dataloader:
            # Move batch to the same device as the model
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            # Forward pass
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits

            # Get predictions
            predictions = torch.argmax(logits, dim=-1)
            total_predictions.extend(predictions.cpu().numpy())
            total_labels.extend(labels.cpu().numpy())

    # Convert predictions and labels back to original string labels
    predicted_labels = label_encoder.inverse_transform(total_predictions)
    gold_labels = label_encoder.inverse_transform(total_labels)

    # Print classification report
    print(classification_report(gold_labels, predicted_labels, zero_division=0))

    return predicted_labels, gold_labels

* Create a function that makes predictions using a trained model on a list of input texts.
* Parameters:
    - texts: List of input texts to classify.
    - model: Trained model for prediction.
    - label_encoder: Encoder to transform numerical predictions back to string labels.
    - tokenizer: Pretrained tokenizer to process the input texts.
* Returns:
    - List of predicted labels in their original string format.

In [11]:
def predict(texts, model, label_encoder, tokenizer):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    model.eval()

    # Tokenize the input texts
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
    inputs = {key: value.to(device) for key, value in inputs.items()} 

    # Make predictions
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)

    # Decode predictions back to string labels
    return label_encoder.inverse_transform(predictions.cpu().numpy().tolist())

# ST1: Text classification for food hazard prediction, predicting the type of hazard and product

## Encode labels and train the models

* Encode hazard category and product category labels.

In [12]:
# Encode labels
label_encoder_hazard_category = LabelEncoder()
label_encoder_product_category = LabelEncoder()

# Fit Transform train set
train_set['hazard_category_encoded'] = label_encoder_hazard_category.fit_transform(train_set['hazard-category'])
train_set['product_category_encoded'] = label_encoder_product_category.fit_transform(train_set['product-category'])

# Transform dev and test sets
dev_set['hazard_category_encoded'] = label_encoder_hazard_category.transform(dev_set['hazard-category'])
dev_set['product_category_encoded'] = label_encoder_product_category.transform(dev_set['product-category'])

test_set['hazard_category_encoded'] = label_encoder_hazard_category.transform(test_set['hazard-category'])
test_set['product_category_encoded'] = label_encoder_product_category.transform(test_set['product-category'])

* Train a model to predict the hazard categories.
* Arguments:
    - Training input texts.
    - Corresponding encoded labels.
    - Number of unique hazard categories.
    - Validation input texts.
    - Corresponding validation labels.
    - Number of epochs (3).

In [13]:
hazard_category_model = train_roberta_model(
    train_set['cleaned_text'].tolist(),
    train_set['hazard_category_encoded'].tolist(),
    num_labels=len(label_encoder_hazard_category.classes_),
    val_texts=dev_set['cleaned_text'].tolist(),
    val_labels=dev_set['hazard_category_encoded'].tolist(),
    epochs=3
)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,0.2025,0.342965
2,0.3448,0.279363
3,0.1394,0.274214


* Train a model to predict the product categories.
* Arguments:
    - Training input texts.
    - Corresponding encoded labels.
    - Number of unique product categories.
    - Validation input texts.
    - Corresponding validation labels.
    - Number of epochs (4).

In [14]:
product_category_model = train_roberta_model(
    train_set['cleaned_text'].tolist(),
    train_set['product_category_encoded'].tolist(),
    num_labels=len(label_encoder_product_category.classes_),
    val_texts=dev_set['cleaned_text'].tolist(),
    val_labels=dev_set['product_category_encoded'].tolist(),
    epochs=4
)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,0.7796,1.046289
2,0.7386,0.896073
3,0.5256,0.806444
4,0.3024,0.793801


## Evaluate the hazard category model

* Evaluate the model based on the validation data and add predictions to the dev_set.

In [15]:
# Define a DataLoader for the development set
dev_dataset_hazard_category = TextDataset(
    dev_set['cleaned_text'].tolist(),
    dev_set['hazard_category_encoded'].tolist(),
    tokenizer,
    max_len=128
)

dev_dataloader_hazard_category = DataLoader(dev_dataset_hazard_category, batch_size=8, shuffle=False)

# Evaluate the hazard model
predicted_labels, gold_labels = evaluate_model_with_dataloader(
    hazard_category_model, dev_dataloader_hazard_category, label_encoder_hazard_category
)

# Store predictions in the dev_set DataFrame
dev_set['predictions-hazard-category'] = predicted_labels

                                precision    recall  f1-score   support

                     allergens       0.94      0.98      0.96       207
                    biological       0.96      0.99      0.97       194
                      chemical       0.88      0.82      0.85        28
food additives and flavourings       1.00      0.50      0.67         2
                foreign bodies       0.98      0.94      0.96        63
                         fraud       0.91      0.71      0.79        41
          organoleptic aspects       0.70      0.88      0.78         8
                  other hazard       0.82      0.64      0.72        14
              packaging defect       0.62      0.62      0.62         8

                      accuracy                           0.94       565
                     macro avg       0.87      0.79      0.81       565
                  weighted avg       0.94      0.94      0.93       565



* Evaluate the model based on the test data and add predictions to the test_set.

In [16]:
# Define a DataLoader for the test set
test_dataset_hazard_category = TextDataset(
    test_set['cleaned_text'].tolist(),
    test_set['hazard_category_encoded'].tolist(),
    tokenizer,
    max_len=128
)

test_dataloader_hazard_category = DataLoader(test_dataset_hazard_category, batch_size=8, shuffle=False)

# Evaluate the hazard model
predicted_labels, gold_labels = evaluate_model_with_dataloader(
    hazard_category_model, test_dataloader_hazard_category, label_encoder_hazard_category
)

# Store predictions in the test_set DataFrame
test_set['predictions-hazard-category'] = predicted_labels

                                precision    recall  f1-score   support

                     allergens       0.96      0.96      0.96       365
                    biological       0.98      0.97      0.98       343
                      chemical       0.98      0.90      0.94        52
food additives and flavourings       1.00      0.50      0.67         4
                foreign bodies       0.96      0.97      0.97       111
                         fraud       0.77      0.79      0.78        75
                     migration       0.00      0.00      0.00         1
          organoleptic aspects       0.83      1.00      0.91        10
                  other hazard       0.66      0.81      0.72        26
              packaging defect       0.82      0.90      0.86        10

                      accuracy                           0.94       997
                     macro avg       0.80      0.78      0.78       997
                  weighted avg       0.94      0.94      0.94 

## Evaluate the product category model

* Evaluate the model based on the validation data and add predictions to the dev_set.

In [17]:
# Define a DataLoader for the development set
dev_dataset_product_category = TextDataset(
    dev_set['cleaned_text'].tolist(),
    dev_set['product_category_encoded'].tolist(),
    tokenizer,
    max_len=128
)

dev_dataloader_product_category = DataLoader(dev_dataset_product_category, batch_size=8, shuffle=False)

# Evaluate the product model
predicted_labels, gold_labels = evaluate_model_with_dataloader(
    product_category_model, dev_dataloader_product_category, label_encoder_product_category
)

# Store predictions in the dev_set DataFrame
dev_set['predictions-product-category'] = predicted_labels

                                                   precision    recall  f1-score   support

                              alcoholic beverages       0.88      1.00      0.93         7
                      cereals and bakery products       0.84      0.76      0.80        75
     cocoa and cocoa preparations, coffee and tea       0.50      0.73      0.59        15
                                    confectionery       0.72      0.50      0.59        26
dietetic foods, food supplements, fortified foods       0.58      0.79      0.67        14
                                    fats and oils       1.00      0.75      0.86         4
                                   feed materials       0.00      0.00      0.00         1
                            fruits and vegetables       0.73      0.79      0.76        52
                                 herbs and spices       0.73      0.84      0.78        19
                                ices and desserts       0.88      0.88      0.88        2

* Evaluate the model based on the test data and add predictions to the test_set.

In [18]:
# Define a DataLoader for the test set
test_dataset_product_category = TextDataset(
    test_set['cleaned_text'].tolist(),
    test_set['product_category_encoded'].tolist(),
    tokenizer,
    max_len=128
)

test_dataloader_product_category = DataLoader(test_dataset_product_category, batch_size=8, shuffle=False)

# Evaluate the product model
predicted_labels, gold_labels = evaluate_model_with_dataloader(
    product_category_model, test_dataloader_product_category, label_encoder_product_category
)

# Store predictions in the test_set DataFrame
test_set['predictions-product-category'] = predicted_labels

                                                   precision    recall  f1-score   support

                              alcoholic beverages       1.00      0.94      0.97        16
                      cereals and bakery products       0.84      0.78      0.81       121
     cocoa and cocoa preparations, coffee and tea       0.85      0.93      0.89        42
                                    confectionery       0.72      0.79      0.75        33
dietetic foods, food supplements, fortified foods       0.69      0.77      0.73        26
                                    fats and oils       1.00      0.83      0.91         6
                   food additives and flavourings       1.00      0.25      0.40         4
                            fruits and vegetables       0.82      0.82      0.82       103
                                 herbs and spices       0.64      0.80      0.71        20
                            honey and royal jelly       1.00      0.50      0.67         

# ST2: Food hazard and product “vector” detection, predicting the exact hazard and product

## Encode labels and train the models

* Encode hazard and product labels.

In [19]:
# Encode labels
label_encoder_hazard = LabelEncoder()
label_encoder_product = LabelEncoder()

# Fit Transform train set
train_set['hazard_encoded'] = label_encoder_hazard.fit_transform(train_set['hazard'])
train_set['product_encoded'] = label_encoder_product.fit_transform(train_set['product'])

# Add an "unknown" class to the label encoder
label_encoder_product.classes_ = np.append(label_encoder_product.classes_, 'unknown')

# Function to handle unseen categories
def encode_with_unknown(encoder, column):
    return [encoder.transform([x])[0] if x in encoder.classes_ else encoder.transform(['unknown'])[0] for x in column]

# Transform dev and test sets with handling for unseen categories
dev_set['hazard_encoded'] = encode_with_unknown(label_encoder_hazard, dev_set['hazard'])
dev_set['product_encoded'] = encode_with_unknown(label_encoder_product, dev_set['product'])

test_set['hazard_encoded'] = encode_with_unknown(label_encoder_hazard, test_set['hazard'])
test_set['product_encoded'] = encode_with_unknown(label_encoder_product, test_set['product'])

* Train a model to predict the hazards.
* Arguments:
    - Training input texts.
    - Corresponding encoded labels.
    - Number of unique hazards.
    - Validation input texts.
    - Corresponding validation labels.
    - Number of epochs (7).
    - Batch size (32).

In [20]:
hazard_model = train_roberta_model(
    train_set['cleaned_text'].tolist(),
    train_set['hazard_encoded'].tolist(),
    num_labels=len(label_encoder_hazard.classes_),
    val_texts=dev_set['cleaned_text'].tolist(),
    val_labels=dev_set['hazard_encoded'].tolist(),
    epochs=7,
    batch_size=32
)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,3.2285,2.582979
2,1.198,1.11255
3,0.8332,0.885747
4,0.6366,0.732694
5,0.6717,0.696913
6,0.3267,0.653181
7,0.3215,0.62965


* Train a model to predict the products.
* Arguments:
    - Training input texts.
    - Corresponding encoded labels.
    - Number of unique products.
    - Validation input texts.
    - Corresponding validation labels.
    - Number of epochs (10).
    - Batch size (32).

In [21]:
product_model = train_roberta_model(
    train_set['cleaned_text'].tolist(),
    train_set['product_encoded'].tolist(),
    num_labels=len(label_encoder_product.classes_),
    val_texts=dev_set['cleaned_text'].tolist(),
    val_labels=dev_set['product_encoded'].tolist(),
    epochs=10,
    batch_size=32
)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,6.6808,6.544946
2,5.6455,5.47571
3,4.7634,4.476463
4,3.9504,3.983412
5,3.4142,3.710415
6,3.1687,3.497015
7,2.5473,3.40929
8,2.3377,3.286588
9,2.0291,3.255643
10,1.9272,3.241041


## Evaluate the hazard model

* Evaluate the model based on the validation data and add predictions to the dev_set.

In [22]:
# Define a DataLoader for the development set
dev_dataset_hazard = TextDataset(
    dev_set['cleaned_text'].tolist(),
    dev_set['hazard_encoded'].tolist(),
    tokenizer,
    max_len=128
)

dev_dataloader_hazard = DataLoader(dev_dataset_hazard, batch_size=8, shuffle=False)

# Evaluate the hazard model
predicted_labels, gold_labels = evaluate_model_with_dataloader(
    hazard_model, dev_dataloader_hazard, label_encoder_hazard
)

# Store predictions in the dev_set DataFrame
dev_set['predictions-hazard'] = predicted_labels

                                                 precision    recall  f1-score   support

                                      Aflatoxin       1.00      1.00      1.00         1
                                 abnormal smell       0.00      0.00      0.00         1
                                      allergens       0.00      0.00      0.00         1
                                         almond       1.00      0.86      0.92         7
                         antibiotics, vet drugs       0.00      0.00      0.00         1
                                  bacillus spp.       1.00      1.00      1.00         2
                           bad smell / off odor       0.50      1.00      0.67         1
                                  bone fragment       1.00      1.00      1.00         1
                                     brazil nut       0.00      0.00      0.00         1
                              bulging packaging       1.00      0.67      0.80         3
                    

* Evaluate the model based on the test data and add predictions to the test_set.

In [23]:
# Define a DataLoader for the test set
test_dataset_hazard = TextDataset(
    test_set['cleaned_text'].tolist(),
    test_set['hazard_encoded'].tolist(),
    tokenizer,
    max_len=128
)

test_dataloader_hazard = DataLoader(test_dataset_hazard, batch_size=8, shuffle=False)

# Evaluate the hazard model
predicted_labels, gold_labels = evaluate_model_with_dataloader(
    hazard_model, test_dataloader_hazard, label_encoder_hazard
)

# Store predictions in the test_set DataFrame
test_set['predictions-hazard'] = predicted_labels

                                                 precision    recall  f1-score   support

                                      Aflatoxin       1.00      1.00      1.00         2
                                 abnormal smell       0.00      0.00      0.00         1
                                alcohol content       0.00      0.00      0.00         1
                                      alkaloids       1.00      1.00      1.00         1
                                      allergens       0.00      0.00      0.00         3
                                         almond       0.92      0.92      0.92        13
                                      amygdalin       0.00      0.00      0.00         1
                         antibiotics, vet drugs       0.00      0.00      0.00         1
                                  bacillus spp.       1.00      1.00      1.00         3
                           bad smell / off odor       1.00      1.00      1.00         1
                    

## Evaluate the product model

* Evaluate the model based on the validation data and add predictions to the dev_set.

In [24]:
# Define a DataLoader for the development set
dev_dataset_product = TextDataset(
    dev_set['cleaned_text'].tolist(),
    dev_set['product_encoded'].tolist(),
    tokenizer,
    max_len=128
)

dev_dataloader_product = DataLoader(dev_dataset_product, batch_size=8, shuffle=False)

# Evaluate the product model
predicted_labels, gold_labels = evaluate_model_with_dataloader(
    product_model, dev_dataloader_product, label_encoder_product
)

# Store predictions in the dev_set DataFrame
dev_set['predictions-product'] = predicted_labels

                                                      precision    recall  f1-score   support

                              Catfishes (freshwater)       0.67      1.00      0.80         2
                                     Dried pork meat       0.00      0.00      0.00         1
                               Fishes not identified       0.50      0.60      0.55         5
                 Precooked cooked pork meat products       0.00      0.00      0.00         1
                                       Veggie Burger       0.00      0.00      0.00         0
                                     alfalfa sprouts       0.50      1.00      0.67         1
                                               algae       0.00      0.00      0.00         0
                                         almond milk       1.00      1.00      1.00         1
                                     almond products       0.00      0.00      0.00         0
                                             almonds       

* Evaluate the model based on the test data and add predictions to the test_set.

In [25]:
# Define a DataLoader for the test set
test_dataset_product = TextDataset(
    test_set['cleaned_text'].tolist(),
    test_set['product_encoded'].tolist(),
    tokenizer,
    max_len=128
)

test_dataloader_product = DataLoader(test_dataset_product, batch_size=8, shuffle=False)

# Evaluate the product model
predicted_labels, gold_labels = evaluate_model_with_dataloader(
    product_model, test_dataloader_product, label_encoder_product
)

# Store predictions in the test_set DataFrame
test_set['predictions-product'] = predicted_labels

                                                    precision    recall  f1-score   support

                            Catfishes (freshwater)       0.50      1.00      0.67         3
                                   Dried pork meat       0.00      0.00      0.00         1
                             Fishes not identified       0.44      0.88      0.58         8
                          Not classified pork meat       0.00      0.00      0.00         3
               Precooked cooked pork meat products       0.00      0.00      0.00         1
                                     Veggie Burger       0.25      1.00      0.40         1
                                   alfalfa sprouts       0.00      0.00      0.00         0
                                             algae       1.00      1.00      1.00         1
                                    almond kernels       0.00      0.00      0.00         1
                                   almond products       1.00      1.00      1.

# Compute final scores

* Compute and print the final score based on the validation data.

In [26]:
score_st1 = compute_score(
    dev_set['hazard-category'], dev_set['product-category'],
    dev_set['predictions-hazard-category'], dev_set['predictions-product-category']
)
print(f"Score Sub-Task 1: {score_st1:.3f}")

score_st2 = compute_score(
    dev_set['hazard'], dev_set['product'],
    dev_set['predictions-hazard'], dev_set['predictions-product'])
print(f"Score Sub-Task 2: {score_st2:.3f}")

Score Sub-Task 1: 0.783
Score Sub-Task 2: 0.459


* Compute and print the final score based on the test data.

In [27]:
score_st1 = compute_score(
    test_set['hazard-category'], test_set['product-category'],
    test_set['predictions-hazard-category'], test_set['predictions-product-category']
)
print(f"Score Sub-Task 1: {score_st1:.3f}")

score_st2 = compute_score(
    test_set['hazard'], test_set['product'],
    test_set['predictions-hazard'], test_set['predictions-product'])
print(f"Score Sub-Task 2: {score_st2:.3f}")

Score Sub-Task 1: 0.759
Score Sub-Task 2: 0.426


# Predict Test Set

* Create a df to store the predictions.

In [28]:
predictions = pd.DataFrame()

* Predict the ST1 test data and store them in the predictions df.

In [29]:
predictions['hazard-category'] = predict(test_set.title.to_list(), hazard_category_model, label_encoder_hazard_category, tokenizer)
predictions['product-category'] = predict(test_set.title.to_list(), product_category_model, label_encoder_product_category, tokenizer)

* Predict the ST2 test data and store them in the predictions df.

In [30]:
predictions['hazard'] = predict(test_set.title.to_list(), hazard_model, label_encoder_hazard, tokenizer)
predictions['product'] = predict(test_set.title.to_list(), product_model, label_encoder_product, tokenizer)

* Create a new folder named 'submission' if it doesn't already exist.
* Save predictions to 'submission.csv' inside the 'submission' folder and zip it.

In [31]:
# Save predictions to a new folder
os.makedirs('./submission/', exist_ok=True)
predictions.to_csv('./submission/submission.csv')

# Zip the folder
make_archive('./submission', 'zip', './submission')

'/home/ec2-user/SageMaker/submission.zip'