# NLP 2025 Track A -- multilabel text classification

authors:
- Anne Marschner
- Arshia Orangkhadivi
- Jafar Zohourian Moftakharahmadi
- Kirill Kuznetsov
- Moritz Groß

---

### Task description (from Milestone 2)

Given a target text snippet, we aim to predict the
perceived emotion(s) of the speaker. Specifically,
select whether each of the following emotions
apply to the text: anger, fear, joy, sadness, surprise.
This is a multi-label sentiment classification, since
multiple nonexclusive labels may be assigned to
each instance of text.

install python packages. The library `transformers` by HuggingFace provides access to popular pretrained models. We use the model `distilbert-base-uncased `. DistilBERT is a 67 million parameter model created from the famous BERT models from 2018 and retains nearly all performance while being reasonable fast to run.

The libarary `datasets` is made by HuggingFace as well and provides easy and clean handling of data for our AI models, and is very compatible with `transformers` as they are from the same creators.

In [57]:
!pip -q install -U transformers datasets

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [58]:
import torch, sklearn, numpy as np, pandas as pd
import transformers, datasets   # both from HuggingFace

transformers.set_seed(42)       # ensure deterministic behaviour

In [59]:
print(torch.backends.mps.is_available()) # check that Torch runs using M-Chipset of MacOS
print(next(model.parameters()).device)  # should print 'mps:0'

True
mps:0


## Data Wrangling

The provided CSV file is loaded and processed for *training*, *validation* and *testing*.


In [60]:
csv_path = "track-a.csv"

In [61]:
label_cols = ["anger", "fear", "joy", "sadness", "surprise"]

raw_ds = datasets.load_dataset("csv", data_files=csv_path)["train"] # all rows

def add_labels(ex):
    ex["labels"] = [ex[c] for c in label_cols]
    return ex

raw_ds = raw_ds.map(add_labels, remove_columns=label_cols + ["id"])

# split into train / eval / test
train_tmp  = raw_ds.train_test_split(test_size=0.20, seed=42)
eval_test  = train_tmp["test"].train_test_split(test_size=0.50, seed=42)
ds = {"train": train_tmp["train"],
      "eval" : eval_test["train"],
      "test" : eval_test["test"]}

for split in ds:
    ds[split] = ds[split].cast_column("labels",
                                      datasets.Sequence(datasets.Value("float32"))
                                      )

### data imbalance

like mentioned in the project description, we can see that for most labels, only in the minority of the sentences the label is true, in particular for anger with just 12 percent.

In [62]:
pd.read_csv(csv_path)[label_cols].mean()

anger       0.120303
fear        0.582009
joy         0.243497
sadness     0.317197
surprise    0.303107
dtype: float64

## Tokenization

In [63]:
checkpoint = "distilbert-base-uncased"
tok = transformers.AutoTokenizer.from_pretrained(checkpoint)

def tok_fn(x):
    return tok(x["text"])

tok_ds  = {k: v.map(tok_fn, batched=True, remove_columns=["text"])
           for k, v in ds.items()}

## AI Training

In [64]:
model = transformers.AutoModelForSequenceClassification.from_pretrained(
            checkpoint,
            num_labels=len(label_cols),
            problem_type="multi_label_classification")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    probs  = torch.sigmoid(torch.tensor(logits)).numpy()
    preds  = (probs > 0.5).astype(int)
    micro_f1 = sklearn.metrics.f1_score(labels, preds, average="micro")
    return {"micro_f1": micro_f1}

args = transformers.TrainingArguments(
    output_dir="output_dir",
    eval_strategy="epoch",
    logging_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    report_to="none",
    logging_first_step=True,
    learning_rate=1e-5, # small data -> small lr; default is 5e-5
    weight_decay=0.5, # regularization
)

trainer = transformers.Trainer(model, args,
                  train_dataset  =tok_ds["train"],
                  eval_dataset   =tok_ds["eval"],
                  tokenizer      =tok,
                  compute_metrics=compute_metrics)

trainer.train()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = transformers.Trainer(model, args,


Epoch,Training Loss,Validation Loss,Micro F1
1,0.5684,0.505985,0.464968
2,0.4713,0.449421,0.639676
3,0.4126,0.436499,0.659898
4,0.3796,0.42577,0.667494
5,0.3622,0.422834,0.667513




TrainOutput(global_step=695, training_loss=0.43899146627179153, metrics={'train_runtime': 489.129, 'train_samples_per_second': 22.632, 'train_steps_per_second': 1.421, 'total_flos': 146822611020420.0, 'train_loss': 0.43899146627179153, 'epoch': 5.0})

In [65]:
test_out    = trainer.predict(tok_ds["test"])
test_scores = compute_metrics((test_out.predictions, test_out.label_ids))
print(f"Test micro-F1: {test_scores['micro_f1']:.3f}")



Test micro-F1: 0.627


In [75]:
# -- predictions -> 0/1 --------------------------------------------------------
probs = torch.sigmoid(torch.tensor(test_out.predictions)).numpy()
preds = (probs > 0.5).astype(int)
true  = test_out.label_ids

# ──-- 1. standard per-label P/R/F1/support ------------------------------------
print(sklearn.metrics.classification_report(true, preds,
                            target_names=label_cols, zero_division=0))

# -- 2. per-label accuracy (binary accuracy for each emotion) ------------------
print("\nPer-label accuracy:")
for lbl, acc in zip(label_cols, (preds == true).mean(axis=0)):
    print(f"  {lbl:<8}: {acc:.4f}") # align prints nicely

# -- 3. per-row accuracies -----------------------------------------------------
print("\nGlobal Accuracies")
print(f"exactly correct rows        : {(preds == true).all(axis=1).mean():.3f}")
print(f"accuracy across all fields  : {(preds == true).mean():.3f}")

              precision    recall  f1-score   support

       anger       0.00      0.00      0.00        37
        fear       0.67      0.83      0.74       143
         joy       0.75      0.53      0.62        79
     sadness       0.61      0.55      0.58        84
    surprise       0.74      0.49      0.59        79

   micro avg       0.68      0.58      0.63       422
   macro avg       0.55      0.48      0.51       422
weighted avg       0.63      0.58      0.59       422
 samples avg       0.58      0.55      0.54       422


Per-label accuracy:
  anger   : 0.8664
  fear    : 0.7004
  joy     : 0.8159
  sadness : 0.7581
  surprise: 0.8051

Global Accuracies
exactly correct rows        : 0.343
accuracy across all fields  : 0.789


# Inference script

try out the methods required from the project description. See `main.py` for details.

In [79]:
from main import predict_single_text, predict

print(predict_single_text("this is fantastic!"))

print("\n\n\n")

print(predict("track-a-head.csv"))

['joy', 'surprise']




[['fear', 'surprise'], [], ['fear', 'sadness'], ['joy'], ['fear', 'surprise'], ['fear', 'surprise'], ['fear'], ['fear'], ['fear', 'sadness']]
