<a href="https://colab.research.google.com/github/CcTheresa/Notia/blob/main/emotion_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook focuses on finetuning the DistilBERT model from Hugging Face to classify emotions in text.

I used the Go emotions dataset as its multilabel aspect is ideal for journal entires which often reflect a myriad of emtions.

In [1]:
#installing the necessary libraries
!pip install -q transformers datasets scikit-learn

#Import the necessary dependancies
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding, TrainingArguments, Trainer #tokenizer & model, training args and trainer
from datasets import load_dataset, DatasetDict  #accessing the dataset
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
import numpy as np

#Load dataset
ds = load_dataset("google-research-datasets/go_emotions", "simplified")

#Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=27)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

simplified/train-00000-of-00001.parquet:   0%|          | 0.00/2.77M [00:00<?, ?B/s]

simplified/validation-00000-of-00001.par(…):   0%|          | 0.00/350k [00:00<?, ?B/s]

simplified/test-00000-of-00001.parquet:   0%|          | 0.00/347k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/43410 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5426 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5427 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [2]:

#Tokenizing the dataset:to ensure consistent input format to model
# The tokenizer converts raw text into token IDs
# Padding ensures all sequences are the same length
# Truncation cuts off very long sequences so they fit the model
def tokenize(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True)

# Apply the tokenizer across the entire dataset
# batched=True = process multiple examples at once for speed
tokenized_dataset = ds.map(tokenize, batched=True)

# Check one sample after tokenization
print(tokenized_dataset["train"][0])


Map:   0%|          | 0/43410 [00:00<?, ? examples/s]

Map:   0%|          | 0/5426 [00:00<?, ? examples/s]

Map:   0%|          | 0/5427 [00:00<?, ? examples/s]

{'text': "My favourite food is anything I didn't have to cook myself.", 'labels': [27], 'id': 'eebbqej', 'input_ids': [101, 2026, 8837, 2833, 2003, 2505, 1045, 2134, 1005, 1056, 2031, 2000, 5660, 2870, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [3]:

import numpy as np
THRESHOLD = 0.2

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

def compute_metrics(eval_pred):
    logits, labels = eval_pred  # model output vs. true labels
    logits = np.asarray(logits)
    labels = np.asarray(labels)

    # Convert logits -> probabilities unless already [0,1]
    if (logits.min() >= 0.0) and (logits.max() <= 1.0):
        probs = logits
    else:
        probs = sigmoid(logits)

    preds = (probs >= THRESHOLD).astype(int)

    return {
        # Micro: aggregates TP/FP/FN across all labels
        "precision_micro": precision_score(labels, preds, average="micro",  zero_division=0),
        "recall_micro":    recall_score(labels,  preds, average="micro",    zero_division=0),
        "f1_micro":        f1_score(labels,      preds, average="micro",    zero_division=0),

        # Macro: unweighted mean over labels (reveals rare-class weakness)
        "precision_macro": precision_score(labels, preds, average="macro",  zero_division=0),
        "recall_macro":    recall_score(labels,  preds, average="macro",    zero_division=0),
        "f1_macro":        f1_score(labels,      preds, average="macro",    zero_division=0),

        # Optional extras you can keep or drop:
        # "f1_samples":      f1_score(labels,      preds, average="samples",  zero_division=0),
        # "subset_accuracy": accuracy_score(labels, preds),
    }


In [4]:
# TRAINING BLOCK
# Assumes: `ds` (datasets.DatasetDict) and `tokenizer` are already in the notebook workspace.

# Imports
from transformers import (
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer
)
from datasets import Sequence, Value
import numpy as np
import torch


# (2) constants
NUM_LABELS = 27
MAX_LEN = 128

# (3) tokenizer mapping (tokenize texts)
def tokenize_fn(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=MAX_LEN)

# Dataset mapping for multilabel flattening
def to_multilabel_flat(batch):

    out_labels = []
    for labs in batch["labels"]:
        # handle numpy / torch / nested single-element wrappers
        if isinstance(labs, (np.ndarray, torch.Tensor)):
            labs = labs.tolist()

        # if nested wrapper like [[...]] -> unwrap
        if isinstance(labs, list) and len(labs) == 1 and isinstance(labs[0], (list, np.ndarray, torch.Tensor)):
            labs = labs[0]
            if isinstance(labs, (np.ndarray, torch.Tensor)):
                labs = labs.tolist()

        # Case 1: list of integer indices -> build multi-hot vector
        if isinstance(labs, list) and all(isinstance(x, int) for x in labs):
            row = [0.0] * NUM_LABELS
            for idx in labs:
                if 0 <= idx < NUM_LABELS:
                    row[idx] = 1.0
            out_labels.append(row)
            continue

        # Case 2: already a sequence of 0/1 values or floats (maybe wrong length) -> normalize
        if isinstance(labs, list) and all(isinstance(x, (int, float)) for x in labs):
            # convert all to float
            arr = [float(x) for x in labs]
            # pad or truncate to NUM_LABELS
            if len(arr) < NUM_LABELS:
                arr = arr + [1.0 * 0.0] * (NUM_LABELS - len(arr))
            elif len(arr) > NUM_LABELS:
                arr = arr[:NUM_LABELS]
            out_labels.append(arr)
            continue

        # Fallback: unexpected shape/types -> create zero vector (safe)
        out_labels.append([0.0] * NUM_LABELS)

    batch["labels"] = out_labels
    return batch

tokenized = ds.map(tokenize_fn, batched=True)
tokenized = tokenized.map(to_multilabel_flat, batched=True)

# Force the Arrow schema of labels to float32 vectors of fixed length
for split in tokenized.keys():
    tokenized[split] = tokenized[split].cast_column(
        "labels", Sequence(Value("float32"), length=NUM_LABELS)
    )

# (C) finally set HF dataset format so Trainer gets tensors
# Do NOT pass dtype as dict (that caused earlier TypeError). If labels are floats (Python lists), HF will convert to torch.float.
tokenized.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])


# Sanity checks (run these and confirm)
example = tokenized["train"][0]
print("keys:", example.keys())
print("input_ids dtype:", example["input_ids"].dtype, "shape:", example["input_ids"].shape)
print("attention_mask dtype:", example["attention_mask"].dtype, "shape:", example["attention_mask"].shape)
print("labels dtype:", example["labels"].dtype, "shape:", example["labels"].shape)  # should be torch.float32 and (27,)
print("labels sample:", example["labels"])

Map:   0%|          | 0/43410 [00:00<?, ? examples/s]

Map:   0%|          | 0/5426 [00:00<?, ? examples/s]

Map:   0%|          | 0/5427 [00:00<?, ? examples/s]

Map:   0%|          | 0/43410 [00:00<?, ? examples/s]

Map:   0%|          | 0/5426 [00:00<?, ? examples/s]

Map:   0%|          | 0/5427 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/43410 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/5426 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/5427 [00:00<?, ? examples/s]

keys: dict_keys(['labels', 'input_ids', 'attention_mask'])
input_ids dtype: torch.int64 shape: torch.Size([128])
attention_mask dtype: torch.int64 shape: torch.Size([128])
labels dtype: torch.float32 shape: torch.Size([27])
labels sample: tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0.])


In [5]:
import transformers, inspect
print("transformers version:", transformers.__version__)
from transformers import TrainingArguments
print("TrainingArguments class:", TrainingArguments)
print("Init signature:", inspect.signature(TrainingArguments.__init__))


transformers version: 4.56.2
TrainingArguments class: <class 'transformers.training_args.TrainingArguments'>


In [6]:
ex = tokenized["train"][0]
assert ex["labels"].dtype == torch.float32
assert ex["labels"].shape[-1] == NUM_LABELS


In [7]:
# --- Hard-disable W&B ---
import os
os.environ["WANDB_DISABLED"] = "true"
os.environ["WANDB_MODE"] = "disabled"
os.environ["WANDB_SILENT"] = "true"

# --- Imports ---
import numpy as np
import torch
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from transformers import DataCollatorWithPadding
from transformers import (
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    TrainingArguments,
    Trainer,
)

# Assumes you already have: tokenizer, tokenized (DatasetDict), NUM_LABELS, MAX_LEN


train_subset = tokenized["train"].shuffle(seed=42).select(range(min(5000, len(tokenized["train"]))))
eval_subset  = tokenized["validation"].shuffle(seed=42).select(range(min(1000, len(tokenized["validation"]))))

# --- Model (multi-label) ---
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    problem_type="multi_label_classification",
    num_labels=NUM_LABELS,
)
model.config.problem_type = "multi_label_classification"
model.config.num_labels   = NUM_LABELS

# --- Collator that forces labels -> float32 (so BCEWithLogitsLoss works) ---
base_collator = DataCollatorWithPadding(tokenizer=tokenizer)

def float_label_collator(features):
    # Separate labels first (they may be int64 in the dataset)
    labels = [f["labels"] for f in features]
    # Use base collator to pad input_ids / attention_mask
    batch = base_collator([{k: v for k, v in f.items() if k != "labels"} for f in features])
    # Cast labels to float32 and stack
    batch["labels"] = torch.tensor(labels, dtype=torch.float32)
    return batch

# --- Metrics (scikit-learn) ---
THRESHOLD = 0.2  # tune later or per-class

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    logits = np.asarray(logits); labels = np.asarray(labels)
    # logits -> probabilities via sigmoid (unless already in [0,1])
    if (logits.min() >= 0.0) and (logits.max() <= 1.0):
        probs = logits
    else:
        probs = 1.0 / (1.0 + np.exp(-logits))
    preds = (probs >= THRESHOLD).astype(int)

    return {
        "precision_micro": precision_score(labels, preds, average="micro",  zero_division=0),
        "recall_micro":    recall_score(labels,  preds, average="micro",    zero_division=0),
        "f1_micro":        f1_score(labels,      preds, average="micro",    zero_division=0),
        "precision_macro": precision_score(labels, preds, average="macro",  zero_division=0),
        "recall_macro":    recall_score(labels,  preds, average="macro",    zero_division=0),
        "f1_macro":        f1_score(labels,      preds, average="macro",    zero_division=0),
        "f1_samples":      f1_score(labels,      preds, average="samples",  zero_division=0),
        "subset_accuracy": accuracy_score(labels, preds),
    }

# --- TrainingArguments (HF 4.56.2 uses eval_strategy) ---
training_args = TrainingArguments(
    output_dir="./outputs",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    num_train_epochs=3,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1_micro",
    greater_is_better=True,
    logging_steps=50,
    report_to="none",            # no wandb/tensorboard
    run_name=None,
    dataloader_pin_memory=False, # silence CPU-only warning
)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# --- Trainer ---
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_subset,
    eval_dataset=eval_subset,
    compute_metrics=compute_metrics,
    data_collator=data_collator,   # <-- critical: casts labels to float32
)

# --- Train & Evaluate ---
trainer.train()
eval_metrics = trainer.evaluate()
print(eval_metrics)

# --- Save ---
trainer.save_model("./finetuned_distilbert_goemotions")
tokenizer.save_pretrained("./finetuned_distilbert_goemotions")
print("Training finished. Model saved to ./finetuned_distilbert_goemotions")


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Precision Micro,Recall Micro,F1 Micro,Precision Macro,Recall Macro,F1 Macro,F1 Samples,Subset Accuracy
1,0.1348,0.134375,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.292


Epoch,Training Loss,Validation Loss,Precision Micro,Recall Micro,F1 Micro,Precision Macro,Recall Macro,F1 Macro,F1 Samples,Subset Accuracy
1,0.1348,0.134375,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.292
2,0.1297,0.129435,0.611765,0.060465,0.110053,0.034324,0.035145,0.029634,0.0439,0.324
3,0.1239,0.123107,0.547445,0.087209,0.150451,0.03991,0.045197,0.041678,0.0624,0.331


{'eval_loss': 0.12310708314180374, 'eval_precision_micro': 0.5474452554744526, 'eval_recall_micro': 0.0872093023255814, 'eval_f1_micro': 0.15045135406218657, 'eval_precision_macro': 0.03990978157644825, 'eval_recall_macro': 0.04519653753370228, 'eval_f1_macro': 0.041678269748445185, 'eval_f1_samples': 0.06239999999999999, 'eval_subset_accuracy': 0.331, 'eval_runtime': 132.8261, 'eval_samples_per_second': 7.529, 'eval_steps_per_second': 0.241, 'epoch': 3.0}
Training finished. Model saved to ./finetuned_distilbert_goemotions
