# Multi-label text classification with HuggingFace Transformers

This notebook demonstrates the use of the HuggingFace
`transformers` library to do perform multi-label text
classification.

## The toxicity dataset

The dataset we'll use is one that Kaggle featured for a
[Toxic Comment Classification Challenge](https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/overview). The data are comments from Wikipedia's talk page
edits, where each comment is labeled for different types of
toxicity, including:

* threats
* obscenity
* insults
* identity-based hate

This dataset is a *multi-label* dataset, meaning each comment
can be labeled to contain multiple types of toxicity.

## Libraries used

We'll train our multi-label classification model using HuggingFace
transformers with PyTorch as our deep learning framework.

For preprocessing data we'll use Pandas.

In [1]:
import numpy as np
import pandas as pd
import torch

from datasets import Dataset, load_metric

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
)

## Preprocessing the data

In [2]:
df = pd.read_csv('data/train.csv')
df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [3]:
df.set_index('id', inplace=True)
df.head()

Unnamed: 0_level_0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [4]:
label = df[df.columns[1:]].apply(lambda x: x.to_list(), axis=1)
datadf = pd.DataFrame(data={
    'comment_text': df.comment_text,
    'label': label,
})
datadf.head()

Unnamed: 0_level_0,comment_text,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0000997932d777bf,Explanation\nWhy the edits made under my usern...,"[0, 0, 0, 0, 0, 0]"
000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,"[0, 0, 0, 0, 0, 0]"
000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...","[0, 0, 0, 0, 0, 0]"
0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...","[0, 0, 0, 0, 0, 0]"
0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...","[0, 0, 0, 0, 0, 0]"


In [5]:
train_dataset = datadf.sample(frac=0.8)

test_dataset = datadf.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)

In [6]:
train_dataset = Dataset.from_pandas(train_dataset)
test_dataset = Dataset.from_pandas(test_dataset)

In [7]:
tokenizer = AutoTokenizer.from_pretrained(
    'distilbert-base-uncased',
    use_fast=True,
)

In [8]:
def preprocess(samples):
    return tokenizer(samples['comment_text'], truncation=True)

In [9]:
encoded_train_dataset = train_dataset.map(preprocess, batched=True)
encoded_test_dataset = test_dataset.map(preprocess, batched=True)

  0%|          | 0/128 [00:00<?, ?ba/s]

  0%|          | 0/32 [00:00<?, ?ba/s]

In [10]:
train_dataset[0]

{'comment_text': "I collect old Pears Cyclopaedias. I'm just pulling a few editions of the shelf as we speak. The 1898 edition shows the county town of Sussex as Chicester, as does the 1927 edition. .... okay the county town is shown as Chichester up until 1935/1936  the next edition I have is 1939 and the county town is shown as Lewes. I don't have any further details.",
 'label': [0, 0, 0, 0, 0, 0]}

In [11]:
encoded_train_dataset[0]

{'comment_text': "I collect old Pears Cyclopaedias. I'm just pulling a few editions of the shelf as we speak. The 1898 edition shows the county town of Sussex as Chicester, as does the 1927 edition. .... okay the county town is shown as Chichester up until 1935/1936  the next edition I have is 1939 and the county town is shown as Lewes. I don't have any further details.",
 'label': [0, 0, 0, 0, 0, 0],
 'input_ids': [101,
  1045,
  8145,
  2214,
  28253,
  2015,
  22330,
  20464,
  29477,
  2098,
  7951,
  1012,
  1045,
  1005,
  1049,
  2074,
  4815,
  1037,
  2261,
  6572,
  1997,
  1996,
  11142,
  2004,
  2057,
  3713,
  1012,
  1996,
  6068,
  3179,
  3065,
  1996,
  2221,
  2237,
  1997,
  9503,
  2004,
  9610,
  9623,
  3334,
  1010,
  2004,
  2515,
  1996,
  4764,
  3179,
  1012,
  1012,
  1012,
  1012,
  1012,
  3100,
  1996,
  2221,
  2237,
  2003,
  3491,
  2004,
  23406,
  2039,
  2127,
  4437,
  1013,
  4266,
  1996,
  2279,
  3179,
  1045,
  2031,
  2003,
  3912,
  1998,
 

## Creating the classification model

In [12]:
model = AutoModelForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=6,
)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'pre_classifi

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

## Training the model

In [21]:
args = TrainingArguments(
    'distilbert-base-uncased-finetuned-toxicity',
    evaluation_strategy='epoch',
    save_strategy='epoch',
    learning_rate=1e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [25]:
metric = load_metric('accuracy')

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    print(f'{predictions = }, {labels = }')
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

In [26]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model,
    args,
    train_dataset=encoded_train_dataset,
    eval_dataset=encoded_test_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

In [27]:
trainer.train()

The following columns in the training set  don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: comment_text. If comment_text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 127657
  Num Epochs = 5
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 159575


ValueError: Expected input batch_size (4) to match target batch_size (24).