# Multi-label text classification with HuggingFace Transformers

This notebook demonstrates the use of the HuggingFace
`transformers` library to do perform multi-label text
classification.

## The toxicity dataset

The dataset we'll use is one that Kaggle featured for a
[Toxic Comment Classification Challenge](https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/overview). The data are comments from Wikipedia's talk page
edits, where each comment is labeled for different types of
toxicity, including:

* threats
* obscenity
* insults
* identity-based hate

This dataset is a *multi-label* dataset, meaning each comment
can be labeled to contain multiple types of toxicity.

## Libraries used

We'll train our multi-label classification model using HuggingFace
transformers with TensorFlow as our deep learning framework.

For preprocessing data we'll use Pandas.

In [1]:
from typing import Dict, Tuple

import pandas as pd
import tensorflow as tf
import tensorflow_text as text

from transformers import (
    TFDistilBertModel,
    DistilBertTokenizerFast,
)

## Preprocessing the data

In [2]:
df = pd.read_csv('data/train.csv')
df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [3]:
df.set_index('id', inplace=True)
df.head()

Unnamed: 0_level_0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [4]:
label = df[df.columns[1:]].apply(lambda x: x.to_list(), axis=1)
datadf = pd.DataFrame(data={
    'comment_text': df.comment_text,
    'label': label,
})
datadf.head()

Unnamed: 0_level_0,comment_text,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0000997932d777bf,Explanation\nWhy the edits made under my usern...,"[0, 0, 0, 0, 0, 0]"
000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,"[0, 0, 0, 0, 0, 0]"
000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...","[0, 0, 0, 0, 0, 0]"
0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...","[0, 0, 0, 0, 0, 0]"
0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...","[0, 0, 0, 0, 0, 0]"


In [5]:
traindf = datadf.sample(frac=0.8)

testdf = datadf.drop(traindf.index).reset_index(drop=True)
traindf = traindf.reset_index(drop=True)

In [6]:
train_dataset = tf.data.Dataset.from_tensor_slices({
    'text': tf.constant(traindf.comment_text),
    'labels': tf.constant(traindf.label.to_list()),
})
test_dataset = tf.data.Dataset.from_tensor_slices({
    'text': tf.constant(testdf.comment_text),
    'labels': tf.constant(testdf.label.to_list()),
})

2022-05-27 10:49:41.845636: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:961] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-05-27 10:49:41.848018: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:961] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-05-27 10:49:41.848325: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:961] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-05-27 10:49:41.848862: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate

In [7]:
hf_tokenizer = DistilBertTokenizerFast.from_pretrained(
    'distilbert-base-uncased',
)

vocab_size = tf.size(list(hf_tokenizer.vocab.keys()), out_type=tf.int64)

lookup_table = tf.lookup.StaticVocabularyTable(
    tf.lookup.KeyValueTensorInitializer(
        keys=[key.encode() for key in hf_tokenizer.vocab.keys()],
        values=tf.range(vocab_size, dtype=tf.int64)
    ),
    num_oov_buckets=vocab_size // 2 ** 4
)

def preprocess(
    inputs: Dict[str, tf.Tensor],
) -> Tuple[Tuple[tf.Tensor, tf.Tensor], tf.Tensor]:
    tokenizer = text.BertTokenizer(lookup_table, token_out_type=tf.int64)
    trimmer = text.RoundRobinTrimmer(max_seq_length=512)

    text_ = inputs['text']
    tokenized_text = tokenizer.tokenize(text_).merge_dims(-2, -1)
    trimmed_text, = trimmer.trim([tokenized_text])
    input_ids, attention_mask = text.pad_model_inputs(
        trimmed_text,
        max_seq_length=512,
    )

    return (input_ids, attention_mask), inputs['labels']


In [8]:
train_pp_dataset = train_dataset.map(preprocess)
test_pp_dataset = test_dataset.map(preprocess)

train_pp_dataset = train_pp_dataset.shuffle(512).batch(8)
test_pp_dataset = test_pp_dataset.shuffle(512).batch(8)

## Creating the classification model

In [None]:
distilbert_model = TFDistilBertModel.from_pretrained(
    'distilbert-base-uncased',
)
for layer in distilbert_model.layers:
    layer.trainable = False

model = tf.keras.Sequential([
    distilbert_model,
    tf.keras.layers.Dense(6, activation='sigmoid'),
])

## Training the model

In [None]:
model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
)

In [None]:
model.fit(train_dataset, epochs=1)