# Introduction
In the realm of natural language processing (NLP), the rise of transformer architectures, especially BERT (Bidirectional Encoder Representations from Transformers) and its variants, has revolutionized the field by setting new benchmarks across various tasks. One such variant, DistilBERT, offers a compact, faster, and more efficient solution without compromising too much on the performance characteristics of its larger counterpart.

The objective of this study, as encapsulated within this Jupyter Notebook, is to construct a text classifier leveraging the prowess of DistilBERT. Given the intricacies and nuances associated with deep learning and NLP tasks, it's essential to rely on tools that streamline the process and make it more interpretable. To this end, we utilize PyTorch Lightning—a lightweight PyTorch wrapper that simplifies the training and evaluation pipeline, allowing us to focus on the model architecture and logic rather than the boilerplate training loops.

Furthermore, harnessing pretrained models has become a staple in modern NLP. It allows researchers and practitioners to leverage vast amounts of knowledge and insights distilled into these models from extensive training on large-scale datasets. The transformers library by Hugging Face offers a repository of such pretrained models, including DistilBERT, and facilitates the integration of these models into custom applications.

Within this notebook, we'll journey through the stages of data preprocessing, model loading, training, evaluation, and inference. This endeavor not only stands as an exploration of state-of-the-art techniques but also as a testament to the ease and efficiency brought about by tools like PyTorch Lightning and the transformers library in the rapidly evolving landscape of NLP.

In [3]:
# Imports
from typing import List, Dict
import torch
from torch.utils.data import Dataset, DataLoader
import json
from sklearn.model_selection import train_test_split # scikit-learn
import lightning.pytorch as pl
from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast
import torchmetrics

In [4]:
# Setup & Configurations (constants, seeds, and devices)
RANDOM_SEED = 69
DATASET_FILENAME = '../data/clean/customer_support_twitter_full.json'
MODEL_NAME = 'distilbert-base-uncased'
MAX_EPOCHS = 5
BATCH_SIZE = 16

torch.manual_seed(RANDOM_SEED)

<torch._C.Generator at 0x1b085132c50>

In [5]:
# Load the data
with open(DATASET_FILENAME) as file:
    conversations = json.load(file)

# Extract Statistics
n_messages = 0
intent_counts = dict()
for conversation in conversations:
    for message in conversation:
        n_messages += 1
        for intent in message.get('intents'):
            intent_counts[intent] = intent_counts.get(intent, 0) + 1
ordered_counts = sorted(intent_counts.items(), key=lambda intent: intent[1], reverse=True)
ordered_counts_text = "\n".join([f"* {k:<25}: {v:5,}" for k, v in ordered_counts])

print(f'Conversation Count: {len(conversations)}')
print(f'Message Count: {n_messages}')
print(f'Label Counts:\n{ordered_counts_text}')
print(f'Sample conversation:{json.dumps(conversations[0], indent=2)}')

Conversation Count: 1001
Message Count: 2619
Label Counts:
* Question                 : 1,180
* URL Share                : 1,132
* Direct to DM             :   741
* Check Version/Details    :   561
* Provide Information      :   514
* Acknowledgement          :   355
* Report Problem           :   318
* Troubleshooting          :   217
* Call Center Inquiry      :    18
Sample conversation:[
  {
    "id": 698,
    "text": "@AppleSupport  URL",
    "authored": false,
    "intents": [
      "URL Share"
    ]
  },
  {
    "id": 696,
    "text": "USERNAME We're here for you. Which version of the iOS are you running? Check from Settings > General > About.",
    "authored": true,
    "intents": [
      "Question",
      "Provide Information",
      "Check Version/Details",
      "Acknowledgement"
    ]
  },
  {
    "id": 697,
    "text": "@AppleSupport The newest update. I made sure to download it yesterday.",
    "authored": false,
    "intents": [
      "Provide Information"
    ]
  },
  

In [6]:
# Define PyTorch Dataset
class ClassifierDataset(Dataset):

    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

# Define PyTorch Lightning DataModule
class ClassifierDataModule(pl.LightningDataModule):

    def __init__(self, conversations: List[List[Dict]], batch_size=32):
        super().__init__()
        self.conversations = conversations
        self.batch_size = batch_size
        # TODO move to prepare data (it requires saving and loading from HDD)
        labels = set()
        tokenizer = DistilBertTokenizerFast.from_pretrained(MODEL_NAME)
        for conversation in self.conversations:
            for message in conversation:
                for intent in message.get('intents'):
                    labels.add(intent)
        self.labels = sorted(list(labels))
        print(f'Found {len(self.labels)} labels')
        data = []
        for conversation in self.conversations:
            for message in conversation:
                inputs = tokenizer(
                    message.get('text'), 
                    padding='max_length',
                    truncation=True,
                    return_tensors='pt',
                )
                data.append({
                    'input_ids': inputs['input_ids'].squeeze(),
                    'attention_mask': inputs['attention_mask'].squeeze(),
                    'labels': self._get_label(message.get('intents')),
                })

        # Split the data into 80% train, 10% validation, and 10% test
        train_data, temp_data = train_test_split(data, test_size=0.2, random_state=RANDOM_SEED)
        val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=RANDOM_SEED)

        # Setup datasets
        self.train_dataset = ClassifierDataset(train_data)
        self.val_dataset = ClassifierDataset(val_data)
        self.test_dataset = ClassifierDataset(test_data)
        

    def _get_label(self, intents: List[str]) -> torch.Tensor:
        label = torch.zeros(len(self.labels))
        for intent in intents:
            label[self.labels.index(intent)] = 1
        return label

    @property
    def n_labels(self):
        return len(self.labels)

    def prepare_data(self):
        # TODO
        pass

    def setup(self, stage: str):
        # TODO
        pass
    
    def train_dataloader(self):
        return DataLoader(self.train_dataset, batch_size=self.batch_size)

    def val_dataloader(self):
        return DataLoader(self.val_dataset, batch_size=self.batch_size)

    def test_dataloader(self):
        return DataLoader(self.test_dataset, batch_size=self.batch_size)

dm = ClassifierDataModule(conversations, BATCH_SIZE)

Found 9 labels


In [34]:
# Model Loading & Configuration
class ClassifierModel(pl.LightningModule):
    
    def __init__(self, n_labels):
        super().__init__()
        self.model = DistilBertForSequenceClassification.from_pretrained(
            MODEL_NAME,
            num_labels=n_labels,
            problem_type='multi_label_classification',
            return_dict=True,
        )
        self.accuracy = torchmetrics.Accuracy(
            task='multilabel',
            num_labels=n_labels,
        )
        self.f1_score = torchmetrics.F1Score(
            task='multilabel',
            num_labels=n_labels,
            average='macro',
        )
        self.hamming_loss = torchmetrics.HammingDistance(
            task='multilabel',
            num_labels=n_labels,
        )

    def forward(self, inputs):
        return self.model(**inputs)

    def _common_step(self, batch, batch_idx):
        outputs = self(batch)
        return outputs.logits, batch['labels'], outputs.loss

    def _common_log(self, outputs, labels, loss, stage: str):
        self.log_dict({
            f'{stage}_loss': loss,
            f'{stage}_acc': self.accuracy(outputs, labels),
            f'{stage}_f1': self.f1_score(outputs, labels),
        }, prog_bar=True)

    def training_step(self, batch, batch_idx):
        outputs, labels, loss = self._common_step(batch, batch_idx)
        self._common_log(outputs, labels, loss, 'train')
        return loss

    def validation_step(self, batch, batch_idx):
        outputs, labels, loss = self._common_step(batch, batch_idx)
        self._common_log(outputs, labels, loss, 'val')
        self.log_dict({
            'val_hloss': self.hamming_loss(outputs, labels),
        }, prog_bar=True)
        # if batch_idx == 0:
        #     self.outputs = outputs
        #     self.labels = labels
        # else:
        #     self.outputs = torch.cat((self.outputs, outputs))
        #     self.labels = torch.cat((self.labels, labels))
        # self.logger.experiment.add_figure("Confusion matrix", fig_, self.current_epoch)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=1e-5)

model = ClassifierModel(dm.n_labels)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier

In [35]:
# Training Loop
logger = pl.loggers.TensorBoardLogger("runs", name="classifier")
trainer = pl.Trainer(
    max_epochs=MAX_EPOCHS,
    callbacks=[pl.callbacks.RichProgressBar(leave=True)],
    logger=logger,
)
trainer.fit(model, datamodule=dm)
trainer.validate(model, dm)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Output()

Output()

Output()

Output()

Output()

`Trainer.fit` stopped: `max_epochs=5` reached.


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Output()

[{'val_loss': 0.1228979155421257,
  'val_acc': 0.9749788045883179,
  'val_f1': 0.6996081471443176,
  'val_hloss': 0.025021204724907875}]

In [None]:
# Evaluation

In [36]:
model.eval()

ClassifierModel(
  (model): DistilBertForSequenceClassification(
    (distilbert): DistilBertModel(
      (embeddings): Embeddings(
        (word_embeddings): Embedding(30522, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (transformer): Transformer(
        (layer): ModuleList(
          (0): TransformerBlock(
            (attention): MultiHeadSelfAttention(
              (dropout): Dropout(p=0.1, inplace=False)
              (q_lin): Linear(in_features=768, out_features=768, bias=True)
              (k_lin): Linear(in_features=768, out_features=768, bias=True)
              (v_lin): Linear(in_features=768, out_features=768, bias=True)
              (out_lin): Linear(in_features=768, out_features=768, bias=True)
            )
            (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (ffn):

In [7]:
preds.detach()

tensor([[0.1100, 0.2200, 0.8400],
        [0.7300, 0.3300, 0.9200]])