# Introduction
In the realm of natural language processing (NLP), the rise of transformer architectures, especially BERT (Bidirectional Encoder Representations from Transformers) and its variants, has revolutionized the field by setting new benchmarks across various tasks. One such variant, DistilBERT, offers a compact, faster, and more efficient solution without compromising too much on the performance characteristics of its larger counterpart.

The objective of this study, as encapsulated within this Jupyter Notebook, is to construct a text classifier leveraging the prowess of DistilBERT. Given the intricacies and nuances associated with deep learning and NLP tasks, it's essential to rely on tools that streamline the process and make it more interpretable. To this end, we utilize PyTorch Lightning—a lightweight PyTorch wrapper that simplifies the training and evaluation pipeline, allowing us to focus on the model architecture and logic rather than the boilerplate training loops.

Furthermore, harnessing pretrained models has become a staple in modern NLP. It allows researchers and practitioners to leverage vast amounts of knowledge and insights distilled into these models from extensive training on large-scale datasets. The transformers library by Hugging Face offers a repository of such pretrained models, including DistilBERT, and facilitates the integration of these models into custom applications.

Within this notebook, we'll journey through the stages of data preprocessing, model loading, training, evaluation, and inference. This endeavor not only stands as an exploration of state-of-the-art techniques but also as a testament to the ease and efficiency brought about by tools like PyTorch Lightning and the transformers library in the rapidly evolving landscape of NLP.

In [57]:
# Imports
import torch
import json
from sklearn.model_selection import train_test_split
import lightning.pytorch as pl

In [42]:
# Setup & Configurations (constants, seeds, and devices)
RANDOM_SEED = 69
DATASET_FILENAME = '../data/clean/customer_support_twitter_full.json'

torch.manual_seed(RANDOM_SEED)

<torch._C.Generator at 0x1047e446f90>

In [43]:
# Load the data
with open(DATASET_FILENAME) as file:
    conversations = json.load(file)

# Extract Statistics
n_messages = 0
intent_counts = dict()
for conversation in conversations:
    for message in conversation:
        n_messages += 1
        for intent in message.get('intents'):
            intent_counts[intent] = intent_counts.get(intent, 0) + 1
ordered_counts = sorted(intent_counts.items(), key=lambda intent: intent[1], reverse= True)
ordered_counts_text = "\n".join([f"* {k:<25}: {v:5,}" for k, v in ordered_counts])

print(f'Conversation Count: {len(conversations)}')
print(f'Message Count: {n_messages}')
print(f'Label Counts:\n{ordered_counts_text}')
print(f'Sample conversation:{json.dumps(conversations[0], indent=2)}')

# Split the data


Conversation Count: 1001
Message Count: 2619
Label Counts:
* Question                 : 1,180
* URL Share                : 1,132
* Direct to DM             :   741
* Check Version/Details    :   561
* Provide Information      :   514
* Acknowledgement          :   355
* Report Problem           :   318
* Troubleshooting          :   217
* Call Center Inquiry      :    18
Sample conversation:[
  {
    "id": 698,
    "text": "@AppleSupport  URL",
    "authored": false,
    "intents": [
      "URL Share"
    ]
  },
  {
    "id": 696,
    "text": "USERNAME We're here for you. Which version of the iOS are you running? Check from Settings > General > About.",
    "authored": true,
    "intents": [
      "Question",
      "Provide Information",
      "Check Version/Details",
      "Acknowledgement"
    ]
  },
  {
    "id": 697,
    "text": "@AppleSupport The newest update. I made sure to download it yesterday.",
    "authored": false,
    "intents": [
      "Provide Information"
    ]
  },
  

In [58]:
# Split the data into 80% train, 10% validation, and 10% test
train_data, temp_data = train_test_split(conversations, test_size=0.2, random_state=RANDOM_SEED)
val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=RANDOM_SEED)
print(f'Train: {len(train_data)}; Val: {len(val_data)}; Test: {len(test_data)}')

# Define PyTorch Lightning Dataset
class ClassifierDataModule(pl.LightningDataModule):
    pass

Train: 800; Val: 100; Test: 101


In [60]:
# Model Loading & Configuration
class ClassifierModel(pl.LightningModule):
    
    def __init__(self):
        super().__init__()