In [4]:
import numpy as np
from sklearn.preprocessing import LabelEncoder

**DistilBERT** is a smaller, faster, and more efficient version of the original BERT (Base BERT) model, specifically designed to maintain much of BERT's performance while reducing its size and computational cost. Here are the key advantages and architectural differences:

**Advantages of DistilBERT over Base BERT:**

**1.Reduced Size and Complexity:** DistilBERT has about 40% fewer parameters than BERT-base, which makes it smaller and more efficient. This reduction in size makes it faster to train and infer.

**2.Preserved Performance:** Despite the reduction in size, DistilBERT retains around 97% of the performance of BERT-base on benchmark datasets. This makes it an attractive choice when computational resources or model size are a concern, but without a substantial sacrifice in performance.

**Architectural Differences:**

**1.Number of Layers:** DistilBERT has 6 transformer layers compared to the 12 in BERT-base. This reduction in layers is the primary reason for its smaller size and faster performance.

**2.Distillation Process**: DistilBERT is trained using a technique called knowledge distillation, where the smaller model is trained to reproduce the behavior of the larger, more complex model (BERT). This process allows DistilBERT to retain much of the original model's capabilities.

**3.No Token-Type Embeddings:** DistilBERT does not use token-type embeddings which are present in BERT. In BERT, these embeddings are used to distinguish between different sequences for tasks like question-answering, but they were found to be less critical for overall performance.

**4.Reduced Hidden Size and Attention Heads:** The hidden size and the number of attention heads in DistilBERT are the same as in BERT-base, which helps in maintaining performance.

In [5]:
from transformers import Trainer, TrainingArguments, DistilBertForSequenceClassification, DistilBertTokenizerFast, \
     DataCollatorWithPadding, pipeline
from datasets import load_metric, Dataset,load_dataset

**The Fast tokenizers** in Hugging Face are implemented in Rust for speed, offering fast tokenization.

**DataCollatorWithPadding:**
This is a utility that helps in creating batches of data. It automatically pads the inputs to the maximum length in a batch, making it easier to process variable-length inputs. Padding ensures that all sequences in a batch are of the same length, which is a requirement for most deep learning models.


In [6]:
# Load the SNIPS dataset
dataset = load_dataset("benayas/snips")

# Explore the dataset structure
print(dataset)

Downloading readme:   0%|          | 0.00/426 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/370k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/45.0k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/13084 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1400 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'category'],
        num_rows: 13084
    })
    test: Dataset({
        features: ['text', 'category'],
        num_rows: 1400
    })
})


In [7]:
# Example: Accessing the training set
train_set = dataset["train"]
print(train_set)

test_set = dataset["test"]
print(test_set)

# Example: Print a few samples from the training set
for i in range(20):
    print(f"Sample {i}: {train_set[i]}")

Dataset({
    features: ['text', 'category'],
    num_rows: 13084
})
Dataset({
    features: ['text', 'category'],
    num_rows: 1400
})
Sample 0: {'text': 'Add Don and Sherri to my Meditate to Sounds of Nature playlist', 'category': 'AddToPlaylist'}
Sample 1: {'text': 'put United Abominations onto my rare groove playlist', 'category': 'AddToPlaylist'}
Sample 2: {'text': 'add the tune by misato watanabe to the Trapeo playlist', 'category': 'AddToPlaylist'}
Sample 3: {'text': 'add this artist to my this is miguel bosé playlist', 'category': 'AddToPlaylist'}
Sample 4: {'text': 'add heresy and the hotel choir to the evening acoustic playlist', 'category': 'AddToPlaylist'}
Sample 5: {'text': 'Please add Jency Anthony to my playlist This Is Mozart', 'category': 'AddToPlaylist'}
Sample 6: {'text': 'Add an album to my list La Mejor Música Dance 2017', 'category': 'AddToPlaylist'}
Sample 7: {'text': 'Add shame on you to my masters of metal playlist', 'category': 'AddToPlaylist'}
Sample 8: {'te

In [8]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-cased')

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

In [9]:
# Create a mapping for the labels
label_to_id = {label: i for i, label in enumerate(set(train_set['category']))}

In [10]:
from torch.utils.data import Dataset
import torch

class SNIPSDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Prepare inputs
train_encodings = tokenizer(list(train_set['text']), truncation=True, padding=True, max_length=512)
test_encodings = tokenizer(list(test_set['text']), truncation=True, padding=True, max_length=512)

# Convert labels
train_labels = [label_to_id[label] for label in train_set['category']]
test_labels = [label_to_id[label] for label in test_set['category']]

# Create datasets
train_dataset = SNIPSDataset(train_encodings, train_labels)
test_dataset = SNIPSDataset(test_encodings, test_labels)


In [11]:
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=len(label_to_id))

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [12]:
training_args = TrainingArguments(
    output_dir='./results',          # output directory for model checkpoints
    num_train_epochs=3,              # number of training epochs
    per_device_train_batch_size=16,  # batch size for training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
)

In [13]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

In [14]:
trainer.evaluate()

{'eval_loss': 1.953208565711975,
 'eval_runtime': 2.2184,
 'eval_samples_per_second': 631.073,
 'eval_steps_per_second': 9.917}

In [15]:
# Train the model
trainer.train()

Step,Training Loss
500,0.8853
1000,0.1511
1500,0.0768
2000,0.0397


TrainOutput(global_step=2454, training_loss=0.23975588814845003, metrics={'train_runtime': 206.2227, 'train_samples_per_second': 190.338, 'train_steps_per_second': 11.9, 'total_flos': 436724962356888.0, 'train_loss': 0.23975588814845003, 'epoch': 3.0})

In [16]:
trainer.evaluate()

{'eval_loss': 0.08386124670505524,
 'eval_runtime': 1.4493,
 'eval_samples_per_second': 965.956,
 'eval_steps_per_second': 15.179,
 'epoch': 3.0}

In [17]:
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [18]:
# Create the pipeline
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0 if device == "cuda" else -1)

In [19]:
results = pipe('Please add Here We Go by Dispatch to my road trip playlist')
print(results)

[{'label': 'LABEL_2', 'score': 0.9998158812522888}]


Create Inverse Mapping

In [20]:
id_to_label = {id: label for label, id in label_to_id.items()}

In [21]:
# Assuming your result is something like [{'label': 'LABEL_4', 'score': 0.999452531337738}]
prediction = results[0]['label']  # This gets
prediction_id = int(prediction.split('_')[-1])  # Extract the number part
true_label = id_to_label[prediction_id]  # Map back to the original label

print("Predicted Label:", true_label)


Predicted Label: AddToPlaylist


In [22]:
trainer.save_model()