Blog: https://blog.futuresmart.ai/fine-tuning-hugging-face-transformers-model        
YouTube: https://www.youtube.com/watch?v=9he4XKqqzvE


We hereby fine-tune the DistilBERT model for the sentiment analysis task using the following Toxic Comment dataset.

Dataset Link: https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/data

- Install the transformers library using `!pip install transformers -U` and import the required libraries:

In [2]:
#!pip install transformers -U

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
import torch
from transformers import TrainingArguments, Trainer
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

- Load the dataset containing toxic comments using `pd.read_csv` and select only the relevant columns:

In [3]:
data = pd.read_csv("toxic-comments.csv",error_bad_lines=False, engine="python")
data = data[['comment_text','toxic']]
data = data[0:1000]
data.head()



  data = pd.read_csv("toxic-comments.csv",error_bad_lines=False, engine="python")


Unnamed: 0,comment_text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


- Split the dataset into training and validation sets using `train_test_split` from sklearn. 80% of the data is used for training and 20% for validation:

In [4]:
X = list(data["comment_text"])
y = list(data["toxic"])
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2,stratify=y)


- Tokenize the input sequences using the `DistilBertTokenizer` from the transformers library. The `padding=True`, `truncation=True`, and `max_length=512` parameters ensure that all input sequences have the same length.

In [5]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

X_train_tokenized = tokenizer(X_train, padding=True, truncation=True, max_length=512)
X_val_tokenized = tokenizer(X_val, padding=True, truncation=True, max_length=512)


- Create a custom PyTorch dataset using the tokenized input sequences and corresponding labels:

In [6]:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

train_dataset = Dataset(X_train_tokenized, y_train)
val_dataset = Dataset(X_val_tokenized, y_val)


In this snippet, we define a custom dataset class `Dataset` that takes the tokenized encodings and labels as input. The `__getitem__` method returns a dictionary of tensors representing each input sample. We create instances of `train_dataset` and `val_dataset` using the tokenized datasets and labels.

- Define a function to compute evaluation metrics (accuracy, recall, precision, and F1-score) for the model:

In [7]:
def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)

    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    recall = recall_score(y_true=labels, y_pred=pred)
    precision = precision_score(y_true=labels, y_pred=pred)
    f1 = f1_score(y_true=labels, y_pred=pred)

    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}


- Define the training arguments for the Trainer using `TrainingArguments`:

In [9]:
args = TrainingArguments(
    output_dir="distilbert-base-uncased",
    num_train_epochs=1,
    per_device_train_batch_size=8
)


- Define the Trainer using the `Trainer` class from the transformers library, passing in the model, training, and validation datasets, and the evaluation metrics function. Then, we train the model using `trainer.train()`:

In [10]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)
model = model.to(device)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

trainer.train()


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier

Step,Training Loss


TrainOutput(global_step=100, training_loss=0.24767805099487306, metrics={'train_runtime': 797.0641, 'train_samples_per_second': 1.004, 'train_steps_per_second': 0.125, 'total_flos': 105973918924800.0, 'train_loss': 0.24767805099487306, 'epoch': 1.0})

- Evaluate the model and make predictions on new inputs: Here, we define the training arguments and initialize the `Trainer` with the DistilBERT model, training arguments, and the training and validation datasets. We train the model using `trainer.train()` and evaluate its performance on the validation set using `trainer.evaluate()`:

In [11]:
trainer.evaluate()

text = "That was good point"
inputs = tokenizer(text, padding=True, truncation=True, return_tensors='pt').to(device)
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predictions = predictions.cpu().detach().numpy()
predictions

array([[0.978746  , 0.02125401]], dtype=float32)

The output has a 0.97 score for a positive label indicating that the text was a positive comment.

We save the fine-tuned model using `trainer.save_model()` to a directory called `bert-base-uncased-finetuned-toxic-comments`. Later, we load the saved model using `DistilBertForSequenceClassification.from_pretrained()` and move it to the GPU:

In [12]:
trainer.save_model('distilbert-base-uncased-finetuned-toxic-comments')


We can load the saved model and make predictions on new inputs.

Finally, we provide an example text and tokenize it. We pass the tokenized inputs through the loaded model (`model_2`) to obtain the predicted probabilities using the `softmax` function. The predictions are converted to a numpy array for further processing:

In [13]:
model_2 = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-toxic-comments")
model_2.to(device)

text = "go to hell"
inputs = tokenizer(text, padding=True, truncation=True, return_tensors='pt').to(device)
outputs = model_2(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predictions = predictions.cpu().detach().numpy()
predictions

array([[0.19095181, 0.8090482 ]], dtype=float32)

Here we have a 0.80 score for the negative label, indicating that the text represents a negative comment.

This code demonstrates how to fine-tune a pre-trained DistilBERT model on a custom dataset using the Hugging Face Transformers library. The model is trained to classify toxic comments, and evaluation metrics such as accuracy, recall, precision, and F1-score are computed to assess the model's performance. Finally, the trained model is saved and can be loaded later to make predictions on new inputs.