The Transformer is the latest advance in Deep Learning architectures that has driven most state-of-the-art progress in NLP since it was first presented in ['Attention is All You Need'](https://arxiv.org/abs/1706.03762). Since then, ever larger models are being made, with parameters running into the billions. 

> Side-note: I think we're inflection point in ML with OpenAI's release of their API - everyone now has easy access to these state-of-the-art language models, we're gonna see an explosion of use-cases + value creation


There's a lot of greats resources with visualisations to help understand the architecture which I'll come back to. First, a brief introduction to what makes Transformers so powerful:

*   *Self-attention*: a mechanism allowing us to learn contextual relationships between different elements in our input sequence, replacing the need for sequential structure (from RNN/LSTM cells).
*   *Multi-headed attention*: multiple heads of the model carry out self-attnetion, attending to information jointly at different parts of the sequence from different subspaces. This allows us to learn a variety of features of language + means the model can scale efficiently with large datasets + unsupervised learning.
* *Transfer learning*: Transformers use the knowledge extracted from a prior setting (usually in the form a language model), which can be unsupervised, then apply or *transfer* to a specific domain, where labelled data is available. This allows a large rich corpus of text to be used in the first pre-training stage, before the model is fine-tuned on custom data. 

*insert pre training photo*

In this post, we'll look at how to fine tune a pre-trained model for the task fo sentiment analysis using Hugging Face's [Transformer](https://huggingface.co/transformers/pretrained_models.html) library, that gives simple access to many of the top transformed-based models (*BERT*, *GPT-2*, *XLNet* etc).  We'll use *DistilBert* here, a lightweight version of the famous *BERT* model with 66 million parameters that's slightly easier to run on a single Colab GPU.

BERT stands for Bidirectional Encoder Representations from Transformers. It uses a *masked* language model where 15% of a sequence's tokens are randomly masked, then the model learns to predict, given a token, what came before *or* after it (the bi-dircectional part). In addition, it has a next sentence prediction objective (did this sentence come after a previous one). BERT differs from a more standard *casual* language model, that predicts the most likely next token in the sequence in a left-to-right direction.


## Setup

In [33]:
!pip install transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [None]:
#from google.colab import drive # import drive from google colab

In [None]:
#ROOT = "/content/drive"     # default location for the drive
#print(ROOT)                 # print content of ROOT (Optional)

#drive.mount(ROOT)           # we mount the google drive at /content/drive

/content/drive

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [34]:
import transformers
from transformers import DistilBertModel, DistilBertTokenizer, AdamW, get_linear_schedule_with_warmup
# Load model directly
from transformers import AutoTokenizer, AutoModelForMaskedLM


from torch import nn, optim
from torch.utils.data import Dataset, DataLoader

from os import path
import requests
import gzip
import zipfile
import numpy as np
from collections import defaultdict

RANDOM_SEED = 0
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [35]:
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-multilingual-cased")
model = AutoModelForMaskedLM.from_pretrained("distilbert/distilbert-base-multilingual-cased")
import torch

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


config.json:   0%|          | 0.00/466 [00:00<?, ?B/s]

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)
  self.comm = Comm(**args)


model.safetensors:   0%|          | 0.00/542M [00:00<?, ?B/s]

## Loading our Data

For the task of sentiment analysis our model takes a sentence as input and outputs one of five classes representing sentiments (very negative, negative, neutral, positive, very positive). The Stanford Sentiment Treebank (SST-5) is the best-known dataset for this, composed of 11855 such sentences with labels 1-5 already split into train, validation and test sets (of sizes 8544, 1101 and 2210). 

Let's download the dataset, then split into train/val/test sets.

In [36]:
import pandas as pd

# Load the data from the CSV file
data = pd.read_csv("/kaggle/input/senti-distil/merged_file_cleaned.csv")

# Separate features (X) and labels (y)
X = data["text"].tolist()
y = data["target"].tolist()

# Split the data into train, validation, and test sets
from sklearn.model_selection import train_test_split

# Split into train and temporary set (to be further split into validation and test)
X_train_temp, X_test, y_train_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Further split the temporary set into validation and train
X_train, X_val, y_train, y_val = train_test_split(X_train_temp, y_train_temp, test_size=0.25, random_state=42)

# Now you have X_train, y_train, X_val, y_val, X_test, y_test ready for training and evaluation


We need to turn each sequence of words into tokens that serve as inputs into our model. The `DistilBertTokenizer` object does just that. We can see what the tokenizer does to the first sentence in our training set.


In [42]:
PRE_TRAINED_MODEL_NAME = 'distilbert/distilbert-base-multilingual-cased'

#tokenizer = DistilBertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)

In [43]:
sample_txt = str(X_train[0])
tokens = tokenizer.tokenize(sample_txt)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(f' Sentence: {sample_txt}')
print(f'   Tokens: {tokens}')
print(f'Token IDs: {token_ids}')

 Sentence: నరసింహా(ఆనంద్ బచ్చు) రవి (రాజ్ బాలా) దుర్గ(లౌక్య) లిల్లీ(రాధికా) నలుగురు కలిసి ‘వైట్ టైగర్స్ సొసైటీ’ అనే ఓ సంస్థను స్థాపించి సిటీలో అమ్మాయిలను కిడ్నాప్ చేసి లైంగికంగా వేధించే ఆకతాయలను పట్టుకుని చిత్ర విచిత్రమైన శిక్షలు విధిస్తూ వుంటారు రాత్రి వేళల్లో ఉద్యోగాలు చేసే మహిళలను టార్గెట్ గా చేస్తున్న కొంత మంది కిడ్నాపర్లను పట్టుకొని తమదైనశైలిలో శిక్షలు విధిస్తోంటోంది ఈ నలుగురి బృందం అయితే ఓ కరుడు గట్టిన టీమ్ మాత్రం వీరి కంట పడకుండా తప్పించుకుని తిరుగుతూ మహిళలను కిడ్నాప్ చేస్తూ వుంటుంది వారిని పట్టుకోవడానికి ఈ టీమ్ శత విధాలా ట్రై చేస్తూ వుంటుంది రాత్రి ఏడు గంటల నుంచి తెల్లవారు జామున 4 వరకు జరిగే ఈ స్టోరీలో ఆ కిడ్నాపర్లు ఎవరికోసం అమ్మాయిలను కిడ్నాప్ చేస్తున్నారు ఆ కిడ్నాపర్ల ముఠాను వైట్ టైగర్స్ సొసైటీ పట్టుకుందా చివరకు వారిని ఏం చేశారనేదే మిగతా కథ   కథనం-విశ్లేషణ   మెట్రో పాలిటన్ నగరాల్లో విమెన్ ట్రాఫికింగ్ ఎలా వుంటుందో నిత్యం మన చుట్టూ జరుగుతున్న సంఘటనలే నిదర్శనం లేట్ నైట్ ఆఫీసుకెళ్లొచ్చే అమ్మాయిలకు తగిన రక్షణ అనేదే కరువు అలాంటి అమ్మాయిలను రక్షించడానికి ప్రభుత్వాలు పోలీసు వ్యవస్థ ఎ

The model needs to account for a few special tokens, namely the start + end of a sentence, unknown words and lastly for padding (each sentence has a different length, not well suited to feed into batches for a deep learning model so we set a suitable max length, then pad shorter sentences up to that length with a padding token.)  All this word is done for us using the `encode_plus` method, which we use to build our `Dataset` object.

In [44]:
class SST_Dataset(Dataset):
    def __init__(self, ys, Xs, tokenizer, max_len):
        self.targets = ys
        self.reviews = Xs
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.reviews)

    def __getitem__(self, idx):
        review = str(self.reviews[idx])
        target = self.targets[idx]
        encoding = self.tokenizer.encode_plus(
          review,
          add_special_tokens=True,
          max_length=self.max_len,
          return_token_type_ids=False,
          pad_to_max_length=True,
          return_attention_mask=True,
          return_tensors='pt',
          truncation=True
        )
        return {
          'review_text': review,
          'input_ids': encoding['input_ids'].flatten(),
          'attention_mask': encoding['attention_mask'].flatten(),
          'targets': torch.tensor(target, dtype=torch.long)
        }

Next we create our `Dataloader` objects for training, validation and testing. For each item in the dataset we need the encoded input tokens, masks for where the sentence is not padded and the target value.

In [45]:
def create_data_loader(ys, Xs, tokenizer, max_len, batch_size):
    ds = SST_Dataset(ys, Xs, tokenizer, max_len)
    return DataLoader(ds, batch_size=batch_size)

BATCH_SIZE = 16
MAX_LEN = 128

train_data_loader = create_data_loader(y_train, X_train, tokenizer, MAX_LEN, BATCH_SIZE)
val_data_loader = create_data_loader(y_val, X_val, tokenizer, MAX_LEN, BATCH_SIZE)
test_data_loader = create_data_loader(y_test, X_test, tokenizer, MAX_LEN, BATCH_SIZE)

## Constructing our model

Now we'ready to build our simple sentiment classification model: we use the output of the `DistilBertModel` - of size 768 - as input into a single fully-connected layer. Dropout is important here for a model with so many parameters (discussed below). (Hugging Face also provide some inbuilt models for downstream tasks that we could have used such as `BertForSequenceClassification` or `BertForQuestionAnswering`)


In [46]:
class SentimentClassifier(nn.Module):
  def __init__(self, n_classes=5):
    super(SentimentClassifier, self).__init__()
    self.bert = DistilBertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
    self.drop = nn.Dropout(p=0.3)
    self.fc = nn.Linear(self.bert.config.hidden_size, n_classes)

  def forward(self, input_ids, attention_mask):
    output = self.bert(input_ids, attention_mask)
    output= output[0][:,0]
    output = self.drop(output)
    return self.fc(output)

The BERT authors had some recommendations for hyperparameters when it comes to fine-tuning:

*   *Batch size*: 16, 32
*   *Learning rate (Adam)*: 5e-5, 3e-5, 2e-5
*   *Number of epochs*: 2, 3, 4

We'll largely stick with these - note that the number of epochs is a lot lower than you might expect for a Deep Learning model. This is since we can easily overfit to the training set with many parameters. We'll check for this by calculating both the training and validation accuracy at each epoch. You can find out more about the Hugging Face's optimisers [here](https://huggingface.co/transformers/main_classes/optimizer_schedules.html).

In [47]:
# initialise model
model = SentimentClassifier()

EPOCHS = 5
optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
total_steps = len(train_data_loader)*EPOCHS
scheduler = get_linear_schedule_with_warmup(
  optimizer,
  num_warmup_steps=50,
  num_training_steps=total_steps
)
loss_fn = nn.CrossEntropyLoss().to(device)



Let’s continue with writing our helper functions for training our model. 

In [48]:
def evalModel(model, data_loader, loss_fn, N):
    """Evaluate loss and accuracy of model on data_loader"""
    # set model to evaluation mode
    model = model.eval()
    total_loss = 0
    correct = 0

    with torch.no_grad():
        for d in data_loader:
            # get inputs and target 
            input_ids = d["input_ids"].to(device)
            attention_mask = d["attention_mask"].to(device)
            targets = d["targets"].to(device)

            # pass through model + make prediction
            outputs = model(input_ids, attention_mask)
            _, pred = torch.max(outputs, dim=1)

            # update counters
            loss = loss_fn(outputs, targets)
            correct += (pred == targets).sum().item()
            total_loss += loss.item()*len(targets)

    # normalise
    return 100*correct/N, total_loss/N

In [49]:
def trainModel(model, trainDataLoader, valDataLoader, loss_fn, optimizer, scheduler, verbose=True):
    """Train sentiment classifier"""
    # structure to store progress of the model at each epoch
    history = defaultdict(list)
    
    # move the model to the gpu
    model = model.to(device)

    for ep in range(EPOCHS):
        total_loss = 0
        correct = 0
        # set model to train mode so dropout and batch normalisation layers work as expected
        model.train()

        for d in trainDataLoader:
            # get inputs for batch
            input_ids = d["input_ids"].to(device)
            attention_mask = d["attention_mask"].to(device)
            targets = d["targets"].to(device)

            # calculate output + loss
            model.zero_grad()
            outputs = model(input_ids, attention_mask)
            loss = loss_fn(outputs.squeeze(), targets.long())

            # take gradient step
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()
            scheduler.step()

            # update losses
            _, pred = torch.max(outputs, dim=1)
            correct += (pred == targets).sum().item()
            total_loss += loss.item()*len(targets)

        #after each epoch, collect statistics
        history['train_acc'].append(100*correct/len(X_train))
        history['train_loss'].append(total_loss/len(X_train))

        # statistics about the validation set
        val_acc, val_loss = evalModel(model, valDataLoader, loss_fn, len(X_val))
        history['val_acc'].append(val_acc)
        history['vall_loss'].append(val_loss)

        #if validation improved, save new best model
        if history['val_acc'][-1] == max(history['val_acc']):
            print ("=> Saving a new best at epoch:", ep)
            torch.save(model.state_dict(), 'best_model_state.bin')
        
        if verbose:
            print('Epoch {}/{}'.format(ep+1, EPOCHS))
            print('-' * 10)
            print('Train loss {} accuracy {}'.format(history['train_loss'][-1], history['train_acc'][-1]))
            print('Val loss {} accuracy {}'.format(val_loss, val_acc))

    #clean up
    model = model.to(torch.device("cpu"))
    del input_ids, attention_mask, targets, outputs, _, pred

    return model, history

Let's train our model and see how it does on our test set!

In [50]:
%%time
best_model, histories = trainModel(model, train_data_loader, val_data_loader, loss_fn, optimizer, scheduler, verbose=True)



=> Saving a new best at epoch: 0
Epoch 1/5
----------
Train loss 1.1004840837584602 accuracy 51.358024691358025
Val loss 0.8337118223861412 accuracy 52.592592592592595
=> Saving a new best at epoch: 1
Epoch 2/5
----------
Train loss 0.7696111040350831 accuracy 49.629629629629626
Val loss 0.7275256037712097 accuracy 52.592592592592595
=> Saving a new best at epoch: 2
Epoch 3/5
----------
Train loss 0.6791808421964999 accuracy 57.77777777777778
Val loss 0.6310601609724539 accuracy 63.7037037037037
=> Saving a new best at epoch: 3
Epoch 4/5
----------
Train loss 0.4871045817747528 accuracy 78.27160493827161
Val loss 0.6164150012864007 accuracy 68.88888888888889
=> Saving a new best at epoch: 4
Epoch 5/5
----------
Train loss 0.343588593546991 accuracy 85.4320987654321
Val loss 0.6035547534624736 accuracy 73.33333333333333
CPU times: user 37.6 s, sys: 6.99 s, total: 44.6 s
Wall time: 45.1 s


In [51]:
test_acc, test_loss = evalModel(best_model.to(device), test_data_loader, loss_fn, len(y_test))

In [52]:
print(test_acc, test_loss)

71.32352941176471 0.5530685817494112


In [57]:
from sklearn.metrics import precision_score, recall_score, f1_score

# Evaluate the model on the test set
def evaluate_model(model, test_data_loader, loss_fn):
    model.eval()
    total_loss = 0
    all_predictions = []
    all_targets = []

    with torch.no_grad():
        for batch in test_data_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            targets = batch['targets'].to(device)

            outputs = model(input_ids, attention_mask)
            loss = loss_fn(outputs, targets)

            total_loss += loss.item() * len(targets)
            _, predictions = torch.max(outputs, dim=1)

            all_predictions.extend(predictions.cpu().tolist())
            all_targets.extend(targets.cpu().tolist())

    avg_loss = total_loss / len(test_data_loader.dataset)
    return all_predictions, all_targets, avg_loss

# Evaluate the model
test_predictions, test_targets, test_loss = evaluate_model(best_model, test_data_loader, loss_fn)

# Convert to numpy arrays
test_predictions = np.array(test_predictions)
test_targets = np.array(test_targets)

# Calculate precision, recall, and F1 score
precision = precision_score(test_targets, test_predictions, average='weighted')
recall = recall_score(test_targets, test_predictions, average='weighted')
f1 = f1_score(test_targets, test_predictions, average='weighted')

print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)


Precision: 0.7285945545326884
Recall: 0.7132352941176471
F1 Score: 0.7120709092472461


There we have it! We've fine-tuned DistilBert for the task of sentiment classification to over 40% test accuracy in only 5 epochs. We can see that the pre-training step of this Tranformer model produces versatile, useful and high-quality features representing different semantics of language.

However we note that this doesn't get us close to [state-of-the-art](https://paperswithcode.com/sota/sentiment-analysis-on-sst-5-fine-grained) on this dataset (55%) - the important lesson here is that we haven't tuned any hyperparameters so finding the best optimizer, learning-rate, droupout amount, adding hidden-layers + number of epochs is what will improve our model. We use the validation set to see what hyperparameters get the best accuracy on that - this estimates how our model will generalise to the unseen test set (see your favourite Learning Theory textbooka as to why this works).

Remember that during training we're trying to find the optima a (> 66,000,000 dimension) hypersurface - there's going to many minima so finding the best one requires some searching. Hyperparameter tuning is an important part of solving any problem with Machine Learning, one you just can't avoid.

As a final bit of fun, let's see what our model predicts on some raw text - we need to tokenise our custom input then pass it through our trained classifier. Though not a 5 we see the model can correctly identify the review as positive!

In [53]:
review_text = "నాకు చిత్రం బాగా నచ్చింది ."

In [55]:
encoded_review = tokenizer.encode_plus(
  review_text,
  max_length=MAX_LEN,
  add_special_tokens=True,
  return_token_type_ids=False,
  pad_to_max_length=True,
  return_attention_mask=True,
  return_tensors='pt',
  truncation=True
)

In [56]:
input_ids = encoded_review['input_ids'].to(device)
attention_mask = encoded_review['attention_mask'].to(device)
output = model(input_ids, attention_mask)
_, prediction = torch.max(output, dim=1)
print(f'Review text: {review_text}')
print(f'Sentiment  : {int(prediction.cpu().detach().numpy())}')

Review text: నాకు చిత్రం బాగా నచ్చింది .
Sentiment  : 0


  print(f'Sentiment  : {int(prediction.cpu().detach().numpy())}')


## References & Helpful resources

*   [Visualising Attention](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/) post by Jay Allamar and its follow up [The Illustrated Tranformer](http://jalammar.github.io/illustrated-transformer/)
*   [State of transfer learning in NLP](https://ruder.io/state-of-transfer-learning-in-nlp/)
* [Lecture](https://www.youtube.com/watch?v=5vcj8kSwBCY) at Stanford, also found [this](https://youtu.be/S27pHKBEp30) video helpful
* The Hugging face Transformer library [docs](https://huggingface.co/transformers/)

