<a href="https://colab.research.google.com/github/MJMortensonWarwick/ADA2425/blob/main/9_1_Transformer_Backbone_and_Finetuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers: Backbone and Fine-tuning
In the lecture we have seen the overall architecture and approach of transformers. We have also seen that in general transformers have allowed us to build very large-scale models of billions of paramaters (maybe even trillions).

How should we interact with such models? It is obviously difficult for us to train our own transformer of this scale, and if we train a much smaller model, can it compete?

Instead, we may want to build on top of a transfomer trained at web-scale. This Notebook will explore a couple of methods of doing this.

First, let's install some packages and some data:

In [None]:
!pip install transformers datasets

import torch
from transformers import AutoTokenizer, AutoModel
from datasets import load_dataset
from torch.utils.data import DataLoader
import torch.nn as nn
import torch.optim as optim

# Load Dataset and Tokeniser
dataset = load_dataset("imdb") # use the in-built IMDB review dataset

We have used some standard libraries, most of which we have seen. We have also installed the IMDB dataset - a the name suggests a set of movie details and reviews.

We will now start to use some transformer technology! We need to convert our data first of all to embeddings. We'll do so with the popular [BERT](https://arxiv.org/abs/1810.04805) model from Google:

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def tokenize_function(examples):
    # we want the data to be the same size each time but some reviews are shorter
    # if any are less than max_length, we will add zeros to the end of it until
    # it is the same length as max_length (padding="max_length")
    # if any are longer than max_length (could happen in production), then
    # we truncate it - i.e. delete any characters after max_length is reached.
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# tokenise the data (covert to embeddings)
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Here we have tokenised our data using BERT ... i.e. converted the text documents to a vector format where each document is now a set of vectors that represents the semantic meaning of the text.

We can use this feature extracted dataset in a variety of ways, two of which we will explore. The first is taking this as basically a feature engineered dataset, and training a simple prediction "head" on top of this. I.e. we will not attempt to train the transformer model at all - we will use it with the weights learned through the training regime Google used.

Instead we will just use that as our backbone, with a simple, dense neural network on top that we will train. First, let's build that as a model:

In [None]:
# Create Embeddings-Only Model with Custom Head

class CustomHeadModel(nn.Module):
    def __init__(self, embedding_dim, num_labels):
        super(CustomHeadModel, self).__init__()
        self.embedding_dim = embedding_dim # input
        self.linear1 = nn.Linear(embedding_dim, 64) # linear layer
        self.relu = nn.ReLU() # ReLU activation
        self.linear2 = nn.Linear(64, num_labels) # linear layer - binary output

    def forward(self, embeddings):
        x = self.linear1(embeddings)
        x = self.relu(x)
        return self.linear2(x)

# create a model from the pretrained embeddings
model_embeddings = AutoModel.from_pretrained("bert-base-uncased")

# get the size of the embeddings (number of dimensions)
embedding_dim = model_embeddings.config.hidden_size

num_labels = 2 # Binary classification (positive: 0, negative: 1)

# setup the custom head to take embeddings size as input and labels as output
# note this time we are doing binary classification with two neurons to keep
# num_labels as flexible and something we could change to a higher number of labels
# this works the same way as output=1. Its slightly less efficient but fine
custom_head = CustomHeadModel(embedding_dim, num_labels)

Here we have a similar codebase to the one we have seen before. The difference is essentialy at the input layer. Rather than feeding in raw data, we have fed in the embeddings from BERT.

In our model the input layer is based on the shape of the embeddings we create with BERT and incorporate into our architecture via [AutoModel](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html). We then add a two layer DNN with ReLU and 64 neurons in the first layer and then we build our output layer as 2x neurons "negative" review and "positive".

Note, we have previously used only one neuron for binary classification, which is absolutely fine to do here. Two neurons without softmax achieves the same thing - somewhat less efficiently - as we can just pick whichever outputs the highest number (if the first neuron we say the review is negative; if the second neuron we say positive).

In [None]:
# Embeddings-only prediction (with training of the classification head)

# Freeze the embeddings
for param in model_embeddings.parameters():
    param.requires_grad = False

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_embeddings.to(device) # add embedding model to GPU
custom_head.to(device) # add custom head to GPU

optimizer = optim.AdamW(custom_head.parameters(), lr=5e-5, weight_decay=0.0005) # Only optimise the custom head
criterion = nn.CrossEntropyLoss()

# use a subset of 1,000 random records for demonstration
small_train_dataset = tokenized_datasets["train"].shuffle().select(range(1000))

# Data is a list including:
# 1. id of words
# 2. attention mask
# 3. label

# Custom collate function to handle lists which DataLoader doesn't do automatically
def custom_collate(batch):
    input_ids = torch.tensor([item['input_ids'] for item in batch])
    attention_mask = torch.tensor([item['attention_mask'] for item in batch])
    labels = torch.tensor([item['label'] for item in batch])
    return {'input_ids': input_ids, 'attention_mask': attention_mask, 'label': labels}


train_loader = DataLoader(small_train_dataset, shuffle=True, batch_size=16, collate_fn=custom_collate)

Now we have the model we can set up for training. At the very start of the code, we have made the embeddings part of the model non-trainable (by setting these parameters as "requires\_grad = False"). This means during training only the weights in the custom head will be updated by backprop.

We specify a loss function (note this is now CrossEntropy rather than BinaryCrossEntropy becuase we are using two neurons).

We create a DataLoader as before. The only difference is we have a customer collate function because our embedding output will effectively be alist each time including id's, attention masks and labels. In the code above this, we load in a subset of the data (1,000 rows) to reduce training time.

In [None]:
num_epochs = 10

custom_head.train()
for epoch in range(num_epochs):
    for batch in train_loader:
        epoch_loss = 0 # reset loss

        # add data to GPU
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        with torch.no_grad(): # Freeze the bert/embeddings model
            embeddings = model_embeddings(input_ids=input_ids, attention_mask=attention_mask).pooler_output

        optimizer.zero_grad()
        logits = custom_head(embeddings) # get logits for custom head
        loss = criterion(logits, labels) # get loss for custom head
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()

    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {round(epoch_loss/len(train_loader), 4)}")

Training is a before. We can see from the very epoch we can see a low level of loss, and it stays about the same through each of the epochs. What does this tell you?

Now we can make predictions:

In [None]:
example_text = "This is a great movie!"
encoded_input = tokenizer(example_text, return_tensors="pt").to(device)
with torch.no_grad():
    embeddings = model_embeddings(**encoded_input).pooler_output
    logits = custom_head(embeddings)

    # select the highest probability neuron as the prediction
    predictions = torch.argmax(logits, dim=1) # dim 0 is neurons, dim 1 is probability of each

    # print result
    print_label = "positive" if predictions.item() == 0 else "negative"
    print(f"Prediction for '{example_text}': {print_label}")

We predict a very basic example "This is a great movie!". Unsuprisingly, given the low level of loss, this is easy for our model and it predicts it correctly.

Now let's try another approach - fine-tuning. In this we will train the whole model:

In [None]:
# Fine-tuning the Model with the Custom Head

# Reset the custom head so it is untrained
custom_head = CustomHeadModel(embedding_dim, num_labels)

# combine the embeddings model and custom head into one model
model_fine_tune = nn.Sequential(model_embeddings,custom_head)
model_fine_tune.to(device) # add the combined model to GPU

# this time optimise all parameters
optimizer = optim.AdamW(model_fine_tune.parameters(), lr=5e-5, weight_decay=0.0005)
criterion = nn.CrossEntropyLoss()

# use a subset of 1,000 records for demonstration
small_train_dataset = tokenized_datasets["train"].shuffle().select(range(1000))

# Custom collate function to handle lists which DataLoader doesn't do automatically
def custom_collate(batch):
    input_ids = torch.tensor([item['input_ids'] for item in batch])
    attention_mask = torch.tensor([item['attention_mask'] for item in batch])
    labels = torch.tensor([item['label'] for item in batch])
    return {'input_ids': input_ids, 'attention_mask': attention_mask, 'label': labels}

train_loader = DataLoader(small_train_dataset, shuffle=True, batch_size=16, collate_fn=custom_collate)

Here we have reset the model - reverted the weights of the custom head back to zero (actually random, but the same meaning).

Now we build a new model, "model\_fine\_tune", with our embeddings and our reset, custom head. Note, we do not set any part of the model to "frozen" (i.e. "requires\_grad = False") ... we will be training all of it.

The last parts of the code just rebuild the DataLoader. If you are running the whole Notebook you don't actually need to do this, I've included it just in case you want to copy out for your own project.

Now we can train as before:

In [None]:
model_fine_tune.train()
for epoch in range(num_epochs): #train for 3 epochs
  for batch in train_loader:
    epoch_loss = 0
    input_ids = batch['input_ids'].to(device)
    attention_mask = batch['attention_mask'].to(device)
    labels = torch.tensor(batch['label']).to(device)

    optimizer.zero_grad()

    # Pass input_ids and attention_mask to the first module in the sequence (model_embeddings)
    # and then pass its output to the next module (custom_head)
    outputs = model_fine_tune[1](model_fine_tune[0](input_ids=input_ids, attention_mask=attention_mask).pooler_output)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()
    epoch_loss += loss.item()

  print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {round(epoch_loss/len(train_loader), 4)}")

Everything here is code as we've previously seen. We also see the training outcomes which ... are about the same as before (although with a much slower training process as we are also training the embedding model). We can also try a prediction:

In [None]:
example_text = "This is a terrible movie!"
encoded_input = tokenizer(example_text, return_tensors="pt").to(device) # covert example to tokens

with torch.no_grad():
    # Pass input_ids and attention_mask to the first module in the sequence (model_embeddings)
    # and then pass its output to the next module (custom_head)
    # **encoded_input means unpack the list of items
    outputs = model_fine_tune[1](model_fine_tune[0](**encoded_input).pooler_output) # Pass encoded_input as keyword arguments

    # select the highest probability neuron as the prediction
    predictions = torch.argmax(outputs, dim=1)

    # print result
    print_label = "positive" if predictions.item() == 0 else "negative"
    print(f"Prediction for '{example_text}': {print_label}")

Not a great prediction. However, what should we conclude?