- Improving RNNs with an attention mechanism;
- Introducing the stand-alone self-attention mechanism;
- Understanding the original transformer architecture;
- Comparing transformer-based large-scale language models;
- Fine-tuning BERT for sentiment classification.

# Adding an attention mechanism to RNNs

Attention mechanis help predictive models to focus on certain parts of the input sequence more than others.

Let's consider the traditional RNN model for seq2seq task like language translation, which parses the entire input
sequence before producing the translation.

The RNN parses teh whole input sentence before producing the first input since, translating a sentence word by word
would likely result in grammatical errors. However, on limitation of the seq2seq approach is that the RNN is trying to
remember the entire input sequence via one single hidden unit before translating it (this will probably result in a loss
of information, especially with long sequences).

Thus, similar to how humans translate sentences, it may be beneficial to have access to the whole input sequence at each
time step. An attention mechanism lets the RNN access all input elements at each given time step. To avoid making the
model overwhelmed by accessing all input sequence elements at each time step, attention mechanisms will assign different
attention weights to each input element, resulting in the model focussing on the most relevant elements.

## The original attention mechanism for RNNs

Given an input sequence, $x$, the attention mechanism assigns a weight to each element $x^{(i)}$ and helps the model
identify which part of the input is should focus on.

A first RNN is a bidirectional RNN generates context vectors, $c_i$ (augmented version of the input vector,
$x^{(i)})$. The context vector also incorporates information from all other input elements via an attention mechanism.

Then a second RNN uses this context vector, prepared by the first RNN, to generate the outputs.

### Preprocessing the inputs using a bidirectional RNN

The RNN#1 processes the input sequence $x$ in the regular forward and backward directions (the backward pass is used to
capture additional information since current inputs may have a dependence on sequence elements that came either before 
or after it in a sentence). 

Consequently, we have two hidden states for each input sequence element. For instance, for the second input sequence
element $x^{(2)}$, we obtain hidden state $h_F^{(2)}$ from the forward pass and the hidden state $h_B^{(2)}$ from the
backward pass. These two hidden states are then concatenated in the hidden state $h^{(2)}$. We consider this
concatenated hidden state as the "annotation" of the source word since it contains the information of the $j$th word in
both directions.

### Generating outputs from context vectors

The RNN#2 is the main RNN that is generating the outputs. In addition to the hidden states, it receives so-called 
context vectors as input. Think about the context vector as a weighted version of the hidden states obtained from RNN#1:
$$c_i = \sum_{j=1}^T\alpha_{ij}h^{(j)}$$
Here $\alpha_{ij}$ represents the attention weights over the input sequence, in the context of the $i$th input sequence
element. **Note** that each $i$th input sequence element has a unique set of attention weights.

Now, RNN#2 receives the aorementioned context vector $c_i$ at each time step $i$ as input. The hidden state $s^{(i)}$
depends on the previous hidden state $s^{(i-1)}$, the previous target word $y^{(i-1)}$, and the context vector
$c^{(i)}$, which are used to generate the predicted output $o^{(i)}$ for the target word $y^{(i)}$ at time $i$.
**Notice** that the RNN#2 doesn't directly use the input sequence $x$.

### Computing the attention weights

Each attention weight has two subscripts: $j$ refers to the index position of the input and $i$ corresponds to the
output index position. The attention weight $a_{ij}$ is a normalized version of the alignment score $e_{ij}$, where the
alignmentscore evaluates how well the input around position $j$ matches with the output at position $i$.

The attention weight is computed by normalizing the alignment scores ($e$, not Euler's number) as follows:
$$\alpha_{ij}=\frac{exp(e_{ij})}{\sum_{k=1}^Te_{ik}}$$

Basically is the softmax function, so that the attention weights sum up to 1.

> The allignment score can be calculated in various ways, from just the dot product of the two vectors, to doing dot
> products of the vectors scaled by learnable matrices.

# Introducing the self-attention mechanism

We can think of the previously discussed attention mechanism as an operation that connects two different modules, that
is, the encoder and decoder of the RNN.
Self-attention focuses only on the input and captures only dependencies between the input elements, without connecting
two modules.

## Starting with a basic form of self-attention

Let's assume that we have an input sequence $x$ and an output sequence $z$ ($o$ this time won't be the output of the 
self-attention mechanism, but will be the output of the model that will use this mechanism).

For a seq2seq task, the goal of self-attention is to model the dependencies of the current input element to all other
input elements. To achieve this self-attention mechanisms are composed of three stages:
1. We derive importance weights based on the similarity between the current element and all other elements in the
sequence;
2. We normalize the weights, which usually involves using the softmax function;
3. We use the weights in combination with the corresponding sequence elements to compute the attention value.

More formally, the output of self-attention, $z^{(i)}$, is the weighted sum of all $T$ input sequences, $x$.
For instance, for the $i$th input element, the corresponding output value is computed as follows:
$$z^{(i)}=\sum_{j=1}^T\alpha_{ij}x^{(j)}$$

Hence, we can think of $z^{(i)}$ as a context-aware embedding vector in input vector $x^{(i)}$ that involves all other
input sequence elements weighted by their respective attention weigths. Here, the attention weights, $\alpha_{ij}, are
computed based on the similarity between the current input element, $x^{(i)}$, and all other elements in the input
sequence.

### Calculating similarity
We compute the dot product between the current input element $x^{(i)}$, and another element in the input sequence
$x^{(j)}$:
$$\omega_{ij}=x^{(i)T}x^{(j)}$$

In [1]:
import torch
sentence = torch.tensor([0,7,1,2,5,6,4,3])
sentence

tensor([0, 7, 1, 2, 5, 6, 4, 3])

In [2]:
# Let's assume that we already encoded this sentence into a rela-number vector representation via an embedding layer
torch.manual_seed(123)
embed = torch.nn.Embedding(10,16)
embedded_sentence = embed(sentence).detach()
embedded_sentence.shape

torch.Size([8, 16])

In [3]:
# We can now compute omega_ij as the dot product between the ith and the jth word embeddings
omega = embedded_sentence @ embedded_sentence.T

In [4]:
# Now we can obtain the attention weights normalizing the omegas
attention_weights = omega.softmax(dim=1)
attention_weights.shape

torch.Size([8, 8])

Note that each element represents an attention weight. If we are proceesing the $i$ th input word, the $i$ th row of this
matrix contains the corresponding attention weights for all words in sentence.

In [5]:
# Notice how they all add up to one
attention_weights.sum(dim=1)

tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])

In [6]:
# Lets compute the context vectors as the attention-weighted sum of the inputs
context_v = attention_weights @ embedded_sentence
context_v.shape

torch.Size([8, 16])

## Parameterizing the self-attention mechanism: scaled dot-product attention

To make the self-attention mechanism more flexible and amenable to model optimization, we will introduce three
additional weight matrices that can be fit as model parameters during model training. We denote this three weight
matrices as $U_q$, $U_k$ and $U_v$. They are used to preject the inputs into query, key and value sequence elements as
follows:
- Query sequence $q^{(i)}=U_qx^{(i)}$
- Key sequence $k^{(i)}=U_kx^{(i)}$
- Value sequence $v^{(i)}=U_vx^{(i)}$

> The terms query, key and value that were used in the original transformer paper are inspired by information retrieval
> systems and databases. For example, if we enter a query, it is matched against the key values for which values are 
> retrieved.

In [9]:
torch.manual_seed(123)
d = embedded_sentence.shape[1] # The embedding dimensionality
U_query = torch.rand(d,d)
U_key = torch.rand(d,d)
U_value = torch.rand(d,d)

In [18]:
# Using the query projection matrix, we can compute the query sentence
query = embedded_sentence @ U_query

# In similar fashion we can compute the query and value sentences
key = embedded_sentence @ U_key
value = embedded_sentence @ U_value
print(query.shape)
print(key.shape)
print(value.shape)

torch.Size([8, 16])
torch.Size([8, 16])
torch.Size([8, 16])


In [20]:
# Now lets compute the unnormalized attention weight matrix
unn_att_weights = query @ key.T

# Now we can normalize it through self attention
att_weights = (unn_att_weights/d**0.5).softmax(dim=-1)
print(att_weights.shape)
print(att_weights.sum(dim=1))

torch.Size([8, 8])
tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])


Note that the scaling by $\frac{1}{\sqrt{d}}$, ensures that the euclidean length of the weight vectors will be
approximately in the same range.

In [21]:
# Finally we can compute the output by doing the weighted sum of the values (using the attention weights)
out = att_weights @ value
out.shape

torch.Size([8, 16])

# Attention is all we need: introducing the original transformer architecture

Originally, the intention behind using an attention mechanism was to improve the text generation capabilities of RNNs
when working with long sentences. However, reserchers found that an attention-based language model was even more
powerful when the recurrent layers were deleted. This led to the transformer architecture.

<img src="./attention_research_1.webp" width=300>

## Encoding contxt embedding via multi-head attention

The overall goal of the encoder block is to take in a sequential input $X$ and map it into a continuous representation
$Z$ that is then passed on to the decoder.

The encoder is a stack of six (the hyperparameter that better suited the performances of the architecture, in the 
original paper) identical layers. Inside of each of the layers there are two sublayers: one computes multi-head
self-attention, and the other one is a fully connected layer.

Multi-head self-attention is a simple modification of the scaled dot-product attention covered earlier. In the context
of multi-head attention, we can think of the set of query, key and value matrices as one attention head. As indicated by
the name, in multi-head attention, we now have multiple of such heads, similar to how convolutional neural networks can
have multiple kernels.

In [31]:
# To demonstrate the concept with code
torch.manual_seed(123)
h = 8
d = embedded_sentence.shape[1]
multi_U_query = torch.rand(h,d,d)
multi_U_key = torch.rand(h,d,d)
multi_U_value = torch.rand(h,d,d)

# Scaling vectors
multi_query = embedded_sentence @ multi_U_query
multi_key = embedded_sentence @ multi_U_key
multi_value = embedded_sentence @ multi_U_value

print("Scaling:")
print(multi_query.shape)
print(multi_key.shape)
print(multi_value.shape)

# Computing attention
unn_att_weights = multi_query @ multi_key.permute((0,2,1))
att_weights = (unn_att_weights / d**0.5).softmax(dim=-1)

print("\nAttention weights:")
print(att_weights.shape)
print(att_weights.sum(dim=-1))

# Weighted sum with attention
out = att_weights @ multi_value
print("\nWeighted sum:")
print(out.shape)

Scaling:
torch.Size([8, 8, 16])
torch.Size([8, 8, 16])
torch.Size([8, 8, 16])

Attention weights:
torch.Size([8, 8, 8])
tensor([[1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000]])

Weighted sum:
torch.Size([8, 8, 16])


In [32]:
# Now we just have to concatenate attention heads
out = out.permute(1,0,2).reshape(embedded_sentence.shape[0],h*d)
print(out.shape)

torch.Size([8, 128])


In [33]:
# After concatenation we have to map back the output of the multi head attention mechanism to a single vector of
# dimensionality d for each element of the sequence
linear = torch.nn.Linear(h*d,d)
context_vector = linear(out)
print(context_vector.shape)

torch.Size([8, 16])


Even if multi-head attention sounds computationally expensive, notice that the computation can all be done in parallel
because there are no dependencies between the multiple heads.

## Learning a language model: decoder and masked multi-head attention

Similar to the encoder, the decoder also contains several repreated layers. Besides the two sublayers that we have
already introduced in the previous encoder section, each repeated layer also contains a masked multi-head attention
sublayer.

Masked attention is a variation of the original attention mechanism, where masked attention only passes a limited input
sequence into the model by "masking" out a certain number of words. For example, if we are building a language
translation model with a labeled dataset, at sequence position $i$ during the training procedure, we only feed in the
correct output words from positions $1,...,i-1$. All other words are hidden from the model to prevent the model from
cheating.

The encoder works like this:
1. It receives the otput embeddings and gives them to the first masked multi-head attention layer;
2. Then the second layer receives both encoded inputs from the encoder block and the output of the masked multi-head
attention layer;
3. Finally, we pass the multi-head attention outputs into a fully connected layer that generates the overall model
output (a probability vector corresponding to the output words).

Comparing the decoder with the encoder block, the main difference is the range of sequence elements that the model can
attend to. In the encoder, for each given word, the attention is calculated across al the words in a sentence, whiich
can be considered as a form of bidirectional input parsing. The decoder also receives the bidirectionally parsed inputs
from the encoder. However, when it comes to the output sequence, the decoder only considers those elements that are
preceding the current input position, which can be interpreted as a form of unidirectional input parsing.

## Implementation details: positional encodings and layer normalization

Positional encodings help with capturing information about the input sequence ordering and are a crucial of transformers
because both scaled dot-product attention layers and fully connected layers are permutation-invariant (without
positional encoding, the order of words is ignored and does not make any difference to the attention-based encodings).

Transformers enable the same words at different positions to have slightly different encodings by adding a vector of
small values to the input embedding at the beginning of the encoder and decoder blocks. In the context of the original
transformer paper they used the so-called sinusoidal encoding:
$$
PE_{(i,2k)} = sin(pos/10000^{2k/d_{model}}) \\
PE_{(i,2k+1)} = cos(pos/10000^{2k/d_{model}})
$$

Here $i$ is the position of the word and $k$ denotes the length of the encoding vector, where we choose $k$ to have the 
same dimension as the input word embedding so that the positional encoding and word embeddings can be added together.

In general, there are two types of positional encodings, an absolute one (as shown in the previous formula) and a
relative one. Absolute positional encodings are fixed vectors for each given position, while relative encodings only
maintain the relative position of words and are invariant to sentence shift.

Next, let's look at the layer normalization mechanism. While batch normalization, is a popular choice in computer vision
contexts, layer normalization is the preferred choice in NLP contexts, where sentence lengths can vary.

While layer normalization is traditionally performed across all elements in a given feature for each feature
indipendently (the focus is on each feature for all the training examples), the layer normalization used in transformers
extends this concept and computes the normalization statistics across all feature value indipendently for each training
example (the focus is on one training example for all its features).

Since layer normalization computes mean and standard deviation for each training example, it relaxes minibatch size
constraints or dependencies. In contrast to batch normalization, layer normalization is thus capable of learning from
data with small minibatch sizes and varying lengths.

However, note that the transformer architecture does not have varying-length inputs (sequences are padded when needed),
and there is no recurrence in the model. Transformers are usually trained on very large text corpora, which requires
parallel computation. This can be challenging to achieve with batch normalization, which has a dependency between
training examples (layer normalization has no such dependency).

# Building large-scale language models by leveragin unlabeled data

One common thing about large-scale transformers is that they are pre-trained on very large, unlabeled datasets and then
fine-tuned for their respective target tasks.

## Pre-training and fine-tuning transformer models

Language translation is a supervised task and requires a labeled dataset, which can be very expensive to obtain. The
lack of large, labeled datasets is a long-lasting problem in deep learning, especially for models like the transformer,
which are even more data hungry than other deep learning architectures. However, given that large amounts of text are
generated every day, an interesting question is how we can use such unlabeled data for improving the model training.

We can generate "labels" for supervised learning from plain text itself, for example doing a next-word prediction task.
This enables the model to learn the probability distribution of words and can form a strong basis for becoming a
powerful language model. This technique is called unsupervised pre-training.

The main idea of pre-training is to make use of plain text and then transfer and fine-tune the model to perform some
specific tasks for which a (smaller) labeled dataset is available. Now, there are many pre-training techniques (the next
word prediction technique can be thought of as a unidirectional pre-training approach).

With the representations that can be obtained from the pre-trained model, there are mainly two strategies for
transferring and adopting a model to a specific task:
1. Feature-based approach: it uses pre-trained representations as additional features to a labeled dataset. This 
requires us to learn how to extract sentence features from the pre-trained model. In other word, we can think of the
feature-based approach as a model-based feature extraction technique similar to principal component analysis;
2. Fine-tuning approach: updates the pre-trained model parameters in a regular supervised fashion via backpropagation.
Unlike the feature-based method, we usually also add another fully connected layer to the pre-trained model, to
accomplish certain tasks such as classification, and then update the whole model based on the prediction performance on
the labeled training set.

## Using GPT-2 to generate new text

In [34]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation',model='gpt2')

Device set to use mps:0


In [41]:
set_seed(12)
generator("The best stock to have in your portfolio is",max_length=50,truncation=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "The best stock to have in your portfolio is that you want to put a good amount of the price right into that asset, so that's where you want the best performance.\n\nRounding out the list above are investments based primarily on short term"}]

## Bidirectional pre-training with BERT

BERT (Bidirectional Encoder Representations from Transformers) has a transformer-encoder-based model structure that
utilizes bidirectional training procedure (it reads in all input elements all at once). This kind of pretraining
disables BERT's ability to generate a sentence word by word but provides input encodings of higher quality for other
tasks, such as classification, since the model can now process information in both directions.

Recall that in a transformer's encoder, token encoding is a summation of positional encodings and token embeddings. In
the BERT encoder, theere is an additional segment embedding indicating which segment this token belongs to. Why do we
need this additional segment information in BERT? The need for this segment information originated from the special
pre-training task of BERT called *next-sentence-prediction*. In this pre-training task, each training example includes
two sentences and thus requires special segment notation to denote whether it belongs to the first or second sentence.

The pre-training of BERT includes two unsupervised tasks: masked language modeling and next-sentence prediction.

In the masked language model, tokens are randomly replaced by so-called mask tokens, and the model is required to
predict these hidden words. With this technique BERT is more akin to "filling the blanks" because the model can attend
to all tokens in the sentence. To compensate the fact that MASK tokens don't usually appear in regular texts, there are
further modificatinos to the words that are selected for masking: 15% of the words in BERT are marked for masking, then
15% of randomly selected words are then further treated as follows:
1. Keep the word unchanged 10% of the time;
2. Replace the original word token with a random word 10% of the time;
3. Replace the original word token with a mask token 80% of the time.

(This distribution of probabilities was the best among all the tested ones in the original paper)

Other than partially solving the aforementioned problem, these modifications also have other benefits. Firstly,
unchanged words include the possibility of maintaining the information of the original token (otherwise the model can
only learn from the context and nothing from the masked words). Secondly the 10% random words prevent the model from
becoming lazy (i.e. learning nothing but returning what it is being given).

Talking about the next-sentence prediction task, the model is given two sentences, A and B, in the following format:
$$[CLS] A [SEP] B [SEP]$$
Where $[CLS]$ is a classification token, which serves as a placeholder or the predicted label in the decoder output, as
well as a token denoting the beginning of the sentences. The $[SEP]$ token, on the other hand, is attached to denote the
end of each sentence. The model is then required to classify whether B is the next sentence of A or not.

To provide the model with a balanced dataset, 50% of the samples are labeled as "IsNext" while the remaining samples are
labeled as "NotNext".

The training objective of BERT is to minimize the combined loss function of both tasks.

> NOTE: each input example needs to match a certain format. For example, it should begin with a $[CLS]$ token and be
> separed using $[SEP]$ tokens if it consists of more than one sentence.

BERT can be then fine-tuned on four categories of tasks:
1. Sentence pair classification;
2. Single sentence classification;
3. Question answering;
4. Single-sentence tagging.

The first two only require an additional softmax layer to be added to the output representation of the $[CLS]$ token.
The last two, however, are token-level classification tasks, which means that the model passes output representations of
all related tokens to the softmax layer to predict a class label for each individual token.

# The bestof both worlds: BART

The Bidirectional and Auto-Regressive Transformer (BART) can be viewed as a generalization of both GPT and BERT. As the
title of this section suggests, BART is able to accomplish both tasks, generating and classifying text. The reason
resides in the model having both a bidirectional encoder as well as a left-to-right autoregressive decoder.

One of the more interesting changes is that BART works with different model inputs. In BART the input format was
generalized such that it only uses the source sequence as input. Upon receiving a training example as plain text, the
input will first be "corrupted" and then encoded by the encoder. These input encodings will then be passed to the
decoder, along with the generated tokens. The cross-entropy loss between encoder output and the original text will be
calculated and then optimized through the learning process.

# Fine-tuning a BERT model in PyTorch

In [43]:
import gzip
import shutil
import time
import pandas as pd
import requests
import torch
import torch.nn.functional as F
import torchtext
import transformers
from transformers import DistilBertTokenizerFast
from transformers import DistilBertForSequenceClassification # (Smaller distilled version of BERT)

RANDOM_SEED = 123
torch.manual_seed(RANDOM_SEED)
DEVICE = torch.device("mps")
NUM_EPOCHS = 3

In [44]:
df = pd.read_csv("../Sentiment analysis/movie_data.csv")
df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


In [45]:
train_texts = df.iloc[:35000]['review'].values
train_labels = df.iloc[:35000]['sentiment'].values

valid_texts = df.iloc[35000:40000]['review'].values
valid_labels = df.iloc[35000:40000]['sentiment'].values

test_texts = df.iloc[40000:]['review'].values
test_labels = df.iloc[40000:]['sentiment'].values

In [46]:
# Let's use the tokenizer used for the training of the BERT model in order to tokenize the new text
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(list(train_texts),truncation=True,padding=True)
valid_encodings = tokenizer(list(valid_texts),truncation=True,padding=True)
test_encodings = tokenizer(list(test_texts),truncation=True,padding=True)

In [47]:
from torch.utils.data import DataLoader
class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self,encodings, labels):
        self.encodings = encodings
        self.labels = labels
        
    def __getitem__(self,idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
valid_dataset = IMDbDataset(valid_encodings, valid_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

train_loader = DataLoader(train_dataset,batch_size=16,shuffle=True)
valid_loader = DataLoader(valid_dataset,batch_size=16,shuffle=False)
test_loader = DataLoader(test_dataset,batch_size=16,shuffle=False)

In [48]:
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(DEVICE)
model.train()

optim = torch.optim.Adam(model.parameters(),lr=5e-5)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [49]:
def compute_accuracy(model, data_loader, device):
    with torch.no_grad():
        correct_pred, num_examples = 0, 0
        for batch_idx, batch in enumerate(data_loader):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(input_ids,attention_mask=attention_mask)
            logits = outputs['logits']
            predicted_labels = torch.argmax(logits,1)
            num_examples+=labels.size(0)
            correct_pred += (predicted_labels == labels).sum()
    return correct_pred.float()/num_examples*100

In [None]:
start_time = time.time()

for epoch in range(NUM_EPOCHS):
    model.train()

    for batch_idx, batch in enumerate(train_loader):
        # Preparing data
        input_ids = batch['input_ids'].to(DEVICE)
        attention_mask = batch['attention_mask'].to(DEVICE)
        labels = batch['labels'].to(DEVICE)

        # Forward pass
        outputs = model(input_ids,attention_mask=attention_mask,labels=labels)
        loss, logits = outputs['loss'], outputs['logits']

        # Backward pass
        optim.zero_grad()
        loss.backward()
        optim.step()

        # Logging
        if not batch_idx % 250:
            print(
                f'Epoch: {epoch+1:04d}/{NUM_EPOCHS:04d} \
                | Batch {batch_idx:04d}/{len(train_loader):04d} | Loss: {loss:.4f}'
            )

        model.eval()
        with torch.set_grad_enabled(False):
            print(f'Training accuracy: {compute_accuracy(model,train_loader,DEVICE):.2f}% \
                  Valid accuracy: {compute_accuracy(model,valid_loader,DEVICE):.2f}%')
        print(f'Time elapsed: {(time.time() - start_time)/60:.2f} min')

print(f"Total Training Time: {(time.time() - start_time)/60:.2f} min")
print(f'Test accuracy: {compute_accuracy(model, test_loader,DEVICE):.2f}%')

Another more convenient way of training the code it though the `Trainer` API.

The preceding (and the code of the rest of the section) code isn't executed because I don't have a GPU to use for the
fine-tuning.