# Training Transformers from Scratch

In this lab, we will discuss how to train a transformer model using PyTorch. Training transformers is not very different from training other types of models. However, we do require some additional helper functions, such as positional encoding, to construct the model. Additionally, we'll need to modify the preprocessing steps to suit our new model.

While creating this tutorial, we referred to some of the official PyTorch tutorial pages. You can find these references in the "References" section at the end of this tutorial.

In [1]:
from time import time
import torch
import torchtext
from torch import nn
import torchdata
import math
%matplotlib inline
print('version of the torch:' + torch.__version__)
print('version of the torchtext:' + torchtext.__version__)
print('version of the torchdata:' + torchdata.__version__)

version of the torch:2.0.1
version of the torchtext:0.15.2
version of the torchdata:0.6.1


## nn.Transformer

Before we begin, let's discuss the tools that PyTorch provides for training a transformer model. One of the standout features of PyTorch is its modularity, and this extends to the nn.Transformer module, which offers a highly customizable model architecture.

In [2]:
print(nn.Transformer())

Transformer(
  (encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-5): 6 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
        )
        (linear1): Linear(in_features=512, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=512, bias=True)
        (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
    (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (decoder): TransformerDecoder(
    (layers): ModuleList(
      (0-5): 6 x TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, o

As we can see above, simply calling the transformer model automatically sets up both the encoder and decoder layers, complete with their respective functionalities. Of course, these layers can be easily modified by adjusting their input parameters.

In [3]:
print(nn.Transformer().__doc__)

A transformer model. User is able to modify the attributes as needed. The architecture
    is based on the paper "Attention Is All You Need". Ashish Vaswani, Noam Shazeer,
    Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and
    Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information
    Processing Systems, pages 6000-6010.

    Args:
        d_model: the number of expected features in the encoder/decoder inputs (default=512).
        nhead: the number of heads in the multiheadattention models (default=8).
        num_encoder_layers: the number of sub-encoder-layers in the encoder (default=6).
        num_decoder_layers: the number of sub-decoder-layers in the decoder (default=6).
        dim_feedforward: the dimension of the feedforward network model (default=2048).
        dropout: the dropout value (default=0.1).
        activation: the activation function of encoder/decoder intermediate layer, can be a string
            ("re

Let's say you want to create a custom model that replaces the default transformer decoder with an LSTM. This can be easily accomplished by modifying the custom_decoder parameter and supplying a functional LSTM model for it. On the other hand, if you're not looking for such an extreme customization, you can make more targeted adjustments by utilizing the built-in `TransformerDecoderLayer` and `TransformerDecoder` modules.

In [4]:
decoder_layer = nn.TransformerDecoderLayer(d_model=512, nhead=8)
decoder = nn.TransformerDecoder(decoder_layer,num_layers=3)
print(decoder_layer)
print('\n\n')
print(decoder)

TransformerDecoderLayer(
  (self_attn): MultiheadAttention(
    (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
  )
  (multihead_attn): MultiheadAttention(
    (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
  )
  (linear1): Linear(in_features=512, out_features=2048, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)
  (linear2): Linear(in_features=2048, out_features=512, bias=True)
  (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  (norm3): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  (dropout1): Dropout(p=0.1, inplace=False)
  (dropout2): Dropout(p=0.1, inplace=False)
  (dropout3): Dropout(p=0.1, inplace=False)
)



TransformerDecoder(
  (layers): ModuleList(
    (0-2): 3 x TransformerDecoderLayer(
      (self_attn): MultiheadAttention(
        (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out

After creating your new decoder layer we can simly call `Transformer` with new decoder and create new model. Similary Encoder can also be edited by built-in `TransformerEncoderLayer` or `TransformerEncoder` modules.

In [5]:
nn.Transformer(custom_decoder=decoder)

Transformer(
  (encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-5): 6 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
        )
        (linear1): Linear(in_features=512, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=512, bias=True)
        (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
    (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (decoder): TransformerDecoder(
    (layers): ModuleList(
      (0-2): 3 x TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, o

## Model

Now that we understand how the Transformer works, we can start building our model. For those of you with keen eyes, you may have noticed that one component is missing from the transformer architecture. Can you guess what it is?

### @Spoiler

Yes it is positional embeddings.

In [6]:
class PositionalEncoding(nn.Module):
    def __init__(self,
                 embedding_size: int,
                 dropout: float= 0.1,
                 maximum_length: int = 5000):
        super(PositionalEncoding, self).__init__()

        divider = torch.exp(- torch.arange(0, embedding_size, 2)* math.log(10000) / embedding_size)
        position = torch.arange(0, maximum_length).unsqueeze(1)

        positionalembedding = torch.zeros((maximum_length,1, embedding_size))
        positionalembedding[:,0, 0::2] = torch.sin(position * divider)
        positionalembedding[:,0, 1::2] = torch.cos(position * divider)

        self.dropout = nn.Dropout(dropout)
        self.register_buffer('positionalembedding', positionalembedding)

    def forward(self, input_token: torch.Tensor):
        embedding = input_token + self.positionalembedding[:input_token.size(0), :]
        return self.dropout(embedding)


dumposition = PositionalEncoding(4,0)
dumembedding = nn.Embedding(10, 4)
embed = dumembedding(torch.tensor(([1,1,1,1,1,1],[1,1,1,1,1,1]))).transpose(0,1)* 2 # Normally, we use the square root of the embedding size, but for demonstration purposes, we'll multiply the embedding size by 100.
embedpos = dumposition(embed)
print(embed)
print(embedpos)

tensor([[[-1.5964, -0.8773,  0.1465,  0.7073],
         [-1.5964, -0.8773,  0.1465,  0.7073]],

        [[-1.5964, -0.8773,  0.1465,  0.7073],
         [-1.5964, -0.8773,  0.1465,  0.7073]],

        [[-1.5964, -0.8773,  0.1465,  0.7073],
         [-1.5964, -0.8773,  0.1465,  0.7073]],

        [[-1.5964, -0.8773,  0.1465,  0.7073],
         [-1.5964, -0.8773,  0.1465,  0.7073]],

        [[-1.5964, -0.8773,  0.1465,  0.7073],
         [-1.5964, -0.8773,  0.1465,  0.7073]],

        [[-1.5964, -0.8773,  0.1465,  0.7073],
         [-1.5964, -0.8773,  0.1465,  0.7073]]], grad_fn=<MulBackward0>)
tensor([[[-1.5964,  0.1227,  0.1465,  1.7073],
         [-1.5964,  0.1227,  0.1465,  1.7073]],

        [[-0.7549, -0.3370,  0.1565,  1.7072],
         [-0.7549, -0.3370,  0.1565,  1.7072]],

        [[-0.6871, -1.2934,  0.1665,  1.7071],
         [-0.6871, -1.2934,  0.1665,  1.7071]],

        [[-1.4553, -1.8673,  0.1765,  1.7068],
         [-1.4553, -1.8673,  0.1765,  1.7068]],

        [[-2.353

In the output, we can observe subtle alterations to the model's embeddings, which are a result of the positional encoding.

### Model cont.


Now that we've learned how the model works, let's proceed to create it. Last time, we familiarized ourselves with the AG_News dataset, and we're going to use it again. For this purpose, we'll build a classification model using only the transformer encoder layer. Let's go ahead and fill in the empty spaces within the model architecture.

In [7]:
class TransformerModel(nn.Module):

    def __init__(self, ntoken: int, embedding_size: int, nhead: int, d_hid: int,
                 nlayers: int, nclass: int, dropout: float = 0.5):
        super().__init__()
        self.embedding_size = embedding_size

        self.embedding = nn.Embedding(ntoken, embedding_size)

        self.PositionalEncoding = PositionalEncoding(embedding_size, dropout)
        encoder_layers = nn.TransformerEncoderLayer(embedding_size, nhead, d_hid, dropout)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layers, nlayers)


        self.linear = nn.Linear(embedding_size, nclass)

        self.init_weights()

    def init_weights(self) -> None:
        initrange = 0.1
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.linear.bias.data.zero_()
        self.linear.weight.data.uniform_(-initrange, initrange)

    def forward(self, src: torch.Tensor, src_mask: torch.Tensor = None) -> torch.Tensor:
        """
        Arguments:
            src: Tensor, shape ``[seq_len, batch_size]``
            src_mask: Tensor, shape ``[seq_len, seq_len]``

        Returns:
            output Tensor of shape ``[seq_len, batch_size, ntoken]``
        """
        src = self.embedding(src) * math.sqrt(self.embedding_size)
        src = self.PositionalEncoding(src)
        output = self.transformer_encoder(src, src_mask)
        output = self.linear(output)
        return output

## Data processing

Data processing for this tutorial will be quite similar to our previous approach, with only minor modifications needed to accommodate our new model. These changes may include using regular embeddings instead of an embedding bag, or shifting from a one-to-one model to a many-to-one model, among other adjustments.

In [8]:
from torchtext.datasets import AG_NEWS

train_iter = iter(AG_NEWS(split="train"))
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer("basic_english")
train_iter = AG_NEWS(split="train")


def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)


vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>","<pad>"])  # Creates a dictionary for tokens
vocab.set_default_index(vocab["<unk>"])

In [9]:
text_pipeline = lambda x: vocab(tokenizer(x))   # Pipelines for conversion
label_pipeline = lambda x: int(x) - 1

In [10]:
vocab(["<pad>"])

[1]

In [11]:
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
   label_list, text_list = [], []
   for (_label, _text) in batch:
        label_list.append(label_pipeline(_label))
        processed_text = torch.tensor(text_pipeline(_text))
        text_list.append(processed_text)
   return torch.tensor(label_list).to(device), pad_sequence(text_list, padding_value=vocab(["<pad>"])[0]).to(device)
# def collate_batch(batch):
#     label_list, text_list, offsets = [], [], [0]
#     for _label, _text in batch:
#         label_list.append(label_pipeline(_label))
#         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
#         text_list.append(processed_text)
#         offsets.append(processed_text.size(0))
#     label_list = torch.tensor(label_list, dtype=torch.int64)
#     offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
#     text_list = torch.cat(text_list)
#     return label_list.to(device), text_list.to(device), offsets.to(device)


train_iter = AG_NEWS(split="train")
dataloader = DataLoader(
    train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch
)

In [12]:
ntokens = len(vocab)  # size of vocabulary
emsize = 16  # embedding dimension
d_hid = 8  # dimension of the feedforward network model in ``nn.TransformerEncoder``
nlayers = 2  # number of ``nn.TransformerEncoderLayer`` in ``nn.TransformerEncoder``
nhead = 2  # number of heads in ``nn.MultiheadAttention``
dropout = 0.2  # dropout probability
nclasses = 4
model = TransformerModel(ntokens, emsize, nhead, d_hid, nlayers,nclasses, dropout).to(device)

In [13]:
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset

BATCH_SIZE = 64
EPOCHS = 4
LR = 5


criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None
train_iter, test_iter = AG_NEWS()
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)
num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = random_split(
    train_dataset, [num_train, len(train_dataset) - num_train]
)

train_dataloader = DataLoader(
    split_train_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch,drop_last= True
)
valid_dataloader = DataLoader(
    split_valid_, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch,drop_last= True
)
test_dataloader = DataLoader(
    test_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch,drop_last= True
)


## Training


Once again, we will utilize the training loop that we constructed in the previous lab. The only difference this time is that we'll be taking the last item from the model's output for further processing.

In [14]:
import time


def train(dataloader):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text) in enumerate(dataloader):
        optimizer.zero_grad()
        out = model(text)
        predicted_label =out[-1]
        loss = criterion(predicted_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1) ## To prevent expoding gradient
        optimizer.step()
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print(
                "| epoch {:3d} | {:5d}/{:5d} batches "
                "| accuracy {:8.3f}".format(
                    epoch, idx, len(dataloader), total_acc / total_count
                )
            )
            total_acc, total_count = 0, 0
            start_time = time.time()


def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text) in enumerate(dataloader):
            out = model(text)
            predicted_label =out[-1]
            loss = criterion(predicted_label, label)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc / total_count

In [15]:
for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(valid_dataloader)
    if total_accu is not None and total_accu > accu_val:
        scheduler.step()
    else:
        total_accu = accu_val
    print("-" * 59)
    print(
        "| end of epoch {:3d} | time: {:5.2f}s | "
        "valid accuracy {:8.3f} ".format(
            epoch, time.time() - epoch_start_time, accu_val
        )
    )
    print("-" * 59)

| epoch   1 |   500/ 1781 batches | accuracy    0.350
| epoch   1 |  1000/ 1781 batches | accuracy    0.703
| epoch   1 |  1500/ 1781 batches | accuracy    0.824
-----------------------------------------------------------
| end of epoch   1 | time: 14.80s | valid accuracy    0.875 
-----------------------------------------------------------
| epoch   2 |   500/ 1781 batches | accuracy    0.862
| epoch   2 |  1000/ 1781 batches | accuracy    0.874
| epoch   2 |  1500/ 1781 batches | accuracy    0.884
-----------------------------------------------------------
| end of epoch   2 | time: 14.38s | valid accuracy    0.900 
-----------------------------------------------------------
| epoch   3 |   500/ 1781 batches | accuracy    0.901
| epoch   3 |  1000/ 1781 batches | accuracy    0.900
| epoch   3 |  1500/ 1781 batches | accuracy    0.902
-----------------------------------------------------------
| end of epoch   3 | time: 14.28s | valid accuracy    0.906 
-------------------------------

In [16]:
print("Checking the results of test dataset.")
accu_test = evaluate(test_dataloader)
print("test accuracy {:8.3f}".format(accu_test))

Checking the results of test dataset.
test accuracy    0.916


In [17]:
ag_news_label = {1: "World", 2: "Sports", 3: "Business", 4: "Sci/Tec"}


def predict(text, text_pipeline):
    with torch.no_grad():
        text = torch.tensor(text_pipeline(text)).unsqueeze(1)
        output = model(text)
        return output[-1].argmax(1).item() + 1


ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

model = model.to('cpu')

print("This is a %s news" % ag_news_label[predict(ex_text_str, text_pipeline)])

This is a Sports news


## Another Example

In our previous example, we utilized only half of the transformer model, specifically the encoder. In this tutorial, we'll delve into how to set up a sequence-to-sequence model by making full use of both the encoder and decoder components of the transformer. Let's dive in!

In [19]:
class Seq2SeqTransformer(nn.Module):
    def __init__(self,
                 num_encoder_layers: int,
                 num_decoder_layers: int,
                 embedding_size: int,
                 nhead: int,
                 src_vocab_size: int,
                 tgt_vocab_size: int,
                 d_hid: int = 512,
                 dropout: float = 0.1):
        super(Seq2SeqTransformer, self).__init__()
        self.transformer = nn.Transformer(d_model=embedding_size,
                                       nhead=nhead,
                                       num_encoder_layers=num_encoder_layers,
                                       num_decoder_layers=num_decoder_layers,
                                       dim_feedforward=d_hid,
                                       dropout=dropout)
        self.generator = nn.Linear(embedding_size, tgt_vocab_size)
        self.src_tok_emb = nn.Embedding(src_vocab_size, embedding_size)
        self.tgt_tok_emb = nn.Embedding(tgt_vocab_size, embedding_size)
        self.positional_encoding = PositionalEncoding(
            embedding_size, dropout=dropout)

    def forward(self,
                src: torch.Tensor,
                trg: torch.Tensor,
                src_mask: torch.Tensor,
                tgt_mask: torch.Tensor,
                src_padding_mask: torch.Tensor,
                tgt_padding_mask: torch.Tensor,
                memory_key_padding_mask: torch.Tensor):
        src_emb = self.positional_encoding(self.src_tok_emb(src))
        tgt_emb = self.positional_encoding(self.tgt_tok_emb(trg))
        outs = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None,
                                src_padding_mask, tgt_padding_mask, memory_key_padding_mask)
        return self.generator(outs)

    def encode(self, src: torch.Tensor, src_mask: torch.Tensor):
        return self.transformer.encoder(self.positional_encoding(
                            self.src_tok_emb(src)), src_mask)

    def decode(self, tgt: torch.Tensor, memory: torch.Tensor, tgt_mask: torch.Tensor):
        return self.transformer.decoder(self.positional_encoding(
                          self.tgt_tok_emb(tgt)), memory,
                          tgt_mask)

In [21]:
SRC_VOCAB_SIZE = 10000 #len(vocab_src)
TGT_VOCAB_SIZE = 10000 #len(vocab_tgt)
EMB_SIZE = 512
NHEAD = 8
FFN_HID_DIM = 512
BATCH_SIZE = 64
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

transformer = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,
                                 NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM).to(device)




print(transformer)

Seq2SeqTransformer(
  (transformer): Transformer(
    (encoder): TransformerEncoder(
      (layers): ModuleList(
        (0-2): 3 x TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
          )
          (linear1): Linear(in_features=512, out_features=512, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=512, out_features=512, bias=True)
          (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
      )
      (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    )
    (decoder): TransformerDecoder(
      (layers): ModuleList(
        (0-2): 3 x TransformerDecoderLayer(
          (self_attn): MultiheadAttent

## References


* [``LANGUAGE TRANSLATION WITH NN.TRANSFORMER AND TORCHTEXT``](https://pytorch.org/tutorials/beginner/translation_transformer.html)
* [``LANGUAGE MODELING WITH NN.TRANSFORMER AND TORCHTEXT``](https://pytorch.org/tutorials/beginner/transformer_tutorial.html)
