# Implementation of a Transformer model from scratch

In this project we implement the transformer model from the paper "Attention is all you need" that can be found at https://arxiv.org/pdf/1706.03762.pdf. This paper has been highly impactful in the field of NLP and deep learning. In it the authors proposed a novel architecture of the transformer which is entirely based on the self attention mechanism and outperformed the previous state-of-the-art models, relying on RNN architecture, on various NLP tasks. This novel architecture of the transformer is summarized in the following image.



<img src='https://drive.google.com/uc?id=1pbQcTBnbJQPBwoLp3OKiI7s3Iq_mNUn8'>

## The Transformer

We will describe the components of the Transformer, analyze their operation and build a simple model that we will apply to a small-scale Neural Machine Translation problem. After basic imports we start with the discussion of the self attention mechanism.

In [None]:
import math

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import dataset

import numpy as np
import pandas as pd
import matplotlib as plt


### Multihead attention

Every input vector $X$ is used in three different ways in the self-attention mechanism: the query $Q$, the key $K$ and the value $V$, which are computed using the learnable weight matrices as $Q = W^{(Q)}X$, $K = W^{(K)}X$, $V = W^{(V)}X$. The attention scores measure how much focus to place on other places or words of the input sequence with respect to a word at a certain position. The formula for the computation is

Attention(Q,K,V) = softmax($\frac{QK^T}{\sqrt{d_k}}$)V

where $d_k$ is the dimension of queries and keys. We can use several attention heads as above thus allowing a 'word' in the input sequence to pay attention to several words or parts of that sequence. This is similar to having several convolutional filters in CNN models that focus on different features of the image.

In [None]:
class MultiHeadAttention(nn.Module):
  def __init__(self, d_k, d_model, n_heads, max_len, causal=False):
    super().__init__()

    # assume d_v = d_k
    self.d_k = d_k
    self.n_heads = n_heads

    self.key = nn.Linear(d_model, d_k * n_heads)
    self.query = nn.Linear(d_model, d_k * n_heads)
    self.value = nn.Linear(d_model, d_k * n_heads)

    # final linear layer
    self.fc = nn.Linear(d_k * n_heads, d_model)

    # causal mask
    self.causal = causal
    if causal:
      cm = torch.tril(torch.ones(max_len, max_len))
      self.register_buffer(
          "causal_mask",
          cm.view(1, 1, max_len, max_len)
      )

  def forward(self, q, k, v, pad_mask=None):
    q = self.query(q) # shape N x T x (n_heads * d_k)
    k = self.key(k) # shape N x T x (n_heads * d_k)
    v = self.value(v) # shape N x T x (n_heads * d_v)

    N = q.shape[0]
    T_output = q.shape[1]
    T_input = k.shape[1]

    # change shape (N,T,n_heads,d_k) to (N,n_heads,T,d_k) so matrix multiplication works
    q = q.view(N, T_output, self.n_heads, self.d_k).transpose(1, 2)
    k = k.view(N, T_input, self.n_heads, self.d_k).transpose(1, 2)
    v = v.view(N, T_input, self.n_heads, self.d_k).transpose(1, 2)

    # compute attention weights
    # (N, n_heads, T, d_k) x (N, n_heads, d_k, T) --> (N, h, T, T)
    attn_scores = q @ k.transpose(-2, -1) / math.sqrt(self.d_k)
    if pad_mask is not None:
      attn_scores = attn_scores.masked_fill(pad_mask[:,None,None,:] == 0, float('-inf'))
    if self.causal:
      attn_scores = attn_scores.masked_fill(
          self.causal_mask[:, :, :T_output, :T_input] == 0, float('-inf')
      )
    attn_weights = F.softmax(attn_scores, dim=-1)

    # compute attention weighted values
    # (N, n_heads, T, T) x (N, n_heads, T, d_k) --> (N, n_heads, T, d_k)
    A = attn_weights @ v

    # reshape it back before the final layer
    A = A.transpose(1, 2) # (N, T, n_heads, d_k)
    A = A.contiguous().view(N, T_output, self.d_k * self.n_heads) # (N, T, (n_heads * d_k))

    # projection
    return self.fc(A)

The transformer consists of an Encoder and a Decoder that we will build separately. We start with some building blocks.

### Encoder block

The encoder block consists of a multihead attention layer followed by the fully connected neural network. There is a residual connection around each layer.

In [None]:
class EncoderBlock(nn.Module):
  def __init__(self, d_k, d_model, n_heads, max_len, dropout_prob=0.1):
    super().__init__()

    self.ln1 = nn.LayerNorm(d_model)
    self.ln2 = nn.LayerNorm(d_model)
    self.mha = MultiHeadAttention(d_k, d_model, n_heads, max_len, causal=False)
    self.ann = nn.Sequential(
        nn.Linear(d_model, d_model * 4),
        nn.GELU(),
        nn.Linear(d_model * 4, d_model),
        nn.Dropout(dropout_prob)
    )
    self.dropout = nn.Dropout(p=dropout_prob)

  def forward(self, x, pad_mask=None):
    x = self.ln1(x + self.mha(x,x,x,pad_mask))
    x = self.ln2(x + self.ann(x))
    x = self.dropout(x)
    return x

### Decoder block

In addition to the self attention layer on decoder input, this block contains a masked multihead attention layer. The mask prevents each position from attending to subsequent positions. The second attention layer performs multi-head attention over the output of the encoder and the output of the previous attention layer. The key and value vectors come from the output of the encoder but the queries come from the previous attention layer. This allows every position in the decoder block to attend over all positions in the input sequence.

In [None]:
class DecoderBlock(nn.Module):
  def __init__(self, d_k, d_model, n_heads, max_len, dropout_prob=0.1):
    super().__init__()

    self.ln1 = nn.LayerNorm(d_model)
    self.ln2 = nn.LayerNorm(d_model)
    self.ln3 = nn.LayerNorm(d_model)
    self.mha1 = MultiHeadAttention(d_k, d_model, n_heads, max_len, causal=True)
    self.mha2 = MultiHeadAttention(d_k, d_model, n_heads, max_len, causal=False)
    self.ann = nn.Sequential(
        nn.Linear(d_model, d_model * 4),
        nn.GELU(),
        nn.Linear(d_model * 4, d_model),
        nn.Dropout(dropout_prob)
    )
    self.dropout = nn.Dropout(p=dropout_prob)

  def forward(self, enc_output, dec_input, enc_mask=None, dec_mask=None):
    # self-attention on decoder input
    x = self.ln1(dec_input + self.mha1(dec_input, dec_input, dec_input, dec_mask))

    # multihead attention including encoder output
    x = self.ln2(x + self.mha2(x, enc_output, enc_output, enc_mask))

    x = self.ln3(x + self.ann(x))
    x = self.dropout(x)
    return x

### Positional encoding

The self attention mechanism does not pay attention to the order of the words in the input sentence, it is permutation invariant. To account for the word order we need to create a representation of the position of the word in the sentence and add it to the word embeddings at the bottom of the encoder and decoder stacks. In the paper this is done using trigonometric functions
as follows.

$PE_{(pos, 2i)} = \sin(pos/10000^{2i/d_{model}})$

$PE_{(pos, 2i+1)} = \cos(pos/10000^{2i/d_{model}})$

In [None]:
class PositionalEncoding(nn.Module):
  def __init__(self, d_model, max_len=2048, dropout_prob = 0.1):
    super().__init__()
    self.dropout = nn.Dropout(p=dropout_prob)

    position = torch.arange(max_len).unsqueeze(1)
    exp_term = torch.arange(0, d_model, 2)
    div_term = torch.exp(exp_term * (-math.log(10000.0) / d_model))
    pe = torch.zeros(1, max_len, d_model)
    pe[0, :, 0::2] = torch.sin(position * div_term)
    pe[0, :, 1::2] = torch.cos(position * div_term)
    self.register_buffer('pe', pe)

  def forward(self, x):
    # x.shape N x T x D
    x = x + self.pe[:, :x.size(1), :]
    return self.dropout(x)

### Encoder

The encoder consists of an embedding layer, positional encoding layer, and n_layers encoding blocks.

In [None]:
class Encoder(nn.Module):
  def __init__(self,
               vocab_size,
               max_len,
               d_k,
               d_model,
               n_heads,
               n_layers,
               dropout_prob):
    super().__init__()

    self.embedding = nn.Embedding(vocab_size, d_model)
    self.pos_encoding = PositionalEncoding(d_model, max_len, dropout_prob)
    transformer_blocks = [
        EncoderBlock(d_k, d_model, n_heads, max_len, dropout_prob) for _ in range(n_layers)
    ]
    self.transformer_blocks = nn.Sequential(*transformer_blocks)
    self.ln = nn.LayerNorm(d_model)

  def forward(self, x, pad_mask=None):
    x = self.embedding(x)
    x = self.pos_encoding(x)
    for block in self.transformer_blocks:
      x = block(x, pad_mask)

    x = self.ln(x)
    return x

### Decoder

The decoder consists of an embedding layer, a positional encoding layer, n_layers decoder blocks and a fully connected layer.

In [None]:
class Decoder(nn.Module):
  def __init__(self,
               vocab_size,
               max_len,
               d_k,
               d_model,
               n_heads,
               n_layers,
               dropout_prob):
    super().__init__()

    self.embedding = nn.Embedding(vocab_size, d_model)
    self.pos_encoding = PositionalEncoding(d_model, max_len, dropout_prob)
    transformer_blocks = [
        DecoderBlock(d_k, d_model, n_heads, max_len, dropout_prob) for _ in range(n_layers)
    ]
    self.transformer_blocks = nn.Sequential(*transformer_blocks)
    self.ln = nn.LayerNorm(d_model)
    self.fc = nn.Linear(d_model, vocab_size)

  def forward(self, enc_output, dec_input, enc_mask=None, dec_mask=None):
    x = self.embedding(dec_input)
    x = self.pos_encoding(x)
    for block in self.transformer_blocks:
      x = block(enc_output, x, enc_mask, dec_mask)
    x = self.ln(x)
    x = self.fc(x)
    return x

### Transformer

We join the Encoder and Decoder to obtain our Transformer.

In [None]:
class Transformer(nn.Module):
  def __init__(self, encoder, decoder):
    super().__init__()
    self.encoder = encoder
    self.decoder = decoder

  def forward(self, enc_input, dec_input, enc_mask, dec_mask):
    enc_output = self.encoder(enc_input, enc_mask)
    dec_output = self.decoder(enc_output, dec_input, enc_mask, dec_mask)
    return dec_output

### Test our Transformer with dummy data

In [None]:
encoder = Encoder(vocab_size=20000,
                  max_len = 512,
                  d_k = 16,
                  d_model = 64,
                  n_heads = 4,
                  n_layers = 2,
                  dropout_prob = 0.1)
decoder = Decoder(vocab_size = 10000,
                  max_len = 512,
                  d_k = 16,
                  d_model = 64,
                  n_heads = 4,
                  n_layers = 2,
                  dropout_prob = 0.1)
transformer = Transformer(encoder, decoder)

In [None]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(device)
encoder.to(device)
decoder.to(device)

cuda:0


Decoder(
  (embedding): Embedding(10000, 64)
  (pos_encoding): PositionalEncoding(
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer_blocks): Sequential(
    (0): DecoderBlock(
      (ln1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
      (ln2): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
      (ln3): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
      (mha1): MultiHeadAttention(
        (key): Linear(in_features=64, out_features=64, bias=True)
        (query): Linear(in_features=64, out_features=64, bias=True)
        (value): Linear(in_features=64, out_features=64, bias=True)
        (fc): Linear(in_features=64, out_features=64, bias=True)
      )
      (mha2): MultiHeadAttention(
        (key): Linear(in_features=64, out_features=64, bias=True)
        (query): Linear(in_features=64, out_features=64, bias=True)
        (value): Linear(in_features=64, out_features=64, bias=True)
        (fc): Linear(in_features=64, out_features=64, bias=True)
 

We make up some random data for encoder and decoder inputs, write them as tensors and move to GPU. We also make up padding masks to cover the second half of inputs. We then tryout our transformer on this data.

In [None]:
xe = np.random.randint(0, 20000, size=(8, 512))
xe_t = torch.tensor(xe).to(device)

xd = np.random.randint(0, 10000, size=(8, 256))
xd_t = torch.tensor(xd).to(device)

maske = np.ones((8, 512))
maske[:, 256:] = 0
maske_t = torch.tensor(maske).to(device)

maskd = np.ones((8, 256))
maskd[:, 128:] = 0
maskd_t = torch.tensor(maskd).to(device)

out = transformer(xe_t, xd_t, maske_t, maskd_t)
out.shape

torch.Size([8, 256, 10000])

## Train the Transformer to translate text from English to French

### Load and prepare data.

In [None]:
# load the original dataset
df_original = pd.read_csv('/content/drive/MyDrive/Transformers/fra.txt', sep='\t', header=None)
df_original.head()

Unnamed: 0,0,1,2
0,Go.,Va !,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
1,Go.,Marche.,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
2,Go.,En route !,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
3,Go.,Bouge !,CC-BY 2.0 (France) Attribution: tatoeba.org #2...
4,Hi.,Salut !,CC-BY 2.0 (France) Attribution: tatoeba.org #5...


In [None]:
# need only first two columns
df = df_original.iloc[:, 0:2]
df.head()

Unnamed: 0,0,1
0,Go.,Va !
1,Go.,Marche.
2,Go.,En route !
3,Go.,Bouge !
4,Hi.,Salut !


In [None]:
df.shape

(217975, 2)

In [None]:
# take first 30000 samples otherwise it takes to long to train the model
df = df.iloc[:30000]

In [None]:
# name the columns compatible to Hugging Face tokenizer
df.columns=['en', 'fr']
df.to_csv('fra.csv', index=None)

In [None]:
!head fra.csv

en,fr
Go.,Va !
Go.,Marche.
Go.,En route !
Go.,Bouge !
Hi.,Salut !
Hi.,Salut.
Run!,Cours !
Run!,Courez !
Run!,Prenez vos jambes à vos cous !


### Tokenize and pad data.

The next step is to tokenize English and French sentences. We will use a pretrained tokenizer from Hugging Face. We need some imports for this.

In [None]:
!pip install transformers datasets sentencepiece



In [None]:
!pip install sacremoses



In [None]:
from datasets import load_dataset
raw_dataset = load_dataset('csv', data_files='fra.csv')

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
raw_dataset

DatasetDict({
    train: Dataset({
        features: ['en', 'fr'],
        num_rows: 30000
    })
})

Split the data into the train and test sets.

In [None]:
split = raw_dataset['train'].train_test_split(test_size=0.3, seed=42)
split

DatasetDict({
    train: Dataset({
        features: ['en', 'fr'],
        num_rows: 21000
    })
    test: Dataset({
        features: ['en', 'fr'],
        num_rows: 9000
    })
})

We now import our tokenizer from Hugging Face.

In [None]:
# use Hugging Face tokenizer
from transformers import AutoTokenizer

model_checkpoint = 'Helsinki-NLP/opus-mt-en-fr'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

We explore how this tokenizer works.

In [None]:
# test the tokenizer
en_sentence = split['train'][0]['en']
fr_sentence = split['train'][0]['fr']

inputs = tokenizer(en_sentence)
targets = tokenizer(text_target=fr_sentence)

In [None]:
inputs

{'input_ids': [30705, 61, 3, 0], 'attention_mask': [1, 1, 1, 1]}

In [None]:
tokenizer.convert_ids_to_tokens(inputs['input_ids'])

['▁Replace', '▁it', '.', '</s>']

In [None]:
en_sentence

'Replace it.'

In [None]:
targets

{'input_ids': [31190, 8103, 21, 774, 3, 0], 'attention_mask': [1, 1, 1, 1, 1, 1]}

In [None]:
tokenizer.convert_ids_to_tokens(targets['input_ids'])

['▁Rem', 'place', '-', 'la', '.', '</s>']

In [None]:
fr_sentence

'Remplace-la.'

Define a preprocessing function that tokenizes our inputs and targets.

In [None]:
max_input_length = 128
max_target_length = 128

def preprocess_function(batch):
  model_inputs = tokenizer(
      batch['en'], max_length=max_input_length, truncation=True
  )

  # tokenizer for targets
  labels = tokenizer(
      text_target=batch['fr'], max_length=max_target_length, truncation=True
  )

  model_inputs['labels'] = labels['input_ids']
  return model_inputs


In [None]:
# apply preprocess function to our datasets
tokenized_datasets = split.map(
    preprocess_function,
    batched=True,
    remove_columns=split['train'].column_names
)

Map:   0%|          | 0/21000 [00:00<?, ? examples/s]

Map:   0%|          | 0/9000 [00:00<?, ? examples/s]

In [None]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 21000
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 9000
    })
})

We will use a data collator to pad our sequences.

In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer)

Test how the data collator works on the first 5 samples.

In [None]:
# test the data_collator on the first 5 samples
batch = data_collator([tokenized_datasets['train'][i] for i in range(0,5)])
batch.keys()

dict_keys(['input_ids', 'attention_mask', 'labels'])

In [None]:
batch['input_ids']

tensor([[30705,    61,     3,     0, 59513, 59513, 59513],
        [   47,   324,  6149,    15,  2086,     3,     0],
        [  560,     6,     9,   170,  6761,     3,     0],
        [12432,   357, 15002,  6149,     3,     0, 59513],
        [ 6149, 11726,  3964,     3,     0, 59513, 59513]])

Seems 0 is used for end of sentence token and 59513 to pad

In [None]:
batch['attention_mask']

tensor([[1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 0, 0]])

In [None]:
batch['labels']

tensor([[31190,  8103,    21,   774,     3,     0,  -100,  -100,  -100,  -100,
          -100],
        [  234,     6, 27939,   924,  6149,     3,     0,  -100,  -100,  -100,
          -100],
        [  104,    81,     1,   133,    15,    53,    20,     1,   259,     3,
             0],
        [35698,    17,  4002,  1588,  6149,     3,     0,  -100,  -100,  -100,
          -100],
        [ 6149,   423,    14,     6,  1800,  5915,     3,     0,  -100,  -100,
          -100]])

targets are padded with -100

In [None]:
# check which special ids exist in our tokenizer
tokenizer.all_special_ids

[0, 1, 59513]

In [None]:
tokenizer.all_special_tokens

['</s>', '<unk>', '<pad>']

In [None]:
# tokenizer adds end of sentence token to everything
tokenizer('<pad>')

{'input_ids': [59513, 0], 'attention_mask': [1, 1]}

### Loading data into the Transformer.

In [None]:
from torch.utils.data import DataLoader

train_loader = DataLoader(
    tokenized_datasets['train'],
    shuffle=True,
    batch_size=32,
    collate_fn=data_collator
)

valid_loader = DataLoader(
    tokenized_datasets['test'],
    batch_size=32,
    collate_fn=data_collator
)

In [None]:
# check how the data_loader works
for batch in train_loader:
  for k, v in batch.items():
    print("k:", k, "v.shape:", v.shape)
  break

k: input_ids v.shape: torch.Size([32, 8])
k: attention_mask v.shape: torch.Size([32, 8])
k: labels v.shape: torch.Size([32, 10])


We will need a start of sentence token to offset the decoder inputs.

In [None]:
tokenizer.vocab_size

59514

In [None]:
# add start of sentence token manually
tokenizer.add_special_tokens({'cls_token': '<s>'})

1

In [None]:
tokenizer('<s>')

{'input_ids': [59514, 0], 'attention_mask': [1, 1]}

In [None]:
# adding start of sentence token did not increase vocab size
tokenizer.vocab_size

59514

### Build the model.

In [None]:
encoder = Encoder(vocab_size = tokenizer.vocab_size + 1,
                  max_len = 512,
                  d_k = 16,
                  d_model = 64,
                  n_heads = 4,
                  n_layers = 2,
                  dropout_prob = 0.1)

decoder = Decoder(vocab_size = tokenizer.vocab_size + 1,
                  max_len = 512,
                  d_k = 16,
                  d_model = 64,
                  n_heads = 4,
                  n_layers = 2,
                  dropout_prob = 0.1)

transformer = Transformer(encoder, decoder)

Move the model to GPU.

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)
encoder.to(device)
decoder.to(device)

cuda:0


Decoder(
  (embedding): Embedding(59515, 64)
  (pos_encoding): PositionalEncoding(
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer_blocks): Sequential(
    (0): DecoderBlock(
      (ln1): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
      (ln2): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
      (ln3): LayerNorm((64,), eps=1e-05, elementwise_affine=True)
      (mha1): MultiHeadAttention(
        (key): Linear(in_features=64, out_features=64, bias=True)
        (query): Linear(in_features=64, out_features=64, bias=True)
        (value): Linear(in_features=64, out_features=64, bias=True)
        (fc): Linear(in_features=64, out_features=64, bias=True)
      )
      (mha2): MultiHeadAttention(
        (key): Linear(in_features=64, out_features=64, bias=True)
        (query): Linear(in_features=64, out_features=64, bias=True)
        (value): Linear(in_features=64, out_features=64, bias=True)
        (fc): Linear(in_features=64, out_features=64, bias=True)
 

Define the loss function and the optimizer for our model.

In [None]:
# Loss and optimizer
criterion = nn.CrossEntropyLoss(ignore_index=-100)
optimizer = torch.optim.Adam(transformer.parameters())

Define the training loop function.

In [None]:
from datetime import datetime
# the training loop
def train(model, criterion, optimizer, train_loader, valid_loader, epochs):
  train_losses = np.zeros(epochs)
  test_losses = np.zeros(epochs)

  for it in range(epochs):
    model.train()
    t0 = datetime.now()
    train_loss = []
    for batch in train_loader:
      batch = {k: v.to(device) for k, v in batch.items()}

      optimizer.zero_grad()

      enc_input = batch['input_ids']
      enc_mask = batch['attention_mask']
      targets = batch['labels']

      # shift targets forward to get decoder input
      dec_input = targets.clone().detach()
      dec_input = torch.roll(dec_input, shifts=1, dims=1)
      dec_input[:, 0] = 59514

      # convert all -100 to pad token id
      dec_input = dec_input.masked_fill(dec_input == -100, tokenizer.pad_token_id)

      # make the decoder input mask
      dec_mask = torch.ones_like(dec_input)
      dec_mask = dec_mask.masked_fill(dec_input == tokenizer.pad_token_id, 0)

      # forward pass
      outputs = model(enc_input, dec_input, enc_mask, dec_mask)
      loss = criterion(outputs.transpose(2,1), targets)

      # backward and optimize
      loss.backward()
      optimizer.step()
      train_loss.append(loss.item())

    # get train loss and test loss
    train_loss = np.mean(train_loss)

    model.eval()
    test_loss = []
    for batch in valid_loader:
      batch = {k: v.to(device) for k, v in batch.items()}

      enc_input = batch['input_ids']
      enc_mask = batch['attention_mask']
      targets = batch['labels']

      # shift targets forward to get decoder input
      dec_input = targets.clone().detach()
      dec_input = torch.roll(dec_input, shifts=1, dims=1)
      dec_input[:, 0] = 59514

      # change -100 to regular padding
      dec_input = dec_input.masked_fill(dec_input == -100, tokenizer.pad_token_id)

      # make decoder input mask
      dec_mask = torch.ones_like(dec_input)
      dec_mask = dec_mask.masked_fill(dec_input == tokenizer.pad_token_id, 0)

      outputs = model(enc_input, dec_input, enc_mask, dec_mask)
      loss = criterion(outputs.transpose(2,1), targets)
      test_loss.append(loss.item())

    test_loss = np.mean(test_loss)

    # save losses
    train_losses[it] = train_loss
    test_losses[it] = test_loss

    dt = datetime.now() - t0
    print(f'Epoch {it+1}/{epochs}, Train loss: {train_loss:.4f}, Test loss: {test_loss:.4f}, Duration: {dt} ')

  return train_losses, test_losses

### Train the model.

In [None]:
train_losses, test_losses = train(transformer, criterion, optimizer, train_loader, valid_loader, epochs=15)

Epoch 1/15, Train loss: 4.2763, Test loss: 3.1911, Duration: 0:00:21.843785 
Epoch 2/15, Train loss: 2.9318, Test loss: 2.6960, Duration: 0:00:17.250656 
Epoch 3/15, Train loss: 2.5081, Test loss: 2.4333, Duration: 0:00:17.266946 
Epoch 4/15, Train loss: 2.2147, Test loss: 2.2270, Duration: 0:00:18.134786 
Epoch 5/15, Train loss: 1.9805, Test loss: 2.0937, Duration: 0:00:17.904731 
Epoch 6/15, Train loss: 1.7976, Test loss: 1.9829, Duration: 0:00:17.372011 
Epoch 7/15, Train loss: 1.6418, Test loss: 1.9066, Duration: 0:00:17.766868 
Epoch 8/15, Train loss: 1.5181, Test loss: 1.8601, Duration: 0:00:17.901532 
Epoch 9/15, Train loss: 1.4134, Test loss: 1.8138, Duration: 0:00:18.169078 
Epoch 10/15, Train loss: 1.3331, Test loss: 1.7862, Duration: 0:00:17.427760 
Epoch 11/15, Train loss: 1.2572, Test loss: 1.7745, Duration: 0:00:17.488900 
Epoch 12/15, Train loss: 1.2017, Test loss: 1.7553, Duration: 0:00:17.321237 
Epoch 13/15, Train loss: 1.1481, Test loss: 1.7515, Duration: 0:00:18.287

### Apply the model to translate English sentences to French

We define the function to do translation.

In [None]:
def translate(input_sentence):
  # get encoder output
  enc_input = tokenizer(input_sentence, return_tensors='pt').to(device)
  enc_output = encoder(enc_input['input_ids'], enc_input['attention_mask'])

  # set initial decoder input to be the start of sentence token
  dec_input_ids = torch.tensor([[59514]], device=device)
  dec_attn_mask = torch.ones_like(dec_input_ids, device = device)

  # decoder loop (we predict next token and append it to decoder input ids
  # for next iteration of the loop)
  for _ in range(32):
    dec_output = decoder(
        enc_output,
        dec_input_ids,
        enc_input['attention_mask'],
        dec_attn_mask
    )

    prediction_id = torch.argmax(dec_output[:, -1, :], axis=-1)

    dec_input_ids = torch.hstack((dec_input_ids, prediction_id.view(1,1)))

    dec_attn_mask = torch.ones_like(dec_input_ids)

    # break when reach end of sentence token
    if prediction_id == 0:
      break

  translation = tokenizer.decode(dec_input_ids[0, 1:])
  print(translation)

In [None]:
translate('How are you?')

Comment tu es-tu?</s>


In [None]:
translate('The sky is blue.')

Le ciel est bleue.</s>
