# Lab 7: Transformer in Practice


In this lab, we train ``nn.TransformerEncoder`` model on a
language modeling task. The language modeling task is to assign a
probability for the likelihood of a given word (or a sequence of words) to follow a sequence of words. 

A sequence of tokens are passed to the embedding
layer first, followed by a positional encoding layer to account for the order
of the word (see the rest of the lab for more details). The
``nn.TransformerEncoder`` consists of multiple layers of
[``nn.TransformerEncoderLayer``](https://pytorch.org/docs/master/nn.html?highlight=transformerencoderlayer#torch.nn.TransformerEncoderLayer). 

Along with the input sequence, a square
attention mask is required because the self-attention layers in
``nn.TransformerEncoder`` are only allowed to attend the earlier positions in
the sequence. For the language modeling task, any tokens on the future
positions should be masked. 

To have the actual words, the output
of ``nn.TransformerEncoder`` model is sent to the final Linear
layer, which is followed by a log-Softmax function.




In [52]:
%matplotlib inline

import math
import torch
import torch.nn as nn
import torch.nn.functional as F

It could be necessary to install torchtext on Google Colab or your personal laptop.

In [53]:
# Sur Google colab
#!pip install torchtext
# ou avec Anaconda
#conda install -c pytorch torchtext

In [54]:
import torchtext
print(torchtext.__version__)

0.11.0


## Define the model
----------------




#### First steps with the transformer layers

"TransformerEncoderLayer" is made up of self-attn and feedforward network. This standard encoder layer is based on the paper “Attention Is All You Need”.

"dim_feedforward" is the dimension of the hidden layer in the feedforward neural network.

More details on them [here](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html#torch.nn.TransformerEncoderLayer)

In [55]:
encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8, dim_feedforward=1000)
src = torch.rand(10, 32, 512)
out = encoder_layer(src)
print(src.shape)
print(out.shape)

torch.Size([10, 32, 512])
torch.Size([10, 32, 512])


Create a transformer as a stack of encoder layers. "TransformerEncoder" is a stack of "num_layers" encoder layers.

More details on them [here](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoder.html#torch.nn.TransformerEncoder)

In [56]:
encoder_layer = nn.TransformerEncoderLayer(d_model=512, nhead=8)
transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=6)
src = torch.rand(9, 32, 512)
out = transformer_encoder(src)
print(src.shape)
print(out.shape)

torch.Size([9, 32, 512])
torch.Size([9, 32, 512])


### Question: what is the meaning of parameters in "__init__"?

#### Response:



### Question: explain how the "forward" function works? What is the role of the mask "src_mask"?

#### Response:




In [57]:
class TransformerModel(nn.Module):

    def __init__(self, ntoken, ninp, nhead, nhid, nlayers, dropout=0.5):
        super(TransformerModel, self).__init__()
        from torch.nn import TransformerEncoder, TransformerEncoderLayer
        self.model_type = 'Transformer'
        self.pos_encoder = PositionalEncoding(ninp, dropout)
        encoder_layers = TransformerEncoderLayer(ninp, nhead, nhid, dropout)
        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)
        self.encoder = nn.Embedding(ntoken, ninp)
        self.ninp = ninp
        self.decoder = nn.Linear(ninp, ntoken)

        self.init_weights()

    # Generate an upper triangular mask: lower part is -inf and upper part is 0.0
    def generate_square_subsequent_mask(self, sz):
        # torch.triu returns a copy of a matrix with the elements below the k-th diagonal zeroed.
        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
        return mask

    def init_weights(self):
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, src, src_mask):
        src = self.encoder(src) * math.sqrt(self.ninp)
        src = self.pos_encoder(src) # add the positional encoding (pe) to src: src <- src + pe
        output = self.transformer_encoder(src, src_mask)
        output = self.decoder(output)
        return output

### Positional Encoding

``PositionalEncoding`` module injects some information about the
relative or absolute position of the tokens in the sequence. The
positional encodings have the same dimension as the embeddings so that
the two can be summed. Here, we use ``sine`` and ``cosine`` functions of
different frequencies.




### Question: what is the role of "Positional Encoding".

#### Response: 



In [58]:
class PositionalEncoding(nn.Module):

    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)

### Question: What is the role of the mask "src_mask" in the class "TransformerModel"?

#### Response:



Load and batch data
-------------------




This tutorial uses ``torchtext`` to generate Wikitext-2 dataset. The
vocab object is built based on the train dataset and is used to numericalize
tokens into tensors. Starting from sequential data, the ``batchify()``
function arranges the dataset into columns, trimming off any tokens remaining
after the data has been divided into batches of size ``batch_size``.
For instance, with the alphabet as the sequence (total length of 26)
and a batch size of 4, we would divide the alphabet into 4 sequences of
length 6:
$$
\begin{align}\begin{bmatrix}
  \text{A} & \text{B} & \text{C} & \ldots & \text{X} & \text{Y} & \text{Z}
  \end{bmatrix}
  \Rightarrow
  \begin{bmatrix}
  \begin{bmatrix}\text{A} \\ \text{B} \\ \text{C} \\ \text{D} \\ \text{E} \\ \text{F}\end{bmatrix} &
  \begin{bmatrix}\text{G} \\ \text{H} \\ \text{I} \\ \text{J} \\ \text{K} \\ \text{L}\end{bmatrix} &
  \begin{bmatrix}\text{M} \\ \text{N} \\ \text{O} \\ \text{P} \\ \text{Q} \\ \text{R}\end{bmatrix} &
  \begin{bmatrix}\text{S} \\ \text{T} \\ \text{U} \\ \text{V} \\ \text{W} \\ \text{X}\end{bmatrix}
  \end{bmatrix}\end{align}
$$
These columns are treated as independent by the model, which means that
the dependence of ``G`` and ``F`` can not be learned, but allows more
efficient batch processing.




#### What is TorchText?

TorchText is a pytorch package that contains different data processing methods as well as popular NLP datasets. According to the official PyTorch documentation, torchtext has 4 main functionalities: data, datasets, vocab, and utils. Data is mainly used to create custom dataset class, batching samples etc. Datasets consists of the various NLP datasets from sentiment analysis to question answering. Vocab covers different methods of processing text and utils consists of additional helper functions.

#### Warning: path to access the dataset

The following variables "from_path" and "to_path" are necessary to store the dataset. 

They are defined for Google colab.

You must change them if you are using your personal Python installation.

In [59]:
# Google Colab
#from_path = '/content/sample_data/wikitext-2-v1.zip'
#to_path = '/content/sample_data/'

# Personal laptop
from_path = './wikitext-2-v1.zip'
to_path = './'

Download and preprocess the dataset.

In [60]:
import io
import torch
from torchtext.utils import download_from_url, extract_archive
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

url = 'https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip'
download_from_url(url, from_path)
test_filepath, valid_filepath, train_filepath = extract_archive(from_path, to_path)



Print the file paths

In [61]:
print(test_filepath)
print(valid_filepath)
print(train_filepath)

./wikitext-2/wiki.test.tokens
./wikitext-2/wiki.valid.tokens
./wikitext-2/wiki.train.tokens


### Question: what is the role of "tokenizer".



#### Response: 


### Question: what is the role of "vocab"?

In [62]:
tokenizer = get_tokenizer('basic_english')
vocab = build_vocab_from_iterator(map(tokenizer,
                                      iter(io.open(train_filepath,
                                                   encoding="utf8"))))

#### Response: 


### Question: what is the role of the function "data_process"?

#### Response: 


In [63]:
def data_process(raw_text_iter):
  data = [torch.tensor([vocab[token] for token in tokenizer(item)],
                       dtype=torch.long) for item in raw_text_iter]
  return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))

train_data = data_process(iter(io.open(train_filepath, encoding="utf8")))
val_data = data_process(iter(io.open(valid_filepath, encoding="utf8")))
test_data = data_process(iter(io.open(test_filepath, encoding="utf8")))



In [64]:
print(type(train_data))
# The length of train_data corresponds to the sequence of items that have been encoded.
print(train_data.shape)
# Print the 10 first first items as numbers
print(train_data[0:10])

<class 'torch.Tensor'>
torch.Size([2049990])
tensor([    9,  3849,  3869,   881,     9, 20000,    83,  3849,    88,     4])


We reduce the size of the dataset to get results faster

In [65]:
print(train_data.shape)
print(val_data.shape)
print(test_data.shape)
train_data = train_data[0:10000]
val_data = val_data[0:10000]
test_data = test_data[0:10000]

torch.Size([2049990])
torch.Size([214417])
torch.Size([241859])


Use a GPU if possible

In [66]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Create the batches

### Question: what is the physical meaning (with respect to the raw text) of a batch?

#### Response:



In [67]:
def batchify(data, bsz):
    # Divide the dataset into bsz parts.
    nbatch = data.size(0) // bsz # number of elements in a batch
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * bsz)
    # Evenly divide the data across the bsz batches.
    # "contiguous()" returns a contiguous in memory tensor containing the same data as self tensor. 
    data = data.view(bsz, -1).t().contiguous()
    return data.to(device)

batch_size = 20 # number of batches
eval_batch_size = 10

import copy 
train_data_initial = copy.deepcopy(train_data)

train_data = batchify(train_data, batch_size)
val_data = batchify(val_data, eval_batch_size)
test_data = batchify(test_data, eval_batch_size)

## Functions to generate input and target sequence
--------------------------------------------------

``get_batch()`` function generates the input and target sequence for
the transformer model. It subdivides the source data into chunks of
length ``bptt``. For the language modeling task, the model needs the
following words as ``Target``. For example, with a ``bptt`` value of 2,
we’d get the following two Variables for ``i`` = 0:

![](https://pytorch.org/tutorials/_images/transformer_input_target.png)

It means that from each column of "Input" (a column is a sequence), we want to predict the column of "Target".





### Question: what is returned by "get_batch"?

#### Response: 



In [68]:
bptt = 35
def get_batch(source, i):
    seq_len = min(bptt, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].reshape(-1)
    return data, target

Initiate an instance
--------------------




The model is set up with the hyperparameter below. The vocab size is
equal to the length of the vocab object.




In [69]:
ntokens = len(vocab.get_stoi()) # the size of vocabulary
emsize = 200 # embedding dimension (it corresponds to "ninp")
nhid = 200 # the dimension of the feedforward network model in nn.TransformerEncoder
nlayers = 2 # the number of nn.TransformerEncoderLayer in nn.TransformerEncoder
nhead = 2 # the number of heads in the multiheadattention models
dropout = 0.2 # the dropout value
model = TransformerModel(ntokens, emsize, nhead, nhid, nlayers, dropout).to(device)

Run the model
-------------




[``CrossEntropyLoss``](https://pytorch.org/docs/master/nn.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss) is applied to track the loss and
[``SGD``](https://pytorch.org/docs/master/optim.html?highlight=sgd#torch.optim.SGD)
implements stochastic gradient descent method as the optimizer.

The initial learning rate is set to 5.0. [``StepLR``](https://pytorch.org/docs/master/optim.html?highlight=steplr#torch.optim.lr_scheduler.StepLR) is
applied to adjust the learn rate through epochs. 

During the
training, we use
[``nn.utils.clip_grad_norm_``](https://pytorch.org/docs/master/nn.html?highlight=nn%20utils%20clip_grad_norm#torch.nn.utils.clip_grad_norm_)
function to scale all the gradient together to prevent exploding.




### Question: what is the role of "StepLR"?

#### Response:



In [70]:
criterion = nn.CrossEntropyLoss()
lr = 5.0 # learning rate
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.95)

Perplexity is an evaluation criterion that has been well studied over the past few years

Perplexity, called ppl in the next cell, is the exponentiation of the average cross entropy of a corpus (Mikolov et al., 2011).

### Question: what is the role of "torch.nn.utils.clip_grad_norm_"?

#### Response: 


In [71]:
import time

log_interval = 20

def train():
    model.train() # Turn on the train mode
    total_loss = 0.
    start_time = time.time()
    src_mask = model.generate_square_subsequent_mask(bptt).to(device)
    # Loop over the training batches
    for batch, i in enumerate(range(0, train_data.size(0) - 1, bptt)):
        data, targets = get_batch(train_data, i)
        optimizer.zero_grad()
        if data.size(0) != bptt:
            src_mask = model.generate_square_subsequent_mask(data.size(0)).to(device)
        output = model(data, src_mask)
        loss = criterion(output.view(-1, ntokens), targets)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()

        total_loss += loss.item()
        
        # Print a summary each log_interval iterations
        # The summary is focused on the loss 
        if batch % log_interval == 0 and batch > 0:
            cur_loss = total_loss / log_interval # compute the current loss over the log_interval
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches | '
                  'lr {:02.2f} | ms/batch {:5.2f} | '
                  'loss {:5.2f} | ppl {:8.2f}'.format(
                    epoch, batch, len(train_data) // bptt, scheduler.get_last_lr()[0],
                    elapsed * 1000 / log_interval,
                    cur_loss, math.exp(cur_loss)))
            total_loss = 0
            start_time = time.time()

The next function evaluates a trained neural network

In [72]:
def evaluate(eval_model, data_source):
    eval_model.eval() # Turn on the evaluation mode
    total_loss = 0.
    src_mask = model.generate_square_subsequent_mask(bptt).to(device)
    with torch.no_grad():
        for i in range(0, data_source.size(0) - 1, bptt):
            data, targets = get_batch(data_source, i)
            if data.size(0) != bptt:
                src_mask = model.generate_square_subsequent_mask(data.size(0)).to(device)
            output = eval_model(data, src_mask)
            output_flat = output.view(-1, ntokens)
            total_loss += len(data) * criterion(output_flat, targets).item()
    return total_loss / (len(data_source) - 1)

Loop over epochs. Save the model if the validation loss is the best
we've seen so far. Adjust the learning rate after each epoch.



In [73]:
best_val_loss = float("inf")
epochs = 3 # The number of epochs
best_model = None

for epoch in range(1, epochs + 1):
    epoch_start_time = time.time()
    train() # train a model for one epoch
    val_loss = evaluate(model, val_data) # evaluate the performance of the trained model
    # Print the performance of the trained model on the evaluation dataset
    print('-' * 89)
    print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
          'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
                                     val_loss, math.exp(val_loss)))
    print('-' * 89)

    # Keep the model if the loss decreases
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        best_model = model

    scheduler.step() # Update the scheduler step with "StepLR" at each epoch

| epoch   1 |    20/ 2928 batches | lr 5.00 | ms/batch 22.25 | loss 10.73 | ppl 45787.60
| epoch   1 |    40/ 2928 batches | lr 5.00 | ms/batch 17.40 | loss  9.36 | ppl 11601.78
| epoch   1 |    60/ 2928 batches | lr 5.00 | ms/batch 16.85 | loss  8.61 | ppl  5462.68
| epoch   1 |    80/ 2928 batches | lr 5.00 | ms/batch 16.70 | loss  8.06 | ppl  3179.35
| epoch   1 |   100/ 2928 batches | lr 5.00 | ms/batch 17.02 | loss  7.88 | ppl  2636.28
| epoch   1 |   120/ 2928 batches | lr 5.00 | ms/batch 17.00 | loss  7.66 | ppl  2112.96
| epoch   1 |   140/ 2928 batches | lr 5.00 | ms/batch 16.80 | loss  7.61 | ppl  2025.87
| epoch   1 |   160/ 2928 batches | lr 5.00 | ms/batch 16.55 | loss  7.36 | ppl  1572.95
| epoch   1 |   180/ 2928 batches | lr 5.00 | ms/batch 16.80 | loss  7.30 | ppl  1479.28
| epoch   1 |   200/ 2928 batches | lr 5.00 | ms/batch 16.70 | loss  7.20 | ppl  1334.86
| epoch   1 |   220/ 2928 batches | lr 5.00 | ms/batch 16.75 | loss  7.11 | ppl  1226.22
| epoch   1 |   240/ 

Evaluate the model with the test dataset
-------------------------------------

Apply the best model to check the result with the test dataset.



In [74]:
test_loss = evaluate(best_model, test_data)
print('=' * 89)
print('| End of training | test loss {:5.2f} | test ppl {:8.2f}'.format(
    test_loss, math.exp(test_loss)))
print('=' * 89)

| End of training | test loss  5.47 | test ppl   236.89
