Lab3 assignment
===================================

To train the translator, we will do the following steps:

1. Load the Multi30k datasets using ``torchtext``
2. Define a Reccurent Neural Network and a loss function
3. Train the network on the training data
4. Test the network on the test data

In [None]:
%matplotlib inline
!python -m spacy download en
!python -m spacy download de
!pip install torchtext==0.6.0

from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

import numpy as np

import random
import math
import time

#We'll set the random seeds for deterministic results.
SEED = 1234
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')
Collecting de_core_news_sm==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.2.5/de_core_news_sm-2.2.5.tar.gz (14.9MB)
[K     |████████████████████████████████| 14.9MB 400kB/s 
Building wheels for collected packages: de-core-news-sm
  Building wheel for de-core-news-sm (setup.py) ... [?25l[?25hdone
  Created wheel for de-core-news-sm: filename=de_core_news_sm-2.2.5-cp36-none-any.whl size=14907056 sha256=e1ceca3424cf5f01a7a3135e3e8bab8628b247b1432ab4d07a7b115714bf10dc
  Stored in directory: /tmp/pip-ephem-wheel-cache-vusgxces/wheels/ba/3f/ed/d4aa8e45e7191b7f32db4bfad565e7da1edbf05c916ca7a1ca
Successfully built de-core-news-sm
Inst

## 1. Preparing the Data

``torchtext`` has utilities for creating datasets that can be easily
iterated through for the purposes of creating a language translation
model. One key class is a
[`Field`](https://github.com/pytorch/text/blob/master/torchtext/data/field.py#L64),
which specifies the way each sentence should be preprocessed, and another is the
`TranslationDataset` ; ``torchtext``
has several such datasets; in this exercise we'll use the
[`Multi30k dataset`](https://github.com/multi30k/dataset), which contains about
30,000 sentences (averaging about 13 words in length) in both English and German.

The tokenization in this exercise is [`Spacy`](https://spacy.io).
We use Spacy because it provides strong support for tokenization in languages
other than English. The following code will tokenize each of the sentences
in the ``TranslationDataset`` based on the tokenizer defined in the ``Field``.



In [None]:
SRC = Field(tokenize = "spacy",
            tokenizer_language="de",
            init_token = '<sos>',
            eos_token = '<eos>',
            lower = True)

TRG = Field(tokenize = "spacy",
            tokenizer_language="en",
            init_token = '<sos>',
            eos_token = '<eos>',
            lower = True)

#Split training samples of 29,000, validation samples of 1,014 and testing samples of 1,000.
train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'),
                                                    fields = (SRC, TRG))

training.tar.gz:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

downloading training.tar.gz


training.tar.gz: 100%|██████████| 1.21M/1.21M [00:00<00:00, 5.04MB/s]
validation.tar.gz: 100%|██████████| 46.3k/46.3k [00:00<00:00, 1.44MB/s]

downloading validation.tar.gz
downloading mmt_task1_test2016.tar.gz



mmt_task1_test2016.tar.gz: 100%|██████████| 66.2k/66.2k [00:00<00:00, 1.40MB/s]


Now that we've defined ``train_data``, we can see an extremely useful
feature of ``torchtext``'s ``Field``: the ``build_vocab`` method
now allows us to create the vocabulary associated with each language.

Once these lines of code have been run, ``SRC.vocab.stoi`` will  be a
dictionary with the tokens in the vocabulary as keys and their
corresponding indices as values; ``SRC.vocab.itos`` will be the same
dictionary with the keys and values swapped. We won't make extensive
use of this fact in this tutorial, but this will likely be useful in
other NLP tasks you'll encounter.



In [None]:
print(len(train_data)) #number of training samples
print(len(valid_data)) #number of validation samples
print(len(test_data)) #number of testing samples

29000
1014
1000


In [None]:
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)


The last ``torchtext`` specific feature we'll use is the ``BucketIterator``,
which is easy to use since it takes a ``TranslationDataset`` as its
first argument. Specifically, as the docs say:
Defines an iterator that batches examples of similar lengths together.
Minimizes amount of padding needed while producing freshly shuffled
batches for each new epoch. See pool for the bucketing procedure used.



In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

BATCH_SIZE = 1024

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data),
    batch_size = BATCH_SIZE,
    device = device)

## 2. Build the Transformer Model

In [None]:
from torch.nn import Transformer
class TransformerModel(nn.Module):

    def __init__(self, ntoken_in, ntoken_out, ninp, nhead, npf_dim, nlayers, src_pad_idx, trg_pad_idx, dropout=0.5):
        super(TransformerModel, self).__init__()

        # --------------- param -----------------
        # ntoken_in: the idx of the input word after tokenization 
        # ntoken_out: the idx of the input word w.r.t. the tokenization 
        # ninp: the number of expected features in the encoder/decoder inputs 
        # nhead: the number of multiAttention heads 
        # npf_dim: the dimension of the feedforward layer 
        # src_pad_idx: the token for padding in source language
        # trg_pad_idx: the token for padding in target language 
        # ----------------------------------------

        self.model_type = 'Transformer'
        self.pos_encoder = PositionalEncoding(ninp, dropout)
        self.transformer = Transformer(d_model=ninp, nhead=nhead, num_encoder_layers=nlayers, num_decoder_layers=nlayers,
                                       dim_feedforward=npf_dim, dropout=dropout, activation='relu')
      
        self.encoder_en = nn.Embedding(ntoken_in, ninp)  # tok_embedding for input 
        self.encoder_de = nn.Embedding(ntoken_out, ninp) # tok_embedding for output 
        self.ninp = ninp
        self.decoder = nn.Linear(ninp, ntoken_out)

        self.src_pad_idx = src_pad_idx
        self.tgt_pad_idx = trg_pad_idx

        self.init_weights()

    def _generate_src_key_mask(self, src):
        # for key_padding_mask in transformer
        # the positions with the value of True will be ignored while the position
        # with the value of False will be unchanged. We mask all padding words. 
        # The output dim is b*s
        src_mask = (src == self.src_pad_idx)
        return src_mask.T

    def _generate_tgt_mask(self, tgt, sz):
        # Beside key_padding_mask in transformer, the output or teacher input 
        # should be masked sequentially to prevent the model get any information 
        # from the future words it is going to predict 
        tgt_key_mask = tgt == self.tgt_pad_idx

        # We provide FloatTensor attn_mask. It will be added to the attention weight.
        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
        attn_mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0)).to(tgt.device)
        return attn_mask, tgt_key_mask.T

    def init_weights(self):
        initrange = 0.1
        self.encoder_en.weight.data.uniform_(-initrange, initrange)
        self.encoder_de.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, src, tgt):
        # src
        src_key_mask = self._generate_src_key_mask(src)
        src = self.encoder_en(src) * math.sqrt(self.ninp)  # use a learned encoder put stoi index to a feature space s*b --> s*b*e
        src = self.pos_encoder(src)  # add the pos feature toward feature space

        # tgt
        tgt_mask, tgt_key_mask = self._generate_tgt_mask(tgt, tgt.size(0))
        tgt = self.encoder_de(tgt) * math.sqrt(self.ninp) 
        tgt = self.pos_encoder(tgt)

        output = self.transformer(src, tgt, tgt_mask=tgt_mask, 
                                  src_key_padding_mask = src_key_mask, 
                                  tgt_key_padding_mask = tgt_key_mask)
        output = self.decoder(output)
        return output

class PositionalEncoding(nn.Module):
    # The positional encoding as described in the paper 
    # https://arxiv.org/pdf/1706.03762.pdf
    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)

# Here we intialize our model
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
print(INPUT_DIM, OUTPUT_DIM)

HID_DIM = 256
N_LAYERS = 3
N_HEADS = 8
N_PF_DIM = 512
DROPOUT = 0.1

SRC_PAD_IDX = SRC.vocab.stoi[SRC.pad_token]
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]

model =TransformerModel(ntoken_in = INPUT_DIM, ntoken_out=OUTPUT_DIM, ninp=HID_DIM, 
                        nhead=N_HEADS, npf_dim=N_PF_DIM, nlayers=N_LAYERS,
                        src_pad_idx=SRC_PAD_IDX, trg_pad_idx=TRG_PAD_IDX, dropout=DROPOUT).to(device)

def count_parameters(model: nn.Module):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

def initialize_weights(m):
    if hasattr(m, 'weight') and m.weight.dim() > 1:
        nn.init.xavier_uniform_(m.weight.data)

model.apply(initialize_weights)

7855 5893
The model has 8,988,677 trainable parameters


TransformerModel(
  (pos_encoder): PositionalEncoding(
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (encoder): TransformerEncoder(
      (layers): ModuleList(
        (0): TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True)
          )
          (linear1): Linear(in_features=256, out_features=512, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=512, out_features=256, bias=True)
          (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
        (1): TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): _LinearWithBias(in_features=256, out_features=256, bias=True

In [None]:
if device == 'cuda':
    net = torch.nn.DataParallel(net)
    cudnn.benchmark = True

## 3. Training the Model

We define our optimizer, which we use to update our parameters in the training loop. Here, we'll use Adam.

In [None]:
optimizer = optim.Adam(model.parameters(), lr=0.00085)

Next, we define our loss function. 

Note: when scoring the performance of a language translation model in
particular, we have to tell the ``nn.CrossEntropyLoss`` function to
ignore the indices where the target is simply padding.



In [None]:
PAD_IDX = TRG.vocab.stoi['<pad>']

criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)

Finally, we can train and evaluate this model:


In [None]:
def train(model: nn.Module,
          iterator: BucketIterator,
          optimizer: optim.Optimizer,
          criterion: nn.Module,
          clip: float):

    model.train()

    epoch_loss = 0

    for _, batch in enumerate(iterator):

        src = batch.src
        trg = batch.trg
        
        #remove the eos_token from the target sequence before giving it to the model as input.
        trg_temp = trg[:-1]

        optimizer.zero_grad()

        output = model(src, trg_temp)

        output = output.view(-1, output.shape[-1])
        trg = trg[1:].view(-1)

        loss = criterion(output, trg)

        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

        optimizer.step()

        epoch_loss += loss.item()

    return epoch_loss / len(iterator)


def evaluate(model: nn.Module,
             iterator: BucketIterator,
             criterion: nn.Module):

    #Evaluation mode
    model.eval()

    epoch_loss = 0

    with torch.no_grad():

        for _, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg
            trg_temp = trg[:-1]
            
            output = model(src, trg_temp)

            output = output.view(-1, output.shape[-1])
            trg = trg[1:].view(-1)

            loss = criterion(output, trg)
            epoch_loss += loss.item()

    return epoch_loss / len(iterator)


def epoch_time(start_time: int,
               end_time: int):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs


N_EPOCHS = 20
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()

    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)

    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

Epoch: 01 | Time: 0m 12s
	Train Loss: 6.112 | Train PPL: 451.392
	 Val. Loss: 5.000 |  Val. PPL: 148.352
Epoch: 02 | Time: 0m 12s
	Train Loss: 4.632 | Train PPL: 102.767
	 Val. Loss: 4.030 |  Val. PPL:  56.288
Epoch: 03 | Time: 0m 11s
	Train Loss: 3.882 | Train PPL:  48.529
	 Val. Loss: 3.607 |  Val. PPL:  36.838
Epoch: 04 | Time: 0m 11s
	Train Loss: 3.550 | Train PPL:  34.812
	 Val. Loss: 3.383 |  Val. PPL:  29.468
Epoch: 05 | Time: 0m 11s
	Train Loss: 3.284 | Train PPL:  26.671
	 Val. Loss: 3.120 |  Val. PPL:  22.655
Epoch: 06 | Time: 0m 11s
	Train Loss: 2.993 | Train PPL:  19.935
	 Val. Loss: 2.813 |  Val. PPL:  16.657
Epoch: 07 | Time: 0m 12s
	Train Loss: 2.619 | Train PPL:  13.716
	 Val. Loss: 2.464 |  Val. PPL:  11.749
Epoch: 08 | Time: 0m 11s
	Train Loss: 2.254 | Train PPL:   9.526
	 Val. Loss: 2.161 |  Val. PPL:   8.676
Epoch: 09 | Time: 0m 11s
	Train Loss: 1.954 | Train PPL:   7.058
	 Val. Loss: 1.973 |  Val. PPL:   7.193
Epoch: 10 | Time: 0m 11s
	Train Loss: 1.725 | Train PPL

## 4. Inference and BLEU Score

Network testing

In [None]:
def translate_sentence(sentence, src_field, trg_field, model, device, max_len = 50):

    #Evaluation mode
    model.eval()
        
    if isinstance(sentence, str):
        nlp = spacy.load('de')
        tokens = [token.text.lower() for token in nlp(sentence)]
    else:
        tokens = [token.lower() for token in sentence]

    tokens = [src_field.init_token] + tokens + [src_field.eos_token]

    src_indexes = [src_field.vocab.stoi[token] for token in tokens]
    src_tensor = torch.LongTensor(src_indexes).unsqueeze(1).to(device)

    src_key_mask = model._generate_src_key_mask(src_tensor)
    src_tensor = model.encoder_en(src_tensor) * math.sqrt(model.ninp)
    src_tensor = model.pos_encoder(src_tensor)

    trg_indexes = [trg_field.vocab.stoi[trg_field.init_token]]
    trg_tensor = torch.LongTensor(src_indexes).unsqueeze(1).to(device)


    for i in range(max_len):

        trg_tensor = torch.LongTensor(trg_indexes).unsqueeze(1).to(device)

        with torch.no_grad():
            tgt_mask, tgt_key_mask = model._generate_tgt_mask(trg_tensor, trg_tensor.size(0))
            trg_tensor = model.encoder_de(trg_tensor) * math.sqrt(model.ninp)
            trg_tensor = model.pos_encoder(trg_tensor)

            output = model.transformer(src_tensor, trg_tensor, tgt_mask=tgt_mask,
                                       src_key_padding_mask = src_key_mask,
                                       tgt_key_padding_mask = tgt_key_mask)
        
        output = model.decoder(output)

        output = output[-1].view(-1, output.shape[-1])
        #pick the index with the highest probability as output.
        pred_token = output.argmax(1).item()
        trg_indexes.append(pred_token)

        if pred_token == trg_field.vocab.stoi[trg_field.eos_token]:
            break
    
    trg_tokens = [trg_field.vocab.itos[i] for i in trg_indexes]
    
    return trg_tokens[1:]

Now, we'll grab some translations from our training set and see how well our model did. 

In [None]:
example_idx = 18

src = vars(train_data.examples[example_idx])['src']
trg = vars(train_data.examples[example_idx])['trg']

print(f'src = {src}')
print(f'trg = {trg}')

translation = translate_sentence(src, SRC, TRG, model, device)

print(f'predicted trg = {translation}')

src = ['fünf', 'personen', 'sitzen', 'mit', 'instrumenten', 'im', 'kreis', '.']
trg = ['five', 'people', 'are', 'sitting', 'in', 'a', 'circle', 'with', 'instruments', '.']
predicted trg = ['five', 'people', 'are', 'sitting', 'in', 'a', 'circle', 'with', 'instruments', 'in', 'a', 'circle', '.', '<eos>']


Next, let's get an example from the test set.

In [None]:
example_idx = 10

src = vars(test_data.examples[example_idx])['src']
trg = vars(test_data.examples[example_idx])['trg']

print(f'src = {src}')
print(f'trg = {trg}')

translation = translate_sentence(src, SRC, TRG, model, device)

print(f'predicted trg = {translation}')

src = ['eine', 'mutter', 'und', 'ihr', 'kleiner', 'sohn', 'genießen', 'einen', 'schönen', 'tag', 'im', 'freien', '.']
trg = ['a', 'mother', 'and', 'her', 'young', 'song', 'enjoying', 'a', 'beautiful', 'day', 'outside', '.']
predicted trg = ['a', 'mother', 'and', 'her', 'son', 'enjoying', 'a', 'nice', 'day', 'outside', 'in', 'the', 'beautiful', 'day', '.', '<eos>']


We define a calculate_bleu function which calculates the BLEU score over a provided TorchText dataset. This function creates a corpus of the actual and predicted translation for each source sentence and then calculates the BLEU score.

In [None]:
from torchtext.data.metrics import bleu_score

def calculate_bleu(data, src_field, trg_field, model, device, max_len = 50):
    
    #Evaluation mode
    model.eval()
    
    trgs = []
    pred_trgs = []
    
    for datum in data:
        
        src = vars(datum)['src']
        trg = vars(datum)['trg']
        
        pred_trg = translate_sentence(src, src_field, trg_field, model, device, max_len)
        
        #cut off <eos> token
        pred_trg = pred_trg[:-1]
        
        pred_trgs.append(pred_trg)
        trgs.append([trg])
        
    return bleu_score(pred_trgs, trgs)

bleu_score = calculate_bleu(test_data, SRC, TRG, model, device)

#Target score is 35
print(f'BLEU score = {bleu_score*100:.2f}')

BLEU score = 35.21
