## Introduction

In this notebook we will be adding a few improvements - packed padded sequences and masking - to the model from the previous notebook. Packed padded sequences are used to tell our RNN to skip over padding tokens in our encoder. Masking explicitly forces the model to ignore certain values, such as attention over padded elements. Both of these techniques are commonly used in NLP. 

We will also look at how to use our model for inference, by giving it a sentence, seeing what it translates it as and seeing where exactly it pays attention to when translating each word.

Finally, we'll use the BLEU metric to measure the quality of our translations.

# Task III CopyNet Mechanism

## Preparing Data

First, we'll import all the modules as before, with the addition of the `matplotlib` modules used for viewing the attention.

In [1]:
!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null
!apt-get -y install -qq google-drive-ocamlfuse fuse
from google.colab import auth
auth.authenticate_user()
from oauth2client.client import GoogleCredentials
creds = GoogleCredentials.get_application_default()
import getpass
!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
vcode = getpass.getpass()
!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}

E: Package 'python-software-properties' has no installation candidate
Selecting previously unselected package google-drive-ocamlfuse.
(Reading database ... 160690 files and directories currently installed.)
Preparing to unpack .../google-drive-ocamlfuse_0.7.26-0ubuntu1~ubuntu18.04.1_amd64.deb ...
Unpacking google-drive-ocamlfuse (0.7.26-0ubuntu1~ubuntu18.04.1) ...
Setting up google-drive-ocamlfuse (0.7.26-0ubuntu1~ubuntu18.04.1) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...
Please, open the following URL in a web browser: https://accounts.google.com/o/oauth2/auth?client_id=32555940559.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive&response_type=code&access_type=offline&approval_prompt=force
··········
Please, open the following URL in a web browser: https://accounts.google.com/o/oauth2/auth?client_id=32555940559.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope

In [2]:
!mkdir -p drive 
!google-drive-ocamlfuse drive
import os
os.chdir("drive/assignment1") 
!ls

 assets    README.md			'Task III.ipynb'   tut4-model-copy.pt
 data	   starter_code_copy_net.ipynb	 test.json	   tut4-model.pt
 LICENSE   starter_code_new.ipynb	 train.json	   valid.json


In [3]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from torchtext.legacy.datasets import Multi30k
from torchtext.legacy.data import Field, BucketIterator, TabularDataset

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

import spacy
import numpy as np

import random
import math
import time

import spacy.cli
spacy.cli.download("de_core_news_sm")
spacy.cli.download("en_core_web_sm")

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('de_core_news_sm')
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


Next, we'll set the random seed for reproducability.

In [4]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

As before, we'll import spaCy and define the German and English tokenizers.

In [5]:
spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

In [6]:
def tokenize_de(text):
    """
    Tokenizes German text from a string into a list of strings
    """
    return [tok.text for tok in spacy_de.tokenizer(text)]

def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]

When using packed padded sequences, we need to tell PyTorch how long the actual (non-padded) sequences are. Luckily for us, TorchText's `Field` objects allow us to use the `include_lengths` argument, this will cause our `batch.src` to be a tuple. The first element of the tuple is the same as before, a batch of numericalized source sentence as a tensor, and the second element is the non-padded lengths of each source sentence within the batch.

In [7]:
SRC = Field(tokenize = tokenize_de, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True, 
            include_lengths = True)

TRG = Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

We then load the data.

In [8]:
"""
train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'), 
                                                    fields = (SRC, TRG))
"""
fields = {'src': ('src', SRC), 'trg': ('trg', TRG)}
train_data, valid_data, test_data = TabularDataset.splits(
    path = '.',
    train = 'train.json',
    validation = 'valid.json',
    test = 'test.json',
    format = 'json',
    fields = fields
)

And build the vocabulary.

In [9]:
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)

Then, we need to change the golden target to include copy words.

In [10]:
def change_gold_sent(src_field, trg_field, src_sent, trg_sent):
    src_idx = [src_field.vocab.stoi[word] for word in ['<sos>'] + src_sent + ['<eos>']]
    trg_idx = []
    original_trg_idx = []
    for word in ['<sos>'] + trg_sent + ['<eos>']:
        idx = trg_field.vocab.stoi[word]
        original_trg_idx.append(idx)
        if idx != 0:
            trg_idx.append(idx)
        else:
            if word in src_sent:
                new_idx = src_sent.index(word) + 1
                trg_idx.append(len(trg_field.vocab) + new_idx)
            else:
                trg_idx.append(idx)
    return src_idx, trg_idx, original_trg_idx

def change_gold_data(src_field, trg_field, data):
    src_data, trg_data, original_trg_data = [], [], []
    for sent in data:
        src_sent = sent.src
        trg_sent = sent.trg
        src_idx, trg_idx, original_idx = change_gold_sent(src_field, trg_field, src_sent, trg_sent)
        src_data.append(src_idx)
        trg_data.append(trg_idx)
        original_trg_data.append(original_idx)
    return src_data, trg_data, original_trg_data
    
train_src, train_trg, train_original_trg = change_gold_data(SRC, TRG, train_data)
valid_src, valid_trg, valid_original_trg = change_gold_data(SRC, TRG, valid_data)
test_src, test_trg, test_original_trg = change_gold_data(SRC, TRG, test_data)
    

After changing golden target, we need to batchify the data.

In [11]:
BATCH_SIZE = 128

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def process_batch(src_field, trg_field, data, device):
    src_data = []
    src_lens = []
    trg_data = []
    trg_lens = []
    original_trg_data = []
    for src_sent, src_len, trg_sent, original_trg_sent in data:
        src_data.append(src_sent)
        src_lens.append(src_len)
        trg_data.append(trg_sent)
        trg_lens.append(len(trg_sent))
        original_trg_data.append(original_trg_sent)
    
    src_max_length = max(src_lens)
    trg_max_length = max(trg_lens)
    
    
    src_pad_idx = src_field.vocab.stoi[src_field.pad_token]
    trg_pad_idx = trg_field.vocab.stoi[trg_field.pad_token]
    new_src_data = [src_sent + [src_pad_idx] * (src_max_length - src_len) for src_sent, src_len in zip(src_data, src_lens)]
    new_trg_data = [trg_sent + [trg_pad_idx] * (trg_max_length - trg_len) for trg_sent, trg_len in zip(trg_data, trg_lens)]
    new_original_trg_data = [trg_sent + [trg_pad_idx] * (trg_max_length - trg_len) for trg_sent, trg_len in zip(original_trg_data, trg_lens)]
    return (torch.tensor(new_src_data, device=device).permute(1,0), torch.tensor(src_lens, device=device)), (torch.tensor(new_trg_data, device = device).permute(1,0), torch.tensor(new_original_trg_data, device = device).permute(1,0)) 
    

def generate_batch(src_field, trg_field, src_data, trg_data, original_trg_data, batch_size, device):
    data_iterator = []
    src_lens = [len(src_sent) for src_sent in src_data]
    all_data = [(src_sent, src_len, trg_sent, original_trg_sent) for src_sent, src_len, trg_sent, original_trg_sent in zip(src_data, src_lens, trg_data, original_trg_data)]
    all_data.sort(key=lambda x: -x[1])
    
    n = len(all_data)
    max_n = n 
    for i in range(0, max_n, batch_size):
        src, trg = process_batch(src_field, trg_field, all_data[i:i+batch_size], device)
        data_iterator.append([src, trg])
    
    return data_iterator

train_iterator = generate_batch(SRC, TRG, train_src, train_trg, train_original_trg, BATCH_SIZE, device)
valid_iterator = generate_batch(SRC, TRG, valid_src, valid_trg, valid_original_trg, BATCH_SIZE, device)
test_iterator = generate_batch(SRC, TRG, test_src, test_trg, test_original_trg, BATCH_SIZE, device)
random.shuffle(train_iterator)
valid_iterator.reverse()
test_iterator.reverse()


## Building the Model

### Encoder

The encoder is the same as before.

In [12]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        
        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)
        
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src, src_len):
        
        #src = [src len, batch size]
        #src_len = [batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
                
        #need to explicitly put lengths on cpu!
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, src_len.to('cpu'))
                
        packed_outputs, hidden = self.rnn(packed_embedded)
                                 
        #packed_outputs is a packed sequence containing all hidden states
        #hidden is now from the final non-padded element in the batch
            
        outputs, _ = nn.utils.rnn.pad_packed_sequence(packed_outputs) 
            
        #outputs is now a non-packed sequence, all hidden states obtained
        #  when the input is a pad token are all zeros
            
        #outputs = [src len, batch size, hid dim * num directions]
        #hidden = [n layers * num directions, batch size, hid dim]
        
        #hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        #outputs are always from the last layer
        
        #hidden [-2, :, : ] is the last of the forwards RNN 
        #hidden [-1, :, : ] is the last of the backwards RNN
        
        #initial decoder hidden is final hidden state of the forwards and backwards 
        #  encoder RNNs fed through a linear layer
        hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))
        
        #outputs = [src len, batch size, enc hid dim * 2]
        #hidden = [batch size, dec hid dim]
        
        return outputs, hidden

### Attention

The attention is the same as before.

In [13]:
class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
        
    def forward(self, hidden, encoder_outputs, mask):
        
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        
        #repeat decoder hidden state src_len times
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
  
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        #hidden = [batch size, src len, dec hid dim]
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) 
        
        #energy = [batch size, src len, dec hid dim]

        attention = self.v(energy).squeeze(2)
        
        #attention = [batch size, src len]
        
        attention = attention.masked_fill(mask == 0, -1e10)
        
        return F.softmax(attention, dim = 1)

### Decoder

We add the copy output in the decoder and compute copy score and generate score.

In [14]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
        super().__init__()

        self.output_dim = output_dim
        self.attention = attention
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.GRU((enc_hid_dim * 4) + emb_dim, dec_hid_dim)
        
        #self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
        self.generate_output = nn.Linear(((enc_hid_dim * 4) + dec_hid_dim + emb_dim), output_dim)
        
        self.copy_output = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
        
        
        
    def forward(self, input, hidden, encoder_outputs, mask, selective_weight):
             
        #input = [batch size]
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        #mask = [batch size, src len]
        #selective_weights = [batch size, 1, src len]
        
        #copy_score = [batch size, src len]
        copy_projection = self.copy_output(encoder_outputs.permute(1,0,2))
        copy_score = torch.tanh(copy_projection).bmm(hidden.unsqueeze(-1)).squeeze(-1) 
        copy_score = copy_score.masked_fill(mask==0, -1e10)
        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
        
        a = self.attention(hidden, encoder_outputs, mask)
                
        #a = [batch size, src len]
        
        a = a.unsqueeze(1)
        
        #a = [batch size, 1, src len]
        
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        
        weighted = torch.bmm(a, encoder_outputs)
        
        #weighted = [batch size, 1, enc hid dim * 2]
        
        weighted = weighted.permute(1, 0, 2)
        
        #weighted = [1, batch size, enc hid dim * 2]
        
        
        selective_weighted = torch.bmm(selective_weight.unsqueeze(1), encoder_outputs)
        selective_weighted = selective_weighted.permute(1, 0, 2)
        
        rnn_input = torch.cat((embedded, weighted, selective_weighted), dim = 2)
        
        #rnn_input = [1, batch size, (enc hid dim * 4) + emb dim]
            
        output, hidden = self.rnn(rnn_input, hidden.unsqueeze(0))
        
        #output = [seq len, batch size, dec hid dim * n directions]
        #hidden = [n layers * n directions, batch size, dec hid dim]
        
        #seq len, n layers and n directions will always be 1 in this decoder, therefore:
        #output = [1, batch size, dec hid dim]
        #hidden = [1, batch size, dec hid dim]
        #this also means that output == hidden
        assert (output == hidden).all()
        
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        selective_weighted = selective_weighted.squeeze(0)
        
        generate_score = self.generate_output(torch.cat((output, weighted, selective_weighted, embedded), dim = 1))
        
        #prediction = torch.cat((generate_score, copy_score), dim = -1)
        
        
        selective_weight = F.softmax(copy_score, dim=-1)
        #prediction = [batch size, output dim + src len]
        
        return generate_score, copy_score, hidden.squeeze(0), a.squeeze(1), selective_weight

### Seq2Seq

We modify the Seq2Seq to generate CopyNet outputs.

In [15]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, src_pad_idx, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.src_pad_idx = src_pad_idx
        self.device = device
        
    def create_mask(self, src):
        mask = (src != self.src_pad_idx).permute(1, 0)
        return mask
        
    def forward(self, src, src_len, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #src_len = [batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use teacher forcing 75% of the time
                    
        batch_size = src.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        copy_outputs = torch.zeros(trg_len, batch_size, src.shape[0]).to(self.device)
        
        #encoder_outputs is all hidden states of the input sequence, back and forwards
        #hidden is the final forward and backward hidden states, passed through a linear layer
        encoder_outputs, hidden = self.encoder(src, src_len)
                
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        mask = self.create_mask(src)

        #mask = [batch size, src len]
        
        selective_weight = torch.zeros(mask.shape).to(self.device)
        selective_weight = selective_weight.masked_fill(mask==0, -1e10)
        selective_weight = F.softmax(selective_weight, dim=-1)
                
        for t in range(1, trg_len):
            
            #insert input token embedding, previous hidden state, all encoder hidden states 
            #  and mask
            #receive output tensor (predictions) and new hidden state
            output, copy_ret, hidden, _, selective_weight = self.decoder(input, hidden, encoder_outputs, mask, selective_weight)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #copy_top1 = copy_ret.argmax(1).unsqueeze(-1)
            #copy_outputs[t] = torch.gather(src.permute(1,0), 1, copy_top1).squeeze(-1)
            copy_outputs[t] = copy_ret
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1
            
        return outputs, copy_outputs

## Training the Seq2Seq Model

Next up, initializing the model and placing it on the GPU.

In [16]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5
SRC_PAD_IDX = SRC.vocab.stoi[SRC.pad_token]

attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)

model = Seq2Seq(enc, dec, SRC_PAD_IDX, device).to(device)

Then, we initialize the model parameters.

In [17]:
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7853, 256)
    (rnn): GRU(256, 512, bidirectional=True)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=1536, out_features=512, bias=True)
      (v): Linear(in_features=512, out_features=1, bias=False)
    )
    (embedding): Embedding(5893, 256)
    (rnn): GRU(2304, 512)
    (dropout): Dropout(p=0.5, inplace=False)
    (generate_output): Linear(in_features=2816, out_features=5893, bias=True)
    (copy_output): Linear(in_features=1024, out_features=512, bias=True)
  )
)

We'll print out the number of trainable parameters in the model, noticing that it has the exact same amount of parameters as the model without these improvements.

In [18]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 28,650,501 trainable parameters


Then we define our optimizer and criterion. 

The `ignore_index` for the criterion needs to be the index of the pad token for the target language, not the source language.

In [19]:
optimizer = optim.Adam(model.parameters())

In [20]:
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]

criterion = nn.NLLLoss(ignore_index = TRG_PAD_IDX)

Next, we'll define our training and evaluation loops.

As we are using `include_lengths = True` for our source field, `batch.src` is now a tuple with the first element being the numericalized tensor representing the sentence and the second element being the lengths of each sentence within the batch.

Our model also returns the attention vectors over the batch of source source sentences for each decoding time-step. We won't use these during the training/evaluation, but we will later for inference.

In [21]:
def train(model, iterator, optimizer, criterion, clip):
    random.shuffle(iterator)
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        (src, src_len), (target_trg, trg) = batch
        #src, src_len = batch.src
        #trg = batch.trg
        
        optimizer.zero_grad()
        
        output, copy_output = model(src, src_len, trg)
        
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]
        #copy_output[trg len, batch size, src_len]
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        output = F.log_softmax(output, dim=-1)
        
        target_trg = target_trg[1:].contiguous().view(-1)
        #trg = trg[1:].contiguous().view(-1)
        
        src_shape = copy_output.shape[-1]
        copy_output = copy_output[1:].view(-1, src_shape)
        copy_output = F.log_softmax(copy_output, dim=-1)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]
        #copy_output = [(trg len - 1) * batch size, src_len]
        
        all_output = torch.cat((output, copy_output), dim=-1)
        loss = criterion(all_output, target_trg)
        #loss = criterion(all_output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [22]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            (src, src_len), (target_trg, trg) = batch
            #src, src_len = batch.src
            #trg = batch.trg

            output, copy_output = model(src, src_len, trg, 0) #turn off teacher forcing
            
            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            output = F.log_softmax(output, dim=-1)
            
            target_trg = target_trg[1:].contiguous().view(-1)
            #trg = trg[1:].contiguous().view(-1)
            
            src_shape = copy_output.shape[-1]
            copy_output = copy_output[1:].view(-1, src_shape)
            copy_output = F.log_softmax(copy_output, dim=-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]
            
            all_output = torch.cat((output, copy_output), dim=-1)
            loss = criterion(all_output, target_trg)
            #loss = criterion(all_output, trg)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

Then, we'll define a useful function for timing how long epochs take.

In [23]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

The penultimate step is to train our model. Notice how it takes almost half the time as our model without the improvements added in this notebook.

In [None]:
N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut4-model-copy.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Train Loss now is:  5.358170032501221
Train Loss now is:  4.814002990722656
Train Loss now is:  5.306338787078857
Train Loss now is:  4.540647983551025
Epoch: 01 | Time: 1m 8s
	Train Loss: 5.060 | Train PPL: 157.527
	 Val. Loss: 4.853 |  Val. PPL: 128.156
Train Loss now is:  4.185345649719238
Train Loss now is:  3.770214796066284
Train Loss now is:  3.5870072841644287
Train Loss now is:  3.692592144012451
Epoch: 02 | Time: 1m 8s
	Train Loss: 3.990 | Train PPL:  54.077
	 Val. Loss: 3.961 |  Val. PPL:  52.501
Train Loss now is:  3.3772716522216797
Train Loss now is:  2.7973074913024902
Train Loss now is:  3.510956048965454
Train Loss now is:  2.7950360774993896
Epoch: 03 | Time: 1m 8s
	Train Loss: 3.264 | Train PPL:  26.167
	 Val. Loss: 3.582 |  Val. PPL:  35.958
Train Loss now is:  2.6166372299194336
Train Loss now is:  3.5918972492218018
Train Loss now is:  2.606139659881592
Train Loss now is:  2.388012647628784
Epoch: 04 | Time: 1m 8s
	Train Loss: 2.787 | Train PPL:  16.231
	 Val. Los

Finally, we load the parameters from our best validation loss and get our results on the test set.

We get the improved test perplexity whilst almost being twice as fast!

In [24]:
model.load_state_dict(torch.load('tut4-model-copy.pt'))

test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 3.230 | Test PPL:  25.270 |


## Inference

The translate_sentence are changed to output copy words.

In [25]:
def translate_sentence(sentence, src_field, trg_field, model, device, max_len = 50):

    model.eval()
        
    if isinstance(sentence, str):
        nlp = spacy.load('de')
        tokens = [token.text.lower() for token in nlp(sentence)]
    else:
        tokens = [token.lower() for token in sentence]

    tokens = [src_field.init_token] + tokens + [src_field.eos_token]
        
    src_indexes = [src_field.vocab.stoi[token] for token in tokens]
    
    src_tensor = torch.LongTensor(src_indexes).unsqueeze(1).to(device)

    src_len = torch.LongTensor([len(src_indexes)])
    
    with torch.no_grad():
        encoder_outputs, hidden = model.encoder(src_tensor, src_len)

    mask = model.create_mask(src_tensor)
        
    trg_indexes = [trg_field.vocab.stoi[trg_field.init_token]]
    trg_tokens = []

    attentions = torch.zeros(max_len, 1, len(src_indexes)).to(device)
    selective_weight = torch.zeros(mask.shape).to(device)
    selective_weight = selective_weight.masked_fill(mask==0, -1e10)
    selective_weight = F.softmax(selective_weight, dim=-1)
    
    for i in range(max_len):

        trg_tensor = torch.LongTensor([trg_indexes[-1]]).to(device)
                
        with torch.no_grad():
            output, copy_output, hidden, attention, selective_weight = model.decoder(trg_tensor, hidden, encoder_outputs, mask, selective_weight)

        attentions[i] = attention
            
        pred_token = output.argmax(1).item()
        copy_token = copy_output.argmax(1).item()
        
        trg_indexes.append(pred_token)
        if pred_token != 0:
            trg_tokens.append(trg_field.vocab.itos[pred_token])
        else:
            trg_tokens.append(tokens[copy_token])

        if pred_token == trg_field.vocab.stoi[trg_field.eos_token]:
            break
            
    return trg_tokens, attentions[:len(trg_tokens)-1]

In [28]:
model.load_state_dict(torch.load('tut4-model-copy.pt'))
src = ['ein', 'chevrolet', ',', 'der', 'auf', 'einer', 'messe', 'ausgestellt', 'ist']
translation, attention = translate_sentence(src, SRC, TRG, model, device)
print(" ".join(src))
print(" ".join(translation))

ein chevrolet , der auf einer messe ausgestellt ist
a chevrolet patron at a convention convention . <eos>


In [30]:
src = ['ein', 'mädchen', 'in', 'einer', 'burka', 'lernt', 'in', 'einem', 'klassenraum', '<unk>', '.']
translation, attention = translate_sentence(src, SRC, TRG, model, device)
print(" ".join(src))
print(" ".join(translation))

ein mädchen in einer burka lernt in einem klassenraum <unk> .
a girl in a burka is learning in a classroom . <eos>
