# The NMT model based on PyTorch Seq2Seq Tutorial

## Introduction

In this notebook we will be adding a few improvements - packed padded sequences and masking - to the model from the previous notebook. Packed padded sequences are used to tell our RNN to skip over padding tokens in our encoder. Masking explicitly forces the model to ignore certain values, such as attention over padded elements. Both of these techniques are commonly used in NLP. 

We will also look at how to use our model for inference, by giving it a sentence, seeing what it translates it as and seeing where exactly it pays attention to when translating each word.

Finally, we'll use the BLEU metric to measure the quality of our translations.

## Preparing Data

First, we'll import all the modules as before, with the addition of the `matplotlib` modules used for viewing the attention.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

from torchtext.datasets import TranslationDataset, Multi30k
from torchtext.data import Field, BucketIterator, TabularDataset 

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

import numpy as np

import random
import math
import time
import pandas as pd

Next, we'll set the random seed for reproducability.

In [2]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

When using packed padded sequences, we need to tell PyTorch how long the actual (non-padded) sequences are. Luckily for us, TorchText's `Field` objects allow us to use the `include_lengths` argument, this will cause our `batch.src` to be a tuple. The first element of the tuple is the same as before, a batch of numericalized source sentence as a tensor, and the second element is the non-padded lengths of each source sentence within the batch.

#### Load the data

In [3]:
data_path = 'dict_data'  # news_data or dict_data

def tokenize(word): # create a tokenizer function
    word = word.replace('\n', '')
    return word.split(' ')

# <sos>: start of a sequence; <eos>: end of a sequence.
SRC = Field(tokenize=tokenize, 
            init_token='<sos>', 
            eos_token='<eos>', 
            lower=True,
            include_lengths = True)

TRG = Field(tokenize=tokenize, 
            init_token='<sos>', 
            eos_token='<eos>', 
            lower=False,
            include_lengths = True)

PINYIN_STR = Field(tokenize=tokenize, 
                   init_token='<sos>', 
                   eos_token='<eos>', 
                   lower=True,
                   include_lengths = True)

PINYIN_CHAR = Field(tokenize=tokenize, 
                    init_token='<sos>', 
                    eos_token='<eos>', 
                    lower=True,
                    include_lengths = True)

train_data = TabularDataset(
           path = data_path + "/train.tsv", 
           format='tsv',
           skip_header=True, 
           fields=([("src", SRC), ("trg", TRG), ("pinyin_str", PINYIN_STR), ("pinyin_char", PINYIN_CHAR)]))

valid_data = TabularDataset(
           path = data_path + "/valid.tsv", 
           format='tsv',
           skip_header=True, 
           fields=([("src", SRC), ("trg", TRG), ("pinyin_str", PINYIN_STR), ("pinyin_char", PINYIN_CHAR)]))

test_data = TabularDataset(
           path = data_path + "/test.tsv",   #/dev.tsv for news_data
           format='tsv',
           skip_header=True, 
           fields=([("src", SRC), ("trg", TRG), ("pinyin_str", PINYIN_STR), ("pinyin_char", PINYIN_CHAR)]))

In [4]:
SRC.build_vocab(train_data, min_freq=1)
TRG.build_vocab(train_data, min_freq=1)  # first hyperparameter, the min_freq for vocab
PINYIN_STR.build_vocab(train_data, min_freq=1)
PINYIN_CHAR.build_vocab(train_data, min_freq=1)

Next, we handle the iterators.

One quirk about packed padded sequences is that all elements in the batch need to be sorted by their non-padded lengths in descending order, i.e. the first sentence in the batch needs to be the longest. We use two arguments of the iterator to handle this, `sort_within_batch` which tells the iterator that the contents of the batch need to be sorted, and `sort_key` a function which tells the iterator how to sort the elements in the batch. Here, we sort by the length of the `src` sentence.

In [5]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
     batch_size = BATCH_SIZE,
     sort_within_batch = True,
     sort_key = lambda x : len(x.src),
     device = device)

In [6]:
# evaluate the test data considering multiple answers (for news data)
import json
with open('news_data/dev.json', 'r') as f:
    dev_json = json.load(f)

## Building the Model

### Encoder

Next up, we define the encoder.

The changes here all within the `forward` method. It now accepts the lengths of the source sentences as well as the sentences themselves. 

After the source sentence (padded automatically within the iterator) has been embedded, we can then use `pack_padded_sequence` on it with the lengths of the sentences. `packed_embedded` will then be our packed padded sequence. This can be then fed to our RNN as normal which will return `packed_outputs`, a packed tensor containing all of the hidden states from the sequence, and `hidden` which is simply the final hidden state from our sequence. `hidden` is a standard tensor and not packed in any way, the only difference is that as the input was a packed sequence, this tensor is from the final **non-padded element** in the sequence.

We then unpack our `packed_outputs` using `pad_packed_sequence` which returns the `outputs` and the lengths of each, which we don't need. 

The first dimension of `outputs` is the padded sequence lengths however due to using a packed padded sequence the values of tensors when a padding token was the input will be all zeros.

In [7]:
class LayerNormalisedGRUCell(nn.Module):
    
    def __init__(self, emb_dim, hid_dim, dropout, num_layers):
        super().__init__()
        # input of shape (batch, input_size): tensor containing input features
        # hidden of shape (batch, hidden_size): tensor containing the initial hidden state for each element in the batch. Defaults to zero if not provided.
        
        self.GRUCell = nn.GRUCell(emb_dim, hid_dim, bidirectional=True, num_layers=num_layers, dropout=0.2)
        self.layerNorm = nn.LayerNorm(hid_dim)  # normalised along the hidden dimensions
        
    def forward(self, cur_embed, pre_hidden):
        cur_hidden = self.GRUCell(cur_embed, pre_hidden)
        return self.layerNorm(cur_hidden)

In [8]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dropout, num_layers, 
                 layer_normalised=False, rnn_type='GRU', rnn_dropout=0.2):
        super().__init__()
        
        # input_dim is the vocab_size, i.e. the emb_dim of one-hot encoding
        self.embedding = nn.Embedding(input_dim, emb_dim) 
        # self.layerNorm = nn.LayerNorm(enc_hid_dim * 2)
        
        self.rnn_type = rnn_type
        if rnn_type == 'GRU':
            self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True, num_layers=num_layers, dropout=rnn_dropout)
        elif rnn_type == 'LSTM':
            self.rnn = nn.LSTM(emb_dim, enc_hid_dim, bidirectional = True, num_layers=num_layers, dropout=rnn_dropout)
        
        self.enc_hid_dim = enc_hid_dim
        self.dropout = nn.Dropout(dropout)
        
        
    def forward(self, src, src_len):
        
        # src = [src len, batch size]
        # src_len = [src len]
        
        embedded = self.dropout(self.embedding(src))
        
        # embedded = [src len, batch size, emb dim]
                
        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, src_len)
        
        packed_outputs, hidden = self.rnn(packed_embedded)
        if self.rnn_type == 'LSTM':
            hidden = hidden[0]  # (h_n, c_n)
                                 
        # packed_outputs is a packed sequence containing all hidden states
        # hidden is now from the final non-padded element in the batch
            
        outputs, _ = nn.utils.rnn.pad_packed_sequence(packed_outputs) 
            
        # outputs is now a non-packed sequence, all hidden states obtained
        #  when the input is a pad token are all zeros
            
        # outputs = [src len, batch size, hid dim * num directions]
        # hidden = [n layers * num directions, batch size, hid dim]
        
        # hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        # outputs are always from the last layer
        
        # hidden [-2, :, : ] is the last of the forwards RNN 
        # hidden [-1, :, : ] is the last of the backwards RNN
        
        # initial decoder hidden is final hidden state of the forwards and backwards 
        #  encoder RNNs fed through a linear layer
        hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)
        # hidden = self.layerNorm(hidden)
        
        #outputs = [src len, batch size, enc hid dim * 2]
        #hidden = [batch size, dec hid dim]
        
        return outputs, hidden

### Attention

The attention module is where we calculate the attention values over the source sentence. 

Previously, we allowed this module to "pay attention" to padding tokens within the source sentence. However, using *masking*, we can force the attention to only be over non-padding elements.

The `forward` method now takes a `mask` input. This is a **[batch size, source sentence length]** tensor that is 1 when the source sentence token is not a padding token, and 0 when it is a padding token. For example, if the source sentence is: ["hello", "how", "are", "you", "?", `<pad>`, `<pad>`], then the mask would be [1, 1, 1, 1, 1, 0, 0].

We apply the mask after the attention has been calculated, but before it has been normalized by the `softmax` function. It is applied using `masked_fill`. This fills the tensor at each element where the first argument (`mask == 0`) is true, with the value given by the second argument (`-1e10`). In other words, it will take the un-normalized attention values, and change the attention values over padded elements to be `-1e10`. As these numbers will be miniscule compared to the other values they will become zero when passed through the `softmax` layer, ensuring no attention is payed to padding tokens in the source sentence.

In [9]:
class Attention(nn.Module):
    def __init__(self, enc_hid_dim, dec_hid_dim):
        super().__init__()
        
        self.attn = nn.Linear((enc_hid_dim * 2) + dec_hid_dim, dec_hid_dim)
        self.v = nn.Linear(dec_hid_dim, 1, bias = False)
        
    def forward(self, hidden, encoder_outputs, mask):
        
        #hidden = [batch size, dec hid dim]
        #encoder_outputs = [src len, batch size, enc hid dim * 2]
        
        batch_size = encoder_outputs.shape[1]
        src_len = encoder_outputs.shape[0]
        
        #repeat decoder hidden state src_len times
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
  
        encoder_outputs = encoder_outputs.permute(1, 0, 2)
        
        #hidden = [batch size, src len, dec hid dim]
        #encoder_outputs = [batch size, src len, enc hid dim * 2]
        
        energy = torch.tanh(self.attn(torch.cat((hidden, encoder_outputs), dim = 2))) 
        
        #energy = [batch size, src len, dec hid dim]

        attention = self.v(energy).squeeze(2)
        
        #attention = [batch size, src len]
        
        attention = attention.masked_fill(mask == 0, -1e10)
        
        return F.softmax(attention, dim = 1)

### Decoder

The decoder only needs a few small changes. It needs to accept a mask over the source sentence and pass this to the attention module. As we want to view the values of attention during inference, we also return the attention tensor.

In [10]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, 
                 attention, num_layers, apply_transition=True, rnn_type='GRU', rnn_dropout=0.2):
        super().__init__()

        self.output_dim = output_dim
        self.dec_hid_dim = dec_hid_dim
        self.num_layers = num_layers
        self.attention = attention
        self.layerNorm = nn.LayerNorm(dec_hid_dim)
        self.apply_transition = apply_transition
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        if rnn_type == 'GRU':
            if apply_transition:
                self.trans_rnn1 = nn.GRU(emb_dim, dec_hid_dim)  # input: pre; h_0: prev_rnn_hidden
                self.trans_rnn2 = nn.GRU(enc_hid_dim * 2, dec_hid_dim)  # input: attention; h_0: trans_rnn1_hidden
                self.rnn = nn.GRU(emb_dim + dec_hid_dim, dec_hid_dim, num_layers=num_layers, dropout=rnn_dropout)  # [prev_embed + trans_rnn2_hidden]
            else:
                self.rnn = nn.GRU(emb_dim + 2 * enc_hid_dim, dec_hid_dim, num_layers=num_layers, dropout=rnn_dropout)
        elif rnn_type == 'LSTM':
            if apply_transition:
                self.trans_rnn1 = nn.LSTM(emb_dim, dec_hid_dim)  # input: prev_embed; h_0: prev_rnn_hidden
                self.trans_rnn2 = nn.LSTM(enc_hid_dim * 2, dec_hid_dim)  # input: attention; h_0: trans_rnn1_hidden
                self.rnn = nn.LSTM(emb_dim + dec_hid_dim, dec_hid_dim, num_layers=num_layers, dropout=rnn_dropout)  # [prev_embed + trans_rnn2_hidden]
            else:
                self.rnn = nn.LSTM(emb_dim + 2 * enc_hid_dim, dec_hid_dim, num_layers=num_layers, dropout=rnn_dropout)
            
        
        self.fc_out = nn.Linear((enc_hid_dim * 2) + dec_hid_dim + emb_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
        
    def forward(self, input, hidden, encoder_outputs, mask):
             
        # input = [batch size]
        # hidden = [batch size, dec hid dim]
        # encoder_outputs = [src len, batch size, enc hid dim * 2]
        # mask = [batch size, src len]
        
        input = input.unsqueeze(0)  # input = [1, batch size]
        embedded = self.dropout(self.embedding(input))  #  embedded = [1, batch size, emb dim]
        
        # transition rnn 1
        if self.apply_transition:
            output_1, hidden_1 = self.trans_rnn1(embedded, hidden.unsqueeze(0))  # input, h_0
            assert (output_1 == hidden_1).all()
            # hidden_1 = [1, batch size, dec dim]
        
        # compute the attention score and the weighted sum (context vector)
        if self.apply_transition:
            a = self.attention(hidden_1.squeeze(0), encoder_outputs, mask)  # a = [batch size, src len]
        else:
            a = self.attention(hidden, encoder_outputs, mask)
        a = a.unsqueeze(1)  # a = [batch size, 1, src len]
        encoder_outputs = encoder_outputs.permute(1, 0, 2) # encoder_outputs = [batch size, src len, enc hid dim * 2]      
        weighted = torch.bmm(a, encoder_outputs) # weighted = [batch size, 1, enc hid dim * 2]  
        weighted = weighted.permute(1, 0, 2) # weighted = [1, batch size, enc hid dim * 2]
        
        # transition rnn 2
        if self.apply_transition:
            output_2, hidden_2 = self.trans_rnn2(weighted, hidden_1)
            assert (output_2 == hidden_2).all()
            # hidden_2 = self.layerNorm(hidden_2)
        
        # the rest of rnn   
        if self.apply_transition:
            output, hidden = self.rnn(torch.cat((embedded, hidden_2), dim=2))  # add skip connection
        else:
            output, hidden = self.rnn(torch.cat((embedded, weighted), dim=2), hidden.unsqueeze(0).repeat(2, 1, 1))
        hidden = hidden.view(self.num_layers, 1, embedded.size(1), self.dec_hid_dim)[-1]  # last layer hidden extracted
        
        #output = [seq len, batch size, dec hid dim * n directions]
        #hidden = [n layers * n directions, batch size, dec hid dim]
        
        #seq len, n layers and n directions will always be 1 in this decoder, therefore:
        #output = [1, batch size, dec hid dim]
        #hidden = [1, batch size, dec hid dim]
        #this also means that output == hidden
        assert (output == hidden).all()
        hidden = self.layerNorm(hidden)
        
        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted = weighted.squeeze(0)
        
        prediction = self.fc_out(torch.cat((output, weighted, embedded), dim = 1))  #  prediction = [batch size, output dim]
        
        # go through log-softmax according to openNMT
        prediction = F.log_softmax(prediction, dim = 1)
        
        return prediction, hidden.squeeze(0), a.squeeze(1)

### Seq2Seq

The overarching seq2seq model also needs a few changes for packed padded sequences, masking and inference. 

We need to tell it what the indexes are for the pad token and also pass the source sentence lengths as input to the `forward` method.

We use the pad token index to create the masks, by creating a mask tensor that is 1 wherever the source sentence is not equal to the pad token. This is all done within the `create_mask` function.

The sequence lengths as needed to pass to the encoder to use packed padded sequences.

The attention at each time-step is stored in the `attentions` 

In [11]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, src_pad_idx, device, teacher_forcing_ratio = 0.5, decoder_pinyin = None):
        super().__init__()
        self.encoder = encoder
        self.fc = nn.Linear(encoder.enc_hid_dim * 2, decoder.dec_hid_dim)
        self.decoder = decoder
        self.decoder_pinyin = decoder_pinyin
        if decoder_pinyin:
            self.fc_pinyin = nn.Linear(encoder.enc_hid_dim * 2, decoder_pinyin.dec_hid_dim)
        self.src_pad_idx = src_pad_idx
        self.device = device
        self.teacher_forcing_ratio = teacher_forcing_ratio 
        
    def create_mask(self, src):
        mask = (src != self.src_pad_idx).permute(1, 0)
        return mask
        
    def forward(self, src, src_len, trg, pinyin = None):
        
        # src = [src len, batch size]
        # src_len = [batch size]
        # trg = [trg len, batch size]
        # teacher_forcing_ratio is probability to use teacher forcing
        # e.g. if teacher_forcing_ratio is 0.75 we use teacher forcing 75% of the time
                    
        batch_size = src.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        if type(pinyin) != type(None):
            pinyin_len = pinyin.shape[0]
            pinyin_vocab_size = self.decoder_pinyin.output_dim
        
        # tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        if type(pinyin) != type(None):
            outputs_pinyin = torch.zeros(pinyin_len, batch_size, pinyin_vocab_size).to(self.device)
        
        # encoder_outputs is all hidden states of the input sequence, back and forwards
        # hidden is the final forward and backward hidden states, passed through a linear layer
        encoder_outputs, hidden = self.encoder(src, src_len)
        hidden_trg = torch.tanh(self.fc(hidden))
        if type(pinyin) != type(None):
            hidden_pinyin = torch.tanh(self.fc_pinyin(hidden))
                
        # first input to the decoder is the <sos> tokens
        input_trg = trg[0, :]
        if type(pinyin) != type(None):
            input_pinyin = pinyin[0, :]
        
        mask = self.create_mask(src)

        #mask = [batch size, src len]
                
        for t in range(1, trg_len): 
            
            #insert input token embedding, previous hidden state, all encoder hidden states 
            #  and mask
            #receive output tensor (predictions) and new hidden state
            output, hidden_trg, _ = self.decoder(input_trg, hidden_trg, encoder_outputs, mask)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < self.teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input_trg = trg[t] if teacher_force else top1
            
        # For the dual decoder only
        if type(pinyin) != type(None):
            for t in range(1, pinyin_len): 

                #insert input token embedding, previous hidden state, all encoder hidden states 
                #  and mask
                #receive output tensor (predictions) and new hidden state
                output, hidden_pinyin, _ = self.decoder_pinyin(input_pinyin, hidden_pinyin, encoder_outputs, mask)

                #place predictions in a tensor holding predictions for each token
                outputs_pinyin[t] = output

                #decide if we are going to use teacher forcing or not
                teacher_force = random.random() < self.teacher_forcing_ratio

                #get the highest predicted token from our predictions
                top1 = output.argmax(1) 

                #if teacher forcing, use actual next token as next input
                #if not, use predicted token
                input_pinyin = pinyin[t] if teacher_force else top1
            
            return outputs, outputs_pinyin
        
        return outputs

## Training the Seq2Seq Model

Next up, initializing the model and placing it on the GPU.

In [12]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
OUTPUT_DIM_str = len(PINYIN_STR.vocab)
OUTPUT_DIM_char = len(PINYIN_CHAR.vocab)
ENC_EMB_DIM = 256  # 512 in paper, 256 best
DEC_EMB_DIM = 256
DEC_EMB_DIM2 = 128
ENC_HID_DIM = 512  # 1024 in paper, 512 best
DEC_HID_DIM = 512
DEC_HID_DIM2 = 256
ENC_DROPOUT = 0.1  # 0.1 in paper, for embedding
DEC_DROPOUT = 0.1  # 0.1 in paper, for embedding
RNN_DROPOUT = 0.2
SRC_PAD_IDX = SRC.vocab.stoi[SRC.pad_token]
# RNN_TYPE = 'LSTM'

encoder_num_layer = 2
decoder_num_layer = 2

attn = Attention(ENC_HID_DIM, DEC_HID_DIM)
attn2 = Attention(ENC_HID_DIM, DEC_HID_DIM2)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, ENC_DROPOUT, encoder_num_layer, rnn_dropout=RNN_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, 
              attn, decoder_num_layer, apply_transition=True, rnn_dropout=RNN_DROPOUT)
dec2 = Decoder(OUTPUT_DIM_str, DEC_EMB_DIM2, ENC_HID_DIM, DEC_HID_DIM2, DEC_DROPOUT, 
               attn2, decoder_num_layer, apply_transition=True, rnn_dropout=RNN_DROPOUT)

model = Seq2Seq(enc, dec, SRC_PAD_IDX, device, decoder_pinyin=dec2).to(device)
#model = Seq2Seq(enc, dec, SRC_PAD_IDX, device).to(device)

Then, we initialize the model parameters.

In [13]:
def init_weights(m):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(param.data, mean=0, std=0.01)
        else:
            nn.init.constant_(param.data, 0)
            
show_params = model.apply(init_weights)
show_params

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(30, 256)
    (rnn): GRU(256, 512, num_layers=2, dropout=0.2, bidirectional=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (fc): Linear(in_features=1024, out_features=512, bias=True)
  (decoder): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=1536, out_features=512, bias=True)
      (v): Linear(in_features=512, out_features=1, bias=False)
    )
    (layerNorm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
    (embedding): Embedding(437, 256)
    (trans_rnn1): GRU(256, 512)
    (trans_rnn2): GRU(1024, 512)
    (rnn): GRU(768, 512, num_layers=2, dropout=0.2)
    (fc_out): Linear(in_features=1792, out_features=437, bias=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (decoder_pinyin): Decoder(
    (attention): Attention(
      (attn): Linear(in_features=1280, out_features=256, bias=True)
      (v): Linear(in_features=256, out_features=1, bias=False)
    )
    (layerNorm): LayerNorm(

We'll print out the number of trainable parameters in the model, noticing that it has the exact same amount of parameters as the model without these improvements.

In [14]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 19,571,651 trainable parameters


Then we define our optimizer and criterion. 

The `ignore_index` for the criterion needs to be the index of the pad token for the target language, not the source language.

In [15]:
learning_rate = 0.003  # 0.003 in paper
patience = 0
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
# optimizer = optim.SGD(model.parameters(), lr=learning_rate)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
                optimizer=optimizer,
                mode='max', factor=0.9, # 0.9 in paper
                patience=patience)

In [16]:
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

Next, we'll define our training and evaluation loops.

As we are using `include_lengths = True` for our source field, `batch.src` is now a tuple with the first element being the numericalized tensor representing the sentence and the second element being the lengths of each sentence within the batch.

Our model also returns the attention vectors over the batch of source source sentences for each decoding time-step. We won't use these during the training/evaluation, but we will later for inference.

In [17]:
def update(epoch, valid_loss, valid_acc, 
           best_valid_loss, best_valid_acc, acc_valid_loss,
           update_type='acc'):
    global best_valid_epoch, early_stop_patience, full_patience, best_train_step, train_steps
    print("\n---------------------------------------")
    print("[Epoch: {}][Validatiing...]".format(epoch))
    if valid_loss < best_valid_loss:
        print('\t\t Better Valid Loss!')
        best_valid_loss = valid_loss
        if update_type == 'loss':
            torch.save(model.state_dict(), 'loss-model.pt')
        early_stop_patience = full_patience  # restore full patience if obtain new minimum of the loss
    else:
        if early_stop_patience > 0:
            early_stop_patience += -1
    
    if valid_acc > best_valid_acc or (valid_acc == best_valid_acc and valid_loss < acc_valid_loss):
        print('\t\t Better Valid Acc!')
        best_valid_acc = valid_acc
        acc_valid_loss = valid_loss
        best_valid_epoch = epoch
        best_train_step = train_steps
        if update_type == 'acc':
            torch.save(model.state_dict(), 'acc-model.pt')
    print(f'\t patience: {early_stop_patience}/{full_patience}')
    print(f'\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc:.3f} | Val. PPL: {math.exp(valid_loss):7.3f}')
    print(f'\t BEST. Val. Loss: {best_valid_loss:.3f} | BEST. Val. Acc: {best_valid_acc:.3f} | Val. Loss: {acc_valid_loss:.3f} | BEST. Val. Epoch: {best_valid_epoch} | BEST. Val. Step: {best_train_step}')
    print("---------------------------------------\n")
    return best_valid_loss, best_valid_acc, acc_valid_loss

In [18]:
n_examples = len(train_data.examples)

def train(model, iterator, 
          optimizer, criterion, 
          clip, epoch,
          scheduler, valid_iterator = None, 
          teacher_forcing_ratio = 0.5,
          multi_task = False, multi_task_ratio=0.5):
    
    model.train()
    model.teacher_forcing_ratio = teacher_forcing_ratio
    print("Current Teacher Forcing Ratio: {:.3f}".format(model.teacher_forcing_ratio))
    
    epoch_loss = 0
    running_loss = 0
    global best_valid_loss, acc_valid_loss, best_valid_acc, best_valid_epoch, train_steps, report_steps
    
    for i, batch in enumerate(iterator):
        
        src, src_len = batch.src
        trg, trg_len = batch.trg
        pinyin, pinyin_len = batch.pinyin_str
        
        optimizer.zero_grad()
        
        if multi_task:
            output, output_pinyin = model(src, src_len, trg, pinyin=pinyin)
        else:
            output = model(src, src_len, trg)
        
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]
        
        if multi_task:
            output_pinyin_dim = output_pinyin.shape[-1]
            output_pinyin = output_pinyin[1:].view(-1, output_pinyin_dim)
            pinyin = pinyin[1:].view(-1)
            loss = (multi_task_ratio * criterion(output, trg)) + ((1 - multi_task_ratio) * criterion(output_pinyin, pinyin))
        else:
            loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        running_loss = epoch_loss / (i + 1)
        
        # print every 50 batches (50 steps)
        if i % report_steps == report_steps - 1:
            train_steps += report_steps  # by doing so, the last batch is neglected
            for param_group in optimizer.param_groups:
                lr = param_group['lr']
            print('[Epoch: {}][#examples: {}/{}][#steps: {}]'.format(epoch, (i+1) * BATCH_SIZE, n_examples, train_steps))
            print(f'\tTrain Loss: {running_loss:.3f} | Train PPL: {math.exp(running_loss):7.3f} | lr: {lr:.3e}')
            
            # eval the validation set for every * steps
            if (train_steps % (10 * report_steps)) == 0:
                valid_loss, valid_acc = evaluate(model, valid_iterator, criterion, scheduler, multi_task = multi_task)
                test_loss, test_acc = evaluate(model, test_iterator, criterion, scheduler, is_test=True, multi_task=multi_task)
                best_valid_loss, best_valid_acc, acc_valid_loss = update(epoch, valid_loss, valid_acc, 
                                                         best_valid_loss, best_valid_acc, acc_valid_loss,
                                                         update_type='acc')
                scheduler.step(valid_acc)  # must be placed here otherwise the test acc messes up
                model.train()
                
            
    return epoch_loss / len(iterator)

In [19]:
def evaluate(model, iterator, criterion, scheduler, is_test=False, multi_task=False, multi_task_ratio=0.5):
    
    model.eval()
    model.teacher_forcing_ratio = 0 #  turn off teacher forcing
    
    epoch_loss = 0
    correct = 0
    correct_pinyin = 0
    
    global valid_data
    global test_data
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src, src_len = batch.src
            trg, trg_len = batch.trg
            pinyin, pinyin_len = batch.pinyin_str

            if multi_task:
                output, output_pinyin = model(src, src_len, trg, pinyin=pinyin)
            else:
                output = model(src, src_len, trg)
            
            # ---------compute acc START----------
            pred = output[1:].argmax(2).permute(1, 0) # [batch_size, trg_len]
            ref = trg[1:].permute(1, 0)
            # consider the last batch as well
            size = pred.shape[0]
            for j in range(size):
                
                pred_j = pred[j, :]
                pred_j_toks = []
                for t in pred_j:
                    tok = TRG.vocab.itos[t]
                    if tok == '<eos>':
                        break
                    else:
                        pred_j_toks.append(tok)
                pred_j = ''.join(pred_j_toks)
                
                ref_j = ref[j, :]
                ref_j_toks = []
                for t in ref_j:
                    tok = TRG.vocab.itos[t]
                    if tok == '<eos>':
                        break
                    else:
                        ref_j_toks.append(tok)
                ref_j = ''.join(ref_j_toks)
                
                if pred_j == ref_j:
                    correct += 1
            # ---------compute acc END----------
            
            # ---------compute acc START----------
            if multi_task:
                pred_pinyin = output_pinyin[1:].argmax(2).permute(1, 0) # [batch_size, pinyin_len]
                ref_pinyin = pinyin[1:].permute(1, 0)
                # consider the last batch as well
                size = pred_pinyin.shape[0]
                for j in range(size):

                    pred_j = pred_pinyin[j, :]
                    pred_j_toks = []
                    for t in pred_j:
                        tok = PINYIN_STR.vocab.itos[t]
                        if tok == '<eos>':
                            break
                        else:
                            pred_j_toks.append(tok)
                    pred_j = ''.join(pred_j_toks)

                    ref_j = ref_pinyin[j, :]
                    ref_j_toks = []
                    for t in ref_j:
                        tok = PINYIN_STR.vocab.itos[t]
                        if tok == '<eos>':
                            break
                        else:
                            ref_j_toks.append(tok)
                    ref_j = ''.join(ref_j_toks)
                    
                    if pred_j == ref_j:
                        correct_pinyin += 1
                # ---------compute acc END----------
            
            
            # trg = [trg len, batch size]
            # output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            if multi_task:
                output_pinyin_dim = output_pinyin.shape[-1]
                output_pinyin = output_pinyin[1:].view(-1, output_pinyin_dim)
                pinyin = pinyin[1:].view(-1)
                loss = (multi_task_ratio * criterion(output, trg)) + ((1 - multi_task_ratio) * criterion(output_pinyin, pinyin))
            else:
                loss = criterion(output, trg)

            epoch_loss += loss.item()
        
        # compute loss and acc
        epoch_loss = epoch_loss / len(iterator)
        # sheduler applies on acc
        if not is_test:
            acc = correct / len(valid_data.examples)
            if multi_task:
                acc_pinyin = correct_pinyin / len(valid_data.examples)
                print('The number of correct pinyin predictions: {}'.format(correct_pinyin))
                print('Val Acc on Pinyin: {:.3f}'.format(acc_pinyin))
        else:
            acc = correct / len(test_data.examples)
            if multi_task:
                acc_pinyin = correct_pinyin / len(test_data.examples)
                print('The number of correct pinyin predictions: {}'.format(correct_pinyin))
                print('Val Acc on Pinyin: {:.3f}'.format(acc_pinyin))
        
        print('The number of correct predictions: {}'.format(correct))
        
    return epoch_loss, acc

Then, we'll define a useful function for timing how long epochs take.

In [20]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

The penultimate step is to train our model. Notice how it takes almost half the time as our model without the improvements added in this notebook.

In [24]:
N_EPOCHS = 100
CLIP = 1

best_valid_loss = float('inf')
acc_valid_loss = float('inf')
best_valid_acc = float(-1)
best_valid_epoch = -1
best_train_step = -1
multi_task = True
full_patience = 20
early_stop_patience = full_patience
train_steps = 0
report_steps = 50
multi_task_ratio = 0.8

try:
    for epoch in range(N_EPOCHS):
        
        if epoch <= 15:
            early_stop_patience = full_patience
        
        if early_stop_patience == 0:
            print("Early Stopping!")
            # break
            # abandon early stopping because we found best epoch in a long run
            
        start_time = time.time()
        
        tfr = max(1 - (float(10 + epoch * 1.5) / 50), 0.2) 
            
        train_loss = train(model, train_iterator, optimizer, criterion, CLIP, epoch, scheduler, 
                           valid_iterator, teacher_forcing_ratio = tfr, 
                           multi_task=multi_task, multi_task_ratio=multi_task_ratio)
        
        valid_loss, valid_acc = evaluate(model, valid_iterator, criterion, scheduler, is_test=False, 
                                         multi_task=multi_task, multi_task_ratio=multi_task_ratio)
        #test_loss, test_acc = evaluate(model, test_iterator, criterion, scheduler, is_test=True, multi_task=multi_task)

        end_time = time.time()

        epoch_mins, epoch_secs = epoch_time(start_time, end_time)

        best_valid_loss, best_valid_acc, acc_valid_loss = update(epoch, valid_loss, valid_acc, 
                                                 best_valid_loss, best_valid_acc, acc_valid_loss, update_type='acc')

        print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
        print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
        print(f'\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc:.3f} | Val. PPL: {math.exp(valid_loss):7.3f}')
        # print(f'\t Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} | Test ACC: {test_acc:.3f}')
except KeyboardInterrupt:
        print("Exiting loop")

Current Teacher Forcing Ratio: 0.800
[Epoch: 0][#examples: 3200/46620][#steps: 50]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 0][#examples: 6400/46620][#steps: 100]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 0][#examples: 9600/46620][#steps: 150]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 0][#examples: 12800/46620][#steps: 200]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 0][#examples: 16000/46620][#steps: 250]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 0][#examples: 19200/46620][#steps: 300]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 0][#examples: 22400/46620][#steps: 350]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 0][#examples: 25600/46620][#steps: 400]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 0][#examples: 28800/46620][#steps: 450]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 0][#examples: 32000/466

[Epoch: 3][#examples: 3200/46620][#steps: 2150]
	Train Loss: 0.001 | Train PPL:   1.001 | lr: 9.838e-08
[Epoch: 3][#examples: 6400/46620][#steps: 2200]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 3][#examples: 9600/46620][#steps: 2250]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 3][#examples: 12800/46620][#steps: 2300]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 3][#examples: 16000/46620][#steps: 2350]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 3][#examples: 19200/46620][#steps: 2400]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 3][#examples: 22400/46620][#steps: 2450]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 3][#examples: 25600/46620][#steps: 2500]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4241
Val Acc on Pinyin: 0.728
The number of correct predictions: 4141
The number of correct pinyin predictions: 4289
Val 

[Epoch: 6][#examples: 3200/46620][#steps: 4250]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 6][#examples: 6400/46620][#steps: 4300]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 6][#examples: 9600/46620][#steps: 4350]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 6][#examples: 12800/46620][#steps: 4400]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 6][#examples: 16000/46620][#steps: 4450]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 6][#examples: 19200/46620][#steps: 4500]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4241
Val Acc on Pinyin: 0.728
The number of correct predictions: 4142
The number of correct pinyin predictions: 4288
Val Acc on Pinyin: 0.736
The number of correct predictions: 4164

---------------------------------------
[Epoch: 6][Validatiing...]
		 Better Valid Acc!
	 patience: 19/20
	 Val. Loss: 0.765 | Val. Acc: 0.711 | Val

[Epoch: 9][#examples: 3200/46620][#steps: 6350]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 9][#examples: 6400/46620][#steps: 6400]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 9][#examples: 9600/46620][#steps: 6450]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 9][#examples: 12800/46620][#steps: 6500]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4241
Val Acc on Pinyin: 0.728
The number of correct predictions: 4141
The number of correct pinyin predictions: 4288
Val Acc on Pinyin: 0.736
The number of correct predictions: 4164

---------------------------------------
[Epoch: 9][Validatiing...]
	 patience: 19/20
	 Val. Loss: 0.765 | Val. Acc: 0.711 | Val. PPL:   2.149
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.765 | BEST. Val. Epoch: 6 | BEST. Val. Step: 4500
---------------------------------------

[Epoch: 9][#examples: 16000/46620][#steps: 6550]
	Train Loss:

[Epoch: 12][#examples: 3200/46620][#steps: 8450]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 12][#examples: 6400/46620][#steps: 8500]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4241
Val Acc on Pinyin: 0.728
The number of correct predictions: 4141
The number of correct pinyin predictions: 4289
Val Acc on Pinyin: 0.736
The number of correct predictions: 4164

---------------------------------------
[Epoch: 12][Validatiing...]
	 patience: 19/20
	 Val. Loss: 0.765 | Val. Acc: 0.711 | Val. PPL:   2.149
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.765 | BEST. Val. Epoch: 6 | BEST. Val. Step: 4500
---------------------------------------

[Epoch: 12][#examples: 9600/46620][#steps: 8550]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 12][#examples: 12800/46620][#steps: 8600]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 12][#examples: 16000/46620][#steps: 8650]
	Train

The number of correct pinyin predictions: 4241
Val Acc on Pinyin: 0.728
The number of correct predictions: 4141

---------------------------------------
[Epoch: 14][Validatiing...]
	 patience: 17/20
	 Val. Loss: 0.792 | Val. Acc: 0.711 | Val. PPL:   2.208
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.765 | BEST. Val. Epoch: 6 | BEST. Val. Step: 4500
---------------------------------------

Epoch: 15 | Time: 2m 33s
	Train Loss: 0.002 | Train PPL:   1.002
	 Val. Loss: 0.792 | Val. Acc: 0.711 | Val. PPL:   2.208
Current Teacher Forcing Ratio: 0.350
[Epoch: 15][#examples: 3200/46620][#steps: 10550]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 15][#examples: 6400/46620][#steps: 10600]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 15][#examples: 9600/46620][#steps: 10650]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 15][#examples: 12800/46620][#steps: 10700]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoc

The number of correct pinyin predictions: 4241
Val Acc on Pinyin: 0.728
The number of correct predictions: 4141

---------------------------------------
[Epoch: 17][Validatiing...]
	 patience: 13/20
	 Val. Loss: 0.792 | Val. Acc: 0.711 | Val. PPL:   2.208
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.765 | BEST. Val. Epoch: 6 | BEST. Val. Step: 4500
---------------------------------------

Epoch: 18 | Time: 2m 35s
	Train Loss: 0.002 | Train PPL:   1.002
	 Val. Loss: 0.792 | Val. Acc: 0.711 | Val. PPL:   2.208
Current Teacher Forcing Ratio: 0.260
[Epoch: 18][#examples: 3200/46620][#steps: 12650]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 18][#examples: 6400/46620][#steps: 12700]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 18][#examples: 9600/46620][#steps: 12750]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 18][#examples: 12800/46620][#steps: 12800]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoc

The number of correct pinyin predictions: 4241
Val Acc on Pinyin: 0.728
The number of correct predictions: 4142

---------------------------------------
[Epoch: 20][Validatiing...]
	 patience: 6/20
	 Val. Loss: 0.793 | Val. Acc: 0.711 | Val. PPL:   2.210
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.765 | BEST. Val. Epoch: 6 | BEST. Val. Step: 4500
---------------------------------------

Epoch: 21 | Time: 2m 16s
	Train Loss: 0.002 | Train PPL:   1.002
	 Val. Loss: 0.793 | Val. Acc: 0.711 | Val. PPL:   2.210
Current Teacher Forcing Ratio: 0.200
[Epoch: 21][#examples: 3200/46620][#steps: 14750]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 21][#examples: 6400/46620][#steps: 14800]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 21][#examples: 9600/46620][#steps: 14850]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 21][#examples: 12800/46620][#steps: 14900]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch

The number of correct pinyin predictions: 4241
Val Acc on Pinyin: 0.728
The number of correct predictions: 4142

---------------------------------------
[Epoch: 23][Validatiing...]
	 patience: 0/20
	 Val. Loss: 0.793 | Val. Acc: 0.711 | Val. PPL:   2.210
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.765 | BEST. Val. Epoch: 6 | BEST. Val. Step: 4500
---------------------------------------

Epoch: 24 | Time: 2m 17s
	Train Loss: 0.002 | Train PPL:   1.002
	 Val. Loss: 0.793 | Val. Acc: 0.711 | Val. PPL:   2.210
Early Stopping!
Current Teacher Forcing Ratio: 0.200
[Epoch: 24][#examples: 3200/46620][#steps: 16850]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 24][#examples: 6400/46620][#steps: 16900]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 24][#examples: 9600/46620][#steps: 16950]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 24][#examples: 12800/46620][#steps: 17000]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 

[Epoch: 26][#examples: 44800/46620][#steps: 18900]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4241
Val Acc on Pinyin: 0.728
The number of correct predictions: 4142

---------------------------------------
[Epoch: 26][Validatiing...]
	 patience: 0/20
	 Val. Loss: 0.793 | Val. Acc: 0.711 | Val. PPL:   2.210
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.765 | BEST. Val. Epoch: 6 | BEST. Val. Step: 4500
---------------------------------------

Epoch: 27 | Time: 2m 16s
	Train Loss: 0.002 | Train PPL:   1.002
	 Val. Loss: 0.793 | Val. Acc: 0.711 | Val. PPL:   2.210
Early Stopping!
Current Teacher Forcing Ratio: 0.200
[Epoch: 27][#examples: 3200/46620][#steps: 18950]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 27][#examples: 6400/46620][#steps: 19000]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4241
Val Acc on Pinyin: 0.728
The number of correct pr

[Epoch: 29][#examples: 41600/46620][#steps: 20950]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 29][#examples: 44800/46620][#steps: 21000]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4241
Val Acc on Pinyin: 0.728
The number of correct predictions: 4141
The number of correct pinyin predictions: 4288
Val Acc on Pinyin: 0.736
The number of correct predictions: 4163

---------------------------------------
[Epoch: 29][Validatiing...]
	 patience: 0/20
	 Val. Loss: 0.766 | Val. Acc: 0.711 | Val. PPL:   2.150
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.765 | BEST. Val. Epoch: 6 | BEST. Val. Step: 4500
---------------------------------------

The number of correct pinyin predictions: 4241
Val Acc on Pinyin: 0.728
The number of correct predictions: 4141

---------------------------------------
[Epoch: 29][Validatiing...]
	 patience: 0/20
	 Val. Loss: 0.793 | Val. Acc: 0.711 | Val. PPL:   2.210
	 BEST. V

[Epoch: 32][#examples: 38400/46620][#steps: 23000]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4240
Val Acc on Pinyin: 0.728
The number of correct predictions: 4141
The number of correct pinyin predictions: 4288
Val Acc on Pinyin: 0.736
The number of correct predictions: 4163

---------------------------------------
[Epoch: 32][Validatiing...]
	 patience: 0/20
	 Val. Loss: 0.766 | Val. Acc: 0.711 | Val. PPL:   2.150
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.765 | BEST. Val. Epoch: 6 | BEST. Val. Step: 4500
---------------------------------------

[Epoch: 32][#examples: 41600/46620][#steps: 23050]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 32][#examples: 44800/46620][#steps: 23100]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4240
Val Acc on Pinyin: 0.728
The number of correct predictions: 4141

---------------------------------------
[Epo

The number of correct pinyin predictions: 4240
Val Acc on Pinyin: 0.728
The number of correct predictions: 4141
The number of correct pinyin predictions: 4288
Val Acc on Pinyin: 0.736
The number of correct predictions: 4163

---------------------------------------
[Epoch: 35][Validatiing...]
	 patience: 0/20
	 Val. Loss: 0.766 | Val. Acc: 0.711 | Val. PPL:   2.150
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.765 | BEST. Val. Epoch: 6 | BEST. Val. Step: 4500
---------------------------------------

[Epoch: 35][#examples: 35200/46620][#steps: 25050]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 35][#examples: 38400/46620][#steps: 25100]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 35][#examples: 41600/46620][#steps: 25150]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 35][#examples: 44800/46620][#steps: 25200]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4240
Val

The number of correct pinyin predictions: 4288
Val Acc on Pinyin: 0.736
The number of correct predictions: 4163

---------------------------------------
[Epoch: 38][Validatiing...]
	 patience: 0/20
	 Val. Loss: 0.766 | Val. Acc: 0.711 | Val. PPL:   2.150
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.765 | BEST. Val. Epoch: 6 | BEST. Val. Step: 4500
---------------------------------------

[Epoch: 38][#examples: 28800/46620][#steps: 27050]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 38][#examples: 32000/46620][#steps: 27100]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 38][#examples: 35200/46620][#steps: 27150]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 38][#examples: 38400/46620][#steps: 27200]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 38][#examples: 41600/46620][#steps: 27250]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 38][#examples: 44800/46620][#steps: 27300]
	Tra

[Epoch: 41][#examples: 22400/46620][#steps: 29050]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 41][#examples: 25600/46620][#steps: 29100]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 41][#examples: 28800/46620][#steps: 29150]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 41][#examples: 32000/46620][#steps: 29200]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 41][#examples: 35200/46620][#steps: 29250]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 41][#examples: 38400/46620][#steps: 29300]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 41][#examples: 41600/46620][#steps: 29350]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 41][#examples: 44800/46620][#steps: 29400]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4240
Val Acc on Pinyin: 0.728
The number of correct predictions: 4141

-------------------------------

[Epoch: 44][#examples: 19200/46620][#steps: 31100]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 44][#examples: 22400/46620][#steps: 31150]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 44][#examples: 25600/46620][#steps: 31200]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 44][#examples: 28800/46620][#steps: 31250]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 44][#examples: 32000/46620][#steps: 31300]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 44][#examples: 35200/46620][#steps: 31350]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 44][#examples: 38400/46620][#steps: 31400]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 44][#examples: 41600/46620][#steps: 31450]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 44][#examples: 44800/46620][#steps: 31500]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predicti

[Epoch: 47][#examples: 16000/46620][#steps: 33150]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 47][#examples: 19200/46620][#steps: 33200]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 47][#examples: 22400/46620][#steps: 33250]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 47][#examples: 25600/46620][#steps: 33300]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 47][#examples: 28800/46620][#steps: 33350]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 47][#examples: 32000/46620][#steps: 33400]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 47][#examples: 35200/46620][#steps: 33450]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 47][#examples: 38400/46620][#steps: 33500]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4240
Val Acc on Pinyin: 0.728
The number of correct predictions: 4140
The number of correct pinyin pre

[Epoch: 50][#examples: 12800/46620][#steps: 35200]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 50][#examples: 16000/46620][#steps: 35250]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 50][#examples: 19200/46620][#steps: 35300]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 50][#examples: 22400/46620][#steps: 35350]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 50][#examples: 25600/46620][#steps: 35400]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 50][#examples: 28800/46620][#steps: 35450]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 50][#examples: 32000/46620][#steps: 35500]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4240
Val Acc on Pinyin: 0.728
The number of correct predictions: 4141
The number of correct pinyin predictions: 4289
Val Acc on Pinyin: 0.736
The number of correct predictions: 4163

--------------------------

[Epoch: 53][#examples: 9600/46620][#steps: 37250]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 53][#examples: 12800/46620][#steps: 37300]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 53][#examples: 16000/46620][#steps: 37350]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 53][#examples: 19200/46620][#steps: 37400]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 53][#examples: 22400/46620][#steps: 37450]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 53][#examples: 25600/46620][#steps: 37500]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4240
Val Acc on Pinyin: 0.728
The number of correct predictions: 4140
The number of correct pinyin predictions: 4289
Val Acc on Pinyin: 0.736
The number of correct predictions: 4163

---------------------------------------
[Epoch: 53][Validatiing...]
	 patience: 0/20
	 Val. Loss: 0.766 | Val. Acc: 0.710 | Val. PPL: 

[Epoch: 56][#examples: 6400/46620][#steps: 39300]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 56][#examples: 9600/46620][#steps: 39350]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 56][#examples: 12800/46620][#steps: 39400]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 56][#examples: 16000/46620][#steps: 39450]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 56][#examples: 19200/46620][#steps: 39500]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4240
Val Acc on Pinyin: 0.728
The number of correct predictions: 4141
The number of correct pinyin predictions: 4289
Val Acc on Pinyin: 0.736
The number of correct predictions: 4163

---------------------------------------
[Epoch: 56][Validatiing...]
	 patience: 0/20
	 Val. Loss: 0.766 | Val. Acc: 0.711 | Val. PPL:   2.150
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.765 | BEST. Val. Epoch: 6 | BEST. Va

[Epoch: 59][#examples: 3200/46620][#steps: 41350]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 59][#examples: 6400/46620][#steps: 41400]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 59][#examples: 9600/46620][#steps: 41450]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 59][#examples: 12800/46620][#steps: 41500]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4240
Val Acc on Pinyin: 0.728
The number of correct predictions: 4141
The number of correct pinyin predictions: 4289
Val Acc on Pinyin: 0.736
The number of correct predictions: 4163

---------------------------------------
[Epoch: 59][Validatiing...]
	 patience: 0/20
	 Val. Loss: 0.766 | Val. Acc: 0.711 | Val. PPL:   2.150
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.765 | BEST. Val. Epoch: 6 | BEST. Val. Step: 4500
---------------------------------------

[Epoch: 59][#examples: 16000/46620][#steps: 41550]
	T

[Epoch: 62][#examples: 3200/46620][#steps: 43450]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 62][#examples: 6400/46620][#steps: 43500]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4239
Val Acc on Pinyin: 0.727
The number of correct predictions: 4141
The number of correct pinyin predictions: 4289
Val Acc on Pinyin: 0.736
The number of correct predictions: 4163

---------------------------------------
[Epoch: 62][Validatiing...]
	 patience: 0/20
	 Val. Loss: 0.766 | Val. Acc: 0.711 | Val. PPL:   2.150
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.765 | BEST. Val. Epoch: 6 | BEST. Val. Step: 4500
---------------------------------------

[Epoch: 62][#examples: 9600/46620][#steps: 43550]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 62][#examples: 12800/46620][#steps: 43600]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 62][#examples: 16000/46620][#steps: 43650]
	T

The number of correct pinyin predictions: 4240
Val Acc on Pinyin: 0.728
The number of correct predictions: 4141

---------------------------------------
[Epoch: 64][Validatiing...]
	 patience: 0/20
	 Val. Loss: 0.793 | Val. Acc: 0.711 | Val. PPL:   2.210
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.765 | BEST. Val. Epoch: 6 | BEST. Val. Step: 4500
---------------------------------------

Epoch: 65 | Time: 1m 50s
	Train Loss: 0.002 | Train PPL:   1.002
	 Val. Loss: 0.793 | Val. Acc: 0.711 | Val. PPL:   2.210
Early Stopping!
Current Teacher Forcing Ratio: 0.200
[Epoch: 65][#examples: 3200/46620][#steps: 45550]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 65][#examples: 6400/46620][#steps: 45600]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 65][#examples: 9600/46620][#steps: 45650]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 65][#examples: 12800/46620][#steps: 45700]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 

[Epoch: 67][#examples: 44800/46620][#steps: 47600]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4240
Val Acc on Pinyin: 0.728
The number of correct predictions: 4141

---------------------------------------
[Epoch: 67][Validatiing...]
	 patience: 0/20
	 Val. Loss: 0.793 | Val. Acc: 0.711 | Val. PPL:   2.210
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.765 | BEST. Val. Epoch: 6 | BEST. Val. Step: 4500
---------------------------------------

Epoch: 68 | Time: 1m 49s
	Train Loss: 0.002 | Train PPL:   1.002
	 Val. Loss: 0.793 | Val. Acc: 0.711 | Val. PPL:   2.210
Early Stopping!
Current Teacher Forcing Ratio: 0.200
[Epoch: 68][#examples: 3200/46620][#steps: 47650]
	Train Loss: 0.001 | Train PPL:   1.001 | lr: 9.838e-08
[Epoch: 68][#examples: 6400/46620][#steps: 47700]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 68][#examples: 9600/46620][#steps: 47750]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 

[Epoch: 70][#examples: 41600/46620][#steps: 49650]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 70][#examples: 44800/46620][#steps: 49700]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4240
Val Acc on Pinyin: 0.728
The number of correct predictions: 4142

---------------------------------------
[Epoch: 70][Validatiing...]
	 patience: 0/20
	 Val. Loss: 0.793 | Val. Acc: 0.711 | Val. PPL:   2.210
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.765 | BEST. Val. Epoch: 6 | BEST. Val. Step: 4500
---------------------------------------

Epoch: 71 | Time: 1m 38s
	Train Loss: 0.002 | Train PPL:   1.002
	 Val. Loss: 0.793 | Val. Acc: 0.711 | Val. PPL:   2.210
Early Stopping!
Current Teacher Forcing Ratio: 0.200
[Epoch: 71][#examples: 3200/46620][#steps: 49750]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 71][#examples: 6400/46620][#steps: 49800]
	Train Loss: 0.002 | Train PPL:   1.002 | lr:

[Epoch: 73][#examples: 38400/46620][#steps: 51700]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 73][#examples: 41600/46620][#steps: 51750]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 73][#examples: 44800/46620][#steps: 51800]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4240
Val Acc on Pinyin: 0.728
The number of correct predictions: 4142

---------------------------------------
[Epoch: 73][Validatiing...]
	 patience: 0/20
	 Val. Loss: 0.793 | Val. Acc: 0.711 | Val. PPL:   2.210
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.765 | BEST. Val. Epoch: 6 | BEST. Val. Step: 4500
---------------------------------------

Epoch: 74 | Time: 1m 37s
	Train Loss: 0.002 | Train PPL:   1.002
	 Val. Loss: 0.793 | Val. Acc: 0.711 | Val. PPL:   2.210
Early Stopping!
Current Teacher Forcing Ratio: 0.200
[Epoch: 74][#examples: 3200/46620][#steps: 51850]
	Train Loss: 0.002 | Train PPL:   1.002 | lr

[Epoch: 76][#examples: 35200/46620][#steps: 53750]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 76][#examples: 38400/46620][#steps: 53800]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 76][#examples: 41600/46620][#steps: 53850]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 76][#examples: 44800/46620][#steps: 53900]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4240
Val Acc on Pinyin: 0.728
The number of correct predictions: 4143

---------------------------------------
[Epoch: 76][Validatiing...]
	 patience: 0/20
	 Val. Loss: 0.793 | Val. Acc: 0.711 | Val. PPL:   2.210
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.766 | BEST. Val. Epoch: 75 | BEST. Val. Step: 53000
---------------------------------------

Epoch: 77 | Time: 1m 38s
	Train Loss: 0.002 | Train PPL:   1.002
	 Val. Loss: 0.793 | Val. Acc: 0.711 | Val. PPL:   2.210
Early Stopping!
Current Teacher Forcin

[Epoch: 79][#examples: 32000/46620][#steps: 55800]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 79][#examples: 35200/46620][#steps: 55850]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 79][#examples: 38400/46620][#steps: 55900]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 79][#examples: 41600/46620][#steps: 55950]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 79][#examples: 44800/46620][#steps: 56000]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4240
Val Acc on Pinyin: 0.728
The number of correct predictions: 4142
The number of correct pinyin predictions: 4289
Val Acc on Pinyin: 0.736
The number of correct predictions: 4163

---------------------------------------
[Epoch: 79][Validatiing...]
	 patience: 0/20
	 Val. Loss: 0.766 | Val. Acc: 0.711 | Val. PPL:   2.151
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.766 | BEST. Val. Epoch: 75 | BEST.

[Epoch: 82][#examples: 28800/46620][#steps: 57850]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 82][#examples: 32000/46620][#steps: 57900]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 82][#examples: 35200/46620][#steps: 57950]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 82][#examples: 38400/46620][#steps: 58000]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4240
Val Acc on Pinyin: 0.728
The number of correct predictions: 4142
The number of correct pinyin predictions: 4289
Val Acc on Pinyin: 0.736
The number of correct predictions: 4163

---------------------------------------
[Epoch: 82][Validatiing...]
	 patience: 0/20
	 Val. Loss: 0.766 | Val. Acc: 0.711 | Val. PPL:   2.151
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.766 | BEST. Val. Epoch: 75 | BEST. Val. Step: 53000
---------------------------------------

[Epoch: 82][#examples: 41600/46620][#steps: 5805

[Epoch: 85][#examples: 25600/46620][#steps: 59900]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 85][#examples: 28800/46620][#steps: 59950]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 85][#examples: 32000/46620][#steps: 60000]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4240
Val Acc on Pinyin: 0.728
The number of correct predictions: 4142
The number of correct pinyin predictions: 4290
Val Acc on Pinyin: 0.736
The number of correct predictions: 4163

---------------------------------------
[Epoch: 85][Validatiing...]
	 patience: 0/20
	 Val. Loss: 0.766 | Val. Acc: 0.711 | Val. PPL:   2.151
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.766 | BEST. Val. Epoch: 75 | BEST. Val. Step: 53000
---------------------------------------

[Epoch: 85][#examples: 35200/46620][#steps: 60050]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 85][#examples: 38400/46620][#steps: 6010

[Epoch: 88][#examples: 22400/46620][#steps: 61950]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 88][#examples: 25600/46620][#steps: 62000]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4240
Val Acc on Pinyin: 0.728
The number of correct predictions: 4141
The number of correct pinyin predictions: 4290
Val Acc on Pinyin: 0.736
The number of correct predictions: 4163

---------------------------------------
[Epoch: 88][Validatiing...]
	 patience: 0/20
	 Val. Loss: 0.766 | Val. Acc: 0.711 | Val. PPL:   2.151
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.766 | BEST. Val. Epoch: 75 | BEST. Val. Step: 53000
---------------------------------------

[Epoch: 88][#examples: 28800/46620][#steps: 62050]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 88][#examples: 32000/46620][#steps: 62100]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 88][#examples: 35200/46620][#steps: 6215

[Epoch: 91][#examples: 19200/46620][#steps: 64000]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
The number of correct pinyin predictions: 4240
Val Acc on Pinyin: 0.728
The number of correct predictions: 4141
The number of correct pinyin predictions: 4290
Val Acc on Pinyin: 0.736
The number of correct predictions: 4163

---------------------------------------
[Epoch: 91][Validatiing...]
	 patience: 0/20
	 Val. Loss: 0.766 | Val. Acc: 0.711 | Val. PPL:   2.151
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.766 | BEST. Val. Epoch: 75 | BEST. Val. Step: 53000
---------------------------------------

[Epoch: 91][#examples: 22400/46620][#steps: 64050]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 91][#examples: 25600/46620][#steps: 64100]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 91][#examples: 28800/46620][#steps: 64150]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 91][#examples: 32000/46620][#steps: 6420

The number of correct pinyin predictions: 4240
Val Acc on Pinyin: 0.728
The number of correct predictions: 4141
The number of correct pinyin predictions: 4290
Val Acc on Pinyin: 0.736
The number of correct predictions: 4163

---------------------------------------
[Epoch: 94][Validatiing...]
	 patience: 0/20
	 Val. Loss: 0.766 | Val. Acc: 0.711 | Val. PPL:   2.151
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.766 | BEST. Val. Epoch: 75 | BEST. Val. Step: 53000
---------------------------------------

[Epoch: 94][#examples: 16000/46620][#steps: 66050]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 94][#examples: 19200/46620][#steps: 66100]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 94][#examples: 22400/46620][#steps: 66150]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 94][#examples: 25600/46620][#steps: 66200]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 94][#examples: 28800/46620][#steps: 6625

The number of correct pinyin predictions: 4290
Val Acc on Pinyin: 0.736
The number of correct predictions: 4163

---------------------------------------
[Epoch: 97][Validatiing...]
	 patience: 0/20
	 Val. Loss: 0.766 | Val. Acc: 0.711 | Val. PPL:   2.151
	 BEST. Val. Loss: 0.765 | BEST. Val. Acc: 0.711 | Val. Loss: 0.766 | BEST. Val. Epoch: 75 | BEST. Val. Step: 53000
---------------------------------------

[Epoch: 97][#examples: 9600/46620][#steps: 68050]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 97][#examples: 12800/46620][#steps: 68100]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 97][#examples: 16000/46620][#steps: 68150]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 97][#examples: 19200/46620][#steps: 68200]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 97][#examples: 22400/46620][#steps: 68250]
	Train Loss: 0.002 | Train PPL:   1.002 | lr: 9.838e-08
[Epoch: 97][#examples: 25600/46620][#steps: 68300]
	Tr

Finally, we load the parameters from our best validation loss and get our results on the test set.

We get the improved test perplexity whilst almost being twice as fast!

In [25]:
model.load_state_dict(torch.load('acc-model.pt'))

valid_loss, valid_acc = evaluate(model, valid_iterator, criterion, scheduler, is_test=False)
test_loss, test_acc = evaluate(model, test_iterator, criterion, scheduler, is_test=True)

# Note that the final translation accs might differ from below because of floating point error.
# But they should be the same in most of the cases.
print(f'| Valid Loss: {valid_loss:.3f} | Valid PPL: {math.exp(valid_loss):7.3f} | Valid ACC: {valid_acc:.3f}')
print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} | Test ACC: {test_acc:.3f}')

The number of correct predictions: 4143
The number of correct predictions: 4163
| Valid Loss: 0.811 | Valid PPL:   2.251 | Valid ACC: 0.711
| Test Loss: 0.760 | Test PPL:   2.139 | Test ACC: 0.714


## Inference

Now we can use our trained model to generate translations.

**Note:** these translations will be poor compared to examples shown in paper as they use hidden dimension sizes of 1000 and train for 4 days! They have been cherry picked in order to show off what attention should look like on a sufficiently sized model.

Our `translate_sentence` will do the following:
- ensure our model is in evaluation mode, which it should always be for inference
- tokenize the source sentence if it has not been tokenized (is a string)
- numericalize the source sentence
- convert it to a tensor and add a batch dimension
- get the length of the source sentence and convert to a tensor
- feed the source sentence into the encoder
- create the mask for the source sentence
- create a list to hold the output sentence, initialized with an `<sos>` token
- create a tensor to hold the attention values
- while we have not hit a maximum length
  - get the input tensor, which should be either `<sos>` or the last predicted token
  - feed the input, all encoder outputs, hidden state and mask into the decoder
  - store attention values
  - get the predicted next token
  - add prediction to current output sentence prediction
  - break if the prediction was an `<eos>` token
- convert the output sentence from indexes to tokens
- return the output sentence (with the `<sos>` token removed) and the attention values over the sequence

In [None]:
def translate_sentence(sentence, src_field, trg_field, model, device, max_len = 50):

    model.eval()
        
    # lower-cased
    tokens = [token.lower() for token in sentence]

    # add <sos> and <eos>
    tokens = [src_field.init_token] + tokens + [src_field.eos_token]
    
    # vectorized
    src_indexes = [src_field.vocab.stoi[token] for token in tokens]
    
    src_tensor = torch.LongTensor(src_indexes).unsqueeze(1).to(device)

    src_len = torch.LongTensor([len(src_indexes)]).to(device)
    
    with torch.no_grad():
        encoder_outputs, hidden = model.encoder(src_tensor, src_len)
        hidden = torch.tanh(model.fc(hidden))

    mask = model.create_mask(src_tensor)
        
    trg_indexes = [trg_field.vocab.stoi[trg_field.init_token]]

    attentions = torch.zeros(max_len, 1, len(src_indexes)).to(device)
    
    for i in range(max_len):

        trg_tensor = torch.LongTensor([trg_indexes[-1]]).to(device)
                
        with torch.no_grad():
            output, hidden, attention = model.decoder(trg_tensor, hidden, encoder_outputs, mask)

        attentions[i] = attention
            
        pred_token = output.argmax(1).item()
        
        trg_indexes.append(pred_token)

        if pred_token == trg_field.vocab.stoi[trg_field.eos_token]:
            break
    
    trg_tokens = [trg_field.vocab.itos[i] for i in trg_indexes]
    
    return trg_tokens[1:], attentions[:len(trg_tokens)-1]

Next, we'll make a function that displays the model's attention over the source sentence for each target token generated.

In [None]:
def display_attention(sentence, translation, attention):
    
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(111)
    
    attention = attention.squeeze(1).cpu().detach().numpy()
    
    cax = ax.matshow(attention, cmap='bone')
   
    ax.tick_params(labelsize=15)
    ax.set_xticklabels(['']+['<sos>']+[t.lower() for t in sentence]+['<eos>'], 
                       rotation=45)
    ax.set_yticklabels(['']+translation)

    ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
    ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

    plt.show()
    plt.close()

Now, we'll grab some translations from our dataset and see how well our model did. Note, we're going to cherry pick examples here so it gives us something interesting to look at, but feel free to change the `example_idx` value to look at different examples.

First, we'll get a source and target from our dataset.

In [None]:
example_idx = 1001

src = vars(train_data.examples[example_idx])['src']
trg = vars(train_data.examples[example_idx])['trg']

print(f'src = {src}')
print(f'trg = {trg}')

Then we'll use our `translate_sentence` function to get our predicted translation and attention. We show this graphically by having the source sentence on the x-axis and the predicted translation on the y-axis. The lighter the square at the intersection between two words, the more attention the model gave to that source word when translating that target word.

Below is an example the model attempted to translate, it gets the translation correct except changes *are fighting* to just *fighting*.

In [None]:
translation, attention = translate_sentence(src, SRC, TRG, model, device)

print(f'predicted trg = {translation}')

In [None]:
display_attention(src, translation, attention)

Translations from the training set could simply be memorized by the model. So it's only fair we look at translations from the validation and testing set too.

Starting with the validation set, let's get an example.

---------------
### Evaluation on the validation set (Customised)

In [None]:
def evaluate_trans(data, mode='TRG'):
    # Translate all valid set and report Accuracy
    result = pd.DataFrame(columns=["SRC", "PRED", "TRG"])
    i = 0
    for exp in data.examples:
        if mode == 'PINYIN':
            pred = translate_sentence(vars(exp)['src'], SRC, PINYIN_STR, model, device)[0][:-1]
            result.loc[str(i)] = (["".join(vars(exp)['src']), "".join(pred), "".join(vars(exp)['pinyin_str'])])
        else:
            pred = translate_sentence(vars(exp)['src'], SRC, TRG, model, device)[0][:-1]
            result.loc[str(i)] = (["".join(vars(exp)['src']), "".join(pred), "".join(vars(exp)['trg'])])
        i += 1
    # print(np.sum(result['PRED'] == result['TRG']))
    acc = np.sum(result['PRED'] == result['TRG']) / len(data.examples)
    return result, acc

In [None]:
result_valid, acc_valid = evaluate_trans(valid_data)
result_test, acc_test = evaluate_trans(test_data)
acc_valid, acc_test

In [None]:
count = 0
for i, dp in result_test.iterrows():
    if dp['PRED'] in dev_json[dp['SRC']]:
        count += 1
acc_test_multi = count / len(test_data.examples)
acc_test_multi

In [None]:
num = 4
result_valid.to_excel('experiments/' + 'exp' + str(num) + '/valid_result' + '.xlsx', index=False)
result_test.to_excel('experiments/' + 'exp' + str(num) + '/test_result' + '.xlsx', index=False)

In [None]:
with open('experiments/' + 'exp' + str(num) + '/acc.txt', 'w+', encoding='utf-8') as f:
    f.write('Valid Accuracy: {}\n'.format(acc_valid))
    f.write('Test Accuracy: {}\n'.format(acc_test))
    # f.write('Test Accuracy (Multi): {}\n'.format(acc_test_multi))
    f.write('Test Loss: {}\n'.format(test_loss))

with open('experiments/' + 'exp' + str(num) + '/setting.txt', 'w+', encoding='utf-8') as f:
    f.write('Params #: {}\n{}'.format(count_parameters(model), show_params))

In [None]:
with open('experiments/' + 'exp' + str(num) + '/valid_pred.txt', 'w+', encoding='utf-8') as f:
    for p in result_valid['PRED']:
        f.write(' '.join(p) + '\n')
        
with open('experiments/' + 'exp' + str(num) + '/valid_ref.txt', 'w+', encoding='utf-8') as f:
    for p in result_valid['TRG']:
        f.write(' '.join(p) + '\n')
        
with open('experiments/' + 'exp' + str(num) + '/test_pred.txt', 'w+', encoding='utf-8') as f:
    for p in result_test['PRED']:
        f.write(' '.join(p) + '\n')
        
with open('experiments/' + 'exp' + str(num) + '/test_ref.txt', 'w+', encoding='utf-8') as f:
    for p in result_test['TRG']:
        f.write(' '.join(p) + '\n')

-------------------

In [None]:
example_idx = 14

src = vars(valid_data.examples[example_idx])['src']
trg = vars(valid_data.examples[example_idx])['trg']

print(f'src = {src}')
print(f'trg = {trg}')

Then let's generate our translation and view the attention.

Here, we can see the translation is the same except for swapping *female* with *woman*.

In [None]:
translation, attention = translate_sentence(src, SRC, TRG, model, device)

print(f'predicted trg = {translation}')

display_attention(src, translation, attention)

Finally, let's get an example from the test set.

In [None]:
example_idx = 18

src = vars(test_data.examples[example_idx])['src']
trg = vars(test_data.examples[example_idx])['trg']

print(f'src = {src}')
print(f'trg = {trg}')

Again, it produces a slightly different translation than target, a more literal version of the source sentence. It swaps *mountain climbing* for *climbing on a mountain*.

In [None]:
translation, attention = translate_sentence(src, SRC, TRG, model, device)

print(f'predicted trg = {translation}')

display_attention(src, translation, attention)

## BLEU

Previously we have only cared about the loss/perplexity of the model. However there metrics that are specifically designed for measuring the quality of a translation - the most popular is *BLEU*. Without going into too much detail, BLEU looks at the overlap in the predicted and actual target sequences in terms of their n-grams. It will give us a number between 0 and 1 for each sequence, where 1 means there is perfect overlap, i.e. a perfect translation, although is usually shown between 0 and 100. BLEU was designed for multiple candidate translations per source sequence, however in this dataset we only have one candidate per source.

We define a `calculate_bleu` function which calculates the BLEU score over a provided TorchText dataset. This function creates a corpus of the actual and predicted translation for each source sentence and then calculates the BLEU score.

In [None]:
from torchtext.data.metrics import bleu_score

def calculate_bleu(data, src_field, trg_field, model, device, max_len = 50):
    
    trgs = []
    pred_trgs = []
    
    for datum in data:
        
        src = vars(datum)['src']
        trg = vars(datum)['trg']
        
        pred_trg, _ = translate_sentence(src, src_field, trg_field, model, device, max_len)
        
        #cut off <eos> token
        pred_trg = pred_trg[:-1]
        
        pred_trgs.append(pred_trg)
        trgs.append([trg])
        
    return bleu_score(pred_trgs, trgs)

We get a BLEU of around 29. If we compare it to the paper that the attention model is attempting to replicate, they achieve a BLEU score of 26.75. This is similar to our score, however they are using a completely different dataset and their model size is much larger - 1000 hidden dimensions which takes 4 days to train! - so we cannot really compare against that either.

This number isn't really interpretable, we can't really say much about it. The most useful part of a BLEU score is that it can be used to compare different models on the same dataset, where the one with the **higher** BLEU score is "better".

In [None]:
bleu_score = calculate_bleu(test_data, SRC, TRG, model, device)

print(f'BLEU score = {bleu_score*100:.2f}')

In the next tutorials we will be moving away from using recurrent neural networks and start looking at other ways to construct sequence-to-sequence models. Specifically, in the next tutorial we will be using convolutional neural networks.