Machine Translation Seq2Seq model for English to German translation.



1.   Uses attention mechanism to improve translation accuracy
2.   This architecture but using a few tricks that are applicable to all RNN architectures - packed padded sequences and masking. This improves train time 



In [1]:
import torch
import spacy
from torchtext.datasets import Multi30k
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

import random
import math


In [None]:
!python -m spacy download en
!python -m spacy download de

In [3]:
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')

In [4]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [5]:
def tokenize_en(text):
  return [tok.text for tok in spacy_en.tokenizer(text)]

def tokenize_de(text):
  return [tok.text for tok in spacy_de.tokenizer(text)]


In [6]:
from torchtext.data import Field


SRC = Field(tokenize = tokenize_en,
      # sequential = True,
      init_token = '<sos>',
      eos_token = '<eos>',
      lower = True,
      include_lengths =True
      )


TGT = Field(tokenize=tokenize_de,
      # sequential = True,
      init_token = '<sos>',
      eos_token = '<eos>',
      lower = True,
      include_lengths =True
      )

In [7]:
train, valid, test = Multi30k.splits(exts = ('.en','.de'), fields = (SRC,TGT))

In [8]:
SRC.build_vocab(train, min_freq=2)
TGT.build_vocab(train, min_freq=2)

In [9]:
len(train.examples[0].src)

11

In [10]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Augment data here if required

In [11]:
from torchtext.data import BucketIterator

BATCH_SIZE = 32
train_iterator, valid_iterator, test_iterator = BucketIterator.splits((train, valid, test),
                                                               batch_size = BATCH_SIZE,
                                                               sort_key = lambda x: len(x.src),
                                                               sort_within_batch = True,
                                                               device=device)


In [30]:
list(iter(train_iterator))[123].src

(tensor([[   2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
             2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
             2,    2,    2,    2,    2,    2,    2,    2],
         [   4,  795,  480,    4,    4,    7,    4,    4,    4,   48,  682,    4,
            48,    4,    4,    4,    4,   16,    4,   16,    4,   21,   16,    4,
             4,    4,    7,    4,   16,    4,  196,    4],
         [  26,   73,    6,    9,   70,  134,    9,   34,   55,   63,    9,   14,
           905,   24,   70,   14,    9,  161,   14,   30,  132,  115,   30,   14,
             9,    9,   14,    0,   30,  870,   17,    0],
         [ 145,    6,    4,    6,   55,   10,   11,   10,   91,   17,    6,    6,
            22, 2252,   55,   91,  165, 1588,    6,  403,  491,    9, 1623,    6,
            10,   15,   10,   14,  249,  436,    6,   14],
         [  14,   26,   31,  217, 2281,   12,   27, 4377,    8,   36,    4,    4,
          1003,   33,    6

## Building the Seq2Seq Model

### Encoder

First, we'll build the encoder. Similar to the previous model, we only use a single layer GRU, however we now use a *bidirectional RNN*. With a bidirectional RNN, we have two RNNs in each layer. A *forward RNN* going over the embedded sentence from left to right (shown below in green), and a *backward RNN* going over the embedded sentence from right to left (teal). All we need to do in code is set `bidirectional = True` and then pass the embedded sentence to the RNN as before. 

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq8.png?raw=1)

We now have:

$$\begin{align*}
h_t^\rightarrow &= \text{EncoderGRU}^\rightarrow(e(x_t^\rightarrow),h_{t-1}^\rightarrow)\\
h_t^\leftarrow &= \text{EncoderGRU}^\leftarrow(e(x_t^\leftarrow),h_{t-1}^\leftarrow)
\end{align*}$$

Where $x_0^\rightarrow = \text{<sos>}, x_1^\rightarrow = \text{guten}$ and $x_0^\leftarrow = \text{<eos>}, x_1^\leftarrow = \text{morgen}$.

As before, we only pass an input (`embedded`) to the RNN, which tells PyTorch to initialize both the forward and backward initial hidden states ($h_0^\rightarrow$ and $h_0^\leftarrow$, respectively) to a tensor of all zeros. We'll also get two context vectors, one from the forward RNN after it has seen the final word in the sentence, $z^\rightarrow=h_T^\rightarrow$, and one from the backward RNN after it has seen the first word in the sentence, $z^\leftarrow=h_T^\leftarrow$.

The RNN returns `outputs` and `hidden`. 

`outputs` is of size **[src len, batch size, hid dim * num directions]** where the first `hid_dim` elements in the third axis are the hidden states from the top layer forward RNN, and the last `hid_dim` elements are hidden states from the top layer backward RNN. We can think of the third axis as being the forward and backward hidden states concatenated together other, i.e. $h_1 = [h_1^\rightarrow; h_{T}^\leftarrow]$, $h_2 = [h_2^\rightarrow; h_{T-1}^\leftarrow]$ and we can denote all encoder hidden states (forward and backwards concatenated together) as $H=\{ h_1, h_2, ..., h_T\}$.

`hidden` is of size **[n layers * num directions, batch size, hid dim]**, where **[-2, :, :]** gives the top layer forward RNN hidden state after the final time-step (i.e. after it has seen the last word in the sentence) and **[-1, :, :]** gives the top layer backward RNN hidden state after the final time-step (i.e. after it has seen the first word in the sentence).

As the decoder is not bidirectional, it only needs a single context vector, $z$, to use as its initial hidden state, $s_0$, and we currently have two, a forward and a backward one ($z^\rightarrow=h_T^\rightarrow$ and $z^\leftarrow=h_T^\leftarrow$, respectively). We solve this by concatenating the two context vectors together, passing them through a linear layer, $g$, and applying the $\tanh$ activation function. 

$$z=\tanh(g(h_T^\rightarrow, h_T^\leftarrow)) = \tanh(g(z^\rightarrow, z^\leftarrow)) = s_0$$

**Note**: this is actually a deviation from the paper. Instead, they feed only the first backward RNN hidden state through a linear layer to get the context vector/decoder initial hidden state. This doesn't seem to make sense to me, so we have changed it.

As we want our model to look back over the whole of the source sentence we return `outputs`, the stacked forward and backward hidden states for every token in the source sentence. We also return `hidden`, which acts as our initial hidden state in the decoder.

In [13]:
class Encoder(nn.Module):
  def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_him_dim, dropout):
    super().__init__()
    self.embedding = nn.Embedding(input_dim, emb_dim)
    self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional=True)
    self.fc = nn.Linear(2*enc_hid_dim, dec_him_dim)
    self.dropout = nn.Dropout(dropout)


  def forward(self, src, src_len):
    # print('src in debugger', (src[0].shape),src_len)
    # src = [seq_len, batch_size] 
    embedding = self.dropout(self.embedding(src[0]))
    # embedding = [seq_len, batch_size, emb_dim]
    packed_embedded = nn.utils.rnn.pack_padded_sequence(embedding, src_len)
    packed_outputs, hidden = self.rnn(packed_embedded)
    # packed_outputs is a packed sequence containing all hidden states
    # hidden is from the final non-passed elemeent in the batch
    outputs,_ = nn.utils.rnn.pad_packed_sequence(packed_outputs)
    # outputs is now non-packed sequence, all hidden states obtained when the input is a pad token are all zeros
    # output = [seq_len, batch_size, directions * hid_dim]
    # hidden = [directions * layers , batch_size, him_dim]
    # Combining both hidden layers to create one context vector
    hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1)))
    return outputs, hidden


In [14]:
# test Encoder

# enc = Encoder(emb_dim=256,
#              input_dim = 32,
#              enc_hid_dim = 512,
#              dec_him_dim= 512,
#              dropout=0.5)
# o1,o2 = enc.forward(torch.zeros(12,32).to(torch.int64))

In [15]:
# s_hidden = o2
# print(s_hidden.unsqueeze(1).shape,o1.shape)
# print(s_hidden.unsqueeze(1).repeat(1,12,1).shape)
# print(o1.shape,o1.permute(1,0,2).shape)

# torch.cat((o1.permute(1,0,2),s_hidden.unsqueeze(1).repeat(1,12,1)),dim=2).shape

### Attention

Next up is the attention layer. This will take in the previous hidden state of the decoder, $s_{t-1}$, and all of the stacked forward and backward hidden states from the encoder, $H$. The layer will output an attention vector, $a_t$, that is the length of the source sentence, each element is between 0 and 1 and the entire vector sums to 1.

Intuitively, this layer takes what we have decoded so far, $s_{t-1}$, and all of what we have encoded, $H$, to produce a vector, $a_t$, that represents which words in the source sentence we should pay the most attention to in order to correctly predict the next word to decode, $\hat{y}_{t+1}$. 

First, we calculate the *energy* between the previous decoder hidden state and the encoder hidden states. As our encoder hidden states are a sequence of $T$ tensors, and our previous decoder hidden state is a single tensor, the first thing we do is `repeat` the previous decoder hidden state $T$ times. We then calculate the energy, $E_t$, between them by concatenating them together and passing them through a linear layer (`attn`) and a $\tanh$ activation function. 

$$E_t = \tanh(\text{attn}(s_{t-1}, H))$$ 

This can be thought of as calculating how well each encoder hidden state "matches" the previous decoder hidden state.

We currently have a **[dec hid dim, src len]** tensor for each example in the batch. We want this to be **[src len]** for each example in the batch as the attention should be over the length of the source sentence. This is achieved by multiplying the `energy` by a **[1, dec hid dim]** tensor, $v$.

$$\hat{a}_t = v E_t$$

We can think of $v$ as the weights for a weighted sum of the energy across all encoder hidden states. These weights tell us how much we should attend to each token in the source sequence. The parameters of $v$ are initialized randomly, but learned with the rest of the model via backpropagation. Note how $v$ is not dependent on time, and the same $v$ is used for each time-step of the decoding. We implement $v$ as a linear layer without a bias.

Finally, we ensure the attention vector fits the constraints of having all elements between 0 and 1 and the vector summing to 1 by passing it through a $\text{softmax}$ layer.

$$a_t = \text{softmax}(\hat{a_t})$$

This gives us the attention over the source sentence!

Graphically, this looks something like below. This is for calculating the very first attention vector, where $s_{t-1} = s_0 = z$. The green/teal blocks represent the hidden states from both the forward and backward RNNs, and the attention computation is all done within the pink block.

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq9.png?raw=1)

In [16]:
class Attention(nn.Module):
  def __init__(self, enc_hid_dim, dec_hid_dim):
    super().__init__()
    self.attn = nn.Linear((enc_hid_dim*2) + dec_hid_dim, enc_hid_dim)
    self.v = nn.Linear(enc_hid_dim, 1, bias = False)

  def forward(self, hidden, encoder_outputs,mask):
    batch_size = encoder_outputs.shape[1]
    src_len = encoder_outputs.shape[0]
    # hidden = [batch_size, dec_hid_dim]
    hidden = hidden.unsqueeze(1).repeat(1, src_len, 1) # hidden = [batch_size, src_len, dec_hid_dim]
    # encoder_outputs = [src_len, batch_size, enc_hid_dim*2]
    encoder_outputs = encoder_outputs.permute(1,0,2)
    # encoder_outputs = [batch_size,src_len, enc_hid_dim*2]
    energy = torch.tanh(self.attn(torch.cat((hidden,encoder_outputs), dim = 2)))
    # energy =[batch_size, src_len, dec_hid_dim]
    attention = self.v(energy).squeeze(2)
    attention = attention.masked_fill(mask ==0, -1e10)
    # attention = [batch_size, src_len]
    return F.softmax(attention,dim = 1)




### Decoder

Next up is the decoder. 

The decoder contains the attention layer, `attention`, which takes the previous hidden state, $s_{t-1}$, all of the encoder hidden states, $H$, and returns the attention vector, $a_t$.

We then use this attention vector to create a weighted source vector, $w_t$, denoted by `weighted`, which is a weighted sum of the encoder hidden states, $H$, using $a_t$ as the weights.

$$w_t = a_t H$$

The embedded input word, $d(y_t)$, the weighted source vector, $w_t$, and the previous decoder hidden state, $s_{t-1}$, are then all passed into the decoder RNN, with $d(y_t)$ and $w_t$ being concatenated together.

$$s_t = \text{DecoderGRU}(d(y_t), w_t, s_{t-1})$$

We then pass $d(y_t)$, $w_t$ and $s_t$ through the linear layer, $f$, to make a prediction of the next word in the target sentence, $\hat{y}_{t+1}$. This is done by concatenating them all together.

$$\hat{y}_{t+1} = f(d(y_t), w_t, s_t)$$

The image below shows decoding the first word in an example translation.

![](https://github.com/bentrevett/pytorch-seq2seq/blob/master/assets/seq2seq10.png?raw=1)

The green/teal blocks show the forward/backward encoder RNNs which output $H$, the red block shows the context vector, $z = h_T = \tanh(g(h^\rightarrow_T,h^\leftarrow_T)) = \tanh(g(z^\rightarrow, z^\leftarrow)) = s_0$, the blue block shows the decoder RNN which outputs $s_t$, the purple block shows the linear layer, $f$, which outputs $\hat{y}_{t+1}$ and the orange block shows the calculation of the weighted sum over $H$ by $a_t$ and outputs $w_t$. Not shown is the calculation of $a_t$.

In [17]:
class Decoder(nn.Module):
  def __init__(self,output_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout, attention):
    super().__init__()
    self.embedding = nn.Embedding(output_dim, emb_dim) 
    self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)
    self.fc_out = nn.Linear((enc_hid_dim*2) + dec_hid_dim + emb_dim, output_dim)
    self.dropout = nn.Dropout(dropout)
    self.attention = attention
    self.emb_dim = emb_dim
    self.output_dim = output_dim
    self.enc_hid_dim = enc_hid_dim
    self.dec_hid_dim = dec_hid_dim


  def forward(self, input, hidden, encoder_outputs, mask):
    # input = [batch_size]
    input = input.unsqueeze(0) # input = [1, batch_size]
    embedding = self.dropout(self.embedding(input))
    # embedding = [1, batch_size, emb_dim]
    a = self.attention(hidden, encoder_outputs, mask)
    a = a.unsqueeze(1)
    # a=[batch_size, 1, src_len]
    encoder_outputs = encoder_outputs.permute(1,0,2)
    # encoder_outputs =[batch_size, src_len, enc_hid_dim * 2]
    weighted = torch.bmm(a, encoder_outputs)
     #weighted = [batch size, 1, enc hid dim * 2]
    weighted = weighted.permute(1, 0, 2)
    #weighted = [1, batch size, enc hid dim * 2]
    outputs, hidden = self.rnn(torch.cat((embedding, weighted), dim =2), hidden.unsqueeze(0))

    #outputs = [seq len, batch size, dec hid dim * n directions]
    #hidden = [n layers * n directions, batch size, dec hid dim]
    
    #seq len, n layers and n directions will always be 1 in this decoder, therefore:
    #outputs = [1, batch size, dec hid dim]
    #hidden = [1, batch size, dec hid dim]
    embedding = embedding.squeeze(0)
    outputs = outputs.squeeze(0)
    weighted = weighted.squeeze(0)
    prediction = self.fc_out(torch.cat((outputs, weighted, embedding), dim = 1))
    
    return prediction, hidden.squeeze(0)


In [18]:
# o1.shape, o2.shape,fc.shape

### Seq2Seq

This is the first model where we don't have to have the encoder RNN and decoder RNN have the same hidden dimensions, however the encoder has to be bidirectional. This requirement can be removed by changing all occurences of `enc_dim * 2` to `enc_dim * 2 if encoder_is_bidirectional else enc_dim`. 

This seq2seq encapsulator is similar to the last two. The only difference is that the `encoder` returns both the final hidden state (which is the final hidden state from both the forward and backward encoder RNNs passed through a linear layer) to be used as the initial hidden state for the decoder, as well as every hidden state (which are the forward and backward hidden states stacked on top of each other). We also need to ensure that `hidden` and `encoder_outputs` are passed to the decoder. 

Briefly going over all of the steps:
- the `outputs` tensor is created to hold all predictions, $\hat{Y}$
- the source sequence, $X$, is fed into the encoder to receive $z$ and $H$
- the initial decoder hidden state is set to be the `context` vector, $s_0 = z = h_T$
- we use a batch of `<sos>` tokens as the first `input`, $y_1$
- we then decode within a loop:
  - inserting the input token $y_t$, previous hidden state, $s_{t-1}$, and all encoder outputs, $H$, into the decoder
  - receiving a prediction, $\hat{y}_{t+1}$, and a new hidden state, $s_t$
  - we then decide if we are going to teacher force or not, setting the next input as appropriate

In [19]:
class Seq2Seq(nn.Module):
  def __init__(self,encoder,decoder, device,src_pad_idx):
    super().__init__()
    self.encoder = encoder
    self.decoder = decoder
    self.device = device
    self.src_pad_idx = src_pad_idx
    
  def create_mask(self, src):
    # print('in mask src shape',src.shape)
    mask = (src != self.src_pad_idx).permute(1, 0)
    return mask

  def forward(self, src, tgt, teacher_forcing_ratio= 0.5):
    # src = [src_len, batch_size]
    # tgt = [tgt_len, batch_size]
    batch_size = src[0].shape[1]
    trg_len = tgt[0].shape[0]
    trg_vocab_size = self.decoder.output_dim
    #tensor to store decoder outputs
    outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
    #last hidden state of the encoder is the context
    encoder_outputs, hidden = self.encoder(src, src[1].cpu())

    #first input to the decoder is the <sos> tokens
    input = tgt[0][0,:]
    # print('input single',input.shape)
    for t in range(1, trg_len):
        mask = self.create_mask(src[0])
        #insert input token embedding, previous hidden state and the context state
        #receive output tensor (predictions) and new hidden state
        output, hidden = self.decoder(input, hidden, encoder_outputs, mask)
        
        #place predictions in a tensor holding predictions for each token
        outputs[t] = output
        
        #decide if we are going to use teacher forcing or not
        teacher_force = random.random() < teacher_forcing_ratio
        
        #get the highest predicted token from our predictions
        top1 = output.argmax(1) 
        
        #if teacher forcing, use actual next token as next input
        #if not, use predicted token
        input = tgt[0][t] if teacher_force else top1
    return outputs

In [20]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TGT.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
ENC_HID_DIM = 512
DEC_HID_DIM = 512
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5
src_pad_idx = SRC.vocab.stoi[SRC.pad_token]

attn = Attention(ENC_HID_DIM,DEC_HID_DIM)
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn)


model = Seq2Seq(enc, dec, device, src_pad_idx).to(device)

In [21]:
def init_weights(m):
    for name, param in m.named_parameters():
      if 'weight' in name:
        nn.init.normal_(param.data, mean=0, std=0.01)
      else:
        nn.init.constant_(param.data, 0)
        
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(5893, 256)
    (rnn): GRU(256, 512, bidirectional=True)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(7855, 256)
    (rnn): GRU(1280, 512)
    (fc_out): Linear(in_features=1792, out_features=7855, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
    (attention): Attention(
      (attn): Linear(in_features=1536, out_features=512, bias=True)
      (v): Linear(in_features=512, out_features=1, bias=False)
    )
  )
)

In [22]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 24,036,783 trainable parameters


In [23]:
from torch.optim import Adam

optimizer = Adam(model.parameters())

In [24]:
TRG_PAD_IDX = TGT.vocab.stoi[TGT.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

In [25]:
def train(model, iterator, optimizer, criterion, clip):
  model.train()
  epoch_loss = 0
  for i, batch in enumerate(iterator):
    # if i ==0:
    #   print(batch.src[0].shape,batch.src[1].shape)
    src, trg = batch.src,batch.trg
    optimizer.zero_grad()
    output = model(src, trg)
    #trg = [trg len, batch size]
    #output = [trg len, batch size, output dim]
    output_dim = output.shape[-1]
    output = output[1:].view(-1, output_dim)
    trg = trg[0][1:].view(-1)
    #trg = [(trg len - 1) * batch size]
    #output = [(trg len - 1) * batch size, output dim]
    loss = criterion(output, trg)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
    optimizer.step()
    epoch_loss += loss.item()
    # break
  return epoch_loss/ len(iterator)


In [26]:
def evaluate(model, iterator, criterion):
  model.eval()
  epoch_loss = 0
  with torch.no_grad():
    for i, batch in enumerate(iterator):
      src, trg = batch.src,batch.trg
      optimizer.zero_grad()
      output = model(src, trg, 0)
      #trg = [trg len, batch size]
      #output = [trg len, batch size, output dim]
      output_dim = output.shape[-1]
      output = output[1:].view(-1, output_dim)
      trg = trg[0][1:].view(-1)
      #trg = [(trg len - 1) * batch size]
      #output = [(trg len - 1) * batch size, output dim]
      loss = criterion(output, trg)
      epoch_loss += loss.item()
  return epoch_loss/ len(iterator)


In [27]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [28]:
import time

In [29]:
N_EPOCHS= 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
  start_time = time.time()
  train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
  valid_loss = evaluate(model, valid_iterator, criterion)
  end_time = time.time()
  epoch_mins, epoch_secs = epoch_time(start_time, end_time)
  
  if valid_loss < best_valid_loss:
      best_valid_loss = valid_loss
      torch.save(model.state_dict(), 'eng2german-model.pt')
  
  print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
  print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
  print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')


Epoch: 01 | Time: 1m 28s
	Train Loss: 4.320 | Train PPL:  75.196
	 Val. Loss: 3.636 |  Val. PPL:  37.935
Epoch: 02 | Time: 1m 29s
	Train Loss: 2.841 | Train PPL:  17.141
	 Val. Loss: 3.165 |  Val. PPL:  23.692
Epoch: 03 | Time: 1m 29s
	Train Loss: 2.237 | Train PPL:   9.366
	 Val. Loss: 2.998 |  Val. PPL:  20.044
Epoch: 04 | Time: 1m 29s
	Train Loss: 1.834 | Train PPL:   6.259
	 Val. Loss: 3.047 |  Val. PPL:  21.062
Epoch: 05 | Time: 1m 29s
	Train Loss: 1.549 | Train PPL:   4.707
	 Val. Loss: 3.160 |  Val. PPL:  23.574
Epoch: 06 | Time: 1m 29s
	Train Loss: 1.394 | Train PPL:   4.032
	 Val. Loss: 3.179 |  Val. PPL:  24.021
Epoch: 07 | Time: 1m 29s
	Train Loss: 1.264 | Train PPL:   3.538
	 Val. Loss: 3.287 |  Val. PPL:  26.749
Epoch: 08 | Time: 1m 29s
	Train Loss: 1.139 | Train PPL:   3.123
	 Val. Loss: 3.426 |  Val. PPL:  30.764
Epoch: 09 | Time: 1m 29s
	Train Loss: 1.051 | Train PPL:   2.859
	 Val. Loss: 3.548 |  Val. PPL:  34.742
Epoch: 10 | Time: 1m 29s
	Train Loss: 0.968 | Train PPL

In [31]:
model.load_state_dict(torch.load('eng2german-model.pt'))

test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 2.952 | Test PPL:  19.153 |
