# <span style="color:#0b486b">  FIT3181/5215: Deep Learning (2025)</span>
***
*CE/Lecturer (Clayton):*  **Dr Trung Le** | trunglm@monash.edu <br/>
*Lecturer (Clayton):* **A/Prof Zongyuan Ge** | zongyuan.ge@monash.edu <br/>
*Lecturer (Malaysia):*  **Dr Arghya Pal** | arghya.pal@monash.edu <br/>
 <br/>
*Head Tutor 3181:*  **Ms Ruda Nie H** |  \[RudaNie.H@monash.edu \] <br/>
*Head Tutor 5215:*  **Ms Leila Mahmoodi** |  \[leila.mahmoodi@monash.edu \]

<br/> <br/>
Faculty of Information Technology, Monash University, Australia
***

# <span style="color:#0b486b">Tutorial 09b: Seq2seq for Machine Translation</span><span style="color:red">****</span>

**This tutorial shows you a famous application of seq2seq which is machine translation. Basically, we build up a seq2seq model to translate English to French in which the source sentences are English sentences, whereas the target sentences are French ones. More specifically, we explore**
- Seq2seq machine translation without attention mechanism.
-  Seq2seq machine translation with attention mechanism.

**References and additional reading and resources**
- Here is the link for a tutorial on image captioning application [link](https://www.tensorflow.org/tutorials/text/image_captioning).
- A blog that explains the BLEU score and how to compute this score [link](https://machinelearningmastery.com/calculate-bleu-score-for-text-python/).
- The reference if you want to explore BERT, a SOTA deep learning model for sequential data with self-attention mechanism [link](http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/).

*Acknowledgment: this tutorial was developed based on the Chapter 8 materials from the book `Deep Learning with TensorFlow 2 and Keras (TF 2.x edition)`.*

### <span style="color:#0b486b"> 0. Set up</span> <span style="color:red"></span>
You need to download relevant files to run this notebook on Google Colab.

In [None]:
!gdown https://drive.google.com/uc?id=1J6ldngMZ-84Et5IPHG8S-mYE2BBJMz48

Downloading...
From (original): https://drive.google.com/uc?id=1J6ldngMZ-84Et5IPHG8S-mYE2BBJMz48
From (redirected): https://drive.google.com/uc?id=1J6ldngMZ-84Et5IPHG8S-mYE2BBJMz48&confirm=t&uuid=fbfb5507-125e-4051-b6ad-d44cf27ff734
To: /content/Tut11_data.zip
100% 223M/223M [00:02<00:00, 90.6MB/s]


In [None]:
!unzip -q Tut11_data.zip

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data as data
import math
import copy
import nltk
import numpy as np
import re
import shutil
import os
import unicodedata
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

It is the time to set random seeds for PyTorch and numpy.

In [None]:
torch.manual_seed(6789)
np.random.seed(6789)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## <span style="color:#0b486b">I. Introduction of Dataset</span> ##

We use the French-English bilingual dataset from the Tatoeba Project (1997-2019). The dataset contains approximately $167,000$ sentence pairs. To make our training go faster, we will only consider the first $1,000$ sentence pairs for our training.

## <span style="color:#0b486b">II. Neural Machine Translation Without Attention Mechanism</span> <span style="color:red">****</span> ##

The following function supports preprocessing input sentences. We employ regular expressions for this purpose.

In [None]:
def preprocess_sentence(sent):
    sent = "".join([c for c in unicodedata.normalize("NFD", sent) if unicodedata.category(c) != "Mn"])
    sent = re.sub(r"([!.?])", r" \1", sent)
    sent = re.sub(r"[^a-zA-Z!.?]+", r" ", sent)
    sent = re.sub(r"\s+", " ", sent)
    sent = sent.lower()
    return sent

The following function assists us in reading sentences from a file located on the hard disk. This function returns three lists of sentences:
- `en_sents` contains the list of English input sentences.
- `fr_sents_in` contains the list of French sentences starting with the specific symbol **BOS**, meaning `Beginning Of Sentence`. To form a sentence in fr_sents_in, we start from a sentence in French list and insert BOS at the begining.
- `fr_sents_out` contains the list of French sentences ending with the specific symbol **EOS**, meaning `End Of Sentence`. To form a sentence in fr_sents_out, we start from a sentence in French list and insert EOS at the end.

In [None]:
def read_data(num_sent_pairs =20000):
    en_sents, fr_sents_in, fr_sents_out = [], [], []
    local_file = os.path.join("datasets", "fra.txt")
    with open(local_file, "r") as fin:
        for i, line in enumerate(fin):
            en_sent, fr_sent, _ = line.strip().split('\t')
            en_sent = [w for w in preprocess_sentence(en_sent).split()]
            fr_sent = preprocess_sentence(fr_sent)
            fr_sent_in = [w for w in ("BOS " + fr_sent).split()]
            fr_sent_out = [w for w in (fr_sent + " EOS").split()]
            en_sents.append(en_sent)
            fr_sents_in.append(fr_sent_in)
            fr_sents_out.append(fr_sent_out)
            if i >= num_sent_pairs - 1:
                break
    return en_sents, fr_sents_in, fr_sents_out

We set `NUM_SENT_PAIRS= 1000` to read $1,000$ bilingual sentences from the dataset.

In [None]:
NUM_SENT_PAIRS = 1000
sents_en, sents_fr_in, sents_fr_out = read_data(NUM_SENT_PAIRS)

We print the first five sentences in `fr_sent_in`.

In [None]:
print(sents_fr_in[0:5])

[['BOS', 'va', '!'], ['BOS', 'salut', '!'], ['BOS', 'salut', '.'], ['BOS', 'cours', '!'], ['BOS', 'courez', '!']]


We print the first five sentences in `fr_sent_out`.

In [None]:
print(sents_fr_out[0:5])

[['va', '!', 'EOS'], ['salut', '!', 'EOS'], ['salut', '.', 'EOS'], ['cours', '!', 'EOS'], ['courez', '!', 'EOS']]


We now create vocabularies, dictionaries, and numeric datasets from three lists of sentences. Specifically, we achieve
- `data_en` contains sequences of indices for sents_en.
- `data_fr_in` contains sequences of indices for sents_fr_in.
- `data_fr_out` contains sequences of indices for sents_fr_out.

We then build up dictionaries and vocabularies.

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from collections import Counter
from nltk.tokenize import word_tokenize

`The build_vocab_from_iterator function` creates a vocabulary from a sequence of token lists by counting the frequency of each token using the Counter class. It first updates the token counts from the provided iterator. After counting, it sorts the tokens by their frequency, placing the special tokens (if provided) at the beginning of the sorted list. It then constructs a dictionary where each token is mapped to a unique index based on its position in the sorted list, with special tokens receiving indices before other tokens. This vocabulary dictionary is returned, mapping each token to its corresponding index in the vocabulary.

In [None]:
# build vocabulary using Counter and add special tokens
def build_vocab_from_iterator(iterator, specials=None):
    word_freq = Counter()
    for tokens in iterator:
        word_freq.update(tokens)

    # sort words by frequency (most common first) and include special tokens
    sorted_vocab = specials + [word for word, freq in word_freq.most_common()]
    vocab = {word: idx for idx, word in enumerate(sorted_vocab)}

    return vocab

In [None]:
from itertools import islice
# build the vocabulary
en_vocab = build_vocab_from_iterator(sents_en, specials=["<pad>"])

# print the vocabulary and default index
print("English Vocabulary:", dict(islice(en_vocab.items(),10)))

word2idx_en = en_vocab
idx2word_en = {idx: word for word, idx in en_vocab.items()}

English Vocabulary: {'<pad>': 0, '.': 1, 'i': 2, '!': 3, 'm': 4, 'it': 5, 's': 6, 'go': 7, 'tom': 8, '?': 9}


In [None]:
# build the vocabulary
fr_vocab = build_vocab_from_iterator(sents_fr_in, specials=["<pad>","EOS"])

# print the vocabulary and default index
print("French Vocabulary:", dict(islice(fr_vocab.items(),10)))

word2idx_fr = fr_vocab
idx2word_fr = {idx: word for word, idx in fr_vocab.items()}

French Vocabulary: {'<pad>': 0, 'EOS': 1, 'BOS': 2, '.': 3, '!': 4, 'je': 5, 'suis': 6, 'est': 7, 'j': 8, 'ai': 9}


In [None]:
seq_lengths = np.array([len(s) for s in sents_en])
print([(p, np.percentile(seq_lengths, p)) for p in [75, 80, 90, 95, 99, 100]])

[(75, 4.0), (80, 4.0), (90, 4.0), (95, 4.0), (99, 4.0), (100, 5.0)]


In [None]:
seq_lengths = np.array([len(s) for s in sents_fr_in])
print([(p, np.percentile(seq_lengths, p)) for p in [75, 80, 90, 95, 99, 100]])

[(75, 5.0), (80, 5.0), (90, 6.0), (95, 7.0), (99, 9.0), (100, 10.0)]



We now transform texts to sequences of indices.

In [None]:
en_seq_len = 5
fr_seq_len = 10

`The create_pad_sequences function` converts sentences into sequences of indices, pads or truncates them to a specified length, and returns them as a tensor. It first transforms each sentence into a list of token indices using a vocabulary dictionary. Each sequence is then truncated to a maximum length if it exceeds the specified seq_len, and padded with zeros if it is shorter. Finally, the function converts the list of padded and truncated sequences into a PyTorch tensor of type long for further processing.

In [None]:
from torch.utils.data import TensorDataset, DataLoader

def create_pad_sequences(sents, vocab, seq_len):
  # transform sentences to sequences of indices (list of lists of indices)
  sentences_as_ints = [[vocab[token] for token in tokens] for tokens in sents]
  # pad and truncate sequences
  padded_sequences = []
  for seq in sentences_as_ints:
    if len(seq) > seq_len:
      seq = seq[: seq_len]
    # pad sequence to max_sequence_length
    padding_length = seq_len - len(seq)
    if padding_length > 0:
        seq.extend([0] * padding_length)  # extend with padding values
    padded_sequences.append(seq)
  # convert list of lists to tensor
  truncated_padded_sequences = torch.tensor(padded_sequences, dtype=torch.long)
  return truncated_padded_sequences

We create a PyTorch dataset with three components: `data_en, data_fr_in, data_fr_out` and a train_loader for this dataset.

In [None]:
# create dataset
en_padded_sequences = create_pad_sequences(sents_en, en_vocab, en_seq_len)
fr_in_padded_sequences = create_pad_sequences(sents_fr_in, fr_vocab, fr_seq_len)
dataset = TensorDataset(en_padded_sequences, fr_in_padded_sequences)
fr_out_padded_sequences = create_pad_sequences(sents_fr_out, fr_vocab, fr_seq_len)
dataset = TensorDataset(en_padded_sequences, fr_in_padded_sequences, fr_out_padded_sequences)

In [None]:
from torch.utils.data import DataLoader
train_loader = DataLoader(dataset, batch_size=32, shuffle=True)

We now declare  `Encoder` which is the first component of a seq2seq model with the aim to encode an input sentence to a fix-length encode or context vector. Please pay attention to the code of the `call` method.
- We first embed the input `x` (a sequence of indices) to a 2D tensor using an embedding layer. In addition, when training we input to an embedding layer a 2D tensor x with the shape $[batch\_size, seq\_len]$ and receive output as a 3D tensor with the shape $[batch\_size, seq\_len, embed\_size]$ (embedding_dim in our code).
- Next, we feed the output from the embedding layer to a GRU recurrent layer and receive a 2D tensor with the shape $[batch\_size, encoder\_dim]$ (`return_sequences=False`) which is the last hidden state considered as the encode or context vector.
- Note that the encoder returns `x` as a 2D tensor and `state` as a 2D tensor with the shape $[batch\_size, encoder\_dim]$ which stands for the last hidden state of the GRU recurrent layer (they are identical). The last hidden state can be regarded as the encoding of the entire input sentence.

In [None]:
class EncoderRNN(nn.Module):
    def __init__(self, vocab_size, embed_size, encoder_dim, dropout_p=0.1):
        super(EncoderRNN, self).__init__()
        self.encoder_dim = encoder_dim

        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.gru = nn.GRU(embed_size, encoder_dim, batch_first=True)
        self.dropout = nn.Dropout(dropout_p)

    def forward(self, input):
        embedded = self.dropout(self.embedding(input))
        all_hiddens, last_hidden = self.gru(embedded)
        return all_hiddens, last_hidden

The second component of a seq2seq model is a decoder. Our decoder takes input x as a batch in `data_fr_in` with the shape $[batch\_size, seq\_len]$.
- x is inputted to an embedding layer and outputs a 3D tensor with shape $[batch\_size, seq\_len, embedding\_dim]$.
- The above 3D tensor output is then fed to a GRU recurrent layer. Note that the statement  `x, state = self.rnn(x, state)` means that we initialize the first hidden state of our GRU recurrent layer with `state` (later we will know it is the last hidden state or encoded from the encoder) and this returns a 3D tensor output and the last hidden state of this GRU recurrent layer.
- Finally, on the top of each hidden layer in x, we conduct a dense layer with $vocab\_size$ used to predict the next word in a French sentence.

In [None]:
class DecoderRNN(nn.Module):
    def __init__(self, vocab_size, embed_size, decoding_dim):
        super(DecoderRNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.gru = nn.GRU(embed_size, decoding_dim, batch_first=True)
        self.fc = nn.Linear(decoding_dim, vocab_size)

    def forward(self, encoder_outputs, encoder_state, target_tensor=None):
        batch_size = encoder_outputs.size(0)
        decoder_input = torch.empty(batch_size, 1, dtype=torch.long, device=device).fill_(fr_vocab['BOS'])
        decoder_state = encoder_state
        decoder_outputs = []

        for i in range(fr_seq_len):
            decoder_output, decoder_state  = self.forward_step(decoder_input, decoder_state)
            decoder_outputs.append(decoder_output)

            if target_tensor is not None:
                # Teacher forcing: Feed the target as the next input
                decoder_input = target_tensor[:, i].unsqueeze(1) # Teacher forcing
            else:
                # Without teacher forcing: use its own predictions as the next input
                _, topi = decoder_output.topk(1)
                decoder_input = topi.squeeze(-1).detach()  # detach from history as input

        decoder_outputs = torch.cat(decoder_outputs, dim=1)
        decoder_outputs = F.log_softmax(decoder_outputs, dim=-1)
        return decoder_outputs, decoder_state

    def forward_step(self, input, hidden):
        embeded = self.embedding(input)
        output, hidden = self.gru(embeded, hidden)
        output = self.fc(output)
        return output, hidden

We now declare `encoder` as an Encoder and `decoder` as a Decoder.

In [None]:
embed_size = 256
encoder_dim, decoder_dim = 512, 512
encoder = EncoderRNN(len(en_vocab), embed_size, encoder_dim).to(device)
decoder = DecoderRNN(len(fr_vocab), embed_size, decoder_dim).to(device)

This code shows you the flow of how to feed an  English sentence to the encoder and a French sentence to the decoder. Note that when we run `encoder(encoder_in, encoder_state)`, the method `call` is really invoked and the input batch and encoder_state (a zero 2D tensor) are passed to the encoder.

In addition, when we run `decoder(decoder_in, decoder_state)`, the method `call` is really invoked and we pass to this method a batch of French sentences in `data_fr_in`. Note that the first hidden state of the decoder is initialized with the last hidden state of the encoder.

In [None]:
for en_data, fr_in_data, fr_out_data  in train_loader:
    en_data, fr_in_data, fr_out_data = en_data.to(device), fr_in_data.to(device), fr_out_data.to(device)
    break
encoder_out, encoder_state = encoder(en_data)
decoder_pred, decoder_state = decoder(encoder_out, encoder_state, fr_out_data)

print("encoder input :", en_data.shape)
print("encoder output :", encoder_out.shape, "state:", encoder_state.shape)
print("decoder output (logits):", decoder_pred.shape, "state:", decoder_state.shape)
print("decoder output (labels):", decoder_state.shape)

encoder input : torch.Size([32, 5])
encoder output : torch.Size([32, 5, 512]) state: torch.Size([1, 32, 512])
decoder output (logits): torch.Size([32, 10, 710]) state: torch.Size([1, 32, 512])
decoder output (labels): torch.Size([1, 32, 512])


In [None]:
criterion = nn.CrossEntropyLoss(ignore_index=0)
encoder_optimizer = optim.Adam(encoder.parameters(), lr=0.001, betas=(0.9, 0.98), eps=1e-9)
decoder_optimizer = optim.Adam(decoder.parameters(), lr=0.001, betas=(0.9, 0.98), eps=1e-9)

In [None]:
import torch.nn.functional as F

def predict(encoder, decoder, en_vocab, sents_en, sents_fr_out, word2idx_fr, idx2word_fr, en_seq_len, fr_seq_len):
    random_id = np.random.choice(len(sents_en))
    print("input : ", " ".join(sents_en[random_id]))
    print("label : ", " ".join(sents_fr_out[random_id]))

    # Prepare encoder input and initial state
    seq_ints = [en_vocab[token] for token in sents_en[random_id]]
    if len(seq_ints) > en_seq_len:
        seq_ints = seq_ints[:en_seq_len]
    else:
        padding_length = en_seq_len - len(seq_ints)
        if padding_length > 0:
            seq_ints.extend([0] * padding_length)
    encoder_in = torch.unsqueeze(torch.tensor(seq_ints), dim=0).to(device)

    # Forward pass through the encoder
    encoder_out, encoder_state = encoder(encoder_in)

    # Prepare decoder input and initial state
    decoder_state = encoder_state
    decoder_in = torch.tensor([[word2idx_fr["BOS"]]], dtype=torch.long).to(device)

    pred_sent_fr = []
    decoding_step = 0

    while decoding_step < fr_seq_len:
        # Forward pass through the decoder
        decoder_pred, decoder_state = decoder(decoder_in, decoder_state)
        decoder_pred = decoder_pred[0,-1,:]
        # Get the word index with the highest probability
        decoder_pred = torch.argmax(decoder_pred, dim=-1)
        # Convert the predicted index to a word
        pred_word = idx2word_fr[decoder_pred.item()]
        pred_sent_fr.append(pred_word)
        # If EOS is predicted, stop decoding
        if pred_word == "EOS":
            break
        # Prepare next decoder input
        decoder_in = torch.cat((decoder_in, decoder_pred.unsqueeze(0).unsqueeze(0)), dim=1)
        decoding_step += 1

    print("predicted: ", " ".join(pred_sent_fr))

# Example usage with PyTorch models (encoder, decoder)
# Note: `encoder` and `decoder` should be instances of your PyTorch model classes.

In [None]:
class BaseTrainer:
    def __init__(self, encoder, decoder, criterion, enc_optimizer, dec_optimizer, train_loader):
        self.encoder = encoder
        self.decoder = decoder
        self.criterion = criterion  #the loss function
        self.enc_optimizer = enc_optimizer  #the optimizer
        self.dec_optimizer = dec_optimizer  #the optimizer
        self.train_loader = train_loader  #the train loader

    #the function to train the model in many epochs
    def fit(self, num_epochs):
        self.num_batches = len(self.train_loader)

        for epoch in range(num_epochs):
            print(f'Epoch {epoch + 1}/{num_epochs}')

            train_loss = self.train_one_epoch()
            print(
                f'{self.num_batches}/{self.num_batches} - loss: {train_loss:.4f} '
            )
            predict(encoder = encoder, decoder = decoder, en_vocab = en_vocab,
                    sents_en = sents_en, sents_fr_out = sents_fr_out, word2idx_fr=word2idx_fr,
                    idx2word_fr=idx2word_fr, en_seq_len=en_seq_len, fr_seq_len=fr_seq_len)

    #train in one epoch
    def train_one_epoch(self):
        self.encoder.train()
        self.decoder.train()
        running_loss  = 0.0

        for en_data, fr_in_data, fr_out_data  in train_loader:
          en_data, fr_in_data, fr_out_data = en_data.to(device), fr_in_data.to(device), fr_out_data.to(device)
          encoder_out, encoder_state = self.encoder(en_data)
          decoder_pred, decoder_state = self.decoder(encoder_out, encoder_state, fr_out_data)
          output = decoder_pred.contiguous().view(-1, len(fr_vocab))
          loss = criterion(output, fr_out_data.contiguous().view(-1))
          loss.backward()
          self.enc_optimizer.zero_grad()
          self.dec_optimizer.zero_grad()
          self.enc_optimizer.step()
          self.dec_optimizer.step()
          running_loss += loss.item()

        train_loss = running_loss / self.num_batches
        return train_loss

We now train our seq2seq and observe the outputs.

In [None]:
trainer = BaseTrainer(encoder, decoder, criterion, encoder_optimizer, decoder_optimizer, train_loader)
trainer.fit(20)

Epoch 1/20
32/32 - loss: 6.5536 
input :  use this .
label :  utilisez ceci . EOS
predicted:  plierai pris tele parlerai rattrape attendez pris tele les sien
Epoch 2/20
32/32 - loss: 6.5531 
input :  he s a dj .
label :  il est dj . EOS
predicted:  detendu revoila touche chaude tele refuse court fini tele refuse
Epoch 3/20
32/32 - loss: 6.5527 
input :  i dozed .
label :  je me suis assoupie . EOS
predicted:  pouvons soul pige tele parlerai rattrape attendez pris tele les
Epoch 4/20
32/32 - loss: 6.5529 
input :  i m fast .
label :  je suis rapide . EOS
predicted:  tele parlerai rattrape attendez pris tele les sien venue pris
Epoch 5/20
32/32 - loss: 6.5525 
input :  it s ours .
label :  c est a nous . EOS
predicted:  occupe tele parlerai rattrape attendez pris tele les sien venue
Epoch 6/20
32/32 - loss: 6.5522 
input :  take this .
label :  prenez ca . EOS
predicted:  plierai gagne pris chanceux tele parlerai rattrape attendez pris tele
Epoch 7/20
32/32 - loss: 6.5534 
input :  oh pl

**<span style="color:red">Exercise 1</span>**: Swap the input and target languages to build up a seq2seq model allowing us to translate from French to English.

## <span style="color:#0b486b">II. Neural Machine Translation With Attention Mechanism </span> <span style="color:red">****</span> ##

In [None]:
class BahdanauAttention(nn.Module):
    def __init__(self, hidden_size):
        super(BahdanauAttention, self).__init__()
        self.Wa = nn.Linear(hidden_size, hidden_size)
        self.Ua = nn.Linear(hidden_size, hidden_size)
        self.Va = nn.Linear(hidden_size, 1)

    #query is the current hidden state, while keys specify the
    def forward(self, d_state, e_states):
        scores = self.Va(torch.tanh(self.Wa(d_state) + self.Ua(e_states)))
        scores = scores.squeeze(2).unsqueeze(1)

        weights = F.softmax(scores, dim=-1)
        context = torch.bmm(weights, e_states)

        return context, weights

We now test our `BahdanauAttention`.

In [None]:
batch_size = 64
num_timesteps = 100
num_units = 200
d_state = torch.rand(batch_size, 1, num_units, dtype=torch.float32)
e_states = torch.rand(batch_size, num_timesteps, num_units, dtype=torch.float32)
# check out dimensions for Bahdanau attention
b_attn = BahdanauAttention(num_units)
context, attention_weight = b_attn(d_state, e_states)
print("Bahdanau: context.shape:", context.shape, "attention_weight.shape:", attention_weight.shape)

Bahdanau: context.shape: torch.Size([64, 1, 200]) attention_weight.shape: torch.Size([64, 1, 100])


In [None]:
class AttnDecoderRNN(nn.Module):
    def __init__(self, vocab_size, embed_size, decoder_dim, dropout_p=0.1):
        super(AttnDecoderRNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.attention = BahdanauAttention(decoder_dim)
        self.gru = nn.GRU(decoder_dim + embed_size, decoder_dim, batch_first=True)
        self.fc = nn.Linear(decoder_dim, vocab_size)
        self.dropout = nn.Dropout(dropout_p)

    def forward(self, encoder_outputs, encoder_state, target_tensor=None):
        batch_size = encoder_outputs.size(0)
        decoder_input = torch.empty(batch_size, 1, dtype=torch.long, device=device).fill_(fr_vocab['BOS'])
        decoder_state = encoder_state
        decoder_outputs = []
        attentions = []

        for i in range(fr_seq_len):
            decoder_output, decoder_state = self.forward_step(
                decoder_input, decoder_state, encoder_outputs)
            decoder_outputs.append(decoder_output)
            #attentions.append(attn_weights)

            if target_tensor is not None:
                # Teacher forcing: Feed the target as the next input
                decoder_input = target_tensor[:, i].unsqueeze(1) # Teacher forcing
            else:
                # Without teacher forcing: use its own predictions as the next input
                _, topi = decoder_output.topk(1)
                decoder_input = topi.squeeze(-1).detach()  # detach from history as input

        decoder_outputs = torch.cat(decoder_outputs, dim=1)
        decoder_outputs = F.log_softmax(decoder_outputs, dim=-1)
        #attentions = torch.cat(attentions, dim=1)

        return decoder_outputs, decoder_state

    def forward_step(self, decoder_input, decoder_state, encoder_outputs):
        embedded =  self.dropout(self.embedding(decoder_input))
        query = decoder_state.permute(1, 0, 2)
        context, attn_weights = self.attention(query, encoder_outputs)
        input_gru = torch.cat((embedded, context), dim=2)
        output, decoder_state = self.gru(input_gru, decoder_state)
        output = self.fc(output)
        return output, decoder_state

We now declare encoder and decoder.

In [None]:
embed_size = 256
encoder_dim, decoder_dim = 512, 512
encoder = EncoderRNN(len(en_vocab), embed_size, encoder_dim).to(device)
attn_decoder = AttnDecoderRNN(len(fr_vocab), embed_size, decoder_dim).to(device)

In [None]:
for en_data, fr_in_data, fr_out_data  in train_loader:
    en_data, fr_in_data, fr_out_data = en_data.to(device), fr_in_data.to(device), fr_out_data.to(device)
    break
encoder_out, encoder_state = encoder(en_data)
decoder_pred, decoder_state = attn_decoder(encoder_out, encoder_state, fr_out_data)

print("encoder input :", en_data.shape)
print("encoder output :", encoder_out.shape, ", state:", encoder_state.shape)
print("decoder output (logits):", decoder_pred.shape, ", state:", decoder_state.shape)
print("decoder output (labels):", decoder_state.shape)

encoder input : torch.Size([32, 5])
encoder output : torch.Size([32, 5, 512]) , state: torch.Size([1, 32, 512])
decoder output (logits): torch.Size([32, 10, 710]) , state: torch.Size([1, 32, 512])
decoder output (labels): torch.Size([1, 32, 512])


We now train our seq2seq model with `BahdanauAttention`.

In [None]:

criterion = nn.CrossEntropyLoss(ignore_index=0)
encoder_optimizer = optim.Adam(encoder.parameters(), lr=0.001, betas=(0.9, 0.98), eps=1e-9)
decoder_optimizer = optim.Adam(attn_decoder.parameters(), lr=0.001, betas=(0.9, 0.98), eps=1e-9)

In [None]:
trainer = BaseTrainer(encoder, attn_decoder, criterion, encoder_optimizer, decoder_optimizer, train_loader)
trainer.fit(20)

Epoch 1/20
32/32 - loss: 6.5555 
input :  call me .
label :  appelez moi ! EOS
predicted:  refuse court fini tele refuse court fini tele refuse court
Epoch 2/20
32/32 - loss: 6.5566 
input :  let s see .
label :  voyons voir ! EOS
predicted:  parle cours fini tele parlerai rattrape attendez pris tele les
Epoch 3/20
32/32 - loss: 6.5559 
input :  it stinks .
label :  ca pue . EOS
predicted:  magnez trop detendu revoila touche chaude tele refuse court fini
Epoch 4/20
32/32 - loss: 6.5581 
input :  open up .
label :  ouvre moi ! EOS
predicted:  fauche bizarre restez pris chanceux tele parlerai rattrape attendez pris
Epoch 5/20
32/32 - loss: 6.5579 
input :  i m right .
label :  j ai raison . EOS
predicted:  fauche bizarre restez pris chanceux tele parlerai rattrape attendez pris
Epoch 6/20
32/32 - loss: 6.5560 
input :  they fell .
label :  ils sont tombes . EOS
predicted:  refuse court fini tele refuse court fini tele refuse court
Epoch 7/20
32/32 - loss: 6.5564 
input :  see you !
label

---
### <span style="color:#0b486b"> <div  style="text-align:center">**THE END**</div> </span>