# Neural Machine Translation with Attention Using PyTorch
In this notebook we are going to perform machine translation using a deep learning based approach and attention mechanism. All code is based on PyTorch and it was adopted from the tutorial provided on the official documentation of [TensorFlow](https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb).

Specifically, we are going to train a sequence to sequence model for French-to-English translation.

## Import libraries

In [1]:
import torch
import torch.functional as F
import torch.nn as nn
import torch.optim as optim
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

import pandas as pd
import numpy as np
import unicodedata
import re
import nltk
from sklearn.model_selection import train_test_split

In [2]:
!pip install gdown


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0[0m[39;49m -> [0m[32;49m23.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
!gdown 1Bhpi9gD_3UHZRFmcn7Czkb73jHN8CUAL

Downloading...
From: https://drive.google.com/uc?id=1Bhpi9gD_3UHZRFmcn7Czkb73jHN8CUAL
To: /home/ivan/Downloads/Dataset-fr-eng.txt
100%|██████████████████████████████████████| 28.7M/28.7M [00:09<00:00, 2.98MB/s]


In [4]:
f = open('Dataset-fr-eng.txt', encoding='UTF-8').read().strip().split('\n')

In [5]:
lines = f
# sample size (smaller sample size to reduce computation)
num_examples = 30000 
# creates lists containing each pair
original_word_pairs = [[w for w in l.split('\t')] for l in lines[:num_examples]]
data = pd.DataFrame(original_word_pairs, columns=["eng", "fr","whatever"])
data = data[['eng','fr']]

In [6]:
data.head(50)

Unnamed: 0,eng,fr
0,Go.,Va !
1,Go.,Marche.
2,Go.,Bouge !
3,Hi.,Salut !
4,Hi.,Salut.
5,Run!,Cours !
6,Run!,Courez !
7,Run!,Prenez vos jambes à vos cous !
8,Run!,File !
9,Run!,Filez !


In [7]:
# Converts the unicode file to ascii
def unicode_to_ascii(s):
    """
    Normalizes latin chars with accent to their canonical decomposition
    """
    return ''.join(c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn')

def preprocess_sentence(w):
    w = unicode_to_ascii(w.lower().strip())
    # creating a space between a word and the punctuation following it
    # eg: "he is a boy." => "he is a boy ." 
    w = re.sub(r"([?.!,¿])", r" \1 ", w)
    w = re.sub(r'[" "]+', " ", w)
    # replacing everything with space except (a-z, A-Z, ".", "?", "!", ",")
    w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
    w = w.rstrip().strip()
    # adding a start and an end token to the sentence
    # so that the model know when to start and stop predicting.
    w = '<start> ' + w + ' <end>'
    return w

## Data Exploration
Let's explore the dataset a bit.

In [8]:
# Now we do the preprocessing using pandas and lambdas
data["eng"] = data.eng.apply(lambda w: preprocess_sentence(w))
data["fr"] = data.fr.apply(lambda w: preprocess_sentence(w))
data.sample(10)


Unnamed: 0,eng,fr
14,<start> run . <end>,<start> courez ! <end>
14653,<start> it s cold today . <end>,<start> il fait froid aujourd hui . <end>
6581,<start> do you see it ? <end>,<start> est ce que vous le voyez ? <end>
13421,<start> how big you are ! <end>,<start> comme tu es grand ! <end>
1803,<start> he is lazy . <end>,<start> il est faineant . <end>
12676,<start> are you dressed ? <end>,<start> etes vous habille ? <end>
8533,<start> tom has a job . <end>,<start> tom travaille . <end>
5153,<start> i m pregnant . <end>,<start> je suis enceinte . <end>
19385,<start> my cat is hungry . <end>,<start> mon chat a faim . <end>
26226,<start> we can build that . <end>,<start> nous pouvons construire cela . <end>


#### Building Vocabulary Index


In [9]:
# This class creates a word -> index mapping (e.g,. "dad" -> 5) and vice-versa 
# (e.g., 5 -> "dad") for each language,
class LanguageIndex():
    def __init__(self, lang):
        """ lang are the list of phrases from each language"""
        self.lang = lang
        self.word2idx = {}
        self.idx2word = {}
        self.vocab = set()
        self.create_index()
        
    def create_index(self):
        for phrase in self.lang:
            # update with individual tokens
            self.vocab.update(phrase.split(' '))
        # sort the vocab
        self.vocab = sorted(self.vocab)
        # add a padding token with index 0
        self.word2idx['<pad>'] = 0
        # word to index mapping
        for index, word in enumerate(self.vocab):
            self.word2idx[word] = index + 1 # +1 because of pad token
        # index to word mapping
        for word, index in self.word2idx.items():
            self.idx2word[index] = word      


# index language using the class above
inp_lang = LanguageIndex(data["fr"].values.tolist())
targ_lang = LanguageIndex(data["eng"].values.tolist())
# Vectorize the input and target languages
input_tensor = [[inp_lang.word2idx[s] for s in es.split(' ')]  for es in data["fr"].values.tolist()]
target_tensor = [[targ_lang.word2idx[s] for s in eng.split(' ')]  for eng in data["eng"].values.tolist()]
input_tensor[:10]

[[5, 7355, 1, 4],
 [5, 4357, 3, 4],
 [5, 905, 1, 4],
 [5, 6432, 1, 4],
 [5, 6432, 3, 4],
 [5, 1663, 1, 4],
 [5, 1655, 1, 4],
 [5, 5559, 7581, 3985, 7, 7581, 1677, 1, 4],
 [5, 3125, 1, 4],
 [5, 3127, 1, 4]]

In [10]:
target_tensor[:10]

[[5, 1588, 3, 4],
 [5, 1588, 3, 4],
 [5, 1588, 3, 4],
 [5, 1774, 3, 4],
 [5, 1774, 3, 4],
 [5, 3188, 1, 4],
 [5, 3188, 1, 4],
 [5, 3188, 1, 4],
 [5, 3188, 1, 4],
 [5, 3188, 1, 4]]

In [11]:
def max_length(tensor):
    return max(len(t) for t in tensor)

# calculate the max_length of input and output tensor
max_length_inp, max_length_tar = max_length(input_tensor), max_length(target_tensor)

def pad_sequences(x, max_len):
    padded = np.zeros((max_len), dtype=np.int64)
    if len(x) > max_len: padded[:] = x[:max_len]
    else: padded[:len(x)] = x
    return padded

In [12]:
# inplace padding
input_tensor = [pad_sequences(x, max_length_inp) for x in input_tensor]
target_tensor = [pad_sequences(x, max_length_tar) for x in target_tensor]
len(target_tensor)

30000

In [13]:
# Creating training and validation sets using an 80-20 split
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.2)
# Show length
len(input_tensor_train), len(target_tensor_train), len(input_tensor_val), len(target_tensor_val)

(24000, 24000, 6000, 6000)

## Load data into DataLoader for Batching
This is just preparing the dataset so that it can be efficiently fed into the model through batches.

In [14]:
from torch.utils.data import Dataset, DataLoader

In [15]:
class MyData(Dataset):
    def __init__(self, X, y):
        self.data = X
        self.target = y
        self.length = [ np.sum(1 - np.equal(x, 0)) for x in X]
        
    def __getitem__(self, index):
        x = self.data[index]
        y = self.target[index]
        x_len = self.length[index]
        return x,y,x_len
    
    def __len__(self):
        return len(self.data)

## Parameters
Let's define the hyperparameters and other things we need for training our NMT model.

In [16]:
BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
N_BATCH = BUFFER_SIZE//BATCH_SIZE
embedding_dim = 256
units = 1024
vocab_inp_size = len(inp_lang.word2idx)
vocab_tar_size = len(targ_lang.word2idx)

train_dataset = MyData(input_tensor_train, target_tensor_train)
val_dataset = MyData(input_tensor_val, target_tensor_val)

dataset = DataLoader(train_dataset, batch_size = BATCH_SIZE, 
                     drop_last=True,
                     shuffle=True)

val_loader = DataLoader(val_dataset, batch_size = BATCH_SIZE, 
                     drop_last=True,
                     shuffle=True)

In [17]:
class Encoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
        super(Encoder, self).__init__()
        self.batch_sz = batch_sz
        self.enc_units = enc_units
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.embedding = nn.Embedding(self.vocab_size, self.embedding_dim)
        self.gru = nn.GRU(self.embedding_dim, self.enc_units)
        
    def forward(self, x, device):
        # x: batch_size, max_length 
        
        # x: batch_size, max_length, embedding_dim
        x = self.embedding(x)
    
        self.hidden = self.initialize_hidden_state(device)
        
        # output: max_length, batch_size, enc_units
        # self.hidden: 1, batch_size, enc_units
        output, self.hidden = self.gru(x, self.hidden)
               
        return output, self.hidden

    def initialize_hidden_state(self, device):
        return torch.zeros((1, self.batch_sz, self.enc_units)).to(device)


### Decoder

Here, we'll implement an encoder-decoder model with attention which you can read about in the TensorFlow [Neural Machine Translation (seq2seq) tutorial](https://github.com/tensorflow/nmt). This notebook implements the [attention equations](https://github.com/tensorflow/nmt#background-on-the-attention-mechanism) from the seq2seq tutorial. The following diagram shows that each input word is assigned a weight by the attention mechanism which is then used by the decoder to predict the next word in the sentence.

<img src="https://www.tensorflow.org/images/seq2seq/attention_mechanism.jpg" width="500" alt="attention mechanism">

The input is put through an encoder model which gives us the encoder output of shape *(batch_size, max_length, hidden_size)* and the encoder hidden state of shape *(batch_size, hidden_size)*. 

Here are the equations that are implemented:

<img src="https://www.tensorflow.org/images/seq2seq/attention_equation_0.jpg" alt="attention equation 0" width="800">
<img src="https://www.tensorflow.org/images/seq2seq/attention_equation_1.jpg" alt="attention equation 1" width="800">

We're using *Bahdanau attention*. Lets decide on notation before writing the simplified form:

* FC = Fully connected (dense) layer
* EO = Encoder output
* H = hidden state
* X = input to the decoder

And the pseudo-code:

* `score = FC(tanh(FC(EO) + FC(H)))`
* `attention weights = softmax(score, axis = 1)`. Softmax by default is applied on the last axis but here we want to apply it on the *1st axis*, since the shape of score is *(batch_size, max_length, 1)*. `Max_length` is the length of our input. Since we are trying to assign a weight to each input, softmax should be applied on that axis.
* `context vector = sum(attention weights * EO, axis = 1)`. Same reason as above for choosing axis as 1.
* `embedding output` = The input to the decoder X is passed through an embedding layer.
* `merged vector = concat(embedding output, context vector)`
* This merged vector is then given to the GRU
  
The shapes of all the vectors at each step have been specified in the comments in the code:

In [18]:
import torch.nn.functional as F

class Decoder(nn.Module):
    def __init__(self, vocab_size, embedding_dim, dec_units, enc_units, batch_sz):
        super(Decoder, self).__init__()
        self.batch_sz = batch_sz
        self.dec_units = dec_units
        self.enc_units = enc_units
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.embedding = nn.Embedding(self.vocab_size, self.embedding_dim)
        self.gru = nn.GRU(self.embedding_dim + self.enc_units, 
                          self.dec_units,
                          batch_first=True)
        self.fc = nn.Linear(self.enc_units, self.vocab_size)
        
        # used for attention
        self.W1 = nn.Linear(self.enc_units, self.dec_units)
        self.W2 = nn.Linear(self.enc_units, self.dec_units)
        self.V = nn.Linear(self.enc_units, 1)
    
    def forward(self, x, hidden, enc_output):
        # enc_output converted == (batch_size, max_length, hidden_size)
        enc_output = enc_output.permute(1,0,2)
      
        # hidden shape == (batch_size, hidden size) we convert it to (batch_size, 1, hidden size)
        hidden_with_time_axis = hidden.permute(1, 0, 2)
        
        # score: (batch_size, max_length, hidden_size) # Bahdanaus's
        # It doesn't matter which FC we pick for each of the inputs
        score = torch.tanh(self.W1(enc_output) + self.W2(hidden_with_time_axis))
        
        # calculate attention weights using softmax
        # attention_weights shape == (batch_size, max_length, 1)
        attention_weights = F.softmax(self.V(score), dim=1)
        
        # calculate context_vector
        # context_vector shape after sum == (batch_size, hidden_size)
        context_vector = torch.sum(attention_weights * enc_output, axis=1)

        # pass the context vector into embedding embedding layer
        # x shape after passing through embedding == (batch_size, 1, embedding_dim)
        x = self.embedding(x)

        ##  concatenate the context vector and x
        # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
        x = torch.cat((context_vector.unsqueeze(1), x), -1)
        
        # passing the concatenated vector to the GRU
        # output: (batch_size, 1, hidden_size)
        output, state = self.gru(x)

        # output shape == (batch_size * 1, hidden_size)
        output =  output.view(-1, output.size(2))
        
        # output shape == (batch_size * 1, vocab)
        x = self.fc(output)
        return x, state, attention_weights

    def initialize_hidden_state(self):
        return torch.zeros((1, self.batch_sz, self.dec_units))

In [19]:
criterion = nn.CrossEntropyLoss()
def loss_function(real, pred):
    """ Only consider non-zero inputs in the loss; mask needed """
    mask = real.ge(1).type(torch.cuda.FloatTensor)
    loss_ = criterion(pred, real) * mask 
    return torch.mean(loss_)

In [20]:
# Device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)
decoder = Decoder(vocab_tar_size, embedding_dim, units, units, BATCH_SIZE)

encoder.to(device)
decoder.to(device)
optimizer = optim.Adam(list(encoder.parameters()) + list(decoder.parameters()), 
                       lr=0.001)

## Training
Now we start the training. We are only using 10 epochs but you can expand this to keep trainining the model for a longer period of time. Note that in this case we are teacher forcing during training. Find a more detailed explanation in the official TensorFlow [implementation](https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb) of this notebook provided by the TensorFlow team. 

- Pass the input through the encoder which return encoder output and the encoder hidden state.
- The encoder output, encoder hidden state and the decoder input (which is the start token) is passed to the decoder.
- The decoder returns the predictions and the decoder hidden state.
- The decoder hidden state is then passed back into the model and the predictions are used to calculate the loss.
- Use teacher forcing to decide the next input to the decoder.
- Teacher forcing is the technique where the target word is passed as the next input to the decoder.
- The final step is to calculate the gradients and apply it to the optimizer and backpropagate.

# Why do we need mask in loss function?

### *Your answer:* 
The mask is needed in the loss function because the model is trained to predict the entire sequence, but we only want to consider the non-zero inputs when calculating the loss. In other words, we do not want to penalize the model for predicting zeros when the actual target is also a zero. The mask is used to ignore the zeros in the loss calculation.

# When do we apply teacher forcing - during training and/or testing?

### *Your answer:* 
Teacher forcing is used only during training, not during testing. During training, we use the ground truth target sequence as the input to the decoder at each time step, rather than using the decoder's own output from the previous time step as the input. This is done to ensure that the decoder is trained to predict the correct next word in the sequence, rather than just memorizing the previous predictions. However, during testing, we use the decoder's own output from the previous time step as the input to predict the next word in the sequence.

In [21]:
def flip_batch(X, y):
    return X.transpose(0,1), y # transpose (batch x seq) to (seq x batch)

In [22]:
EPOCHS = 10

encoder.batch_sz = 64
encoder.initialize_hidden_state(device)
decoder.batch_sz = 64
decoder.initialize_hidden_state()

for epoch in range(EPOCHS):    
    encoder.train()
    decoder.train()
    total_loss = 0
    
    for (batch, (inp, targ, inp_len)) in enumerate(dataset):
        loss = 0
        xs, ys = flip_batch(inp, targ)
        enc_output, enc_hidden = encoder(xs.to(device), device)
        dec_hidden = enc_hidden
        dec_input = torch.tensor([[targ_lang.word2idx['<start>']]] * BATCH_SIZE)
        for t in range(1, ys.size(1)):
            predictions, dec_hidden, _ = decoder(dec_input.to(device), 
                                         dec_hidden.to(device), 
                                         enc_output.to(device))
            
            loss += loss_function(ys[:, t].long().to(device), predictions.to(device))
            dec_input = ys[:, t].unsqueeze(1)

        batch_loss = (loss / int(ys.size(1)))
        total_loss += batch_loss
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if batch % 100 == 0:
            print('Epoch {} Batch {} Loss {:.4f}'.format(epoch + 1,
                                                         batch,
                                                         batch_loss.detach().item()))    

Epoch 1 Batch 0 Loss 4.6236
Epoch 1 Batch 100 Loss 1.5164
Epoch 1 Batch 200 Loss 1.3514
Epoch 1 Batch 300 Loss 1.1261
Epoch 2 Batch 0 Loss 0.8623
Epoch 2 Batch 100 Loss 0.8067
Epoch 2 Batch 200 Loss 0.6598
Epoch 2 Batch 300 Loss 0.6423
Epoch 3 Batch 0 Loss 0.3595
Epoch 3 Batch 100 Loss 0.3352
Epoch 3 Batch 200 Loss 0.3628
Epoch 3 Batch 300 Loss 0.3891
Epoch 4 Batch 0 Loss 0.2362
Epoch 4 Batch 100 Loss 0.1997
Epoch 4 Batch 200 Loss 0.2005
Epoch 4 Batch 300 Loss 0.2358
Epoch 5 Batch 0 Loss 0.1134
Epoch 5 Batch 100 Loss 0.1354
Epoch 5 Batch 200 Loss 0.1193
Epoch 5 Batch 300 Loss 0.0957
Epoch 6 Batch 0 Loss 0.0774
Epoch 6 Batch 100 Loss 0.0759
Epoch 6 Batch 200 Loss 0.0892
Epoch 6 Batch 300 Loss 0.0985
Epoch 7 Batch 0 Loss 0.0447
Epoch 7 Batch 100 Loss 0.0990
Epoch 7 Batch 200 Loss 0.0479
Epoch 7 Batch 300 Loss 0.0995
Epoch 8 Batch 0 Loss 0.0593
Epoch 8 Batch 100 Loss 0.0409
Epoch 8 Batch 200 Loss 0.0910
Epoch 8 Batch 300 Loss 0.1094
Epoch 9 Batch 0 Loss 0.0797
Epoch 9 Batch 100 Loss 0.070

In [23]:
def translate_sentence(encoder, decoder, sentence, max_length=120):
    encoder.eval()
    decoder.eval()
    total_loss = 0
    sentence = torch.unsqueeze(sentence, dim=1)
    with torch.no_grad():
        enc_output, enc_hidden = encoder(sentence.to(device), device)
        dec_hidden = enc_hidden
        dec_input = torch.tensor([[targ_lang.word2idx['<start>']]] * 1)
        out_sentence = []
        for t in range(1, sentence.size(0)):  
            
            # why there is a loop?
            # answer:

            predictions, dec_hidden, _ = decoder(dec_input.to(device), 
                                        dec_hidden.to(device), 
                                        enc_output.to(device))
            dec_input = predictions.argmax(dim=1).unsqueeze(1)
            out_sentence.append(targ_lang.idx2word[predictions.squeeze().argmax().item()])

    return out_sentence

encoder.batch_sz = 1
encoder.initialize_hidden_state(device)
decoder.batch_sz = 1
decoder.initialize_hidden_state()

test_sentence = "<start> j adore les fleurs . <end>"
test_sentence = [inp_lang.word2idx[s] for s in test_sentence.split(' ')]
test_sentence = pad_sequences(test_sentence, max_length_inp)
ret = translate_sentence(encoder, decoder, torch.tensor(test_sentence), max_length=120)
ret

['i',
 'love',
 'egg',
 '.',
 '<end>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>',
 '<pad>']

### References



1.   http://www.adeveloperdiary.com/data-science/deep-learning/nlp/machine-translation-using-attention-with-pytorch/
2.   https://medium.com/dair-ai/neural-machine-translation-with-attention-using-pytorch-a66523f1669f


