# A Convolutional Encoder Model for Neural Machine Translation

In this tutorial we will demonstrate how to implement a state of the art convolutional encoder sequential decoder (conv2seq) architecture (Published recently at ACL'17. [Link To Paper](http://www.aclweb.org/anthology/P/P17/P17-1012.pdf)) for sequence to sequence modeling using Pytorch. While the aim of the tutorial is to make the audience comfortable with pytorch using this tutorial (with a Conv2Seq implementation as an add on), some familiarity with pytorch (or any other deep learning framework) would definitely be a plus. The agenda of this tutorial is as follows:

1. Getting Ready with the data 
2. Network Definition. This includes
    * A Convolution Encoder with residual connections
    * An attention based RNN decoder 
3. Training subroutines
4. Model testing and Visualizations

This tutorial draws its content/design heavily from [this](http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html) pytorch tutorial for attention based sequence to sequence model for translation. We reuse their data selection/filtering methodology. This helps in focussing more on explaining model architecture and it's translation from formulae to code. 

## Data Preparation

While the paper uses the official WMT data, we stick to a relatively smaller dataset for English to French translation released as part of the Tatoeba project \[3\] which is present in the "data" directory of this project. We will later apply more filtering to restrict our focus on certain type of short sentences. 

Some examples of English-French pairs available in the data are:

    La prochaine fois, je gagnerai la partie. ==> I will win the game next time.

    Fouillez la maison ! ==>  Search the house!

    Ne vous faites pas de souci ! Je vous couvre. ==> Don't worry. I've got you covered.

    Ma famille n'est pas aussi grande que ça. ==> My family is not that large.

    Ça va être serré. ==> It's going to be close.


  To get started we first import the necessary libraries


In [4]:
from __future__ import unicode_literals, print_function, division
from io import open
from collections import namedtuple
import unicodedata
import string
import re
import random

import torch
import torch.nn as nn
from torch.autograd import Variable
from torch import optim
import torch.nn.functional as F

import numpy as np
import pandas as pd

We now define some constants and variables that we will use later

In [5]:
use_cuda = torch.cuda.is_available() # To check if GPU is available
MAX_LENGTH = 10 # We restrict our experiments to sentences of length 10 or less
embedding_size = 256
hidden_size_gru = 256
attn_units = 256
conv_units = 256
num_iterations = 750
print_every = 100
batch_size = 1
sample_size = 1000
dropout = 0.2
encoder_layers = 3
SOS_TOKEN = "<sos>"
EOS_TOKEN = "<eos>"

eng_prefixes = (
    "i am ", "i m ",
    "he is", "he s ",
    "she is", "she s",
    "you are", "you re ",
    "we are", "we re ",
    "they are", "they re "
)

Note that while the paper uses different model parameters (longer sentences, larger embedding dimensions etc.) we choose smaller values to make the architecture shallow enough for the small data that we are using.

Next, we will define (or rather copy from [2]) some helper functions that will prove to be useful later

In [6]:
# Function to convert unicdoe string to plain ASCII
# Thanks to http://stackoverflow.com/a/518232/2809427

def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )

In [7]:
# Takes all unicode characters, converts them to ascii
# Replaces full stop with space full stop (so that Fire!
# becomes Fire !)
# Removes everything apart from alphabet characters and
# stop characters.

def normalizeString(s):
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s

In [8]:
# Returns the cuda tensor type of a variable if CUDA is available
def check_and_convert_to_cuda(var):
    return var.cuda() if use_cuda else var

We now read the dataset and normalize them. To be able to observe the effects of training on our small dataset quickly, we will restrict our dataset to simpler sentences, which begin with phrases like "i am", "he is", "she is" etc. These prefixes which we will be using to filter our dataset have been defined as the variable `eng_prefixes`.

In [9]:
data = pd.read_csv('data/eng-fra.txt', sep='\t', names=['english', 'french'])
data = data[data.english.str.lower().str.startswith(eng_prefixes)].iloc[:sample_size]

data['english'] = data.apply(lambda row: normalizeString(row.english), axis=1)
data['french'] = data.apply(lambda row: normalizeString(row.french), axis=1)

We now have a list of sentences which are space separated words. Now, we want to convert these individual words to unique numerical ID's so that each unique word in the vocabulary is represented by a particular integer ID. To do this, we first create a function that does this mapping for us

In [10]:
Vocabulary = namedtuple('Vocabulary', ['word2id', 'id2word']) # A Named tuple representing the vocabulary of a particular language

In [11]:
def construct_vocab(sentences):
    word2id = dict()
    id2word = dict()
    word2id[SOS_TOKEN] = 0
    word2id[EOS_TOKEN] = 1
    id2word[0] = SOS_TOKEN
    id2word[1] = EOS_TOKEN
    for sentence in sentences:
        for word in sentence.strip().split(' '):
            if word not in word2id:
                word2id[word] = len(word2id)
                id2word[len(word2id)-1] = word
    return Vocabulary(word2id, id2word)

Now, generating the vocabulary for source/target language is as simple as 

In [12]:
english_vocab = construct_vocab(data.english)
french_vocab = construct_vocab(data.french)

The next task is to convert each sentence to a list of ID's from the corresponding vocabulary mapping. We create another helper function for it. Note that we also add a special End of Sentence `<eos>` token to mark the end of sentence. ( At decoding time, we keep generating words until the `<eos>` token has been generated )

In [13]:
def sent_to_word_id(sentences, vocab, eos=True):
    data = []
    for sent in sentences:
        if eos:
            end = [vocab.word2id[EOS_TOKEN]]
        else:
            end = []
        words = sent.strip().split(' ')
        
        if len(words) < MAX_LENGTH:
            data.append([vocab.word2id[w] for w in words] + end)
    return data


And finally use this function to generate sentences with token ID's

In [14]:
english_data = sent_to_word_id(data.english, english_vocab)
french_data = sent_to_word_id(data.french, french_vocab)

What we have generated now are python lists where each item in itself is a list of ID's. However, Pytorch expects a Tensor object and so we also perform that required transformation

In [15]:
input_dataset = [Variable(torch.LongTensor(sent)) for sent in french_data]
output_dataset = [Variable(torch.LongTensor(sent)) for sent in english_data]

if use_cuda: # And if cuda is available use the cuda tensor types
    input_dataset = [i.cuda() for i in input_dataset]
    output_dataset = [i.cuda() for i in output_dataset]

We are now done with the required data preprocessing that is compatible with the requirements of our Encoder - Decoder architecture.

# Encoder - Decoder Architecture

At it's core, an encoder-decoder model uses two neural networks. The first takes in the sentence token by token and produces a sentence representation (A vector of given size, say 512). Once we have this representation (presumably containing the entire semantics of the source sentence), we then use this to generate the corresponding sentence in the target language , word by word. Conventionally, recurrent neural networks have been used for both encoder and decoder [4]  as shown in the figure below ([Image Source](http://colah.github.io/posts/2015-01-Visualizing-Representations/))


<img src="https://colah.github.io/posts/2015-01-Visualizing-Representations/img/Translation2-RepArrow.png" width="600" height="400" />

This however burdens the encoder by asking it to encode the entire representation of the sentence in a single vector. [5] propose a neural attention mechanism that at decoding time, apart from using this sentence representation, also tries peek at the input to get additional help in performing that decoding step.

General architecture of how attention is used to selectively focus on a particular part of input to perform decoding

<img src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/12/Screen-Shot-2015-12-30-at-1.16.08-PM.png" width="200" height="200" />

(Image Source: WildML Blog)

This paper replaces the recurrent encoder in the above architecture and replaces it with a Convolutional Encoder. A big benefit from this is that CNN's are highly parallelizable and thus make the training faster (and as shown in the results of the paper with equal or better performance). We now focus on the implementation of the Convolution Encoder (We recommend everyone to also try the previously mentioned RNN Encoder-Decoder tutorial on PyTorch's official tutorial page). As you will see later, the attention formulation is modified from the above architecture when using Convolutional Encoder

# Convolutional Encoder

The main components of the convolution encoder are

* A multi-layer convolution neural network with one fixed size filter that gathers information for the given context window.

* Residual connections that combine the input of a convolution layer to its output followed by a non-linear transformation. (Note that having no pooling layer after convolution layers is essentiall for incorporating residual connections. Moreover, each input to a convolution layer has to be appropriately padded so that the output-size after applying the convolution operation can remain the same as the original input size and can thus be passed to successive convolution layers whithout reducing the input size )

* The architecture has two such Convolution networks
  * Conv-a - The output of this encoder is used for creating the attention matrix that is used at decoding time.
  * Conv-c - The output of this encoder is attended to (the exact formulation is discussed later) using Conv-a and is then passed to the Decoder
  
  
We will now explain the architecture of in convolution encoder in general (without explicitly referring to Conv-a or Conv-c as they are structurally similar)

The input to a convolution encoder is combination (addition in this case) of individual word embeddings and their position embeddings which in the paper is given by $e_j = w_j + l_j $. Both these embeddings are learnt during training. Thus for a sentence *"La prochaine fois je gagnerai la partie"* the input to the encoder is

| Word         | Position  | Representation  |
| -------------|:---------:| -----:|
|    La        |  1        |  WordEmbeddingFor(*La*) + PositionEmbeddingFor(*1*)|
|    prochaine |  2        |  WordEmbeddingFor(*prochaine*) + PositionEmbeddingFor(*2*)  |
| fois         |  3        |  WordEmbeddingFor(*fois*) + PositionEmbeddingFor(*3*)   |
|   je         |  4        |  WordEmbeddingFor(*je*) + PositionEmbeddingFor(*4*)|
|   gagnerai   |  5        |  WordEmbeddingFor(*gagnerai*) + PositionEmbeddingFor(*5*)|
|   la         |  6        |  WordEmbeddingFor(*la*) + PositionEmbeddingFor(*6*)|
|partie        |  7        |  WordEmbeddingFor(*partie*) + PositionEmbeddingFor(*7*)|


We finally begin with our encoder implementation by defining a barebone architecture

In [16]:
%%script false
class ConvEncoder(nn.Module):
    def __init__(self, vocab_size, embedding_size, dropout=0.2,
                 num_channels_attn=512, num_channels_conv=512, max_len=MAX_LENGTH,
                 kernel_size=3, num_layers=5):
      pass
    def forward(self, position_ids, sentence_as_wordids):
      # position_ids refer to position of individual words in the sentence 
      # represented by sentence_as_wordids. 
      pass

Here we have the constructor with the necessary model parameters. The forward() defines the forward pass of your computational graph. Pytorch handles the backward pass of calculating gradients and updating weights on its own. We now incrementally build our encoder in the following steps. (A point worth mentioning is that while the *position_ids* can be dynamically generated on the fly as part of the computation graph, we pass them as input to make the model code minimal)

In [14]:
%%script false
class ConvEncoder(nn.Module):
    def __init__(self, vocab_size, embedding_size, dropout=0.2,
                 num_channels_attn=512, num_channels_conv=512, max_len=MAX_LENGTH,
                 kernel_size=3, num_layers=5):
      super(ConvEncoder, self).__init__()
      # Here we define the required layers that would be used in the forward pass
      self.position_embedding = nn.Embedding(max_len, embedding_size)
      self.word_embedding = nn.Embedding(vocab_size, embedding_size)
      self.num_layers = num_layers
      self.dropout = dropout
      
      # Convolution Layers
      self.conv = nn.ModuleList([nn.Conv1d(num_channels_conv, num_channels_conv, kernel_size,
                                      padding=kernel_size // 2) for _ in range(num_layers)])
      
    def forward(self, position_ids, sentence_as_wordids):
      # position_ids refer to position of individual words in the sentence 
      # represented by sentence_as_wordids. 
      pass


The reason why we explicily use *nn.ModuleList* and not traditional Python Lists is to allow these modules to be visible to other Pytorch modules if GPU is used.

We now define the computational graph of the encoder 

In [15]:
class ConvEncoder(nn.Module):
    def __init__(self, vocab_size, embedding_size, dropout=0.2,
                 num_channels_attn=512, num_channels_conv=512, max_len=MAX_LENGTH,
                 kernel_size=3, num_layers=5):
        super(ConvEncoder, self).__init__()
        self.position_embedding = nn.Embedding(max_len, embedding_size)
        self.word_embedding = nn.Embedding(vocab_size, embedding_size)
        self.num_layers = num_layers
        self.dropout = dropout

        self.conv = nn.ModuleList([nn.Conv1d(num_channels_conv, num_channels_conv, kernel_size,
                                      padding=kernel_size // 2) for _ in range(num_layers)])

    def forward(self, position_ids, sentence_as_wordids):
        # Retrieving position and word embeddings 
        position_embedding = self.position_embedding(position_ids)
        word_embedding = self.word_embedding(sentence_as_wordids)
        
        # Applying dropout to the sum of position + word embeddings
        embedded = F.dropout(position_embedding + word_embedding, self.dropout, self.training)
        
        # Transform the input to be compatible for Conv1d as follows
        # Length * Channel ==> Num Batches * Channel * Length
        embedded = torch.unsqueeze(embedded.transpose(0, 1), 0)
        
        # Successive application of convolution layers followed by residual connection
        # and non-linearity
        
        cnn = embedded
        for i, layer in enumerate(self.conv):
          # layer(cnn) is the convolution operation on the input cnn after which
          # we add the original input creating a residual connection
          cnn = F.tanh(layer(cnn)+cnn)        

        return cnn

The only difference between Conv-a and Conv-c is that of embedding size and number of convolution layers which can be easily adjusted for each using the given constructor.

# Decoder
We will now see how to build the decoder component of the translation model. To understand the decoder module, let's focus on the section 2 of the paper. The first paragraph indicates that we are getting a sequence of states, **z** defined as:
$$\mathbf{z} = (z_1, z_2 \ldots z_m)$$

The next paragraph describes how a typical recurrent neural network works. This part is not necessary in understanding the implementation, since PyTorch and other Deep Learning frameworks provide methods for easy construction of these recurrent networks. What is important to understand, however, is the notations they are using for describing the inputs and outputs, since they will be useful in understanding the equations later. The paper uses LSTM as the neural network, however, for this tutorial we are using a GRU instead. Since GRU and LSTM only differ in the internal mechanism of generating hidden states and outputs, this won't affect the implementation a lot.

* $h_i$ represents the hidden state/output of the LSTM.
* $c_i$ is the input context to the LSTM
* $g_i$ is the embedding of the previous output of the LSTM. This gets concatenated with $c_i$ as input to the LSTM

The outputs of the encoder module are $cnn_a$ (used in generating attention matrix) and $cnn_c$ (encoded sentence).

The next word, $y_{i+1}$ is generated as:
$$p(y_{i+1}|y_1, \ldots, y_i, \mathbf{x}) = \text{softmax}(W_oh_{i+1} + b_o)$$

For the `softmax` part, we can use PyTorch's `functional` module. For the linear transformation within the `softmax`, we can use `nn.Linear`. The input to this linear transformation is the GRU's hidden state, therefore of the same. The output will be a distribution over the entire output vocabulary, therefore equal to the output vocabulary size.

So far, our decoder should like something like this:



In [16]:
%%script false

class AttnDecoder(nn.Module):
  def __init__(self, output_vocab_size, hidden_size_gru, embedding_size,
               n_layers_gru):
    
    # This will generate the embedding g_i of previous output y_i
    self.embedding = nn.Embedding(output_size, embedding_size)
    
    # A GRU 
    self.gru = nn.GRU(hidden_size_gru+embedding_size, hidden_size, n_layers_gru)
    
    # Dense layer for output transformation
    self.dense_o = nn.Linear(hidden_size_gru, output_vocab_size)
    
  def forward(self, y_i, h_i, cnn_a, cnn_c):
    
    # generates the embedding of previous output
    g_i = self.embedding(y_i)
    
    gru_output, gru_hidden = self.gru(torch.concat(g_i, input_context), h_i)
    # gru_output: contains the output at each time step from the last layer of gru
    # gru_hidden: contains hidden state of every layer of gru at the end
    
    # We want to compute a softmax over the last output of the last layer
    output = F.log_softmax(self.dense_o(gru_hidden[-1]))
    
    # We return the softmax-ed output. We also need to collect the hidden state of the GRU
    # to be used as h_i in the next forward pass
    
    return output, gru_hidden

In the code snippet above, we haven't included the generation of `input_context` $c_i$. The paper describes the generation of $c_i$ as follows (same section):

We first transform the hidden state of GRU $h_i$ to match the size of $g_i$, the embedding of previous output and then add the embedding $g_i$.

$$d_i = W_dh_i + b_d + g_i$$

Corresponding code (rough - don't copy paste!) will be:


---


```python
def __init__():
  self.transform_gru_hidden = nn.Linear(gru_hidden_size, embedding_size)

def forward():
  d_i = self.transform_gru_hidden(gru_hidden[-1]) + g_i
```


---


Next, we generate the attention matrix $A$ as follows:
$$a_{ij} = \frac{exp \left(d_i^Tz_j\right)}{\sum^m_{t=1}exp\left(d_i^Tz_t\right)}$$

$z_j$ (described in the next page) is the $j^{th}$ column of $cnn\_a$.

Instead of generating $a_{ij}$ individually, we can generate the entire $\mathbf{a_i}$ in one go, by modifying the equation slightly:

$$\mathbf{a_i} = \text{softmax}(d_i^T\mathbf{z})$$ where $\mathbf{z}$ now corresponds to the entire $cnn\_a$. This simplifies our implementation, since we can now quickly multiply matrices instead of iterating over individual vectors and computing the dot products.

Pythonically, we can write this roughly as:

---

```python
  a_i = F.softmax(torch.bmm(d_i, cnn_a).view(1, -1))
```
---

Finally, we generate $c_i$ as
$$c_i = \sum_{j=1}^{m}a_{ij}\left (cnn\_c(\mathbf{e})_j \right )$$

$cnn-c(\mathbf{e})_j$ corresponds to `cnn_c` that we receive as the encoder input. As before, we can transform the equation a bit so that it becomes easier to implement it:
$$c_i = \mathbf{a}_i \left (cnn\_c \right)$$ 

---

```python
  c_i = torch.bmm(a_i.view(1, 1, -1), cnn_c.transpose(1, 2))
```
---

A bit of implementation trick here - `a_i` has dimension `(sequence_length,)`. `cnn_c` has dimension `embedding_size x sequence_length`. Now, for `a_i` and `cnn_c` to be multiplied together, we need to make them compatible for multiplication. Therefore, we transpose `cnn_c` to make it `sequence_length x embedding_size` and reshape `a_i` to `1 x sequence_length`

We are almost done here with the decoder component. Few things we need to do to complete the implementation:
* Make sure `__init__` and `forward` funcitons have all the arguments which are needed.
* Add dropouts for embedding and decoder output $h_i$ (section 4.3, last line)
* Add a function to initialize the hidden units of the GRU to zero after every sentence. (section 4.2, second line)

Putting everything together, our decoder module now looks like this:

In [17]:
class AttnDecoder(nn.Module):
  
  def __init__(self, output_vocab_size, dropout = 0.2, hidden_size_gru = 128,
               cnn_size = 128, attn_size = 128, n_layers_gru=1,
               embedding_size = 128, max_sentece_len = MAX_LENGTH):

    super(AttnDecoder, self).__init__()
    
    self.n_gru_layers = n_layers_gru
    self.hidden_size_gru = hidden_size_gru
    self.output_vocab_size = output_vocab_size
    self.dropout = dropout
    
    self.embedding = nn.Embedding(output_vocab_size, hidden_size_gru)
    self.gru = nn.GRU(hidden_size_gru + embedding_size, hidden_size_gru,
                      n_layers_gru)
    self.transform_gru_hidden = nn.Linear(hidden_size_gru, embedding_size)
    self.dense_o = nn.Linear(hidden_size_gru, output_vocab_size)

    self.n_layers_gru = n_layers_gru
    
  def forward(self, y_i, h_i, cnn_a, cnn_c):
    
    g_i = self.embedding(y_i)
    g_i = F.dropout(g_i, self.dropout, self.training)
    
    d_i = self.transform_gru_hidden(h_i) + g_i
    a_i = F.softmax(torch.bmm(d_i, cnn_a).view(1, -1))
  
    c_i = torch.bmm(a_i.view(1, 1, -1), cnn_c.transpose(1, 2))
    gru_output, gru_hidden = self.gru(torch.cat((g_i, c_i), dim=-1), h_i)
    
    gru_hidden = F.dropout(gru_hidden, self.dropout, self.training)
    softmax_output = F.log_softmax(self.dense_o(gru_hidden[-1]))
    
    return softmax_output, gru_hidden


  # function to initialize the hidden layer of GRU. 
  def initHidden(self):
    result = Variable(torch.zeros(self.n_layers_gru, 1, self.hidden_size_gru))
    if use_cuda:
        return result.cuda()
    else:
        return result

# Training the Model

We now describe the process for training the network on the parallel dataset. In general, the steps involved in training a PyTorch model may be outlined as follows:
1. Initialize the network weights
2. Define and initialize the optimizers
3. Define and initialize the loss criterion
4. Repeat till convergence:
   * Make a forward pass through the network
   * Use the loss criterion to compute loss
   * Use the optimizer to compute the gradients
   * Backpropogate
   
We will describe (and implement) each of the steps described above.

## Initialize the network weights

This is easy. We create the objects corresponding to the `ConvEncoder` and `AttnDecoder` classes we have created above. Then we initialize the weights for different parts of the network as follows (section 4.2):
* Convolution Layers: Samples from uniform distribution in range $(-kd^{-0.5}, kd^{0.5})$
* Others: Samples from uniform distribution in range $(-0.05, 0.05)$

In [18]:
def init_weights(m):
  
    if not hasattr(m, 'weight'):
        return
    if type(m) == nn.Conv1d:
        width = m.weight.data.shape[-1]/(m.weight.data.shape[0]**0.5)
    else:
        width = 0.05
        
    m.weight.data.uniform_(-width, width)


encoder_a = ConvEncoder(len(french_vocab.word2id), embedding_size, dropout=dropout,
                        num_channels_attn=attn_units, num_channels_conv=conv_units,
                        num_layers=encoder_layers)
encoder_c = ConvEncoder(len(french_vocab.word2id), embedding_size, dropout=dropout,
                        num_channels_attn=attn_units, num_channels_conv=conv_units,
                        num_layers=encoder_layers)
decoder = AttnDecoder(len(english_vocab.word2id), dropout = dropout,
                       hidden_size_gru = hidden_size_gru, embedding_size = embedding_size,
                       attn_size = attn_units, cnn_size = conv_units)

if use_cuda:
    encoder_a = encoder_a.cuda()
    encoder_c = encoder_c.cuda()
    decoder = decoder.cuda()

encoder_a.apply(init_weights)
encoder_c.apply(init_weights)
decoder.apply(init_weights)

encoder_a.training = True
encoder_c.training = True
decoder.training = True

## Define and Initialize the Optimizers
We will use Adam optimzer `torch.optim.Adam` with a learning rate of $10^{-4}$. Here's how we can do it:

---

```python
encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate)

encoder_optimizer.zero_grad()
decoder_optimizer.zero_grad()
```
---


## Define and Initialize Loss Criterion
We will be using Negative Log Likelihood (also known as Cross-entropy loss) as the loss criterion for our network.

---
```python
criterion = nn.NLLLoss()
```
---


## Training Steps

We will define two functions:
* `train`: This corresponds to one step of training. It will make a forward pass for one batch, compute loss, compute and backpropagate the gradients.
* `trainIters`: This will sample a batch and call the train function, in a loop.

Although the paper suggests using [beam search](https://en.wikipedia.org/wiki/Beam_search) while generating the output sentence, we will use greedy decoding instead.

In [19]:
def trainIters(encoder_a, encoder_c, decoder, n_iters, batch_size=32, learning_rate=1e-4, print_every=100):
  
    encoder_a_optimizer = optim.Adam(encoder_a.parameters(), lr=learning_rate)
    encoder_c_optimizer = optim.Adam(encoder_c.parameters(), lr=learning_rate)
    decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate)
    
    # Sample a training pair
    training_pairs = list(zip(*(input_dataset, output_dataset)))
    
    criterion = nn.NLLLoss()
    
    
    print_loss_total = 0
    
    # The important part of the code is the 3rd line, which performs one training
    # step on the batch. We are using a variable `print_loss_total` to monitor
    # the loss value as the training progresses
    
    for itr in range(1, n_iters + 1):
        training_pair = random.sample(training_pairs, k=batch_size)
        input_variable, target_variable = list(zip(*training_pair))
        
        loss = train(input_variable, target_variable, encoder_a, encoder_c,
                     decoder, encoder_a_optimizer, encoder_c_optimizer, decoder_optimizer,
                     criterion, batch_size=batch_size)
        
        print_loss_total += loss

        if itr % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print(print_loss_avg)
            print_loss_total=0
    print("Training Completed")

In [20]:
def train(input_variables, output_variables, encoder_a, encoder_c, decoder,
          encoder_a_optimizer, encoder_c_optimizer, decoder_optimizer, criterion, 
          max_length=MAX_LENGTH, batch_size=32):
    
  # Initialize the gradients to zero
  encoder_a_optimizer.zero_grad()
  encoder_c_optimizer.zero_grad()
  decoder_optimizer.zero_grad()

  for count in range(batch_size):
    # Length of input and output sentences
    input_variable = input_variables[count]
    output_variable = output_variables[count]

    input_length = input_variable.size()[0]
    output_length = output_variable.size()[0]

    loss = 0

    # Encoder outputs: We use this variable to collect the outputs
    # from encoder after each time step. This will be sent to the decoder.
    position_ids = Variable(torch.LongTensor(range(0, input_length)))
    position_ids = position_ids.cuda() if use_cuda else position_ids
    cnn_a = encoder_a(position_ids, input_variable)
    cnn_c = encoder_c(position_ids, input_variable)
    
    cnn_a = cnn_a.cuda() if use_cuda else cnn_a
    cnn_c = cnn_c.cuda() if use_cuda else cnn_c

    prev_word = Variable(torch.LongTensor([[0]])) #SOS
    prev_word = prev_word.cuda() if use_cuda else prev_word

    decoder_hidden = decoder.initHidden()

    for i in range(output_length):
      decoder_output, decoder_hidden = \
          decoder(prev_word, decoder_hidden, cnn_a, cnn_c)
      topv, topi = decoder_output.data.topk(1)
      ni = topi[0][0]
      prev_word = Variable(torch.LongTensor([[ni]]))
      prev_word = prev_word.cuda() if use_cuda else prev_word
      loss += criterion(decoder_output,output_variable[i])

      if ni==1: #EOS
        break

  # Backpropagation
  loss.backward()
  encoder_a_optimizer.step()
  decoder_optimizer.step()

  return loss.data[0]/output_length



To finally start the training, we simply call the trainIters method. (Be patient as the training will take time depending upon your machine)

In [21]:
trainIters(encoder_a,encoder_c, decoder, num_iterations, print_every=print_every, batch_size=batch_size)

5.367075214340574
3.677882065591358
3.588345387708575
3.5734350045976204
3.4276830834888266
3.3303319127900273
3.222051881880987
Training Completed


# Evaluation

The evaluation function will be very similar to train function minus the backpropagation part.

In [22]:
def evaluate(sent_pair, encoder_a, encoder_c, decoder, source_vocab, target_vocab, max_length=MAX_LENGTH):
    source_sent = sent_to_word_id(np.array([sent_pair[0]]), source_vocab)
    if(len(source_sent) == 0):
        return
    source_sent = source_sent[0]
    input_variable = Variable(torch.LongTensor(source_sent))
    
    if use_cuda:
        input_variable = input_variable.cuda()
        
    input_length = input_variable.size()[0]
    position_ids = Variable(torch.LongTensor(range(0, input_length)))
    position_ids = position_ids.cuda() if use_cuda else position_ids
    cnn_a = encoder_a(position_ids, input_variable)
    cnn_c = encoder_c(position_ids, input_variable)
    cnn_a = cnn_a.cuda() if use_cuda else cnn_a
    cnn_c = cnn_c.cuda() if use_cuda else cnn_c
    
    prev_word = Variable(torch.LongTensor([[0]])) #SOS
    prev_word = prev_word.cuda() if use_cuda else prev_word

    decoder_hidden = decoder.initHidden()
    target_sent = []
    ni = 0
    out_length = 0
    while not ni==1 and out_length < 10:
        decoder_output, decoder_hidden = \
            decoder(prev_word, decoder_hidden, cnn_a, cnn_c)

        topv, topi = decoder_output.data.topk(1)
        ni = topi[0][0]
        target_sent.append(target_vocab.id2word[ni])
        prev_word = Variable(torch.LongTensor([[ni]]))
        prev_word = prev_word.cuda() if use_cuda else prev_word
        out_length += 1
        
    print("Source: " + sent_pair[0])
    print("Translated: "+' '.join(target_sent))
    print("Expected: "+sent_pair[1])
    print("")


To evaluate the learned model, simply execute the following

In [23]:
encoder_a.training = False
encoder_c.training = False
decoder.training = False
samples = data.sample(n=100)
for (i, row) in samples.iterrows():
    evaluate((row.french, row.english), encoder_a, encoder_c, decoder, french_vocab, english_vocab)

Source: elle est dure avec eux .
Translated: he is is . . <eos>
Expected: she is hard on them .

Source: tu n es pas japonais .
Translated: he is . . . <eos>
Expected: you are not japanese .

Source: je suis fiance avec elle .
Translated: he is . . . <eos>
Expected: i am engaged to her .

Source: tu es sur mon chemin .
Translated: he is . . . . <eos>
Expected: you are in my way .

Source: il m instruit .
Translated: he is . . <eos>
Expected: he is teaching me .

Source: il n est pas la en ce moment .
Translated: he is . . . . <eos>
Expected: he isn t here now .

Source: je suis divorcee .
Translated: he is . . . <eos>
Expected: i am divorced .

Source: tu es enseignante .
Translated: he is . . <eos>
Expected: you are a teacher .

Source: elles sont bonnes toutes les deux .
Translated: he is . . . <eos>
Expected: they are both good .

Source: vous etes matinale .
Translated: he is . . <eos>
Expected: you are early .

Source: elle n est jamais a l heure .
Translated: he is is . . <eos>
E

Source: il passe a la radio .
Translated: he is . . . <eos>
Expected: he is on the radio .



## References

1) Jonas  Gehring,  Michael  Auli,  David  Grangier,  and Yann  Dauphin.  2017. ** A  Convolutional  Encoder Model for Neural Machine Translation.**   In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: LongPapers). Association for Computational Linguistics,Vancouver,  Canada,  pages  123–135
[http://www.aclweb.org/anthology/P17-1012](http://www.aclweb.org/anthology/P17-1012)

2) [** Translation with a Sequence to Sequence Network and Attention **](http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html) - Official PyTorch Tutorial 

3) [**Tatoeba Project**](https://tatoeba.org/eng) (Downloaded from [http://www.manythings.org/anki/](http://www.manythings.org/anki/))

4) Sutskever, I., Vinyals, O., and Le, Q. (2014). **Sequence to sequence learning with neural networks**. In Advances in Neural Information Processing Systems (NIPS 2014)

5) Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. **Neural machine translation by jointly learning to align and translate** arXiv:1409.0473 [cs.CL], September 2014