# Neural Machine Translation by Jointly Learning to Align and Translate

Link: https://arxiv.org/abs/1409.0473

Authors: Dzmitry Bahdanau, KyungHyun Cho, Yoshua Bengio

Institution: Jacobs University 

Publication: Universite ́ de Montre ́al

Date: 19 May 2016 (as of v7)




## Background Materials

- Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation https://arxiv.org/abs/1406.1078


## Papers citing this paper

- Tacotron: Towards End-to-End Speech Synthesis https://arxiv.org/abs/1703.10135
- and more

## What is this paper about?

An encode-decoder architecture for neural machine translation that is not constraind to fixed-length vector representation.


## What is the motivation of this research?

Cho et al., 2014a proposed encode-decoder architecture for machine translation. The encoder encode source sentence inputs into fixed-length vector and the decoder outputs a translation from the encoded vector. 

Cho et al., 2014b showed that the performance of a encode-decoder model deteriorates as the length of an input sequence increases.

They conjecture that use of the fixed-length vector is a bottleneck to improve performance.


## What makes this paper different from previous research?

They extended the encoder-decoder model so that it can soft-serarch for a set of positions in a source sentence where the most relevant infomation is concentrated.

The proposed model encodes an input sentence into a sequence of vectors and choose a subset of these vectors adaptively while decoding the translation.


## How this paper achieve it?

### Decoder

The decoder is often trained to predict the next word $y_t$ given the context vector $c$ and all the previously predicted words $\{y_1, ..., y_{t-1}\}$. 

$p(\boldsymbol{y}) = p(\{y_1,...,y_T\}) = \prod_{t=1}^Tp(y_t \lvert \{y_1,...,y_{t-1}\}, c)$

With RNN, each conditional probability is modeled as 

$p(y_t \lvert \{y_1, ..., y_{t-1}\}, c) = g(y_{t-1}, s_t, c)$

where $g$ is nonlinear, potentially multi-layered function that outputs the probability of $y_t$, and $s_t$ is the hidden state of the RNN.

In the proposed model, the each conditional probability above is defined as 

$p(y_i\lvert y_1,...,y_{i-1}, \boldsymbol{x}) = g(y_{i-1}, s_t, c_i)$

where $s_i$ is an RNN hidden state for time $i$ computed by

$s_i = f(s_{i-1}, y_{i-1}, c_i)$

The context vector $c_i$ depends on a sequence of *annotations* $(h_1,...,h_T)$ to which an encoder maps the input sentence. The next section about an encoder explains how to compute the annotations.

The context vector $c_i$ is then computed as a weighted sum of the annotations.

$c_i = \sum_{j=1}^T\alpha_{ij}h_j$

The weight $\alpha_{ij}$ of each annotation h_j is computed by

$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{K=1}^T\exp(e_{ik})}$

where 

$e_{ij} = a(s_{i-1}, h_j)$

is an *alignment model* which scores how the inputs around position $j$ and the output position $i$ match.


### Encoder

For annotations to summarize not only the preceeding words but also the following words, bidirectional RNN is used.

The backward RNN reads the sequence in the reverse order.

An annotation for each word $x_j$ is obtained by concatenating the forward hidden state $\overrightarrow{h_j}$ and the backward one $\overleftarrow{h_j}$,

$h_j = [\overrightarrow{h_j}^\top ; \overleftarrow{h_j}^\top]$

<img src="img/Neural_Machine_Translation_by_Jointly_Learning_to_Align_and_Translate_Figure1.png" width="200">


### Results

<img src="img/Neural_Machine_Translation_by_Jointly_Learning_to_Align_and_Translate_Figure2.png" width="400">

## Dataset used in this study

- WMT' 14: English-French parallel corpora


## Implementations

- tf.contrib.seq2seq.BahdanauAttention


## Further Readings


