# Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation

Link: https://arxiv.org/abs/1406.1078

Authors: Kyunghyun Cho, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio

Institution: Universite ́ de Montre ́al, Jacobs University, Universite ́ du Maine 

Publication: arXiv

Date: 3 Sep. 2014




## Background Materials




## What is this paper about?

A new neural network model called RNN Encoder-Decoder is applyed to a statistical machine translation.

In RNN Encoder-Decoder, one RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols.

## What is the motivation of this research?

The previous researches are limited in a sense that:

- the input sequences should be fixed length
- the order of input sequences are not taken into account because inputs are represented as bag-of-words
- the input is n-gram


## What makes this paper different from previous research?


The RNN Encode-Decoder can take a variable-length input that is a raw symbol sequence.

## How this paper achieve it?

### RNN Encoder–Decoder

The encoder reads each symbol of an input sequence $x$ sequentially and the hidden state of RNN changes as follows,

$\boldsymbol{h}_{<t>} = f(\boldsymbol{h}_{<t-1>}, x_t)$

where $f$ is a non-linear activation function.

After reading the end of the sequence, the hidden state is a summary $c$ of the whole input sequence.


The decoder is another RNN which predicts the next symbol $y_t$ given the hidden state $\boldsymbol{h}_{<t>}$.

In the decoder RNN both $y_t$ and $\boldsymbol{h}_{<t>}$ are conditioned on $y_{t-1}$ and on summary $c$. Hence, the hidden state of the decoder RNN at time t is,

$\boldsymbol{h}_{<t>} = f(\boldsymbol{h}_{<t-1>}, y_{t-1}, \boldsymbol{c})$

Similarly, the conditional distribution of the next symbol is 

$P(y_t \lvert y_{t-1}, y_{t-2}, ..., y_1, \boldsymbol{c}) = g(\boldsymbol{h}_{<t>}, y_{t-1}, \boldsymbol{c})$

where $g$ is probability function like softmax.

<img src="img/Learning Phrase Representations using RNN Encoder–Decoder_for_Statistical_Machine_Translation_figure1.png" width="300">


The two RNNs are jointly trained to maximize the conditional log-likelihood

$\max_\theta \frac{1}{N}\sum_{n=1}^N \log p_\theta(\boldsymbol{y}_n \lvert \boldsymbol{x}_n)$

where $\theta$ is the set of the model parameters.



### Statistical Machine Translation

A common statistical machine translation system aims to find a translation $\boldsymbol{f}$ given source sentence $\boldsymbol{e}$ which maximizes

$p(\boldsymbol{f} \lvert \boldsymbol{e}) \propto p(\boldsymbol{e} \lvert \boldsymbol{f}) p(\boldsymbol{f})$ (from Bays theorem)

where $p(\boldsymbol{e} \lvert \boldsymbol{f})$ is called translation model and $p(\boldsymbol{f})$ language model.

In practice, however, most SMT systems model $\log p(\boldsymbol{f} \lvert \boldsymbol{e})$ as log-linear model with additional features and corresponding weights:

$\log p(\boldsymbol{f} \lvert \boldsymbol{e}) = \sum_{n=1}^N \omega_n f_n(\boldsymbol{f}, \boldsymbol{e}) + \log Z(\boldsymbol{e})$

where $f_n$ and $w_n$ are the $n$-th feature and weight, respectively. $Z(\boldsymbol{e})$ is normalized constant that does not depend on the weights. The weights are often optimized to maximize the BLEU score on a training set.

The RNN Encoder-Decoder trained on a table of phrase pairs can be used to score a given pair of input and output sequences. The score is simply a probability $p_\theta(\boldsymbol{y} \lvert \boldsymbol{x})$.

The score is used  as additional features in the log-linear model when tuning SMT decoder. Once the RNN Encoder-Decoder is trained, a new score for each phrase pair is added to the existing table.


## Dataset used in this study


- WMT’14 translation task

## Implementations




## Further Readings
