# Encoder Decoder

## What you will learn in this course 🧐🧐

This course is dedicated to the encoder-decoder type models. This architecture type has only been introduced in 2014 and later used by Google in there first attempt to create their Google Translate service.

Although this type of architecture has been very successful for machine translation tasks (sometimes referred to as NMT Neural Machine Translation) it has also paved the way for other applications such as image captioning for example.

The key benefits of this method is the ability to process inputs and outputs of arbitrary length (the length will of course affect the results, however the model will be able to process them contrary to the approaches we studied so far).

The outline of the course is the following

### The encoder decoder architecture 

Researcher Sutskever first introduced the encoder-decoder architecture in 2014. It marked an important time in history as it was the first neural machine translation model to ever outperform a baseline machine learning model on a large translation task.

This model was applied to an English to French Translation problem.

The first trick of the data processing was to add an `<EOS>` End-of-Sequence token at the end of each sentence in order for the model to understand when to stop making predictions for the target sequence, which translates to the ability of outputing sequences of arbitrary length!

The second trick to succeeding at machine translation is also simplifying the problems, limiting the vocabulary size in the tokenizer helps the model focus on the most common words instead of including even the least represented terms that may end up adding noise to the data. The out of vocabulary words that are not referenced in the tokenizer would simply be replaced by a special unknown token `<UNK>`.



#### Simplified general principle

A simplified representation of the encoder decoder architecture is presented below:

![encoder_decoder_simplify](https://full-stack-assets.s3.eu-west-3.amazonaws.com/models/M08_Deep_learning/encoder_decoder/encoder_decoder_simplify.png)

The general idea is the following:

* The input is processed by the encoder model (usually an RNN model) and the encoder output is produced
* Then the encoder output is fed to the decoder model (usually also an RNN model), the difference is that the encoder output is only the first input that the decoder will receive, the decoder will be used as many times as needed to produce a full output sequence until the `<EOS>` token.

This idea of including the decoder in a loop to produce the different elements of the output sequence is a central idea of the encoder decoder architecture.

Let's study this into a little bit more detail now to make things clearer!



#### Encoder Decoder in detail

In order to really understand the encoder-decoder model's principle, we need to go a little bit deeper, the following figure will help you understand:

![encoder_decoder_detail](https://full-stack-assets.s3.eu-west-3.amazonaws.com/models/M08_Deep_learning/encoder_decoder/Encoder_decoder_detail.png)

This figure illustrates the behavior of an encoder decoder model when applied to a sequence to sequence problem (in this case translation). Let's detail the process:

* The input sequence passes through an LSTM layer (or any RNN structure) in the encoder part of the model.
* The encoder output is then fed to the RNN layer of the decoder, along with a token `<start>` and the output leads to the prediction of the first word of the target sequence
* The following steps are called **teacher forcing**, the idea is that the hidden state produced by the RNN layer(s) AND the previous element of the target sequence are used to produce the next prediction.
* The process ends when the `<EOS>` token is reached.

This process has two main advantages over what happens in classic RNN architectures. First the teacher forcing is helping the model not to get "lost". In classic RNN architectures the predictions are made using only the previous hidden state, which means that if the RNN layer is not able to get useful information, then the way it will analyse the following elements in the sequence will be weighed down by the previous error, which will then pervade to all the following predictions. The teacher forcing principle let's the model know what it was supposed to predict after each element in the sequence, which means that even if the first element in the prediction is completely off, the next steps still have a chance of getting back on the right path.

Secondly, the fact that the decoder gets applied in a loop and waits for the `<EOS>` token to occur means that this type of architecture is able to train and infer sequences of arbitrary lengths!

You should understand by now that such an architecture will not be able to train using the standard `.fit` method of tensorflow, but will necessarily call for custom training.

The main drawback from this architecture is that the encoder has to output a fixed dimensional vector to feed the decoder, this means that for longer sequences, more information needs to be wrapped in it, resulting in information and performance loss.

### Conclusion

The encoder decoder architecture really helped the field of Deep learning for NLP leap forward, and also paved the way for more sophisticated techniques such as Attention which we will study during the next course.

Now let's get to practice!