# Sequence to Sequence (Seq2Seq)

Hello! For this next section, I'd like to introduce you to Jay Alammar. Jay has done some great work in interactive explorations of neural networks. If you haven't already, make sure you check out [his blog.](http://jalammar.github.io/)

Jay will be teaching you about a particular RNN architecture called "sequence to sequence". In this case, you feed in a sequence of data and the network will output another sequence. This is typically used in problems such as machine translation, where you'd feed in a sentence in English and get out a sentence in Arabic.

Sequence to sequence will prepare you for the next section, **Deep Learning Attention**, also taught by Jay.

## Lesson Outline
1. Applications
2. Architectures
3. Architectures in More Depth

# <a id='0'>0: Introduction</a>

We've known that we can do simple sentiment analysis using normal feedforward neural networks the network is able to learn how positive or negative each word is and can if a sequence as a whole has positive or negative things to say about its subject. However, we start running issues when we want to do a little bit more advanced models that deal with language and sequential data. Let's look at an example in the video.

<img src="assets/images/05/img_001.png" width=700 align='center'>

Given the two sentences, if we wanted to find the year:
> I went to Nepal in **2009**<br>
> In **2009**, I went to Nepal

If we used a typical feedforward network, it would have to have separate parameters for each input feature (i.e., every word). Technically, it would have to learn all the rules of language separately at each position in the input sentence.

Recurrent nets are a powerful class of neural nets that deal with sequential data. They are especially suited for language and translation tasks because they can extend to sequences of any length. More importantly, they share parameters across different time steps. When they do learn a language model, they do it more efficiently than a traditional feed-forward network would.

<img src="assets/images/05/img_002.png" width=700 align='center'>

[Source](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

**Sequential data** can refer to the input to the model, the output of the model, or both. The above image demonstrates different kinds of RNNs that are suited for different types of tasks.

* many-to-one: reads a **sequence** of words and outputs a single value<br>
> input: sequential<br>
> output: singular<br>
* many-to-many: reads a sequence of words and outputs a sequence of words<br>
> input: sequential<br>
> output: sequential<br>

In the second *many-to-many* option, we are using a single RNN where we are forced to ouput at most as many vectors as we input. But that wouldn't work well for a chatbot where we would like the outcome to be unlimited in the length of items it returns. We want the model to take in the entire input before we start generating a response.

In the first *many-to-many* option, it is composed of two RNNs that can map a sequence of any length to another sequence of any length. The basic premise is that we use two RNNs, an input (encoder network) and an output (decoder network), where the first reads the input sequence and the second generates the output sequence. It does this by the encoder handing what it learned to the decoder network.

<img src="assets/images/05/img_003.png" width=700 align='center'>



# 1: Application

The term, "sequence to sequence", may seem abstract and not clearly demonstrate what can be done with this kind of architecture. So, here are some rudimentary examples of what it means to take in a any sequence and produce any another sequence (if it can be represented as a vector, it can be used in a seq2seq model):

1. **Translation Model**
> input: english phrases<br>
> target: french phrases<br>
2. **Summarization Model**:
> input: news articles<br>
> target: summary<br>
3. **Question and Answering Model**
> input: questions<br>
> target: answers<br>
4. **Chatbot**
> input: transmission dialog<br>
> target: response dialog<br>

But inputs don't have to only be words. The RNNs are used along convolutional nets. They can be images or audio.

<img src="assets/images/05/img_004.png" width=700 align='center'>

There are many options. The biggest challenge is to find the right data set to build what you are looking for.

# 2: Architectures

Recall the basic 2 RNN architecture of encoder and decoder:
<img src="assets/images/05/img_005.png" width=700 align='center'>

The first RNN, encoder, reads the input sequence and hands over what it has understood to the second RNN, decoder, which generates the output sequence.

The "understanding" that the encoder "hands over" is a fixed size tensor that is called a "state" or "context". No matter how long the inputs and outputs are, the context remains the same size as when you built the model.

At a high level, the inference process is done by handing **inputs** to the **encoder**, the encoder summarizes what it understood into a state or context variable, it hands it over to the decoder, which then generates the output sequence.

Going a level deeper:
Since the encoder and decoder are both RNNs, they have **loops** which allows them to process these sequences of inputs and outputs.

<img src="assets/images/05/img_006.png" width=700 align='center'>

As an example, think of a chatbot:

We want to ask it *How are you?*. First we tokenize the input into four elements$\left[\text{How,are,you,?}\right]$. Because there are 4 elements, it will take any RNN 4 time steps (loops) to read in the entire sequence. At each loop, it reads the input, does a transformation on the hidden state, and pass the hidden state out to the next time step. In the diagram, the clock symbol indicates we are moving from one time step to the next.

<img src="assets/images/05/img_007.png" width=700 align='center'>

If we think about the RNN as a loop, we can also "unroll" the RNN and show the steps sequentially laid out in a line with blocks representing the hidden state. A note, the blocks are actually only ONE block, but it gets updated to a new "state" through transformation at each time step with the new input and the previous time step:

<img src="assets/images/05/img_008.png" width=700 align='center'>

So, what is the **hidden state**? The hidden state is a number hidden units inside of the cell. In practice, its most likely to be the hidden state inside an LSTM (long short term memory) cell. The size of the network is another hyperparameter that can be set. Generally, the bigger the hidden state, thus the bigger the size, the more capacity of the model to observe and learn patterns. However, the large the network, the more resources are needed to train and deploy such a model.

<img src="assets/images/05/img_009.png" width=700 align='center'>

Similar things happen with the decoder. We begin by "feeding" the decoder the state (or context) generated by the encoder, then it generates the output value(s) element by element.

<img src="assets/images/05/img_010.png" width=700 align='center'>

If we unroll it like we did the encoder, we see that we are feeding it back every element that it outputs. This helps to create coherent outcomes as each previous time step informs the succeeding output. It's as though the current time step *remembers* what the previous time step has emitted.

<img src="assets/images/05/img_011.png" width=700 align='center'>

Connected all of that together, we can see something of the sort:

<img src="assets/images/05/img_012.png" width=700 align='center'>

# 3. Architecture in More Depth