# ___Encoder-Decoder Sequence to Sequence___

## ___What is Seq2Seq?___

_The Seq2Seq(Sequence to Sequence) is a method that can generate another sequence by a specific method based on a given sequence (long sentences, paragraphs, image extraction features, audio signals,etc.). It was firstly proposed in 2014, having first, the two articles describes its main idea, namely Google Brain team "[Sequence to Sequence Learning with Neural Networks](https://arxiv.org/pdf/1409.3215.pdf)" and Yoshua Bengio team "Learning Phrase Representation using RNN Encoder-Decoder for Statistical Machine Translation. The two articles coincided with a similar solution,and Seq2Seq was generated._

_As a simple example, when we use machine translation: input (Hello) ---> output (hello). For another example, in human-machine dialogue, we ask the machine: "Who are you?", And the machine will return the answer "I am XX"._

_The core idea of the seq2Seq model is to convert a sequence signal as an input to an output sequence signal through a deep neural network. This process consists of two processes: __Encoding and Decoding__. In the classic implementation, the encoder and decoder are each __composed of a recurrent neural network(RNN, LSTM, GRU can be)__.In Seq2Seq, the two recurrent neural networks are trained together._

<img src='https://www.guru99.com/images/1/111318_0848_seq2seqSequ4.png'/>

### ___Applications of Seq2Seq___
_With the development of computer technology, artificial intelligence technology, algorithm research, etc. and the needs of social development, Seq2Seq has produced some applications in many fields._

* _Machine Translation (currently the most famous Google translation is completely developed based on Seq2Seq + Attention Mechanism)_
* _Chatbot(Microsoft Xiaobing, also used seq2seq technology)_
* _The text summary is automatically generated(this technology is used by headlines today)_
* _The picture discription is automatically genreated_
* _Machine writing poetry, code completion, generation of comit message, story style rewriting, etc._

## ___What is Encoder-Decoder?___

_The Encoder-Decoder model is mainly a concept in the NLP field. It does not specifically value a specific algorithm, but a general term for a class of algorithms. Encoder-Decoder can be regarded as a general framework, under which different algorithms can be used to solve different tasks._

<img src='https://miro.medium.com/proxy/1*3lj8AGqfwEE5KCTJ-dXTvg.png' width = 400/>

_The model consists of 3 parts: __encoder__, __intermediate (encoder) vector__ and __decoder__._

<img src='https://miro.medium.com/max/3972/1*1JcHGUU7rFgtXC_mydUA_Q.jpeg' width = 600/>

___Encoder___
* _A stack of several recurrent units (LSTM or GRU cells for better performance) where each accepts a single element of the input sequence, collects information for that element and propagates it forward._
* _In question-answering problem, the input sequence is a collection of all words from the question. Each word is represented as x_i where i is the order of that word._
* _The hidden states h_i are computed using the formula:_
<img src='https://miro.medium.com/max/700/1*sKqGIDJm3P8DeSwl0WHGkg.png'/>

* _This simple formula represents the result of an ordinary recurrent neural network. As you can see, we just apply the appropriate weights to the previous hidden state h_(t-1) and the input vector x_t._

* _Example: Consider the input sequence “I am a Student” to be encoded. There will be totally 4 timesteps ( 4 tokens) for the Encoder model. At each time step, the hidden state h will be updated using the previous hidden state and the current input._

* _At the first timestep t1, the previous hidden state h0 will be considered as zero or randomly chosen. So the first RNN cell will update the current hidden state with the first input and h0. Each layer outputs two things — updated hidden state and the output for each stage. The outputs at each stage are rejected and only the hidden states will be propagated to the next layer._

* _At second timestep t2, the hidden state h1 and the second input X[2] will be given as input , and the hidden state h2 will be updated according to both inputs. Then the hidden state h1 will be updated with the new input and will produce the hidden state h2. This happens for all the four stages wrt example taken._

<img src='https://miro.medium.com/max/700/1*J0tt1Xncos1kficN80DjuQ.png' width = 400/>

___Encoder Vector___
* _This is the final hidden state produced from the encoder part of the model. It is calculated using the formula above._
* _This vector aims to encapsulate the information for all input elements in order to help the decoder make accurate predictions._
* _It acts as the initial hidden state of the decoder part of the model._

___Decoder___
* _A stack of several recurrent units where each predicts an output y_t at a time step t._
* _Each recurrent unit accepts a hidden state from the previous unit and produces and output as well as its own hidden state._
* _In the question-answering problem, the output sequence is a collection of all words from the answer. Each word is represented as y_i where i is the order of that word._
* _Any hidden state h_i is computed using the formula:_
<img src='https://miro.medium.com/max/700/1*sdxvcjeV7NOUsR_VQ_nrUQ.png'/>
* _As you can see, we are just using the previous hidden state to compute the next one._
* _The output y_t at time step t is computed using the formula:_

<img src='https://miro.medium.com/max/700/1*y5T2-J2mrCRZp5M9Q4METw.png'/>

<img src='https://miro.medium.com/max/700/1*cKAWEO4GJ6crmmCkNvNEPQ.png' width = 400/>

_We calculate the outputs using the hidden state at the current time step together with the respective weight W(S). Softmax is used to create a probability vector which will help us determine the final output (e.g. word in the question-answering problem)._

_The power of this model lies in the fact that it can map sequences of different lengths to each other. As you can see the inputs and outputs are not correlated and their lengths can differ. This opens a whole new range of problems which can now be solved using such architecture._

<img src='https://cdn-images-1.medium.com/max/600/1*8OZYn5yMAl4hfdfTwSwKyQ.png'/>

## ___Core Idea of the Seq2Seq Model___

<img src='https://www.guru99.com/images/1/111318_0848_seq2seqSequ2.png'/>

_The Seq2Seq model is mainly used to achieve the conversion from one sequence to another, such as French-English translation. The Seq2Seq model consists of two deep neural networks.The deep neural network can be other neural networks such as RNN(Recurrent neural network) or LSTM(Long short-term memory). The Seq2Seq model uses a neural network to map the input sequence to a fixed-dimensional vector, which is an encoding process; then another neural network maps this vector to the target sequence, which is a decoding process.The model structure of Seq2Seq is shown in Fig1. The model inputs the sentence "ABC" and then generates "WXYZ" as the output sentence._

### ___Seq2Seq Model___

<img src='https://docs.chainer.org/en/stable/_images/seq2seq.png' width = 600/>

* ___Encoder Embedding Layer___

    _The first layer or encoder embedding layer converts each word in input sentence to the embedding vector. When processing the i-th word in the input sentence, the input and output of the layer are the following:_

    _The input is  𝑥𝑖 : the one-hot vector which represents the i-th word._

    _The output is  𝑥¯𝑖 : the embedding vector which represents the i-th word._
    

* ___Encoder Recurrent Layer___

    _The encoder recurrent layer generates a hidden vectors from the embedding vectors. When we processing the i-th embedding vector, the input and output layer are the following:_

    _The input is  𝑥¯𝑖 : the embedding vector which represents the i-th word._

    _The output vector is  ℎ(𝑠)𝑖  : hidden vector of the i-th position_

  
  
* ___Decoder Embedding Layer___

    _The decoder embedding layer converts each word in the output sentence to the embedding vector. When processing the j-th word in the output sentence, the input and output layer are the following:_

    _The input is  𝑦𝑗−1 : the one-hot vector which represents the (j - 1)-th word generated by the decoder output layer._
    
    _The output is  𝑦¯𝑗 : the embedding vector which represents the (j - 1)-th word._
    
    
    
* ___Decoder Recurrent layer___

    _The decoder recurrent layer generates the hidden vectors from the embedding vectors. When processing the j-th embedding vector, the input and the output layers are following:_
    
    _The input  𝑦¯𝑗  : the embedding vector._
    
    _The output is  ℎ(𝑡)𝑗  : the hidden vector of j-th position._
    
    _For example, when using the uni-directional RNN of one layer, the process can be represented as the following function  Ψ(𝑡):_
    
    _In this case we used tanh as the activation function.And we must use the encoder's hidden vector of the last position as the decoder's hidden vector of first position as following:_
    
    $$ℎ(𝑡)0=𝑧=ℎ(𝑠)𝑡$$


* ___Decoder Output Layer___

    _The decoder output layer generates the probability of the j-th word of the output sentence from the hidden vector. When we processing the j-th embedding vector, the input and output of the layer are the following:_

    _The input is  ℎ(𝑡)𝑗  : the hidden vector of the j-th position._

    _The output is  𝑝𝑗  : the probability of generating_


## ___Cons of Encoder-Decoder Architecture___

<img src='https://miro.medium.com/max/624/0*PaGt4fcpHGUUM-NA.png' width = 400/>

### ___Con 1___
_The above architecture represents the basic seq-seq model and thus they cannot be used for complex applications. The single hidden vector cannot encapsulate all the information from a sequence._

####  ___Reversing the Sequence___
* _RNNs learns much better when the input sequences are reversed._
* _When you concatenate the source and target sequence, you can see that the words in the source sentence are far from the corresponding words in the target sentence._
* _By reversing the source sentence, the average distance between the word in source sentence and the corresponding word in target sentence are unchanged but the first few words are now very close to each other in both source and target._
* _This can be helpful in establishing a proper back propagation._


### ___Con 2___
_Neural Network needs to compress all the information of the source sentence into fixed vector length. It can be tricky during the testing time that is is difficult for NN to cope with longer sentences especially those are not in the corpus. The performance of Encoder- Decoder decreases with the increase in length of the sentences._

####  ___Using Attention Mechanism___
* _In this model, there occurs the chances of missing the importance of a word. NN cannot able to focus on the important word. It can be solved using Attention mechanism._
* _The Attention mechanism stores the output of the previous RNNs._
* _At each step, it ranks all the outputs by relevancy._
* _The word with the highest score will be considered as the word to be focused on the current step._


### ___Con 3___
_While creating hidden vectors for longer sentences, the model doesn’t address the complexity of the grammar. Example: While predicting the output for the nth word, it considers only the 1st n-words in a sequence before the current word. But grammatically, the meaning of the word depends on words present before and after the current words in a sequence._

####  ___Bi-directional LSTM___
* _It allows to input the context of both past and future words to create encoded context vector i.e output of the encoder._
* _At the first layer, the inputs will be given as left-to-right and at the second layer, the inputs will be given as right-to-left._
* _The outputs from the both layers are concatenated and given as input to the third layer._

<img src='https://miro.medium.com/max/700/1*D9NsJKvsOarjSHiKKLZdmQ.png'  width =400/>

### ___Con 4___
_Simply stacking the number of LSTM layers will work only to a certain amount of layers, beyond that the network result in decreased efficiency and slow training time._

####  ___Adding Residual Connections___
* _This can be solved using residual connections. The input of the current layer will be element-wise added with the output of the current layer. The added output will be given as the input to the next layer. These residual connections will make the memory states more efficient._

<img src='https://miro.medium.com/max/700/1*m1o6nUHHTIkHV21y9mVbYg.png' width =400/>

## ___Attention Model___

_Attention models, or attention mechanisms, are input processing techniques for neural networks that allows the network to focus on specific aspects of a complex input, one at a time until the entire dataset is categorized. The goal is to break down complicated tasks into smaller areas of attention that are processed sequentially. Similar to how the human mind solves a new problem by dividing it into simpler tasks and solving them one by one._

_Attention models require continuous reinforcement or backpopagation training to be effective._

### ___Why Attention Model?___

_The attention model was born to help memorize long source sentences in neural machine translation._

_Instead of building a single context vector out of the encoder’s last hidden state, this new architecture create context vector for each input word. For example, If a source sentence has N input word, then there’s get to be with N context vectors rather than one. The advantage is that the encoded information will be much great decoded by the model._

<img src='https://lilianweng.github.io/lil-log/assets/images/encoder-decoder-attention.png' width = 600/>

_Here, the Encoder generates <h1,h2,h….hm> from the source sentence <X1,X2,X3…Xm>. So far, the architecture remains the same as the Seq2seq model. Source sentence as input and Target sentence as output. Therefore, we want to figure out the difference inside the model, which is the context vector c_{i} for each of the input word._