# Motivation

Our main motivation in exploring more complex neural architectures is:

1. Scaling up to "multiple levels" of texts in one model (native modeling of compositionality)
2. Solving sequence in, sequence out scenarios.

For this we capitalize again on the representation learning abilities of RNN-s.

# Skip thought vectors

On of the methods for directly learning representations of bigger units of texts is the "skip thought vectors" approach that borrows some elements from **word2vec** and uses it with **LSTM**-s.

We practically try to **re-generate the the preceding and following sentences from the dense representation of a sentence in the middle**. (This can also be tried with paragraphs.)

[Original paper](https://arxiv.org/abs/1506.06726)
[Elaboration](https://sanyam5.github.io/my-thoughts-on-skip-thoughts/)


## Structure

<a href="https://sanyam5.github.io/images/skip-thoughts/skip-overview.png"><img src="https://drive.google.com/uc?export=view&id=1lDdso_MgVRZaTkblmERh2VZIspD8V3g2" width=65%></a>

"Skip-Thoughts model has three parts:

**Encoder Network:** Takes the sentence x(i) at index i and generates a fixed length representation z(i). This is a recurrent network (generally GRU or LSTM) that takes the words in a sentence sequentially.

**Previous Decoder Network:** Takes the embedding z(i) and “tries” to generate the sentence x(i-1). This also is a recurrent network (generally GRU or LSTM) that generates the sentence sequentially.

**Next Decoder Network:** Takes the embedding z(i) and “tries” to generate the sentence x(i+1). Again a recurrent network similar to the Previous Decoder Network."


**Main takeaway:**

-------------------
<font color=red>
The inner representation of memory models at the end of the sequence are good dense representations for the full sequence.
</font>

-------------------




 
# Generalized seq2seq architecture

As mentioned before the inner states of LSTMs represent an arbitrary long sequence of inputs as a fixed length hidden state vector, thus LSTMs can be regarded as sequence encoders.

The produced representations can be for example used to:

- classification (eg. sentiment analysis in NLP)
- the measurement of similarities of series, thus __search__
- for **sequence to sequence transformations**, where we generate a new series just in case of language models by applying for example "beam search" from the hidden representations.

<a href="http://suriyadeepan.github.io/img/seq2seq/seq2seq2.png"><img src="https://drive.google.com/uc?export=view&id=1slnyfW87l_HqBLXiIHUKzuMD91CN143_"></a>

LSTM based sequence-to-sequence transformations are used at:

- **neural machine translation**
- Summarization
- Question answering
- Dialogue systems

## Visualization of a seq2seq model

Source: [Visualizing A Neural Machine Translation Model](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/)


In [1]:
from IPython.display import HTML

HTML(data='''
<video width="100%" height="auto" loop="" controls="">
  <source src="https://jalammar.github.io/images/seq2seq_6.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>''')


## Some libraries: Google seq2seq

The encoder-decoder based seq2seq RNN architectures became so important, but so complex in recent years, that some dedicated frameworks for building up these kind of models appeared. One of the most well known of them is Google's  [seq2seq](https://github.com/google/seq2seq) based on TensorFlow, with which we can define complex seq2seq architectures with the help of simple `yml` description files (naturally more atypical architectrues are also possible). An example of a simple seq2seq model in `yml`:



```model: BasicSeq2Seq
model_params:
  bridge.class: seq2seq.models.bridges.InitialStateBridge
  embedding.dim: 128
  encoder.class: seq2seq.encoders.BidirectionalRNNEncoder
  encoder.params:
    rnn_cell:
      cell_class: GRUCell
      cell_params:
        num_units: 128
      dropout_input_keep_prob: 0.8
      dropout_output_keep_prob: 1.0
      num_layers: 1
  decoder.class: seq2seq.decoders.BasicDecoder
  decoder.params:
    rnn_cell:
      cell_class: GRUCell
      cell_params:
        num_units: 128
      dropout_input_keep_prob: 0.8
      dropout_output_keep_prob: 1.0
      num_layers: 1
  optimizer.name: Adam
  optimizer.params:
    epsilon: 0.0000008
  optimizer.learning_rate: 0.0001
  source.max_seq_len: 50
  source.reverse: false
  target.max_seq_len: 50```

There are multiple other solutions available.


# Main use cases

## Machine translation

Machine translation is in itself one of the foundational concerns of AI research, it has been explicitly mentioned in the Dartmouth Manifesto, and has a long [history of it's own](https://en.wikipedia.org/wiki/History_of_machine_translation). This made it all the more remarkable, that the advent of seq2seq machine translation marked a breakthrough, thus when Google decided to deploy such models into Translate, the appropriate media attention was also given.

The performance of NMT models is still progressing rapidly.

[source](http://nlpprogress.com/english/machine_translation.html)

**Models are evaluated on the English-German dataset of the Ninth Workshop on Statistical Machine Translation (WMT 2014) based on BLEU**

|Model |	BLEU |	Paper / Source|
|------|------|------|
|ConvS2S (Gehring et al., 2017)|	25.16|	[Convolutional Sequence to Sequence Learning](https://arxiv.org/abs/1705.03122)|
|MoE (Shazeer et al., 2017)|	26.03	| [Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer](https://arxiv.org/abs/1701.06538)|
|Transformer Base (Vaswani et al., 2017)|	27.3|	[Attention Is All You Need](https://arxiv.org/abs/1706.03762)|
|Transformer Big (Vaswani et al., 2017)|	28.4|	[Attention Is All You Need](https://arxiv.org/abs/1706.03762)|
|RNMT+ (Chen et al., 2018)|	28.5*|	[The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation](https://arxiv.org/abs/1804.09849)|
|Transformer Big (Ott et al., 2018)|	29.3|	[Scaling Neural Machine Translation](https://arxiv.org/abs/1806.00187)|
|DeepL|	33.3|	[DeepL Press release](https://www.deepl.com/press.html)|
|Transformer Big + BT (Edunov et al., 2018)	|35.0	|[Understanding Back-Translation at Scale](https://arxiv.org/pdf/1808.09381.pdf)|


More information on the BLEU metric can be found [here](https://machinelearningmastery.com/calculate-bleu-score-for-text-python/)

## Chat agents

The other very promising area of application for seq2seq models is the field of end-to-end dialog modeling. The final goal is to use the available corpora of domain specific dialogs to build up useful chat agents in an unsupervised (better to say: self-supervised) manner.

As pointed out elsewhere, though the learning ability of such models in "language modeling like" scenarios is remarkable, the **semantic control of output production is rather problematic**, it is a yet unsolved area of research. This hinders the rollout of end-to-end learned chatbot solutions.

## Just for fun

Sequence to sequence "translation" can mean different things, and a surprisingly large body of problems can be cast into this category. Beside serious applications, like chemical reaction modeling (see eg. [here](https://pubs.acs.org/doi/full/10.1021/acscentsci.7b00303)), there are fun projects one can attempt with this approach, like the one below, where the author tries to compute written probability exercises with an external symbolic calculator.

<a href="https://reiinakano.github.io/images/sp/calcnet3.gif"><img src="https://drive.google.com/uc?export=view&id=1ZSn6kP8Q1hnTtWh8j9LHrZEJTw8WbHk4" width=50%></a>

# Attention!

Many of the seq2seq tasks behave in a "non holistic" way, meaning that during the solution generation it is not true that all of the prior input information is always equally important, it is well worth "attending to" certain elements of it at times, when at other occasions they can be thought of as completely unnecessary. Despite this the encoder-decoder model is constrained to only one summarized representation and can not access the relevant parts of prior hidden states. In early times some tricks were applied to mitigate this effect: entering the input twice or in reverse order, but the real solution proved to be tha so called **"attention mechanism"** (coming from ConvNets).

<a href="https://cdn-images-1.medium.com/max/1600/0*SY3nv8-J6qX1GUxk.png"><img src="https://drive.google.com/uc?export=view&id=19Fckva14TW5FpNKkdIVVGGVPqkmuAlXB"></a>

The decoder receives in each step the prior hidden state and output, as well a _weighted sum_ of all prior states of the encoder as context. 

Context in the $i$ step of the decoder:

$$ c_i = \sum_{j=1}^{T}\alpha_{ij}h_j$$

where for all $h_k$ hidden states there is weight generated by a trained feedforward network $A$:

$$e_{ik} = A(h_k, s_{i-1})$$ 

(where input is $h_k$ encoder state and $s_{i-1}$, the prior hidden state of the decoder) and uses $\alpha_{ij}$ weights to generate a softmax:

$$\alpha_{ij} = \frac{\exp e_{ij}}{\sum_{k=1}^{T}\exp e_{ik}}$$


The classic paper about attention mechanisms is: [Bahdanau et al: "Neural machine translation by jointly learning to align and translate." (2014).](https://arxiv.org/pdf/1409.0473.pdf)


## Visualization of attention mechanism


In [4]:
from IPython.display import HTML

HTML(data='''
<video width="100%" height="auto" loop="" controls="">
  <source src="https://jalammar.github.io/images/seq2seq_7.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>''')


In [6]:
from IPython.display import HTML

HTML(data='''
<video width="100%" height="auto" loop="" controls="">
  <source src="https://jalammar.github.io/images/attention_process.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>''')


In [7]:
from IPython.display import HTML

HTML(data='''
<video width="100%" height="auto" loop="" controls="">
  <source src="https://jalammar.github.io/images/attention_tensor_dance.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>''')


## What does the model attend to?

<a href="https://jalammar.github.io/images/attention_sentence.png"><img src="https://drive.google.com/uc?export=view&id=1vOPMVqpsqmY6S21RC0PtRDiaZN1HfU9o" width=45%></a>

It is noteworthy in in the picture above, that the sequence of English and French text is not the same, and the model learns to pay attention to the relevant positions even in a nonstandard sequence, thus effectively learning syntactic rules of the two languages, as well as their mapping.




# Maybe that is all we need?

As some of the above mentioned paper titles implied, the idea arose, that the attention mechanism itself forms the crucial part in the success of seq2seq models, thus some brave experiments were made to get rid of RNN-s )and even CNN-s) altogether and focus on purely attention based models.

It turned out, that: **"Attention is all you need!**


We will talk about a [generalized attention model](https://arxiv.org/pdf/1902.02181.pdf) next, which proves, that the idea of attention started a "life of it's own", and became a major modeling in itself, without any kind of recurrent or convolutional elements.