<a id="necessary"></a>

# Are LSTM-s really necessary for time series?

LSTMs, when combined with dropout and other techniques became **hugely successful** and were considered the workhorse of NLP and time series applications, as well as sequence to sequence problems, moreover they serve as basis for all memory network architectures. They were and are still dominant in these fields.

None the less as of 2017-8, multiple findings emerged that question the necessity for LSTMs in many fields. The leading field in this regard was neural machine translation, where Facebook Research developed it's [ConvNet based  machine translation](https://code.fb.com/ml-applications/a-novel-approach-to-neural-machine-translation/), as well as Google publishing the [transformer architecture](https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html). Both approaches were motivated by the fact that LSTM computation is not easily parallelized, but ConvNets and transformers are, so training can be scaled up rapidly.

Based on these networks multiple analyses tried to justify the usage of LSTMs and found, that though they have in theory infinite memory ability, in practice, a limited memory is good enough, which can be modeled by ConvNets, especially 1D and dilated convolutions.

For more information see [here](https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0) [here](http://www.offconvex.org/2018/07/27/approximating-recurrent/) [here](http://blog.aylien.com/acl-2018-highlights-understanding-representations-and-evaluation-in-more-challenging-settings/) [here](https://arxiv.org/abs/1803.01271)

(Interesting snippets or remarks from the above sources:
“LSTMs work in practice, but can they work in theory?”
" “According to Chomsky, sequential recency is not the right bias for learning human language. RNNs thus don’t seem to have the right bias for modeling language, which in practice can lead to statistical inefficiency and poor generalization behaviour.")


Alternative approaches as ["recursive neural networks"](https://en.wikipedia.org/wiki/Recursive_neural_network) and ["recurrent neural network grammars"](https://arxiv.org/abs/1602.07776) also exist, but not that widespread.


## Convolutions for time series

We use convolution operators, but only in one dimension over the data.

<img src="http://mblogthumb2.phinf.naver.net/MjAxNjEyMTBfMjMx/MDAxNDgxMjk1ODk2NDAz.kn9JN93v9X2Xn9vJloqupV5c5GB09YNYwPrvDB8yKU8g.Hh1wT30ySu0JFWNqj2qoSTiX-pRnrjH2VWhMI2EAo30g.PNG.atelierjpro/%EC%8A%AC%EB%9D%BC%EC%9D%B4%EB%93%9C2.PNG?type=w2" width=600 heigth=600>

Look, look, we have reinvented sliding windows! :-(

<img src="https://qph.fs.quoracdn.net/main-qimg-523434af0d21bb0b59454aa9563cc90b-c" width=600 heigth=600>

Though if we [calculate the receptive fileds](https://medium.com/mlreview/a-guide-to-receptive-field-arithmetic-for-convolutional-neural-networks-e0f514068807) of these models, with enough depth they can be formidable!

### Dilated convolutions

Dilated convolutions are used to radically increase the total receptive field of a network.

<img src="http://sergeiturukin.com/assets/2017-02-23-155956_803x294_scrot.png" width=600 heigth=600>


<img src="https://mlblr.com/images/dilated.gif" width=400 heigth=400>

Original paper [here](https://arxiv.org/abs/1511.07122)

[Wavenet](http://sergeiturukin.com/2017/03/02/wavenet.html) and other successful models (especially in voice recognition) use this approach effectively.

### Convolutional filters and wavelets

It is also interesting to note, that convolutions over a timeseries can be understood as generalizations of "wavelet / shapelet" approaches, with the added benefit, that here the definition of "mother wavelet" is not that problematic.

<img src="https://www.researchgate.net/profile/Zoltan_German-Sallo/publication/266056525/figure/fig1/AS:295770143117318@1447528505764/The-Continuous-Wavelet-Transform-as-a-convolution-between-data-signal-and-scaled-and.png" width=45%>


# Neural models for time series similarity

Some really successful neural models eg. for time series similarity (like [this](https://arxiv.org/pdf/1812.08306.pdf)) also capitalize on the flexibility of neural models for pattern recognition in timeseries, thus forming effective competition to the tried and true Fourier spectral methods and [Dynamic Time Warping](https://en.wikipedia.org/wiki/Dynamic_time_warping).

<img src="http://drive.google.com/uc?export=view&id=1WrrsD273HFeWQkKnTDGnv-PxP4cWU6Ph" widht=75%>

<a id="seq"></a>
# What if outputs are also sequences? - Seq2seq

Till this point we always assumed, that though the inputs of the modeling problems are sequences, but the outputs were single categorical values or scalars representing either one prediction step or a similarity score. But what if we would like to move into problems where **the input and the output are both sequences**? (For example parallel texts to translate, questions and answers,...)

We are quite lucky, since it turns out, that full sequence models are quite flexible in this regard also: 

<img src="https://i.stack.imgur.com/WSOie.png" width=65%>

**We can use the sequence models to capture ("encode") some inputs and generate ("decode") the desired outputs for us (utilizing their hidden states).**

## Sidenote: more heroes

A noteworthy "hero" of DL is [Ilya Sutskever](https://en.wikipedia.org/wiki/Ilya_Sutskever), who was instrumental in the elaboration of Sequence-to-sequence learning methods. 

(Not surprisingly, also a student of Hinton.)

<img src="http://r.com.pk/wp-content/uploads/2018/04/ilya-sutskever.jpg" width=400 heigth=400>

Now he is the founder of the OpenAI foundation and research group.

## The innen states of LSTMs as dense vector representations of series

As mentioned before the inner states of LSTMs represent an arbitrary long sequence of inputs as a fixed length hidden state vector, thus LSTMs can be regarded as sequence encoders.

The produced representations can be for example used to:

- classification (eg. sentiment analysis in NLP)
- the measurement of similarities of series, thus __search__
- for sequence to sequence transformations, where we generate a new series just in case of language models by applying for example "beam search" from the hidden representations.

<img src="http://suriyadeepan.github.io/img/seq2seq/seq2seq2.png">

LSTM based sequence-to-sequence transformations are used at:

- **neural machine translation**
- Summarization
- Question answering
- Dialogue systems

## Architectural improvements

### Bi-directional RNN layer

The big deficiency of the models seen before is that they only take into account the information coming from the "left context" of the datapoint (in the form of passed on hidden state), but this is not a realistic assumption form language processing perspective this is not totally plausible, since humans also read in a "back-and-forth" manner. 

This problem is mitigated by the introduction of a forward and a backward looking LSTM layer.:

<img src="http://opennmt.net/OpenNMT/img/gnmt-encoder.png" width="500px">

The two layers produce the sequence element independently, which later on gets combined into the final output -- most frequently only as a concatenation. 

Naturally we can stack BiLSTMs on top of each-other:

<img src="http://brightliao.me/attaches/2016/2016-12-11-dl-workshop-rnn-and-lstm-1/deep-bidirectional-rnn.png" width="400px">


### Two more complex LSTM NLP architectures

#### Tree-LSTM

It is capable of processing a tree structured input - intead of a traditional sequence based input. (It was developed for parse-tree processing, but can have usages in molecular graphs also - though alternative approaches as ["recursive neural networks"](https://en.wikipedia.org/wiki/Recursive_neural_network) are also present in that field.) The input of a higher level node is - in a simple case - formed by the summation of the output of the child nodes. In a more complex case with maximalized branching factor each cell has an input and forget gate for all children.

<img src="https://adeshpande3.github.io/assets/NLP28.png" width="400px">

Original paper where they use it for sentiment-analysis: [Tai et al (2015): Improved Semantic Representations From
Tree-Structured Long Short-Term Memory Networks (2015)](https://arxiv.org/pdf/1503.00075.pdf)

#### Hierarchic models

It is capitalizing on the hierarchic structure of textual data: the encoder first generates the representation of the sentences, then a paragraph is being modelled as a sequence of dense sentence vectors.

<img src="https://ai2-s2-public.s3.amazonaws.com/figures/2017-08-08/538991962b2e6e832183225a5b555d55a630dcef/3-Figure3-1.png" width="600px">

Usage: abstract generation from text ([Zou 2016 A Hierarchical model for text autosummarization](https://pdfs.semanticscholar.org/5389/91962b2e6e832183225a5b555d55a630dcef.pdf)) and as a general purpose autoencoder also ([Li et al 2015: A Hierarchical Neural Autoencoder for Paragraphs and Documents](https://arxiv.org/pdf/1506.01057.pdf)): 

### A sidenote: Google Seq2Seq

The encoder-decoder based seq2seq RNN architectures became so important, but so complex in recent years, that some dedicated frameworks for building up these kind of models appeared. One of the most well known of them is Google's  [seq2seq](https://github.com/google/seq2seq) based on TensorFlow, with which we can define complex seq2seq architectures with the help of simple `yml` description files (naturally more atypical architectrues are also possible). An example of a simple seq2seq model in `yml`:

```model: BasicSeq2Seq
model_params:
  bridge.class: seq2seq.models.bridges.InitialStateBridge
  embedding.dim: 128
  encoder.class: seq2seq.encoders.BidirectionalRNNEncoder
  encoder.params:
    rnn_cell:
      cell_class: GRUCell
      cell_params:
        num_units: 128
      dropout_input_keep_prob: 0.8
      dropout_output_keep_prob: 1.0
      num_layers: 1
  decoder.class: seq2seq.decoders.BasicDecoder
  decoder.params:
    rnn_cell:
      cell_class: GRUCell
      cell_params:
        num_units: 128
      dropout_input_keep_prob: 0.8
      dropout_output_keep_prob: 1.0
      num_layers: 1
  optimizer.name: Adam
  optimizer.params:
    epsilon: 0.0000008
  optimizer.learning_rate: 0.0001
  source.max_seq_len: 50
  source.reverse: false
  target.max_seq_len: 50```

Training and prediction can be done in this case with simple shell commands.

More than one "concurent" seq2seq framework exists, like [this](https://github.com/eladhoffer/seq2seq.pytorch).