## Ok, but what is an RNN?

We would like to define a neural network architecture, that in case of an input data point  $x_{t}$ can take into account the effect of prior datapoints without providing those as explicit inputs (as in the feedforward case), but as **(hidden) state**, using the activations of the NN itslef.

Till this point we only used the activation for backpropagation, but now we'll use it also for something else.

**Basic idea:**

<a href="https://cdn-images-1.medium.com/max/1400/1*lQ4izz9ZbhKYD8NClZpsmQ.png"><img src="https://drive.google.com/uc?export=view&id=197zCScRaBJypUb0giZzBh_0plHAOzlYJ"></a>

As a naive solution we could store the activations at $x_{t_-1}$ for all neurons and use a common $\lambda$ "dampening factor" before adding them to the activations arising at $x_{t}$

$$ f(x_t) = \sigma(w x_t +\lambda a_{t-1} + b)$$ 

This is not a complete recurrent network, but something along that direction (it's worth trying wether it works :-)

But if we have a coplete neural architecture with weight matrices, why shouldn't we do this in a more clever way?

### Elman network

The first real recurrent network:

[Elman 1990](https://crl.ucsd.edu/~elman/Papers/fsit.pdf): 
Jeffrey L. Elman: Finding structure in time, Cognitive Science 14, p179-p211
<a href="https://cdn-images-1.medium.com/max/1400/1*E6OMkLY8vbPdJ7b5R27FQA.jpeg"><img src="https://drive.google.com/uc?export=view&id=17s60w9qBrwdrazopSAIPHjQ6ttiBgtTn"></a>

**The general notation for recurrent networks:**

**$$h_t=\sigma(Wx_t+Uh_{t-1})$$**

"Legend":

|Symbol| Meaning|
|:---|:---|
|$h_t$| "hidden state" at time $t$ |
|$\sigma$| activation function (typically sigmoid or tanh)|
|$x_t$| input at time $t$ |
|$W$| input to hidden weigth matrix|
|$U$| hidden to hidden weigth matrix|
|$h_{t-1}$| "hidden state" at time $t-1$ |


**In this model we learn two weigth matrices $W$ and $U$.**

Summary  [here](https://medium.com/lingvo-masino/introduction-to-recurrent-neural-network-d77a3fe2c56c).

## Long short-term memory network (LSTM)
Get ready to be Schmidhubered! :-)

<a href="https://www.xing.com/img/custom/content/klartext/asset_images/images/000/157/804/x137/image.png?1453288186"><img src="https://drive.google.com/uc?export=view&id=1Ea5cmpxJnHZmZTPNO2DwVr7rfxMAw_wi"></a>

Or more famously [Hochreiter and Schmidhuber 1997](http://www.bioinf.jku.at/publications/older/2604.pdf), in which the heroes are strongly inspired by human short term memory, and would like to create a network which can store signals for arbitrary, but learned length of time, enabling it to effectively infinitely "memorize" them.

(The joke term "to be Schmidhubered" comes from the fact that Prof. Schmidhuber has a long time feud with the Hinton group, and he tries to diminish the importance of their findings, claiming that others - amongst them he-  invented key methods before the "deep learning conspiracy". Disturbingly he is sometimes even right. see [this](http://people.idsia.ch/~juergen/deep-learning-conspiracy.html) :-)

## Architecture

**Gold standard explanation: [Colah's Blog](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)** The personal blog of the researcher Chris Olah.

<a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png"><img src="https://drive.google.com/uc?export=view&id=1UOtmNeaimbcn3H7CBSeIUfOcjRaubXgY" heigth=300 width=600></a>

### Preliminaries

Cell state "travels" through the process.
<a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-C-line.png"><img src="https://drive.google.com/uc?export=view&id=1IQEqkYz9ThTT1p_9EnO1NJ8xVLQwRsDv"></a>

"Gates" are to be understood as combinations of non-linearities (sigmoid or tanh) and pointwise operations (addition or multiplication).

<a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-gate.png"><img src="https://drive.google.com/uc?export=view&id=1tUFbRRllulugrK5OBmPGBVidOnqUIMWF"></a>

**Some help: Let us imagine if a vector gets multiplied pointwise with something between 0 and 1, this is equivalent to "deleting" (0) or "leaving intact" (1) some parts of the data.**

### 1. step: What would we like to "forget"?

<a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-f.png"><img src="https://drive.google.com/uc?export=view&id=1J1_LbATvWLASY4OJsJ95txk9SCWjeYHR"></a>

### 2. step: What and where would we like to store?

<a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-i.png"><img src="https://drive.google.com/uc?export=view&id=1ZpIjdziGKlIacF4M7GpO9GRvxVPxGK7j"></a>

### 3. step: Update the cell state!

<a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-C.png"><img src="https://drive.google.com/uc?export=view&id=1jjM4MSIqgqNIz69K4jQUZtVjzTskezF_"></a>

### 4. step: Choose and produce the output!

<a href="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-o.png"><img src="https://drive.google.com/uc?export=view&id=1_gjSEe6CW1yc_QTCPhGEqM5Bst3D0SId"></a>

#### Alternative explanation [here](http://adventuresinmachinelearning.com/recurrent-neural-networks-lstm-tutorial-tensorflow/)

This may be worth reading through, since it shows in detail each vector's dimensionality.