# LSTMs

The content for this course (including many of the images) comes from the excellent [colah's blog](https://colah.github.io) - specifically the post [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/).

## Sigmoid versus Tanh 

Tanh
- [-1, 1]
- naturally suited to create output distributions similar to a normal distribution
- are good at capturing statistical features like "this one is average", "this one is in the 10th percentile"

Sigmoid
- [0, 1]
- naturally suited for probabilities or relative amounts - a relative amount to signify how strongly the incoming values should be considered
- they don't add up to 100%. that's what softmax is for

##  Challenges with recurrent neural networks

Two historical challenges
- how to train (vanishing & exploding gradients)
- how to remember long term (the long term dependency problem)

Why does the RNN architiecture cause these two challenges?

LSTMs specifically address both of these problems.  They were introduced in 1997 (see [Hochreiter & Schmidhuber (1997) Long Short-Term Memory](http://www.bioinf.jku.at/publications/older/2604.pdf)), and work well on a wide range of problems.

## The LSTM

The **Long Short Term Memory (LSTM)** network is composed of memory cells.  These cells are composed of three gates
- forget gate = what to discard from the state (reset)
- input gate = update the state (write)
- output gate = what to output (read)

The input & forget gate update the internal state.  The input gate protect the cell from irrelevant inputs, the output gate protects other units.  

The rest of the neural network communicates with the cells via these gates.  The gates ensure a stable error signal (known as a *constant error carrousel*).

These gates not only help with keeping gradients consistent, they also help with longer term memory.

## Limitations of LSTMs

They are not a solution to everything!  

Time series problems where the relevant infomation is within a small time window - or problems that are well solved by traditional autoregression models (ARMIA etc)

## Future of LSTMs

**Attention (2013)** - let every step of an RNN pick information to look at from some larger collection of information

## Recurrent recap

Infomation is fed to the next timestep

<img src="assets/rec.png" alt="" width="800"/>

## LSTM architecture

In the LSTM we have more complexity in the cell

![](assets/lstm.png)

## Three gates

### 1 - Forget gate

How much of the cell state to forget based on the next observation in the sequence ($x$) and hidden state ($h$)

<img src="assets/forget.png" alt="" width="300"/>

Output of sigmoid (between 0 and 1) operates on the cell state
- 0 = forget everything
- 1 = remember everything

$$ f_{t} = sigmoid(W_{f} \cdot [h_{t-1}, x_t] + B_{f}) $$

### 2 - Input gate

How to update the cell state based on the next observation in the sequence ($x$) and hidden state ($h$)

<img src="assets/input.png" alt="" width="300"/>

Decide what to update:

$$ i_{t} = sigmoid(W_{i} \cdot [h_{t-1}, x_t] + B_{i})) $$

What the new values should be:

$$ \overset{\sim}{C}_{t} = tanh(W_{C} \cdot [h_{t-1}, x_t] + B_{C})) $$

We can now put together the results of our forget and input gates:

$$ C_{t} = f_{t} * {C}_{t-1} + i_{t} * \overset{\sim}{C}_{t} $$

### 3 - Output gate

<img src="assets/output.png" alt="" width="300"/>

Which parts of the cell state to output based on the next observation in the sequence ($x$) and hidden state ($h$)

What to output from the cell state:

$$ o_{t} = sigmoid(W_{o}[h_{t-1},x_{t}] + b_{o}])$$

What the output should be:

$$ h_{t} = o_{t} * tanh(C_{t}) $$

## Gated Recurrent Units (GRUs)

Simpler than LSTMs & popular

Forget & input gates combined into an update gate

Cell state & hidden state merged together

[Cho, et al. (2014) Learning Phrase Representations using RNN Encoder–Decoder
for Statistical Machine Translation](https://arxiv.org/pdf/1406.1078v3.pdf)

![](assets/gru.png)

See [Greff et. al (2017) LSTM: A Search Space Odyssey](https://arxiv.org/pdf/1503.04069.pdf) for a comparison of different LSTM variants.

## Practical

Predict a damped sin wave using an LSTM:

In [None]:
from math import sin, pi, exp

import matplotlib.pyplot as plt

%matplotlib inline

def generate_sequence(length, period, decay):
    # generate damped sine wave in [0,1]
    return [0.5 + 0.5 * sin(2 * pi * i / period) * exp(-decay * i) for i in range(length)]

f, a = plt.subplots()
_ = a.plot(generate_sequence(102, 20, 0.05))