<a href="https://colab.research.google.com/github/Benendead/LSTMjazz/blob/master/Understanding_LSTMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Source 1 : [Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)

## Recurrent Neural Networks (RNNs)

Traditional ANNs are unable to perceive the sequential context of data points, which is of course a fundamental ability in comprehending all sorts of things.

RNNs address this by including loops where data can persist. For input $x_t$, the network outputs some $h_t$. On the next input, $x_{t+1}$, the RNN receives persisting data from earlier inputs.

![RNN unrolled](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png)

RNNs, with this sequential architecture, have been successfully applied to numerous problems including speech recognition, translation, and image captioning. That said, LSTMs have been crucial for these successes.

## RNNs: Bad at Long-Term Dependencies

Based on what we've seen, RNNs should be able to connect previously seen information to their present task. As an example, say the network needs to predict the next word in a sentence based on previous words. If it's given "the clouds are in the ____," it's pretty obvious that the next word will be "sky." The immediate context enables the RNN to use the past information.

Consider a much longer paragraph beginning, "I grew up in France..." and ending with "I speak fluent ____." Recent context might inform the RNN that the next word will be a language. Unfortunately, the context of France is a much more distant memory.

As this gap grows, RNNs are unable to connect the distant information. In theory, they're capable of handling these "long-term dependencies," but in practice they're unable to learn the exact parameters to do so on their own. To quote the title of Yoshua Bengio's [1994 paper](http://ai.dinfo.unifi.it/paolo//ps/tnn-94-gradient.pdf) on the topic, "learning long-term dependencies with gradient descent is difficult."

## LSTMs: Good at Long-Term Dependencies

Long Short Term Memory networks - shortened to "LSTMs" - are a special kind of RNN capable of long-term dependencies. They were introduced in 1997 in [this paper](http://www.bioinf.jku.at/publications/older/2604.pdf). LSTMs were designed to solve the long-term dependency problem, and thus remembering information for a long time is basically their default behavior.

Consider the structure of a typical RNN:

![RNN Chain](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-SimpleRNN.png)

In the RNN, the chain of repeating modules is quite simple: the output of each layer is combined with the next input. We then $\text{tanh}$ the whole thing.

For clarity, consider another RNN illustration:

![RNN Alternative View](http://www.wildml.com/wp-content/uploads/2015/09/rnn.jpg)

Notation from [another source](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/):
* $x_t$ is input at time $t$.
* $s_t$ is the hidden state at time $t$. It's the memory of the RNN and is calculated as $s_t=f(Ux_t+Ws_{t+1})$. $f$ is a nonlinearity like $\text{tanh}$ or $\text{ReLU}$.
* $o_t$ is the output at time $t$. As an example, it might be $\text{softmax}(Vs_t)$ if we wanted to select the highest probabilty of the options in the output vector.
* $U$, $V$, and $W$ are the same for each module, as can be seen on the left.

Now consider the structure of an LSTM:

![LSTM Chain](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png)

LSTMs instead have four neural network layers per module. The notation used is as follows:

![RNN Notation](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM2-notation.png)

We'll go into the specifics of the LSTM structure later.

## The Core Idea of LSTMs

The centerpiece of the LSTM is the cell state, seen as the top line in each module. These run straight down the chain, with optional changes occasionally altering the information passed along. The few ways the LSTM can add or remove data from the cell state are called gates.

Gates optionally let information through. They're composed of a sigmoid layer and a pointwise multiplication operation. Sigmoid (by definition) outputs numbers zero to one, where a zero says "change nothing in the cell state" and a one meaning "let everything through!"

LSTMs have three of these gates:
1. Forget gate
2. Input gate
3. Output gate

## Step-by-step Walkthrough

Step 1: Decide what information in the cell state should be thrown away. This is made by the "forget gate layer." Looks at $h_{t-1}$ and $x_t$ and outputs a number 0-1 for each number in cell state $C_{t-1}$. 1 says to keep and 0 says to completely remove.  
![Forget Gate Graphic](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-f.png)

Step 2: Decide what information we should store in the cell state. This has two subprocesses: a $\text{sigmoid}$ layer called the "input gate layer" decides which values to update, and a $\text{tanh}$ layer creates a vector of candidate values, $\tilde{C}_t$,  as potential options to add to the cell state.  
![Input Gate Layer Graphic](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-i.png)

Step 3: Update the old cell state $C_{t-1}$ into new cell state $C_t$. We already have *what* to do based on the previous steps, but we now *do* it:  
1. Multiply $C_{t-1}$ by $f_t$, the forget gate's output. This "forgets" what we no longer need.
2. Add $i_t*\tilde{C}_t$ to this result. The new candidate values, scaled by their importance, were what we'd decided was worth remembering from this step.  
![Update Cell State Graphic](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-C.png)

Step 4: Decide what we'd like to output. This will be a filtered version of our cell state:
1. We use a sigmoid layer over to determine which parts of the cell state to output.
2. We put the cell state through a $\text{tanh}$ layer to push the values between -1 and 1.
3. Multiply the sigmoid's result by the tanh scaling as to output only what we decided on.  
![Output Layers Graphic](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-o.png)

## Variants on LSTMs

We've thus far described "a pretty normal LSTM," but in reality almost every paper using LSTMs uses a slightly different version.

**Peephole Connections** - Introduced in 2000 by [Gers & Schmidhuber](ftp://ftp.idsia.ch/pub/juergen/TimeCount-IJCNN2000.pdf). These additional connections allow the gate layers to look at the cell state. Many papers only give some gates peepholes and not others.  
![Peephole Connections Graphic](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-var-peepholes.png)

**Coupled Forget and Input Gates** - Instead of separately deciding where to forget or add new information, we combine those decisions. We only forget something if we're going to input something in its place.  
![Coupled Forget/Input Gates Graphic](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-var-tied.png)

**GRU** - Gated Recurrent Units, or GRUs, were introduced by [Cho, et al.](https://arxiv.org/pdf/1406.1078v3.pdf) in 2014. These modules combine the forget and input gates into a single "update gate." They also merge the cell and hidden states, among other changes. The result is simpler than standard LSTMs and was growing in popularity as of 2015.  
![GRU Graphic](http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-var-GRU.png)

**Additional Variants** - There are numerous other LSTM variants. Some of these include [Depth Gated RNNs](https://arxiv.org/pdf/1508.03790v2.pdf) or [Clockwork RNNs](https://arxiv.org/pdf/1402.3511v1.pdf). In general, most variants achieve pretty similar results, but of course [a few can do better than LSTMs](http://proceedings.mlr.press/v37/jozefowicz15.pdf) on certain tasks.

## Conclusion

LSTMs are responsible for many of the achievements being made with RNNs. The model may be complex, but hopefully it's a bit more approachable now. The next step, in many researchers' opinions, is the use of attention in ML/DL models.

# Future Things To Read

On this: https://skymind.ai/wiki/lstm

Future:
https://medium.com/mlreview/understanding-lstm-and-its-diagrams-37e2f46f1714
https://towardsdatascience.com/understanding-lstm-and-its-quick-implementation-in-keras-for-sentiment-analysis-af410fd85b47
https://towardsdatascience.com/recurrent-neural-networks-and-lstm-4b601dd822a5
https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21