# Recurrent Neural Networks

Networks that have persistent memory in the form of loops that pass informtion from one step of the network to the next. Think of these networks as multiple copies of the same network, each passing message to successor. 

#### Long Short-Term Memory (LSTM)

Designed to avoid the classic long-term dependency problem by being able to remember information for long periods of time. Classing RNNs would just have a single tanh layer, but LSTMs have a different structure, which is composed of four network layers. 

This type of Network aids in vanishing and exploding gradients caused by long-term memory. Weights will not be updated proportionately toward the beginning of the network during backpropagation.

###### Architecture
Cell State: runs down the entire chain with minor linear interactions. Information can be removed or added to the Cell State by the use of structures called Gates. 

###### Gates
A way to optionally let information through and are composed out of a sigmoid neural net layer and a pointwise multiplication operation. The sigmoid layer outputs numbers between 0 and 1 describing how much of each component should be let through. 0 = none and 1 = all. LSTMs have three of these gates. 

#### Process

- $\sigma$: Sigmoid Function
- $W_f$: Weights
- $b_f$: Biases
- $i_t$: Values Updated
- $\tilde C_t$: Values Added to State
- $C_{t - 1}$: Old Cell State
- $C_t$: New Cell State
- $f_t$: The Forget Output

<br />
<img src=".\images\rnn_2.PNG" />
<br />

First step is to decide what information will be thrown away from the cell state and is determined by a first sogmoid layer called the Forget Gate Layer. This looks at $h_{t - 1}$ and $x_t$, and outputs a number between 0 and 1 for each number in the Cell State $C_{t - 1}$. $h_{t - 1}$ is the output from the previous timestep and $x_t$ is the current input. SO the operation at this first Forget Gate Layer is:
$$f_t = \sigma(W_f\ dot\ [h_{t - 1}, x_t] + b_f)$$

Next step is to decide what new information will be stored in a cell state; this step can be broken into two parts:
- Sigmoid Layer called Input Gate Layer decides values to be updated
$$i_t = \sigma(W_i\ dot\ [h_{t - 1}, x_t] + b_i)$$

- tanh Layer creates a vector of new candidate values $\tilde C_t$ that can be added to the state.
$$\tilde C_t = tanh(W_C\ dot\ [h_{t - 1}, x_t] + b_C)$$

Next we multiply the old state $C_{t - 1}$ by $f_t$, which will forget values from the Cell State. Then we add $i_t\ dot\ \tilde C_t$, which represent the new candidate values scaled by how much we decided to update each state value. So:
$$C_t = (f_t\ dot\ C_{t - 1}) + (i_t\ dot\ \tilde C_t)$$

this essentially comes out to be the Cell State is equal to the Cell State after forgetting values that came in from the previous Cell State with the addition of the New Values and Updated Values from the current Cell State.

Now to decide what will be output, which will be based on the filtered version of the Cell State. We run a Sigmoid Layer to decide what parts of the Cell State will be output, then put the Cell State through a tanh (pushing values to be between -1 and 1) and then multiply it by the output of the sigmoid gate. 
$$o_t = \sigma(W_o [h_{t - 1}, x_t] + b_o)$$
$$h_t = o_t\ dot\ tanh(C_t)$$

This part will decide which inputs will be sent through with the output.

#### Variants on Long Short Term Memory

###### Gers & Schmidhuber - Peephole Connections
This allows the Gate Layer to look at the Cell State:
- $f_t = \sigma(W_f\ dot\ [C_{t - 1}, h_{t - 1}, x_t] + b_f)$
- $i_t = \sigma(W_i\ dot\ [C_{t - 1}, h_{t - 1}, x_t] + b_i)$
- $o_t = \sigma(W_o\ dot\ [C_t, h_{t - 1}, x_t] + b_o$

###### Use Coupled Forget and Input Gates
Make the decision to forget and add information is made together (only forget when we will inut something in its place):
$$C_t = f_t\ dot\ C_{t - 1} + (1 - f_t)\ dot\ \tilde C_t$$

###### Cho - Gated Recurrent Unit (GRU)
Combines the forget and input gates into a single Update Gate and merges the Cell State and the Hidden State.

#### Diagram Explained

<br />
<img src=".\images\rnn_3.PNG" />
<br />

- $x_t$: input of the current step
- $h_{t - 1}$: is the output from the previous LSTM unit
- $C_{t - 1}$: is the memory of the previous unit
- $h_t$: is the output of the current network
- $C_t$: memory of the current unit

From the top pipe for $C_{t - 1}$, the $X$ represents the Forget Gate and the $+$ represents the merging of old memory not forgotten and the new memory. From here the output is the new memory $C_t$.

The first layer - Forget Gate - will be given a bias vector $b_o$ - pink circle along with the other inputs. 

The second layer - New Memory Layer - and influences how much the new memory should influence the old memory. The new memory is generated by another layer that uses a tanh as the activation function.

The third layer - Output Layer - and dictates how much new memory should output to the next LSTM unit. 

## Keras RNN

In [7]:
from keras        import Sequential
from keras.layers import Dense, RNN

In [2]:
model = Sequential()

#### LSTM Documentation

###### Arguments
- units: dimension of the output
- activation: default is tanh
- recurrent activation: activation to use for recurrent step; default is the hard sigmoid
- use_bias: bool on whether ot use bias or not
- kernel_initializer: used for linear transformatins of the recurrent state
- bias_initializer: initializes the bias vector
- unit_forget_bias: if True, add 1 to the bias of the forget gate initialization. This will force bias_initializer = 'zeros and is recommended by Jazefowicz
- kernel_regularizer: regularizer function applied to the bias vector
- recurrent_regularizer: regularizer function added to the recurrent_kernel weights matrix
- bias_regularizer: regularizer function applied to bias vector
- activity_regularizer: regularizer function applied to output of the layer
- dropout: float between 0.0 and 1.0 and represents the fraction of untis to drop for the linear transformation of the inputs
- recurrent_dropout: float between 0.0 and 1.0 and represents the fraction of units to drop for linear transformation of the recurrent state
- stateful: default is false, but if true the last state for each sample at index i in a batch will be used as initial state for the sample of index i in the following batch

Also CuDNNLSTM for GPU-enabled machines