# Long Short-Term Memory (LSTM)

Architecture: memory cell that maintains its state over time, and non-linear gsting units which regulate information flow into and out of the cell. 

#### Forward Pass

- $x^t$ is the input vector at time $t$
- $N$ the number of LSTM blocks
- $M$ is the number of inputs

###### LSTM Layer Weights
- $W_z, W_i, W_f, W_o \in R^{NxM}$: Input Weights
- $R_z, R_i, R_f, R_o \in R^{NxN}$: Recurrent Weights
- $p_i, p_f, p_o \in R^N$: Peephole Weights
- $b_z, b_i, b_f, b_o \in R^N$: Bias Weights

###### Vector Formulas in Forward Pass
- $\bar z^t = W_zx^t + R_zy^{t - 1} + b_z$  
- $z^t = g(\bar z^t)$: Block Input
- $\bar i^t = W_ix^t + R_iy^{t - 1} + p_i\ dot\ c^{t - 1} + b_i$
- $i^t = \sigma(\bar i^t)$: Input Gate
- $\bar f^t = W_fx^t + R_fy^{t - 1} + p_f\ dot\ c^{t - 1} + b_f$
- $f^t = \sigma(\bar f^t)$ Forget Gate
- $c^t = z^t\ dot\ i^t + c^{t - 1}\ dot\ f^t$
- $\bar o^t = W_ox^t + R_oy^{t - 1} + p_o\ dot\ c^t + b_o$
- $o^t = \sigma(\bar o^t)$ Output Gate
- $y^t = h(c^t)\ dot\ o^t$ Block Output

where $\sigma, g, h$ are point-wise non-linear activation functions.
- Logistic Sigmoid: $\sigma(x) = \frac{1}{1 + e^{-x}}$ is used as a gate activation function
- Hyperbolic Tangent: $g(x) = h(x) = tanh(x)$ is usually used as the block input and output activation function.

#### Backpropagation Through Time

- $\delta y^t = \Delta^t + R^T_z \delta z^{t + 1} + R^T_i \delta i^{t + 1} + R^T_f \delta f^{t + 1} + R^T_o \delta o^{t + 1}$
- $\delta \bar o^t = \delta y^t\ dot\ h(c^t)\ dot\ o'(\bar o^t)$
- $\delta c^t = \delta y^t\ dot\ o^t\ dot\ h'(c^t) + p_o\ dot\ \delta \bar o^t + p_i\ dot\ \delta \bar i^{t + 1} + p_f\ dot\ \delta \bar f^{t + 1} + \delta c^{t + 1}\ dot\ f^{t + 1}$
- $\delta \bar f^t = \delta c^t\ dot\ c^{t - 1}\ dot\ o'(\bar f^t)$
- $\delta \bar i^t = \delta c^t\ dot\ z^t\ dot\ o'(\bar i^t)$
- $\delta \bar z^t = \delta c^t\ dot\ i^t\ dot\ g'(\bar z^t)$

where $\Delta^t$ is vector of deltas passed down from the layer above. Deltas are only needed if the layers below need training, and can be computed as follows:
$$\delta x^t = W^T_z \delta \bar z^t + W^T_i \delta \bar i^t + W^T_f \delta \bar f^t + W^T_o \delta \bar o^t$$

And the gradient of the weights can be calculated as the following:

<br />
<img src=".\images\rnn_1.PNG" />
<br />