Note: In the following equations, $tanh$ is just used as a default non-linear activation function.

## RNNs (Recurrent Neural Net)
A neural net that incorporates a hidden state for past information at the previous time step. The primary part of the RNN is the hidden state. The output is then a function of said hidden state which does not necessarily need to be a single matrix multiplication with bias added.

Evolution of the RNN:
\begin{array}{rl}
h_t&=tanh(W_{xh}x_t+W_{hh}h_{t-1}+b_h)\\
y_t&=W_{hy}h_t+b_y \text{ (Toy example equation for output)}
\end{array}
where 
- $y_t$ is the output at time $t$
- $h_t$ is the hidden state at time $t$
- $x_t$ is the data at time $t$
- $b_{...}$ is a bias vector
- $W_{...}$ is a weight matrix

## GRUs
An RNN that uses a gating mechanism. This mechanism helps fix 2 issues: vanishing gradients, and short term memory bias. The gates control gradient flow, and allow for the hidden state to be updated only if there is sufficient information.

Evolution of the GRU:
\begin{array}{rl}
z_t &= \sigma(W_zx_t+U_zh_{t-1}+b_z) \\
r_t &= \sigma(W_rx_t+U_rh_{t-1}+b_r) \\
\tilde{h}_t &= tanh(Wx_t + U(r_t\odot h_{t-1}) + b) \\
h_t &= (1-z_t)\odot h_{t-1}+z_t\odot h_t
\end{array}
where
- $z_t$ is the update gate with values from $0$ to $1$
- $r_t$ is the reset gate with values from $0$ to $1$
- $\tilde{h}_t$ is the candidate hidden state
- $h_t$ is the final hidden state
- $\sigma(\cdot)$ is the sigmoid activation function
- $W_{...}$ and $U_{...}$ are weight matrices
- $b_{...}$ are bias vectors

## LSTMs
An RNN that uses a memory cell as well as a gating mechanism. The additions fix issues of learning long-term dependencies, have more gates than the GRU, and separates the hidden state and memory rather than merging them like in a GRU.

Evolution of the LSTM:
\begin{array}{rl}
    f_t &= \sigma(W_fx_t+U_fh_{t-1}+b_f)\\
    i_t &= \sigma(W_ix_t+U_ih_{t-1}+b_i)\\
    o_t &= \sigma(W_ox_t+U_oh_{t-1}+b_o)\\
    \tilde{c}_t &= tanh(W_cx_t+U_ch_{t-1}+b_c)\\
    c_t &= f_t\odot c_{t-1} + i_t\odot \tilde{c}_t\\
    h_t &= o_t\odot tanh(c_t)
\end{array}
where 
- $f_t$ is the forget gate with values $0$ to $1$
- $i_t$ is the input gate with values $0$ to $1$
- $o_t$ is the output gate with values $0$ to $1$
- $\tilde{c}_t$ is the candidate memory cell vector
- $c_t$ is the updated memory cell vector
- $h_t$ is the hidden state


## Bi-RNNs
An RNN that processes a sequence both forwards and backwards, used for when an entire sequence is available e.g. for text sentiment-classification.

\begin{array}{rl} 
\overrightarrow{h}_t&=tanh(\overrightarrow{W}_{xh}x_t+\overrightarrow{W}_{hh}\overrightarrow{h}_{t-1}+\overrightarrow{b})\\
\overleftarrow{h}_t&=tanh(\overleftarrow{W}_{xh}x_t+\overleftarrow{W}_{hh}\overleftarrow{h}_{t-1}+\overleftarrow{b})\\
h_t &= concat(\overrightarrow{h}_t,\overleftarrow{h}_t)
\end{array}
where 
- $\overrightarrow{h}_t$ is the forward hidden state
- $\overleftarrow{h}_t$ is the backward hidden state

*The same kind of equations (a forward model and backward model) appear for the **BiLSTM** and **BiGRU**.