# NLP Concepts #5
## The World of Sequences -  Recurrent Neural Networks II

### Simple neural network

Let's take a simple shallow neural network. We can define it as



$$\Large \hat{y} = a(W^Tx + b)$$

where:

* $a$ is some activation function
* $\hat{y}$ is the output (scalar or vector, depending on $a$)
* $W$ is a weight matrix
* $x$ is the input vector
* $b$ is a bias vector

### Simple RNN

**RNN** needs a hidden state:

$$\Large h_t = a(W_{xh}^T x_t + W_{hh}^T h_{t-1} + b_h)$$

where:

* $a$ is some activation function
* $x$ is the input vector
* $b_h$ is a hidden layer bias vector
* $W_{xh}$ is a weight matrix mapping from feature vector $x \in R^{D_x}$ to a hidden state $h \in R^{D_h}$
* $W_{hh}$ is a weight matrix mapping from hidden state $h \in R^{D_h}$ to a hidden state $h \in R^{D_h}$

Then we can compute the output(s):

$$\Large \hat{y} = a(W_o^Th_t + b_o)$$

where:

* $a$ is some activation function
* $\hat{y}$ is the output (scalar or vector, depending on $a$)
* $x$ is the input vector
* $b_o$ is an output layer bias vector
* $W_o$ is an output layer weight matrix

### GRU

**GRU** introduces two new concepts: **update gate** $z_t$ and **reset gate** $r_t$.

Let's define them:

$$\Large z_t = \sigma(W_{xz}^T x_t + W_{hz}^T h_{t-1} + b_z)$$


$$\Large r_t = \sigma(W_{xr}^T x_t + W_{hr}^T h_{t-1} + b_r)$$

where:

* $\sigma$ is a sigmoid activation function
* $x$ is the input vector
* $b$ are respective bias vectors
* $W$ are respective weight matrices


When $z_t$ and $r_t$ are computed, we can compute the **hidden state**:


$$\Large h_t = (1-z_t)\odot h_{t-1} + z_t \odot a(W_{xh}^T x_t + W_{hh}^T (r_t \odot h_{t-1}) + b_h)$$

where:

* $\odot$ is element-wise multiplication
* $a$ is an activation function (typically $tanh$)


Finally, we can compute the output(s):

$$\Large \hat{y} = a(W_o^Th_t + b_o)$$

### LSTM

**LSTM** is the most complex architecture.

It uses a couple of additional concepts, some of them very similar to the ones used in **GRU**.

To build an **LSTM**, we'll need:

* forget gate $f_t$
* input gate (AKA update gate) $i_t$
* output gate $o_t$
* cell state (AKA extra hidden state) $c_t$
* regular hidden state $h_t$

Let's define them:

$$\Large f_t = \sigma(W_{xf}^T x_t + W_{hf}^T h_{t-1} + b_f)$$

$$\Large i_t = \sigma(W_{xi}^T x_t + W_{hi}^T h_{t-1} + b_i)$$

$$\Large o_t = \sigma(W_{xo}^T x_t + W_{ho}^T h_{t-1} + b_o)$$

$$\Large c_t = f_t \odot c_{t-1} + i_t \odot a(W_{xc}^T x_t + W_{hc}^T h_{t-1} + b_c)$$

$$\Large h_t = o_t \odot a(c_t)$$

where:

* $\odot$ is element-wise multiplication
* $a$ is an activation function (typically $tanh$)
* $\sigma$ is a sigmoid activation function
* $x$ is the input vector
* $b$ are respective bias vectors
* $W$ are respective weight matrices


We compute the output(s) as usual:

$$\Large \hat{y} = a(W_o^Th_t + b_o)$$