##  **Recurrent Neural Network (RNN) - Introduction**

**All sources are from:**
- [Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
- [Supervised Sequence Labelling with Recurrent Neural Networks by Alex Graves](http://www.cs.toronto.edu/~graves/preprint.pdf)
- [RECURRENT NEURAL NETWORKS TUTORIAL](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/)

**Covers:**
- Structure of RNN
- Structure of LSTM
- Tips and tricks
- Few more examples
- Coding: Python example (character prediction in RNN by Karpathy), TensorFlow example

**Notation: ** notation is a bit messy in the tutorial. Every figures has different notations.

### **Structure of RNN**

Here, Karpathy descibes RNN using input/output relations

<img width="800" src="http://karpathy.github.io/assets/rnn/diags.jpeg"/>
**(1)** vanilla mode (fixed size input) **(2)** Sequence output (e.g. image captioning) **(3)** Sequence input (sentiment analysis) **(4)** Sequence input and sequence output (e.g. machine translation)


**Example of character prediction**
<img width="400" src="http://karpathy.github.io/assets/rnn/charseq.jpeg"/>


**Recurrent Neural Networks**

<img width="100" src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-rolled.png"/>

- RNNs makes use of sequential information
- Another way to think about RNNs is that they have a “memory” which captures information (through $s_t$)


**Unrolled Recurrent Neural Networks**

<img width="500" src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png"/>


input $x_i$ with output $h_i$ with chunk of neural netwok $A$


### **Parameters**

We only have $U, W, V$ ($W_1, W_2, W_3$) only, and it's the same in every time steps
<img width="500" src="unfold_rnn.png"/>

Standard RNN equations are as follows (sorry the notation from the figure is not the same as what we write):

<img width="800" src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-SimpleRNN.png"/>

$$s_t = \tanh(U x_{t} + W s_{t-1})$$ 
$$\hat{y}_t = \text{softmax}(V s_{t}) $$
$$E(y, \hat{y}) = -\sum_{t}  y_t \log \hat{y}_t$$

where
- $x_t$ is the input at time step $t$
- $s_t$ is the hidden state at time step $t$ (memory)
- $\hat{y}_t$ is the output at step $t$
- $E(y, \hat{y})$ is cross-entropy loss function  (for all traning examples)

**note that** 
- we can re-write equation as $s_t = \tanh([U, W] \cdot [x_t, s_{t-1}])$
- RNN shares the same parameters across all steps

**Pseudo-code for RNN - Forward propagation**

```python
class RNN:
    def step(self, x):
        self.s = np.tanh(np.dot(self.W, self.s) + np.dot(self.U, x))
        y_hat = np.dot(self.V, self.s)
        return y_hat
```

**Different type of activation function**

Last lecture, we learned only sigmoid. Here is the hyperbolic function that people also use.

$$\tanh(x) = \cfrac{e^{2x} - 1}{ e^{2x} + 1}$$
$$\sigma(x) = \cfrac{1}{1 + e^{-x}} $$
$$\tanh(x) = 2 \sigma(2x) - 1 $$

**Back propagation**

- Similar to traditional neural network. Since parameters are shared across time step, gradient then depends not only on current time step but **previous time steps**
- We need to backpropagate (e.g. t=4, we have to backpropagate 3 steps) and sum up the gradient
- **note** BPTT have difficulties learning long-term dependencies (gradient vanishing) > can be solved by certain types of RNN like LSTMs
- We'll talk more later on... now we just do forward propagation


In [1]:
import numpy as np

def sigmoid(x):
    return 1/(1 + np.exp(-x))

def sigmoid_grad(z):
    g = sigmoid(z)*(1 - sigmoid(z))
    return g

def tanh(x):
    return np.tanh(x)

def tanh_grad(z):
    g = 1 - (np.tanh(z)**2)
    return g

x = 1.0
print 'If x = %s Output from sigmiod and tanh is %s, %s respectively' % (x, 2*sigmoid(2*x)-1, tanh(x))

If x = 1.0 Output from sigmiod and tanh is 0.761594155956, 0.761594155956 respectively


## RNN Extension


- **Bidirectional RNNs** output at time $t$ depend on the previous elements in the sequence and future elements (e.g predicting missing words)
<img width="400" src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/09/bidirectional-rnn.png"/>

- **Deep (Bidirectional) RNNs**
<img width="400" src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/09/Screen-Shot-2015-09-16-at-2.21.51-PM.png"/>

- **LSTM network** LSTMs don’t have a fundamentally different architecture from RNNs, but they use a different function to compute the hidden state

## **Back-Propagation through time**

**Back-propagation through time (BPTT)**

what we have before are forward propagation equations:

$$s_t = \tanh(U x_{t} + W s_{t-1})$$ 
$$\hat{y}_t = \text{softmax}(V s_{t}) $$

and loss by cross entropy loss, 
$$E_t(y_t, \hat{y}_t) = - y_t \log \hat{y}_t$$
$$E = \sum_{t} E_t(y_t, \hat{y}_t) = -\sum_{t}  y_t \log \hat{y}_t$$

where $y_t$ is correct output, $\hat{y}_t$ is our prediction.


<img width="400" src="http://www.wildml.com/wp-content/uploads/2015/10/rnn-bptt1.png"/>


**Goal: ** calculate the gradients of the error with respect to our parameters $U, V, W$

We'll use $E_3$ as an example (example from blog post)

$$\cfrac{\partial E_3}{\partial V} = \cfrac{\partial E_3}{\partial \hat{y}_3} \cfrac{\partial \hat{y}_3}{\partial V} = \cfrac{\partial E_3}{\partial \hat{y}_3} \cfrac{\partial \hat{y}_3}{\partial z_3} \cfrac{\partial z_3}{\partial V} = (\hat{y}_3 - y_3) s_3 $$

where $z_3 = V s_3$. We can see that $\cfrac{\partial E_3}{\partial V}$ only depends on the values at the current time step


However, for $\cfrac{\partial E_3}{\partial W}$, story is different

$$\cfrac{\partial E_3}{\partial W} = \cfrac{\partial E_3}{\partial \hat{y}_3} \cfrac{\partial \hat{y}_3}{\partial s_3} \cfrac{\partial s_3}{\partial W} $$

Now, $s_3 = \tanh(U x_3 + W s_2)$ which depends on $s_2$ (and yes, $s_1$ also)

$$\cfrac{\partial E_3}{\partial V} = \sum_{k=0}^3 \cfrac{\partial E_3}{\partial \hat{y}_3} \cfrac{\partial \hat{y}_3}{\partial s_3} \cfrac{\partial s_3}{\partial s_k} \cfrac{\partial s_k}{\partial W}  $$


<img width="400" src="http://d3kbpzbmcynnmx.cloudfront.net/wp-content/uploads/2015/10/rnn-bptt-with-gradients.png"/>


## **Long short-term memory networks (LSTM)**

LSTM are a special kind of RNN, capable of learning long-term dependencies.


<img width="800" src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png"/>

**Forget gate layer** 
<img width="800" src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-f.png"/>

**Input gate layer** decides which values we’ll update 
<img width="800" src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-i.png"/>

**The new candidate values**
<img width="800" src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-C.png"/>

**Output gate**Decide what we’re going to output
<img width="800" src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-o.png"/>


## **Additional tricks for RNN training**

**Tips** here are some tips and tricks from Karpathy in RNN

- **RMSProp/Adam/Adagrad** (SGD has high sensitivity)
- **Clip gradient** 5.0 is a common value to use, suggested by Mikolov. Prevent exploding gradient problem (see also vanishing gradient)
- **Initialize forget gates** with high bias to encourage remembering at start
- **Regularization** L2 regularization is not very common. Dropout always good along depth, *NOT* along time

<img width="400" src="vanishing.png"/>

### **More example**

**Image captioning**

<img width="400" src="image_captioning.png"/>


The RNN is conditioned on the image information at the first time step. START and END are special tokens. [See full paper here](http://cs.stanford.edu/people/karpathy/cvpr2015.pdf)

## **Coding session**

we will follow example from https://github.com/dennybritz/rnn-tutorial-rnnlm which has blog post [here](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-2-implementing-a-language-model-rnn-with-python-numpy-and-theano/)