##  Recurrent Neural Network (RNN)

**All sources are from:**
- [Understanding LSTM Networks](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)
- [Supervised Sequence Labelling with Recurrent Neural Networks by Alex Graves](http://www.cs.toronto.edu/~graves/preprint.pdf)

**Covers:**
- Structure of RNN
- Structure of LSTM
- Tips and tricks
- Few more examples
- Coding: Python example (character prediction in RNN by Karpathy), TensorFlow example

### Structure of RNN

Here, Karpathy descibes RNN using input/output relations

<img width="800" src="http://karpathy.github.io/assets/rnn/diags.jpeg"/>
**(1)** vanilla mode (fixed size input) **(2)** Sequence output (e.g. image captioning) **(3)** Sequence input (sentiment analysis) **(4)** Sequence input and sequence output (e.g. machine translation)


**Example of character prediction**
<img width="400" src="http://karpathy.github.io/assets/rnn/charseq.jpeg"/>


**Recurrent Neural Networks**

<img width="100" src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-rolled.png"/>

- RNNs makes use of sequential information
- Another way to think about RNNs is that they have a “memory” which captures information (through $h_t$)


**Unrolled Recurrent Neural Networks**

<img width="500" src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-unrolled.png"/>


input $x_i$ with output $h_i$ with chunk of neural netwok $A$


### Parameters

We only have $W^{(hh)}, W^{(hx)}, W^{(s)}$ only, and it's the same in every time steps
<img width="500" src="unfold_rnn.png"/>

Standard RNN equations are as follows:

<img width="800" src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-SimpleRNN.png"/>

$$h_t = \sigma(W^{(hh)} h_{t-1} + W^{(hx)} x_{t})$$ 
$$\hat{y}_t = \text{softmax}(W^{(s)} h_{t}) $$
$$J = -\frac{1}{T} \sum_{t=1}^{T} \sum_{j=1}^{|V|} y_{t,j} \log \hat{y}_{t, j}$$


note that we can re-write equation as $h_t = \sigma(W \cdot [h_{t-1}, x_t])$ also where $W = [W^{(hh)}, W^{(hx)} ] $

**Pseudo-code for RNN**

```python
class RNN:
    def step(self, x):
        self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))
        y = np.dot(self.W_hy, self.h)
        return y
```

**Forward propagation**

Last lecture, we learned only sigmoid. Here is the hyperbolic function that people also use.

$$\tanh(x) = \cfrac{e^{2x} - 1}{ e^{2x} + 1}$$
$$\sigma(x) = \cfrac{1}{1 + e^{-x}} $$
$$\tanh(x) = 2 \sigma(2x) - 1 $$

In [4]:
import numpy as np

def sigmoid(x):
    return 1/(1 + np.exp(-x))

def sigmoid_grad(z):
    g = sigmoid(z)*(1 - sigmoid(z))
    return g

def tanh(x):
    return np.tanh(x)

def tanh_grad(z):
    g = 1 - (np.tanh(z)**2)
    return g

x = 1.0
print 'If x = %s Output from sigmiod and tanh is %s, %s respectively' % (x, 2*sigmoid(2*x)-1, tanh(x))

If x = 1.0 Output from sigmiod and tanh is 0.761594155956, 0.761594155956 respectively


### Long short-term memory networks (LSTM)

LSTM are a special kind of RNN, capable of learning long-term dependencies.


<img width="800" src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png"/>

**Forget gate layer** 
<img width="800" src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-f.png"/>

**Input gate layer** decides which values we’ll update 
<img width="800" src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-i.png"/>

**The new candidate values**
<img width="800" src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-C.png"/>

**Output gate**Decide what we’re going to output
<img width="800" src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-o.png"/>


## Additional tricks for RNN training

**Tips** here are some tips and tricks from Karpathy in RNN

- **RMSProp/Adam/Adagrad** (SGD has high sensitivity)
- **Clip gradient** 5.0 is a common value to use, suggested by Mikolov. Prevent exploding gradient problem (see also vanishing gradient)
- **Initialize forget gates** with high bias to encourage remembering at start
- **Regularization** L2 regularization is not very common. Dropout always good along depth, *NOT* along time

<img width="400" src="vanishing.png"/>

### More example

**Image captioning**

<img width="400" src="image_captioning.png"/>


The RNN is conditioned on the image information at the first time step. START and END are special tokens. [See full paper here](http://cs.stanford.edu/people/karpathy/cvpr2015.pdf)