# Recurrent Neural Network on Tensorflow
[Recurrent Neural Network](https://en.wikipedia.org/wiki/Recurrent_neural_network) (RNN) is one of the architectures in the Neural Network. I see RNN as the 'Time Series' version of Neural Network, since the input is sequential data. Actually, I've been working with RNN in many projects before, but I've never created it from scratch before. This is the right time to create one. I will create the network based on this [paper](https://arxiv.org/pdf/1506.00019.pdf).

In [1]:
import numpy as np
import tensorflow as tf

## Preparing the Data
I prepare simple data for my the experiment. The data is constructed using this rule: $x_t = 1.1 \times x_{t-1} + 0.3$. I generate 1000 sample sequences and use 80:20 rule for train and test. I will train the RNN to detect this rule.

In [2]:
np.random.seed(1)
SEQ_LENGTH = 10
random_seq = np.random.normal(size=[1000, SEQ_LENGTH])
for i in range(1, SEQ_LENGTH):
    random_seq[:, i] = random_seq[:, i-1]*1.1 + 0.3
random_seq = random_seq.reshape(-1, SEQ_LENGTH, 1)
train_seq = random_seq[:800,:,:]
test_seq = random_seq[800:,:,:]

## Basic RNN
Let's just start by creating 'vanilla' RNN. RNN is essentially similar to the Neural Network with 2 layers. Just like below image.
<img src='NN.png'>
However, since it deals with sequences, somehow it needs a mechanism to consider the data in the previous time step. Mathematically, we write $$\textbf{h}^{(t)}=\sigma(W^{hx}\textbf{x}^{(t)}+W^{hh}\textbf{h}^{(t-1)}+\textbf{b}_h).$$
where $\textbf{h}^{(t)}$ and $\textbf{x}^{(t)}$ are the hidden and the input nodes at time $t$, $W^{**}$ is the weight from $*$ and to $*$, and $\sigma$ is an activation function. From the equation, it's clear that hidden state is affected by hidden state at previous time step. At first, we can just set $\textbf{h}_0=\textbf{0}$.

Let's put this knowledge into code! I use 4 nodes in the hidden layer and 1 node in the output layer (since the output is scalar).

In [3]:
tf.reset_default_graph()
tf.set_random_seed(1)

HIDDEN_DIM = 4
LEARNING_RATE = 0.1

I create the input placeholder and variables below.

In [4]:
with tf.variable_scope('input'):
    seq_input = tf.placeholder(dtype=tf.float32,
                        shape=[None,SEQ_LENGTH,1], name='seq_input')

In [5]:
with tf.variable_scope('weights'):
    W_hx = tf.get_variable(name='W_hx', shape=[1,HIDDEN_DIM],
                           dtype=tf.float32)
    W_hh = tf.get_variable(name='W_hh', shape=[HIDDEN_DIM,HIDDEN_DIM],
                           dtype=tf.float32)
    b_h = tf.get_variable(name='b_h', shape=[HIDDEN_DIM],
                          dtype=tf.float32)

    W_yh = tf.get_variable(name='W_yh', shape=[HIDDEN_DIM,1],
                           dtype=tf.float32)
    b_y = tf.get_variable(name='b_y', shape=[1], dtype=tf.float32)

As stated above, initially the hidden nodes have value 0. To get all the hidden values at every time step, the hidden values are appended in the list and stacked together.

In [6]:
with tf.variable_scope('rnn'):
    hidden_0 = tf.zeros(shape=[tf.shape(seq_input)[0], HIDDEN_DIM],
                      dtype=tf.float32, name='hidden_0')
    hidden = hidden_0
    hiddens = list()
    for j in range(SEQ_LENGTH-1):
        hidden = tf.matmul(seq_input[:,j,:], W_hx) + \
                 tf.nn.xw_plus_b(hidden, W_hh, b_h)
        hidden = tf.sigmoid(hidden)
        hiddens.append(hidden)
    hiddens = tf.stack(hiddens, axis=1, name='hiddens')

I prepare output tensor.

In [7]:
with tf.variable_scope('output'):
    outputs = tf.reshape(hiddens, [-1,HIDDEN_DIM])
    outputs = tf.nn.xw_plus_b(outputs, W_yh, b_y)
    outputs = tf.reshape(outputs, [-1,SEQ_LENGTH-1,1], name='outputs')

I simply use mean squared error as the loss function.

In [8]:
with tf.variable_scope('learning'):
    predict = tf.identity(outputs[:,-1,:], 'predict')
    loss = tf.losses.mean_squared_error(labels=seq_input[:,-1,:],
                                        predictions=predict)
    train_step = tf.train.AdamOptimizer(LEARNING_RATE).minimize(loss)

Then, I train the network.

In [9]:
sess = tf.Session()
sess.run(tf.global_variables_initializer())

In [10]:
for step in range(1,101):
    _ = sess.run(train_step, {seq_input: train_seq})
    curr_loss = sess.run(loss, {seq_input: train_seq})
    print 'step:', step, 'loss:', curr_loss

step: 1 loss: 24.8303
step: 2 loss: 21.7922
step: 3 loss: 19.3631
step: 4 loss: 17.4194
step: 5 loss: 15.8064
step: 6 loss: 14.4164
step: 7 loss: 13.188
step: 8 loss: 12.0878
step: 9 loss: 11.0972
step: 10 loss: 10.2044
step: 11 loss: 9.40108
step: 12 loss: 8.67933
step: 13 loss: 8.0307
step: 14 loss: 7.44491
step: 15 loss: 6.90991
step: 16 loss: 6.41455
step: 17 loss: 5.95542
step: 18 loss: 5.54593
step: 19 loss: 5.21866
step: 20 loss: 5.00694
step: 21 loss: 4.90687
step: 22 loss: 4.87362
step: 23 loss: 4.86081
step: 24 loss: 4.83629
step: 25 loss: 4.77895
step: 26 loss: 4.67703
step: 27 loss: 4.52724
step: 28 loss: 4.33285
step: 29 loss: 4.10137
step: 30 loss: 3.84233
step: 31 loss: 3.56532
step: 32 loss: 3.27842
step: 33 loss: 2.9873
step: 34 loss: 2.69531
step: 35 loss: 2.40529
step: 36 loss: 2.12352
step: 37 loss: 1.86426
step: 38 loss: 1.65704
step: 39 loss: 1.56637
step: 40 loss: 1.65935
step: 41 loss: 1.74606
step: 42 loss: 1.63108
step: 43 loss: 1.39447
step: 44 loss: 1.17073


Well, not so bad result in the test data.

In [11]:
prediction = sess.run(predict, {seq_input: test_seq})
test_loss = sess.run(loss, {seq_input: test_seq})
print 'test loss:', test_loss

test loss: 0.206641


## Long Short Term Memory.
Next, I create LSTM, which is an extension of the basic RNN. Let's see the illustration.
<img src='LSTM.png'>
When you search LSTM in Google, most likely you will encounter this kind of image. It might be daunting at first, but don't worry. I'll explain. First, it has a kind of 'memory', called **state**. Next, the state is also accompanied with 3 'gates' that govern how the information will be written to the state and the hidden node itself. The first two gates are called the **input** ($i$) and **forget** ($f$) gate. The input gate adjusts how much of new information should be written in the state, whereas the forget gate adjusts how much of old information should be forgotten. These two gates are working together to modify the state value. The last gate is called **output** ($o$) gate. It governs how much of the state value should be output to the hidden node. Since all of the gates have sigmoid activation functions, they range between 0 and 1, which represents 'how much' effect the gates are posing. In mathematical way, the equations are as follows:
$$\textbf{i}^{(t)}=\sigma(W^{ix}\textbf{x}^{(t)}+W^{ih}\textbf{h}^{(t-1)}+\textbf{b}_i),$$
$$\textbf{f}^{(t)}=\sigma(W^{fx}\textbf{x}^{(t)}+W^{fh}\textbf{h}^{(t-1)}+\textbf{b}_f),$$
$$\textbf{o}^{(t)}=\sigma(W^{ox}\textbf{x}^{(t)}+W^{oh}\textbf{h}^{(t-1)}+\textbf{b}_o),$$
$$\textbf{g}^{(t)}=\phi(W^{gx}\textbf{x}^{(t)}+W^{gh}\textbf{h}^{(t-1)}+\textbf{b}_g),$$
$$\textbf{s}^{(t)}=\textbf{g}^{(t)} \odot \textbf{i}^{(t)} + \textbf{s}^{(t-1)} \odot \textbf{f}^{(t)},$$
$$\textbf{h}^{(t)}= \phi(\textbf{s}^{(t)}) \odot \textbf{o}^{(t)},$$
where $\textbf{s}^{(t)}$ represents the state, $\textbf{g}^{(t)}$ the input to the state, $\odot$ the scalar multiplication, and $\phi$ the tanh function. Next, let's just put them into code using the same data as previous RNN.

In [12]:
tf.reset_default_graph()
tf.set_random_seed(1)

HIDDEN_DIM = 4
LEARNING_RATE = 0.1

I initialize the input and the weights.

In [13]:
with tf.variable_scope('input'):
    seq_input = tf.placeholder(dtype=tf.float32,
                        shape=[None,SEQ_LENGTH,1], name='seq_input')

In [14]:
with tf.variable_scope('weights'):
    W_gx = tf.get_variable(name='W_gx', shape=[1,HIDDEN_DIM],
                           dtype=tf.float32)
    W_gh = tf.get_variable(name='W_gh', shape=[HIDDEN_DIM,HIDDEN_DIM],
                           dtype=tf.float32)
    W_ix = tf.get_variable(name='W_ix', shape=[1,HIDDEN_DIM],
                           dtype=tf.float32)
    W_ih = tf.get_variable(name='W_ih', shape=[HIDDEN_DIM,HIDDEN_DIM],
                           dtype=tf.float32)
    W_fx = tf.get_variable(name='W_fx', shape=[1,HIDDEN_DIM],
                           dtype=tf.float32)
    W_fh = tf.get_variable(name='W_fh', shape=[HIDDEN_DIM,HIDDEN_DIM],
                           dtype=tf.float32)
    W_ox = tf.get_variable(name='W_ox', shape=[1,HIDDEN_DIM],
                           dtype=tf.float32)
    W_oh = tf.get_variable(name='W_oh', shape=[HIDDEN_DIM,HIDDEN_DIM],
                           dtype=tf.float32)
    b_g = tf.get_variable(name='b_g', shape=[HIDDEN_DIM],
                          dtype=tf.float32)
    b_i = tf.get_variable(name='b_i', shape=[HIDDEN_DIM],
                          dtype=tf.float32)
    b_f = tf.get_variable(name='b_f', shape=[HIDDEN_DIM],
                          dtype=tf.float32)
    b_o = tf.get_variable(name='b_o', shape=[HIDDEN_DIM],
                          dtype=tf.float32)

    W_yh = tf.get_variable(name='W_yh', shape=[HIDDEN_DIM,1],
                           dtype=tf.float32)
    b_y = tf.get_variable(name='b_y', shape=[1], dtype=tf.float32)

The mechanism is essentially similar with the basic RNN.

In [15]:
with tf.variable_scope('rnn'):
    state_0 = tf.zeros(shape=[tf.shape(seq_input)[0], HIDDEN_DIM],
                       dtype=tf.float32, name='state_0')
    state = state_0
    states = list()
    hidden_0 = tf.zeros(shape=[tf.shape(seq_input)[0], HIDDEN_DIM],
                        dtype=tf.float32, name='hidden_0')
    hidden = hidden_0
    hiddens = list()
    for j in range(SEQ_LENGTH-1):
        g = tf.matmul(seq_input[:,j,:], W_gx) + \
                 tf.nn.xw_plus_b(hidden, W_gh, b_g)
        g = tf.tanh(g, name='g')
        i = tf.matmul(seq_input[:,j,:], W_ix) + \
                 tf.nn.xw_plus_b(hidden, W_ih, b_i)
        i = tf.sigmoid(i, name='i')
        f = tf.matmul(seq_input[:,j,:], W_fx) + \
                 tf.nn.xw_plus_b(hidden, W_fh, b_f)
        f = tf.sigmoid(g, name='f')
        o = tf.matmul(seq_input[:,j,:], W_ox) + \
                 tf.nn.xw_plus_b(hidden, W_oh, b_o)
        o = tf.sigmoid(o, name='o')
        state = g*i + f*state
        states.append(state)
        hidden = o*tf.tanh(state)
        hiddens.append(hidden)
    states = tf.stack(states, axis=1, name='states')
    hiddens = tf.stack(hiddens, axis=1, name='hiddens')

The output and learning process are similar with the previous RNN.

In [16]:
with tf.variable_scope('output'):
    outputs = tf.reshape(hiddens, [-1,HIDDEN_DIM])
    outputs = tf.nn.xw_plus_b(outputs, W_yh, b_y)
    outputs = tf.reshape(outputs, [-1,SEQ_LENGTH-1,1], name='outputs')

In [17]:
with tf.variable_scope('learning'):
    predict = tf.identity(outputs[:,-1,:], 'predict')
    loss = tf.losses.mean_squared_error(labels=seq_input[:,-1,:],
                                        predictions=predict)
    train_step = tf.train.AdamOptimizer(LEARNING_RATE).minimize(loss)

Then, I train the network.

In [18]:
sess = tf.Session()
sess.run(tf.global_variables_initializer())

In [19]:
for step in range(1,101):
    _ = sess.run(train_step, {seq_input: train_seq})
    curr_loss = sess.run(loss, {seq_input: train_seq})
    print 'step:', step, 'loss:', curr_loss

step: 1 loss: 14.5623
step: 2 loss: 12.4639
step: 3 loss: 10.8879
step: 4 loss: 9.56814
step: 5 loss: 8.27149
step: 6 loss: 7.01537
step: 7 loss: 5.99699
step: 8 loss: 5.29935
step: 9 loss: 4.78822
step: 10 loss: 4.35377
step: 11 loss: 3.95908
step: 12 loss: 3.59298
step: 13 loss: 3.25045
step: 14 loss: 2.92614
step: 15 loss: 2.61298
step: 16 loss: 2.30848
step: 17 loss: 2.02133
step: 18 loss: 1.76062
step: 19 loss: 1.52389
step: 20 loss: 1.31662
step: 21 loss: 1.14633
step: 22 loss: 1.01484
step: 23 loss: 0.919205
step: 24 loss: 0.854123
step: 25 loss: 0.80886
step: 26 loss: 0.771471
step: 27 loss: 0.732364
step: 28 loss: 0.685605
step: 29 loss: 0.629251
step: 30 loss: 0.565037
step: 31 loss: 0.497537
step: 32 loss: 0.431543
step: 33 loss: 0.368814
step: 34 loss: 0.313661
step: 35 loss: 0.277237
step: 36 loss: 0.260737
step: 37 loss: 0.251952
step: 38 loss: 0.244557
step: 39 loss: 0.229023
step: 40 loss: 0.208485
step: 41 loss: 0.185181
step: 42 loss: 0.166462
step: 43 loss: 0.160733


Using the same number of iteration, LSTM has better result compared to the vanilla RNN.

In [20]:
prediction = sess.run(predict, {seq_input: test_seq})
test_loss = sess.run(loss, {seq_input: test_seq})
print 'test loss:', test_loss

test loss: 0.118942


Yeah, that's all. I hope it helps you understanding more about RNN.