# Vanilla RNN #

** A simple RNN using standard backprop algorithm, implemented by numpy **

In [32]:
# code environment
%load_ext watermark
%watermark -p numpy -v -m -u -d

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
last updated: 2018-02-11 

CPython 3.5.4
IPython 6.2.1

numpy 1.13.3

compiler   : MSC v.1900 64 bit (AMD64)
system     : Windows
release    : 7
machine    : AMD64
processor  : Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
CPU cores  : 4
interpreter: 64bit


### Model Description ###
A standard RNN can be mathematically represented by:

$h(t) = tanh(Wh(t-1) + UX + bh)$

$f(t) = Vh(t) + bf$

$p(t) = softmax(f(t))$

* Here, we use $tanh$ and $softmax$ for hidden and output layer respectively.

* In the follwing, all notations **without special declaration**, represent time $t$. 

We assume, hidden_size as $H$, input_size & output_size as $K$,namely a $K$-class classfifier

### Notations & Sizes ###
parameters: $W: H \times H \quad U:H \times K \quad bh: H \times 1 \quad V: K \times H \quad bf: K \times 1$

intermediate variables $p: K \times 1 \quad  f: K \times 1 \quad h: H \times 1$

input: $X: K \times 1$

derivation of h: $dh: H \times 1 $

derivation of f: $df: K \times 1 $

### Gradient Reduction ###
We Use **cross-entropy** as loss function,and the label of $m$-th sample is $y_m$.

* output layer $\Longrightarrow$ hidden layer

$$ 
\begin{aligned}
&\frac{\partial L_m}{\partial f_k} = p_k - I(y_m = k) \overset{def}{=} df_k \\
&\frac{\partial L_m}{\partial bf_k} = \frac{\partial L_m}{\partial f_k} \cdot \frac{\partial f_k}{\partial bf_k} = df_k \\
&\frac{\partial L_m}{\partial v_{ki}} = \frac{\partial L_m}{\partial f_k} \cdot \frac{\partial f_k}{\partial v_{ki}} = df_k \cdot h_i
\end{aligned}
$$

**Matrix formulation**:

$$ 
\begin{aligned}
&\frac{\partial L_m}{\partial bf} = df ,\\
&\frac{\partial L_m}{\partial V} = df \cdot h(t)^T 
\end{aligned}
$$

* hidden layer $\Longrightarrow$ input layer
$$ 
\begin{aligned}
\frac{\partial L_m}{\partial bh_i} &=\frac{\partial L_m}{\partial h_i} \cdot \frac{\partial h_i}{\partial bh_i} \\
&=\sum_{j}^{K}(\frac{\partial L_m}{\partial f_j} \cdot \frac{\partial f_j}{\partial h_i}) \cdot \frac{\partial h_i}{\partial bh_i} \\
&=(1-h_i^2) \cdot \sum_{j}^{K}(\frac{\partial L_m}{\partial f_j} \cdot \frac{\partial h_i}{\partial bh_i}) \\
&=(1-h_i^2) \cdot \sum_{j}^{K}(df_j \cdot v_{ji}) \\
\end{aligned}
$$

Image Description：
![image.png](img/dbh.png)

Here, we could regard $L_m$ as a function of $f_1, f_2,...,f_K$, and $f_j$ as function of $h_i$,:

$$L_m = F(f_1,f_2,...,f_K) $$

$$f_1 = G_1(h_i) \quad  f_2 = G_2(h_i) \quad ... \quad f_K = G_K(h_i)$$

According to the chain rule, we can get the result above.

Learn More about backprop: http://colah.github.io/posts/2015-08-Backprop

Go on!

$$
\begin{aligned}
\frac{\partial L_m}{\partial u_{ij}} &=\frac{\partial L_m}{\partial h_i} \cdot \frac{\partial h_i}{\partial u_{ij}} \\
&=\sum_{j}^{K}(df_j \cdot v_{ji}) \cdot \frac{\partial h_i}{\partial u_{ij}} \\
&=X_j \cdot (1 - h_i^2) \cdot \sum_{j}^{K}(df_j \cdot v_{ji}) \\
&=X_j \cdot \frac{\partial L_m}{\partial bh_i}
\end{aligned}
$$

same as above:

$$
\begin{aligned}
\frac{\partial L_m}{\partial w_{ni}} = h(t-1)_i \cdot \frac{\partial L_m}{\partial bh_i}
\end{aligned}
$$
**Notice!** Here we regard h(t-1) as a **Constant**, which actually is a function of $W$. If we unfold $h(t-1)$, we get BPTT(Back Progation Through Time) algorithm.

**Matrix formulation**:

$$ 
\begin{aligned}
&\frac{\partial L_m}{\partial bh} = [(1 - h(t))\otimes h(t)]\otimes (V^T \cdot df) ,\\
&\frac{\partial L_m}{\partial U} = \frac{\partial L_m}{\partial bh} \cdot X^T,\\
&\frac{\partial L_m}{\partial W} = \frac{\partial L_m}{\partial bh} \cdot h(t-1)^T,\\
\end{aligned}
$$

Let's go for gradient of h(t-1), here we regard h(t-1) as a variable:
$$ 
\begin{aligned}
\frac{\partial L_m}{\partial h(t-1)_i} &= \sum_n^H(\frac{\partial L_m}{\partial h_n} \cdot \frac{\partial h_n}{\partial h(t-1)_i}) \\
&=\sum_n^H[ (\sum_{j}^{K}(\frac{\partial L_m}{\partial f_j} \cdot \frac{\partial f_j}{\partial h_n})) \cdot (1-h_i^2)\cdot w_{ni} ] \\
&=\sum_n^H(w_{ni} \cdot \frac{\partial L_m}{\partial bh_n})
\end{aligned}
$$

**Matrix formulation:**
$$ 
\begin{aligned}
\frac{\partial L_m}{\partial h(t-1)} = W^T \cdot \frac{\partial L_m}{\partial bh}
\end{aligned}
$$

Image Description：
![image.png](img/dh1.png)

Tips: as the image shows, the indexs of derivation associated with the edge, are the variable index of the edge's head and tail.

### Code ###

Hyperparameters

In [25]:
import numpy as np

In [26]:
hidden_size = 3
vocab_size = 4
inputs = [2]
targets = [3]

we initialize model parameters using Gaussian distribution

In [27]:
Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Whf = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
bh = np.zeros((hidden_size, 1)) # hidden bias
bf = np.zeros((vocab_size, 1)) # output bias

In [4]:
Wxh

array([[ 0.01322516, -0.01297705, -0.01016583,  0.00896709],
       [-0.00734309, -0.00930984,  0.01919849, -0.00213576],
       [-0.00716901, -0.00230758,  0.01232394, -0.00354846]])

In [5]:
Whh

array([[-0.00138893,  0.00209951,  0.00817658],
       [ 0.00373148, -0.00764717, -0.00496181],
       [ 0.00775307,  0.00230728, -0.01485889]])

In [6]:
Whf

array([[ 0.00036743,  0.01961332, -0.00445935],
       [ 0.00347142,  0.01447653,  0.01592319],
       [-0.00196813, -0.0012944 , -0.00408246],
       [-0.01682938, -0.00894972, -0.0010431 ]])

In [7]:
xs, hs, fs, ps = {}, {}, {}, {}
hprev = np.zeros((hidden_size,1))
hs[-1] = np.copy(hprev)
loss = 0

In [8]:
xs[0] = np.zeros((vocab_size, 1))
xs[0][inputs[0]] = 1
xs[0]

array([[ 0.],
       [ 0.],
       [ 1.],
       [ 0.]])

In [9]:
hs[0] = np.tanh(np.dot(Wxh, xs[0]) + np.dot(Whh, hs[0-1]) + bh)
hs[0]

array([[-0.01016548],
       [ 0.01919613],
       [ 0.01232331]])

In [10]:
fs[0] = np.dot(Whf, hs[0]) + bf
fs[0]

array([[  3.17810854e-04],
       [  4.38831154e-04],
       [ -5.51498967e-05],
       [ -1.35758787e-05]])

In [11]:
ps[0] = np.exp(fs[0]) / np.sum(np.exp(fs[0]))
ps[0]

array([[ 0.25003646],
       [ 0.25006672],
       [ 0.24994322],
       [ 0.24995361]])

In [12]:
print(ps[0][targets[0], 0])
loss += -np.log(ps[0][targets[0], 0])
print(loss)

0.249953609992
1.38647993837


In [13]:
dWxh, dWhh, dWhf = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Whf)
dbh, dbf = np.zeros_like(bh), np.zeros_like(bf)
dhnext = np.zeros_like(hs[0])

In [14]:
df = np.copy(ps[0])
df[targets[0]] -= 1
df

array([[ 0.25003646],
       [ 0.25006672],
       [ 0.24994322],
       [-0.75004639]])

In [15]:
dWhf += np.dot(df, hs[0].T)
dWhf

array([[-0.00254174,  0.00479973,  0.00308128],
       [-0.00254205,  0.00480031,  0.00308165],
       [-0.00254079,  0.00479794,  0.00308013],
       [ 0.00762458, -0.01439799, -0.00924306]])

In [16]:
dbf += df
dbf

array([[ 0.25003646],
       [ 0.25006672],
       [ 0.24994322],
       [-0.75004639]])

In [17]:
dh = np.dot(Whf.T, df)
dh

array([[ 0.01309085],
       [ 0.01491332],
       [ 0.00262885]])

In [18]:
hs[0]

array([[-0.01016548],
       [ 0.01919613],
       [ 0.01232331]])

In [19]:
1 - hs[0] * hs[0]

array([[ 0.99989666],
       [ 0.99963151],
       [ 0.99984814]])

In [20]:
dhraw = (1 - hs[0] * hs[0]) * dh
print(dhraw)
dbh += dhraw
print(dbh)
dhnext = np.dot(Whh.T, dhraw)
print(dhnext)

[[ 0.0130895 ]
 [ 0.01490783]
 [ 0.00262845]]
[[ 0.0130895 ]
 [ 0.01490783]
 [ 0.00262845]]
[[  5.78264788e-05]
 [ -8.04564745e-05]
 [ -5.99831497e-06]]


In [21]:
dWxh += np.dot(dhraw, xs[0].T)
print(dWxh)
dWhh += np.dot(dhraw, hs[0-1].T)
print(dWhh)

[[ 0.          0.          0.0130895   0.        ]
 [ 0.          0.          0.01490783  0.        ]
 [ 0.          0.          0.00262845  0.        ]]
[[ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]]


##  Reference ##
1. http://karpathy.github.io/2015/05/21/rnn-effectiveness/
2. https://gist.github.com/karpathy/d4dee566867f8291f086
3. http://cs231n.github.io/neural-networks-case-study/#grad
4. https://www.zhihu.com/question/27239198?rf=24827633
5. http://colah.github.io/posts/2015-08-Backprop/
6. http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/