# Vanilla RNN #

** A simple RNN using standard backprop algorithm, implemented by numpy **

In [32]:
# code environment
%load_ext watermark
%watermark -p numpy -v -m -u -d

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
last updated: 2018-02-11 

CPython 3.5.4
IPython 6.2.1

numpy 1.13.3

compiler   : MSC v.1900 64 bit (AMD64)
system     : Windows
release    : 7
machine    : AMD64
processor  : Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
CPU cores  : 4
interpreter: 64bit


### Model Description ###
A standard RNN can be mathematically represented by:

$h(t) = tanh(Wh(t-1) + UX + bh)$

$f(t) = Vh(t) + bf$

$p(t) = softmax(f(t))$

* Here, we use $tanh$ and $softmax$ for hidden and output layer respectively.

* In the follwing, all notations **without special declaration**, represent time $t$. 

We assume, hidden_size as $H$, input_size & output_size as $K$,namely a $K$-class classfifier

### Notations & Sizes ###
parameters: $W: H \times H \quad U:H \times K \quad bh: H \times 1 \quad V: K \times H \quad bf: K \times 1$

intermediate variables $p: K \times 1 \quad  f: K \times 1 \quad h: H \times 1$

input: $X: K \times 1$

derivation of h: $dh: H \times 1 $

derivation of f: $df: K \times 1 $

### Gradient Reduction ###
We Use **cross-entropy** as loss function,and the label of $m$-th sample is $y_m$.

* output layer $\Longrightarrow$ hidden layer

$$ 
\begin{aligned}
&\frac{\partial L_m}{\partial f_k} = p_k - I(y_m = k) \overset{def}{=} df_k \\
&\frac{\partial L_m}{\partial bf_k} = \frac{\partial L_m}{\partial f_k} \cdot \frac{\partial f_k}{\partial bf_k} = df_k \\
&\frac{\partial L_m}{\partial v_{ki}} = \frac{\partial L_m}{\partial f_k} \cdot \frac{\partial f_k}{\partial v_{ki}} = df_k \cdot h_i
\end{aligned}
$$

**Matrix formulation**:

$$ 
\begin{aligned}
&\frac{\partial L_m}{\partial bf} = df ,\\
&\frac{\partial L_m}{\partial V} = df \cdot h(t)^T 
\end{aligned}
$$

* hidden layer $\Longrightarrow$ input layer
$$ 
\begin{aligned}
\frac{\partial L_m}{\partial bh_i} &=\frac{\partial L_m}{\partial h_i} \cdot \frac{\partial h_i}{\partial bh_i} \\
&=\sum_{j}^{K}(\frac{\partial L_m}{\partial f_j} \cdot \frac{\partial f_j}{\partial h_i}) \cdot \frac{\partial h_i}{\partial bh_i} \\
&=(1-h_i^2) \cdot \sum_{j}^{K}(\frac{\partial L_m}{\partial f_j} \cdot \frac{\partial h_i}{\partial bh_i}) \\
&=(1-h_i^2) \cdot \sum_{j}^{K}(df_j \cdot v_{ji}) \\
\end{aligned}
$$

Image Description：
![image.png](img/dbh.png)

Here, we could regard $L_m$ as a function of $f_1, f_2,...,f_K$, and $f_j$ as function of $h_i$,:

$$L_m = F(f_1,f_2,...,f_K) $$

$$f_1 = G_1(h_i) \quad  f_2 = G_2(h_i) \quad ... \quad f_K = G_K(h_i)$$

According to the chain rule, we can get the result above.

Learn More about backprop: http://colah.github.io/posts/2015-08-Backprop

Go on!

$$
\begin{aligned}
\frac{\partial L_m}{\partial u_{ij}} &=\frac{\partial L_m}{\partial h_i} \cdot \frac{\partial h_i}{\partial u_{ij}} \\
&=\sum_{j}^{K}(df_j \cdot v_{ji}) \cdot \frac{\partial h_i}{\partial u_{ij}} \\
&=X_j \cdot (1 - h_i^2) \cdot \sum_{j}^{K}(df_j \cdot v_{ji}) \\
&=X_j \cdot \frac{\partial L_m}{\partial bh_i}
\end{aligned}
$$

same as above:

$$
\begin{aligned}
\frac{\partial L_m}{\partial w_{ni}} = h(t-1)_i \cdot \frac{\partial L_m}{\partial bh_i}
\end{aligned}
$$
**Notice!** Here we regard h(t-1) as a **Constant**, which actually is a function of $W$. If we unfold $h(t-1)$, we get BPTT(Back Progation Through Time) algorithm.

**Matrix formulation**:

$$ 
\begin{aligned}
&\frac{\partial L_m}{\partial bh} = [(1 - h(t))\otimes h(t)]\otimes (V^T \cdot df) ,\\
&\frac{\partial L_m}{\partial U} = \frac{\partial L_m}{\partial bh} \cdot X^T,\\
&\frac{\partial L_m}{\partial W} = \frac{\partial L_m}{\partial bh} \cdot h(t-1)^T,\\
\end{aligned}
$$

Let's go for gradient of h(t-1), here we regard h(t-1) as a variable:
$$ 
\begin{aligned}
\frac{\partial L_m}{\partial h(t-1)_i} &= \sum_n^H(\frac{\partial L_m}{\partial h_n} \cdot \frac{\partial h_n}{\partial h(t-1)_i}) \\
&=\sum_n^H[ (\sum_{j}^{K}(\frac{\partial L_m}{\partial f_j} \cdot \frac{\partial f_j}{\partial h_n})) \cdot (1-h_i^2)\cdot w_{ni} ] \\
&=\sum_n^H(w_{ni} \cdot \frac{\partial L_m}{\partial bh_n})
\end{aligned}
$$

**Matrix formulation:**
$$ 
\begin{aligned}
\frac{\partial L_m}{\partial h(t-1)} = W^T \cdot \frac{\partial L_m}{\partial bh}
\end{aligned}
$$

Image Description：
![image.png](img/dh1.png)

Tips: as the image shows, the indexs of derivation associated with the edge, are the variable index of the edge's head and tail.

### Code ###

Hyperparameters

In [51]:
import numpy as np

In [52]:
hidden_size = 3
input_size = 4
inputs = [2]
targets = [3]

hidden_size $\Longrightarrow H$ 

input_size $\Longrightarrow K$

we initialize model parameters using Gaussian distribution

In [53]:
Wxh = np.random.randn(hidden_size, input_size)*0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Whf = np.random.randn(input_size, hidden_size)*0.01 # hidden to output
bh = np.zeros((hidden_size, 1)) # hidden bias
bf = np.zeros((input_size, 1)) # output bias

Wxh $\Longrightarrow U$ 

Whh $\Longrightarrow W$

Whf $\Longrightarrow V$

In [54]:
Wxh

array([[ 0.00498441,  0.01500635, -0.01202295, -0.00355362],
       [-0.00267535,  0.00416772, -0.00698555, -0.00545377],
       [-0.00705934, -0.01143761, -0.0090071 , -0.00409386]])

In [55]:
Whh

array([[-0.0074661 , -0.00860082,  0.00430556],
       [ 0.01304914, -0.00546712,  0.02150136],
       [-0.0096824 , -0.0068539 ,  0.01933451]])

In [56]:
Whf

array([[-0.00649287,  0.00182899,  0.00344997],
       [ 0.00669601,  0.00148885, -0.00129829],
       [-0.00653117,  0.01307569, -0.00403534],
       [-0.00280876, -0.01465321, -0.0089798 ]])

Varaibles Dicts

In [57]:
xs, hs, fs, ps = {}, {}, {}, {} # all vaues,key: time t, value: vectors of time t
hprev = np.zeros((hidden_size,1)) # previous hidden layer output
hs[-1] = np.copy(hprev) # last hidhen layer output
loss = 0

In [58]:
t = 0 # example time t = 0

In [59]:
xs[t] = np.zeros((input_size, 1))
xs[t][inputs[t]] = 1
xs[t]

array([[ 0.],
       [ 0.],
       [ 1.],
       [ 0.]])

In [60]:
hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh)
hs[t]

array([[-0.01202237],
       [-0.00698543],
       [-0.00900686]])

In [61]:
fs[t] = np.dot(Whf, hs[t]) + bf
fs[t]

array([[  3.42101089e-05],
       [ -7.92087153e-05],
       [  2.35265595e-05],
       [  2.17006700e-04]])

In [62]:
ps[t] = np.exp(fs[t]) / np.sum(np.exp(fs[t]))
ps[t]

array([[ 0.24999633],
       [ 0.24996798],
       [ 0.24999366],
       [ 0.25004203]])

In [63]:
print(ps[t][targets[t], 0])
loss += -np.log(ps[t][targets[t], 0])
print(loss)

0.250042032869
1.38612624377


In [64]:
dWxh, dWhh, dWhf = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Whf)
dbh, dbf = np.zeros_like(bh), np.zeros_like(bf)
dhnext = np.zeros_like(hs[0])

In [65]:
df = np.copy(ps[t])
df[targets[t]] -= 1
df

array([[ 0.24999633],
       [ 0.24996798],
       [ 0.24999366],
       [-0.74995797]])

In [66]:
dWhf += np.dot(df, hs[t].T)
dWhf

array([[-0.00300555, -0.00174633, -0.00225168],
       [-0.00300521, -0.00174613, -0.00225143],
       [-0.00300552, -0.00174631, -0.00225166],
       [ 0.00901627,  0.00523878,  0.00675476]])

In [67]:
dbf += df
dbf

array([[ 0.24999633],
       [ 0.24996798],
       [ 0.24999366],
       [-0.74995797]])

In [68]:
dh = np.dot(Whf.T, df)
dh

array([[ 0.00052429],
       [ 0.01508754],
       [ 0.00626361]])

In [69]:
hs[t]

array([[-0.01202237],
       [-0.00698543],
       [-0.00900686]])

In [70]:
1 - hs[t] * hs[t]

array([[ 0.99985546],
       [ 0.9999512 ],
       [ 0.99991888]])

In [71]:
dhraw = (1 - hs[t] * hs[t]) * dh
print(dhraw)
dbh += dhraw
print(dbh)
dhnext = np.dot(Whh.T, dhraw)
print(dhnext)

[[ 0.00052422]
 [ 0.0150868 ]
 [ 0.0062631 ]]
[[ 0.00052422]
 [ 0.0150868 ]
 [ 0.0062631 ]]
[[ 0.00013231]
 [-0.00012992]
 [ 0.00044774]]


In [72]:
dWxh += np.dot(dhraw, xs[t].T)
print(dWxh)
dWhh += np.dot(dhraw, hs[t-1].T)
print(dWhh)

[[ 0.          0.          0.00052422  0.        ]
 [ 0.          0.          0.0150868   0.        ]
 [ 0.          0.          0.0062631   0.        ]]
[[ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]]


##  Reference ##
1. http://karpathy.github.io/2015/05/21/rnn-effectiveness/
2. https://gist.github.com/karpathy/d4dee566867f8291f086
3. http://cs231n.github.io/neural-networks-case-study/#grad
4. https://www.zhihu.com/question/27239198?rf=24827633
5. http://colah.github.io/posts/2015-08-Backprop/
6. http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/