# Vanilla RNN #

** A simple RNN using standard backprop algorithm, implemented by numpy **

In [32]:
# code environment
%load_ext watermark
%watermark -p numpy -v -m -u -d

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
last updated: 2018-02-11 

CPython 3.5.4
IPython 6.2.1

numpy 1.13.3

compiler   : MSC v.1900 64 bit (AMD64)
system     : Windows
release    : 7
machine    : AMD64
processor  : Intel64 Family 6 Model 61 Stepping 4, GenuineIntel
CPU cores  : 4
interpreter: 64bit


### Model Description ###
A standard RNN can be mathematically represented by:

$h(t) = tanh(Wh(t-1) + UX + bh)$

$f(t) = Vh(t) + bf$

$p(t) = softmax(f(t))$

* Here, we use $tanh$ and $softmax$ for hidden and output layer respectively.

* In the follwing, all notations **without special declaration**, represent time $t$. 

We assume, hidden_size as $H$, input_size & output_size as $K$,namely a $K$-class classfifier

### Notations & Sizes ###
parameters: $W: H \times H \quad U:H \times K \quad bh: H \times 1 \quad V: K \times H \quad bf: K \times 1$

intermediate variables $p: K \times 1 \quad  f: K \times 1 \quad h: H \times 1$

input: $X: K \times 1$

derivation of h: $dh: H \times 1 $

derivation of f: $df: K \times 1 $

### Gradient Reduction ###
We Use **cross-entropy** as loss function,and the label of $m$-th sample is $y_m$.

* output layer $\Longrightarrow$ hidden layer

$$ 
\begin{aligned}
&\frac{\partial L_m}{\partial f_k} = p_k - I(y_m = k) \overset{def}{=} df_k \\
&\frac{\partial L_m}{\partial bf_k} = \frac{\partial L_m}{\partial f_k} \cdot \frac{\partial f_k}{\partial bf_k} = df_k \\
&\frac{\partial L_m}{\partial v_{ki}} = \frac{\partial L_m}{\partial f_k} \cdot \frac{\partial f_k}{\partial v_{ki}} = df_k \cdot h_i
\end{aligned}
$$

Thus,

$$ 
\begin{aligned}
&\frac{\partial L_m}{\partial bf} = df ,\\
&\frac{\partial L_m}{\partial V} = df \cdot h(t)^T 
\end{aligned}
$$

* hidden layer $\Longrightarrow$ input layer
$$ 
\begin{aligned}
\frac{\partial L_m}{\partial bh_i} &=\frac{\partial L_m}{\partial h_i} \cdot \frac{\partial h_i}{\partial bh_i} \\
&=\sum_{j}^{K}(\frac{\partial L_m}{\partial f_j} \cdot \frac{\partial f_j}{\partial h_i}) \cdot \frac{\partial h_i}{\partial bh_i}
\end{aligned}
$$

图解：
![image.png](img/dbh.png)



### Code ###

we initialize model parameters using Gaussian distribution

In [36]:
import numpy as np

In [37]:
hidden_size = 3
vocab_size = 4
inputs = [2]
targets = [3]

In [38]:
Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Whf = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
bh = np.zeros((hidden_size, 1)) # hidden bias
bf = np.zeros((vocab_size, 1)) # output bias

In [39]:
Wxh

array([[-0.00750633,  0.00537044, -0.00212228,  0.00870559],
       [-0.00813658,  0.00619978,  0.00339144, -0.0032588 ],
       [-0.01374781,  0.01228833,  0.00256443,  0.0088881 ]])

In [5]:
Whh

array([[ 0.00011224,  0.00507858, -0.0011623 ],
       [ 0.0100956 , -0.00462209, -0.01085725],
       [ 0.00723287,  0.00323559, -0.01131211]])

In [6]:
Whf

array([[ 0.01421196,  0.00707065, -0.00224458],
       [-0.00169425,  0.009578  , -0.00485544],
       [ 0.00252839, -0.0034816 , -0.00401216],
       [ 0.01535335, -0.00223433, -0.01630785]])

In [7]:
xs, hs, fs, ps = {}, {}, {}, {}
hprev = np.zeros((hidden_size,1))
hs[-1] = np.copy(hprev)
loss = 0

In [8]:
xs[0] = np.zeros((vocab_size, 1))
xs[0][inputs[0]] = 1
xs[0]

array([[ 0.],
       [ 0.],
       [ 1.],
       [ 0.]])

In [9]:
hs[0] = np.tanh(np.dot(Wxh, xs[0]) + np.dot(Whh, hs[0-1]) + bh)
hs[0]

array([[ 0.00969675],
       [ 0.00374772],
       [ 0.01119771]])

In [10]:
fs[0] = np.dot(Whf, hs[0]) + bf
fs[0]

array([[  1.39174478e-04],
       [ -3.49028567e-05],
       [ -3.34578758e-05],
       [ -4.21064945e-05]])

In [11]:
ps[0] = np.exp(fs[0]) / np.sum(np.exp(fs[0]))
ps[0]

array([[ 0.250033  ],
       [ 0.24998948],
       [ 0.24998984],
       [ 0.24998768]])

In [12]:
print(ps[0][targets[0], 0])
loss += -np.log(ps[0][targets[0], 0])
print(loss)

0.249987678749
1.38634364734


In [13]:
dWxh, dWhh, dWhf = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Whf)
dbh, dbf = np.zeros_like(bh), np.zeros_like(bf)
dhnext = np.zeros_like(hs[0])

In [14]:
df = np.copy(ps[0])
df[targets[0]] -= 1
df

array([[ 0.250033  ],
       [ 0.24998948],
       [ 0.24998984],
       [-0.75001232]])

In [15]:
dWhf += np.dot(df, hs[0].T)
dWhf

array([[ 0.00242451,  0.00093705,  0.0027998 ],
       [ 0.00242409,  0.00093689,  0.00279931],
       [ 0.00242409,  0.00093689,  0.00279931],
       [-0.00727268, -0.00281084, -0.00839842]])

In [16]:
dbf += df
dbf

array([[ 0.250033  ],
       [ 0.24998948],
       [ 0.24998984],
       [-0.75001232]])

In [17]:
dh = np.dot(Whf.T, df)
dh

array([[-0.00775322],
       [ 0.0049677 ],
       [ 0.00945306]])

In [18]:
hs[0]

array([[ 0.00969675],
       [ 0.00374772],
       [ 0.01119771]])

In [19]:
1 - hs[0] * hs[0]

array([[ 0.99990597],
       [ 0.99998595],
       [ 0.99987461]])

In [20]:
dhraw = (1 - hs[0] * hs[0]) * dh
print(dhraw)
dbh += dhraw
print(dbh)
dhnext = np.dot(Whh.T, dhraw)
print(dhnext)

[[-0.00775249]
 [ 0.00496763]
 [ 0.00945187]]
[[-0.00775249]
 [ 0.00496763]
 [ 0.00945187]]
[[  1.17645275e-04]
 [ -3.17500938e-05]
 [ -1.51844736e-04]]


In [21]:
dWxh += np.dot(dhraw, xs[0].T)
print(dWxh)
dWhh += np.dot(dhraw, hs[0-1].T)
print(dWhh)

[[ 0.          0.         -0.00775249  0.        ]
 [ 0.          0.          0.00496763  0.        ]
 [ 0.          0.          0.00945187  0.        ]]
[[ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]]


$h(t)=tanh(Wh(t-1) + UX + b_h)$

$f(t) = Vh(t) + b_f$

$p(t) = softmax(f(t))$

代码中,

Whh ==> $W$

Wxh ==> $U$

Whf ==> $V$

$\partial$

$\oplus$

$\ominus$

$\otimes$

$\div$

$\odot$

$\times$

$\cdot$

$\sum_\limits{{j}}^{K}$  $\sum_\limits{{i}}^{H}$   $\sum_\limits{{n}}^{H}$

$\overline{a+b}$   $\underline{c+d }$

$\frac{a+b}{c+d}$

##  Reference ##
1. http://karpathy.github.io/2015/05/21/rnn-effectiveness/
2. https://gist.github.com/karpathy/d4dee566867f8291f086
3. http://cs231n.github.io/neural-networks-case-study/#grad
4. https://www.zhihu.com/question/27239198?rf=24827633
5. http://colah.github.io/posts/2015-08-Backprop/
6. http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/