In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

#### Representation of Time Sequence
<font size = 2>
    
The image below simulates the reading/input of a situation:
    
3 sentences with 100 words each and the identity of each word is shown in 50-dimensional vector.
    
This situation can be represented as:
    
    [word_num_in_single_sentence, sentence_num, word_vector] = [100,3,50]
    
The representation is to shown the process of reading/input sentences simultaneously, i.e. read a sentence one word by one word, but read several sentences at the same time (parallel process of data). And in time sequence, read one word in each time epoch, so **word_num_in_single_sentence** can substitute **time** here.
    
<div>
<img src = "TimeSequence.png" style = "zoom:30%"/>
</div>

#### RNN Structure
<font size = 2>
    
The **weights** and **bias** are sharing in RNN: weights for input $W_{xh}$ and weights for memory $W_{hh}$ (bias neglected here).

The initial memory $h_{0}$ is set to be 0. And each memory $h_{t}$ after each time $t$ epoch will be updated as:
    
$$
\begin{equation}
\begin{aligned}
h_{t} &= f_{w} (x@W_{xh} + h_{t-1}@W_{hh}) \\
&= tanh(x@W_{xh} + h_{t-1}@W_{hh})
\end{aligned}
\end{equation}
$$
    
where $tanh()$ is the activation function used in RNN. And the $h_{t}$ will be as new initial memory of next time epoch.

<div>
<img src = 'RNN_1.png' style = 'zoom:40%'/>
</div>
    
    
Each time epoch $t$ also has its own output $y_{t}$ given by weights for output $W_{hy}$:
    
$$ y_{t} = h_{t}@W_{hy} $$
    
<div>
<img src = 'RNN_2.png' style = 'zoom:40%'/>
</div>

#### Gradients of RNN
<font size = 2>
    
How to get the gradient of weights for memory $W_{hh}$?
    
$$
\begin{equation}
\begin{aligned}
grad(W_{hh}) &= \frac{\partial{E_{t}}}{\partial{W_{hh}}} \\
&= \frac{\partial{E_{t}}}{\partial{W^{0}_{hh}}} + \frac{\partial{E_{t}}}{\partial{W^{1}_{hh}}} + ... + \frac{\partial{E_{t}}}{\partial{W^{t}_{hh}}} \\
&= \sum^{t}_{i} \frac{\partial{E_{t}}}{\partial{W^{i}_{hh}}}
\end{aligned}
\end{equation}
$$
    
Take all weights $W_{hh}$ for memory of all time epoch $t$ and sum up. $E_{t}$ is Error function for time $t$.
    
<div>
<img src = 'RNN_3.png' style = 'zoom:40%'>
</div>
    
Due to the formular of current memory $h_{t}$ and output $y_{t}$ which is shown in last section:
    
$$
\begin{equation}
\begin{aligned}
h_{t} &= f_{w} (x@W_{xh} + h_{t-1}@W_{hh}) \\
&= tanh(x@W_{xh} + h_{t-1}@W_{hh})
\end{aligned}
\end{equation}
$$
    
$$ y_{t} = h_{t}@W_{hy} $$

and chain rule of taking gradients. We can get:
    
$$ \frac{\partial{E_{t}}}{\partial{W_{hh}}} = \sum^{t}_{i} \frac{\partial{E_{t}}}{\partial{y_{t}}} \frac{\partial{y_{t}}}{\partial{h_{t}}} \frac{\partial{h_{t}}}{\partial{h_{i}}} \frac{\partial{h_{i}}}{\partial{W_{hh}}} $$
  
The parts of the formular above can be obtained separately:

1) The gradient of error function $E_{t}$ in time epoch $t$ w.r.t the current output $y_{t}$:
    
(For example, we take error function as least square error)
    
$$ \frac{\partial{E_{t}}}{\partial{y_{t}}} =  \frac{\partial{\frac{1}{2} (y_{t} - target)^{2}}}{\partial{y_{t}}} $$
    
2) The gradient of current output $y_{t}$ w.r.t current memory $h_{t}$:
    
$$ \frac{\partial{y_{t}}}{\partial{h_{t}}} = W_{hy} $$
    
3) The gradient of current memory $h_{t}$ w.r.t history memory $h_{i}$ on time epoch $i$:
    
$$
\begin{equation}
\begin{aligned}
\frac{\partial{h_{t}}}{\partial{h_{i}}} &= \frac{\partial{h_{t}}}{\partial{h_{t-1}}} \frac{\partial{h_{t-1}}}{\partial{h_{t-2}}} ... \frac{\partial{h_{i+1}}}{\partial{h_{i}}} \\
&= \prod^{t-1}_{k=i} \frac{\partial{h_{k+1}}}{\partial{h_{k}}} \\
&= \prod^{t-1}_{k=i} \frac{\partial{f_{w} (x@W_{xh} + h_{k}@W_{hh})}}{\partial{h_{k}}} \\
&= \prod^{t-1}_{k=i} \frac{\partial{f_{w} (x@W_{xh} + h_{k}@W_{hh})}}{\partial{(x@W_{xh} + h_{k}@W_{hh})}} \frac{\partial{(x@W_{xh} + h_{k}@W_{hh})}}{\partial{h_{k}}} \\
&= \prod^{t-1}_{k=i} diag(f^{’}_{w} (x@W_{xh} + h_{k}@W_{hh})) W_{hh}
\end{aligned}
\end{equation}
$$
    
The $diag()$ is just a representation.
    
4) The gradient of history memory $h_{i}$ on time epoch $i$ w.r.t memory weights $W_{hh}$:
    
$$
\begin{equation}
\begin{aligned}
\frac{\partial{h_{i}}}{\partial{W_{hh}}} &= \frac{\partial{f_{w} (x@W_{xh} + h_{i-1}@W_{hh})}}{\partial{W_{hh}}} \\
&= \frac{\partial{f_{w} (x@W_{xh} + h_{i-1}@W_{hh})}}{\partial{(x@W_{xh} + h_{i-1}@W_{hh})}} \frac{\partial{(x@W_{xh} + h_{i-1}@W_{hh})}}{\partial{W_{hh}}} \\
&= f^{’}_{w} (x@W_{xh} + h_{i-1}@W_{hh}) h_{i-1}
\end{aligned}
\end{equation}
$$

#### Shape of Instances
<font size = 2>
    
According to section **Representation of Time Sequence**, the input $x$ contains information about:
    
    [word_num_in_single_sentence, sentence_num, word_vector] = [Seq_len, batch, feature_len]
    
We extract temperal input $x_{t}$ which means the reading words in time epoch $t$ under parallel process:
    
    [sentence_num, word_vector] = [batch, feature_len]
    
According to section **RNN Structure**, the relationship between current temperal memory $h_{t}$ and previous memory $h_{t-1}$ is:
    
$$ h_{t} = x_{t-1}@W_{xh} + h_{t-1}@W_{hh} $$
    
The initial memory $h_{0}$ is set as vector **0** with shape of 
    
    [batch, hidden_len]
    
which is also every memory's shape. In order to maintain this,the corresponding shapes of instances above are:
    
$x_{t-1}@W_{xh}$:
    
    [batch, feature_len] @ [hidden_len, feature_len].T = [batch, hidden_len]
    
$h_{t-1}@W_{hh}$:
    
    [batch, hidden_len] @ [hidden_len, hidden_len].T = [batch, hidden_len]
    
$ h_{t} = x_{t-1}@W_{xh} + h_{t-1}@W_{hh} $:
    
    [batch, hidden_len] + [batch, hidden_len] = [batch, hidden_len]
    
which is the same with initial memory $h_{0}$.
    
<div>
<img src = 'RNN_4.png' style = 'zoom:40%'/>
</div>

#### Single RNN Layer

In [2]:
#create RNN instance
#para: input_size  -> feature_len
#para: hidden_size -> hidden_len
#para: num_layers  -> multipal layers of stacked RNN, default = 1
rnn = nn.RNN(input_size = 20,hidden_size = 10, num_layers = 1)
#check the parameters in RNN
paras = rnn._parameters.keys()
print(paras)
print()
#RNN can have multuple layers, l_0 here means layer l_0
#weight_ih_l0: weights for input W_ih on layer l_0 -> [hidden_len, feature_len]
print(rnn.weight_ih_l0.shape)
#weight_hh_l0: weights for memory W_hh on layer l_0 -> [hidden_len, hidden_len]
print(rnn.weight_hh_l0.shape)
#bias_ih_l0: bias for input b_ih on layer l_0 -> [hidden_len]
print(rnn.bias_ih_l0.shape)
#bias_hh_l0: bias for memory b_hh on layer l_0 -> [hidden_len]
print(rnn.bias_hh_l0.shape)

odict_keys(['weight_ih_l0', 'weight_hh_l0', 'bias_ih_l0', 'bias_hh_l0'])

torch.Size([10, 20])
torch.Size([10, 10])
torch.Size([10])
torch.Size([10])


#### Multi- RNN Layer

In [3]:
#create an RNN instance with num_layers = 2
#stack 2 RNN layers together
#2nd RNN layer take the output of 1st RNN layer as input, and compute the result
rnn2 = nn.RNN(input_size = 20,hidden_size = 10, num_layers = 2)
paras = rnn2._parameters.keys()
print(paras)
print()
#weight_ih_l0: weights for input W_ih on layer l_0 -> [hidden_len, feature_len]
print(rnn2.weight_ih_l0.shape)
#weight_hh_l0: weights for memory W_hh on layer l_0 -> [hidden_len, hidden_len]
print(rnn2.weight_hh_l0.shape)
#bias_ih_l0: bias for input b_ih on layer l_0 -> [hidden_len]
print(rnn2.bias_ih_l0.shape)
#bias_hh_l0: bias for memory b_hh on layer l_0 -> [hidden_len]
print(rnn2.bias_hh_l0.shape)
print()
#weight_ih_l1: weights for input W_ih on layer l_1 -> [hidden_len, feature_len]
#due to 2nd layer's input = 1st layer's output
#the feature_len of 2nd RNN is length of 1st layer's output
print(rnn2.weight_ih_l1.shape)
#weight_hh_l1: weights for memory W_hh on layer l_1 -> [hidden_len, hidden_len]
print(rnn2.weight_hh_l1.shape)
#bias_ih_l1: bias for input b_ih on layer l_1 -> [hidden_len]
print(rnn2.bias_ih_l1.shape)
#bias_hh_l1: bias for memory b_hh on layer l_1 -> [hidden_len]
print(rnn2.bias_hh_l1.shape)

odict_keys(['weight_ih_l0', 'weight_hh_l0', 'bias_ih_l0', 'bias_hh_l0', 'weight_ih_l1', 'weight_hh_l1', 'bias_ih_l1', 'bias_hh_l1'])

torch.Size([10, 20])
torch.Size([10, 10])
torch.Size([10])
torch.Size([10])

torch.Size([10, 10])
torch.Size([10, 10])
torch.Size([10])
torch.Size([10])


#### Output of RNN

In [4]:
#input: [word_num_in_single_sentence, sentence_num, word_vector]
#       [seq_len, b, feature_len]
#i.e. 3 sentences, each sentence has 5 words, each word is represented 20-dimensional vector
x = torch.randn(5,3,20)
#h:   memory of every temperal epoch
#out: stacked memory in time sequence
#the paras of rnn() includes input x and initial h_0, but h_0 can be neglected
out_rnn1, h_rnn1 = rnn(x)
#out: [seq_len, b, hidden_len] -> stack[h1,h2,...,ht]
print(out_rnn1.shape)
#ht: [num_layers, b, hidden_size]
print(h_rnn1.shape)
print()

out_rnn2, h_rnn2 = rnn2(x)
print(out_rnn2.shape)
print(h_rnn2.shape)

torch.Size([5, 3, 10])
torch.Size([1, 3, 10])

torch.Size([5, 3, 10])
torch.Size([2, 3, 10])


#### nn.RNNCell
<font size = 2>
    
The class **nn.RNN()** need the integral input with shape of:
    
    [word_num_in_single_sentence, sentence_num, word_vector]
    =
    [Seq_len, batch, feature_len]
    
However we can also give the input as time epoch manully, i.e. give input as:
    
    [sentence_num, word_vector]
    
by times of **word_num_in_single_sentence**.
    
This corresponds to eliminate the self-resurrection of RNN and replace that by user effort, which is illustrated below:
    
<div>
<img src = 'RNN_5.png' style = 'zoom:40%'/>
</div>

#### Single RNNCell Layer

In [8]:
#the intialization of RNNCell is the same as RNN but has no num_layers
#as a result, use RNNCell to perform multi-layer RNN need multi-instantiation of RNNCell
#para: input_size -> feature_len
#para: hidden_size -> hidden_len
cell1 = nn.RNNCell(input_size = 100, hidden_size = 30)
cell_paras1 = cell1._parameters.keys()
print(cell_paras1)
#create input:
#5 sentences, 10 words in each sentence, each word is represented in 100-dimensional vector(feature_len)
cell_input_x = torch.randn(10,5,100)
#initialize memory
tmp_h_1 = torch.randn(5,30)
#iterate on 0th dimension: dim of 'Seq_len'(time epoch) in [Seq_len, batch, feature_len]
for x_t in cell_input_x:
    #due to RNNCell need manully input along time epoch
    #and the output of each RNNCell is a memory of single time epoch
    tmp_h_1 = cell1(x_t, tmp_h_1)
print(tmp_h_1.shape)

odict_keys(['weight_ih', 'weight_hh', 'bias_ih', 'bias_hh'])
torch.Size([5, 30])


#### Multiple RNNCell Layers

In [9]:
#manully instantiate 2 RNN layers
cell1 = nn.RNNCell(input_size = 100, hidden_size = 30)
#pay attention to the shape of 2nd layer
#because the output of 1st layer will be input of 2nd layer
cell2 = nn.RNNCell(input_size = 30, hidden_size = 20)
cell_paras1 = cell1._parameters.keys()
cell_paras2 = cell2._parameters.keys()
print(cell_paras1)
print(cell_paras2)
#create input:
#5 sentences, 10 words in each sentence, each word is represented in 100-dimensional vector(feature_len)
cell_input_x = torch.randn(10,5,100)
#initialize memory
#compatibal with their own layers
tmp_h_1 = torch.randn(5,30)
tmp_h_2 = torch.randn(5,20)
for x_t in cell_input_x:
    tmp_h_1 = cell1(x_t,tmp_h_1)
    #output of previous layer is input of current layer
    tmp_h_2 = cell2(tmp_h_1, tmp_h_2)
print(tmp_h_1.shape)
print(tmp_h_2.shape)

odict_keys(['weight_ih', 'weight_hh', 'bias_ih', 'bias_hh'])
odict_keys(['weight_ih', 'weight_hh', 'bias_ih', 'bias_hh'])
torch.Size([5, 30])
torch.Size([5, 20])


In [26]:
start = np.random.randint(3, size = 1)[0]
print(start)

2


In [31]:
import torch
b = torch.zeros(1,1,4)
x = torch.tensor(b.shape)[-1].item()
print(x, type(x))

4 <class 'int'>


In [35]:
c = np.arange(0,10,1).reshape(2,5)
c = c.ravel()
print(c)

[0 1 2 3 4 5 6 7 8 9]


In [43]:
d = torch.randn(1,1,1).numpy()
print(d)
d = d.ravel()
print(d)

[[[-0.54509634]]]
[-0.54509634]
