In [1]:
import numpy as np
import torch
import torch.nn as nn
torch.__version__

'1.5.0'

## Recurrent Neural Networks

#### Sequence Learning

* Input not independent
    - e.g. auto completion
* Input may not be of fixed size

* To process a sequence the ability to "remember" previous events in the sequence

* RNNs unfold in time to process events that occur in a sequence

![](RNN.png)
$$\text{Figure 1. RNN unfolding in time}$$

#### Output

$$y_i = f(x_i,s_{i-1},W,U,V,b,c)$$

i - time step  
$x_i$ - input  
$s_i$ - hidden states  
U,W,V - Weight Matrices  
b,c - bias  

#### Activation 

<div style="font-size: 125%;">
$$s_i = \sigma(U_{x_{i}} + W_{s_{i-1}} + b)\text{ or } tanh(U_{x_{i}} + W_{s_{i-1}} + b)$$
$$y_i = Softmax(V_{s_i} + c)$$
</div>


    
#### Why Recurrent?  

* RNNs perform the same function for every element of a sequence

* A loop allows information to be passed from one time step of the network to the next.

* The output of a unit is dependent on its input and on the previous computations
    - RNNs have a memory for the computations performed on earlier time steps


#### RNN Architectures & Applications

![](RNN_Configs.png)
$$\text{Figure 2. RNN Architectures}$$

* One-to-many: Picture captions
    - Input: Picture
    - Output: Words in captions
* Many-to-one: Sentiment Analysis
    - Input: Sentence(s)
    - Output: Sentiment score (e.g. 90% Positive)
* Many-to-many: Translation
    - Input: Source language sentence
    - Output: Target language sentence
* Many-to-many (synced): Video Classification
    - Input: Movie frame
    - Output: Label

#### RNN layer

https://pytorch.org/docs/stable/generated/torch.nn.RNN.html

In [3]:

# N = number of samples
# T = sequence length
# D = number of input features
# M = number of hidden units
# K = number of output units


N = 1
T = 10
D = 3
M = 5
K = 2
X = np.random.randn(N, T, D)
X.shape

(1, 10, 3)

In [4]:
class SimpleRNN(nn.Module):
  def __init__(self, n_inputs, n_hidden, n_outputs):
    super(SimpleRNN, self).__init__()
    self.D = n_inputs
    self.M = n_hidden
    self.K = n_outputs
    self.rnn = nn.RNN(
        input_size=self.D,
        hidden_size=self.M,
        nonlinearity='tanh',
        batch_first=True)
    self.fc = nn.Linear(self.M, self.K)
  
  def forward(self, X):
    # initial hidden states
    h0 = torch.zeros(1, X.size(0), self.M)

    # get RNN unit output
    out, _ = self.rnn(X, h0)
    out = self.fc(out)
    return out

In [5]:
model = SimpleRNN(n_inputs=D, n_hidden=M, n_outputs=K)

In [6]:
# Run Forward
inputs = torch.from_numpy(X.astype(np.float32))
out = model(inputs)
out

tensor([[[0.3739, 0.6823],
         [0.2691, 0.2821],
         [0.3581, 0.5325],
         [0.2421, 0.5692],
         [0.3927, 0.7252],
         [0.2474, 0.5856],
         [0.1892, 0.6481],
         [0.3925, 0.5913],
         [0.1869, 0.2363],
         [0.3411, 0.2840]]], grad_fn=<AddBackward0>)

In [7]:
out.shape

torch.Size([1, 10, 2])

In [8]:
Yhats_torch = out.detach().numpy()

In [11]:
W_xh, W_hh, b_xh, b_hh = model.rnn.parameters()
W_xh.shape, W_hh.shape, b_xh.shape, b_hh.shape

(torch.Size([5, 3]), torch.Size([5, 5]), torch.Size([5]), torch.Size([5]))

In [10]:
W_xh.shape

torch.Size([5, 3])

In [13]:
print(type(W_xh))
W_xh

<class 'torch.nn.parameter.Parameter'>


Parameter containing:
tensor([[ 0.2622, -0.1990, -0.0104],
        [ 0.1048, -0.2302,  0.3343],
        [ 0.4376,  0.1810, -0.0932],
        [ 0.1215, -0.1830,  0.4030],
        [ 0.1614,  0.2495, -0.2871]], requires_grad=True)

In [14]:
W_xh = W_xh.data.numpy()
W_xh

array([[ 0.26219094, -0.19896993, -0.01044697],
       [ 0.10477024, -0.23018256,  0.3342694 ],
       [ 0.4376182 ,  0.18097007, -0.093178  ],
       [ 0.1214689 , -0.18295664,  0.40295953],
       [ 0.1613673 ,  0.24948424, -0.2871253 ]], dtype=float32)

In [16]:
b_xh = b_xh.data.numpy()
W_hh = W_hh.data.numpy()
b_hh = b_hh.data.numpy()

W_xh.shape, b_xh.shape, W_hh.shape, b_hh.shape

((5, 3), (5,), (5, 5), (5,))

In [17]:

Wo, bo = model.fc.parameters() # FC layer weights

In [18]:
Wo = Wo.data.numpy()
bo = bo.data.numpy()
Wo.shape, bo.shape

((2, 5), (2,))

In [29]:
# Replicate the output
h_last = np.zeros(M) # initial hidden state
x = X[0] 
Yhats = np.zeros((T, K)) 

for t in range(T):
  h = np.tanh(x[t].dot(W_xh.T) + b_xh + h_last.dot(W_hh.T) + b_hh)
  y = h.dot(Wo.T) + bo 
  Yhats[t] = y
  h_last = h

Yhats

[[ 0.20528887 -0.79961925]
 [-0.15380444 -0.44783627]
 [ 0.3585217  -0.71043858]
 [-0.13168303 -0.52131308]
 [ 0.31756821 -0.24142816]
 [ 0.03747482 -0.67827091]
 [ 0.07696692 -0.52061347]
 [ 0.18046382 -0.27259799]
 [ 0.37582838 -0.95786155]
 [-0.02171963 -0.37523845]]


In [30]:
np.allclose(Yhats, Yhats_torch)

True

In [34]:
h.shape,Wo.T.shape,bo.shape

((5,), (5, 2), (2,))

### Back Propagation Through Time (BPTT)

#### Examples using autocompletion

![](RNN2.png)
![](BPTT1.png)

$$\text{Figure 3. Autocompletion examples}$$

#### Parameter $\theta$ dimensions  

Inputs: $x_i \in R^n$  
States: $s_i \in R^d$  
Classes: $y_i \in R^k$  
Input weights: $U_i \in R^{n x d}$  
Hidden weights: $W_i \in R^{d x k}$  
Output weights:$V_i \in R^{d x d}$  

#### Loss function: L

<div style="font-size: 110%;">
$$ L(\theta) = \sum_{t = 1}^T {L_t(\theta)}$$
</div>


#### Compute gradient of  $L(\theta)$ wrt  =  W,U,V

* Derivation of Loss wrt weights in the input and output layer:

$$\frac{\partial L(\theta)}{\partial U} =  \sum_{t = 1}^T \frac{\partial L_t(\theta)}{\partial U}$$

$$\frac{\partial L(\theta)}{\partial V} =  \sum_{t = 1}^T \frac{\partial L_t(\theta)}{\partial V}$$

* Derivation of Loss wrt weights in hidden layer for each previous time step

![](BPTT3.png)

<div style="font-size: 110%;">
$$\frac{\partial L(\theta)}{\partial W} =  \sum_{t = 1}^T \frac{\partial L_t(\theta)}{\partial W}$$

$$\frac{\partial L_t(\theta)}{\partial W} = \frac{\partial L_t(\theta)}{\partial s_i} \sum_{k=1}^t\frac{s_t}{s_k}\frac{\partial^+s_k}{\partial W}$$

</div>

### Vanishing and Exploding Gradient

* Gradient can vanish (i.e. goto zero) or explode 

<div style="font-size: 115%;">
$$\frac{s_t}{s_k} = \prod_{j = k}^{t - 1}\frac{\partial s_{j+1}}{\partial s_j}$$
</div>

### Long Short Term Memory (LSTM)

* Sepp Hochreiter & Jurgen Schmidhuber (1997)

* Long-range dependencies
    - Language Processing
    
* Does not suffer from vanishing gradient

![](LSTM1.png)


#### Cell State (C) and Gates

* Cell state is a vector of cell states which runs entire length of sequence
* Add and delete information from the cell state through **gates** 
* **gates** are the circles with the elementwise operations of x and a sigmoid activation function
    - The sigmoid outputs a value between 0 and 1 which controls how much information to let through
    - There are 3 gates in the figure
        - Input 
        - Forget
        - Output
    
![](LSTM2.png)

    
### LSTM Operation

#### Forget gate layer step  

<div style="font-size: 125%;">
    
$$f_t = \sigma(W_f\centerdot [h_{t-1},x_t] + b_f)$$

</div>

* Inputs the output of the previous timestep $h_{t-1}$ and the current input $x_t$
    - Dotted with a weight matrix and input to the sigmoid
* The sigmoid outputs a vector (matching the cell state), with values ranging from 0 to 1.
    - Its a probability of how much of the corresponding value in cell state to keep
* Modifies the cell state vector by elementwise multiplication.

#### Determine what to store in cell state step (3 steps)  

#### 1) input gate layer

* Regulates (filters) what values from $h_{t-1}$ and $x_t$ need to be added to the cell state. 

<div style="font-size: 125%;">
$$i_t = \sigma(W_i\centerdot[h_{t-1},x_t] + b_i)$$
</div>

#### 2) Create candidate update $\hat{C}_t$

* Creates a vector containing the possible values that can be added to the cell state by the tanh function, which outputs values from -1 to +1.  

<div style="font-size: 125%;">
$$\hat{C}_t = tanh(W_C\centerdot[h_{t-1},x_t] + b_C)$$
</div>

#### 3) Update the Cell State step

* Multiplies the value of the filter (i.e. the sigmoid gate) and the created vector (i.e.the tanh function)
* Adds this information to the cell state by addition.

<div style="font-size: 125%;">
$$C_t = f_t * C_{t-1} + i_i * \hat{C}_t$$
</div>

* Multiply old state by $f_t$ (to  forget)
* Add new states weighted by how much to update each state

#### Output Gate (3 step process)

* Not all information in the cell state, is "fit" for being output at this time step

1) Sigmoid layer decides what part of cell state to output. It filters $h_{t-1}$ and $x_t$ vis a sigmoid function to regulate the values that need to be output from the vector created below in step 2. 

<div style="font-size: 125%;">
$$ o_t = \sigma(W_o\centerdot[h_{t-1},x_t] + b_0)$$
</div>

2) Scales the values in the cell state with a tanh function creating a vector $tanh(C_t)$.
 
3) Multiplies the vectors created in steps 1 and 2 and sends it as an output and to the hidden state of the next cell.
<div style="font-size: 125%;">
$$ h_t = o_t * tanh(C_t)$$
</div>



### Gated Recurrent Units (GRUs)

* Kyunghyun Cho: Gated Recurrence Unit (GRU) (2014)
* Simpler version of LSTM
    - No cell state
    - 2 gates
* Performance comparable to LSTM

![](GRU.png)

#### Update Gate

* The update gate determines how much of the information from previous time steps needs to be passed along.

* Add weighted input x and previous h and pass through a sigmoid to update hidden state

<div style="font-size: 125%;">
$$z_t = \sigma(W^{(z)}x_t + U^{(z)}h_{t-1}$$
</div>

#### Reset Gate

* Determines how much of the previous information to forget

<div style="font-size: 125%;">    
$$r_t = \sigma(W^{(r)}x_t + U^{(r)}h_{t-1}$$
</div>

#### Current Memory Content

<div style="font-size: 125%;">
$$ h_t = tanh(Wx_t + r_t \odot{Uh_{t-1}})$$
</div>

* Weighted input + elementwise multiplication of reset gate with weighted previous hidden state
* Determines what to remove from previous time steps

#### Final Memory Content

<div style="font-size: 125%;">
$$h_t = z_t\odot{h_{t-1}}+(1-z_t)\odot{h_t}$$
</div>

* Information for current output and to be passed to next time step

### References

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be