In [1]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

#### LSTM: Long Short-Term Memory
<font size = 2>
    
The memory unit in RNN is **short-term**, which can only remember some adjacent information. If want to predict next information based on long-term previously memory, RNN will not perform well. To make longer memory, introduce **LSTM**, i.e. **Long Short-Term Memory**. The basicly intrisic structure is shown as below:
    
<div>
<img src = 'LSTM_1.png' style = 'zoom:40%'/>
</div>
    
The **LSTM** contains 4 main elements: **forget-gate**, **update-gate**, **updating-state-cell-part** and **output-gate**.
    
$C_{t-1}$
    
    previous state cell, containing input memory information from previous layer
    
$C_{t}$
    
    state cell after current LSTM unit
    
$\sigma$ and $tanh$
    
    compose the gates of different elements, which function as inspection and filtering
$x_{t}$
    
    current input
    
$h_{t-1}$
    
    output memory from previous layer
    
$h_{t}$
    
    output memory after current LSTM unit

#### Elements of LSTM
<font size = 2>
    
1) Forget Gate:
  
$f_{t}$:
    
    a number between 0 and 1 representing which values to be remembered/forgotten from sigmoid activation function
    
$h_{t-1}$:
    
    output memory from previous layer/time epoch
    
$x_{t}$:
    
    current input
    
<div>
<img src = 'LSTM_ForgetGate.png' style = 'zoom:67%'/>
</div>
    
$$f_{t} = \sigma (W_{f} \cdot [h_{t-1}, x_{t}] + b_{f})$$
    
with $W_{f}$ and $b_{f}$ are weights and bias for **forget gate**.

2) Update Gate:
    
$i_{t}$: input gate
    
    decides which values are updated
    
$\tilde{C}_{t}$: candidate updating
    
    creates a vector of new candidate values that could be  added to the state, i.e. the content in from input and previous output memory which maybe updated into state cell
    
<div>
<img src = 'LSTM_UpdateGate.png' style = 'zoom:70%'/>
</div>
 
$$i_{t} = \sigma (W_{i} \cdot [h_{t-1}, x_{t}] + b_{i})$$
    
$$\tilde{C}_{t} = tanh (W_{C} \cdot [h_{t-1}, x_{t}] + b_{C})$$
    
with $W_{i}$ and $b_{i}$ are weights and bias for **input gate** and $W_{C}$ and $b_{C}$ for **candidate updating**
    
3) Updating State Cell:
    
<div>
<img src = 'LSTM_UpdatingStateCell.png' style = 'zoom:70%'/>
</div>
    
$$C_{t} = f_{t} \cdot C_{t-1} + i_{t} \cdot \tilde{C}_{t}$$
    
$f_{t} \cdot C_{t-1}$:
    
    the content we decided to forget/remember from previous state cell
    
$i_{t} \cdot \tilde{C}_{t}$:
    
    the content we want to update/add to state cell from input and previous output memory
    
4) Output Gate:
    
$o_{t}$:
    
    the part decided to output
    
$tanh(C_{t})$:
    
    the content from state cell decided to output
    
<div>
<img src = 'LSTM_OutputGate.png' style = 'zoom:71%'/>
</div>

$$o_{t} = \sigma (W_{o} \cdot [h_{t-1}, x_{t}] + b_{o})$$
    
$$h_{t} = o_{t} \cdot tanh(C_{t})$$
    
(ps:Details are presented in ML-L.20-RNNs II)

#### How LSTM Improves Gradient Vanishing(suspended)
<font size = 2>
    
In RNN, The presence of gradient vashing and gradient exploding blames on the partial differentiation between adjacent short-term memory $h_{t}$ and $h_{t-1}$(Check notes in RNN.ipynb):
    
$$ \frac{\partial{E_{t}}}{\partial{W_{hh}}} = \sum^{t}_{i} \frac{\partial{E_{t}}}{\partial{y_{t}}} \frac{\partial{y_{t}}}{\partial{h_{t}}} \frac{\partial{h_{t}}}{\partial{h_{i}}} \frac{\partial{h_{i}}}{\partial{W_{hh}}} $$
    
$$\frac{\partial{h_{t}}}{\partial{h_{i}}} = \prod^{t-1}_{k=i} diag(f^{’}_{w} (x@W_{xh} + h_{k}@W_{hh})) W_{hh}$$
    
Accumulated product of memory weights $W_{hh}$ result in gradient vanishing or exploding.
    
In conclusion, gradient vanishing and exploding happen during partial differentiation between adjacent memory weights which are represented as **state cell** $C_{t}$ and $C_{t-1}$ in LSTM. So the reason why LSTM can improve gradient vanishing lies in the following part of LSTM's gradients(gradient exploding can be individually imporved by **gradient clipping**, Check RNN_sin(x)_GradientExploding_solved.py):
    
$$\frac{\mathrm{d} C_{t}}{\mathrm{d} t} = $$
    
Due to current state cell $C_{t}$ is calculated as:

$$C_{t} = f_{t} \cdot C_{t-1} + i_{t} \cdot \tilde{C}_{t}$$
    
The gradient of $C_{t}$ is combined with 4 parts:
    


#### nn.LSTM( )
<font size = 2>
    
The formulation of gates and parts in nn.LSTM() is similar to original LSTM. However, in order to be integrated into code, there are some adjustments:
    
$$
\begin{equation}
\begin{aligned}
i_{t} &= \sigma (W_{ii} x_{t} + b_{ii} + W_{hi} h_{t-1} + b_{hi}) \\
f_{t} &= \sigma (W_{if} x_{t} + b_{if} + W_{hf} h_{t-1} + b_{hf}) \\
g_{t} &= tanh (W_{ig} x_{t} + b_{ig} + W_{hg} h_{t-1} + b_{hg}) \\
o_{t} &= \sigma (W_{io} x_{t} + b_{io} + W_{ho} h_{t-1} + b_{ho}) \\
c_{t} &= f_{t} \odot c_{t-1} + i_{t} \cdot g_{t} \\
h_{t} &= o_{t} \odot tanh(c_{t})
\end{aligned}
\end{equation}
$$
    
where $h_{t}$ is the hidden state at time t, $c_{t}$ is the cell state at time t, $x_{t}$ is the input at time t, $h_{t-1}$ is the hidden state of the layer at time t-1 or the initial hidden state at time 0, and $i_{t}$, $f_{t}$, $g_{t}$, $o_{t}$ are the input, forget, cell, and output gates, respectively. $\sigma$ is the sigmoid function, and $\odot$ is the Hadamard product.
    
**The parameters in LSTM:**
    
    LSTM.weight_ih_l:  (W_ii|W_if|W_ig|W_io)   shape: [4*hidden_len, feature_en]
    LSTM.weight_hh_l:  (W_hi|W_hf|W_hg|W_ho)   shape: [4*hidden_len, feature_en]
    LSTM.bias_ih_l:    (b_ii|b_if|b_ig|b_io)   shape: [4*hidden_len]
    LSTM.bias_hh_l:    (b_hi|b_hf|b_hg|b_ho)   shape: [4*hidden_len]
    
(ps: 'l' means num_layers.)

In [25]:
'''Single Layer nn.LSTM'''
#create a network:
#para: input_size -> feature_len
#para: hidden_size -> hidden_len
#para: num_layers
lstm = nn.LSTM(input_size = 100, hidden_size = 50, num_layers = 1)
para = lstm._parameters.keys()
print(para)
print()
#weight_ih_l0:  [4*hidden_len,feature_len]
print('weight_ih_l0:',lstm.weight_ih_l0.shape)
#weight_hh_l0:  [4*hidden_len,feature_len]
print('weight_hh_l0:',lstm.weight_hh_l0.shape)
#bias_ih_l0:    [4*hidden_len]
print('bias_ih_l0:', lstm.bias_ih_l0.shape)
#bias_hh_l0:    [4*hidden_len]
print('bias_hh_l0',lstm.bias_hh_l0.shape)
print()

#x: [seq_len,batch,feature_len]
x = torch.randn(10,3,100)
#h: [num_layers,batch,hidden_len]
h = torch.rand(1,3,50)
#c: [num_layers,batch,hidden_len]
c = torch.rand(1,3,50)

#h and c are combined as a tuple to input
out, (h,c) = lstm(x,(h,c))
#out is stacked result of h_t
#out: [seq_len,batch,hidden_len]
print(out.size())
#h is memory of the last time t
#h: [num_layers,batch,hidden_len]
print(h.size())
#c is state cell of the last time t
#c: [num_layers,batch,hidden_len]
print(c.size())

odict_keys(['weight_ih_l0', 'weight_hh_l0', 'bias_ih_l0', 'bias_hh_l0'])

weight_ih_l0: torch.Size([200, 100])
weight_hh_l0: torch.Size([200, 50])
bias_ih_l0: torch.Size([200])
bias_hh_l0 torch.Size([200])

torch.Size([10, 3, 50])
torch.Size([1, 3, 50])
torch.Size([1, 3, 50])


In [27]:
'''Multi-Layer nn.LSTM'''
#create a network:
#para: input_size -> feature_len
#para: hidden_size -> hidden_len
#para: num_layers
#latter layer takes output of former layer as input
multi_lstm = nn.LSTM(input_size = 100, hidden_size = 50, num_layers = 3)
multi_para = multi_lstm._parameters.keys()
print(multi_para)
print()

#multi_x: [seq_len,batch,feature_len]
#multi_h and multi_c are defaultly set as 0 if not provided
multi_x = torch.randn(10,3,100)

out, (h,c) = multi_lstm(multi_x)
#out is stacked result of h_t
#out: [seq_len,batch,hidden_len]
print(out.size())
#h is memory of the last time t
#h: [num_layers,batch,hidden_len]
print(h.size())
#c is state cell of the last time t
#c: [num_layers,batch,hidden_len]
print(c.size())

odict_keys(['weight_ih_l0', 'weight_hh_l0', 'bias_ih_l0', 'bias_hh_l0', 'weight_ih_l1', 'weight_hh_l1', 'bias_ih_l1', 'bias_hh_l1', 'weight_ih_l2', 'weight_hh_l2', 'bias_ih_l2', 'bias_hh_l2'])

torch.Size([10, 3, 50])
torch.Size([3, 3, 50])
torch.Size([3, 3, 50])


#### nn.LSTMCell( )

In [9]:
'''Single Layer nn.LSTMCell'''
#nn.LSTMCell() is similar with nn.RNNCell()
#only operate on single one time epoch, manually offering of input is needed
#the output of it is current output memory h_t and current state cell c_t
#create a network:
#para: input_size -> feature_len
#para: hidden_size -> hidden_len
#no para of 'num_layers', or reports an error
lstm_cell = nn.LSTMCell(input_size = 100, hidden_size = 20)
#c_t:  [batch,hidden_len]
c_t = torch.randn(3,20)
#h_t:  [batch,hidden_len]
h_t = torch.randn(3,20)
#x:    [seq_len,batch,feature_len]
x = torch.randn(10,3,100)
#manually offer input
for x_t in x:
    #x_t:  [batch,feature_len]
    (h,c) = lstm_cell(x_t,(h_t,c_t))
print(h.shape)
print(c.shape)

torch.Size([3, 20])
torch.Size([3, 20])


In [14]:
'''Multi-Layer nn.LSTMCell'''
#multi-layer nn.LSTMCell needs to be composed by user
#latter layer takes the output of former layer as input
#pay attention to dimension
#create a network:
#para: input_size -> feature_len
#para: hidden_size -> hidden_len
#no para of 'num_layers', or reports an error

#input of lstm_cell_1:   [batch, feature_len]   -> [b,100]
#output of lstm_cell_1:  [batch, hidden_len_1]  -> [b,50]
lstm_cell_1 = nn.LSTMCell(input_size = 100, hidden_size = 50)
#lstm_cell_2 takes the output of lstm_cell_1 as input
#input of lstm_cell_2:   [batch, hidden_len_1]  -> [b,50]
#output of lstm_cell_2:  [batch, hidden_len_2]  -> [b,20]
lstm_cell_2 = nn.LSTMCell(input_size = 50, hidden_size = 20)

#c_t:  [batch,hidden_len]
c_t_1 = torch.randn(3,50)
c_t_2 = torch.randn(3,20)
#h_t:  [batch,hidden_len]
h_t_1 = torch.randn(3,50)
h_t_2 = torch.randn(3,20)
#x:    [seq_len,batch,feature_len]
x = torch.randn(10,3,100)

#manually offer input
for x_t in x:
    #x_t:  [batch,feature_len]
    (h_t_1,c_t_1) = lstm_cell_1(x_t,(h_t_1,c_t_1))
    (h_t_2,c_t_2) = lstm_cell_2(h_t_1,(h_t_2,c_t_2))
print(h_t_2.shape)
print(c_t_2.shape)

torch.Size([3, 20])
torch.Size([3, 20])
