# Part 6

## Lecture 6
<img src="./image/types_of_sequential_data.png" height="200" />

- _FFN_ is not used since it will learn exact word locations.
- _RNN_ alleviates by using **shared weights**.
    - It is recurrent to allow for sequential order processing
    - concatenate: $A \cdot B + C \cdot D = A|C \cdot \frac{B}{D}$ (where lines are concatenation)
    - Cannot be trained in parallel, since its sequential.
    - **Teacher forcing** can be used to train in *Parallel*
        - cuts dependency with previous time-step
        - **not** as powerful as RNN
    - Gradient of sequence length for 1000+ (arbitrary big number)
        - \> 1, it explodes
        - < 1, it vanishes
        - This happens because of text having long-range relations/references.

<img src="./image/Simple_RNN.png" height="200" />

## [Assignment 6: RNN, GRU and LSTM](https://colab.research.google.com/drive/1Y71ysVaEHOw0lPRqDGOL-OmTLHB0X-HI)
- Deal with variable length sequential data

**RNN**
- recursively update states via the forward pass.
- recurrences depend on length of sequence.
- Nr of hidden states is preserved, irregardless of the sequence length.
    - (Hidden) state: $H_t = φ(X_tW_{xh} + H_{t-1}W_{hh} + b_h)$
- x will have same size at every iteration
- Tanh is often chosen as function for RNN
- Also called Vanilla or Elman RNN
    - Difficult to deal with **long term dependencies** compared to GRU and LSTM

<img src="./image/Example_language_model.png" height="350" />

- **in** = (seq_length, n_timesteps, input_size)
- RNN layer = (input_size, hidden_size)
    - Weight_xh = (input_size, hidden_size)
    - Weight_hh = (hidden_size, hidden_size)
    - Bias_xh = (hidden_size)
    - Bias_hh = (hidden_size)
- **out** = (seq_length, n_timesteps, hidden_size)


In [None]:
# Vanilla RNN
import torch
import torch.nn as nn
from torchinfo import summary

seq_length, n_timesteps, input_size, hidden_size = 2, 3, 4, 10
bias = True
bidirectional = False

# total_param = 1 * ( input_size + hidden_size + ( 2 <- bias ) ) * hidden_size ( * 2 if bi-directional )
RNN = nn.RNN(
    input_size=input_size, 
    hidden_size=hidden_size, 
    num_layers=1, 
    bias=bias, 
    batch_first=True, 
    bidirectional=bidirectional
    )

model_ouput = summary(
    RNN,
    (seq_length, n_timesteps, input_size),
    verbose=2,
    col_width=16,
    col_names=["input_size", "output_size", "kernel_size", "num_params", "mult_adds"],
)

Layer (type:depth-idx)                   Input Shape      Output Shape     Kernel Shape     Param #          Mult-Adds
└─RNN: 0-1                               [2, 3, 4]        [2, 3, 10]       --               160              140
├─weight_ih_l0                                                             [10, 4]
├─weight_hh_l0                                                             [10, 10]
Total params: 160
Trainable params: 160
Non-trainable params: 0
Total mult-adds (M): 0.00
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00


**GRU**
- Gated Recurrent Unit
    - Gating allows for long term information to pass through unchanged.
- ($R_t$) **Reset gate**: Learn what info must be forgotten or kept 
    - Reset for different logical chunks in input
    - Ex. *chapters in a book*
- ($Z_t$) **Update gate**: Learn what new info must be added to the current hidden state
    - Don't update for uninformative input 
    - Ex. *html encodings*

<img src="./image/GRU.png" height="300" />

In [None]:
# GRU
import torch
import torch.nn as nn
from torchinfo import summary

seq_length, n_timesteps, input_size, hidden_size = 2, 3, 4, 10
bias = False
bidirectional = False

# total_param = 3 * ( input_size + hidden_size + ( 2 <- bias ) ) * hidden_size ( * 2 if bi-directional )
GRU = nn.GRU(
    input_size=input_size, 
    hidden_size=hidden_size, 
    num_layers=1, 
    bias=bias, 
    batch_first=True, 
    bidirectional=bidirectional
    )

model_ouput = summary(
    GRU,
    (seq_length, n_timesteps, input_size),
    verbose=2,
    col_width=16,
    col_names=["input_size", "output_size", "kernel_size", "num_params", "mult_adds"],
)

Layer (type:depth-idx)                   Input Shape      Output Shape     Kernel Shape     Param #          Mult-Adds
└─GRU: 0-1                               [2, 3, 4]        [2, 3, 10]       --               420              420
├─weight_ih_l0                                                             [30, 4]
├─weight_hh_l0                                                             [30, 10]
Total params: 420
Trainable params: 420
Non-trainable params: 0
Total mult-adds (M): 0.00
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00


**LSTM**
- Long Short-term Memories
- ($I_t$) **Input gate**:
- ($F_t$) **Forget gate**: Learn what need to be forgotten or kept
- ($O_t$) **Output gate**:
- ($C_t$) **Cell state**: keep track of the state of the cell, what old info should be forgotten, pushed through and what new info added.
    - Combine forget and update -> updated cell state
- Sigmoid used to calculate which values to remember and forget (\[0, 1])
- Memory = cell state

<img src="./image/LSTM.png" height="300" />

In [None]:
# LSTM
import torch
import torch.nn as nn
from torchinfo import summary

seq_length, n_timesteps, input_size, hidden_size = 2, 3, 4, 10
bias = False
bidirectional = False

# total_param = 4 * ( input_size + hidden_size + ( 2 <- bias ) ) * hidden_size ( * 2 if bi-directional )
LSTM = nn.LSTM(
    input_size=input_size, 
    hidden_size=hidden_size, 
    num_layers=1, 
    bias=bias, 
    batch_first=True, 
    bidirectional=bidirectional
    )

model_ouput = summary(
    LSTM,
    (seq_length, n_timesteps, input_size),
    verbose=2,
    col_width=16,
    col_names=["input_size", "output_size", "kernel_size", "num_params", "mult_adds"],
)

Layer (type:depth-idx)                   Input Shape      Output Shape     Kernel Shape     Param #          Mult-Adds
└─LSTM: 0-1                              [2, 3, 4]        [2, 3, 10]       --               560              3,360
├─weight_ih_l0                                                             [40, 4]
├─weight_hh_l0                                                             [40, 10]
Total params: 560
Trainable params: 560
Non-trainable params: 0
Total mult-adds (M): 0.00
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=de0be7a9-29e1-4ab6-9ce7-607fa646094e' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>