$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
$$

# CS236605: Deep Learning
# Tutorial 5: Recurrent Neural Networks

## Introduction

In this tutorial, we will cover:

TODO

In [245]:
# Setup
%matplotlib inline
import os
import sys
import torch
import matplotlib.pyplot as plt

plt.rcParams['font.size'] = 20
data_dir = os.path.expanduser('~/.pytorch-datasets')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

## Theory Reminders

Thus far, our models have been composed of fully connected (linear) layers or convolutional layers.

- Fully connected layers
    - Each layer $l$ operates on the output of the previous layer ($\vec{y}_{l-1}$) and calculates,
        $$
        \vec{y}_l = \varphi\left( \mat{W}_l \vec{y}_{l-1} + \vec{b}_l \right),~
        \mat{W}_l\in\set{R}^{n_{l}\times n_{l-1}},~ \vec{b}_l\in\set{R}^{n_l}.
        $$
    - FC's have completely pre-fixed input and output dimensions.
    
    <img src="img/mlp.png" />

- Convolutional layers
    - Each layer operates on an input tensor $\vec{x}$ containing $M$ feature maps. The $k$-th feature map of the output tensor $\vec{y}$ is:
        $$
        \vec{y}^k = \sum_{m=1}^{M} \vec{w}^{km}\ast\vec{x}^m+b^k,\ k\in[1,K]
        $$
      Where $\ast$ denotes convolution, and $K$ is the number of output feature maps.
      
      <img src="img/cnn_filters.png" width="500"/>
    - This time the weight dimensions are not dependent on the input dimensions.
    - Weights are shared across the spatial dimensions of the input.
    - Output dimension changes based on input dimension.


However,
- Models based on these types of layers lack **persistent state**. 
- The current output is not affected by **previous inputs** (or outputs).

How can we model a dynamical system?
E.g., a linear system such as
$$\vec{y}_t = a_0 + a_1 \vec{y}_{t-1}+\dots+a_P \vec{y}_{t-P} + b_0 \vec{x}_t+\dots+b_{t-Q}\vec{x}_{t-Q}$$

Many use cases and examples: text translation, sentiment classification, scene analysis in video, etc.

## Recurrent layers

An RNN layer is similar to a regular FC layer, but it has two inputs:
- Current sample, $\vec{x}_t \in\set{R}^{d_{i}}$.
- Previous **state**, $\vec{h}_{t-1}\in\set{r}^{d_{h}}$.

and it produces two outputs which depend on both:
- Current layer output, $\vec{y}_t\in\set{R}^{d_o}$.
- Current **state**, $\vec{h}_{t}\in\set{r}^{d_{h}}$.

<img src="img/rnn_cell.png" width="300"/>

Crucially,
- The function $\varphi(\cdot)$ itself is not time-dependent (but is parametrized).
- The same layer (function) is applied at successive time steps, propagating the hidden state.

A basic RNN can be defined as follows.

$$
\begin{align}
\forall t \geq 0:\\
\vec{h}_t &= \varphi_h\left( \mat{W}_{hh} \vec{h}_{t-1} + \mat{W}_{xh} \vec{x}_t + \vec{b}_h\right) \\
\vec{y}_t &= \varphi_y\left(\mat{W}_{hy}\vec{h}_t + \vec{b}_y \right)
\end{align}
$$

where,
- $\vec{x}_t \in\set{R}^{d_{i}}$ is the input at time $t$.
- $\vec{h}_{t-1}\in\set{R}^{d_{h}}$ is the **hidden state** of a fixed dimension.
- $\vec{y}_t\in\set{R}^{d_o}$ is the output at time $t$.
- $\mat{W}_{hh}\in\set{R}^{d_h\times d_h}$, $\mat{W}_{xh}\in\set{R}^{d_h\times d_i}$, $\mat{W}_{hy}\in\set{R}^{d_o\times d_h}$, $\vec{b}_h\in\set{R}^{d_h}$ 
- $\varphi_h$ and $\varphi_y$ are some non-linear functions. In many cases $\varphi_y$ is not used.

and $\vec{b}_y\in\set{R}^{d_o}$ are the model weights and biases.

### Modeling time-dependence

If we imagine **unrolling** a single RNN layer through time,
<img src="img/rnn_unrolled.png" width="800" />

We can see how late outputs can now be influenced by early inputs, through the hidden state.

How would **backpropagation** work, though?

RNN models are very flexible in terms of input and output meaning.

Common applications include image captioning, sentiment analysis, machine translation and more. 

<img src="img/rnn_use_cases.jpeg" width="800"/>


### Layered RNN

RNNs layers can be stacked to build a deep RNN model.

<img src="img/rnn_layered.png" width="800"/>

- As with MLPs, adding depth allows us to model intricate hierarchical features.
- However, now we also have a time dimension which makes the representation time-dependent.

## RNN Implementation

Based on the above equaitions, let's create a simple layer RNN with PyTorch.

In [80]:
import torch.nn as nn

class RNNLayer(nn.Module):
    def __init__(self, in_dim, h_dim, out_dim, phi_h=torch.tanh, phi_y=torch.sigmoid):
        super().__init__()
        self.phi_h, self.phi_y = phi_h, phi_y
        
        self.fc_xh = nn.Linear(in_dim, h_dim, bias=False)
        self.fc_hh = nn.Linear(h_dim, h_dim, bias=True)
        self.fc_hy = nn.Linear(h_dim, out_dim, bias=True)
        
    def forward(self, xt, h_prev=None):
        if h_prev is None:
            h_prev = torch.zeros(xt.shape[0], self.fc_hh.in_features)
        
        ht = self.phi_h(self.fc_xh(xt) + self.fc_hh(h_prev))
        
        yt = self.fc_hy(ht)
        
        if self.phi_y is not None:
            yt = self.phi_y(yt)
        
        return yt, ht
        

In [85]:
# Instantiate our model

N = 3 # batch size
in_dim, h_dim, out_dim = 1024, 128, 1

rnn = RNNLayer(in_dim, h_dim, out_dim)
rnn

RNNLayer(
  (fc_xh): Linear(in_features=1024, out_features=128, bias=False)
  (fc_hh): Linear(in_features=128, out_features=128, bias=True)
  (fc_hy): Linear(in_features=128, out_features=1, bias=True)
)

In [86]:
# Manually "run" a few time steps

# t=1
x1 = torch.randn(N, in_dim)
y1, h1 = rnn(x1)
print(f'y1: {y1}')

# t=2
x2 = torch.randn(N, in_dim)
y2, h2 = rnn(x2, h1)
print(f'y2: {y2}')

# t=3
x3 = torch.randn(N, in_dim)
y3, h3 = rnn(x3, h2)
print(f'y3: {y3}')

y1: tensor([[0.4339],
        [0.4022],
        [0.5431]], grad_fn=<SigmoidBackward>)
y2: tensor([[0.4161],
        [0.4258],
        [0.4816]], grad_fn=<SigmoidBackward>)
y3: tensor([[0.5188],
        [0.5181],
        [0.5371]], grad_fn=<SigmoidBackward>)


In [5]:
print(y3.shape, h3.shape)

torch.Size([3, 1]) torch.Size([3, 128])


## Application example: Sentiment analysis for movie reviews

The task: Given a review about a movie written by some user, decide whether it's **positive**, **negative** or **neutral**.

<img src="img/sentiment_analysis.png" width="300" />


Classically this is considered a challenging task if approached based on keywords alone.

Consider:

     "This movie was actually neither that funny, nor super witty."
     
To comprehend such a sentence, it's intuitive to see that some "state" must be kept when reading it.

### Dataset

We'll use the [`torchtext`](https://github.com/pytorch/text) package, which provides useful tools for working ith textual data, and also includes some built-in datasets and dataloaders (similar to `torchvision`).

Out dataset will be the [Stanford Sentiment Treebank](https://nlp.stanford.edu/sentiment/treebank.html) (SST) dataset, which contains ~10,000 labeled movie reviews.


#### Loading and tokenizing text samples

The `torchtext.data.Field` class takes care of splitting text into unique "tokens"
(~words) and converting it a numerical representation as a sequence of numbers representing
the tokens in the text.

In [228]:
from torchtext import data
from torchtext import datasets

# This Field object will be used for tokenizing the movie reviews text
TEXT = data.Field(tokenize='spacy', sequential=True, use_vocab=True, )

# This Field object converts the labels into tokens
LABEL = data.LabelField()

# Load SST, tokenize the samples and populate our Field objects
# (ds_X are Dataset objects)
ds_train, ds_valid, ds_test = datasets.SST.splits(TEXT, LABEL, root=data_dir)

n_train = len(ds_train)
print(f'Number of training samples: {n_train}')

Number of training samples: 8544


Lets print some examples from our training data:

In [229]:
for i in ([111, 7777]):
    print(f'sample#{i}: [{ds_train[i].label}] {str.join(" ", ds_train[i].text)}')

sample#111: [positive] The film aims to be funny , uplifting and moving , sometimes all at once .
sample#7777: [negative] An ugly , revolting movie .


#### Building a vocabulary

The `Field` object can build a **vocabulary** for us,
which is simply a bi-directional mapping between a unique index to the word.

We'll only include words from the training set in our vocabulary.

In [241]:
TEXT.build_vocab(ds_train)
LABEL.build_vocab(ds_train)

print(f"Number of tokens in training samples: {len(TEXT.vocab)}")
print(f"Number of tokens in training labels: {len(LABEL.vocab)}")

Number of tokens in training samples: 17200
Number of tokens in training labels: 3


In [242]:
print(f'first 20 tokens:\n', TEXT.vocab.itos[:20], end='\n\n')
print(f'index of "film":', TEXT.vocab.stoi['film'])

first 20 tokens:
 ['<unk>', '<pad>', '.', ',', 'the', 'and', 'of', 'a', 'to', '-', "'s", 'is', 'that', 'in', 'it', 'The', 'as', 'film', 'but', 'with']

index of "film": 17


Note the **special tokens**, `<unk>` and `<pad>` at index 0 and 1. These were automatically created by the tokenizer.

In [248]:
print(f'labels vocab:\n', dict(LABEL.vocab.stoi))

labels vocab:
 {'positive': 0, 'negative': 1, 'neutral': 2}


#### Data loaders (iterators)

The `torchtext` package comes with `Iterator`s, similar to the `DataLoaders` we previously worked with.

A key issue when working with text sequences is that each sample is of a different length.

So, how can we work with **batches** of data?

In [250]:
BATCH_SIZE = 4

# BucketIterator is supposed to created batches with samples of similar length
# to minimize the number of <pad> tokens in the batch.
dl_train, dl_valid, dl_test = data.BucketIterator.splits(
    (ds_train, ds_valid, ds_test), 
    batch_size = BATCH_SIZE,
    shuffle=True,
    device = device)

Lets look at a single batch.

In [260]:
batch = next(iter(dl_train))

X, y = batch.text, batch.label
print('X = \n', X, X.shape, end='\n\n')
print('y = \n', y, y.shape)

X = 
 tensor([[  489,   336,    23,    69],
        [   45,     4, 15958,    11],
        [ 3795,   912, 11368,   828],
        [    2,  1025,    21,     3],
        [    1,   284,  4603,   950],
        [    1,   133,  3225,  1704],
        [    1,    70,    12,     2],
        [    1,   790,    92,     1],
        [    1,     3,   197,     1],
        [    1,    18,  5277,     1],
        [    1,    82,     4,     1],
        [    1,   207,  3605,     1],
        [    1,   265,     6,     1],
        [    1,    14,    22,     1],
        [    1,     7,  2909,     1],
        [    1,  6912,     2,     1],
        [    1,     2,     1,     1]]) torch.Size([17, 4])

y = 
 tensor([2, 0, 1, 0]) torch.Size([4])


What are we looking at?

Our sample tensor `X` is of shape `(sentence_length, batch_size)`.

Note that `sentence_length` changes every batch!

### Model

We'll now create our sentiment analysis model based on the simple `RNNLayer` we've implemented above.

The model will:
- Take an input batch of tokenized sentences.
- Compute a dense word-embedding of each token.
- Process the sentence **sequentially** through the RNN layer.
- Produce a `(N, 3)` tensor for each sentence which we'll interpret as class probabilities.

In [262]:
class SentimentRNN(nn.Module):
    def __init__(self, in_dim, embedding_dim, h_dim, out_dim):
        super().__init__()
        
        # nn.Embedding converts from token index to dense tensor
        self.embedding = nn.Embedding(in_dim, embedding_dim)
        
        # Our custom RNN layer without phi_y outputs a class score
        self.rnn = RNNLayer(embedding_dim, h_dim, out_dim, phi_y=None)
        
        # To convert class scores to log-probability we'll apply log-softmax
        self.log_softmax = nn.LogSoftmax(dim=0)
        
    def forward(self, X):
        # X shape: (S, B)
        
        embedded = self.embedding(X)
        # embedded shape: (S, B, E)
        
        # Loop over (batch of) tokens in the sentence(s)
        ht = None
        for xt in embedded:
            yt, ht = self.rnn(xt, ht)
        
        # Class scores to log-probability
        yt_log_proba = self.log_softmax(yt)
        
        return yt_log_proba

In this model, what should the `input_dim` be?

In [263]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 128
OUTPUT_DIM = 3

model = SentimentRNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)
model

SentimentRNN(
  (embedding): Embedding(17200, 100)
  (rnn): RNNLayer(
    (fc_xh): Linear(in_features=100, out_features=128, bias=False)
    (fc_hh): Linear(in_features=128, out_features=128, bias=True)
    (fc_hy): Linear(in_features=128, out_features=3, bias=True)
  )
  (log_softmax): LogSoftmax()
)

Test a manual forward pass:

In [265]:
print(f'model(X) = \n', model(X), model(X).shape)
print(f'labels = ', y)

model(X) = 
 tensor([[-1.3864, -1.2864, -1.3838],
        [-1.4032, -1.9383, -1.3257],
        [-1.3695, -1.1922, -1.4561],
        [-1.3863, -1.2865, -1.3838]], grad_fn=<LogSoftmaxBackward>) torch.Size([4, 3])
labels =  tensor([2, 0, 1, 0])


How big is our model?

In [266]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable weights.')

The model has 1,749,699 trainable weights.


Why so many? We used only one RNN layer.

Where are most of the weights?

### Training

Let's complete the example by showing the regular pytorch-style train loop with this model.

We'll run only a few epochs on a small subset just to test that it works.

In [269]:
import torch.optim as optim

model = SentimentRNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM).to(device)

optimizer = optim.SGD(model.parameters(), lr=1e-3, momentum=0.9, nesterov=True)

# Recall: LogSoftmax + NLL is equiv to CrossEntropy on the class scores
loss_fn = nn.NLLLoss()

In [270]:
for epoch_idx in range(3):
    total_loss = 0
    num_correct = 0
    max_batches = 200
    
    for batch_idx, batch in enumerate(dl_train):
        X, y = batch.text, batch.label
        
        # Forward pass
        y_pred_log_proba = model(X)
        
        # Backward pass
        optimizer.zero_grad()
        loss = loss_fn(y_pred_log_proba, y)
        loss.backward()
        
        # Weight updates
        optimizer.step()
        
        # Calculate accuracy
        total_loss += loss.item()
        y_pred = torch.argmax(y_pred_log_proba, dim=1)
        num_correct += torch.sum(y_pred == y).float().item()
        
        if batch_idx == max_batches-1:
            break
    print(f"Epoch #{epoch_idx}, loss={total_loss /(max_batches)}, accuracy={num_correct /(max_batches*BATCH_SIZE)}")

Epoch #0, loss=1.3987759047746657, accuracy=0.30625
Epoch #1, loss=1.3964005160331725, accuracy=0.2925
Epoch #2, loss=1.3910330873727799, accuracy=0.28875


#### Limitations

As usual this is a very naïve model, just for demonstration.
It lacks many tricks of the NLP trade, such was pre-trained embeddings,
gated RNN units, deep or bi-directional models, dropout, etc.

Don't expect SotA results :)

## Attention

Intuitively, some parts of the input may be more important than others.

An **Attention** mechanism, allows the model to "focus" on, i.e. give a *greater weight* to
different parts of the input or some pther intermetiate part of the model.

Example from an image captioning [paper](https://arxiv.org/pdf/1502.03044.pdf) (K. Xu et al. 2015):

<img src="img/attn_ic1.png" width="800"/>

<img src="img/attn_ic2.png" width="700"/>


### Input soft attention

One place to apply attention is to the **input features**.

In the context of our RNN model, we can change it's hidden state update to:


$$
\begin{align}
\vec{a}_t &= \sigma\left( \mat{W}_{ha} \vec{h}_{t-1} + \mat{W}_{xa} \vec{x}_t+ \vec{b}_a\right) \\
\vec{g}_t &= \mathrm{softmax}(\alpha \vec{a}_t) \\
\vec{h}_t &= \varphi_h\left( \mat{W}_{hh} \vec{h}_{t-1} + \mat{W}_{xh} (\vec{x}_t \odot \vec{g}_t)+ \vec{b}_h\right) \\
\end{align}
$$


In [283]:
import torch.nn as nn

class RNNLayerInputAttn(nn.Module):
    def __init__(self, in_dim, h_dim, out_dim, phi_h=torch.tanh, phi_y=torch.sigmoid):
        super().__init__()
        self.phi_h, self.phi_y = phi_h, phi_y
        
        # Attention parameters
        self.fc_xa = nn.Linear(in_dim, in_dim, bias=False)
        self.fc_ha = nn.Linear(h_dim, in_dim, bias=True)
        
        # Regular RNN parameters
        self.fc_xh = nn.Linear(in_dim, h_dim, bias=False)
        self.fc_hh = nn.Linear(h_dim, h_dim, bias=True)
        self.fc_hy = nn.Linear(h_dim, out_dim, bias=True)
        
    def forward(self, xt, h_prev=None):
        if h_prev is None:
            h_prev = torch.zeros(xt.shape[0], self.fc_hh.in_features)
            
        # Calculate the attention gating gt: a weight for each feature of x
        at = torch.sigmoid(self.fc_xa(xt) + self.fc_ha(h_prev))
        gt = torch.softmax(at, dim=1)
        
        # Apply regular RNN with gated input
        ht = self.phi_h(self.fc_xh(xt * gt) + self.fc_hh(h_prev))
        
        yt = self.fc_hy(ht)
        
        if self.phi_y is not None:
            yt = self.phi_y(yt)
        
        return yt, ht
        

We can interpret this as a soft (differentiable) gating of the input.

This makes sense for image captioning, where we want to emphasize image regions based on their feature maps.

What about our sentiment analysis task?

**Image credits**

Some images in this tutorial were taken and/or adapted from:

- Fundamentals of Deep Learning, Nikhil Buduma, Oreilly 2017
- Andrej Karpathy, http://karpathy.github.io
- K. Xu et al. 2015, https://arxiv.org/abs/1502.03044