# 11.4.1 Design your RNN (4 points)

**Note: Use the SIC Cluster for this task**


Please create a ```solution.py``` file where you define the following:


1. A ```function``` where you use pytorch's Dataset and Dataloader class, and it should return you the desired split for the dataset. The function should have ```split``` as one of its argument and the call to Dataset class should respect this argument. You will manually need to download the dataset first. The desired role of function is as follows:
    - Use the ```Large Movie Review Dataset``` dataset. [Link](https://ai.stanford.edu/~amaas/data/sentiment)
    - Create Dataset object for different splits
    - Computers don't work with natural language, so we have to convert it to some sort of numbers. One such idea would be to use GloVe embeddings for the conversion. Depending on how you choose to do this, you might also have to take care of padding. **Note:** We encourage using the 300d GloVe embeddings.
    - Returns the Dataloader object for specified split
    - **(Optional)** Try one-hot encoding in-place of GloVe to see how big of an improvement GloVe was for embedding space. There are other (possible but not recommended) ways to do embeddings, such as get POS tags for each word or use a dictionary to define polarity for each word.
    
2. Multiple ```class``` for your implementation of your networks which does the following:
    - Define a RNN class with appropriate layers and hyperparameters
    - Define a LSTM class with appropriate layers and hyperparameters
    - **(Optionally)** Implement Bi-LSTM, Bi-RNN, Bi-GRU and do a comparison with the one-directional implementation


# 11.4.2 (Bonus) Transformers


**Note**: This exercise is mostly devoted to the Transformer model which will be described during lecture on 30th of January.

In this exercise you will be using Multi-Head Attention to solve a toy exercise in sequence modeling. The concept of Multi-Head Attention is taken from a famous paper called ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762), which introduced Transformer model. Please read the paper carefully and answer the questions below. Understanding the concepts described in this paper will help understanding many modern models in the Neural Networks field and it's also necessary if choose to pick NLP project. 

If you have troubles understanding the paper you can read [this blog post](https://jalammar.github.io/illustrated-transformer/) first. 

1. The biggest benefit of using Transformers instead of RNN and convolution based models is the possibility to paralllelize computations during training. Why parallelization is not possible with RNN and Convolution based models for sequence processing, but possible with Transformers? *Note*: parallelization can be applied only to the Encoder part of the Trasnformer. (0.5 points)
2. In expaining the concept of self attention the paper mentions 3 matrices `Q`, `K` and `V` which serve as an input to self-attention mechanism sublayer. Explain how these matrices are computed in the encoder and in the decoder. What role each of these matrices play? (1 point)  
3. How is Multi-Head Attention better than Single-Head Attention? (0.5 points)

### Task description
Given an input sequence `XY[0-5]+` (two digits X and Y followed by the sequence of digits in the range from 0 to 5 inclusive), the task is to count the number of occurrences of X and Y in the remaining substring and then calculate the difference #X - #Y.

Example:  
Input: `1214211`  
Output: `2`  
Explanation: there are 3 `1`'s and 1 `2` in the sequence `14211`, `3-1=2`  
  
The model must learn this relationship between the symbols of the sequence and predict the output. This task can be solved with a multi-head attention network.

In [1]:
import math
import torch
import torch.nn as nn
import torch.nn.functional as F

from IPython.display import Image
from IPython.core.display import HTML 

torch.manual_seed(0)

<torch._C.Generator at 0x11773aa10>

In [2]:
SEQ_LEN = 5
VOCAB_SIZE = 6
NUM_TRAINING_STEPS = 25000
BATCH_SIZE = 64

#### Data generation function
Fill the code to calculate the ground truth outpu for the random sequence and store it in `gts`.    

Why are we offseting the ground truth? In other words, why do we need grouth truth to be non-negative?

In [3]:
# This function generates data samples as described at the beginning of the
# script
def get_data_sample(batch_size=1):
    random_seq = torch.randint(low=0, high=VOCAB_SIZE - 1,
                               size=[batch_size, SEQ_LEN + 2])
    
    ############################################################################
    # TODO: Calculate the ground truth output for the random sequence and store
    # it in 'gts'.
    ############################################################################
    gts = gts.squeeze()

    # Ensure that GT is non-negative
    ############################################################################
    # TODO: Why is this needed?
    ############################################################################
    gts += SEQ_LEN
    return random_seq, gts

In [None]:
get_data_sample(batch_size=2)

#### Scaled Dot-Product Attention
Implement a naive version of Attention mechanism in the following class. Please do not derive from the given structure. If you have ideas about how to optimize the implementation you can however note them in a comment or provide an additional implementation.  
In your implementation refer to Section 3.2.1 and Figure 2 (left) in the paper. Keep the parameters to the forward pass trainable.

In [4]:
class Attention(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, q, k, v):
        # q, k, and v are batch-first
        # TODO: implement
        pass

#### Multi-Head Attention
Implement Multi-Head Attention mechanism on top of the Single-Head Attention mechanism in the following class. Please do not derive from the given structure. If you have ideas about how to optimize the implementation you can however note them in a comment or provide an additional implementation.  
In your implementation refer to Section 3.2.2 and Figure 2 (right) in the paper. Keep the parameters to the forward pass trainable.

In [5]:
class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()

        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.dim_r = self.embed_dim // self.num_heads   # to evenly split q, k, and v across heads.
        self.dotatt = Attention()

        self.q_linear_proj = nn.Linear(self.embed_dim, self.embed_dim)
        self.k_linear_proj = nn.Linear(self.embed_dim, self.embed_dim)
        self.v_linear_proj = nn.Linear(self.embed_dim, self.embed_dim)
        self.final_linear_proj = nn.Linear(self.embed_dim, self.embed_dim)
        
        # xavier initialization for linear layer weights
        nn.init.xavier_uniform_(self.q_linear_proj.weight)
        nn.init.xavier_uniform_(self.k_linear_proj.weight)
        nn.init.xavier_uniform_(self.v_linear_proj.weight)
        nn.init.xavier_uniform_(self.final_linear_proj.weight)

    def forward(self, q, k, v):
        # q, k, and v are batch-first

        ########################################################################
        # TODO: Implement multi-head attention as described in Section 3.2.2
        # of the paper.
        ########################################################################
        # shapes of q, k, v are [bsize, SEQ_LEN + 2, hidden_dim]
        bsize = k.shape[0]

        pass

#### Encoding Layer
Implement the Encoding Layer of the network.  
Refer the following figure from the paper for the architecture of the Encoding layer (left part of the figure). 

In [6]:
Image(url='https://i.stack.imgur.com/eAKQu.png')

In [7]:
class EncodingLayer(nn.Module):
    def __init__(self, num_hidden, num_heads):
        super().__init__()

        self.att = MultiHeadAttention(embed_dim=num_hidden, num_heads=num_heads)
        # TODO: add necessary member variables
    def forward(self, x):
        x = self.att(x, x, x)
        pass

#### Network definition
Implement the forward pass of the complete network.
The network must do the following:
1. calculate embeddings of the input (with the size equal to `num_hidden`)
2. perform positional encoding
3. perform forward pass of a single Encoding layer
4. perform forward pass of a single Decoder layer
5. apply fully connected layer on the output

Because we are dealing with quite simple task, the whole Decoder layer can be replaced with a single MultiHeadAttention block. Since our task is not sequence to sequence, but rather the classification of a sequence, the query (`Q` matrix) for the MultiHeadAttention block can be another learnable parameter (`nn.Parameter`) instead of processed output embedding.

In the forward pass we must add a (trainable) positional encoding of our input embedding. Why is this needed? Can you think of another similar task where the positional encoding would not be necessary?  

In [8]:
# Network definition
class Net(nn.Module):
    def __init__(self, num_encoding_layers=1, num_hidden=64, num_heads=4):
        super().__init__()
        
        q = torch.empty([1, num_hidden])
        nn.init.normal_(q)
        self.q = nn.Parameter(q, requires_grad=True)
        
        # TODO: implement

    def forward(self, x):
        # TODO: implement
        pass

#### Training
Don't edit the following 2 cells. They must run without errors if you implemented the model correctly.  
The model should converge to nearly 100% accuracy after ~4.5k steps.

In [None]:
# Instantiate network, loss function and optimizer
net = Net()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(net.parameters(), lr=0.005, momentum=0.9)

In [None]:
# Train the network
for i in range(NUM_TRAINING_STEPS):
    inputs, labels = get_data_sample(BATCH_SIZE)

    optimizer.zero_grad()
    outputs = net(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()
    accuracy = (torch.argmax(outputs, axis=-1) == labels).float().mean()

    if i % 100 == 0:
        print('[%d/%d] loss: %.3f, accuracy: %.3f' %
              (i , NUM_TRAINING_STEPS - 1, loss.item(), accuracy.item()))
    if i == NUM_TRAINING_STEPS - 1:
        print('Final accuracy: %.3f, expected %.3f' %
              (accuracy.item(), 1.0))

Briefly analyze the results you get. Does the model learn the underlying pattern in all the sequences? How can we improve the results / speed up the training process?