The most recent version of this notebook is available at https://github.com/nadiinchi/dl_labs/blob/master/lab_attention.ipynb

This notebook contains a practical introduction to the mechanisms of attention.
The notebook is considering a toy problem and architecture, so it can be implemented and computed on the CPU.

The task at hand is the multiplication (composition) of permutations.
The length if permutations is fixed and equals perm_len.
Input is an integer vector of length 2 x perm_len which contains two concatenated permutations p1 and p2.
The product of p1 and p2 is a permutation p3 for which p3[i] = p1[p2[i]].
The output of NN is also an integer vector of length 2 x perm_len in which first perm_lem elements are zero and the second perm_len elements are permutation p3.

Example for perm_len = 5:
```
Input sequence:  3 4 2 1 0 1 3 0 2 4
Output sequence: 0 0 0 0 0 4 1 3 2 0
Clarification:  p1 = 3 4 2 1 0,    p2 = 1 3 0 2 4   =>    p3 = 4 1 3 2 0
```

Theoretically, such a problem can be solved by an ordinary LSTM, which will first memorize the permutation p1 in a hidden state, and then passing through the permutation p2 will produce the corresponding elements from the permutation p1.
In practice, however, such a model works noticeably worse than a model with attention. A model with attention is explicitly learning by going through the p2 permutation to pay attention to the desired permutation element p1 and to output it.

The task requires to implement and compare various types of attention used in real-life problems.
You should also implement and use the position coding described in the article on Transformers.
These layers and model are described in more detail below.

In [None]:
import torch
from torch import nn
from torch import optim
import numpy as np
import math

Below it is proposed to implement several models of attention, described in different articles.
In general, there are $K$ objects that you can pay attention to.
Each object is characterized by the key $k_i$ and the value $v_i$.
The attention layer proceeds requests.
For the query $q$, the layer returns a weighted sum of the values of the objects, with weights proportional to the degree of key matching the query:
$$w_i = \frac{\exp(score(q, k_i))}{\sum_{j=1}^K\exp(score(q, k_j))}$$
$$a = \sum_{i=1}^K w_i v_i$$

Almost always, queries, keys, and values are real vectors of some fixed dimensions.
In the assignment it is proposed to implement three types of attention:
+ (optional!) Additive Attention.
Defined by function $ score (q, k) = w_3 ^ T \ tanh (W_1q + W_2k) $, where $ W_1, W_2, w_3 $ are the trainable parameters of the attention layer.
For such a function, the request and key may have different dimensions.
Matrices $ W_1 $ and $ W_2 $ map the query and key into a common hidden space, the dimension of which coincides with the dimension of the vector $ w_3 $.
The dimension of the hidden space can be chosen arbitrarily and is a hyperparameter of the layer.
For more information see the paper Bahdanau et al. "Neural Machine Translation by Jointly Learning to Align and Translate", 2014.
+ Multiplicative Attention.
Defined by function $score(q, k) = q^Tk$.
To use this type of attention, it is required that the dimension of the query coincides with the dimension of the key.
For more information see the paper Luong et al. "Effective approaches to attention-based neural machine translation", 2015.
+ Scaled Dot Product Attention.
Defined by function $score(q, k) = \frac{q^Tk}{\sqrt{dim(k)}}$, where $dim(k)$ is the dimensionality of the key (which also equals the dimensionality of the query).
With learnable queries or keys, such attention is equivalent to multiplicative attention described above.
However, immediately after initialization, such attention encourages smoother weights, which alleviates the problem of small gradients for saturated SoftMax.
For more information see the paper Vaswani et al. "Attention Is All You Need", 2017.

In practice, the architecture in which the keys and values of the objects are the same vector is used.
In the prototypes below, it is proposed to implement such an architecture.
Since keys and values are the same, they are passed to the function only once and are called objects' features
(i. e. $f_i := k_i = v_i$).

Also, for the flexibility of the interface and for the acceleration of learning, all layers of attention below receive several requests for each object of the batch.

Attention class is the parent of AdditiveAttention, MultiplicativeAttention, and ScaledDotProductAttention classes.
Attention class is an abstract class, so it is never used as layer, and only its subclasses are used as layers.

In Attention class it is necessary to implement function attend, which accepts a set of feature attributes and a set of requests for each batch object.
Function attend uses function get_scores, which is implemented in all class subclasses, and then use the obtained $score(q, f)$ values to compute $w$ and $a$.

Mask is an attention mask, showing for each request which objects it cannot pay attention to.
It was proposed in the paper Vaswani et al. "Attention Is All You Need", 2017.
It is used to ensure that the learning model retains its autoregressive properties, that is, that the layer output for the $i$ -th position does not depend on the input values of subsequent positions.
In function get_autoregressive_mask, it is necessary to construct the above-described square mask of a given size.

The most numerically stable way to use a mask in attend will be to set the corresponding score values to -float('inf') before applying SoftMax.
An alternative method is to zero the scales $w$ according to the mask and renormalize them, but this method is less computationally stable (think why).

In each of classes AdditiveAttention, MultiplicativeAttention, and ScaledDotProductAttention, you need to implement function get_scores, which for each batch object for each request returns its similarity to objects of the same batch.
The get_scores code should be equivalent to the following:
```
res = torch.zeros(batch_size, num_queries, num_objects)
for i in range(batch_size):
    for j in range(num_queries):
        for k in range(num_objects):
            res[i, j, k] = score(queries[i, j], features[i, k])
```
Naturally, the above code is only an illustration explaining the dimensions of the arguments and the output of the get_scores function.
Your implementation must be effectively vectorized.

Implementation hints:

+ In class AdditiveAttention, you need to have the learnable parameters $W_1$, $W_2$, and $w_3$.
  * For those who want to practice writing their own learnable layers, remember that the tensor contained in the subclass of nn.Module will not be listed in .parameters() for an object of this class, so gradient descent will not affect this tensor.
In order for the tensor to appear in .parameters(), you need to wrap it in nn.Parameter().
Also, before such a wrapper, it should be initialized using one of the standard initializations of the pytorch layers.
It should be noted that the initializations in pytorch are in-place, that is, you first need to have a tensor, and then pass it to the initialization function.
  * For those who do not wish to practice writing their own learnable layers, we can recall that multiplying by a matrix is equivalent to using nn.Linear(.., bias=False).
Thus, the attention function can be implemented easily using three linear layers corresponding to matrices $W_1$, $W_2$, и $w_3$.
+ It is recommended to pay attention to the function torch.bmm, it can be useful in many layers of attention below.
+ To visualize the attention map at the end of the notebook, you need to save weights.detach() in self.last_weights in the attend function. .detach () is used so that the stored weights are not part of the computational graph. Do not forget to do .detach() for the debug output, so that the computational graph does not consume RAM beyond the required size.

In [None]:
def get_autoregressive_mask(size):
    """
    Returns attention mask of given size for autoregressive model.
    """
    # your code here
    return res

In [None]:
class Attention(nn.Module):
    def __init__(self):
        super().__init__()

    def get_scores(self, features, queries):
        """
        features: [batch_size x num_objects x obj_feature_dim]
        queries:  [batch_size x num_queries x query_feature_dim]
        Returns matrix of scores with shape [batch_size x num_queries x num_objects].
        """
        raise NotImplementedError()                

    def attend(self, features, queries, mask=None):
        """
        features:        [batch_size x num_objects x obj_feature_dim]
        queries:         [batch_size x num_queries x query_feature_dim]
        mask, optional:  [num_queries x num_objects]
        Returns matrix of features for queries with shape [batch_size x num_queries x obj_feature_dim].
        If mask is not None, sets corresponding to mask weights to zero.
        Saves detached weights as self.last_weights for further visualization.
        """
        # your code here
        return result

In [None]:
class AdditiveAttention(Attention):
    """
    Bahdanau et al. "Neural Machine Translation by Jointly Learning to Align and Translate", 2014.
    """
    def __init__(self, obj_feature_dim, query_feature_dim, hidden_dim):
        """
        obj_feature_dim   - dimensionality of attention object features vector
        query_feature_dim - dimensionality of attention query vector
        hidden_dim        - dimensionality of latent vectors of attention 
        """
        super().__init__()
        # your code here

    def get_scores(self, features, queries):
        """
        features: [batch_size x num_objects x obj_feature_dim]
        queries:  [batch_size x num_queries x query_feature_dim]
        Returns matrix of scores with shape [batch_size x num_queries x num_objects].
        """
        # your code here
        return result

In [None]:
class MultiplicativeAttention(Attention):
    """
    Luong et al. "Effective approaches to attention-based neural machine translation", 2015.
    """
    def __init__(self):
        super().__init__()

    def get_scores(self, features, queries):
        """
        features: [batch_size x num_objects x feature_dim]
        queries:  [batch_size x num_queries x feature_dim]
        Returns matrix of scores with shape [batch_size x num_queries x num_objects].
        """
        # your code here
        return result

In [None]:
class ScaledDotProductAttention(Attention):
    """
    Vaswani et al. "Attention Is All You Need", 2017.
    """
    def __init__(self):
        super().__init__()

    def get_scores(self, features, queries):
        """
        features: [batch_size x num_objects x feature_dim]
        queries:  [batch_size x num_queries x feature_dim]
        Returns matrix of scores with shape [batch_size x num_queries x num_objects].
        """
        # your code here
        return result

In [None]:
# time to check that your attention works
# your code here

The perm_generator function generates a batch of a given size of objects for training or a test.
For each object in the batch permuations p1 and p2 of length perm_size are generated equiprobable.
They form the input sequence [p1, p2] and the correct answer [0, p3] for it (see the example above).

In [None]:
def perm_generator(batch_size, perm_size):
    """
    Generates batch of batch_size objects.
    Each object consists of two random permutations with length perm_size.
    The target for the object is the product of its two permutations.
    """
    # your code here
    return objects, correct_answers

In [None]:
# time to check your generator
# your code here

PositionalEncoder is the layer described in Vaswani et al. "Attention Is All You Need", 2017.
Adds to the output of the previous layer embedding positions.
In order not to recalculate position embeddings each time, its constructor receives the max_len parameter and precomputes embeddingings for positions from 0 to max_len - 1 inclusive.
The add flag indicates whether to add position embeddings to the output of the previous layer (by default, with add = True, as was in the original paper) or concatenated (add = False).
For the selected embedding dimensions, you should visualize the embeddings (plot the each component of the embeddings) and select the appropriate scale parameter.

In [None]:
class PositionalEncoder(nn.Module):
    def __init__(self, dim, max_len=50, scale=10000.0, add=True):
        """
        Transforms input as described by Vaswani et al. in "Attention Is All You Need", 2017.
        dim     - dimension of positional embeddings.
        max_len - maximal length of sequence, for precomputing
        scale   - scale factor for frequency for positional embeddings
        add     - boolean, if add is False, concatenate positional embeddings with input instead of adding
        """
        super().__init__()
        
        self.dim = dim
        self.add = add
        if add:
            self.extra_output_shape = 0
        else:
            self.extra_output_shape = dim

        # your code here
               
    def forward(self, input):
        """
        input - [batch_size x sequence_len x features_dim]
        If self.add is True, self.dim = featurs_dim.
        Returns input with added or concatenated positional embeddings (depending on self.add).
        """
        # your code here
        return result

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline

# time to draw positional encoder
# your code here

The model consists of the following layers:
+ Embedding elements of the input sequence.
+ Positional embedding or None, meaning the absence of this layer.
+ LSTM network.
To compute the LSTM network input dimension, you can use the dimension of the first embedding and extra_output_shape for embedding of the positions.
+ A layer of attention. It receives the outputs of the lSTM network as requests  and sequence embeddings as objects.
When autoregressive flag is True, this layer uses the attention mask to ignore sequence elements from the future.
+ Logistic regression with perm_len classes, outputs an integer for each position of answer.

Please note that this model is neither a transformer nor a traditional network using LSTM with attention.
The transformer uses K-head attention, elementwise transformation of embeddings.
In traditional networks with LSTM, the request issued by the network at the previous time point affects the network input at the next time point, therefore simultaneous parallel processing of the entire sequence is impossible.
Also, most networks use the encoder-decoder architecture, where the encoder first reads the entire input sequence and forms its hidden representation, and then the decoder outputs the output sequence using this hidden representation and attention mechanism.

In [None]:
class PermMultiplier(nn.Module):
    def __init__(self, perm_len, embedding_dim, hidden_dim, attention, pos_enc, autoregressive):
        """
        perm_len       - permutation length (the input is twice longer)
        embedding_dim  - dimensionality of integer embeddings
        hidden_dim     - dimensionality of LSTM output
        attention      - Attention object
        pos_enc        - PositionalEncoder object or None
        autoregressive - boolean, if True, then model must use autoregressive mask for attention
        """
        super().__init__()
        self.autoregressive = autoregressive
        self.perm_len = perm_len
        # your code here

    def forward(self, input):
        """
        Perform forward pass through layers:
        + get embeddings from input sequence (using both embeddings
          and positional embeddings if pos_enc is not None)
        + run LSTM on embeddings
        + use output of LSTM as an attention queries
        + attend on the embedded sequence using queries (note autoregressive flag)
        + make final linear tranformation to obtain logits
        """
        # your code here

There is a time to write models and a time to train them.
Now came the second one.

Complete the code for learning the model below.
Find the right architecture and set of hyperparameters for the model and optimization method.
For some architecture and hyperparameters the model can be trained on the CPU in a short time.

In [None]:
perm_len = 10

In [None]:
# time to set up a model
# you can check that without pos_enc model doesn't work
# not-autoregressive model can be learned easily, but it is less isefull
# try to learn autoregressive model if possible
pos_enc = PositionalEncoder(?, perm_len * 2, ?, ?)
attention = ?
model = PermMultiplier(perm_len, ?, ?, attention, pos_enc, ?)
if torch.cuda.is_available():
    model = model.cuda()

In [None]:
# set up optimizer
gd = optim.Adam(model.parameters(), lr=?)

In [None]:
# do optimization
avg_loss = None
forget = 0.99
batch_size = 64
iterator = range(?)
for i in iterator:
    gd.zero_grad()
    batch = perm_generator(batch_size, perm_len)
    if torch.cuda.is_available():
        batch = batch[0].cuda(), batch[1].cuda()
    # compute batch loss
    # your code here
    loss.backward()
    if avg_loss is None:
        avg_loss = float(loss)
    else:
        avg_loss = forget * avg_loss + (1 - forget) * float(loss)
    descr_str = 'Iteration %05d, loss %.5f.' % (i, avg_loss)
    print('\r', descr_str, end='')
    gd.step()

Great, we have some model.
Let's check how it multiplies two random permutations.

In [None]:
# time to check your model
batch = perm_generator(batch_size, perm_len)
if torch.cuda.is_available():
    batch = batch[0].cuda(), batch[1].cuda()
print('Input:\n', batch[0][:5])
print('Output:\n', ?)
print('Correct:\n', batch[1][:5])

One of the important for applications properties of attention is to show what the model pays attention to.
For this, so-called attention maps are used.
Use the last_weights field of the Attention layer to visualize which positions the trained model paid attention to at each point in time for permutations from the batch in the cell above.
Expected behavior is that for each element of permutation p2 model pays attention to the corresponding element from the permutation p1.

In [None]:
# visualize attention map for some object
# your code here

In [None]:
# play with model and learn something new about attention!