# Marked exercises after Lecture 5
This notebook contains the marked exercises with instructions and explanations.

Work through the cells below in sequential order, executing each cell as you progress. Throughout the notebook, you will encounter instructions marked with the words **YOUR CODE HERE** followed by **raise NotImplementedError()**. You will have to substitute  *raise NotImplementedError()* with your own code.
Follow the instructions and write the code to complete the tasks.

Along the way, you may also find questions. Try to reflect on the questions before/after running the code.

You will have to implement a MultiHeadAttention.

You have 2 exercises to complete. In total, you can get **20 points** out of 60 points for Submission 1 for completing all marked exercises related to lecture 5.

This notebook was developed at the [Idiap Research Institute](https://www.idiap.ch) by [Alina Elena Baia](mailto:alina.baia.idiap.ch>), [Darya Baranouskaya](mailto:darya.baranouskaya.idiap.ch) and [Olena Hrynenko](mailto:olena.hrynenko.idiap.ch) (equal contribution).


Read the paper ['Attention is all you need'](https://arxiv.org/pdf/1706.03762.pdf)
and implement Scaled dot-product and Multi-head attention


You are NOT ALLOWED to use toolboxes that automatically solve the main tasks of the assignment, such as (but not limited to) nn.MultiheadAttention.

In [1]:
import torch
import torch.nn as nn
# you are not allowed to use any other libraries (like numpy )in this assignment

##### 1.5.1 Scaled dot-product attention

'An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

The input consists of queries and keys of dimension $d_k$, and values of dimension $d_v$.'



The input of the non-batched scaled dot-product attention is 3 matrixes: queries $Q\in\mathbb{R}^{L\times d_k}$, keys $K\in\mathbb{R}^{S\times d_k}$  and values  $V\in\mathbb{R}^{S\times d_v}$, where $L$ and $S$ represent sequences length (for example, number of tokens in the query and key sequences), and $d_k,d_k, d_v$ are the dimensions of query, key, value correspondingly.

So, query is a sequence of $L$ token embeddings, each token of dimension $d_k$, key is a sequence of $S$ token embeddings, each token of dimension $d_k$, and value is a sequence of $S$ token embeddings, each token of dimension $d_v$.

Scaled dot-product attention is computed as:
$$ \text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

Note that your implementation should work for batched inputs.


In [2]:
def scaled_dot_product(query, key, value):
    """
    Args:
        query: torch.Tensor (..., L, d_k)
        key: torch.Tensor (..., S, d_k)
        value: torch.Tensor (..., S, d_v)


    Returns:
        attn: torch.Tensor (..., L, d_v), output of the scaled dot-product attention (\softmax(Q K^T / d) V
        attn_weights: torch.Tensor (..., L, d_v), attention weights (\softmax(Q K^T / d))

    L is the length of query sequence
    S is the length of key and value sequences, d_k and d_v are the embeddings dimensions

    #... is a placeholder to denote other dimensions.
     The scaled_dot_product should be computed on the last and second-to-last dimension.
     Every element in ...dimension should be processed independently (torch matmul operations allow  that to happen).
     For example, ... can represent a batch size B and the vector query will have a size (B, L, d_k),
     then the output should be a size of (B, L, d_v) where every batch is processed independently from other batches
    """
    #TODO implement scaled dot product
    # YOUR CODE HERE
    # raise NotImplementedError()
    d_k = query.size(-1)
    attn_weights = torch.nn.functional.softmax(torch.matmul(query, torch.transpose(key, -2, -1)) / (d_k ** 0.5), dim=-1)
    attn = torch.matmul(attn_weights, value)

    return attn, attn_weights

  """


In [3]:
# Check your implementation
#epsilon to check your results
epsilon = 1e-3
epsilon_2 = 1e-6
#example 1
query  = torch.Tensor([[1, 2, 3],
                       [3, 2, 1]])
key  = torch.Tensor([[3, 2, 1],
                     [1, 2, 3],
                     [1, 2, 3]])
value  = torch.Tensor([[1, 1],
                       [1, 1],
                       [1, 1]])

answer, attn_weights = scaled_dot_product(query, key, value)
answer
correct = torch.Tensor([[1.0000, 1.0000],
        [1.0000, 1.0000]])

assert (torch.all(answer + epsilon >= correct)) and (torch.all(answer - epsilon <= correct))
assert (torch.all(attn_weights.sum(dim=1) + epsilon_2 >= torch.ones(2))) and (torch.all(attn_weights.sum(dim=1) - epsilon_2 <= torch.ones(2)))

#example 2
#change values and see the result
value  = torch.Tensor([[1, 2],
                       [2, 1],
                       [1, 1]])

answer, attn_weights = scaled_dot_product(query, key, value)

correct = torch.Tensor([[1.4763, 1.0473],
        [1.0829, 1.8343]])
assert (torch.all(answer + epsilon >= correct)) and (torch.all(answer - epsilon <= correct))
assert (torch.all(attn_weights.sum(dim=1) + epsilon_2 >= torch.ones(2))) and (torch.all(attn_weights.sum(dim=1) - epsilon_2 <= torch.ones(2)))


#example 3
#change values and see the result
value  = torch.Tensor([[1, 2],
                       [1, 1],
                       [1, 2]])

answer, attn_weights = scaled_dot_product(query, key, value)
answer
correct = torch.Tensor([[1.0000, 1.5237],
        [1.0000, 1.9171]])
assert (torch.all(answer + epsilon >= correct)) and (torch.all(answer - epsilon <= correct))
assert (torch.all(attn_weights.sum(dim=1) + epsilon_2 >= torch.ones(2))) and (torch.all(attn_weights.sum(dim=1) - epsilon_2 <= torch.ones(2)))


#example 4
query  = torch.Tensor([[1, 2, 3],
                       [3, 2, 1]])
key  = torch.Tensor([[3, 2, 1],
                     [1, 2, 3]])
value  = torch.Tensor([[1, 1],
                       [1, 1]])

answer, attn_weights = scaled_dot_product(query, key, value)
answer
correct = torch.Tensor([[1.0000, 1.0000],
        [1.0000, 1.0000]])

assert (torch.all(answer + epsilon >= correct)) and (torch.all(answer - epsilon <= correct))
assert (torch.all(attn_weights.sum(dim=1) + epsilon_2 >= torch.ones(2))) and (torch.all(attn_weights.sum(dim=1) - epsilon_2 <= torch.ones(2)))


##### 1.5.1 Multi-Head attention

For the multi-head attention the dimension of queries, keys and values are equal to $d\_model$.

'Instead of performing a single attention function with d_model-dimensional keys, values and queries, we found it beneficial to linearly project the queries, keys and values $h$ times with different, learned linear projections to $d\_k$, $d\_k$ and $d\_v$ dimensions, respectively. On each of these projected versions of queries, keys and values we then perform the attention function in parallel, yielding $d_v$ -dimensional output values. These are concatenated and once again projected, resulting in the final values.

$$
\begin{split}\begin{split}
    \text{Multihead}(Q,K,V) & = \text{Concat}(\text{head}_1,...,\text{head}_h)W^{O}\\
    \text{where } \text{head}_i & = \text{Attention}(QW_i^Q,KW_i^K, VW_i^V)
\end{split}\end{split}
$$


Where the projections are parameter matrices $W_i^Q \in R^{d\_model ×d\_k}$, $W_i^K \in R^{d\_model ×d\_k}$, $W_i^V \in R^{d\_model ×d\_v}$ and $W^O \in R^{h*d\_v ×d\_model}$

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.'

In [4]:
class MultiHeadAttention(nn.Module):
    ''' Multi-Head Attention module '''

    def __init__(self, h, d_model, d_k, d_v):
        '''
        d_model: dimensionality of embeddings (total)
        h: number of heads
        d_k: dimensionality of one linear projections of query
        d_v: dimensionality of one linear projections on value
        '''
        super().__init__()
        assert d_model % h == 0
        assert d_model // d_v == h #we want the output to have the same dimensionality d_model as the inputs

        # Note: no bias is needed when linear projections are performed
        self.num_heads = h
        self.d_k = d_k
        self.d_v = d_v
        self.d_model = d_model

        self.q_proj = nn.Linear(d_model, h * d_k, bias=False) # - linear projection of q
        self.k_proj = nn.Linear(d_model, h * d_k, bias=False) # - linear projection of k
        self.v_proj = nn.Linear(d_model, h * d_v, bias=False) # - linear projection of v
        self.o_proj = nn.Linear(h * d_v, d_model, bias=False) # - linear projection after concatenation



    def forward(self, q, k, v):
        '''
        Args:
            query: torch.Tensor (Batch_size, L, d_model)
            key: torch.Tensor (Batch_size, S, d_model)
            value: torch.Tensor (Batch_size, S, d_model)
        Returns:
            attn: torch.Tensor (Batch_size, L, d_model), output of the multi-head attention
            attn_weights: torch.Tensor (Batch_size, h, L, S), attention weights per head
        '''

        # you are allowed to use previously implemented scaled_dot_product


        b, L, S = q.shape[0], q.shape[1], k.shape[1]

        # you are not allowed to use a for loop to iterate through different heads
        # instead you should:
        #   1) get linear projections for q, k, and v (for example, get q_proj of size (b, L, h * d_k) from q)
        #   2) reshape the projections: split the channel dimension (h * d_k (or h * d_v)) into 2 dimensions (h and d_k (or h and d_v))
        # and then transpose reshaped q_proj_reshaped, k_proj_reshaped and v_proj_reshaped vectors to prepare the input for the scaled dot product.
        # The sizes of the reshaped and transposed vectors should be (..., L, d_k), (..., S, d_k), (..., S, d_v), respectively.
        # For example, q_proj of size (b, L, h * d_k) should become a vector q_proj_reshaped_transposed of the size (b, h, L, d_k).
        #   3) compute the scaled dot product, using reshaped transposed vectors q_proj_reshaped_transposed, 
        # k_proj_reshaped_transposed, v_proj_reshaped_transposed as inputs
        #   4) transpose and reshape the output of the scaled dot-product attention 
        # (concatenate head outputs to get a h*d_v dimension) to get an output of size (b, L, d_model)

        # YOUR CODE HERE
        # raise NotImplementedError()
        q_proj = self.q_proj(q)
        k_proj = self.k_proj(k)
        v_proj = self.v_proj(v)

        q_proj_reshaped = q_proj.view(b, L, self.num_heads, self.d_k).transpose(1, 2)
        k_proj_reshaped = k_proj.view(b, S, self.num_heads, self.d_k).transpose(1, 2)
        v_proj_reshaped = v_proj.view(b, S, self.num_heads, self.d_v).transpose(1, 2)

        attn, attn_weights = scaled_dot_product(q_proj_reshaped, k_proj_reshaped, v_proj_reshaped)
        attn = self.o_proj(attn.transpose(1, 2).contiguous().view(b, L, self.num_heads * self.d_v))

        return attn, attn_weights

In [5]:
b = 16 #batch size
h = 8 # number of heads
L = 10 #length of the query sequence
S = 15 #length of the key and value sequences
d_model = 512
d_k = 50
d_v = 64

multihead_attn = MultiHeadAttention(h, d_model, d_k, d_v)

q, k, v = torch.rand((b, L, d_model)), torch.rand((b, S, d_model)), torch.rand((b, S, d_model))

attn, attn_weights = multihead_attn(q, k, v)

assert list(attn_weights.shape) == [b, h, L, S]
assert list(attn.shape) == [b, L, d_model]
assert torch.all(attn_weights.sum(dim=-1) + 1e-3 >= torch.ones((b, h, L))) and torch.all(attn_weights.sum(dim=-1) - 1e-3 <= torch.ones((b, h, L)))
attn.shape, attn_weights.shape

(torch.Size([16, 10, 512]), torch.Size([16, 8, 10, 15]))

In [6]:
#check the documentation of pytorch MultiheadAttention: https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html

# which parameters should you give to torch  multi-head attention
#  so that the output will be in the same format as you implemented above?
# the inner dimensionality of q, k, v projections can be different from your implementation, as torch implementat differs

multihead_attn_torch = nn.MultiheadAttention(
                                             
                                            #add other parameters if needed
                                            # YOUR CODE HERE
                                            # raise NotImplementedError()
                                             embed_dim=d_model,
                                             num_heads=h,
                                             kdim=d_model,
                                             vdim=d_model,
                                             batch_first=True,
                                             
                                            )

attn_output_torch, attn_output_weights_torch = multihead_attn_torch(q, k, v,
                                                                  # add other parameters if needed
                                                                  # YOUR CODE HERE
                                                                  # raise NotImplementedError()
                                                                    need_weights=True,
                                                                    average_attn_weights=False
                                                                   )
print(attn_output_torch.shape, attn_output_weights_torch.shape)

assert attn.shape == attn_output_torch.shape
assert attn_weights.shape == attn_output_weights_torch.shape

torch.Size([16, 10, 512]) torch.Size([16, 8, 10, 15])
