# MultiHeaded Attention

Multi Headed Attention is the parellalized version of self attention in which we calculate self attention for multiple vectors at the same time.

In [3]:
import numpy as np
import math
import torch
import torch.nn as nn
import torch.nn.functional as F

### Multi Headed Params
- **Sequence Length** : The actual input words which we have.
- **Input Dimensions** : The dimensions of each input word-vec. Usually 512
- **Model Dimensions** : Max number of input words the model will take.


_Our case_ : "My name is Shreyas"
- **Sequence Length** = 4   as we have only 4 words
- **Input Dimensions** = 512 as we will encode each word into [1x512]
- **Model Dimensions** = 512 as we will have maximum input length as 512

We will create a tensor of this shape (batch_size, sequence_length, input_dim)

In [4]:
sequence_length = 4
batch_size = 1
input_dim = 512
d_model = 512

x = torch.randn((batch_size, sequence_length, input_dim))

print(f"Size of the vector :{x.size()}")

Size of the vector :torch.Size([1, 4, 512])


### Calculating QKV vectors

We basically copy the initial vector 3 times and multiply them with different filters of the size [512x512]. These filters are learnable parameters and hence we use them in a linear layer.

The output of this will generate a matrix QKV of dimensions [1, 4, 1536]

In [5]:
qkv_layer = nn.Linear(input_dim , 3 * d_model)
qkv = qkv_layer(x)

Optional (Visulalize the QKV distributions)

In [6]:
# import matplotlib.pyplot as plt
# y_val = torch.histc(qkv, bins=200, min=-3, max=3)
# x_val = np.arange(-1, 1, 0.01) * 3
# plt.bar(x_val, y_val, align='center')
# plt.title('qkv distribution')

### Heads
The original Transformer paper had 8 Heads which calculated the self attention, so we will stick with it.

The dimension of each input going into the head also needs to be calculated and stored.
`head_dim` = (max input length) / (number of heads)

In [7]:
num_heads = 8
head_dim = d_model // num_heads
qkv = qkv.reshape(batch_size, sequence_length, num_heads, 3 * head_dim)
print(head_dim)
print(f"The combine shape of the matrix will be : {qkv.shape}")

64
The combine shape of the matrix will be : torch.Size([1, 4, 8, 192])


In [10]:
qkv = qkv.permute(0, 2, 1, 3) # [batch_size, num_heads, sequence_length, 3*head_dim]
qkv.shape

torch.Size([1, 8, 4, 192])

In [8]:
#Split them into Q,K,V. This is done for each word embedding
q, k, v = qkv.chunk(3, dim=-1)
q.shape, k.shape, v.shape

(torch.Size([1, 4, 8, 64]),
 torch.Size([1, 4, 8, 64]),
 torch.Size([1, 4, 8, 64]))

##### The Self Attention Equation is given

$$Self-Attention = softmax(\frac{Q.K^T}{\sqrt(d_k)}+M)V$$


The `Q.K^T` operation can be thought of as a similarity measure between the query and key vectors. The dot product of two vectors measures the **cosine similarity** between them, which ranges from -1 (opposite directions) to 1 (same direction). 

The $\sqrt d_k$ is included in the equation to reduce the variance of the cosine similarity calculated above.

In [12]:
d_k = q.size()[-1]
print(k.shape)
print(k.transpose(-2, -1).shape)
# scaled = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)
# scaled.shape

torch.Size([1, 4, 8, 64])
torch.Size([1, 4, 64, 8])


### Masking

This is generally required in the decoder part of the Transformer Architecture. This is to ensure words dont get context from words generated in the future. Not used in encoders.

The generated mask is a upper trangular mask. Continuing the last example if the 1st row is a vector for My , 
them My is only allowed to look at My and no other word.
Similarly as you will notice , Name which is represented by the second row is only alloed to view My & Name 
for predicting the next word in the sequence.

In [None]:
mask = torch.full(scaled.size() , float('-inf'))
mask = torch.triu(mask, diagonal=1)
mask[0][1] # mask for input to a single head

In [2]:
# the entire input tensor after masking 
# Use ```(scaled + mask)[0][0]``` to check the input tensor for 1st word in the input
(scaled + mask)

NameError: name 'scaled' is not defined

In [16]:
scaled += mask

### SoftMax

It is used to convert a output vector into a probability distribution. This allows the sum of all the values in the output to sum up to 1. Allowing no value in the output to go beyond the probability of 100%.

$$softmax = \frac{e^{x_i}}{\sum_je^x_j}$$

In [17]:
attention = F.softmax(scaled, dim=-1)
attention.shape

torch.Size([1, 8, 4, 4])

In [18]:
# the 1st word of input tensor after masked attention
# Use ```attention``` to check the entire input tensors attention matrix
attention[0][0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000],
        [0.3573, 0.6427, 0.0000, 0.0000],
        [0.3879, 0.3296, 0.2825, 0.0000],
        [0.2939, 0.3040, 0.2177, 0.1844]], grad_fn=<SelectBackward0>)

Now by multiplying the attention matrix with the Value matrix, we generate a new matrix which tells the model which vectors are important to predict which output vectors.

In [19]:
#outputs 
values = torch.matmul(attention, v)
values.shape

torch.Size([1, 8, 4, 64])

# MultiHeaded Attention Class

Now lets combine all the things we did into a single class of python

In [20]:
import torch
import torch.nn as nn
import math

def scaled_dot_product(q, k, v, mask=None):
    d_k = q.size()[-1]
    scaled = torch.matmul(q, k.transpose(-1, -2)) / math.sqrt(d_k)
    if mask is not None:
        scaled += mask
    attention = F.softmax(scaled, dim=-1)
    values = torch.matmul(attention, v)
    return values, attention

class MultiheadAttention(nn.Module):

    def __init__(self, input_dim, d_model, num_heads):
        super().__init__()
        self.input_dim = input_dim
        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads
        self.qkv_layer = nn.Linear(input_dim , 3 * d_model)
        self.linear_layer = nn.Linear(d_model, d_model)
    
    def forward(self, x, mask=None):
        batch_size, sequence_length, input_dim = x.size()
        print(f"x.size(): {x.size()}")
        qkv = self.qkv_layer(x)
        print(f"qkv.size(): {qkv.size()}")
        qkv = qkv.reshape(batch_size, sequence_length, self.num_heads, 3 * self.head_dim)
        print(f"qkv.size(): {qkv.size()}")
        qkv = qkv.permute(0, 2, 1, 3)
        print(f"qkv.size(): {qkv.size()}")
        q, k, v = qkv.chunk(3, dim=-1)
        print(f"q size: {q.size()}, k size: {k.size()}, v size: {v.size()}, ")
        values, attention = scaled_dot_product(q, k, v, mask)
        print(f"values.size(): {values.size()}, attention.size:{ attention.size()} ")
        values = values.reshape(batch_size, sequence_length, self.num_heads * self.head_dim)
        print(f"values.size(): {values.size()}")
        out = self.linear_layer(values)
        print(f"out.size(): {out.size()}")
        return out

In [21]:
input_dim = 1024
d_model = 512
num_heads = 8

batch_size = 30
sequence_length = 5
x = torch.randn( (batch_size, sequence_length, input_dim) )

model = MultiheadAttention(input_dim, d_model, num_heads)
out = model.forward(x)

x.size(): torch.Size([30, 5, 1024])
qkv.size(): torch.Size([30, 5, 1536])
qkv.size(): torch.Size([30, 5, 8, 192])
qkv.size(): torch.Size([30, 8, 5, 192])
q size: torch.Size([30, 8, 5, 64]), k size: torch.Size([30, 8, 5, 64]), v size: torch.Size([30, 8, 5, 64]), 
values.size(): torch.Size([30, 8, 5, 64]), attention.size:torch.Size([30, 8, 5, 5]) 
values.size(): torch.Size([30, 5, 512])
out.size(): torch.Size([30, 5, 512])
