### Transformers

Self-attention focuses only on the input and captures only dependencies between the input
elements.

#### Basic self-attention

Let’s assume we have an input sequence of length T, x(1) , ... , x(T) , as well
as an output sequence, z(1) , z(2) , ... , z(T) . We will use **o** as whole transformer output and **z** as the output of the self-attention layer (it's an intermediate step in the model).
Each *i*th element in these sequences are vectors of size *d* representing the feature information for the input at position *i*.

For seq2seq task, goal of self-attention is to model the dependencies of the current input element to all other input elements. To achieve this, self-attention mechanisms are composed of three stages. First, we derive importance weights based on the similarity between the current element and all other elements in the sequence. Second, we normalize the weights, which usually involves the use of the already familiar softmax function. Third, we use these weights in combination with the corresponding sequence elements to compute the attention value.


##### Calculating similarity values ω<sub>ij<sub>

In [1]:
import torch

In [21]:
# sentence mapped to an integer representation via a dictionary
sentence = torch.tensor(
    [0, # can
     7, # you
     1, # help
     2, # me
     5, # to
     6, # translate
     4, # this
     3] # sentence
)

# produce embeddings
torch.manual_seed(123)
embed = torch.nn.Embedding(10, 16)  # dict size, emb dim
embed_sentence = embed(sentence).detach()
embed_sentence.shape

torch.Size([8, 16])

In [29]:
# compute ω
omega = torch.empty(8, 8)
for i, x_i in enumerate(embed_sentence):
 for j, x_j in enumerate(embed_sentence):
  omega[i, j] = torch.dot(x_i, x_j)

# equivalent to matrix multiplication of input sequence:
omega_mat = embed_sentence.matmul(embed_sentence.T)

# check if the same
print(torch.allclose(omega_mat, omega))
omega.shape

True


torch.Size([8, 8])

##### Calculating attention weights α<sub>ij<sub>

In [30]:
import torch.nn.functional as F

attention_weights = F.softmax(omega, dim=1)
# check if sum in rows is 1
#attention_weights.sum(dim=1)

##### Calculating context vectors z<sup>(i)<sup>

In [34]:
x_2 = embed_sentence[1,:]
context_vec_2 = torch.zeros(x_2.shape)
for j in range(8):
 x_j = embed_sentence[j, :]
 context_vec_2 += attention_weights[1, j] * x_j
x_2.shape


torch.Size([16])

In [39]:
# matrix mul
print(attention_weights.shape, embed_sentence.shape)

context_vec = torch.matmul(attention_weights, embed_sentence)

torch.Size([8, 8]) torch.Size([8, 16])


#### Parameterizing the self-attention mechanism: scaled dot-product attention

In [40]:
torch.manual_seed(123)
d = embed_sentence.shape[1]
U_q = torch.rand(d, d)
U_k = torch.rand(d, d)
U_v = torch.rand(d, d)

In [42]:
# Example for x^(2)
x_2 = embed_sentence[1]

# query sentence
query_2 = U_q.matmul(x_2)
# key sentence
key_2 = U_q.matmul(x_2)
# value sentence
value_2 = U_q.matmul(x_2)

# for all input elements
keys = U_k.matmul(embed_sentence.T).T
values = U_v.matmul(embed_sentence.T).T

In [43]:
# here omega is calculated as dot product of query and key
omega_23 = query_2.dot(keys[2])
omega_23

tensor(14.3667)

In [47]:
omega_2 = query_2.matmul(keys.T)
omega_2

RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x16 and 8x16)

In [48]:
attention_weights_2 = F.softmax(omega_2 / d**0.5, dim=0)
# output
context_vector_2 = attention_weights_2.matmul(values)