<a href="https://www.kaggle.com/code/aisuko/coding-the-self-attention-mechanism?scriptVersionId=160346033" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

**Note: The images are from the Credit section**

Please check the [Encoder In Transformers architecture](https://www.kaggle.com/code/aisuko/encoder-in-transformers-architecture) and [Decoder In Transformers architecture](https://www.kaggle.com/code/aisuko/decoder-in-transformers-architecture) to familar **self-attention** in transformers architecture.

In this notebook, we focus on the **scaled-dot product attention mechanism(referred to as self-attention)**, which remains the most populat and most widely used attention mechanism in practice. And there are existed other types of attention machanisms, like [2020 Efficient Transformers: A Survey](https://arxiv.org/abs/2009.06732) and the [2023 A Survey on Effcient Training of Transformers](https://arxiv.org/abs/2302.01107) review and the [FlashAttention](https://arxiv.org/abs/2205.14135) paper.

# Embedding an Input Sentence

Through the "Encoder in Transformers architecture", we know the first steps is tokenization, normalization and embedding the tokenzes. Here we do not need to normlize the inputs.

In [1]:
inputs="According to the news, it it hard to say Melbourne is safe now"

input_ids={s:i for i,s in enumerate(sorted(inputs.replace(',','').split()))}
input_ids

{'According': 0,
 'Melbourne': 1,
 'hard': 2,
 'is': 3,
 'it': 5,
 'news': 6,
 'now': 7,
 'safe': 8,
 'say': 9,
 'the': 10,
 'to': 12}

Let's convert them to the tokens(assign an integer index to each word).

In [2]:
import torch

input_tokens=torch.tensor([input_ids[s] for s in inputs.replace(',','').split()])
input_tokens

tensor([ 0, 12, 10,  6,  5,  5,  2, 12,  9,  1,  3,  8,  7])

Now, using the integer-vector reoresentation of the input sentence, we can use an embedding layer to **encode the inputs** into a real vector embedding. Here, we will use a 16-dimensional embedding such that each input word is represented by a 16-dimensional vector. Since the sentence consists of 13 words, this will result in a 13x16-dimentional embedding.

In [3]:
# using the same seed to keep the same result
torch.manual_seed(123)
embed=torch.nn.Embedding(13,16)
embedded_sentence=embed(input_tokens).detach()
embedded_sentence

tensor([[ 3.3737e-01, -1.7778e-01, -3.0353e-01, -5.8801e-01,  3.4861e-01,
          6.6034e-01, -2.1964e-01, -3.7917e-01,  7.6711e-01, -1.1925e+00,
          6.9835e-01, -1.4097e+00,  1.7938e-01,  1.8951e+00,  4.9545e-01,
          2.6920e-01],
        [-9.7969e-01, -2.1126e+00, -2.7214e-01, -3.5100e-01,  1.1152e+00,
         -6.1722e-01, -2.2708e+00, -1.3819e+00,  1.1721e+00, -4.3716e-01,
         -4.0527e-01,  7.0864e-01,  9.5331e-01, -1.3035e-02, -1.3009e-01,
         -8.7660e-02],
        [ 6.8508e-01,  2.0024e+00, -5.4688e-01,  1.6014e+00, -2.2577e+00,
         -1.8009e+00,  7.0147e-01,  5.7028e-01, -1.1766e+00, -2.0524e+00,
          1.1318e-01,  1.4353e+00,  8.8307e-02, -1.2037e+00,  1.0964e+00,
          2.4210e+00],
        [-2.2150e+00, -1.3193e+00, -2.0915e+00,  9.6285e-01, -3.1861e-02,
         -4.7896e-01,  7.6681e-01,  2.7468e-02,  1.9929e+00,  1.3708e+00,
         -5.0087e-01, -2.7928e-01, -2.0628e+00,  6.3745e-03, -9.8955e-01,
          7.0161e-01],
        [ 2.5529e-01

In [4]:
embedded_sentence.shape

torch.Size([13, 16])

# Defining the Weight Matrices

Self-attention mechanism(scaled dot-product) utilizes three weight matrices, referred to as $W_{q}$, $W_{k}$ and $W_{v}$, which are adjusted as model parameters during training. These matrices serve to project the inputs into query, key, and value components of the sequence, respectively.

The respective query, key and value sequences are obtained via matrix multiplication between the weight matrices $W$ and the **embedded inputs x**:

**Query sequence**

$$q^{(i)}=W_{q}x^{(i)} for i \in [1,T]$$

**Key sequence**

$$k^{(i)}=W_{k}x^{(i)} for i \in [1,T]$$

**Value sequence**

$$v^{(i)}=W_{v}x^{(i)} for i \in [1,T]$$


The index **i** refers to the token index position in the input sequence, which has length T.

<div style="text-align: center"><img src="https://files.mastodon.social/media_attachments/files/111/814/464/564/081/956/original/9c952305a2b6d5a8.png" width="20%" heigh="20%" alt="scaled dot-product attention"></div>

Here, both $q^{(i)}$ and $k^{(i)}$ are vectors of dimension $d_{k}$. The projection matrices $W_{q}$ and $W_{k}$ have a shape of $d_{k}*d$, while $W_{v}$ has the shape $d_{v}*d$. It is important to note that $d$ represents the size of each word vector x.

According to the [dot-product illustration](https://www.kaggle.com/code/aisuko/encoder-in-transformers-architecture), the computing between the query and key vectors, these two vectors have to contain the same number of elements($d_{q}$ and $d_{k}$). However, the number of elements in the value vector $v^{(i)}$, which determines the size of the resulting context vector, is arbitrary.

Here is an example, we set $d_{q}$ and $d_{k}$=24 and use $d_{v}$=28, initializing the projection matrices as follows:

In [5]:
torch.manual_seed(123)

d=embedded_sentence.shape[1]
d

16

In [6]:
d_q, d_k, d_v=24,24,28

W_query=torch.nn.Parameter(torch.rand(d_q, d))
W_key=torch.nn.Parameter(torch.rand(d_k, d))
W_value=torch.nn.Parameter(torch.rand(d_v, d))
W_value

Parameter containing:
tensor([[2.2383e-01, 3.0465e-01, 3.0185e-01, 7.7719e-01, 4.9632e-01, 3.8457e-01,
         9.4751e-02, 5.4099e-01, 8.0899e-01, 8.1570e-01, 5.4314e-01, 9.5954e-01,
         3.7638e-01, 8.8847e-01, 7.7945e-01, 9.4166e-01],
        [7.5758e-01, 4.9898e-02, 7.4476e-01, 1.3877e-01, 1.6512e-01, 1.4907e-01,
         2.6847e-01, 5.0905e-02, 9.2707e-01, 2.8936e-01, 8.2721e-01, 9.4828e-01,
         8.1707e-01, 8.7183e-01, 5.1264e-01, 8.6063e-03],
        [8.0527e-01, 7.8735e-02, 6.2932e-01, 2.9138e-01, 8.2026e-01, 8.3362e-01,
         4.7395e-01, 3.2585e-01, 8.8695e-01, 3.4264e-01, 1.1503e-01, 1.7675e-01,
         2.1455e-02, 8.6990e-01, 8.7559e-01, 3.7270e-01],
        [7.2059e-01, 7.8469e-01, 2.9878e-01, 5.8486e-01, 4.1490e-01, 2.5936e-01,
         1.8493e-01, 2.5396e-01, 4.6260e-01, 4.3994e-01, 1.2095e-01, 4.5656e-02,
         4.3196e-01, 6.9407e-01, 6.6612e-01, 1.4987e-01],
        [7.6967e-01, 1.5432e-01, 2.5701e-01, 9.0780e-01, 6.2522e-01, 6.8266e-01,
         1.8458e-

# Computing the Unnormalized Attention Weights

Now, let's suppose we are interested in computing the attention-vector for the second input element - the second input element acts as the query here:

In [7]:
x_2=embedded_sentence[1]
query_2=W_query.matmul(x_2)
key_2=W_key.matmul(x_2)
value_2=W_value.matmul(x_2)
value_2

tensor([-0.6497,  0.3101, -1.5242, -2.4824, -0.5965, -2.2526, -4.5416, -4.1824,
        -1.8672, -1.1036, -2.9178, -2.4902, -3.6235, -3.8396, -2.7322, -0.9615,
        -0.3936, -2.3660, -1.2402, -4.7051, -2.8151, -1.9909, -3.8078, -1.4460,
        -2.3606, -2.4327, -1.7750, -2.9069], grad_fn=<MvBackward0>)

In [8]:
value_2.shape

torch.Size([28])

We can then generalize this to compute the remaining key, and value elements for all inputs as well, since we will need them in the next step when we compute the unnormalized attention weights $w$:

In [9]:
keys=W_key.matmul(embedded_sentence.T).T
values=W_value.matmul(embedded_sentence.T).T
print(keys.shape)
print(values.shape)

torch.Size([13, 24])
torch.Size([13, 28])


Let's compute the unnormalized attention weights $w$, which are illustrated in the figure below:

<div style="text-align: center"><img src="https://files.mastodon.social/media_attachments/files/111/814/578/611/132/625/original/a961fee9d7ca0131.png" width="40%" heigh="40%" alt="computing the unnormalized attention weighs w"></div>


As illustrated in the figure above, we compute $w_{ij}$ as the dot product between the query and key sequences $w_{ij}={q^{(i)}}^{T}k^{(j)}$.

For example, we can compute the unnormalized attention weight for the query and 5th input element(corresponding to index position 4) as follows:

In [10]:
w_2_4=query_2.dot(keys[4])
w_2_4

tensor(155.9239, grad_fn=<DotBackward0>)

In [11]:
# Here we compute the w values for all input tokens as illustrated in the previous figure
w_2=query_2.matmul(keys.T)
w_2

tensor([ -24.6096,  151.2782,  -44.1470,  110.7908,  155.9239,  155.9239,
          70.0803,  151.2782,   71.0386,   69.2800, -144.1026,  185.6768,
          41.0362], grad_fn=<SqueezeBackward4>)

# Computing the Attention Scores

The subsequent step in self-attention is to normalize the unnormalized attention weights $w$, to obtain the normalized attention weights, $\alpha$, by applying the softmax function. Additionally, $\sqrt[1]{d_{k}}$ is used to scale $w$ before normalizing it through the softmax function, as shown below:

<div style="text-align: center"><img src="https://files.mastodon.social/media_attachments/files/111/814/687/652/091/481/original/41644ce020deec4c.png" width="80%" heigh="80%" alt="computing attention scores"></div>

The scaling by $d_{k}$ ensures that rhe Euclidean length of the weight vectors will be approximately in the same magnitude. This helps prevent the attention weights from becoming too smaller ot too large, which could lead to numerical instability or affect the model's ability to converge during training.

The implement the computation of the attention weights as follows:

In [12]:
import torch.nn.functional as F

attention_weights_2=F.softmax(w_2/d_k**0.5, dim=0)
attention_weights_2

tensor([2.2665e-19, 8.8675e-04, 4.2010e-21, 2.2835e-07, 2.2890e-03, 2.2890e-03,
        5.6183e-11, 8.8675e-04, 6.8322e-11, 4.7716e-11, 5.7849e-30, 9.9365e-01,
        1.4957e-13], grad_fn=<SoftmaxBackward0>)

Finally, the last step is to compute the context vector $z^{(2)}$, which is an attention-weighted version of our original query input $x^{(2)}$, including all the other input elements as its context via the attention weights:

<div style="text-align: center"><img src="https://files.mastodon.social/media_attachments/files/111/814/731/512/658/906/original/651e3e40601e50fa.png" width="80%" heigh="80%" alt="computing context vector"></div>

In [13]:
context_vector_2=attention_weights_2.matmul(values)
context_vector_2

tensor([-2.9182, -2.0006, -3.9933, -4.1344, -3.2336, -3.3511, -2.9606, -3.6264,
        -2.5876, -3.9000, -2.7759, -3.8449, -4.1974, -2.1862, -3.4551, -2.5073,
        -3.4832, -2.2261, -3.4518, -3.9524, -4.4011, -4.7407, -4.1783, -2.8100,
        -4.1595, -3.3601, -3.0404, -4.5382], grad_fn=<SqueezeBackward4>)

Note that this output vector has more dimensions $d_{v}=28$ than the original input vector $d=16$ since we specified $d_{v}>d$ earlier; however, the embedding size choice is arbitrary.

In [14]:
context_vector_2.shape

torch.Size([28])

# Credit

* https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html