# Coding Self-Attention Class using Pytorch to understand the math behind Self-Attention

---
Transformer models depend on **Attention** mechanism to make next predictions. Today's **Large Language Models** depend on Attention mechanisms to yield useful outputs. The **Self-Attention** mechanism is one type of Attention. In which each word in a sentence along with their encoding builds **contextual awareness** through **Attention scores** that helps to develop an understanding of **associations** between words in a sentence. This happens with not only **neighboring words** but also with **itself**. Hence this type of Attention is called **Self-Attention**. 

**Content:**
- **[Self-Attention Class](#SAttention)** Self Attention class helps transformer to form relationship among words and tokens.

- **[Method to calculate Self-Attention value](#attention_values)** Using Self-Attention class to calculate Self-Attention scores for sample data

- **[Test with more examples](#validate)** Verifying the class

In [4]:
# Imports
import torch
import torch.nn as nn # torch.nn yields the nn and 
import torch.nn.functional as F # For SoftMax()

## Class Self-Attention

We are going to code the selfAttention class which will output Attention Scores based on the following  equation $Attention(Q, K, V) = SoftMax(\frac{Q.K^{T}}{\sqrt{d_{K}}}).V$

In [5]:
# This class inherits from Module
class SelfAttention(nn.Module):

    # d_model = The number of embedding values per token
    # [NOTE:"Attention Is All You Need" d_model was set to 512]
    #  row_dim: indices to access rows
    #  col_dim: indices to access columns
    def __init__(self, d_model=2,
                 row_dim = 0,
                 col_dim = 1):
        
        # Inherit from Module class
        super().__init__()

        # Intializing the weigth matrix of Query, Value and Key respectively
        # for each token.
        # [Note: Though alot of implementations include bias terms when creating queries, 
        # keys, and values. The orginial "Attnetion is all we need" did not use it.

        self.W_q = nn.Linear(in_features=d_model, out_features=d_model, bias=False)
        self.W_v = nn.Linear(in_features=d_model, out_features=d_model, bias=False)
        self.W_k = nn.Linear(in_features=d_model, out_features=d_model, bias=False) 

        self.row_dim = row_dim
        self.col_dim = col_dim

    def forward(self, token_encoding):
        # creating query, key and value using encoding numbers
        # associated with each token (token encodings)

        q = self.W_q(token_encoding)
        k = self.W_k(token_encoding)
        v = self.W_v(token_encoding)

        # computing similarity score (q * K^T) i.e. dot product
        sims = torch.matmul(q, k.transpose(dim0 = self.row_dim, dim1 = self.col_dim))

        # scaling similarity score by dividing by sqrt(k.col_dim)
        scaled_sims = sims/ torch.tensor(k.size(self.col_dim)**0.5)

        # performing SoftMax on the scaled similarity scores
        # to determine percent of each tokens' value to
        # use in the final attention values.
        attention_percents = F.softmax(scaled_sims, dim=self.col_dim)

        # scale the values by their associated percentages and add them up
        attention_scores = torch.matmul(attention_percents, v)

        return attention_scores





        
        

## Calculating Self-Attention using the above class

In [6]:
# Matrix of token encodings
# To keep example simple we are considering the each encoding
# to be a two digit encoding
encoding_matrix = torch.tensor([[1.16, 0.23],
                                [0.57, 1.36],
                                [4.41, -2.16]])

# set the seed for the random number generator
torch.manual_seed(42)

# create a basic self-attention object
SelfAttention = SelfAttention(d_model = 2, 
                              row_dim = 0,
                              col_dim = 1)

# Attention score for the token embeddings supplied
SelfAttention(encoding_matrix)

tensor([[-0.7802, -1.8837],
        [-0.9534, -2.3194],
        [-0.4130, -0.9592]], grad_fn=<MmBackward0>)

### Results verification of the weights and overall calculations

In [9]:
# Weight matrix for queries
print(f'Query Weight Matrix: {SelfAttention.W_q.weight.transpose(0, 1)}')

# Weight matrix for keys
print(f'Key Weight Matrix: {SelfAttention.W_k.weight.transpose(0, 1)}')

# Weight matrix for queries
print(f'Value Weight Matrix: {SelfAttention.W_v.weight.transpose(0, 1)}')

Query Weight Matrix: tensor([[ 0.5406, -0.1657],
        [ 0.5869,  0.6496]], grad_fn=<TransposeBackward0>)
Key Weight Matrix: tensor([[ 0.6233,  0.6146],
        [-0.5188,  0.1323]], grad_fn=<TransposeBackward0>)
Value Weight Matrix: tensor([[-0.1549, -0.3443],
        [ 0.1427,  0.4153]], grad_fn=<TransposeBackward0>)


In [11]:
# Calculating the queries
print(f'Query Matrix: {SelfAttention.W_q(encoding_matrix)}')

# Calculating the keys
print(f'Key Matrix: {SelfAttention.W_k(encoding_matrix)}')

# Calculating the values
print(f'Value Matrix: {SelfAttention.W_v(encoding_matrix)}')

Query Matrix: tensor([[ 0.7621, -0.0428],
        [ 1.1063,  0.7890],
        [ 1.1164, -2.1336]], grad_fn=<MmBackward0>)
Key Matrix: tensor([[ 0.6038,  0.7434],
        [-0.3502,  0.5303],
        [ 3.8695,  2.4246]], grad_fn=<MmBackward0>)
Value Matrix: tensor([[-0.1469, -0.3038],
        [ 0.1057,  0.3685],
        [-0.9914, -2.4152]], grad_fn=<MmBackward0>)


In [13]:
q = SelfAttention.W_q(encoding_matrix)
q

tensor([[ 0.7621, -0.0428],
        [ 1.1063,  0.7890],
        [ 1.1164, -2.1336]], grad_fn=<MmBackward0>)

In [14]:
k = SelfAttention.W_k(encoding_matrix)
k

tensor([[ 0.6038,  0.7434],
        [-0.3502,  0.5303],
        [ 3.8695,  2.4246]], grad_fn=<MmBackward0>)

In [None]:
# printing the unscaled similarity
sims = torch.matmul(q, k.transpose(dim0=0, dim1=1))
sims

tensor([[ 0.4283, -0.2896,  2.8452],
        [ 1.2545,  0.0310,  6.1939],
        [-0.9121, -1.5224, -0.8533]], grad_fn=<MmBackward0>)

In [16]:
# printing the scaled similarity
scaled_sims = sims / (torch.tensor(2)**0.5)
scaled_sims

tensor([[ 0.3029, -0.2048,  2.0119],
        [ 0.8871,  0.0219,  4.3797],
        [-0.6449, -1.0765, -0.6034]], grad_fn=<DivBackward0>)

In [17]:
# printing the attention percentages
attention_percents = F.softmax(scaled_sims, dim=1)
attention_percents

tensor([[0.1403, 0.0845, 0.7752],
        [0.0292, 0.0123, 0.9586],
        [0.3715, 0.2413, 0.3872]], grad_fn=<SoftmaxBackward0>)

In [19]:
# Final Attention Scores
torch.matmul(attention_percents, SelfAttention.W_v(encoding_matrix))

tensor([[-0.7802, -1.8837],
        [-0.9534, -2.3194],
        [-0.4130, -0.9592]], grad_fn=<MmBackward0>)