##1.1 Brief Introduction

* There are different attention variants here we will discuss: self-attention,causal attention and multi-head attention.
* These variants builds on each other, the goal here will be to arrive at a compact and efficient implementation of multi-head attention which we will plug into the LLM architecture.

##1.2 Simple-attention with trainable weights

* In self-attention our goal is to calculate context vectors for each elements in the input sequence.
* Let's use the following input sequence and represent it in a embedding vector.


In [2]:
import torch
inputs = torch.tensor(
   [ [0.34,0.55,0.66],#Attention(x1)
    [0.99,0.87,0.56],#is(x2)
    [0.67,0.12,0.65],#all(x3)
    [0.99,0.89,0.53],#you(x4)
    [0.77,0.67,0.77],#need(x5)
   ]
)

* Next we initialize the query,key,values weights matrices which willl help us project our input embedding into it's respective query,key and value matrices.


In [7]:
torch.manual_seed(123)
w_query = torch.rand(3,5)
w_key = torch.rand(3,5)
w_value = torch.rand(3,5)

In [13]:
##computing for our query,key and value matrices
query = torch.matmul(inputs, w_query)
key = torch.matmul(inputs, w_key)
value = torch.matmul(inputs, w_value)
print(f"Query:{query}")
print(f"Key:{key}")
print(f"Value:{value}")

Query:tensor([[0.7853, 0.7042, 0.1919, 0.4651, 0.6335],
        [1.2236, 1.0150, 0.3807, 0.9519, 0.8824],
        [0.5073, 0.8091, 0.2301, 0.6112, 0.3424],
        [1.2314, 0.9971, 0.3804, 0.9497, 0.8875],
        [1.0513, 1.0183, 0.3207, 0.8049, 0.7873]])
Key:tensor([[1.0406, 0.5497, 1.1973, 1.2639, 0.4013],
        [1.5090, 0.7888, 1.8532, 1.7368, 0.9145],
        [0.7995, 0.3310, 1.1880, 0.9585, 0.5008],
        [1.5063, 0.7924, 1.8402, 1.7294, 0.9189],
        [1.3882, 0.7024, 1.7238, 1.6450, 0.7225]])
Value:tensor([[0.8222, 0.4466, 0.7402, 0.9640, 0.9085],
        [1.5228, 0.9283, 1.2754, 1.3727, 1.6282],
        [0.9374, 0.6549, 0.8192, 1.1236, 0.9768],
        [1.5193, 0.9229, 1.2689, 1.3464, 1.6250],
        [1.3175, 0.7990, 1.1416, 1.3923, 1.4168]])


* Next we calculate the attention scores

In [14]:
attn_scores = torch.matmul(query,key.T)
print(f"Attention Scores:{attn_scores}")

Attention Scores:tensor([[2.2762, 3.4834, 1.8520, 3.4806, 3.1385],
        [3.8442, 5.8128, 3.1209, 5.8050, 5.2712],
        [2.1581, 3.2049, 1.7041, 3.2004, 2.9221],
        [3.8416, 5.8109, 3.1214, 5.8031, 5.2692],
        [3.3711, 5.1021, 2.7245, 5.0962, 4.6205]])


* Next we compute the attention weights by using the formula below:
    * The formula for the attention weights is given by:


$\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)$.

* Where Q represents query,K represents key, and dk represent output dimension.

In [15]:
d_k = key.shape[-1]
attn_weights = torch.softmax(attn_scores/d_k**0.5,dim=-1)
print(f"Attention Weights:{attn_weights}")

Attention Weights:tensor([[0.1486, 0.2551, 0.1230, 0.2547, 0.2186],
        [0.1186, 0.2860, 0.0858, 0.2850, 0.2245],
        [0.1559, 0.2490, 0.1273, 0.2485, 0.2194],
        [0.1186, 0.2860, 0.0859, 0.2850, 0.2245],
        [0.1277, 0.2770, 0.0957, 0.2763, 0.2233]])


* To get the context vector we do a dot product between attention weights and values matrices

In [16]:
context_vector = torch.matmul(attn_weights,value)
print(f"Context Vector:{context_vector}")

Context Vector:tensor([[1.3009, 0.7934, 1.1089, 1.2789, 1.3941],
        [1.3424, 0.8171, 1.1409, 1.2998, 1.4386],
        [1.2932, 0.7887, 1.1030, 1.2750, 1.3859],
        [1.3424, 0.8171, 1.1409, 1.2998, 1.4385],
        [1.3305, 0.8103, 1.1317, 1.2938, 1.4259]])


In [None]:
import torch.nn as nn
class SelfAttention_v1(nn.Module):
  def __init__(self,d_in,d_out):
    super().__init__()
    self.w_query = nn.Parameter(torch.rand(d_in,d_out))
    self.w_key = nn.Parameter(torch.rand(d_in,d_out))
    self.w_value = nn.Parameter(torch.rand(d_in,d_out))