## IMPLEMENTING SELF ATTENTION WITH TRAINABLE WEIGHTS

In [None]:
import torch

inputs = torch.tensor(
  [[0.43, 0.15, 0.89, 0.17, 0.23, 0.19, 0.38, 0.44],  # The       (x^1)
   [0.55, 0.87, 0.66, 0.51, 0.49, 0.3, 0.2, 0.1],     # next      (x^2)
   [0.57, 0.85, 0.64, 0.8, 0.1, 0.4, 0.21, 0.39],     # day       (x^3)
   [0.22, 0.58, 0.33, 0.4, 0.4, 0.4, 0.1, 0.3],       # is        (x^4)
   [0.77, 0.25, 0.10, 0.1, 0.9, 0.3, 0.3, 0.2]]       # bright    (x^5)
)

<div class="alert alert-block alert-success">

Let's begin by defining a few variables:

</div>

<div class="alert alert-block alert-info">
    
A: The second input element

B: The input embedding size, d=8

C: The output embedding size, d_out=4

</div>

In [2]:
x_2 = inputs[1] # A
# What is x_2?
# Input embedding for the "next" word

d_in = inputs.shape[1] # B
d_out = 4 # C

In [3]:
print(x_2)
print(d_in)
print(d_out)

tensor([0.5500, 0.8700, 0.6600, 0.5100, 0.4900, 0.3000, 0.2000, 0.1000])
8
4


<div class="alert alert-block alert-info">
    
Note that in GPT-like models, the input and output dimensions are usually the same.

But for illustration purposes, to better follow the computation, we choose different input (d_in=3)
and output (d_out=2) dimensions here.

</div>

<div class="alert alert-block alert-success">

Next, we initialize the three weight matrices Wq, Wk and Wv

</div>

In [5]:
torch.manual_seed(123)

W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_key = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)

In [6]:
print(W_query)

Parameter containing:
tensor([[0.2961, 0.5166, 0.2517, 0.6886],
        [0.0740, 0.8665, 0.1366, 0.1025],
        [0.1841, 0.7264, 0.3153, 0.6871],
        [0.0756, 0.1966, 0.3164, 0.4017],
        [0.1186, 0.8274, 0.3821, 0.6605],
        [0.8536, 0.5932, 0.6367, 0.9826],
        [0.2745, 0.6584, 0.2775, 0.8573],
        [0.8993, 0.0390, 0.9268, 0.7388]])


In [7]:
print(W_key)

Parameter containing:
tensor([[0.7179, 0.7058, 0.9156, 0.4340],
        [0.0772, 0.3565, 0.1479, 0.5331],
        [0.4066, 0.2318, 0.4545, 0.9737],
        [0.4606, 0.5159, 0.4220, 0.5786],
        [0.9455, 0.8057, 0.6775, 0.6087],
        [0.6179, 0.6932, 0.4354, 0.0353],
        [0.1908, 0.9268, 0.5299, 0.0950],
        [0.5789, 0.9131, 0.0275, 0.1634]])


In [8]:
print(W_value)

Parameter containing:
tensor([[0.3009, 0.5201, 0.3834, 0.4451],
        [0.0126, 0.7341, 0.9389, 0.8056],
        [0.1459, 0.0969, 0.7076, 0.5112],
        [0.7050, 0.0114, 0.4702, 0.8526],
        [0.7320, 0.5183, 0.5983, 0.4527],
        [0.2251, 0.3111, 0.1955, 0.9153],
        [0.7751, 0.6749, 0.1166, 0.8858],
        [0.6568, 0.8459, 0.3033, 0.6060]])


<div class="alert alert-block alert-info">
    
Note that we are setting requires_grad=False to reduce clutter in the outputs for
illustration purposes.

If we were to use the weight matrices for model training, we
would set requires_grad=True to update these matrices during model training.

</div>

<div class="alert alert-block alert-success">

Next, we compute the query, key, and value vectors as shown earlier
</div>

In [9]:
query_2 = x_2 @ W_query
key_2 = x_2 @ W_key
value_2 = x_2 @ W_value
print(query_2)

tensor([0.8463, 2.3367, 1.1531, 1.9900])


<div class="alert alert-block alert-info">
    
As we can see based on the output for the query, this results in a 4-dimensional vector.

This is because: we set the number of columns of the corresponding weight matrix, via d_out, to 4:

</div>

<div class="alert alert-block alert-success">

Even though our temporary goal is to only compute the one context vector z(2),  we still
require the key and value vectors for all input elements.

This is because they are involved in computing the attention weights with respect to the query q(2)
</div>

<div class="alert alert-block alert-success">

We can obtain all keys and values via matrix multiplication:
</div>

In [10]:
keys = inputs @ W_key
values = inputs @ W_value
queries = inputs @ W_query

print("keys.shape:", keys.shape)

print("values.shape:", values.shape)

print("queries.shape:", queries.shape)

keys.shape: torch.Size([5, 4])
values.shape: torch.Size([5, 4])
queries.shape: torch.Size([5, 4])


<div class="alert alert-block alert-info">
    
As we can tell from the outputs, we successfully projected the 6 input tokens from a 3D
onto a 2D embedding space:

</div>

<div class="alert alert-block alert-success">

First, let's compute the attention score ω22</div>

In [11]:
keys_2 = keys[1] # A
attn_score_22 = query_2.dot(keys_2)
print(attn_score_22)

tensor(12.0370)


<div class="alert alert-block alert-success">
Again, we can generalize this computation to all attention scores via matrix multiplication:</div>

In [12]:
attn_scores_2 = query_2 @ keys.T # All attention scores for given query
print(attn_scores_2)

tensor([ 9.7351, 12.0370, 12.2923,  8.7149, 10.9628])


In [13]:
attn_scores = queries @ keys.T # omega
print(attn_scores)

tensor([[ 8.7252, 10.8803, 11.0007,  7.7678,  9.7598],
        [ 9.7351, 12.0370, 12.2923,  8.7149, 10.9628],
        [10.4691, 12.9987, 13.1878,  9.3438, 11.8256],
        [ 7.7531,  9.6199,  9.7608,  6.9217,  8.7864],
        [ 8.8185, 10.9612, 11.1314,  7.8699,  9.8633]])


<div class="alert alert-block alert-success">
    
We compute the attention weights by scaling the
attention scores and using the softmax function we used earlier.

The difference to earlier is
that we now scale the attention scores by dividing them by the square root of the
embedding dimension of the keys.

Note that taking the square root is mathematically the
same as exponentiating by 0.5:</div>

In [14]:
d_k = keys.shape[-1]
attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1)

print(attn_weights_2)
print(d_k)

tensor([0.0980, 0.3099, 0.3521, 0.0589, 0.1811])
4


In [15]:
attn_weights_final = torch.softmax(attn_scores / d_k**0.5, dim=-1)
print(attn_weights_final)

row_sums = attn_weights_final.sum(dim=1)
print("\nSum of Each Row:")
print(row_sums)


tensor([[0.1069, 0.3140, 0.3335, 0.0662, 0.1793],
        [0.0980, 0.3099, 0.3521, 0.0589, 0.1811],
        [0.0911, 0.3227, 0.3547, 0.0519, 0.1795],
        [0.1162, 0.2954, 0.3170, 0.0767, 0.1947],
        [0.1063, 0.3103, 0.3379, 0.0662, 0.1793]])

Sum of Each Row:
tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000])


## WHY DIVIDE BY SQRT (DIMENSION)

<div class="alert alert-block alert-warning">

Reason 1: For stability in learning

The softmax function is sensitive to the magnitudes of its inputs. When the inputs are large, the differences between the exponential values of each input become much more pronounced. This causes the softmax output to become "peaky," where the highest value receives almost all the probability mass, and the rest receive very little.

In attention mechanisms, particularly in transformers, if the dot products between query and key vectors become too large (like multiplying by 8 in this example), the attention scores can become very large. This results in a very sharp softmax distribution, making the model overly confident in one particular "key." Such sharp distributions can make learning unstable,
    
</div>

In [16]:
import torch

# Define the tensor
tensor = torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])

# Apply softmax without scaling
softmax_result = torch.softmax(tensor, dim=-1)
print("Softmax without scaling:", softmax_result)

# Multiply the tensor by 8 and then apply softmax
scaled_tensor = tensor * 8
softmax_scaled_result = torch.softmax(scaled_tensor, dim=-1)
print("Softmax after scaling (tensor * 8):", softmax_scaled_result)

Softmax without scaling: tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])
Softmax after scaling (tensor * 8): tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])


## BUT WHY SQRT?

<div class="alert alert-block alert-warning">

Reason 2: To make the variance of the dot product stable

The dot product of  Q and K increases the variance because multiplying two random numbers increases the variance.

The increase in variance grows with the dimension.

Dividing by sqrt (dimension) keeps the variance close to 1
    
</div>

Imagine you’re rolling dice. Consider two cases:

Case 1: Rolling one standard die (1–6):

The average (mean) is 3.5.
The variance is relatively small (≈2.9).
You have predictable outcomes.

Case 2: Rolling and summing 100 dice:

The mean is 100 × 3.5 = 350.
The variance significantly grows (100 × 2.9 = 290).
Now, outcomes fluctuate widely (e.g., you might get sums like 320, 350, or 380). The distribution spreads out drastically. Outcomes become unpredictable.


Dot Product without normalization:

Think of dimensions as "dice." Increasing the number of dimensions is like rolling more dice and summing results.
Each dimension (dice) contributes some variance. As dimensions grow, variance accumulates.
Result: Dot products (before softmax) become either extremely large or small, making attention weights unstable and erratic.

Dot Product with normalization (dividing by sqrt(d)):

This effectively scales down the variance, ensuring the summed results remain stable.
It’s like taking the average roll per dice rather than summing them up, stabilizing your expected outcomes.
Result: Attention weights become more stable, predictable, and informative, enabling the model to learn effectively.


In [17]:
import numpy as np

# Function to compute variance before and after scaling
def compute_variance(dim, num_trials=1000):
    dot_products = []
    scaled_dot_products = []

    # Generate multiple random vectors and compute dot products
    for _ in range(num_trials):
        q = np.random.randn(dim)
        k = np.random.randn(dim)

        # Compute dot product
        dot_product = np.dot(q, k)
        dot_products.append(dot_product)

        # Scale the dot product by sqrt(dim)
        scaled_dot_product = dot_product / np.sqrt(dim)
        scaled_dot_products.append(scaled_dot_product)

    # Calculate variance of the dot products
    variance_before_scaling = np.var(dot_products)
    variance_after_scaling = np.var(scaled_dot_products)

    return variance_before_scaling, variance_after_scaling

# For dimension 5
variance_before_5, variance_after_5 = compute_variance(5)
print(f"Variance before scaling (dim=5): {variance_before_5}")
print(f"Variance after scaling (dim=5): {variance_after_5}")

# For dimension 20
variance_before_100, variance_after_100 = compute_variance(100)
print(f"Variance before scaling (dim=100): {variance_before_100}")
print(f"Variance after scaling (dim=100): {variance_after_100}")



Variance before scaling (dim=5): 4.840529816574412
Variance after scaling (dim=5): 0.9681059633148823
Variance before scaling (dim=100): 104.52172990736773
Variance after scaling (dim=100): 1.0452172990736772


<div class="alert alert-block alert-success">
    
We now compute the context vector as a weighted sum over the value
vectors.

Here, the attention weights serve as a weighting factor that weighs the respective
importance of each value vector.

We can use matrix multiplication to
obtain the output in one step:</div>

In [18]:
context_vec_2 = attn_weights_2 @ values
print(context_vec_2)

tensor([1.3301, 1.5304, 1.8753, 2.3433])


In [19]:
context_vec = attn_weights_final @ values
print(context_vec)

tensor([[1.3246, 1.5236, 1.8652, 2.3285],
        [1.3301, 1.5304, 1.8753, 2.3433],
        [1.3325, 1.5353, 1.8866, 2.3537],
        [1.3211, 1.5153, 1.8390, 2.3002],
        [1.3253, 1.5242, 1.8657, 2.3304]])


<div class="alert alert-block alert-success">
    
So far, we only computed a single context vector, z(2).

In the next section, we will generalize the code to compute all context vectors in the input sequence, z(1)to z (T)</div>

## IMPLEMENTING A COMPACT SELF ATTENTION PYTHON CLASS

<div class="alert alert-block alert-success">
    
In the previous sections, we have gone through a lot of steps to compute the self-attention
outputs.

This was mainly done for illustration purposes so we could go through one step at
a time.

In practice, with the LLM implementation in mind, it is helpful to
organize this code into a Python class as follows:
    
</div>

In [20]:
import torch.nn as nn

class SelfAttention_v1(nn.Module):

    def __init__(self, d_in, d_out):
        super().__init__()
        self.W_query = nn.Parameter(torch.rand(d_in, d_out))
        self.W_key   = nn.Parameter(torch.rand(d_in, d_out))
        self.W_value = nn.Parameter(torch.rand(d_in, d_out))

    def forward(self, x):
        # 0 to 1 step
        # causal
        # multi-head
        keys = x @ self.W_key
        queries = x @ self.W_query
        values = x @ self.W_value

        attn_scores = queries @ keys.T # omega
        attn_weights = torch.softmax(
            attn_scores / keys.shape[-1]**0.5, dim=-1
        )

        context_vec = attn_weights @ values
        return context_vec

<div class="alert alert-block alert-warning">

In this PyTorch code, SelfAttention_v1 is a class derived from nn.Module, which is a
fundamental building block of PyTorch models, which provides necessary functionalities for
model layer creation and management.    
</div>

<div class="alert alert-block alert-warning">

The __init__ method initializes trainable weight matrices (W_query, W_key, and
W_value) for queries, keys, and values, each transforming the input dimension d_in to an
output dimension d_out.

</div>

<div class="alert alert-block alert-warning">

During the forward pass, using the forward method, we compute the attention scores
(attn_scores) by multiplying queries and keys, normalizing these scores using softmax.

</div>

<div class="alert alert-block alert-success">
    
Finally, we create a context vector by weighting the values with these normalized attention
scores.
    
</div>

In [21]:
torch.manual_seed(123)

sa_v1 = SelfAttention_v1(d_in, d_out)
print(sa_v1(inputs))

tensor([[1.3246, 1.5236, 1.8652, 2.3285],
        [1.3301, 1.5304, 1.8753, 2.3433],
        [1.3325, 1.5353, 1.8866, 2.3537],
        [1.3211, 1.5153, 1.8390, 2.3002],
        [1.3253, 1.5242, 1.8657, 2.3304]], grad_fn=<MmBackward0>)


<div class="alert alert-block alert-info">

Since inputs contains six embedding vectors, we get a matrix storing the six
context vectors, as shown in the above result.
</div>

<div class="alert alert-block alert-warning">

We can improve the SelfAttention_v1 implementation further by utilizing PyTorch's
nn.Linear layers, which effectively perform matrix multiplication when the bias units are
disabled.

</div>

<div class="alert alert-block alert-warning">

Additionally, a significant advantage of using nn.Linear instead of manually
implementing nn.Parameter(torch.rand(...)) is that nn.Linear has an optimized weight
initialization scheme, contributing to more stable and effective model training.

</div>

In [22]:
class SelfAttention_v2(nn.Module):

    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

    def forward(self, x):
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)

        attn_scores = queries @ keys.T
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)

        context_vec = attn_weights @ values
        return context_vec

<div class="alert alert-block alert-success">

You can use the SelfAttention_v2 similar to SelfAttention_v1:
    
</div>

In [23]:
torch.manual_seed(789)

inputs = torch.tensor(
  [[0.43, 0.15, 0.89, 0.17, 0.23, 0.19, 0.38, 0.44],  # The       (x^1)
   [0.55, 0.87, 0.66, 0.51, 0.49, 0.3, 0.2, 0.1],     # next      (x^2)
   [0.57, 0.85, 0.64, 0.8, 0.1, 0.4, 0.21, 0.39],     # day       (x^3)
   [0.22, 0.58, 0.33, 0.4, 0.4, 0.4, 0.1, 0.3],       # is        (x^4)
   [0.77, 0.25, 0.10, 0.1, 0.9, 0.3, 0.3, 0.2]]       # bright    (x^5)
)

d_in = 8
d_out = 4

sa_v2 = SelfAttention_v2(d_in, d_out)
print(sa_v2(inputs))

tensor([[ 0.0174,  0.0553, -0.1093,  0.1026],
        [ 0.0175,  0.0556, -0.1089,  0.1024],
        [ 0.0175,  0.0559, -0.1087,  0.1022],
        [ 0.0179,  0.0544, -0.1091,  0.1028],
        [ 0.0172,  0.0543, -0.1105,  0.1032]], grad_fn=<MmBackward0>)


<div class="alert alert-block alert-info">

Note that SelfAttention_v1 and SelfAttention_v2 give different outputs because they
use different initial weights for the weight matrices since nn.Linear uses a more
sophisticated weight initialization scheme.
    
</div>