# Chapter 2: Building a Production-Ready Attention  Module

<div class="alert alert-block alert-success">
Set up our environment with the necessary imports.
</div>

In [37]:
import torch
import numpy as np
import torch.nn as nn

## 2.1 Introducing Trainable Weights (Wq, Wk, Wv)

<div class="alert alert-block alert-success">
Let'use the sample sentence we uswed in the previous chapter.
</div>

In [3]:
# Our sample input sentence as embedding vectors
inputs = torch.tensor(
    [[ 0.8938,  0.9003,  0.8978], # Your
     [ 0.7165,  0.3428,  0.2553], # journey
     [ 0.1042,  0.5163,  0.3753], # starts
     [ 0.0445,  0.3091,  0.9763], # with
     [ 0.1554,  0.1614,  0.2700], # one
     [ 0.8089,  0.9435,  0.5480]] # step
)

# Corresponding words
words = ['Your', 'journey', 'starts', 'with', 'one', 'step']

<div class="alert alert-block alert-success">

To make our attention mechanism more powerful and production-ready, we now introduce three dedicated, trainable **weight matrices**:

* **`W_query` (Wq)**
* **`W_key` (Wk)**
* **`W_value` (Wv)**

The purpose of these matrices is to **project** our input embeddings into three separate, specialized vectors. For each input token `x`, we will now calculate:

1.  A **query vector `q`** (calculated as `x @ W_query`): This vector is optimized for asking the right "question" to find relevant keys.
2.  A **key vector `k`** (calculated as `x @ W_key`): This vector is optimized to be effectively "found" by relevant queries.
3.  A **value vector `v`** (calculated as `x @ W_value`): This vector contains the rich information that the token will contribute to the final output.

Crucially, these matrices are **trainable parameters**. The model will learn the optimal values for these matrices during the training process, allowing it to master the complex art of understanding context in language.
</div>

<div class="alert alert-block alert-info">
    
To see how this projection works in practice, let's focus on a single input token and define the dimensions for our weight matrices. For this hands-on example, we will:

1.  Select the second input token ("journey") to be the **query** we analyze.
2.  Get its embedding dimension from the input tensor (`d_in`).
3.  Define a smaller output dimension (`d_out`) for the resulting query, key, and value vectors.
</div>

In [7]:
x_2 = inputs[1]
d_in = inputs.shape[1]
d_out = 2

<div class="alert alert-block alert-info">
    
Note that in GPT-like models, the input and output dimensions are usually the same. 

But for illustration purposes,  we are using a smaller output dimension here simply to make the matrix operations easier to track visually.
</div>

<div class="alert alert-block alert-success">
Next, we initialize the three weight matrices Wq, Wk and Wv
</div>

In [14]:
torch.manual_seed(100)
W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_key = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)

In [13]:
print(W_query)

Parameter containing:
tensor([[0.1117, 0.8158],
        [0.2626, 0.4839],
        [0.6765, 0.7539]], requires_grad=True)


In [10]:
print(W_key)

Parameter containing:
tensor([[0.2627, 0.0428],
        [0.2080, 0.1180],
        [0.1217, 0.7356]])


In [11]:
print(W_value)

Parameter containing:
tensor([[0.7118, 0.7876],
        [0.4183, 0.9014],
        [0.9969, 0.7565]])


<div class="alert alert-block alert-info">
    
Note that we are setting requires_grad=False to reduce clutter in the outputs for illustration purposes. 

If we were to use the weight matrices for model training, we would set requires_grad=True to update these matrices during model training.
</div>

<div class="alert alert-block alert-success">
Next, we compute the query, key, and value vectors as shown earlier
</div>

In [17]:
query_2 = x_2 @ W_query
key_2 = x_2 @ W_key
value_2 = x_2 @ W_value

print(query_2)

tensor([0.3427, 0.9429])


<div class="alert alert-block alert-info">
    
As we can see based on the output for the query, this results in a 2-dimensional vector. 

This is because: we set the number of columns of the corresponding weight matrix, via d_out, to 2:
</div>

<div class="alert alert-block alert-success">

Even though our temporary goal is to only compute the one context vector z(2),  we still require the key and value vectors for all input elements. 

This is because they are involved in computing the attention weights with respect to the query q(2)
</div>

<div class="alert alert-block alert-success">
We can obtain all keys and values via matrix multiplication:
</div>

In [20]:
keys = inputs @ W_key
values = inputs @ W_value
queries = inputs @ W_query

print("keys.shape:", keys.shape)

print("values.shape:", values.shape)

print("queries.shape:", queries.shape)

keys.shape: torch.Size([6, 2])
values.shape: torch.Size([6, 2])
queries.shape: torch.Size([6, 2])


<div class="alert alert-block alert-info">
As we can tell from the outputs, we successfully projected the 6 input tokens from a 3D onto a 2D embedding space:
</div>

## 2.2 Scaling Attention Scores to create Attention Weights and Context Vectors

<div class="alert alert-block alert-success">
First, let's compute the attention score ω22
</div>

In [22]:
keys_2 = keys[1]
attn_score_22 = query_2.dot(keys_2)
print(attn_score_22)

tensor(0.3438)


<div class="alert alert-block alert-success">
Again, we can generalize this computation to all attention scores via matrix multiplication:
</div>

In [23]:
attn_scores_2 = query_2 @ keys.T # All attention scores for given query
print(attn_scores_2)

tensor([0.9411, 0.3438, 0.3838, 0.7801, 0.2483, 0.6807])


<div class="alert alert-block alert-success">
    
We compute the attention weights by scaling the attention scores and using the softmax function we used earlier. 

The difference to earlier is that we now scale the attention scores by dividing them by the square root of the embedding dimension of the keys. 
</div>

In [26]:
d_k = keys.shape[-1]
attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1)
print("Attention weights for the second input:", attn_weights_2)
print("Embedding dimension for the keys:", d_k)

Attention weights for the second input: tensor([0.2143, 0.1405, 0.1445, 0.1912, 0.1313, 0.1782])
Embedding dimension for the keys: 2


### Why divide by the square root of the embedding dimension?

<div class="alert alert-block alert-warning">

<b>Reason 1: For stability in learning</b>

The softmax function is sensitive to the magnitudes of its inputs. When the inputs are large, the differences between the exponential values of each input become much more pronounced. This causes the softmax output to become "peaky," where the highest value receives almost all the probability mass, and the rest receive very little.ery sharp softmax distribution, making the model overly confident in one particular "key." Such sharp distributions can make learning unstable,
</div>

In [28]:
# Define the tensor
tensor = torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])

# Apply softmax without scaling
softmax_result = torch.softmax(tensor, dim=-1)
print("Softmax without scaling:", softmax_result)

# Multiply the tensor by 8 and then apply softmax
scaled_tensor = 8 * tensor
softmax_scaled_result = torch.softmax(scaled_tensor, dim=-1)
print("Softmax after scaling by 8:", softmax_scaled_result)

Softmax without scaling: tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])
Softmax after scaling by 8: tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])


<div class="alert alert-block alert-warning">
In attention mechanisms, particularly in transformers, if the dot products between query and key vectors become too large (like multiplying by 8 in this example), the attention scores can become very large. This results in a very sharp softmax distribution, making the model overly confident in one particular "key." Such sharp distributions can make learning unstable,
</div>

### But, why by the square root?

<div class="alert alert-block alert-warning">
    
<b>Reason 2: To make the variance of the dot product stable</b>

The dot product of  Q and K increases the variance because multiplying two random numbers increases the variance.

The increase in variance grows with the dimension. 

Dividing by sqrt (dimension) keeps the variance close to 1
    
</div>

In [31]:
# Function to compute variance before and after scaling
def compute_variances(dim, num_trials=1000):
    dot_products = []
    scaled_dot_products = []

    # Generate multiple random vectors and compute the products
    for _ in range(num_trials):
        q = np.random.randn(dim)
        k = np.random.randn(dim)

        # Compuute dot product
        dot_product = np.dot(q, k)
        dot_products.append(dot_product)

        # Scale the dot product by sqrt(dim)
        scaled_dot_product = dot_product / np.sqrt(dim)
        scaled_dot_products.append(scaled_dot_product)

    # Calculate the variance of the dot produucts
    variance_before_scaling = np.var(dot_products)
    variance_after_scaling = np.var(scaled_dot_products)

    return variance_before_scaling, variance_after_scaling

torch.manual_seed(100)

# For dimension 5:
variance_before_scaling_5, variance_after_scaling_5 = compute_variances(dim=5)
print(f"Variance before scaling (dim=5): {variance_before_scaling_5}")
print(f"Variance after scaling (dim=5): {variance_after_scaling_5}")

# For dimension 20:
variance_before_scaling_20, variance_after_scaling_20 = compute_variances(dim=20)
print(f"Variance before scaling (dim=20): {variance_before_scaling_20}")
print(f"Variance after scaling (dim=20): {variance_after_scaling_20}")

Variance before scaling (dim=5): 5.045255667289148
Variance after scaling (dim=5): 1.0090511334578296
Variance before scaling (dim=20): 22.512670196518346
Variance after scaling (dim=20): 1.125633509825917


<div class="alert alert-block alert-success">
    
We now compute the context vector as a weighted sum over the value vectors. 

Here, the attention weights serve as a weighting factor that weighs the respective importance of each value vector. 

We can use matrix multiplication to obtain the output in one step:
</div>

In [34]:
context_vec_2 = attn_weights_2 @ values
print("Context vector for the second input:", context_vec_2)

Context vector for the second input: tensor([1.1783, 1.3425])


<div class="alert alert-block alert-success">
    
So far, we only computed a single context vector, z(2). 

In the next section, we will generalize the code to compute all context vectors in the input sequence, z(1)to z (T)
</div>

## Implementing a Compact Self Attention Python Class

<div class="alert alert-block alert-success">
    
In the previous sections, we have gone through a lot of steps to compute the self-attention outputs. 

This was mainly done for illustration purposes so we could go through one step at a time. 

In practice, with the LLM implementation in the next chapter in mind, it is helpful to organize this code into a Python class as follows:
</div>

In [38]:
class SelfAttention_v1(nn.Module):

    def __init__(self, d_in, d_out):
        super().__init__()
        self.W_query = nn.Parameter(torch.rand(d_in, d_out))
        self.W_key   = nn.Parameter(torch.rand(d_in, d_out))
        self.W_value = nn.Parameter(torch.rand(d_in, d_out))

    def forward(self, x):
        queries = x @ self.W_query
        keys = x @ self.W_key
        values = x @ self.W_value

        attn_scores = queries @ keys.T #omega
        attn_weights = torch.softmax(
            attn_scores / d_out**0.5, dim=-1
        )
        context_vec = attn_weights @ values
        return context_vec
    

<div class="alert alert-block alert-warning">
In this PyTorch code, SelfAttention_v1 is a class derived from nn.Module, which is a fundamental building block of PyTorch models, which provides necessary functionalities for model layer creation and management.    
</div>

<div class="alert alert-block alert-warning">

The __init__ method initializes trainable weight matrices (W_query, W_key, and W_value) for queries, keys, and values, each transforming the input dimension d_in to an output dimension d_out.

During the forward pass, using the forward method, we compute the attention scores (attn_scores) by multiplying queries and keys, normalizing these scores using softmax.

Finally, we create a context vector by weighting the values with these normalized attention scores.
</div>

In [44]:
torch.manual_seed(100)
sa_v1 = SelfAttention_v1(d_in, d_out)
print(sa_v1(inputs))

tensor([[1.2705, 1.4457],
        [1.1783, 1.3425],
        [1.1593, 1.3236],
        [1.1985, 1.3688],
        [1.1366, 1.2980],
        [1.2373, 1.4083]], grad_fn=<MmBackward0>)
