# Self Attention From First Principles: Single Headed


## Introduction

-  Lets code up the self attention operation for transformers using Numpy/Pytorch. We will do this for both single and multi-headed attention. For Multi-Head the computations remain fairly the same
- The attention operation relies on 3 matrices Query (Q) , Key (K) , Value (V).
- These contain $d_k$ dimensional vectors for each token in the sequence and are used to compute attention scores using the scaled dot-product attention formula

## A. Single Headed Attention

### 0) How to go from Input of Length S to Attention Matrix:

1. Given a string with N words "Hi my name is Royston" (N=4)
2. Tokenizer will split this into "S" tokens and represent each token using a token ID. We can think of the tokenID as a one-hot encoded vector of size = vocab_size
3. Dimensions of S = S x vocab_size
4. Embedding Layer takes each one of the tokens and generates a d_model size embedding. (Embedding layer is pretty much a lookup table that maps the tokenID to emb)
5. Embedding layer weights (W_emb) = (vocab_size x d_model)
6. Therefore to go from S --> X (Input Embedding Matrix) is through S x W_emb --> X [(S x vocab_size) X (vocab_size x d_model) --> (S x d_model)]
7. Now to create Q, K, V we do Q = XWq; K= XWk; V=XWv

### 1) Create Attention Matrices Directly

In [None]:
import torch
import numpy as np

In [None]:
# np.random.randn() --> randn(d0, d1, ..., dn), Returns a sample (or samples) from the "standard normal" distribution.
# If positive int_like arguments are provided, randn generates an array of shape (d0, d1, ..., dn), filledwith random floats sampled from a univariate "normal" (Gaussian)distribution of mean 0 and variance 1.
# A single float randomly sampled from the distribution is returned if no argument is provided.

seq_len, d_model = 4, 16
num_heads = 1
d_k = d_model//num_heads

# Generates a Matrix of seq_len x d_k. Here d_k = d_model
Q = np.random.randn(seq_len, d_k)

# Do the same thing for K & V values too
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_k)

In [None]:
print("Query Matrix:Shape S x d_k\n", Q)
print("\nKey Matrix: S x d_k\n", K)
print("\nValue Matrix: S x d_k\n", V)

Query Matrix:Shape S x d_k
 [[-0.15951622  0.98872276 -1.3099051   0.08709159  0.24916526  0.07197099
  -2.81090795  0.52460321  0.71404504  0.99795676  1.0005279  -0.10157494
   0.3856277  -0.06759229  0.1713508  -1.33010782]
 [ 0.27011345 -0.5443109   0.52075346 -0.4250707   0.96225342 -0.42046436
   0.24899346  0.9761435  -0.37189321 -0.03013046 -0.18577981  0.24505931
  -0.25772134 -1.24400364 -2.00156289  0.05747019]
 [ 0.3886474  -1.85040809 -0.58805571  2.22719094  0.76479855 -0.32956112
   1.90235074  0.36685287  0.49176445  0.74700256  0.25997698 -0.37644622
  -0.03636313  1.0521244  -1.72450375  0.06511079]
 [ 0.36060969 -0.07505919  0.04791978  1.2800908  -1.60290889  0.85907832
   0.8660755  -1.13681212  1.06663053 -0.0890357   0.72101036  0.21743189
  -0.19827866  1.0805751  -2.92718364 -1.94171346]]

Key Matrix: S x d_k
 [[-0.88037719 -1.01550339  0.03704356 -0.23753597 -1.09210438 -0.33763308
   0.73684182  1.08469059 -0.91820012  0.98528465  0.16655124  0.91075384
   0.

In [None]:
Q.var

<function ndarray.var>

In [None]:
# Printing some stats on the distribution of values in the respective matrices
print(f"Q Stats --> Mean = {Q.mean()}, Var = {Q.var()}, Std_Dev = {Q.std()}")
print(f"K Stats --> Mean = {K.mean()}, Var = {K.var()}, Std_Dev = {K.std()}")
print(f"V Stats --> Mean = {V.mean()}, Var = {V.var()}, Std_Dev = {V.std()}")

Q Stats --> Mean = -0.014059075471320182, Var = 1.0161102167626466, Std_Dev = 1.0080229247207857
K Stats --> Mean = 0.12302292938779634, Var = 1.3184921068673563, Std_Dev = 1.1482561155366673
V Stats --> Mean = 0.09465508751880708, Var = 1.3269897497216636, Std_Dev = 1.151950411138285


### 2) Implement Attention Formula
$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

In [None]:
qk_mat_scaled = np.matmul(Q,K.T)/np.sqrt(d_k)
qk_mat_scaled

array([[-1.34277989,  0.18069606,  0.67286823, -0.66086783],
       [ 1.12348479,  0.57402436,  0.43500528, -0.66116886],
       [ 0.88573775,  0.6847196 , -1.01119679,  1.39995509],
       [-0.47955637,  0.82402816, -1.03747975,  1.61700606]])

In [None]:
col_sum= 0.02759229 -0.00779077 +0.92597297 -0.52447276
print(col_sum)

0.42130173000000004


In [None]:
row_sum = 0.02759229+ 1.3389972-0.96834916-0.17945724
print(row_sum)

0.21878309000000015


In [None]:
qk_mat_scaled.shape
print(np.sum(qk_mat_scaled, axis = 0)) # Axis = 0, gives us col sums
print(np.sum(qk_mat_scaled, axis = 1)) # Axis = 1, gives us row sums

[ 0.18688627  2.26346818 -0.94080303  1.69492447]
[-1.15008343  1.47134557  1.95921564  0.9239981 ]


#### Mask (If Needed): Decoder Only

In [None]:
# np.tril creates a lower triangle.
mask = np.tril(np.ones(shape=(seq_len, seq_len)))

# Replace all the slots that are
mask[mask==0] = -np.inf

mask

array([[  1., -inf, -inf, -inf],
       [  1.,   1., -inf, -inf],
       [  1.,   1.,   1., -inf],
       [  1.,   1.,   1.,   1.]])

In [None]:
def softmax(x):
  # Create The exponent
  exp_x = np.exp(x)
  row_sums = np.sum(exp_x, axis = 1, keepdims=True)
  soft = exp_x/row_sums

  return soft

In [None]:
# Running Attention Computation: Encoder Block
attention_scores = softmax(qk_mat_scaled)
print("Computed Attention Matrix:\n",attention_scores)
print("Validating Softmax operation. Row Sums= ", np.sum(attention_scores, axis = 1))

# Attention Output
attention_output = np.matmul(attention_scores, V)
print("\n\nAttention Output:\n", attention_output)

Computed Attention Matrix:
 [[0.06635087 0.30442748 0.49800248 0.13121917]
 [0.44494759 0.25685098 0.22351466 0.07468676]
 [0.27470607 0.22468143 0.04121355 0.45939896]
 [0.07466578 0.27495473 0.04273843 0.60764106]]
Validating Softmax operation. Row Sums=  [1. 1. 1. 1.]


Attention Output:
 [[-0.44333561  0.35745853 -0.1357813   0.47607709  0.65104129  0.55171797
  -1.3339197   0.67528231 -0.14134809  0.78857944  0.05522086 -0.80642508
  -0.3168178  -0.73605936 -0.10839245  0.8242018 ]
 [-0.10582103  0.33326035 -0.38016166  0.60875208  0.21610571  0.87606121
  -1.16414312  1.03591902 -0.19089832  0.52883401  0.28738281 -0.3055862
  -0.3830236  -0.04841962 -0.41378356 -0.40869053]
 [ 0.70342766  0.54793132  0.51715043  0.2643419   0.45036158  0.78027984
   0.29042886 -0.02952878  0.18457965  1.18447241 -0.11856502 -0.06120534
  -0.48033725  0.15388861 -0.35211578 -0.66780141]
 [ 0.90944541  0.59428003  0.87379705  0.15699441  0.59635156  0.66073211
   0.78959652 -0.45281084  0.53022222

In [None]:
# Apply mask if needed (for decoder)
masked_qk_mat_scaled = qk_mat_scaled + mask
masked_attention_weights = softmax(masked_qk_mat_scaled)
print("\nMasked softmax attention weights (for decoder):\n", masked_attention_weights)
print("Sum of each row after masked softmax:", np.sum(masked_attention_weights, axis=1))

# Attention Output
attention_output_decoder = np.matmul(masked_attention_weights, V)
print("\n\nAttention Output:\n", attention_output_decoder)


Masked softmax attention weights (for decoder):
 [[1.         0.         0.         0.        ]
 [0.6340104  0.3659896  0.         0.        ]
 [0.50814934 0.41561412 0.07623654 0.        ]
 [0.07466578 0.27495473 0.04273843 0.60764106]]
Sum of each row after masked softmax: [1. 1. 1. 1.]


Attention Output:
 [[ 1.16623198e-01  4.48318222e-01 -5.81880779e-01  5.50993163e-01
  -1.12075754e-03  1.19920417e+00 -1.24745337e+00  1.26222304e+00
  -1.13790285e+00 -6.00016089e-02  1.02396324e+00  5.31236972e-01
  -4.40120096e-01  4.12892696e-01 -1.18238525e+00 -2.12421852e+00]
 [ 1.37915743e-01  1.95957947e-01 -6.97944860e-01  8.78390003e-01
  -2.90419724e-01  1.13559395e+00 -1.02849683e+00  1.52379572e+00
   3.36710915e-01  4.78763205e-01  1.69131453e-01 -1.78439429e-01
  -4.28703393e-01  6.07326527e-01 -3.40216627e-01 -9.94100842e-01]
 [ 2.45350475e-02  1.69007318e-01 -6.81011443e-01  8.90176039e-01
  -2.17836221e-01  1.05075745e+00 -1.10377495e+00  1.49816372e+00
   4.81904171e-01  5.73678

### 3) Modularizing: Creating Final Attention Functions/Class

#### Base Version: Creating Independent Functions

In [None]:
def softmax(x):
  # Create The exponent
  exp_x = np.exp(x)
  row_sums = np.sum(exp_x, axis = 1, keepdims=True)
  soft = exp_x/row_sums

  return soft


def scaled_dot_product_attention(Q, K, V, d_k, mask=None):
  qk_final= np.matmul(Q, K.T)/np.sqrt(d_k)
  seq_len = Q.shape[0]
  if mask:
    mask = np.tril(np.ones(shape=(seq_len, seq_len)))
    qk_final = qk_final + mask

  # Attention Scores
  attention_scores = softmax(qk_final)

  # Return Output of Computation
  output = np.matmul(attention_scores, V)

  return output

#### Intermediate Level: Creating Class + Methods with Type Hinting

In [None]:
from typing import Optional
import numpy as np

class ScaledDotProductAttentions:
  def __init__(self, Q: np.ndarray, K: np.ndarray,V: np.ndarray, d_k: int) -> None:
    """
        Initializes the ScaledDotProductAttentions class.

        Args:
            Q (np.ndarray): The Query matrix.
            K (np.ndarray): The Key matrix.
            V (np.ndarray): The Value matrix.
            d_k (float): The dimension of the key vectors (used for scaling).
    """
    self.Q = Q
    self.K = K
    self.V = V
    self.d_k = d_k
    self.attention_scores = None
    self.attention_output = None

  def softmax(self, x) -> np.ndarray:
    """
        Computes the softmax function along the rows of a matrix.

        Args:
            x (np.ndarray): The input matrix.

        Returns:
            np.ndarray: The matrix with softmax applied along each row.
        """
    # Create The exponent matrix
    exp_x = np.exp(x)
    # For large values of x, exp(x) can overflow. A more robust pattern will:
    # Subtract the maximum value from each row for numerical stability
    # exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))

    # Compute Row Sums of Exponent Matrix
    row_sums = np.sum(exp_x, axis = 1, keepdims=True)

    # Compute Softmax Matrix
    soft = exp_x/row_sums

    return soft


  def compute_single_head_attention(self, mask: Optional[np.ndarray] = None):
    """
        Computes the scaled dot-product attention.

        Args:
            mask (Optional[np.ndarray]): An optional mask matrix to apply before softmax.
                                         Typically used for masking future tokens in decoders.

        Returns:
            np.ndarray: The attention output matrix.
    """
    qk_final = np.matmul(self.Q, self.K.T)/np.sqrt(self.d_k)
    seq_len = self.Q.shape[0]
    if mask:
      mask = np.tril(np.ones(shape=(seq_len, seq_len)))
      qk_final = qk_final + mask

    # Attention Scores
    self.attention_scores = self.softmax(qk_final)

    # Attenion Output
    self.attention_output = np.matmul(self.attention_scores, self.V)

    return self.attention_output

#### Advanced Class: Pydantic + DataClass

In [None]:
import torch
import numpy as np
from typing import Optional
from pydantic import BaseModel, Field, ConfigDict, model_validator
# Import dataclasses and NpNDArray (if you want pydantic_numpy validation)
from dataclasses import dataclass
# from pydantic_numpy import NpNDArray


# Install pydantic, pydantic-numpy, if not installed
# pip install pydantic "pydantic-numpy>=1.2.0"

# 1. Define the data structure using a dataclass
@dataclass
class AttentionInputData:
    """
    Dataclass defining the structure and basic types of attention inputs.
    """
    Q: np.ndarray
    K: np.ndarray
    V: np.ndarray
    d_k: int

# 2. Define a Pydantic model that uses the dataclass
class AttentionInputs(BaseModel):
    """
    Pydantic model for validating the AttentionInputData dataclass.
    """
    # Pydantic can validate dataclasses directly
    data: AttentionInputData

    # Optional configuration for the model
    model_config = ConfigDict(arbitrary_types_allowed=True) # Still potentially needed for NumPy arrays within dataclass

    # You can still add validators that operate on the nested dataclass data
    @model_validator(mode='after')
    def check_matrix_shapes(self) -> 'AttentionInputs':
        """
        Validator to check that the shapes of Q, K, and V within the dataclass are compatible.
        """
        q_shape = self.data.Q.shape
        k_shape = self.data.K.shape
        v_shape = self.data.V.shape
        d_k_val = self.data.d_k # Access d_k from the dataclass

        # Basic shape checks for attention (seq_len, d_k)
        if q_shape[1] != d_k_val or k_shape[1] != d_k_val:
            # Use a more informative error message
            raise ValueError(f"Last dimension of Q ({q_shape[1]}) or K ({k_shape[1]}) must match d_k ({d_k_val}). Q shape: {q_shape}, K shape: {k_shape}")

        if q_shape[0] != k_shape[0] or q_shape[0] != v_shape[0]:
             raise ValueError(f"Sequence lengths of Q ({q_shape[0]}), K ({k_shape[0]}), and V ({v_shape[0]}) must match.")

        # You could add more checks, e.g., d_k > 0 here, or as a separate field validator
        if d_k_val <= 0:
             raise ValueError(f"d_k must be positive, but got {d_k_val}")

        return self

# You could also potentially use pydantic_numpy types directly in the dataclass
# and then just validate the dataclass itself if you want stronger type checking
# at the dataclass level and Pydantic to pick that up.

## @dataclass
# class AttentionInputDataStrict:
#      Q: NpNDArray[np.float64]
#      K: NpNDArray[np.float64]
#      V: NpNDArray[np.float64]
#      d_k: float

# class AttentionInputsStrict(BaseModel):
#      data: AttentionInputDataStrict
#      model_config = ConfigDict(arbitrary_types_allowed=True)


class ScaledDotProductAttentions:
    def __init__(self, Q: np.ndarray, K: np.ndarray, V: np.ndarray, d_k: float) -> None:
        """
        Initializes the ScaledDotProductAttentions class with Pydantic validation
        using a nested dataclass.

        Args:
            Q (np.ndarray): The Query matrix.
            K (np.ndarray): The Key matrix.
            V (np.ndarray): The Value matrix.
            d_k (float): The dimension of the key vectors (used for scaling).
        """
        # Create the dataclass instance first
        input_data = AttentionInputData(Q=Q, K=K, V=V, d_k=d_k)

        # Validate the dataclass instance using the Pydantic model
        validated_inputs = AttentionInputs(data=input_data)

        # Assign data from the validated dataclass instance
        self.Q = validated_inputs.data.Q
        self.K = validated_inputs.data.K
        self.V = validated_inputs.data.V
        self.d_k = validated_inputs.data.d_k

    def softmax(self, x: np.ndarray) -> np.ndarray:
        """
        Computes the softmax function along the rows of a matrix.
        """
        # Add numerical stability
        x = x - np.max(x, axis=1, keepdims=True)
        exp_x = np.exp(x)
        row_sums = np.sum(exp_x, axis=1, keepdims=True)
        soft = exp_x / row_sums
        return soft

    def compute(self, mask: Optional[np.ndarray] = None) -> np.ndarray:
        """
        Computes the scaled dot-product attention.
        """
        qk_mat_scaled = np.matmul(self.Q, self.K.T) / np.sqrt(self.d_k)

        if mask is not None:
             if mask.shape != qk_mat_scaled.shape:
                 print("Warning: Provided mask shape does not match QK^T shape. Generating default tril mask.")
                 seq_len = self.Q.shape[0]
                 mask = np.tril(np.ones(shape=(seq_len, seq_len)))
                 mask[mask == 0] = -np.inf

             qk_mat_scaled = qk_mat_scaled + mask

        attention_mat = self.softmax(qk_mat_scaled)
        attention_output = np.matmul(attention_mat, self.V)
        return attention_output

# Example usage:

seq_len, d_model = 4, 6
num_heads = 1
d_k_val = float(d_model // num_heads)

Q_val = np.random.randn(seq_len, int(d_k_val)).astype(np.float64)
K_val = np.random.randn(seq_len, int(d_k_val)).astype(np.float64)
V_val = np.random.randn(seq_len, d_model).astype(np.float64) # V dimension can be different

try:
    # Pass the raw data to the class constructor
    attention_module = ScaledDotProductAttentions(Q_val, K_val, V_val, d_k_val)
    print("Attention module created successfully with valid inputs using dataclass.")
    attention_output = attention_module.compute()
    print("Attention Output:\n", attention_output)
except Exception as e:
    print(f"Error creating attention module: {e}")

# Example with invalid inputs (uncomment to test)
# try:
#     # Mismatched d_k
#     invalid_Q = np.random.randn(seq_len, 5).astype(np.float64)
#     invalid_K = np.random.randn(seq_len, 5).astype(np.float64)
#     invalid_V = np.random.randn(seq_len, d_model).astype(np.float64)
#     # Pydantic validation will catch this shape mismatch via the model_validator
#     attention_module_invalid_dk = ScaledDotProductAttentions(invalid_Q, invalid_K, invalid_V, d_k_val)
# except Exception as e:
#      print(f"\nSuccessfully caught error with invalid d_k via dataclass/Pydantic: {e}")

# try:
#     # Mismatched sequence length
#     invalid_Q_len = np.random.randn(3, int(d_k_val)).astype(np.float64)
#     invalid_K_len = np.random.randn(4, int(d_k_val)).astype(np.float64)
#     invalid_V_len = np.random.randn(4, d_model).astype(np.float64)
#     # Pydantic validation will catch this shape mismatch via the model_validator
#     attention_module_invalid_len = ScaledDotProductAttentions(invalid_Q_len, invalid_K_len, invalid_V_len, d_k_val)
# except Exception as e:
#      print(f"\nSuccessfully caught error with invalid sequence length via dataclass/Pydantic: {e}")