# Self Attention
Attention in LLMs is a mechanism that allows the model to focus on different parts of the input sequence when processing each token. In the context of transformers, it computes a weighted average of all tokens for each token, letting the model "attend" to relevant parts.

In [1]:
import torch
import torch.nn.functional as F

d = 4 # dimension

x = torch.rand(10, d)  # (tokens, embedding_dim) # 10 tokens - input sequence
print('Sample tokens:\n', x)

# Linear projections for queries, keys, values
# These weight matrices learn to transform input embeddings into Q, K, V representations
W_q = torch.rand(d, d)  # Query projection - "what am I looking for?"
W_k = torch.rand(d, d)  # Key projection - "what do I contain?"
W_v = torch.rand(d, d)  # Value projection - "what information do I actually provide?"

Q = x @ W_q   # Queries: each token asks "what should I attend to?"
K = x @ W_k   # Keys: each token says "here's what I represent"
V = x @ W_v   # Values: each token says "here's the info I contribute"

# Compute attention scores - how much should each token attend to every other token?
scores = Q @ K.T / (d ** 0.5)  # (10, 10), scaled dot product attention

# Apply softmax to get attention weights - normalize scores into probabilities
weights = F.softmax(scores, dim=-1)  # (10, 10)

# Each row sums to 1, representing attention distribution for that token
# Weighted sum of values - final attended representation
attended = weights @ V  # (10, d)
# For each token: mix all value vectors weighted by attention scores
# This is the core of self-attention: tokens can look at and incorporate 
# information from all other tokens in the sequence

print("\nAttention output:\n", attended)

Sample tokens:
 tensor([[0.4049, 0.4418, 0.0380, 0.5829],
        [0.3825, 0.6892, 0.9744, 0.7100],
        [0.8483, 0.2261, 0.2593, 0.4272],
        [0.7662, 0.3433, 0.2613, 0.9817],
        [0.0059, 0.3611, 0.2559, 0.2645],
        [0.9035, 0.3111, 0.7183, 0.4819],
        [0.5543, 0.2404, 0.0581, 0.7029],
        [0.2210, 0.2754, 0.0144, 0.5870],
        [0.4300, 0.4448, 0.2364, 0.0915],
        [0.3387, 0.7015, 0.5927, 0.7368]])

Attention output:
 tensor([[0.9638, 1.2040, 1.1158, 1.1066],
        [1.0288, 1.3384, 1.2095, 1.1759],
        [1.0000, 1.2749, 1.1644, 1.1442],
        [1.0281, 1.3275, 1.2006, 1.1725],
        [0.9073, 1.1115, 1.0516, 1.0514],
        [1.0310, 1.3432, 1.2114, 1.1779],
        [0.9816, 1.2357, 1.1377, 1.1244],
        [0.9375, 1.1579, 1.0838, 1.0802],
        [0.9414, 1.1701, 1.0922, 1.0855],
        [1.0097, 1.2944, 1.1792, 1.1547]])


# Cross Attention
Cross-attention is used in models like the Transformer decoder, where the model attends to a different sequence (e.g., encoder outputs) rather than the same sequence. This allows the decoder to focus on relevant parts of the input sequence while generating output.