# Self-Attention

Self-attention is the core mechanism that allows language models to understand relationships between words in text.

1. Each word in a sentence gets converted to a vector (embedding)
2. The model calculates how much attention each word should pay to every other word
3. This creates an "attention matrix" showing these relationships
4. Words get updated based on their connections to other words

For example, in "The cat sat on the mat", when processing "cat", the model pays attention to "sat" to understand what the cat is doing, creating these meaningful connections automatically.


In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [3]:
# version 4: self-attention!
torch.manual_seed(1337)
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16 #hyperparameter

#every single token emits two vector, one for key and one for query
key = nn.Linear(C, head_size, bias=False)#what do i contain?
query = nn.Linear(C, head_size, bias=False)#what am i looking for?
#my query dot products with all the keys of other tokens
#if key and wuery are aligned they'll interact to a very hihg value
value = nn.Linear(C, head_size, bias=False)


k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)

wei =  q @ k.transpose(-2, -1)# transpose last two dimension # (B, T, 16) @ (B, 16, T) ---> (B, T, T)# for every row of B, we have a matrix affinities of TxT

tril = torch.tril(torch.ones(T, T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x) #why doing this?
out = wei @ v

out.shape

torch.Size([4, 8, 16])