# Day 03 â€” Self-Attention and Causal Masking

## Why this day matters
If Day 1 showed how language models predict tokens, and Day 2 showed why recurrence fails,
Day 3 introduces the core breakthrough behind modern LLMs: **self-attention**.

This notebook implements self-attention from scratch and demonstrates why **causal masking**
is essential for autoregressive language modeling.

## What is implemented
- Single-head self-attention from first principles
- Attention-based language model (no RNNs)
- Visualization-ready attention weights
- Causal masking to prevent information leakage

## Key question
Why should a model remember everything sequentially when it can directly attend to relevant tokens?



In [13]:
#making raw dataset
text = """ In the beginning the universe was created. This has made a lot of people very angry and been widely regarded as bad move"""
print(text)

 In the beginning the universe was created. This has made a lot of people very angry and been widely regarded as bad move


In [14]:
# tokenizations
chars = sorted(list(set(text)))
vocab_size = len(chars)

print("Charatcters:",chars)
print("Vocab Size:",vocab_size)



Charatcters: [' ', '.', 'I', 'T', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'y']
Vocab Size: 25


In [15]:
# build mapping
stoi = {ch:i for i, ch in enumerate(chars)}
itos = {i:ch for i, ch in enumerate(chars)}

encode = lambda s:[stoi[c] for c in s]
decode = lambda l:"".join([itos[i] for i in l])

print(encode("the"))
print(decode(encode("the")))


[20, 11, 8]
the


In [16]:
import torch
data = torch.tensor(encode(text),dtype=torch.long)
print(data[:20])
print("Total tokens:",len(data))

block_size = 8 # context  lenghth
batch_size = 4


tensor([ 0,  2, 15,  0, 20, 11,  8,  0,  5,  8, 10, 12, 15, 15, 12, 15, 10,  0,
        20, 11])
Total tokens: 121


In [17]:
def get_batch():
  ix = torch.randint(len(data)-block_size,(batch_size,))
  x = torch.stack([data[i:i+block_size]for i in ix])
  y = torch.stack([data[i+1:i+block_size+1]for i in ix])
  return x,y

x,y = get_batch()
print(x)
print(y)

tensor([[10,  4, 18,  7,  8,  7,  0,  4],
        [ 4,  7,  8,  0,  4,  0, 13, 16],
        [ 7,  1,  0,  3, 11, 12, 19,  0],
        [18, 24,  0,  4, 15, 10, 18, 24]])
tensor([[ 4, 18,  7,  8,  7,  0,  4, 19],
        [ 7,  8,  0,  4,  0, 13, 16, 20],
        [ 1,  0,  3, 11, 12, 19,  0, 11],
        [24,  0,  4, 15, 10, 18, 24,  0]])


In [18]:
import torch
import torch.nn as nn
import torch.nn.functional as F


In [19]:
#SINGLE-HEAD SELF-ATTENTION (FROM ZERO)

class SelfAttention(nn.Module):
    def __init__(self, embed_size, block_size):
        super().__init__()
        self.key = nn.Linear(embed_size, embed_size, bias=False)
        self.query = nn.Linear(embed_size, embed_size, bias=False)
        self.value = nn.Linear(embed_size, embed_size, bias=False)

        self.register_buffer(
            "tril", torch.tril(torch.ones(block_size, block_size))
        )

    def forward(self, x):
        B, T, C = x.shape

        K = self.key(x)
        Q = self.query(x)
        V = self.value(x)

        weights = Q @ K.transpose(-2, -1) / (C ** 0.5)
        weights = weights.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        weights = F.softmax(weights, dim=-1)

        out = weights @ V
        return out, weights



In [20]:
class AttentionLanguageModel(nn.Module):
    def __init__(self, vocab_size, embed_size=64, block_size=8):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.attn = SelfAttention(embed_size, block_size)
        self.fc = nn.Linear(embed_size, vocab_size)

    def forward(self, x, targets=None):
        x = self.embed(x)
        attn_out, weights = self.attn(x)
        logits = self.fc(attn_out)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss


In [21]:
model = AttentionLanguageModel(vocab_size, block_size=block_size)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

for step in range(5000):
    xb, yb = get_batch()
    logits, loss = model(xb, yb)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if step % 500 == 0:
        print(f"Step {step} | Loss {loss.item():.4f}")


Step 0 | Loss 3.2437
Step 500 | Loss 1.2366
Step 1000 | Loss 1.0398
Step 1500 | Loss 0.9294
Step 2000 | Loss 0.8060
Step 2500 | Loss 0.9295
Step 3000 | Loss 0.8717
Step 3500 | Loss 1.0302
Step 4000 | Loss 0.5897
Step 4500 | Loss 0.6894


In [23]:
def generate_attn(model, start_char, max_new_tokens=100):
    idx = torch.tensor([[stoi[start_char]]])

    for _ in range(max_new_tokens):
        # Crop idx to the last block_size tokens if its length exceeds block_size
        # This ensures that the input to the model does not exceed the context window
        block_size_attn = model.attn.tril.size(0) # Get the block_size from the attention module
        idx_cond = idx if idx.size(1) <= block_size_attn else idx[:, -block_size_attn:]

        logits, _ = model(idx_cond)
        logits = logits[:, -1, :]
        probs = F.softmax(logits, dim=-1)
        next_idx = torch.multinomial(probs, 1)
        idx = torch.cat([idx, next_idx], dim=1)

    return decode(idx[0].tolist())

print(generate_attn(model, 'I'))

Ininng t.t.T his hade ve ungrarded as a was was creas bangd win t.Thide haden begind win this wis mad


## Observations

- Attention produces more stable training than RNNs
- Model can access the full context simultaneously
- Without causal masking, repetition and degeneration occur
- With masking, text quality improves significantly

## Key Insight (Day 3)
Self-attention replaces memory with **dynamic relevance**.
Instead of remembering, the model looks back and decides what matters.
