### Exercise 1 — Tokenization & Vocabulary
- Create a simple tokenizer and vocabulary.  #
- Convert a sentence into integer token IDs.


In [None]:
# Fill in the blanks

PAD, UNK, CLS = "<pad>", "<unk>", "<cls>"
sentence = "I love artificial intelligence"

# Step 1: Split and lowercase
tokens = [CLS] + sentence.________().split()

# Step 2: Build a small vocabulary
vocab = sorted(set(tokens + [PAD, UNK]))
stoi = {w: i for i, w in enumerate(________)}
itos = {i: w for w, i in stoi.items()}

# Step 3: Encode the sentence
ids = [stoi.get(tok, stoi[________]) for tok in tokens]

print("Tokens:", tokens)
print("Token IDs:", ids)


 **Reflect:**  
- Why do we add a `<cls>` token?  
- How would you handle unknown words in new sentences?

---

### Exercise 2: 
Create sinusoidal positional encodings to inject order information into token embeddings.


In [None]:
import math, torch

# Fill in the blanks

def positional_encoding(seq_len, d_model):
    pe = torch.zeros(seq_len, d_model)
    position = torch.arange(0, seq_len).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / _______))
    pe[:, 0::2] = torch.sin(________ * div_term)
    pe[:, 1::2] = torch.cos(________ * div_term)
    return pe

pe = positional_encoding(10, 8)
print(pe[:2])


 **Reflect:**  
- Why do we alternate sine and cosine values?  
- What benefit do sinusoidal encodings have compared to learned positional embeddings?

---


### Exercise 3
- Implement the attention mechanism manually:  
- Softmax(QKᵀ / √dₖ) × V


In [None]:
import torch, math
torch.manual_seed(0)

# Fill in the blanks

Q = torch.randn(1, 3, 4)   # [batch, tokens, d_model]
K = torch.randn(1, 3, 4)
V = torch.randn(1, 3, 4)

scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(______)
attn = torch.softmax(scores, dim=-1)
out = torch.matmul(attn, ________)

print("Attention weights:\n", attn)
print("Output:\n", out)


 **Reflect:**  
- What would happen if we didn’t scale by √dₖ?  
- Which rows in the attention matrix correspond to which tokens?

---


### Exercise 4:
Simulate 2 attention heads by splitting Q, K, and V manually.


In [None]:
# Fill in the blanks

import torch, math
torch.manual_seed(0)

d_model = 8
num_heads = 2
d_head = d_model // num_heads

x = torch.randn(1, 3, d_model)

# Create projection matrices
W_q = torch.randn(d_model, d_model)
W_k = torch.randn(d_model, d_model)
W_v = torch.randn(d_model, d_model)

Q, K, V = x @ W_q, x @ W_k, x @ W_v

# Split into heads
Q = Q.view(1, 3, num_heads, d_head).transpose(1, 2)
K = K.view(1, 3, num_heads, d_head).transpose(1, 2)
V = V.view(1, 3, num_heads, d_head).transpose(1, 2)

# Compute attention per head
attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(________)
attn = torch.softmax(attn_scores, dim=-1)
out = torch.matmul(attn, V)

print(out.shape)


 **Reflect:**  
- Why do we use multiple heads instead of one?  
- What different relationships could each head learn?

---


### Exercise 5:
Implement a minimal Transformer encoder block using:  
1️. Multi-Head Attention  
2️. Feed-Forward Network  
3️. Residual + Layer Normalization


In [None]:
import torch
import torch.nn as nn

# Fill in the blanks

class MiniEncoder(nn.Module):
    def __init__(self, d_model=8, d_ff=16):
        super().__init__()
        self.mha = nn.MultiheadAttention(embed_dim=d_model, num_heads=2, batch_first=True)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)

    def forward(self, x):
        attn_out, _ = self.mha(x, x, x)
        x = self.ln1(x + ________)
        ff_out = self.ff(x)
        return self.ln2(x + ________)

# Test block
x = torch.randn(1, 4, 8)
enc = MiniEncoder()
y = enc(x)
print(y.shape)


**Reflect:**  
1. Why do we add residual connections?  
2. What happens if LayerNorm is removed from the network?
