### 🧪 Exercise 3.1 — Comparing `SelfAttention_v1` and `SelfAttention_v2`

In this exercise, we compare two implementations of self-attention: one using raw `nn.Parameter` definitions (`SelfAttention_v1`) and another using `nn.Linear` layers (`SelfAttention_v2`). Since `nn.Linear` stores its weight matrices in transposed form, we must transfer the weights carefully to ensure both modules produce identical outputs.

**Steps:**
- Create instances of both attention modules.
- Copy the weights from `SelfAttention_v2` to `SelfAttention_v1`, accounting for the transpose.
- Verify that both implementations produce the same output for the same input.



In [6]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# SelfAttention_v1 with manually defined weights
class SelfAttention_v1(nn.Module):
    def __init__(self, d_in, d_k):
        super().__init__()
        self.query = nn.Parameter(torch.rand(d_in, d_k))
        self.key = nn.Parameter(torch.rand(d_in, d_k))
        self.value = nn.Parameter(torch.rand(d_in, d_k))

    def forward(self, x):
        Q = x @ self.query
        K = x @ self.key
        V = x @ self.value
        attn_scores = Q @ K.T / K.shape[-1]**0.5
        attn_weights = F.softmax(attn_scores, dim=-1)
        return attn_weights @ V

# SelfAttention_v2 using nn.Linear layers
class SelfAttention_v2(nn.Module):
    def __init__(self, d_in, d_k):
        super().__init__()
        self.query = nn.Linear(d_in, d_k, bias=False)
        self.key = nn.Linear(d_in, d_k, bias=False)
        self.value = nn.Linear(d_in, d_k, bias=False)

    def forward(self, x):
        Q = self.query(x)
        K = self.key(x)
        V = self.value(x)
        attn_scores = Q @ K.T / K.shape[-1]**0.5
        attn_weights = F.softmax(attn_scores, dim=-1)
        return attn_weights @ V

# Set input and head size
d_in, d_k = 4, 2

# Instantiate both versions
sa2 = SelfAttention_v2(d_in, d_k)
sa1 = SelfAttention_v1(d_in, d_k)

# Copy weights (transpose because Linear stores weights as [out, in])
with torch.no_grad():
    sa1.query.copy_(sa2.query.weight.T)
    sa1.key.copy_(sa2.key.weight.T)
    sa1.value.copy_(sa2.value.weight.T)

# Check if outputs match
x = torch.rand(3, d_in)
out1 = sa1(x)
out2 = sa2(x)

print("Are outputs equal?", torch.allclose(out1, out2, atol=1e-6))



Are outputs equal? True


### 🧪 Exercise 3.2 — Returning 2-dimensional Embedding Vectors

This task requires configuring the `MultiHeadAttentionWrapper` such that its output embedding vectors are 2-dimensional. The number of attention heads is fixed at `num_heads = 2`, so we adjust `d_out = 2` to evenly split across the heads.

**Steps:**
- Use `SelfAttention_v2` for the heads.
- Set `d_out = 2` and `num_heads = 2` so each head outputs a 1D vector.
- Verify that the final output shape is `[batch_size, 2]`.


In [2]:
class MultiHeadAttentionWrapper(nn.Module):
    def __init__(self, d_in, d_out, num_heads):
        super().__init__()
        assert d_out % num_heads == 0
        self.heads = nn.ModuleList([
            SelfAttention_v2(d_in, d_out // num_heads)
            for _ in range(num_heads)
        ])
        self.proj = nn.Linear(d_out, d_out)

    def forward(self, x):
        head_outputs = [head(x) for head in self.heads]
        concat = torch.cat(head_outputs, dim=-1)
        return self.proj(concat)

# Change d_out from 4 to 2 to make final output 2-dimensional
mha = MultiHeadAttentionWrapper(d_in=4, d_out=2, num_heads=2)

x = torch.rand(3, 4)
output = mha(x)

print("Output shape:", output.shape)  # Should be [3, 2]


Output shape: torch.Size([3, 2])


### 🧪 Exercise 3.3 — Initializing GPT-2 Size Attention Modules

In this exercise, we initialize a `MultiHeadAttentionWrapper` with the same configuration as the smallest GPT-2 model:

- `embedding size` = 768
- `number of heads` = 12

**Steps:**
- Set both `d_in` and `d_out` to 768.
- Set `num_heads = 12` (each head will handle 64 dimensions).
- Test the module with dummy input and confirm the output shape is `[batch_size, 768]`.


In [3]:
# Create a MultiHeadAttention for GPT-2 base configuration
gpt2_attention = MultiHeadAttentionWrapper(
    d_in=768,
    d_out=768,
    num_heads=12
)

# Test it with a dummy input
dummy_input = torch.rand(1, 768)  # One token, 768-dimensional
output = gpt2_attention(dummy_input)

print("GPT-2 attention output shape:", output.shape)  # Should be [1, 768]


GPT-2 attention output shape: torch.Size([1, 768])
