<a href="https://colab.research.google.com/github/CrisMcode111/DI_Bootcamp/blob/main/w6_d2_Daily_Challenge_Multi_Attention_%26_Transformer_Comparisons.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:

You will produce:

A custom PyTorch module implementing multi-head attention + feedforward encoder blocks.
A fine-tuned transformer baseline on the same dataset.
Visualizations of attention weights for selected samples.
A reflection comparing both approaches and documenting insights about attention behavior.


Dataset
Dataset Use the Natural Language Inference dataset provided in this link.

Task
Single-Head Attention Implementation

Objective: Implement the building block before extending to multiple heads.
Instructions:
Using PyTorch, implement an Attention module with linear projections for Q/K/V.

Validate shapes with dummy tensors (batch, seq_len, hidden_dim).

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class SingleHeadAttention(nn.Module):
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        self.embed_dim = embed_dim
        self.head_dim = head_dim

        self.q_linear = nn.Linear(embed_dim, head_dim, bias=False)
        self.k_linear = nn.Linear(embed_dim, head_dim, bias=False)
        self.v_linear = nn.Linear(embed_dim, head_dim, bias=False)

    def forward(self, query, key, value, mask=None):
        # Apply linear transformations
        q = self.q_linear(query)
        k = self.k_linear(key)
        v = self.v_linear(value)

        # Compute attention scores (scaled dot-product)
        attention_scores = torch.matmul(q, k.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.head_dim, dtype=torch.float32))

        # Apply mask (if provided)
        if mask is not None:
            attention_scores = attention_scores.masked_fill(mask == 0, float('-inf'))

        # Compute attention weights
        attention_weights = F.softmax(attention_scores, dim=-1)

        # Apply attention to values
        output = torch.matmul(attention_weights, v)

        return output, attention_weights

# Validate shapes with dummy tensors
batch_size = 2
seq_len = 10
embed_dim = 512
head_dim = 64 # embed_dim should be divisible by head_dim in multi-head attention

single_head_attention = SingleHeadAttention(embed_dim, head_dim)

# Create dummy tensors
query = torch.randn(batch_size, seq_len, embed_dim)
key = torch.randn(batch_size, seq_len, embed_dim)
value = torch.randn(batch_size, seq_len, embed_dim)

# Forward pass
output, attention_weights = single_head_attention(query, key, value)

# Print shapes
print("Query shape:", query.shape)
print("Key shape:", key.shape)
print("Value shape:", value.shape)
print("Output shape:", output.shape)
print("Attention weights shape:", attention_weights.shape)

Log the attention weights for inspection.

In [None]:
# Rapid logging during forward pass (for debugging)

# If  just want to see them numerically or save them:
output, attn_weights = single_head_attention(query, key, value)

print("Attention weights shape:", attn_weights.shape)
print("Sample attention weights [batch 0, token 0]:")
print(attn_weights[0, 0])


In [None]:
# Save to file (for later analysis)
import numpy as np

# Salve as file .npy
np.save("attention_weights.npy", attn_weights.detach().cpu().numpy())

# Or as text (attention)
np.savetxt("attention_sample.txt", attn_weights[0, 0].detach().cpu().numpy())


In [None]:
# Graphical view (the clearest)
import matplotlib.pyplot as plt
import seaborn as sns

# Example: attention for the first example in the batch
attn = attn_weights[0].detach().cpu().numpy()

plt.figure(figsize=(8, 6))
sns.heatmap(attn, cmap="viridis")
plt.title("Attention Weights Heatmap (Sample 0)")
plt.xlabel("Keys")
plt.ylabel("Queries")
plt.show()


Multi-Head Attention Module

Objective: Extend the single-head block to multi-head functionality.
Instructions:
Implement MultiHeadAttention that splits embeddings into num_heads, applies attention per head, and concatenates.

In [None]:
# Multi-Head Attention Module (PyTorch)
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        assert embed_dim % num_heads == 0, "embed_dim must be divisible by num_heads"
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.scale = self.head_dim ** -0.5

        # Linear projections for Q, K, V (shared across all heads)
        self.q_proj = nn.Linear(embed_dim, embed_dim, bias=False)
        self.k_proj = nn.Linear(embed_dim, embed_dim, bias=False)
        self.v_proj = nn.Linear(embed_dim, embed_dim, bias=False)

        # Final linear layer to combine heads
        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=False)

    def forward(self, x, mask=None):
        B, L, D = x.shape
        H, d = self.num_heads, self.head_dim

        # 1️⃣ Linear projections
        q = self.q_proj(x)  # [B, L, D]
        k = self.k_proj(x)
        v = self.v_proj(x)

        # 2️⃣ Reshape → separate heads
        # becomes [B, H, L, d]
        q = q.view(B, L, H, d).transpose(1, 2)
        k = k.view(B, L, H, d).transpose(1, 2)
        v = v.view(B, L, H, d).transpose(1, 2)

        # 3️⃣ Scaled dot-product attention per head
        scores = torch.matmul(q, k.transpose(-2, -1)) * self.scale  # [B, H, L, L]
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn = F.softmax(scores, dim=-1)             # attention weights [B, H, L, L]
        context = torch.matmul(attn, v)              # weighted values [B, H, L, d]

        # 4️⃣ Concatenate heads → [B, L, D]
        context = context.transpose(1, 2).contiguous().view(B, L, D)

        # 5️⃣ Final linear projection
        out = self.out_proj(context)                 # [B, L, D]
        return out, attn


In [None]:
# Test with dummy tensors
# Dummy input
batch_size, seq_len, embed_dim, num_heads = 2, 10, 512, 8
x = torch.randn(batch_size, seq_len, embed_dim)

mha = MultiHeadAttention(embed_dim, num_heads)
out, attn_w = mha(x)

print("Input shape:", x.shape)
print("Output shape:", out.shape)           # → [2, 10, 512]
print("Attention weights shape:", attn_w.shape)  # → [2, 8, 10, 10]


In [None]:
#Include dropout and residual connections.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim, num_heads, dropout=0.1):
        super().__init__()
        assert embed_dim % num_heads == 0, "embed_dim must be divisible by num_heads"

        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        self.scale = self.head_dim ** -0.5

        # Linear projections for Q, K, V and output
        self.q_proj = nn.Linear(embed_dim, embed_dim, bias=False)
        self.k_proj = nn.Linear(embed_dim, embed_dim, bias=False)
        self.v_proj = nn.Linear(embed_dim, embed_dim, bias=False)
        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=False)

        # Regularization
        self.dropout = nn.Dropout(dropout)
        self.norm = nn.LayerNorm(embed_dim)

    def forward(self, x, mask=None):
        # Save residual for skip connection
        residual = x

        B, L, D = x.shape
        H = self.num_heads
        d = self.head_dim

        # 1️ Linear projections
        q = self.q_proj(x)
        k = self.k_proj(x)
        v = self.v_proj(x)

        # 2️ Reshape into heads [B, H, L, d]
        q = q.view(B, L, H, d).transpose(1, 2)
        k = k.view(B, L, H, d).transpose(1, 2)
        v = v.view(B, L, H, d).transpose(1, 2)

        # 3️ Scaled dot-product attention
        scores = torch.matmul(q, k.transpose(-2, -1)) * self.scale
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))
        attn = F.softmax(scores, dim=-1)
        attn = self.dropout(attn)  # dropout on attention map

        context = torch.matmul(attn, v)  # [B, H, L, d]

        # 4️ Merge heads
        context = context.transpose(1, 2).contiguous().view(B, L, D)
        out = self.out_proj(context)
        out = self.dropout(out)

        # 5️⃣ Residual + LayerNorm
        out = self.norm(out + residual)

        return out, attn


In [None]:
#Add FeedForward Network (Positionwise FFN)
class FeedForward(nn.Module):
    def __init__(self, embed_dim, ff_dim, dropout=0.1):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(embed_dim, ff_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(ff_dim, embed_dim),
            nn.Dropout(dropout)
        )
        self.norm = nn.LayerNorm(embed_dim)

    def forward(self, x):
        residual = x
        x = self.net(x)
        return self.norm(x + residual)


In [None]:
#Full Encoder Block (Attention + FFN)
class EncoderBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
        super().__init__()
        self.mha = MultiHeadAttention(embed_dim, num_heads, dropout)
        self.ffn = FeedForward(embed_dim, ff_dim, dropout)

    def forward(self, x, mask=None):
        x, attn_w = self.mha(x, mask)
        x = self.ffn(x)
        return x, attn_w


In [None]:
# to test
B, L, D, H = 2, 10, 512, 8
x = torch.randn(B, L, D)
block = EncoderBlock(embed_dim=D, num_heads=H, ff_dim=2048, dropout=0.1)

out, attn_w = block(x)
print("Output shape:", out.shape)        # → [2, 10, 512]
print("Attention weights:", attn_w.shape)  # → [2, 8, 10, 10]


Summary

*  Dropout → regularizes attention and feedforward layers
* Residuals → preserve information flow and help gradient stability
* LayerNorm → stabilizes activations after each residual connection

Provide a forward example showing input/output shapes.

In [None]:
import torch

# Dummy batch
batch_size = 2
seq_len = 10
embed_dim = 512
num_heads = 8
ff_dim = 2048
dropout = 0.1

# Create random input (e.g., embedded tokens)
x = torch.randn(batch_size, seq_len, embed_dim)

# Initialize encoder block
encoder_block = EncoderBlock(embed_dim, num_heads, ff_dim, dropout)

# Forward pass
output, attn_weights = encoder_block(x)

# Print tensor shapes
print(f"Input shape:             {x.shape}")          # [2, 10, 512]
print(f"Output shape:            {output.shape}")     # [2, 10, 512]
print(f"Attention weights shape: {attn_weights.shape}") # [2, 8, 10, 10]


| Step                   | Tensor         | Shape                      | Description                        |
| ---------------------- | -------------- | -------------------------- | ---------------------------------- |
| Input                  | `x`            | `[B, L, D] = [2, 10, 512]` | Sequence of embeddings             |
| Q/K/V projection       | `q, k, v`      | `[2, 8, 10, 64]`           | Split into 8 heads of size 64      |
| Attention weights      | `attn_weights` | `[2, 8, 10, 10]`           | Relation between each token pair   |
| Weighted sum (context) | `context`      | `[2, 10, 512]`             | Heads concatenated back together   |
| FFN output             | `output`       | `[2, 10, 512]`             | Processed representation per token |


Result

The encoder block preserves the sequence length (10 → 10) and embedding dimension (512 → 512) while enriching each token with contextual information from all others through multi-head attention.