# Task 2 | Transformer Architectural Changes (40 Points)

---

In the second task, you are required to add alterations to your original GPT-2 model architecture to experiment and assess the potential of improvements. Here's what you need to do:

- **Rotary Positional Embedding:** Replace the original positional embeddings in the GPT-2 model with Rotary embeddings. You may refer to [Su et. al. RoFormer](https://arxiv.org/pdf/2104.09864.pdf).
- **Group Query Attention:** Equip your model with the Group Query Attention mechanism following the insights from the [Ainslie et. al. GQA: Training Generalized Multi-Query Transformer](https://arxiv.org/pdf/2305.13245v2.pdf). Analyze how this mechanism can modify the model's operation compared to the standard attention mechanism.
- **Sliding Window Attention:** Imbibe the Sliding Window Attention mechanism in your model and observe its effects on model performance. Refer to the work by [Beltagy et. al. Longformer](https://arxiv.org/pdf/2004.05150v2.pdf) for better comprehension of its implementation and advantages.

**Deliverable:** Python code with any one, two or all three changes. Comment on the model size and capabilities, potential pitfalls and/or any improvement after each change. Points will be awarded for any combination of successful implementations.

**Evaluation Scheme:** Each feature implementation will account for:

- Rotary Positional Embedding: 15 points
- Group Query Attention: 10 points
- Sliding Window Attention: 15 points

In [14]:
import math
import torch
from torch import nn
import torch.nn.functional as F

**Solution Task 2**

**1. Rotary Positional Embedding:**


First, let's create a RotaryEmbedding module:
then Replace the original positional embeddings with Rotary embeddings in the GPT-2 model.


In [15]:
class RotaryEmbedding(nn.Module):
    def __init__(self, embed_dim):
        super(RotaryEmbedding, self).__init__()
        self.embed_dim = embed_dim

        # Initialize sinusoidal embeddings
        self.sinusoidal_embedding = nn.Parameter(torch.zeros(2 * embed_dim))

    def forward(self, x, position):
        angles = position.float() / 10000.0 ** (2 * torch.arange(self.embed_dim).float() / self.embed_dim)
        angles = angles.unsqueeze(0).expand_as(x)

        # Concatenate sinusoidal embeddings
        embeddings = torch.cat([torch.sin(angles), torch.cos(angles)], dim=-1)
        return x + self.sinusoidal_embedding.unsqueeze(0).expand_as(x) + embeddings


Now, integrate RotaryEmbedding into GPT-2 model (created in task 1):


In [16]:
class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        embed_dim = config.embed_dim
        self.max_len = config.max_len
        self.tok_embed = nn.Embedding(
            config.vocab_size, embed_dim
        )

        # Replace positional embedding with Rotary embedding
        self.pos_embed = RotaryEmbedding(embed_dim)

        self.dropout = nn.Dropout(config.embed_dropout)
        self.blocks = nn.Sequential(
            *[Block(config) for _ in range(config.num_blocks)]
        )
        self.ln = nn.LayerNorm(embed_dim)
        self.fc = nn.Linear(embed_dim, config.vocab_size)
    
    def forward(self, x, target=None):
        seq_len = x.size(1)
        assert seq_len <= self.max_len, "sequence longer than model capacity"
        
        tok_embedding = self.tok_embed(x)
        pos_embedding = self.pos_embed(tok_embedding, torch.arange(seq_len))
        x = self.dropout(tok_embedding + pos_embedding)
        x = self.blocks(x)
        x = self.ln(x)
        x = self.fc(x)

        return x


**2. Group Query Attention:**

Creating a GroupQueryAttention module:

In [17]:
class GroupQueryAttention(nn.Module):
    def __init__(self, embed_dim, num_heads):
        super(GroupQueryAttention, self).__init__()
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads

        self.key = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)
        self.query = nn.Linear(embed_dim, embed_dim)

    def forward(self, x):
        batch_size, seq_len, _ = x.size()

        k_t = self.key(x).reshape(batch_size, seq_len, self.num_heads, self.head_dim).permute(0, 2, 3, 1)
        v = self.value(x).reshape(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
        q = self.query(x).reshape(batch_size, seq_len, self.num_heads, self.head_dim).transpose(1, 2)

        attn = torch.matmul(q, k_t) / math.sqrt(q.size(-1))
        attn = F.softmax(attn, dim=-1)

        y = torch.matmul(attn, v)
        y = y.transpose(1, 2)
        y = y.reshape(batch_size, seq_len, -1)
        return y



Now, integrate GroupQueryAttention into Block (created in task 1):

In [19]:
class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        embed_dim = config.embed_dim

=        self.ln1 = nn.LayerNorm(embed_dim)
        self.ln2 = nn.LayerNorm(embed_dim)

        # Replace MultiheadAttention with GroupQueryAttention
        self.attn = GroupQueryAttention(embed_dim, config.num_heads)

        self.ff = nn.Sequential(
            nn.Linear(embed_dim, embed_dim * 4),
            nn.GELU(),
            nn.Linear(embed_dim * 4, embed_dim),
            nn.Dropout(config.ff_dropout),
        )
    
    def forward(self, x):
        # First layer: Multihead Attention
        attn_output = self.attn(self.ln1(x))

        # Residual connection and layer normalization
        x = x + attn_output
        x = self.ln2(x)

        # Second layer: Feedforward
        ff_output = self.ff(x)

        # Residual connection
        x = x + ff_output

        return x


IndentationError: unindent does not match any outer indentation level (<tokenize>, line 19)

**3. Sliding Window Attention:**

Implement the Sliding Window Attention mechanism. Create a SlidingWindowAttention module:

In [None]:
class SlidingWindowAttention(nn.Module):
    def __init__(self, embed_dim, window_size):
        super(SlidingWindowAttention, self).__init__()
        self.window_size = window_size
        self.conv1d = nn.Conv1d(embed_dim, embed_dim, kernel_size=window_size, stride=1, padding=window_size // 2)

    def forward(self, x):
        x = x.permute(0, 2, 1)  # Change shape to (batch_size, embed_dim, seq_len)
        x = self.conv1d(x)
        x = x.permute(0, 2, 1)  # Change shape back to (batch_size, seq_len, embed_dim)
        return x


Now, integrate SlidingWindowAttention into Block (developed in first task):



In [20]:
class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        embed_dim = config.embed_dim

        # Layer normalization after multihead attention
        self.ln1 = nn.LayerNorm(embed_dim)

        # Multihead Attention
        self.attn = MultiheadAttention(config)

        # SlidingWindowAttention
        self.sliding_attention = SlidingWindowAttention(embed_dim, config.window_size)

        # Layer normalization after sliding attention
        self.ln2 = nn.LayerNorm(embed_dim)

        # Feedforward layer
        self.ff = nn.Sequential(
            nn.Linear(embed_dim, embed_dim * 4),
            nn.GELU(),
            nn.Linear(embed_dim * 4, embed_dim),
            nn.Dropout(config.ff_dropout),
        )
    
    def forward(self, x):
        # First layer: Multihead Attention
        attn_output = self.attn(self.ln1(x))

        # Residual connection and layer normalization
        x = x + attn_output
        x = self.ln2(x)

        # SlidingWindowAttention
        sliding_output = self.sliding_attention(x)

        # Residual connection
        x = x + sliding_output

        # Feedforward layer
        ff_output = self.ff(x)

        # Residual connection
        x = x + ff_output

        return x


ALL 3 ways done 
- Rotary Positional Embedding: 
- Group Query Attention:
- Sliding Window Attention:

add these module in gpt2 code (developed in task 1) 
