In [1]:
import torch
import math
from torch import nn
import torch.nn.functional as F

### **Explanation of Transformer Encoding**

The code below provides a modular implementation of the Transformer Encoder. Below are the key components and their roles: 

#### **1. Scaled Dot-Product Attention**
- **Code:** `scaled_dot_product(q, k, v, mask=None)`
- This function computes attention scores by performing:
  1. **Dot product** between query (\(q\)) and key (\(k\)) vectors, scaled by the square root of their dimensionality (\(d_k\)).
  2. **Masking (optional):** Masks certain positions (e.g., padding or future tokens).
  3. **Softmax:** Converts the scores into probabilities for attention weights.
  4. **Weighted sum:** Applies these weights to the value (\(v\)) vectors to compute the final output.
- The attention mechanism allows the model to focus on relevant parts of the input sequence.

#### **2. Multi-Head Attention**
- **Code:** `MultiHeadAttention(nn.Module)`
- Breaks the input into multiple "heads," each attending to different parts of the sequence. 
- Key steps:
  1. **Projection:** Maps input features (\(x\)) into query, key, and value matrices using a shared linear layer.
  2. **Reshape and Permutation:** Prepares the tensors for multi-head processing by reshaping and reordering dimensions.
  3. **Attention Calculation:** Computes scaled dot-product attention for each head.
  4. **Aggregation:** Combines the outputs of all heads and applies a final linear transformation.
- Multi-head attention enhances the modelâ€™s ability to capture diverse relationships in the data.

#### **3. Layer Normalization**
- **Code:** `LayerNormalization(nn.Module)`
- Applied after attention and feedforward layers to stabilize training and improve convergence. 
- Normalizes activations across the feature dimension using:
  \[
  \text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta
  \]
- Maintains learnable parameters (\(\gamma\), \(\beta\)) for scaling and shifting.

#### **4. Position-Wise Feedforward Network**
- **Code:** `PositionwiseFeedForward(nn.Module)`
- Consists of two fully connected layers with a non-linearity (\(ReLU\)) and dropout in between.
- Applies transformations independently at each sequence position to introduce non-linearity and enrich the representation.

#### **5. Encoder Layer**
- **Code:** `EncoderLayer(nn.Module)`
- Combines all components into a single layer:
  1. Multi-head attention.
  2. Residual connection and LayerNorm.
  3. Feedforward network.
  4. Another residual connection and LayerNorm.
- Adds dropout for regularization at appropriate stages.

#### **6. Encoder**
- **Code:** `Encoder(nn.Module)`
- Stacks multiple encoder layers sequentially to form the full Transformer Encoder.
- Each layer refines the representation, allowing the model to build a deeper understanding of the input.

### **Workflow**
1. Input data (\(x\)) passes through multiple layers of the encoder.
2. Each layer applies multi-head attention, feedforward networks, and normalization to process the sequence.
3. The output is a refined representation capturing both local and global dependencies.

This implementation modularizes the Transformer Encoder, providing flexibility for modification or debugging through its well-structured components.

In [2]:
def scaled_dot_product(q, k, v, mask=None):
    d_k = q.size()[-1]
    scaled = torch.matmul(q, k.transpose(-1, -2)) / math.sqrt(d_k) # 30 x 8 x 200 x 200
    print(f"scaled.size() : {scaled.size()}")
    if mask is not None:
        print(f"-- ADDING MASK of shape {mask.size()} --") 
        # Broadcasting add. So just the last N dimensions need to match
        scaled += mask
    attention = F.softmax(scaled, dim=-1)
    values = torch.matmul(attention, v)
    return values, attention

class MultiHeadAttention(nn.Module):

    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_model = d_model # 512
        self.num_heads = num_heads # 8
        self.head_dim = d_model // num_heads # 64 = 512/8
        self.qkv_layer = nn.Linear(d_model , 3 * d_model) # 512*1536
        self.linear_layer = nn.Linear(d_model, d_model) # 512 *512
    
    def forward(self, x, mask=None):
        batch_size, max_sequence_length, d_model = x.size() # 30 x 200 x 512 
        print(f"x.size(): {x.size()}")
        qkv = self.qkv_layer(x) # 30 x 200 x 512 
        print(f"qkv.size(): {qkv.size()}")
        qkv = qkv.reshape(batch_size, max_sequence_length, self.num_heads, 3 * self.head_dim) # 30 x 200 x 8 x 192
        print(f"qkv.size(): {qkv.size()}")
        qkv = qkv.permute(0, 2, 1, 3) # 30 x 8 200x 192
        print(f"qkv.size(): {qkv.size()}")
        q, k, v = qkv.chunk(3, dim=-1) # each are 30 x 8 x 200 x 64 
        print(f"q size: {q.size()}, k size: {k.size()}, v size: {v.size()}, ")
        values, attention = scaled_dot_product(q, k, v, mask)
        print(f"values.size(): {values.size()}, attention.size:{ attention.size()} ")
        values = values.reshape(batch_size, max_sequence_length, self.num_heads * self.head_dim)
        print(f"values.size(): {values.size()}")
        out = self.linear_layer(values)
        print(f"out.size(): {out.size()}")
        return out


class LayerNormalization(nn.Module):
    def __init__(self, parameters_shape, eps=1e-5):
        super().__init__()
        self.parameters_shape=parameters_shape # 512
        self.eps=eps
        self.gamma = nn.Parameter(torch.ones(parameters_shape)) # 512
        self.beta =  nn.Parameter(torch.zeros(parameters_shape)) # 512

    def forward(self, inputs): # 30 x 200 x 512
        dims = [-(i + 1) for i in range(len(self.parameters_shape))] # -1
        mean = inputs.mean(dim=dims, keepdim=True) # 30 x 200 x 1
        print(f"Mean ({mean.size()})")
        var = ((inputs - mean) ** 2).mean(dim=dims, keepdim=True) # 30 x 200 x 1
        std = (var + self.eps).sqrt() # 20 x 200 x 1
        print(f"Standard Deviation  ({std.size()})")
        y = (inputs - mean) / std # 30 x 200 x 512
        print(f"y: {y.size()}")
        out = self.gamma * y  + self.beta # 30 x 200 x 512 for both
        print(f"self.gamma: {self.gamma.size()}, self.beta: {self.beta.size()}")
        print(f"out: {out.size()}")
        return out

  
class PositionwiseFeedForward(nn.Module):

    def __init__(self, d_model, hidden, drop_prob=0.1):
        super(PositionwiseFeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, hidden) # 512 x 2048
        self.linear2 = nn.Linear(hidden, d_model) # 2048 x 512
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(p=drop_prob)

    def forward(self, x): # 30 x 200 x 512
        x = self.linear1(x) # 30 x 200 x 2048
        print(f"x after first linear layer: {x.size()}")
        x = self.relu(x) # 30 x 200 x 2048
        print(f"x after activation: {x.size()}")
        x = self.dropout(x) # 30 x 200 x 2048
        print(f"x after dropout: {x.size()}")
        x = self.linear2(x) # 30 x 200 x 512
        print(f"x after 2nd linear layer: {x.size()}")
        return x


class EncoderLayer(nn.Module):

    def __init__(self, d_model, ffn_hidden, num_heads, drop_prob):
        super(EncoderLayer, self).__init__()
        self.attention = MultiHeadAttention(d_model=d_model, num_heads=num_heads)
        self.norm1 = LayerNormalization(parameters_shape=[d_model])
        self.dropout1 = nn.Dropout(p=drop_prob)
        self.ffn = PositionwiseFeedForward(d_model=d_model, hidden=ffn_hidden, drop_prob=drop_prob)
        self.norm2 = LayerNormalization(parameters_shape=[d_model])
        self.dropout2 = nn.Dropout(p=drop_prob)

    def forward(self, x):
        residual_x = x # 30 x 200 x 512
        print("------- ATTENTION 1 ------")
        x = self.attention(x, mask=None) # 30 x 200 x 512
        print("------- DROPOUT 1 ------")
        x = self.dropout1(x) # 30 x 200 x 512
        print("------- ADD AND LAYER NORMALIZATION 1 ------")
        x = self.norm1(x + residual_x) # 30 x 200 x 512
        residual_x = x # 30 x 200 x 512
        print("------- ATTENTION 2 ------")
        x = self.ffn(x) # 30 x 200 x 512
        print("------- DROPOUT 2 ------")
        x = self.dropout2(x) # 30 x 200 x 512
        print("------- ADD AND LAYER NORMALIZATION 2 ------")
        x = self.norm2(x + residual_x) # 30 x 200 x 512
        return x

class Encoder(nn.Module):
    def __init__(self, d_model, ffn_hidden, num_heads, drop_prob, num_layers):
        super().__init__()
        # Create a sequential stack of 'num_layers' EncoderLayer modules with the specified parameters.
        self.layers = nn.Sequential(*[EncoderLayer(d_model, ffn_hidden, num_heads, drop_prob)
                                     for _ in range(num_layers)])
        
    # Pass the input through the stacked encoder layers sequentially and return the output. 
    def forward(self, x):
        x = self.layers(x)
        return x

In [3]:
d_model = 512
num_heads = 8
drop_prob = 0.1
batch_size = 30
max_sequence_length = 200
ffn_hidden = 2048 # Feed forward network hiddern
num_layers = 5 

encoder = Encoder(d_model, ffn_hidden, num_heads, drop_prob, num_layers)

1. **`d_model`**: The dimensionality of the input and output features in the model.  
2. **`num_heads`**: The number of attention heads in the multi-head attention mechanism.  
3. **`drop_prob`**: The probability of dropping elements during dropout for regularization.  
4. **`batch_size`**: The number of sequences processed together in a single forward and backward pass.  
5. **`max_sequence_length`**: The maximum number of tokens in each input sequence.  
6. **`ffn_hidden`**: The number of hidden units in the feedforward network within each encoder layer.  
7. **`num_layers`**: The number of encoder layers stacked in the Transformer encoder.

In [4]:
x = torch.randn( (batch_size, max_sequence_length, d_model) ) # includes positional encoding
out = encoder(x)

------- ATTENTION 1 ------
x.size(): torch.Size([30, 200, 512])
qkv.size(): torch.Size([30, 200, 1536])
qkv.size(): torch.Size([30, 200, 8, 192])
qkv.size(): torch.Size([30, 8, 200, 192])
q size: torch.Size([30, 8, 200, 64]), k size: torch.Size([30, 8, 200, 64]), v size: torch.Size([30, 8, 200, 64]), 
scaled.size() : torch.Size([30, 8, 200, 200])
values.size(): torch.Size([30, 8, 200, 64]), attention.size:torch.Size([30, 8, 200, 200]) 
values.size(): torch.Size([30, 200, 512])
out.size(): torch.Size([30, 200, 512])
------- DROPOUT 1 ------
------- ADD AND LAYER NORMALIZATION 1 ------
Mean (torch.Size([30, 200, 1]))
Standard Deviation  (torch.Size([30, 200, 1]))
y: torch.Size([30, 200, 512])
self.gamma: torch.Size([512]), self.beta: torch.Size([512])
out: torch.Size([30, 200, 512])
------- ATTENTION 2 ------
x after first linear layer: torch.Size([30, 200, 2048])
x after activation: torch.Size([30, 200, 2048])
x after dropout: torch.Size([30, 200, 2048])
x after 2nd linear layer: torch.

### **Regularization in Transformer Neural Networks**

Regularization is a set of techniques used in neural networks, including Transformers, to prevent overfitting and improve generalization to unseen data. In Transformers, regularization is crucial due to their large number of parameters and the potential for overfitting on limited data. Here are the key regularization techniques commonly employed in Transformers:

---

### **1. Dropout**
- **What it is:** A simple and widely-used technique where a fraction of the neurons in a layer are randomly "dropped" (set to zero) during training. This prevents the model from relying too heavily on specific neurons.
- **Where it's applied in Transformers:**
  - After the attention mechanism to regularize the learned attention weights.
  - In the feedforward network to reduce over-reliance on specific feature transformations.
  - Before or after residual connections to prevent co-adaptation of layers.
- **How it helps:** Dropout reduces the risk of overfitting by encouraging the model to learn more robust features.

---

### **2. Attention Masking**
- **What it is:** While not a traditional regularization method, masking (e.g., padding or future masking) controls the range of attention.
- **Purpose:**
  - In **sequence-to-sequence models**, masks prevent attention to future tokens (causal masking) in tasks like language modeling.
  - Padding masks ensure the model ignores padded parts of the sequence.

---

### **3. Weight Decay (L2 Regularization)**
- **What it is:** Adds a penalty to the loss function based on the magnitude of the model's weights.
  
- **How it helps:** Encourages smaller weights, leading to simpler models that are less prone to overfitting.

---

### **4. Layer Normalization**
- **What it is:** Normalizes the activations across the features in a layer to stabilize training and prevent exploding or vanishing gradients.
- **In Transformers:** It is applied after the attention and feedforward sub-layers.
- **How it helps:** By reducing internal covariate shifts, LayerNorm makes the model more robust to variations in input distributions.

---

### **5. Label Smoothing**
- **What it is:** During training, instead of using one-hot encoded labels (where the correct label has a probability of 1), a small probability is assigned to all incorrect labels.

- **How it helps:** Reduces the model's confidence in predictions, preventing it from becoming overly confident and overfitting.

---

### **6. Early Stopping**
- **What it is:** Stops training when the model's performance on validation data stops improving.
- **How it helps:** Prevents overfitting to the training data by halting training before the model starts to memorize the data.

---

### **7. Data Augmentation**
- **What it is:** Modifies the training data to simulate diverse scenarios, such as adding noise, shuffling, or masking tokens (e.g., in BERT).
- **In Transformers:** Techniques like token masking, token swapping, or span masking are used to encourage the model to generalize better.

---

### **8. Gradient Clipping**
- **What it is:** Restricts the maximum norm of the gradients during backpropagation.
- **How it helps:** Prevents exploding gradients, especially in attention-heavy architectures where gradients can become very large.

---

### **9. Knowledge Distillation**
- **What it is:** A pre-trained "teacher" model guides a smaller "student" model by transferring knowledge.
- **How it helps:** Enables the student model to generalize better by mimicking the teacher's predictions.

---

### **10. Regularization in Residual Connections**
- Transformers rely heavily on residual connections, which can introduce overfitting if left unchecked. Applying dropout or other techniques at these points ensures better regularization.

---

### **Conclusion**
Regularization in Transformers combines multiple techniques, such as dropout, weight decay, label smoothing, and data augmentation, to combat overfitting and stabilize training. These methods are essential for scaling Transformers to large datasets and complex tasks while maintaining their generalization ability.