# **Dropout in Transformers and Deep Learning**

- `Dropout` is a `regularization technique` where a fraction of `neurons` (activations) are randomly “dropped” (set to zero) during training.

- In Transformers and other deep learning models, dropout helps prevent `overfitting`, improves generalization, and stabilizes training.

- At inference, dropout is disabled, and the full network is used with scaled weights.


### **Intuition**

- Deep networks often `co-adapt`: neurons rely too heavily on specific patterns.

- Dropout forces the network to `not depend on any single neuron`, making it more `robust`.

- Think of it as a `“neural committee”`: each forward pass trains a slightly different subnet, and at inference, their ensemble effect is approximated.

### **Mathematical Representation**

Let $h = f(x, W)$ be the output of a hidden layer.
Dropout introduces a random binary mask $m \sim \mathrm{Bernoulli}(p)$, where $p$ is the keep probability (opposite of dropout rate).

$$h' = \frac{m \odot h}{p}$$


Where:

- $\odot$ = elementwise multiplication

- $p$ = probability of keeping a neuron (e.g., $p=0.9$ for 10% dropout)

- Scaling by $1/p$ ensures expectation remains unchanged: $E[h’] = h$

### **Dropout in Transformers**

Transformers use dropout in several places:

**1. Attention Weights**

- After computing attention scores (softmax), dropout is applied to randomly drop attention links.

$\mathrm{Attention}(Q,K,V) = \mathrm{Dropout}(\mathrm{Softmax}(QK^T / \sqrt{d_k})) V$


**2. Feedforward Layers**

- Dropout is applied after non-linear activations to prevent overfitting.


**3. Residual Connections**

- Dropout is applied before adding residuals to stabilize training.


**4. Embedding Layers**

- Sometimes applied to word/token embeddings for additional regularization.

### **Key Properties**


- `Regularization:` prevents overfitting on small/medium datasets.


- `Stochastic Training:` makes model robust to input noise.


- `Improved Generalization:` encourages distributed representations.


- `Scalable:` effective in both small MLPs and very deep models like Transformers.




### **Use Cases**

- `Deep Learning:` CNNs, RNNs, MLPs to combat overfitting.


- `Transformers:` GPT, BERT, Vision Transformers (ViT) use dropout for regularization.


- `Large-Scale Models:` often combined with other regularizers (weight decay, layer norm).

- **Basic Dropout in PyTorch**

In [3]:
import torch
import torch.nn as nn

x = torch.randn(5, 10)
dropout = nn.Dropout(p=0.3) # Drop 30% of neurons

dropout.train()
print("Training:", dropout(x)) # Some neurons are zeroed

dropout.eval()
print("Inference:", dropout(x)) # No dropout applied

Training: tensor([[ 1.1765, -0.6996,  1.4068,  1.8871, -2.0013,  0.2409, -1.1512, -0.0000,
         -0.5173, -0.8496],
        [-0.0000,  1.2326,  1.2722, -2.1774, -0.9650,  3.0767,  0.0000, -0.0000,
          0.7171, -1.8683],
        [ 0.0000,  0.7751, -0.0000,  0.5976, -0.0000,  0.4695, -0.0000, -0.0000,
         -1.4724,  0.2547],
        [ 0.0000, -0.1214, -0.3874, -1.2169,  0.0000,  0.7657,  0.5046,  0.0000,
         -0.0315, -2.2355],
        [ 0.3853, -1.8736,  0.0000, -1.1200, -0.0000, -0.0423,  0.0000, -0.3152,
         -0.0000, -0.1550]])
Inference: tensor([[ 0.8236, -0.4897,  0.9848,  1.3209, -1.4009,  0.1687, -0.8058, -0.3217,
         -0.3621, -0.5947],
        [-1.2429,  0.8628,  0.8905, -1.5242, -0.6755,  2.1537,  1.8196, -0.1857,
          0.5020, -1.3078],
        [ 0.6190,  0.5426, -0.2903,  0.4183, -0.5684,  0.3287, -0.5043, -1.4682,
         -1.0307,  0.1783],
        [ 0.7345, -0.0850, -0.2712, -0.8519,  2.1344,  0.5360,  0.3533,  0.7857,
         -0.0221, -1.5649

In [4]:
x

tensor([[ 0.8236, -0.4897,  0.9848,  1.3209, -1.4009,  0.1687, -0.8058, -0.3217,
         -0.3621, -0.5947],
        [-1.2429,  0.8628,  0.8905, -1.5242, -0.6755,  2.1537,  1.8196, -0.1857,
          0.5020, -1.3078],
        [ 0.6190,  0.5426, -0.2903,  0.4183, -0.5684,  0.3287, -0.5043, -1.4682,
         -1.0307,  0.1783],
        [ 0.7345, -0.0850, -0.2712, -0.8519,  2.1344,  0.5360,  0.3533,  0.7857,
         -0.0221, -1.5649],
        [ 0.2697, -1.3115,  0.6802, -0.7840, -0.0035, -0.0296,  0.2051, -0.2207,
         -1.8340, -0.1085]])

- **Dropout in a Transformer Block**

In [17]:
import torch.nn as nn

class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
        super().__init__()
        self.attn = nn.MultiheadAttention(embed_dim, num_heads, dropout=dropout, batch_first=True)
        self.ff = nn.Sequential(
            nn.Linear(embed_dim, ff_dim),
            nn.ReLU(),
            nn.Dropout(dropout), # Dropout in feedforward
            nn.Linear(ff_dim, embed_dim),
        )
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        
    def forward(self, x:torch.Tensor):
        # Self-attention + dropout + residual
        attn_out, _ = self.attn(x, x, x)
        x = x + self.dropout1(attn_out)
        x = self.norm1(x)
        
        # Feedforward + dropout + residual
        ff_out = self.ff(x)
        x = x + self.dropout2(ff_out)
        x = self.norm2(x)
        
        return x

In [20]:
# Example 1: Basic Usage
def basic_usage_example():
    # Hyperparameters
    batch_size = 4
    seq_length = 10
    embed_dim = 512
    num_heads = 8
    ff_dim = 2048
    
    # create transformer block
    transformer_block = TransformerBlock(embed_dim, num_heads, ff_dim)
    
    # Create sample input (batch_size, seq_length, embed_dim)
    x = torch.randn(batch_size, seq_length, embed_dim)
    
    # Forward pass
    output = transformer_block(x)
    
    print(f"Input shape: {x.shape}")
    print(f"Output shape: {output.shape}")
    print(f"Input and output shapes match: {x.shape == output.shape}")
    
    return output

print("===Basic Usage ===")
output1 = basic_usage_example()

===Basic Usage ===
Input shape: torch.Size([4, 10, 512])
Output shape: torch.Size([4, 10, 512])
Input and output shapes match: True


In [23]:
output1

tensor([[[-0.8910, -0.5180, -0.8523,  ...,  0.2387,  0.5434,  1.1064],
         [ 0.3783, -1.9540,  0.4625,  ..., -0.5516, -1.7609, -0.1673],
         [-2.9832,  0.4322,  1.1139,  ...,  0.0755,  1.3698, -0.6721],
         ...,
         [ 0.3106,  0.2099, -0.4223,  ..., -0.5154,  0.2615,  0.5391],
         [ 0.9403,  1.2124, -1.3579,  ..., -1.0018,  0.0729,  0.2915],
         [-0.6229, -0.4428, -0.7226,  ...,  0.3517, -0.7761,  0.2363]],

        [[ 0.1852,  1.2087,  0.0623,  ...,  0.0879, -0.4299, -0.6647],
         [ 0.0890, -1.5246, -0.2792,  ...,  0.8312,  0.2728, -0.8954],
         [-1.5291,  0.4701, -1.2672,  ...,  0.3608,  0.0942,  0.4916],
         ...,
         [ 0.2830, -1.4619, -1.5875,  ..., -1.0349, -1.3666, -1.3967],
         [ 0.4464, -0.4948, -0.3675,  ...,  0.0912, -0.9250,  1.1598],
         [-0.7500,  0.1630,  0.8844,  ..., -0.7052,  0.5838,  1.4363]],

        [[-1.0349, -1.9986,  0.0330,  ..., -1.1214, -0.0614, -0.7261],
         [ 2.0492,  0.3298, -1.0137,  ..., -0

In [21]:
# Example 2: Stacking Multiple Transformer Blocks
def multi_block_example():
    embed_dim = 256
    num_heads = 4
    ff_dim = 1024
    num_blocks = 3
    
    # Create a small transformer with multiple blocks
    transformer_blocks = nn.Sequential(
        *[TransformerBlock(embed_dim, num_heads, ff_dim) for _ in range(num_blocks)]
    )
    
    # Sample input
    x = torch.randn(2, 8, embed_dim) # (batch_size, seq_length, embed_dim)
    
    # Process through all blocks
    output = transformer_blocks(x)
    
    print(f"Processed through {num_blocks} transformer blocks")
    print(f"Output shape: {output.shape}")
    
    return output

print("=== Example 2: Multiple Blocks ===")
output2 = multi_block_example()

=== Example 2: Multiple Blocks ===
Processed through 3 transformer blocks
Output shape: torch.Size([2, 8, 256])


In [22]:
output2

tensor([[[-0.7984,  0.3157,  1.9364,  ..., -0.3314, -0.5696, -0.9131],
         [-0.3177,  1.5079, -0.5233,  ..., -0.1358,  0.5519, -0.6968],
         [-0.5598, -1.8233,  0.7427,  ...,  0.5237,  0.5401, -1.5678],
         ...,
         [ 0.6121,  0.8328, -0.8884,  ...,  1.8705,  0.5036, -0.9058],
         [-0.6972,  1.1989,  0.5254,  ...,  0.7133, -0.5159, -1.3623],
         [-0.3520, -0.1714,  0.9600,  ..., -0.2660, -0.5049, -1.3907]],

        [[ 0.6494,  0.8556, -0.0379,  ...,  0.3544, -0.1506, -0.3147],
         [ 0.2344,  1.0797,  0.4952,  ...,  1.6897, -1.2753, -0.5030],
         [ 0.7656,  0.9892,  0.9193,  ..., -0.4112,  0.2634,  1.4715],
         ...,
         [-1.0516, -0.5733,  1.2113,  ...,  0.8083, -0.7357, -0.0677],
         [ 0.9197, -0.0543, -1.1508,  ...,  1.4371,  1.8451,  0.9517],
         [-0.4894, -0.2065,  1.1197,  ...,  0.1748,  0.3715, -0.6791]]],
       grad_fn=<NativeLayerNormBackward0>)

In [38]:
import torch.optim as optim

# Example 3: Training a simple model
class SimpleTransformerClassifier(nn.Module):
   def __init__(self, vocab_size, embed_dim, num_heads, ff_dim, num_classes, num_blocks=2):
      super().__init__()
      self.embedding = nn.Embedding(vocab_size, embed_dim)
      self.transformer_blocks = nn.Sequential(
         *[TransformerBlock(embed_dim, num_heads, ff_dim) for _ in range(num_blocks)]
      )
      self.classifier = nn.Linear(embed_dim, num_classes)
      
   
   def forward(self, x:torch.Tensor):
      x = self.embedding(x) # (batch_size, seq_length, embed_dim)
      x = self.transformer_blocks(x)
      # Use the output of the first token for classification
      x = x[:, 0, :]
      return self.classifier(x)
   

def training_example():
   # Model parameters
    vocab_size = 1000
    embed_dim = 128
    num_heads = 4
    ff_dim = 512
    num_classes = 5
    seq_length = 20
    learning_rate = 1e-3
    
    # create model
    model = SimpleTransformerClassifier(vocab_size, 
                                        embed_dim, 
                                        num_heads, 
                                        ff_dim, 
                                        num_classes)
    
    # Sample data
    batch_size = 8
    input_seq = torch.randint(0, vocab_size, (batch_size, seq_length))
    labels = torch.randint(0, num_classes, (batch_size,))
    
    # Forward pass
    outputs = model(input_seq)
    print(f"Model output shape: {outputs.shape}")
    print(f"Labels shape: {labels.shape}")
    
    # Training setup (simplified)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    
    # Training step
    optimizer.zero_grad()
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()
    
    print(f"Training loss: {loss.item():.4f}")
    
    return model, loss
    

print("\n=== Example 3: Training Example ===")
model, loss = training_example()


=== Example 3: Training Example ===
Model output shape: torch.Size([8, 5])
Labels shape: torch.Size([8])
Training loss: 1.7553


In [39]:
print(model)

SimpleTransformerClassifier(
  (embedding): Embedding(1000, 128)
  (transformer_blocks): Sequential(
    (0): TransformerBlock(
      (attn): MultiheadAttention(
        (out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
      )
      (ff): Sequential(
        (0): Linear(in_features=128, out_features=512, bias=True)
        (1): ReLU()
        (2): Dropout(p=0.1, inplace=False)
        (3): Linear(in_features=512, out_features=128, bias=True)
      )
      (dropout1): Dropout(p=0.1, inplace=False)
      (dropout2): Dropout(p=0.1, inplace=False)
      (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
      (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
    )
    (1): TransformerBlock(
      (attn): MultiheadAttention(
        (out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
      )
      (ff): Sequential(
        (0): Linear(in_features=128, out_features=512, bias=True)
        

In [40]:
# Masked Self-Attention (for language modeling)
class MaskedTransformerBlock(nn.Module):
    def __init__(self, embed_dim, num_heads, ff_dim, dropout=0.1):
        super().__init__()
        self.attn = nn.MultiheadAttention(embed_dim, 
                                          num_heads, 
                                          dropout=dropout, 
                                          batch_first=True)
        self.ff = nn.Sequential(
            nn.Linear(embed_dim, ff_dim),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(ff_dim, embed_dim),
        )
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)
        
    def forward(self, x: torch.Tensor, mask: torch.Tensor = None):
        # Masked self-attention
        attn_out, _ = self.attn(x, x, x, attn_mask = mask)
        x = x + self.dropout1(attn_out) # dropout + residual
        x = self.norm1(x)
        
        # Feedforward
        ff_out = self.ff(x)
        x = x + self.dropout2(ff_out)  # dropout + residual
        x = self.norm2(x)
        
        return x


def masked_attention_example():
    embed_dim = 64
    num_heads = 2
    ff_dim = 128
    seq_length = 6
    
    # create casual mask for autoregressive generation
    casual_mask = torch.triu(torch.ones(seq_length, seq_length) * float('-inf'), diagonal=1)
    
    block = MaskedTransformerBlock(embed_dim, num_heads, ff_dim)
    x = torch.randn(2, seq_length, 64) # (batch_size, seq_length, embed_dim)
    
    output = block(x, casual_mask)
    print(f"Casual mask shape: {casual_mask}")
    print(f"Masked output shape: {output.shape}")
    
    return output


print("\n=== Example 4: Masked Attention ===")
output4 = masked_attention_example()


=== Example 4: Masked Attention ===
Casual mask shape: tensor([[0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0., 0.]])
Masked output shape: torch.Size([2, 6, 64])


**Key Points to Note:**

- `Input/Output Shapes:` The `TransformerBlock` maintains the same shape `(batch_size, seq_length, embed_dim)`.

- `Residual Connections:` The `x + dropout(layer(x))` pattern preserves information flow.

- `Layer Normalization:` Applied after the residual connection (post-norm architecture).

- `Batch First:` Note the use of `batch_first=True` in `MultiheadAttention` for consistent tensor shapes.

- `Flexibility:` This block can be stacked to create deeper transformer models.

- **Dropout in Hugging Face Transformer**

In [42]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-uncased")
print(model.config.hidden_dropout_prob)  # Dropout rate (e.g., 0.1)
print(model.config.attention_probs_dropout_prob)  # Dropout in attention

0.1
0.1


**Practical Notes**

- Typical dropout rates in Transformers: `0.1 – 0.3`.

- For very large models `(GPT-3, LLaMA)` trained on huge datasets, dropout is sometimes `reduced` or `removed` since data scale itself regularizes training.

- Dropout works best when combined with other techniques: `weight decay`, `early stopping`, `label smoothing`.

**Common Use Cases:**

- `Text Classification:` Stack multiple blocks + classification head

- `Language Modeling:` Use with causal masking

- `Sequence-to-Sequence:` Use as `encoder/decoder` blocks

- `Feature Extraction:` Extract representations for downstream tasks

### **Summary**

- `Dropout` = stochastic regularization to improve `generalization`.

- In Transformers, it is `woven into attention`, `feedforward`, `residual`, and `embedding layers`.

- Crucial for smaller models and moderate datasets, less critical but still used in very large LLMs.