# **Gaussian Error Linear Unit (GELU) Activation**

- `"GELU"` most commonly refers to the `Gaussian Error Linear Unit`, a smooth, non-monotonic activation function used in deep learning models like `Transformers` to improve performance by weighting inputs based on their probability under a standard normal distribution, rather than gating them by sign as `ReLU` does. The `GELU` function is mathematically represented as $x \cdot \Phi(x)$, where $\Phi(x)$ is the standard Gaussian cumulative distribution function (CDF). 


- The `Gaussian Error Linear Unit (GELU)` is a smooth, non-linear activation function introduced by Hendrycks & Gimpel (2016). Unlike `ReLU` (which either keeps or drops values), `GELU` weights inputs by their `probability of being significant`, based on the `Gaussian cumulative distribution function (CDF)`.

- It’s widely used in `Transformers (BERT, GPT, ViT, etc.)` because it combines the benefits of `ReLU’s sparsity` and `sigmoid’s smoothness`.


### **Intuition**

- Instead of `hard-thresholding` like ReLU ($x \mapsto \max(0, x)$), GELU makes a `soft decision`:

    - Small/negative values are mostly suppressed.
    - Large positive values pass through nearly unchanged.
    - Near-zero values are partially preserved depending on probability.

- This makes GELU `smooth`, `differentiable everywhere`, and well-suited for gradient-based optimization in deep models.

### **Mathematical Representation**

Exact Definition

$\mathrm{GELU}(x) = x \cdot \Phi(x)$


Where:

- $\Phi(x) = \frac{1}{2}\left[1 + \mathrm{erf}\left(\frac{x}{\sqrt{2}}\right)\right]$

- $\mathrm{erf}(\cdot)$ = Gaussian error function.


So:

- $\mathrm{GELU}(x) = \frac{1}{2}x \left[1 + \mathrm{erf}\left(\frac{x}{\sqrt{2}}\right)\right]$


**Approximation (fast version used in practice)**

$\mathrm{GELU}(x) \approx 0.5x \left[1 + \tanh\left(\sqrt{\frac{2}{\pi}}\left(x + 0.044715x^3\right)\right)\right]$

This avoids computing the slow error function $\mathrm{erf}$.

### **Key Properties**

- `Smooth and differentiable everywhere` (unlike ReLU).

- `Probabilistic interpretation:` input is weighted by probability of being “active.”

- `Combines linear & non-linear behavior:` behaves like identity for large $x$, suppresses negatives like ReLU.

- `Empirical success:` BERT, GPT, and Vision Transformers all use GELU by default.




### **Use Cases**

- `Transformers:` BERT, GPT-family, ViT.

- `NLP tasks:` embeddings, encoder/decoder feedforward networks.

- `Vision:` ResMLPs, MLP-Mixers, ViT.

- `General deep learning:` MLPs where smooth activation helps.

**GELU from scratch**

In [2]:
import torch

def gelu(x):
    """GELU activation function"""
    return 0.5 * x * (1 + torch.tanh(torch.sqrt(torch.tensor(2.0 / torch.pi)) * 
                                    (x + 0.044715 * torch.pow(x, 3))))
    
x = torch.linspace(-3, 3, 10)
gelu(x)

tensor([-0.0036, -0.0225, -0.0798, -0.1588, -0.1232,  0.2102,  0.8412,  1.5869,
         2.3108,  2.9964])

**GELU in PyTorch**

In [4]:
import torch
import torch.nn as nn
import torch.nn.functional as F

x = torch.linspace(-3, 3, 10)

# Using PyTorch built-in GELU
gelu = nn.GELU()
print("GELU:", gelu(x))

# Using functional API (approximate version)
print("Approx GELU:", F.gelu(x, approximate="tanh"))

GELU: tensor([-0.0040, -0.0229, -0.0797, -0.1587, -0.1231,  0.2102,  0.8413,  1.5870,
         2.3104,  2.9960])
Approx GELU: tensor([-0.0036, -0.0225, -0.0798, -0.1588, -0.1232,  0.2102,  0.8412,  1.5869,
         2.3108,  2.9964])


**GELU in a Transformer Feedforward Block**

In [5]:
class FeedForward(nn.Module):
    def __init__(self, embed_dim, ff_dim, dropout=0.1):
        super().__init__()
        self.fc1 = nn.Linear(embed_dim, ff_dim)
        self.act = nn.GELU()
        self.fc2 = nn.Linear(ff_dim, embed_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.fc2(self.dropout(self.act(self.fc1(x))))

**Hugging Face Transformer Config (BERT example)**

In [7]:
from transformers import BertModel

model = BertModel.from_pretrained("bert-base-uncased")
print("Activation function:", model.config.hidden_act)  # 'gelu'

Activation function: gelu


### **Visual Intuition**

Compared to ReLU and Sigmoid:

- `ReLU:` sharp cutoff at 0, discards all negatives.

- `Sigmoid:` squashes everything into (0,1), not scale-preserving.

- `GELU:` keeps large positives, smoothly suppresses negatives, probabilistic near zero.

### **Summary**

`GELU` is the default activation function in Transformers because it is `smooth`, `probabilistic`, and empirically better than ReLU in `NLP/vision`. Its definition uses the Gaussian CDF, with a fast $\tanh$ approximation widely used in practice.