***The Gaussian Error Linear Unit*** (GELU) activation function is compatible with BERT, ROBERTa, ALBERT, and other top NLP models. This activation function is motivated by combining properties from dropout, zoneout, and ReLUs. 

# Introduction

GELU, introduced by Dan Hendrycks and Kevin Gimpel in their 2016 paper “Gaussian Error Linear Units (GELUs),” has gained prominence for its ability to enhance the learning capabilities of neural networks. Unlike its predecessors, GELU is derived from a smooth approximation of the cumulative distribution function (CDF) of the standard normal distribution. It is our input times the standard normal CDF at that point:

$$f(x) = x * (\text{CDF}(x)) = x * \frac{1}{2} \left(1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right)$$

where $\text{erf}$ is the error function.

![image.png](attachment:image.png)

In its original formulation, GELU can’t be used in our neural networks as the function is complex and slow to compute.

But we can reformulate the GELU function to make it faster and more efficient. The reformulated GELU function is:

- With Tanh

$$f_{tanh}(x) = 0.5 * x * (1 + \text{tanh}(\sqrt{\frac{2}{\pi}} * (x + 0.044715x^3)))$$

Where:
- $0.5x$ introduces a linear component to the function.
- The $\text{tanh}$ function help maintain a smooth transition
- The scaling factor $\sqrt{\frac{2}{\pi}}$ applies normalization, ensuring the output is within a reasonable range

- With Sigmoid

$$f_{sigmoid}(x) = x * (1 + \text{sigmoid}(1.702x))$$

The advantage of GELU:
- Addresses the vanishing gradient with no dead neurons
- Smoothness across the full range of input values
- Non-monotonic behavior can capture complex patterns

But for the limitation:
- Increased computational complexity
- Performance can be task-specific
- Reduced interpretability due to increased complexity