# Exploring Activation Functions: Understanding Gated Linear Units and Beyond

In the realm of neural networks, activation functions play a crucial role in shaping the behavior and expressiveness of models. Among the myriad of activation functions, Gated Linear Units (GLU), introduced by Dauphin et al. in 2016, have emerged as an intriguing approach. GLU operates by performing a component-wise product of two linear projections, with one of them being subjected to a sigmoid function. However, the exploration doesn't end there; variations of GLU abound, offering flexibility in the choice of nonlinear or even linear functions in lieu of the sigmoid.

In this notebook, we embark on a journey to delve into the intricacies of these activation functions. We investigate how different variants of GLU fare in the context of feedforward sublayers within the Transformer architecture, a seminal model for sequence-to-sequence tasks introduced by Vaswani et al. in 2017. Our goal is to understand not only the performance but also the nuances and trade-offs associated with these activation functions.

As neural networks continue to evolve and tackle increasingly complex tasks, understanding the choices and implications of activation functions becomes paramount. Join us as we navigate through the landscape of activation functions, uncovering insights that shed light on their significance in modern deep learning.

# Why It Matters

Activation functions serve as the nonlinear "gatekeepers" of neural networks, enabling models to capture complex patterns and relationships in data. By exploring different activation functions such as GLU and its variants, we gain a deeper understanding of how neural networks operate and how we can tailor them to specific tasks. This exploration not only fuels advancements in model performance but also contributes to the broader quest for unlocking the full potential of artificial intelligence.

# References
Will be added individually

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# ReLU Activation Function (Rectified Linear Unit)
The Rectified Linear Unit (ReLU) activation function is one of the most widely used activation functions in deep learning. It introduces non-linearity to neural networks by outputting the input directly if it is positive; otherwise, it outputs zero.

* Simple and computationally efficient, ReLU replaces negative values with zeros, resulting in sparse activation patterns.
* Addresses the vanishing gradient problem by enabling the propagation of gradients during backpropagation, leading to faster convergence during training.
* Commonly used in various neural network architectures, including convolutional neural networks (CNNs) and fully connected networks.
* Despite its simplicity, ReLU has been shown to achieve state-of-the-art performance in numerous deep learning tasks, including image recognition, natural language processing, and reinforcement learning.
* While ReLU is effective, it may suffer from the "dying ReLU" problem, where neurons become inactive and output zero for all inputs during training.
* Variants of ReLU, such as Leaky ReLU, Parametric ReLU (PReLU), and Exponential Linear Unit (ELU), have been proposed to address the limitations of traditional ReLU.

References:

In [None]:
class ReLU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return torch.max(torch.tensor(0.0, device=x.device), x)

# SiLU Activation Function (Sigmoid-weighted Linear Unit)
The SiLU activation function, short for Sigmoid-weighted Linear Unit, is a smooth and non-monotonic activation function proposed as an alternative to traditional activation functions like ReLU and sigmoid. SiLU applies a sigmoid function to its input, effectively squashing it between 0 and 1, and then scales the input by this sigmoid output.

* SiLU is defined by the function f(x) = x * sigmoid(x), where sigmoid is the logistic sigmoid function.
* Combines the benefits of both ReLU and sigmoid activation functions, offering smoothness and non-linearity.
* SiLU is continuous and differentiable everywhere, facilitating better gradient flow during training and potentially accelerating convergence.
* Demonstrates effectiveness in various deep learning tasks, including image classification, object detection, and natural language processing.
* SiLU tends to produce more informative gradients compared to ReLU, which can lead to improved model generalization and performance.
* Implementation of SiLU is computationally efficient and supported in popular deep learning frameworks like PyTorch and TensorFlow.

References:

[SiLU](https://paperswithcode.com/method/silu)

In [None]:
class Silu(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return x * torch.sigmoid(x)

# Swish Activation Function
The Swish activation function is a novel activation function proposed by researchers at Google in 2017. It is designed to combine the simplicity of ReLU with the smoothness of sigmoid-based activation functions.

* Swish is defined as f(x) = x * sigmoid(x), where sigmoid is the logistic sigmoid function.
* The function exhibits non-monotonicity and smoothness, allowing for more robust gradient flow during backpropagation compared to ReLU.
* Swish activation has been shown to improve model performance across various tasks, including image classification, natural language processing, and recommendation systems.
* It tends to produce more informative gradients than ReLU, potentially leading to faster convergence and better generalization.
* Swish is computationally efficient and can be easily implemented using standard neural network libraries.
* While Swish has demonstrated promising results, its effectiveness may vary depending on the specific architecture and dataset.

References:

[Swish](https://paperswithcode.com/method/swish)

In [None]:
class Silu(nn.Module):
    def __init__(self, beta=1.0):
        super().__init__()
        self.beta = beta

    def forward(self, x):
        return x * torch.sigmoid(self.beta*x)

# GLU Activation Function (Gated Linear Unit)
The Gated Linear Unit (GLU) activation function is a type of activation function that operates by gating the input through a sigmoid function. It was introduced as part of the gated convolutional network architecture by Dauphin et al. in 2016.

* GLU activation involves gating the input tensor through a sigmoid function, effectively controlling the flow of information.
* It facilitates the selective filtering of information, allowing the model to focus on relevant features while suppressing noise.
* GLU has been shown to be effective in various deep learning architectures, particularly in tasks involving sequential data processing, such as natural language processing and time-series prediction.
* The activation function's gating mechanism helps mitigate the vanishing gradient problem commonly encountered in deep neural networks, leading to more stable training dynamics.
* While GLU offers benefits in certain scenarios, its performance may vary depending on the specific task and dataset.
* Implementation of GLU is straightforward and can be easily integrated into existing neural network architectures.

![image.png](attachment:image.png)

References:
[GLU: Gated Linear Unit implementation](https://medium.com/deeplearningmadeeasy/glu-gated-linear-unit-21e71cd52081)

In [None]:
class GLU(nn.Module):
    def __init__(self, input_size):
        super().__init__()
        self.projection = nn.Linear(input_size, input_size)  # W
        self.gate = nn.Linear(input_size, input_size)  # V

    def forward(self, x):
        # Compute the linear transformation: xW
        projection_output = self.projection(x)
        # Compute the gate operation: sigmoid(xV)
        gate_output = torch.sigmoid(self.gate(x))
        # GLU(x, W, V, b, c) = xW ⊗ sigmoid(xV)
        return projection_output * gate_output

# GELU Activation Function (Gaussian Error Linear Unit)
The GELU activation function, short for Gaussian Error Linear Unit, is a smooth approximation of the ReLU function. Introduced as an activation function in neural networks.

* Smooth approximation of ReLU, aiding in better gradient flow during training.
* Defined by a sigmoid-like function applied to the input tensor.
* Demonstrates effectiveness in various neural network architectures.
* Useful for tasks where smoothness in the activation function is desired.
* Implementation is computationally efficient and widely supported in deep learning frameworks.
* The GELU activation function provides a valuable alternative to ReLU and other activation functions, offering improved performance in certain scenarios while maintaining simplicity and ease of use.

References:

[Gaussian Error Linear Units](https://paperswithcode.com/method/gelu)

In [None]:
class GELU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        # can also be appoximated by glue = x * torch.sigmoid(1.702 * x) or F.gelu(x)
        return 0.5 * x * (1 + torch.tanh((torch.sqrt(2 / (22 / 7)) * (x + 0.044715 * torch.pow(x, 3)))))

# SwiGLU Activation Function (Swish-Gated Linear Unit)
The SwiGLU activation function, or Swish-Gated Linear Unit, combines the Swish activation function with a gating mechanism.

* Combines Swish with a gating mechanism: SwiGLU enhances the traditional Swish activation function by introducing a gating mechanism that modulates the output.
* Smooth and non-monotonic: Similar to Swish, SwiGLU is smooth and non-monotonic, facilitating gradient flow during training.
* Outperforms ReLU, Swish, and GELU: Experimental results have shown that SwiGLU can outperform other activation functions such as ReLU, Swish, and GELU in certain tasks.
* Effectiveness varies based on architecture and dataset: The effectiveness of SwiGLU can depend on the specific neural network architecture and dataset, so it may not always be the best choice for every application.

![image.png](attachment:image.png)

references:

[Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm](https://www.youtube.com/watch?v=oM4VmoabDAI&ab_channel=UmarJamil)
[LLaMA](https://vinija.ai/models/LLaMA/)

[SwiGLU](https://paperswithcode.com/method/swiglu)

In [None]:
class SwiGLU(nn.Module):
    def __init__(self, in_size):
        super().__init__()
        self.projection = nn.Linear(in_size, in_size)
        self.gate = nn.Linear(in_size, in_size)

    def forward(self, x):
        # Swishβ(xW + b)
        projection_output = F.silu(self.projection(x))
        # xV + c
        gate = self.gate(x)
        # xV + c * Swishβ(xW + b)
        return gate * projection_output

### GeGLU Activation Function (Gated Exponential Linear Unit)
The GeGLU activation function, short for Gated Exponential Linear Unit, is a variation of the GELU activation function introduced to enhance the gating mechanism and improve model performance. GeGLU operates by combining the GELU activation of the input tensor with a gated linear transformation.

* GeGLU is defined by the function GEGLU(x, W, V, b, c) = GELU(xW + b) ⊗ (xV + c), where GELU represents the Gaussian Error Linear Unit activation function.
* The activation function incorporates a gating mechanism that controls the flow of information through the network, allowing for selective feature extraction and noise suppression.
* GeGLU has demonstrated effectiveness in various deep learning tasks, including image classification, language modeling, and sequence generation.
* By leveraging both the smoothness of GELU and the gating mechanism, GeGLU enhances gradient flow during training and promotes faster convergence.
* Implementation of GeGLU is feasible using standard neural network libraries, and its computational efficiency enables seamless integration into existing architectures.
* While GeGLU offers promising results, its performance may vary depending on the specific characteristics of the dataset and the complexity of the task at hand.

![image.png](attachment:image.png)

References:

[Gemma: Open Models Based on Gemini
Research and Technology](https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf)

[GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202)

[GeGLU](https://paperswithcode.com/method/geglu)

In [None]:
class GeGLU(nn.Module):
    def __init__(self, input_size):
        super().__init__()
        self.projection = nn.Linear(input_size, input_size)  # W
        self.gate = nn.Linear(input_size, input_size)  # V

    def forward(self, x):
        # Compute GELU activation: approximated by F.gelu(x) = x * sigmoid(1.702 * x)
        gelu_activation = F.gelu(x)
        # xV + c
        gate_output = self.gate(x)
        # GeGLU(x, W, V, b, c) = GELU(x) ⊗ (xV + c)
        return gelu_activation * gate_output