Let’s break down the computation in a feedforward neural network (FNN) using a standard activation function and then see how that differs when using a Gated Linear Unit (GLU).

---

## 1. Standard FNN Calculation with an Activation Function

In a typical FNN layer, you perform a linear transformation followed by a non-linear activation. Here’s the step-by-step calculation:

1. **Linear Transformation:**

   Given an input vector \(\mathbf{x} \in \mathbb{R}^n\), a weight matrix \(\mathbf{W} \in \mathbb{R}^{m \times n}\), and a bias vector \(\mathbf{b} \in \mathbb{R}^{m}\), the linear part computes:
   \[
   \mathbf{z} = \mathbf{W}\mathbf{x} + \mathbf{b}.
   \]

2. **Non-linear Activation:**

   An activation function \(f\) (such as ReLU, tanh, or GELU) is then applied element-wise:
   \[
   \mathbf{a} = f(\mathbf{z}).
   \]
   For instance, with ReLU:
   \[
   \text{ReLU}(z) = \max(0, z).
   \]

This two-step process allows the network to model complex, non-linear relationships.

---

## 2. FNN Calculation with a Gated Linear Unit (GLU)

The Gated Linear Unit introduces a gating mechanism that modulates the output of a linear transformation. Instead of applying a single activation function directly to the linear output, GLU splits the linear output into two parts and uses one part to control the flow of the other.

### **Calculation Steps for GLU**

1. **Linear Transformation with Doubling of Channels:**

   Instead of computing one set of outputs, the network computes a combined output that is twice as large. Suppose the input \(\mathbf{x} \in \mathbb{R}^n\) and you want an output of dimension \(m\). You compute:
   \[
   \mathbf{z} = \mathbf{x} \mathbf{W} + \mathbf{b},
   \]
   where now \(\mathbf{W} \in \mathbb{R}^{n \times 2m}\) and \(\mathbf{b} \in \mathbb{R}^{2m}\).

2. **Splitting the Output:**

   The result \(\mathbf{z}\) is split into two vectors:
   \[
   \mathbf{z} = \left[ \mathbf{z}_A \; \mathbf{z}_B \right],
   \]
   where \(\mathbf{z}_A, \mathbf{z}_B \in \mathbb{R}^{m}\).

3. **Gating Mechanism:**

   A gating function, typically the sigmoid \(\sigma\), is applied to \(\mathbf{z}_B\) to produce a gate vector:
   \[
   \mathbf{g} = \sigma(\mathbf{z}_B),
   \]
   where
   \[
   \sigma(z) = \frac{1}{1 + e^{-z}}.
   \]

4. **Element-wise Multiplication (Gated Output):**

   The final output of the GLU is the element-wise multiplication of \(\mathbf{z}_A\) and the gate \(\mathbf{g}\):
   \[
   \text{GLU}(\mathbf{x}) = \mathbf{z}_A \odot \mathbf{g}.
   \]

This mechanism lets the network learn to “gate” or control how much of the linear transformation’s output should pass through, providing a more flexible and dynamic non-linearity.

---

## 3. Comparison and Pros & Cons

### **Standard Activation in FNNs**

- **Pros:**
  - **Simplicity:** The computation is straightforward—just a linear transformation followed by an activation.
  - **Efficiency:** Fewer parameters and lower computational overhead.
  - **Effective in Many Settings:** Works well for many tasks when the network is not extremely deep or when simple non-linearity is sufficient.

- **Cons:**
  - **Limited Expressiveness:** A single fixed non-linearity might not capture complex interactions as flexibly as gating mechanisms.

### **GLU-Based FNNs**

- **Pros:**
  - **Adaptive Gating:** The gate can learn to control the flow of information, potentially filtering out noise or emphasizing important features.
  - **Enhanced Expressiveness:** By modulating the linear output, GLU can model more complex interactions and dependencies.
  - **Empirical Success:** GLUs have been used effectively in several architectures (e.g., gated convolutional networks in language modeling) to improve performance.

- **Cons:**
  - **Increased Parameter Count:** Since the linear transformation outputs twice as many features, the number of parameters is roughly doubled.
  - **Higher Computational Cost:** More computations are required due to the gating mechanism and the extra parameters.
  - **Complexity in Tuning:** The dynamics of the gating function might require careful tuning in certain architectures or applications.

---

## Summary Equation Comparison

- **Standard FNN Layer:**
  \[
  \mathbf{a} = f\bigl(\mathbf{W}\mathbf{x} + \mathbf{b}\bigr).
  \]

- **GLU FNN Layer:**
  \[
  \begin{aligned}
  \mathbf{z} &= \mathbf{x}\mathbf{W} + \mathbf{b} \quad \text{(with } \mathbf{W} \in \mathbb{R}^{n \times 2m} \text{)}, \\
  \mathbf{z} &= [\mathbf{z}_A, \mathbf{z}_B], \\
  \text{GLU}(\mathbf{x}) &= \mathbf{z}_A \odot \sigma(\mathbf{z}_B).
  \end{aligned}
  \]

In summary, while standard activation functions provide a simple and effective non-linearity for feedforward neural networks, GLUs offer an additional level of control through a gating mechanism, which can enhance model expressiveness at the cost of increased complexity and computational overhead.

Below are nine commonly used activation functions along with their mathematical formulations, advantages, and disadvantages.

---

### 1. Sigmoid

- **Math:**  
  \[
  \sigma(x) = \frac{1}{1 + e^{-x}}
  \]

- **Pros:**  
  - Smooth and differentiable.  
  - Outputs in the range (0, 1), which is useful for probabilistic interpretations (e.g., binary classification).

- **Cons:**  
  - Prone to saturation for very high or low values, leading to vanishing gradients.  
  - Not zero-centered, which can slow down convergence in some networks.

---

### 2. Tanh (Hyperbolic Tangent)

- **Math:**  
  \[
  \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
  \]

- **Pros:**  
  - Zero-centered output, which can aid optimization.  
  - Steeper gradient than sigmoid around 0.

- **Cons:**  
  - Also saturates for large positive or negative values, causing vanishing gradients in deep networks.

---

### 3. ReLU (Rectified Linear Unit)

- **Math:**  
  \[
  f(x) = \max(0, x)
  \]

- **Pros:**  
  - Simple and computationally efficient.  
  - Does not saturate in the positive region, which helps mitigate vanishing gradients.

- **Cons:**  
  - Can lead to “dying ReLU” where neurons become inactive if they consistently output zero.  
  - Not differentiable at \(x = 0\) (though this is typically not a problem in practice).

---

### 4. Leaky ReLU

- **Math:**  
  \[
  f(x) = \begin{cases}
  x, & x \ge 0 \\
  \alpha x, & x < 0
  \end{cases}
  \]
  where \(\alpha\) is a small constant (e.g., 0.01).

- **Pros:**  
  - Mitigates the dying ReLU problem by allowing a small, non-zero gradient for \(x < 0\).

- **Cons:**  
  - Introduces an extra hyperparameter (\(\alpha\)) that needs tuning.  
  - Still may not fully capture the best behavior for negative activations in all cases.

---

### 5. PReLU (Parametric ReLU)

- **Math:**  
  \[
  f(x) = \begin{cases}
  x, & x \ge 0 \\
  \alpha x, & x < 0
  \end{cases}
  \]
  Here, \(\alpha\) is learned during training rather than fixed.

- **Pros:**  
  - Adaptively learns the slope for negative inputs, potentially improving performance.  
  - Can alleviate the dying ReLU problem more flexibly than Leaky ReLU.

- **Cons:**  
  - Increases the number of parameters, potentially leading to overfitting if not regularized properly.  
  - Adds extra computational complexity.

---

### 6. ELU (Exponential Linear Unit)

- **Math:**  
  \[
  f(x) = \begin{cases}
  x, & x \ge 0 \\
  \alpha (e^x - 1), & x < 0
  \end{cases}
  \]
  where \(\alpha\) is a hyperparameter (commonly set to 1).

- **Pros:**  
  - Provides a smooth transition for negative inputs, which can improve learning.  
  - Often converges faster and leads to higher classification accuracy in some tasks.

- **Cons:**  
  - Computationally more expensive than ReLU due to the exponential function.  
  - The hyperparameter \(\alpha\) must be set appropriately.

---

### 7. SELU (Scaled Exponential Linear Unit)

- **Math:**  
  \[
  f(x) = \lambda \begin{cases}
  x, & x \ge 0 \\
  \alpha (e^x - 1), & x < 0
  \end{cases}
  \]
  with fixed parameters, typically \(\alpha \approx 1.67326\) and \(\lambda \approx 1.0507\).

- **Pros:**  
  - Designed for self-normalizing neural networks, helping to keep activations within a desired range automatically.  
  - Can improve convergence speed in deep networks.

- **Cons:**  
  - Sensitive to network architecture choices (e.g., weight initialization, dropout).  
  - Not as broadly adopted, so fewer “out-of-the-box” guidelines exist compared to ReLU.

---

### 8. GELU (Gaussian Error Linear Unit)

- **Math:**  
  A common approximation is:
  \[
  \text{GELU}(x) \approx 0.5x \left(1 + \tanh\left[\sqrt{\frac{2}{\pi}}\left(x + 0.044715x^3\right)\right]\right)
  \]
  The exact formulation is:
  \[
  \text{GELU}(x) = x \cdot \Phi(x)
  \]
  where \(\Phi(x)\) is the cumulative distribution function of the standard normal distribution.

- **Pros:**  
  - Provides a smooth, non-linear activation that often improves performance, especially in transformer-based architectures.  
  - Can capture more nuanced behaviors than ReLU.

- **Cons:**  
  - More computationally expensive than simpler activations like ReLU.  
  - The formulation is more complex and may require careful implementation for efficiency.

---

### 9. Swish

- **Math:**  
  \[
  f(x) = x \cdot \sigma(\beta x)
  \]
  where \(\sigma(x)\) is the sigmoid function and \(\beta\) is either a constant (often set to 1) or a learnable parameter.

- **Pros:**  
  - Smooth and non-monotonic, which can lead to better performance in some deep networks.  
  - Has been shown empirically to outperform ReLU on certain tasks.

- **Cons:**  
  - Computationally more complex due to the additional sigmoid computation.  
  - Not as widely standardized in all frameworks, so hardware support might be limited compared to simpler functions.

---

Each of these activations has its own niche. Choosing the right one often depends on your network’s depth, the nature of your data, and empirical performance on your specific task.