
### Activation Functions in Neural Networks



#### 1. Sigmoid Activation Function
- **Formula:**
  \[
  \sigma(x) = \frac{1}{1 + e^{-x}}
  \]
- **Range:** (0, 1)
- **Properties:**
  - **Non-linear:** This allows the model to capture non-linear relationships.
  - **Differentiable:** Can be used with backpropagation.
  - **Smooth Gradient:** Helps avoid jumps in output values.
  - **Vanishing Gradient Problem:** For very high or very low input values, the gradient is almost zero, which can slow down or stall training.
- **Graph:**

![Sigmoid Graph](https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/1200px-Logistic-curve.svg.png)
*(Source: Wikipedia)*



#### 2. ReLU (Rectified Linear Unit) Activation Function
- **Formula:**
  \[
  \text{ReLU}(x) = \max(0, x)
  \]
- **Range:** [0, ∞)
- **Properties:**
  - **Non-linear:** Allows the model to capture complex relationships.
  - **Differentiable:** The function is differentiable at all points except at \(x = 0\). For practical purposes, we can define the gradient as 0 or 1 at this point.
  - **Computational Efficiency:** Simpler to compute than sigmoid and tanh.
  - **Sparsity:** Activates only a few neurons, leading to a sparse network.
  - **Dying ReLU Problem:** Neurons can become inactive and stop learning if they consistently output 0.
- **Graph:**

![ReLU Graph](https://upload.wikimedia.org/wikipedia/commons/thumb/6/6c/Rectifier_and_softplus_functions.svg/1200px-Rectifier_and_softplus_functions.svg.png)
*(Source: Wikipedia)*





### Comparison and Applications

- **Usage in Layers:**
  - **Sigmoid:** Often used in the output layer for binary classification problems.
  - **ReLU:** Commonly used in hidden layers of deep neural networks.

- **Advantages and Disadvantages:**
  - **Sigmoid:**
    - **Advantages:** Outputs values between 0 and 1, which can be interpreted as probabilities.
    - **Disadvantages:** Susceptible to the vanishing gradient problem.
  - **ReLU:**
    - **Advantages:** Reduces likelihood of vanishing gradients, promotes sparsity, and is computationally efficient.
    - **Disadvantages:** Risk of dying ReLUs.

### Mathematical Insight

#### Derivative of Sigmoid
- **Formula:**
  \[
  \sigma'(x) = \sigma(x) \cdot (1 - \sigma(x))
  \]
- **Explanation:** The gradient decreases as \(x\) moves away from 0, leading to the vanishing gradient problem for high magnitude inputs.

#### Derivative of ReLU
- **Formula:**
  \[
  \text{ReLU}'(x) = \begin{cases}
  0 & \text{if } x < 0 \\
  1 & \text{if } x > 0
  \end{cases}
  \]
- **Explanation:** The gradient is constant for \(x > 0\) and zero for \(x < 0\), making it efficient for gradient descent optimization.

### Practical Considerations

- **Choice of Activation Function:**
  - Depends on the specific problem and architecture of the neural network.
  - ReLU is generally preferred for hidden layers due to its efficiency and ability to mitigate the vanishing gradient problem.
  - Sigmoid is suitable for output layers in binary classification tasks due to its probabilistic interpretation.



#### 3. Tanh (Hyperbolic Tangent) Activation Function
- **Formula:**
  \[
  \text{tanh}(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
  \]
- **Range:** (-1, 1)
- **Properties:**
  - **Non-linear:** Captures complex relationships.
  - **Zero-centered:** Outputs are centered around zero, making optimization easier.
  - **Vanishing Gradient Problem:** Like sigmoid, gradients can become very small for large input values.
- **Graph:**

![image.png](attachment:45aca67b-e51e-4f13-b1e5-da00f48f01e9.png)


#### 4. Leaky ReLU Activation Function
- **Formula:**
  \[
  \text{Leaky ReLU}(x) = \begin{cases}
  x & \text{if } x > 0 \\
  \alpha x & \text{if } x \le 0
  \end{cases}
  \]
  where \(\alpha\) is a small constant (usually 0.01).
- **Range:** (-∞, ∞)
- **Properties:**
  - **Non-linear:** Captures complex relationships.
  - **Prevents Dying ReLU Problem:** Allows a small gradient when \(x \le 0\).
- **Graph:**

![image.png](attachment:1fc1d267-4779-4013-928e-eb328e85dc72.png)


#### 5. Parametric ReLU (PReLU) Activation Function
- **Formula:**
  \[
  \text{PReLU}(x) = \begin{cases}
  x & \text{if } x > 0 \\
  \alpha x & \text{if } x \le 0
  \end{cases}
  \]
  where \(\alpha\) is a learnable parameter.
- **Range:** (-∞, ∞)
- **Properties:**
  - **Non-linear:** Captures complex relationships.
  - **Adaptable:** The slope for \(x \le 0\) is learned during training.
- **Graph:**

![image.png](attachment:b0b9d4fd-6faa-499c-86fb-57316f8683d1.png)



#### 6. Exponential Linear Unit (ELU) Activation Function
- **Formula:**
  \[
  \text{ELU}(x) = \begin{cases}
  x & \text{if } x > 0 \\
  \alpha (e^x - 1) & \text{if } x \le 0
  \end{cases}
  \]
  where \(\alpha\) is a constant.
- **Range:** (-α, ∞)
- **Properties:**
  - **Non-linear:** Captures complex relationships.
  - **Smooth:** Approaches zero for negative values and avoids vanishing gradient problems.
  - **Negative Saturation:** Provides a small negative value for negative inputs.
- **Graph:**

![image.png](attachment:50dc9ec9-5077-4765-a431-6243938b8097.png)



#### 7. Swish Activation Function
- **Formula:**
  \[
  \text{Swish}(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}
  \]
- **Range:** (-∞, ∞)
- **Properties:**
  - **Non-linear:** Captures complex relationships.
  - **Smooth Gradient:** Helps with gradient-based optimization.
  - **Self-Gated:** Combines properties of ReLU and Sigmoid.
- **Graph:**

![image.png](attachment:5fe39f3e-6eea-4c03-8ab3-5e16d836b4b1.png)



#### 8. Softmax Activation Function
- **Formula:**
  \[
  \text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}
  \]
  for each \(x_i\) in the input vector.
- **Range:** (0, 1), the sum of outputs is 1.
- **Properties:**
  - **Used in Multi-class Classification:** Outputs can be interpreted as probabilities.
  - **Differentiable:** Suitable for backpropagation.
- **Graph:**

Softmax is typically visualized as part of the overall network output, not in isolation.

![image.png](attachment:9becc77a-9537-42a1-abc1-75ab529e1f0d.png)



### Summary

Each activation function has unique properties making it suitable for specific tasks:

- **Sigmoid and Tanh:** Useful for binary classification and hidden layers in simpler networks.
- **ReLU and its Variants (Leaky ReLU, PReLU):** Preferred in deep networks due to computational efficiency and ability to mitigate vanishing gradients.
- **ELU and Swish:** Advanced functions that offer smooth gradients and improved training dynamics.
- **Softmax:** Essential for multi-class classification tasks, providing probabilistic outputs.

The choice of activation function can significantly impact the performance and training efficiency of neural networks. Understanding these functions' characteristics and behaviors is crucial for designing effective deep learning models.

![image.png](attachment:4a21fc30-eafc-42db-8e14-e1d45ad950aa.png)