<h2 style="text-align:center;color:#0F4C81;">
Deep Dive into Activation Functions
</h2>

## Introduction
Activation functions play a crucial role in deep learning models by introducing non-linearity, enabling networks to learn complex patterns. In this tutorial, we will explore various activation functions, their properties, and when to use them.

---

## 1. Why Do We Need Activation Functions?
Neural networks consist of multiple layers, where each neuron computes a weighted sum of its inputs and passes it through an activation function. Without activation functions, a neural network would behave like a simple linear model, regardless of the number of layers. Activation functions introduce non-linearity, allowing networks to learn complex features.

---

## 2. Types of Activation Functions

### 2.1 Step Function
- **Definition:** Outputs either 0 or 1 based on a threshold.
- **Formula:** $$ f(x) = \begin{cases} 1, & x \geq 0 \\ 0, & x < 0 \end{cases} $$
- **Usage:** Used in early perceptron models but not suitable for deep learning due to its lack of gradient information.

### 2.2 Sigmoid (Logistic) Function
- **Formula:** $$ f(x) = \frac{1}{1 + e^{-x}} $$
- **Pros:** Smooth, differentiable, and maps output to (0,1).
- **Cons:** Suffering from vanishing gradients, leading to slow learning in deep networks.
- **Usage:** Commonly used in the output layer of binary classification problems.

### 2.3 Tanh (Hyperbolic Tangent) Function
- **Formula:** $$ f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $$
- **Pros:** Zero-centered output in the range (-1,1), improving optimization over sigmoid.
- **Cons:** Also suffers from vanishing gradients for large inputs.
- **Usage:** Used in earlier deep networks before ReLU gained popularity.

### 2.4 ReLU (Rectified Linear Unit)
- **Formula:** $$ f(x) = \max(0, x) $$
- **Pros:** Simple, computationally efficient, and helps mitigate the vanishing gradient problem.
- **Cons:** Can suffer from the "dying ReLU" problem, where some neurons become inactive (output always 0).
- **Usage:** Most widely used activation function in hidden layers of deep networks.

### 2.5 Leaky ReLU
- **Formula:** $$ f(x) = \begin{cases} x, & x \geq 0 \\ \alpha x, & x < 0 \end{cases} $$
- **Pros:** Prevents dying ReLU problem by allowing small negative outputs.
- **Usage:** A preferred alternative to standard ReLU.

### 2.6 Parametric ReLU (PReLU)
- **Formula:** $$ f(x) = \begin{cases} x, & x \geq 0 \\ \alpha x, & x < 0 \end{cases} $$
- **Pros:** Similar to Leaky ReLU but with learnable parameter \(\alpha\).
- **Usage:** Often used in computer vision models.

### 2.7 ELU (Exponential Linear Unit)
- **Formula:** $$ f(x) = \begin{cases} x, & x \geq 0 \\ \alpha (e^x - 1), & x < 0 \end{cases} $$
- **Pros:** Reduces bias shift and speeds up learning.
- **Usage:** A more advanced alternative to ReLU.

### 2.8 Swish
- **Formula:** $$ f(x) = x \cdot \sigma(x) = x \cdot \frac{1}{1 + e^{-x}} $$
- **Pros:** Smooth and often outperforms ReLU.
- **Usage:** Used in Google’s EfficientNet models.

---

## 3. Choosing the Right Activation Function
| Activation Function | Pros | Cons | Best Used In |
|----------------------|------|------|--------------|
| Sigmoid | Probabilistic interpretation | Vanishing gradients | Output layer of binary classifiers |
| Tanh | Zero-centered | Vanishing gradients | Some RNNs |
| ReLU | Simple, efficient | Dying neurons | Most deep networks |
| Leaky ReLU | Prevents dying neurons | Adds a small computational cost | Alternative to ReLU |
| PReLU | Learnable slope | More parameters to train | Deep networks with high complexity |
| ELU | Reduces bias shift | Slightly slower than ReLU | Advanced deep learning tasks |
| Swish | Smoother than ReLU | More computationally expensive | EfficientNet models |

---

## 4. Conclusion
Activation functions are essential for deep learning, enabling networks to learn complex patterns. While ReLU remains the most popular choice, alternatives like Swish and ELU are gaining traction. Choosing the right activation function depends on the problem, architecture, and computational constraints.

---

## References
1. McCulloch, W. S., & Pitts, W. (1943). *A Logical Calculus of the Ideas Immanent in Nervous Activity*.
2. Rosenblatt, F. (1958). *The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain*.
3. Nair, V., & Hinton, G. E. (2010). *Rectified Linear Units Improve Restricted Boltzmann Machines*.
4. He, K., Zhang, X., Ren, S., & Sun, J. (2015). *Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification*.
5. Ramachandran, P., Zoph, B., & Le, Q. V. (2017). *Searching for Activation Functions*.