# Activation Functions

## Softmax

* [Neural Networks Part 5: ArgMax and SoftMax](https://www.youtube.com/watch?v=KpKog-L9veg)
* [The SoftMax Derivative, Step-by-Step!!!](https://www.youtube.com/watch?v=M59JElEPgIg)

Softmax converts a vector of raw scores (logits) into a **probability distribution**. Each output is between 0 and 1, and all outputs sum to 1.

$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$$

**Use case:** Multi-class classification (like MNIST with 10 digit classes). The output represents the probability of each class.

**Example:** Input `[2.0, 1.0, 0.1]` → Output `[0.66, 0.24, 0.10]` (probabilities summing to 1)

---

## Sigmoid

Sigmoid squashes any input value into the range **(0, 1)**.

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

**Use case:** Binary classification or independent multi-label classification (each output is treated independently).

**Key difference from Softmax:**
- **Sigmoid:** Each output is independent (can have multiple outputs > 0.5)
- **Softmax:** Outputs are mutually exclusive (probabilities sum to 1)

---

## ReLU (Rectified Linear Unit)

The most widely used activation in hidden layers. Outputs the input if positive, else 0.

$$\text{ReLU}(z) = \max(0, z)$$

**Pros:** Fast, avoids vanishing gradient for positive values  
**Cons:** "Dying ReLU" — neurons can get stuck at 0 and stop learning

---

## Leaky ReLU

Fixes dying ReLU by allowing a small gradient for negative inputs.

$$\text{LeakyReLU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha z & \text{if } z \leq 0 \end{cases}$$

Where $\alpha$ is typically 0.01.

---

## Tanh (Hyperbolic Tangent)

Squashes input to range **(-1, 1)**. Zero-centered unlike sigmoid.

$$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$$

**Use case:** Hidden layers when zero-centered output is preferred (e.g., RNNs).

---

## ELU (Exponential Linear Unit)

Smooth alternative to ReLU with negative values that help push mean activations toward zero.

$$\text{ELU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha(e^z - 1) & \text{if } z \leq 0 \end{cases}$$

---

## GELU (Gaussian Error Linear Unit)

Used in Transformers (BERT, GPT). Smoothly gates values based on their magnitude.

$$\text{GELU}(z) = z \cdot \Phi(z)$$

Where $\Phi(z)$ is the CDF of standard normal distribution.

---

## Swish / SiLU

Self-gated activation. Often outperforms ReLU in deep networks.

$$\text{Swish}(z) = z \cdot \sigma(z) = \frac{z}{1 + e^{-z}}$$

---

## Summary Table

| Function | Output Range | Zero-Centered | Best For |
|----------|-------------|---------------|----------|
| Sigmoid | (0, 1) | No | Output (binary) |
| Softmax | (0, 1) | No | Output (multi-class) |
| ReLU | [0, ∞) | No | Hidden layers (default) |
| Leaky ReLU | (-∞, ∞) | No | Hidden layers |
| Tanh | (-1, 1) | Yes | RNNs, hidden layers |
| ELU | (-α, ∞) | ~Yes | Hidden layers |
| GELU | (-0.17, ∞) | ~Yes | Transformers |
| Swish | (-0.28, ∞) | ~Yes | Deep networks |
