## Activation Functions and When to Use Which

---

### 1. Theoretical Intuition
- Activation functions introduce **non-linearity** into neural networks.  
- Without them, multiple layers collapse into a single linear transformation.  
- Each neuron decides **whether to activate** based on its input.  

---

### 2. Key Pointers
- **Sigmoid**: Output between 0 and 1, saturates at extremes.  
- **Tanh**: Output between -1 and 1, zero-centered.  
- **ReLU**: Output = max(0, x), avoids vanishing gradient for positive inputs.  
- **Leaky ReLU**: Allows small negative values to prevent dead neurons.  
- **Softmax**: Converts vector into probability distribution for multi-class classification.  
- **Choosing activation** depends on **layer type** (hidden/output), **problem type**, and **training stability**.  

---

### 3. Use Cases / When to Use

| Activation Function | Typical Use Case / Notes |
|--------------------|-------------------------|
| Sigmoid | Output layer for **binary classification** (probability 0–1); avoid in deep hidden layers due to vanishing gradient. |
| Tanh | Hidden layers when zero-centered output is desired; small networks; avoids bias shift. |
| ReLU | Most hidden layers in **deep networks**; fast computation; mitigates vanishing gradient for positive inputs. |
| Leaky ReLU | Hidden layers to prevent “dead ReLU” problem where neurons output 0 constantly. |
| Softmax | Output layer for **multi-class classification**; provides probability distribution across classes. |
| Linear (Identity) | Output layer for **regression tasks**; no non-linearity needed. |

---

### 4. Mathematical Formulas
- **Sigmoid:** \( \sigma(x) = \frac{1}{1 + e^{-x}} \)  
- **Tanh:** \( \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \)  
- **ReLU:** \( f(x) = \max(0, x) \)  
- **Leaky ReLU:** \( f(x) = x \text{ if } x > 0 \text{ else } 0.01x \)  
- **Softmax:** \( \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}} \)  

---

### 5. Interview Q&A

| Question | Answer |
|----------|--------|
| Why do we need activation functions? | To introduce non-linearity, enabling networks to model complex patterns. |
| When is Sigmoid used? | Binary classification output layer. |
| Why avoid Sigmoid in hidden layers? | Can cause vanishing gradient, slowing training in deep networks. |
| When is Tanh preferred over Sigmoid? | When zero-centered output is desired; reduces bias shift. |
| Why ReLU is popular in hidden layers? | Simple, computationally fast, mitigates vanishing gradient for positive inputs. |
| What problem does Leaky ReLU solve? | Prevents dead neurons that never activate. |
| When is Softmax used? | Multi-class classification output layer; gives probability distribution. |
| Which activation is used in regression output? | Linear (identity) activation. |

---

### 6. Code Demo: Plot Activation Functions Step-by-Step

```python
import numpy as np
import matplotlib.pyplot as plt

# Input values
x = np.linspace(-10, 10, 200)

# Sigmoid
sigmoid = 1 / (1 + np.exp(-x))
plt.plot(x, sigmoid, label='Sigmoid', color='blue')

# Tanh
tanh = np.tanh(x)
plt.plot(x, tanh, label='Tanh', color='red')

# ReLU
relu = np.maximum(0, x)
plt.plot(x, relu, label='ReLU', color='green')

# Leaky ReLU
leaky_relu = np.where(x > 0, x, 0.01 * x)
plt.plot(x, leaky_relu, label='Leaky ReLU', color='purple')

plt.title("Activation Functions")
plt.xlabel("Input")
plt.ylabel("Output")
plt.grid(True)
plt.legend()
plt.show()

# Softmax example
softmax_input = np.array([2.0, 1.0, 0.1])
exp_vals = np.exp(softmax_input)
softmax_output = exp_vals / np.sum(exp_vals)

print("Softmax Input:", softmax_input)
print("Softmax Probabilities:", softmax_output)


## Activation Functions Reference Table

| Activation Function | Pros | Cons | Ideal Usage / Notes |
|--------------------|------|------|-------------------|
| **Sigmoid** | Smooth gradient, outputs 0–1 (probabilities) | Vanishing gradient for large positive/negative inputs; not zero-centered | Output layer for **binary classification**; avoid in deep hidden layers |
| **Tanh** | Zero-centered, smooth gradient | Vanishing gradient for large inputs | Hidden layers in small networks; when zero-centered output is needed |
| **ReLU** | Computationally efficient, mitigates vanishing gradient for positives, sparse activation | Dead neurons if input < 0 (never activates) | Hidden layers in **deep networks**; most commonly used |
| **Leaky ReLU** | Fixes dead neuron problem by allowing small negative slope | Slightly more computation than ReLU | Hidden layers when some neurons might die with ReLU |
| **Parametric ReLU (PReLU)** | Learns negative slope during training | Extra parameters increase complexity | Deep hidden layers where flexibility is needed |
| **ELU (Exponential Linear Unit)** | Smooth output for negatives, reduces bias shift | Slightly slower computation | Hidden layers in deep networks; helps faster learning |
| **Softmax** | Converts outputs to probability distribution; differentiable | Can saturate if input differences are large | Output layer for **multi-class classification** |
| **Linear / Identity** | Simple; no non-linearity | Cannot model complex patterns | Output layer for **regression tasks** |
| **Swish** | Smooth, non-monotonic, better gradient flow | Slightly slower than ReLU | Deep networks; sometimes improves accuracy over ReLU |
| **GELU (Gaussian Error Linear Unit)** | Smooth, non-linear, used in Transformers | More computation than ReLU | Hidden layers in **Transformer-based networks** like BERT |

---

### Tips for Choosing Activation Functions
- Hidden layers: usually **ReLU or variants** (Leaky ReLU, ELU, GELU)  
- Output layer depends on task:  
  - Binary classification → **Sigmoid**  
  - Multi-class classification → **Softmax**  
  - Regression → **Linear**  
- Avoid Sigmoid/Tanh in very deep networks to reduce vanishing gradient problem  

