# Activation Functions Tutorial

This notebook will teach you about four important activation functions used in neural networks:
1. **ReLU (Rectified Linear Unit)**
2. **Sigmoid**
3. **Tanh (Hyperbolic Tangent)**
4. **Leaky ReLU**

We'll implement each function, visualize them, and understand their properties.


## What are Activation Functions?

Activation functions are mathematical functions applied to the output of neurons in neural networks. They introduce **non-linearity** into the network, allowing it to learn complex patterns. Without activation functions, a neural network would just be a linear combination of inputs, no matter how many layers it has.

Key properties to consider:
- **Range**: What values can the function output?
- **Differentiability**: Can we compute gradients for backpropagation?
- **Saturation**: Does the function saturate (flatten out) at extreme values?
- **Computational efficiency**: How fast is it to compute?


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Set up plotting style
plt.style.use('seaborn-v0_8')
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

# Create input range
x = np.linspace(-5, 5, 1000)

ModuleNotFoundError: No module named 'matplotlib'

## 1. ReLU (Rectified Linear Unit)

### Formula:
$$\text{ReLU}(x) = \max(0, x) = \begin{cases} 
x & \text{if } x > 0 \\
0 & \text{if } x \leq 0
\end{cases}$$

### Properties:
- **Range**: [0, ∞)
- **Advantages**: 
  - Simple and computationally efficient
  - Solves the vanishing gradient problem (for positive values)
  - Most commonly used in hidden layers
- **Disadvantages**: 
  - "Dying ReLU" problem: neurons can become inactive (output 0) and never recover
  - Not differentiable at x = 0 (though this is rarely a problem in practice)

### Use Cases:
- Hidden layers in deep neural networks
- Convolutional Neural Networks (CNNs)
- Most modern deep learning architectures


In [None]:
def relu(x):
    """ReLU activation function"""
    return np.maximum(0, x)

# Compute ReLU
y_relu = relu(x)

# Plot
axes[0].plot(x, y_relu, 'b-', linewidth=2, label='ReLU')
axes[0].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[0].axvline(x=0, color='k', linestyle='--', alpha=0.3)
axes[0].set_xlabel('Input (x)', fontsize=12)
axes[0].set_ylabel('Output f(x)', fontsize=12)
axes[0].set_title('ReLU Activation Function', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)
axes[0].legend(fontsize=11)
axes[0].set_xlim(-5, 5)
axes[0].set_ylim(-1, 5)

print("ReLU Examples:")
print(f"  ReLU(-3) = {relu(-3)}")
print(f"  ReLU(0) = {relu(0)}")
print(f"  ReLU(2.5) = {relu(2.5)}")


## 2. Sigmoid

### Formula:
$$\sigma(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{e^x + 1}$$

### Properties:
- **Range**: (0, 1)
- **Advantages**: 
  - Smooth and differentiable everywhere
  - Outputs are always between 0 and 1, making it interpretable as probabilities
  - Historically important (used in early neural networks)
- **Disadvantages**: 
  - **Vanishing gradient problem**: Gradients become very small for extreme values
  - Outputs are not zero-centered (always positive)
  - Computationally more expensive than ReLU

### Use Cases:
- Output layer for binary classification (probability output)
- When you need probabilities between 0 and 1
- Less common in hidden layers now (ReLU is preferred)


In [None]:
def sigmoid(x):
    """Sigmoid activation function"""
    return 1 / (1 + np.exp(-x))

# Compute Sigmoid
y_sigmoid = sigmoid(x)

# Plot
axes[1].plot(x, y_sigmoid, 'r-', linewidth=2, label='Sigmoid')
axes[1].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[1].axhline(y=1, color='k', linestyle='--', alpha=0.3)
axes[1].axvline(x=0, color='k', linestyle='--', alpha=0.3)
axes[1].set_xlabel('Input (x)', fontsize=12)
axes[1].set_ylabel('Output f(x)', fontsize=12)
axes[1].set_title('Sigmoid Activation Function', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)
axes[1].legend(fontsize=11)
axes[1].set_xlim(-5, 5)
axes[1].set_ylim(-0.1, 1.1)

print("Sigmoid Examples:")
print(f"  Sigmoid(-3) = {sigmoid(-3):.4f}")
print(f"  Sigmoid(0) = {sigmoid(0):.4f}")
print(f"  Sigmoid(3) = {sigmoid(3):.4f}")
print(f"\nNote: Sigmoid(0) = 0.5 (midpoint)")


## 3. Tanh (Hyperbolic Tangent)

### Formula:
$$\tanh(x) = \frac{e^{2x} - 1}{e^{2x} + 1} = \frac{\sinh(x)}{\cosh(x)}$$

### Properties:
- **Range**: (-1, 1)
- **Advantages**: 
  - Zero-centered output (unlike sigmoid)
  - Smooth and differentiable
  - Stronger gradients than sigmoid in the center region
- **Disadvantages**: 
  - Still suffers from vanishing gradient problem at extremes
  - Computationally more expensive than ReLU

### Use Cases:
- Hidden layers (especially in RNNs/LSTMs)
- When you want zero-centered outputs
- Less common in modern CNNs (ReLU is preferred)


In [None]:
def tanh(x):
    """Tanh activation function"""
    return np.tanh(x)

# Compute Tanh
y_tanh = tanh(x)

# Plot
axes[2].plot(x, y_tanh, 'g-', linewidth=2, label='Tanh')
axes[2].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[2].axvline(x=0, color='k', linestyle='--', alpha=0.3)
axes[2].set_xlabel('Input (x)', fontsize=12)
axes[2].set_ylabel('Output f(x)', fontsize=12)
axes[2].set_title('Tanh Activation Function', fontsize=14, fontweight='bold')
axes[2].grid(True, alpha=0.3)
axes[2].legend(fontsize=11)
axes[2].set_xlim(-5, 5)
axes[2].set_ylim(-1.1, 1.1)

print("Tanh Examples:")
print(f"  Tanh(-3) = {tanh(-3):.4f}")
print(f"  Tanh(0) = {tanh(0):.4f}")
print(f"  Tanh(3) = {tanh(3):.4f}")
print(f"\nNote: Tanh(0) = 0 (zero-centered)")


## 4. Leaky ReLU

### Formula:
$$\text{LeakyReLU}(x) = \begin{cases} 
x & \text{if } x > 0 \\
\alpha x & \text{if } x \leq 0
\end{cases}$$

where $\alpha$ is a small positive constant (typically 0.01)

### Properties:
- **Range**: (-∞, ∞)
- **Advantages**: 
  - Solves the "Dying ReLU" problem by allowing small negative gradients
  - Still computationally efficient
  - Prevents neurons from becoming completely inactive
- **Disadvantages**: 
  - Requires tuning the $\alpha$ parameter (though 0.01 is a common default)
  - Not differentiable at x = 0 (rarely an issue in practice)

### Use Cases:
- Alternative to ReLU when you want to avoid dead neurons
- Used in some GAN architectures
- When training is unstable with standard ReLU


In [None]:
def leaky_relu(x, alpha=0.01):
    """Leaky ReLU activation function"""
    return np.where(x > 0, x, alpha * x)

# Compute Leaky ReLU
y_leaky_relu = leaky_relu(x, alpha=0.01)

# Plot
axes[3].plot(x, y_leaky_relu, 'm-', linewidth=2, label='Leaky ReLU (α=0.01)')
axes[3].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[3].axvline(x=0, color='k', linestyle='--', alpha=0.3)
axes[3].set_xlabel('Input (x)', fontsize=12)
axes[3].set_ylabel('Output f(x)', fontsize=12)
axes[3].set_title('Leaky ReLU Activation Function', fontsize=14, fontweight='bold')
axes[3].grid(True, alpha=0.3)
axes[3].legend(fontsize=11)
axes[3].set_xlim(-5, 5)
axes[3].set_ylim(-0.5, 5)

print("Leaky ReLU Examples (α=0.01):")
print(f"  LeakyReLU(-3) = {leaky_relu(-3):.4f}")
print(f"  LeakyReLU(0) = {leaky_relu(0):.4f}")
print(f"  LeakyReLU(2.5) = {leaky_relu(2.5):.4f}")
print(f"\nNote: Negative values are multiplied by α instead of being set to 0")


In [None]:
# Display all plots together
plt.tight_layout()
plt.show()


## Comparison: All Functions Together

Let's visualize all four functions on the same plot to compare them directly:


In [None]:
plt.figure(figsize=(12, 6))

plt.plot(x, y_relu, 'b-', linewidth=2, label='ReLU', alpha=0.8)
plt.plot(x, y_sigmoid, 'r-', linewidth=2, label='Sigmoid', alpha=0.8)
plt.plot(x, y_tanh, 'g-', linewidth=2, label='Tanh', alpha=0.8)
plt.plot(x, y_leaky_relu, 'm-', linewidth=2, label='Leaky ReLU (α=0.01)', alpha=0.8)

plt.axhline(y=0, color='k', linestyle='--', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='--', alpha=0.3)
plt.xlabel('Input (x)', fontsize=12)
plt.ylabel('Output f(x)', fontsize=12)
plt.title('Comparison of Activation Functions', fontsize=16, fontweight='bold')
plt.legend(fontsize=11, loc='best')
plt.grid(True, alpha=0.3)
plt.xlim(-5, 5)
plt.ylim(-1.5, 5)
plt.tight_layout()
plt.show()


## Why Do We Need Derivatives of Activation Functions?

### The Backpropagation Algorithm

**Derivatives are essential for training neural networks** through a process called **backpropagation** (backward propagation of errors). Here's why:

1. **How Neural Networks Learn:**
   - During training, the network makes predictions
   - It compares predictions to actual values (calculates error/loss)
   - It needs to adjust the weights to reduce this error
   - **Derivatives tell us which direction to adjust each weight** and by how much

2. **The Chain Rule:**
   - Neural networks are composed of layers: `Input → Layer1 → Layer2 → ... → Output`
   - Each layer applies: `output = activation_function(weighted_sum)`
   - To update weights in early layers, we need to propagate the error backward
   - This requires computing: `∂Error/∂Weight = ∂Error/∂Output × ∂Output/∂Activation × ∂Activation/∂Input × ∂Input/∂Weight`
   - **The `∂Activation/∂Input` part is the derivative of the activation function!**

3. **What the Derivative Tells Us:**
   - **Large derivative** = Strong signal to update weights (learning happens quickly)
   - **Small derivative** = Weak signal (learning is slow)
   - **Zero derivative** = No learning (weights don't update) - this is the "vanishing gradient" problem

4. **Example:**
   ```
   If ReLU(x) = max(0, x):
   - For x > 0: derivative = 1 → strong gradient, weights update normally
   - For x ≤ 0: derivative = 0 → no gradient, weights don't update (dead neuron)
   
   This is why Leaky ReLU helps: it has derivative = α (small but non-zero) for x ≤ 0
   ```

### Why Do We Need Non-Linear Activation Functions?

**Without non-linear functions, neural networks would be useless!** Here's why:

1. **The Problem with Linear Functions:**
   - If all activation functions were linear (like `f(x) = x`), then:
   - `Layer1(x) = W₁x + b₁`
   - `Layer2(Layer1(x)) = W₂(W₁x + b₁) + b₂ = (W₂W₁)x + (W₂b₁ + b₂)`
   - **No matter how many layers you stack, you still get a linear function!**
   - A single layer can do the same job → deep networks become pointless

2. **What Non-Linearity Enables:**
   - **Complex Decision Boundaries:** Non-linear functions allow networks to learn curved, complex boundaries between classes
   - **Feature Hierarchies:** Each layer can learn increasingly abstract features
   - **Universal Approximation:** With non-linear activations, neural networks can approximate any continuous function (given enough neurons)

3. **Real-World Analogy:**
   - **Linear:** Can only draw straight lines to separate data
   - **Non-Linear:** Can draw curves, circles, and complex shapes to separate data
   - Most real-world problems require complex, non-linear decision boundaries

4. **Visual Example:**
   ```
   Linear: Can separate this?  ❌
   [●]  [○]  [●]
   [○]  [●]  [○]
   
   Non-Linear: Can separate this?  ✅
   [●]  [○]  [●]
   [○]  [●]  [○]
   (XOR problem - requires non-linearity!)
   ```

5. **Why Not Just Use Linear Functions?**
   - Linear functions can only model linear relationships
   - Real-world data is rarely linear (images, speech, text all have complex patterns)
   - You'd need infinite linear layers to approximate non-linear functions (impractical)
   - Non-linear activations make each layer more powerful

**Summary:**
- **Derivatives** → Enable backpropagation (how networks learn)
- **Non-linearity** → Enables learning complex patterns (what networks can learn)


In [None]:
# Demonstration: Why Non-Linearity Matters
# Let's show what happens when you stack linear vs non-linear layers

print("=" * 60)
print("DEMONSTRATION: Linear vs Non-Linear Functions")
print("=" * 60)

# Simulate a simple 3-layer network
x = np.array([1.0, 2.0, 3.0])
W1 = np.array([[0.5, 0.3, 0.2], [0.1, 0.4, 0.6]])
W2 = np.array([[0.7, 0.5], [0.3, 0.9]])
W3 = np.array([[0.2, 0.8]])

print("\n1. LINEAR ACTIVATION (f(x) = x):")
print("-" * 60)
# Linear: just pass through
layer1_linear = W1 @ x  # No activation
layer2_linear = W2 @ layer1_linear  # No activation
output_linear = W3 @ layer2_linear  # No activation
print(f"Input: {x}")
print(f"Output: {output_linear[0]:.4f}")

# This is equivalent to a single layer!
W_equivalent = W3 @ W2 @ W1
output_equivalent = W_equivalent @ x
print(f"\nEquivalent single layer output: {output_equivalent[0]:.4f}")
print("→ Same result! Multiple layers add no value with linear activations!")

print("\n2. NON-LINEAR ACTIVATION (ReLU):")
print("-" * 60)
# Non-linear: apply ReLU
layer1_nonlinear = relu(W1 @ x)
layer2_nonlinear = relu(W2 @ layer1_nonlinear)
output_nonlinear = W3 @ layer2_nonlinear
print(f"Input: {x}")
print(f"Layer 1 (after ReLU): {layer1_nonlinear}")
print(f"Layer 2 (after ReLU): {layer2_nonlinear}")
print(f"Output: {output_nonlinear[0]:.4f}")

# This CANNOT be reduced to a single layer!
print("\n→ Cannot be reduced to a single layer!")
print("→ Each layer learns different non-linear transformations!")
print("→ This is why deep networks with non-linear activations are powerful!")

print("\n" + "=" * 60)
print("KEY INSIGHT:")
print("Linear layers = Can be collapsed into one layer (useless)")
print("Non-linear layers = Each layer adds new capabilities (powerful)")
print("=" * 60)


## Understanding Derivatives/Gradients

The derivative of an activation function is crucial for backpropagation. Let's visualize the derivatives:


In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

# Derivatives
def relu_derivative(x):
    return np.where(x > 0, 1, 0)

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

def tanh_derivative(x):
    return 1 - np.tanh(x)**2

def leaky_relu_derivative(x, alpha=0.01):
    return np.where(x > 0, 1, alpha)

# Compute derivatives
d_relu = relu_derivative(x)
d_sigmoid = sigmoid_derivative(x)
d_tanh = tanh_derivative(x)
d_leaky_relu = leaky_relu_derivative(x)

# Plot derivatives
axes[0].plot(x, d_relu, 'b-', linewidth=2, label="ReLU'")
axes[0].set_title('ReLU Derivative', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Input (x)', fontsize=12)
axes[0].set_ylabel("f'(x)", fontsize=12)
axes[0].grid(True, alpha=0.3)
axes[0].set_xlim(-5, 5)
axes[0].set_ylim(-0.1, 1.1)

axes[1].plot(x, d_sigmoid, 'r-', linewidth=2, label="Sigmoid'")
axes[1].set_title('Sigmoid Derivative', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Input (x)', fontsize=12)
axes[1].set_ylabel("f'(x)", fontsize=12)
axes[1].grid(True, alpha=0.3)
axes[1].set_xlim(-5, 5)
axes[1].set_ylim(-0.1, 0.3)

axes[2].plot(x, d_tanh, 'g-', linewidth=2, label="Tanh'")
axes[2].set_title('Tanh Derivative', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Input (x)', fontsize=12)
axes[2].set_ylabel("f'(x)", fontsize=12)
axes[2].grid(True, alpha=0.3)
axes[2].set_xlim(-5, 5)
axes[2].set_ylim(-0.1, 1.1)

axes[3].plot(x, d_leaky_relu, 'm-', linewidth=2, label="Leaky ReLU'")
axes[3].set_title('Leaky ReLU Derivative', fontsize=14, fontweight='bold')
axes[3].set_xlabel('Input (x)', fontsize=12)
axes[3].set_ylabel("f'(x)", fontsize=12)
axes[3].grid(True, alpha=0.3)
axes[3].set_xlim(-5, 5)
axes[3].set_ylim(-0.01, 1.1)

plt.tight_layout()
plt.show()

print("Key Observations:")
print("1. ReLU: Gradient is 1 for x > 0, 0 for x ≤ 0 (vanishing gradient for negatives)")
print("2. Sigmoid: Gradient is largest near x=0, very small at extremes (vanishing gradient problem)")
print("3. Tanh: Similar to sigmoid but zero-centered, stronger gradients in center")
print("4. Leaky ReLU: Gradient is 1 for x > 0, α for x ≤ 0 (solves dying ReLU problem)")


## Practical Examples

Let's see how these functions behave with different input values:


In [None]:
# Test with various inputs
test_inputs = np.array([-5, -2, -1, 0, 1, 2, 5])

print("Input\tReLU\t\tSigmoid\t\tTanh\t\tLeaky ReLU")
print("-" * 70)
for val in test_inputs:
    print(f"{val:5.1f}\t{relu(val):8.4f}\t{sigmoid(val):8.4f}\t{tanh(val):8.4f}\t{leaky_relu(val):8.4f}")


## Summary Table

| Function | Range | Zero-Centered | Vanishing Gradient | Common Use Case |
|----------|-------|---------------|-------------------|-----------------|
| **ReLU** | [0, ∞) | No | Yes (for negatives) | Hidden layers in CNNs/DNNs |
| **Sigmoid** | (0, 1) | No | Yes (at extremes) | Output layer (binary classification) |
| **Tanh** | (-1, 1) | Yes | Yes (at extremes) | Hidden layers (RNNs/LSTMs) |
| **Leaky ReLU** | (-∞, ∞) | No | No | Alternative to ReLU |

## Key Takeaways

1. **ReLU** is the most popular choice for hidden layers due to its simplicity and effectiveness.
2. **Sigmoid** is best for output layers when you need probability outputs (0 to 1).
3. **Tanh** is zero-centered, making it sometimes better than sigmoid for hidden layers.
4. **Leaky ReLU** solves the "dying ReLU" problem by allowing small negative gradients.

## When to Use Which?

- **Hidden layers**: ReLU or Leaky ReLU (most common)
- **Output layer (binary classification)**: Sigmoid
- **Output layer (multi-class classification)**: Softmax (not covered here)
- **RNNs/LSTMs**: Tanh or ReLU variants

## Experiment!

Try modifying the Leaky ReLU alpha parameter or test these functions with your own data!


In [None]:
# Experiment: Try different alpha values for Leaky ReLU
alphas = [0.01, 0.1, 0.3]
plt.figure(figsize=(10, 6))

for alpha in alphas:
    y = leaky_relu(x, alpha=alpha)
    plt.plot(x, y, linewidth=2, label=f'Leaky ReLU (α={alpha})')

plt.plot(x, y_relu, 'b--', linewidth=2, label='Standard ReLU', alpha=0.5)
plt.axhline(y=0, color='k', linestyle='--', alpha=0.3)
plt.axvline(x=0, color='k', linestyle='--', alpha=0.3)
plt.xlabel('Input (x)', fontsize=12)
plt.ylabel('Output f(x)', fontsize=12)
plt.title('Leaky ReLU with Different Alpha Values', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.xlim(-5, 5)
plt.ylim(-1.5, 5)
plt.tight_layout()
plt.show()

print("Notice how larger alpha values allow more negative output, making it closer to a linear function.")
