# Task 4.2: Activation Function Study

**Module:** 4 - Neural Network Fundamentals  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê (Intermediate)

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Implement 6 common activation functions from scratch
- [ ] Understand why each activation exists and when to use it
- [ ] Visualize the vanishing gradient problem
- [ ] Compare training dynamics with different activations
- [ ] Choose the right activation for your task

---

## üìö Prerequisites

- Completed: Notebook 01 (NumPy Neural Network)
- Knowledge of: Derivatives, chain rule

---

## üåç Real-World Context

**Why do activation functions matter?**

The choice of activation function has evolved over decades:
- **1980s-90s:** Sigmoid and Tanh dominated
- **2010s:** ReLU revolutionized deep learning
- **2020s:** GELU and SiLU power modern transformers

The right activation can mean the difference between a network that trains in minutes vs. one that never converges!

---

## üßí ELI5: What Are Activation Functions?

> **Imagine you're building a LEGO house with special bricks.**
>
> Without activation functions, all your bricks would be straight lines - you could only build straight walls. That's boring!
>
> **Activation functions are like magic bricks that can bend and curve.** They let you build arches, curves, and complex shapes.
>
> Different activations = different bending abilities:
> - **Sigmoid**: Smoothly curves between 0 and 1 (like a gentle slide)
> - **ReLU**: Simple bend at zero (like a hockey stick)
> - **GELU**: Smooth bend with a probabilistic feel (like a sophisticated curve)
>
> Without these "bending" abilities, neural networks would just be fancy linear regression!

---

## Setup

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from typing import Tuple, Callable, Dict, List
import time
import sys
import os
from pathlib import Path

# Add scripts directory to path (robust approach)
notebook_dir = Path().resolve()
if notebook_dir.name == 'notebooks':
    scripts_dir = notebook_dir.parent / 'scripts'
else:
    scripts_dir = notebook_dir / 'scripts'
    if not scripts_dir.exists():
        scripts_dir = notebook_dir.parent / 'scripts'

if scripts_dir.exists():
    sys.path.insert(0, str(scripts_dir))

np.random.seed(42)
plt.style.use('default')
%matplotlib inline

print("Setup complete!")

---

## Part 1: Implementing All 6 Activation Functions

Let's implement each activation with its forward pass and gradient.

In [None]:
class Sigmoid:
    """
    Sigmoid activation: œÉ(x) = 1 / (1 + e^(-x))
    
    Output range: (0, 1)
    
    ELI5: Sigmoid squashes any number into a probability between 0 and 1.
    Very negative numbers become almost 0, very positive become almost 1.
    
    Use cases:
    - Binary classification output (probability of class 1)
    - Gates in LSTMs and GRUs
    
    Problems:
    - Vanishing gradient (gradient max is 0.25!)
    - Output not zero-centered
    """
    
    def __init__(self):
        self.cache = {}
        self.name = 'Sigmoid'
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        x = np.clip(x, -500, 500)  # Prevent overflow
        out = 1.0 / (1.0 + np.exp(-x))
        self.cache['out'] = out
        return out
    
    def backward(self, grad_output: np.ndarray) -> np.ndarray:
        out = self.cache['out']
        # Derivative: œÉ(x) * (1 - œÉ(x))
        grad = out * (1 - out)
        return grad_output * grad
    
    def gradient(self, x: np.ndarray) -> np.ndarray:
        """Compute gradient directly (for visualization)."""
        self.forward(x)
        return self.backward(np.ones_like(x))
    
    def __call__(self, x):
        return self.forward(x)

In [None]:
class Tanh:
    """
    Hyperbolic tangent: tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
    
    Output range: (-1, 1)
    
    ELI5: Like sigmoid, but centered around zero.
    Zero-centered outputs help with learning!
    
    Use cases:
    - Hidden layers in older networks
    - Output layer when you need values in [-1, 1]
    
    Problems:
    - Still has vanishing gradient problem
    - Gradient max is 1.0 at x=0
    """
    
    def __init__(self):
        self.cache = {}
        self.name = 'Tanh'
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        out = np.tanh(x)
        self.cache['out'] = out
        return out
    
    def backward(self, grad_output: np.ndarray) -> np.ndarray:
        out = self.cache['out']
        # Derivative: 1 - tanh(x)^2
        grad = 1 - out ** 2
        return grad_output * grad
    
    def gradient(self, x: np.ndarray) -> np.ndarray:
        self.forward(x)
        return self.backward(np.ones_like(x))
    
    def __call__(self, x):
        return self.forward(x)

In [None]:
class ReLU:
    """
    Rectified Linear Unit: ReLU(x) = max(0, x)
    
    Output range: [0, ‚àû)
    
    ELI5: A simple on/off switch. Positive values pass through,
    negative values are blocked. Simple but surprisingly effective!
    
    Use cases:
    - Default choice for hidden layers
    - CNNs, MLPs, most modern architectures
    
    Problems:
    - "Dead ReLU": Neurons that always output 0 can never recover
    - Not zero-centered
    
    Why it works:
    - No vanishing gradient for positive inputs
    - Computationally very fast
    - Sparse activations (many zeros) = efficient
    """
    
    def __init__(self):
        self.cache = {}
        self.name = 'ReLU'
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        self.cache['x'] = x
        return np.maximum(0, x)
    
    def backward(self, grad_output: np.ndarray) -> np.ndarray:
        x = self.cache['x']
        # Derivative: 1 if x > 0, else 0
        grad = (x > 0).astype(float)
        return grad_output * grad
    
    def gradient(self, x: np.ndarray) -> np.ndarray:
        self.forward(x)
        return self.backward(np.ones_like(x))
    
    def __call__(self, x):
        return self.forward(x)

In [None]:
class LeakyReLU:
    """
    Leaky ReLU: f(x) = x if x > 0, else alpha * x
    
    Output range: (-‚àû, ‚àû)
    
    ELI5: Like ReLU, but the valve isn't fully closed for negative values.
    A tiny trickle can still get through, preventing "dead neurons".
    
    Use cases:
    - When you're worried about dead ReLUs
    - GANs often use Leaky ReLU
    
    Typical alpha: 0.01 or 0.1
    """
    
    def __init__(self, alpha: float = 0.01):
        self.alpha = alpha
        self.cache = {}
        self.name = f'LeakyReLU(Œ±={alpha})'
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        self.cache['x'] = x
        return np.where(x > 0, x, self.alpha * x)
    
    def backward(self, grad_output: np.ndarray) -> np.ndarray:
        x = self.cache['x']
        grad = np.where(x > 0, 1.0, self.alpha)
        return grad_output * grad
    
    def gradient(self, x: np.ndarray) -> np.ndarray:
        self.forward(x)
        return self.backward(np.ones_like(x))
    
    def __call__(self, x):
        return self.forward(x)

In [None]:
class GELU:
    """
    Gaussian Error Linear Unit: GELU(x) = x * Œ¶(x)
    
    Where Œ¶ is the CDF of the standard normal distribution.
    
    Approximation: 0.5 * x * (1 + tanh(sqrt(2/œÄ) * (x + 0.044715 * x^3)))
    
    Output range: (-0.17, ‚àû)
    
    ELI5: GELU is like a smarter ReLU. Instead of a hard cutoff,
    it uses probability to decide how much of each input to keep.
    Values near zero might partially get through - it's probabilistic!
    
    Use cases:
    - BERT, GPT, and most modern transformers
    - State-of-the-art NLP models
    
    Why it works:
    - Smooth (helps optimization)
    - Non-monotonic (the dip allows for interesting dynamics)
    - Works great with layer normalization
    """
    
    def __init__(self):
        self.cache = {}
        self.name = 'GELU'
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        self.cache['x'] = x
        # Tanh approximation
        return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x ** 3)))
    
    def backward(self, grad_output: np.ndarray) -> np.ndarray:
        x = self.cache['x']
        # Approximate derivative
        cdf = 0.5 * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x ** 3)))
        pdf = np.exp(-0.5 * x ** 2) / np.sqrt(2 * np.pi)
        grad = cdf + x * pdf
        return grad_output * grad
    
    def gradient(self, x: np.ndarray) -> np.ndarray:
        self.forward(x)
        return self.backward(np.ones_like(x))
    
    def __call__(self, x):
        return self.forward(x)

In [None]:
class SiLU:
    """
    Sigmoid Linear Unit (also known as Swish): SiLU(x) = x * œÉ(x)
    
    Output range: (-0.28, ‚àû)
    
    ELI5: SiLU multiplies each value by its own sigmoid.
    It's like asking "how confident are you?" and scaling the value
    by that confidence. Self-gated activation!
    
    Use cases:
    - EfficientNet, MobileNetV3
    - Many modern architectures
    - LLaMA and other LLMs
    
    Why it works:
    - Smooth and non-monotonic
    - Self-gating provides regularization
    - Performs well across many tasks
    """
    
    def __init__(self):
        self.cache = {}
        self.name = 'SiLU/Swish'
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        self.cache['x'] = x
        sigmoid = 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))
        self.cache['sigmoid'] = sigmoid
        return x * sigmoid
    
    def backward(self, grad_output: np.ndarray) -> np.ndarray:
        x = self.cache['x']
        sigmoid = self.cache['sigmoid']
        # Derivative: œÉ(x) + x * œÉ(x) * (1 - œÉ(x))
        grad = sigmoid + x * sigmoid * (1 - sigmoid)
        return grad_output * grad
    
    def gradient(self, x: np.ndarray) -> np.ndarray:
        self.forward(x)
        return self.backward(np.ones_like(x))
    
    def __call__(self, x):
        return self.forward(x)

---

## Part 2: Visualizing the Activations

Let's plot each activation function and its derivative side by side.

In [None]:
# Create all activations
activations = [
    Sigmoid(),
    Tanh(),
    ReLU(),
    LeakyReLU(alpha=0.1),
    GELU(),
    SiLU()
]

# Generate x values
x = np.linspace(-4, 4, 200)

# Create the plot
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b']

for idx, (ax, activation) in enumerate(zip(axes.flat, activations)):
    color = colors[idx]
    
    # Compute forward and gradient
    y = activation(x.copy())
    grad = activation.gradient(x.copy())
    
    # Plot activation
    ax.plot(x, y, color=color, linewidth=2.5, label='f(x)')
    
    # Plot gradient
    ax.plot(x, grad, color=color, linewidth=2, linestyle='--', alpha=0.7, label="f'(x)")
    
    # Reference lines
    ax.axhline(y=0, color='k', linewidth=0.5)
    ax.axvline(x=0, color='k', linewidth=0.5)
    
    # Styling
    ax.set_title(activation.name, fontsize=14, fontweight='bold')
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.legend(loc='upper left')
    ax.grid(True, alpha=0.3)
    ax.set_xlim(-4, 4)
    ax.set_ylim(-2, 4)

plt.suptitle('Activation Functions and Their Derivatives', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

### üîç What to Notice:

1. **Sigmoid/Tanh gradients flatten at extremes** ‚Üí Vanishing gradients!
2. **ReLU has constant gradient for positive inputs** ‚Üí No vanishing!
3. **GELU and SiLU have smooth transitions** ‚Üí Easier optimization
4. **Leaky ReLU never has zero gradient** ‚Üí No dead neurons

---

## Part 3: The Vanishing Gradient Problem

Let's visualize why sigmoid and tanh cause problems in deep networks.

In [None]:
def simulate_gradient_flow(activation, num_layers: int = 10) -> List[float]:
    """
    Simulate gradient flowing backward through multiple layers.
    
    In backprop, gradients get multiplied at each layer.
    If local gradient < 1, the total gradient shrinks exponentially!
    """
    # Start with gradient = 1
    gradient = 1.0
    gradient_history = [gradient]
    
    # Assume activations are around 0 (common after BatchNorm)
    x_typical = np.array([0.0])
    
    for layer in range(num_layers):
        # Get local gradient
        local_grad = activation.gradient(x_typical)[0]
        
        # Chain rule: multiply gradients
        gradient *= local_grad
        gradient_history.append(gradient)
    
    return gradient_history

In [None]:
# Compare gradient flow through 20 layers
num_layers = 20

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Activations to compare
test_activations = [Sigmoid(), Tanh(), ReLU(), GELU()]
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#9467bd']

for activation, color in zip(test_activations, colors):
    history = simulate_gradient_flow(activation, num_layers)
    
    # Linear scale
    axes[0].plot(history, color=color, linewidth=2, marker='o', 
                 markersize=4, label=activation.name)
    
    # Log scale
    axes[1].semilogy(history, color=color, linewidth=2, marker='o', 
                     markersize=4, label=activation.name)

axes[0].set_xlabel('Layer (from output to input)', fontsize=12)
axes[0].set_ylabel('Gradient Magnitude', fontsize=12)
axes[0].set_title('Gradient Flow (Linear Scale)', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].set_xlabel('Layer (from output to input)', fontsize=12)
axes[1].set_ylabel('Gradient Magnitude (log scale)', fontsize=12)
axes[1].set_title('Gradient Flow (Log Scale)', fontsize=14)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüìâ Gradient after 20 layers:")
for activation in test_activations:
    history = simulate_gradient_flow(activation, num_layers)
    print(f"   {activation.name:15s}: {history[-1]:.2e}")

### üí° Key Insight: Why Sigmoid/Tanh Vanish

- Sigmoid's max gradient = 0.25 (at x=0)
- After 20 layers: 0.25^20 ‚âà 10^-12 (essentially zero!)
- This is why deep sigmoid networks couldn't be trained in the 1990s

**ReLU's brilliance:** Gradient = 1 for positive inputs ‚Üí No vanishing!

---

## Part 4: Training Comparison

Let's train the same architecture with different activations and compare.

In [None]:
# First, let's create a flexible MLP that accepts any activation

class FlexibleMLP:
    """
    MLP that can use any activation function.
    """
    
    def __init__(self, layer_sizes: List[int], activation_class):
        self.layers = []
        self.activations = []
        
        for i in range(len(layer_sizes) - 1):
            # Linear layer
            W = np.random.randn(layer_sizes[i], layer_sizes[i + 1]) * np.sqrt(2.0 / layer_sizes[i])
            b = np.zeros(layer_sizes[i + 1])
            self.layers.append({'W': W, 'b': b, 'cache': {}, 'dW': None, 'db': None})
            
            # Add activation for all but last layer
            if i < len(layer_sizes) - 2:
                self.activations.append(activation_class())
    
    def forward(self, X: np.ndarray) -> np.ndarray:
        out = X
        
        for i, layer in enumerate(self.layers[:-1]):
            layer['cache']['X'] = out
            out = out @ layer['W'] + layer['b']
            out = self.activations[i](out)
        
        # Last layer (no activation, handled by softmax)
        self.layers[-1]['cache']['X'] = out
        out = out @ self.layers[-1]['W'] + self.layers[-1]['b']
        
        # Softmax
        out_shifted = out - np.max(out, axis=1, keepdims=True)
        exp_out = np.exp(out_shifted)
        self.probs = exp_out / np.sum(exp_out, axis=1, keepdims=True)
        
        return self.probs
    
    def backward(self, targets: np.ndarray, learning_rate: float = 0.01):
        batch_size = targets.shape[0]
        
        # Gradient from softmax + cross-entropy
        grad = self.probs.copy()
        grad[np.arange(batch_size), targets] -= 1
        
        # Backward through layers
        for i in range(len(self.layers) - 1, -1, -1):
            layer = self.layers[i]
            X = layer['cache']['X']
            
            # Compute gradients
            layer['dW'] = X.T @ grad / batch_size
            layer['db'] = np.mean(grad, axis=0)
            
            # Gradient for next layer
            grad = grad @ layer['W'].T
            
            # Apply activation gradient (except for first layer in backward)
            if i > 0:
                grad = self.activations[i - 1].backward(grad)
            
            # Update weights
            layer['W'] -= learning_rate * layer['dW']
            layer['b'] -= learning_rate * layer['db']
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        probs = self.forward(X)
        return np.argmax(probs, axis=1)

In [None]:
# Load MNIST
import gzip
import urllib.request

def load_mnist(path='../data'):
    os.makedirs(path, exist_ok=True)
    base_url = 'http://yann.lecun.com/exdb/mnist/'
    files = {
        'train_images': 'train-images-idx3-ubyte.gz',
        'train_labels': 'train-labels-idx1-ubyte.gz',
        'test_images': 't10k-images-idx3-ubyte.gz',
        'test_labels': 't10k-labels-idx1-ubyte.gz'
    }
    
    def download(filename):
        filepath = os.path.join(path, filename)
        if not os.path.exists(filepath):
            print(f"Downloading {filename}...")
            urllib.request.urlretrieve(base_url + filename, filepath)
        return filepath
    
    def load_images(fp):
        with gzip.open(fp, 'rb') as f:
            f.read(16)
            return np.frombuffer(f.read(), dtype=np.uint8).reshape(-1, 784).astype(np.float32) / 255.0
    
    def load_labels(fp):
        with gzip.open(fp, 'rb') as f:
            f.read(8)
            return np.frombuffer(f.read(), dtype=np.uint8)
    
    X_train = load_images(download(files['train_images']))
    y_train = load_labels(download(files['train_labels']))
    X_test = load_images(download(files['test_images']))
    y_test = load_labels(download(files['test_labels']))
    
    return X_train, y_train, X_test, y_test

print("Loading MNIST...")
X_train, y_train, X_test, y_test = load_mnist()

# Use subset for faster comparison
X_train_subset = X_train[:10000]
y_train_subset = y_train[:10000]
print(f"Using {len(X_train_subset)} training samples for comparison")

In [None]:
def train_and_evaluate(activation_class, epochs=5, lr=0.1):
    """
    Train a model with given activation and return history.
    """
    np.random.seed(42)
    
    model = FlexibleMLP([784, 256, 128, 10], activation_class)
    
    history = {'loss': [], 'accuracy': []}
    batch_size = 64
    
    for epoch in range(epochs):
        # Shuffle data
        indices = np.random.permutation(len(X_train_subset))
        epoch_loss = 0
        n_batches = 0
        
        for start in range(0, len(X_train_subset), batch_size):
            batch_idx = indices[start:start + batch_size]
            X_batch = X_train_subset[batch_idx]
            y_batch = y_train_subset[batch_idx]
            
            # Forward
            probs = model.forward(X_batch)
            
            # Loss
            loss = -np.mean(np.log(probs[np.arange(len(y_batch)), y_batch] + 1e-10))
            epoch_loss += loss
            n_batches += 1
            
            # Backward
            model.backward(y_batch, lr)
        
        # Record metrics
        avg_loss = epoch_loss / n_batches
        preds = model.predict(X_test[:1000])
        accuracy = np.mean(preds == y_test[:1000])
        
        history['loss'].append(avg_loss)
        history['accuracy'].append(accuracy)
    
    return history

In [None]:
# Train with each activation
print("üèãÔ∏è Training with different activations...")
print("=" * 50)

activations_to_test = [
    ('Sigmoid', Sigmoid),
    ('Tanh', Tanh),
    ('ReLU', ReLU),
    ('LeakyReLU', lambda: LeakyReLU(0.1)),
    ('GELU', GELU),
    ('SiLU', SiLU)
]

results = {}

for name, activation_class in activations_to_test:
    start_time = time.time()
    
    # Create a wrapper if needed
    if callable(activation_class) and not isinstance(activation_class, type):
        act_class = type(name, (), {
            '__init__': lambda self: setattr(self, '_act', activation_class()),
            '__call__': lambda self, x: self._act(x),
            'backward': lambda self, g: self._act.backward(g),
            'gradient': lambda self, x: self._act.gradient(x),
            'cache': property(lambda self: self._act.cache)
        })
    else:
        act_class = activation_class
    
    history = train_and_evaluate(activation_class if isinstance(activation_class, type) else (lambda: activation_class()), epochs=5, lr=0.1)
    
    elapsed = time.time() - start_time
    results[name] = history
    
    print(f"{name:12s} | Final Acc: {history['accuracy'][-1]:.2%} | "
          f"Final Loss: {history['loss'][-1]:.4f} | Time: {elapsed:.2f}s")

print("=" * 50)

In [None]:
# Plot comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

colors = plt.cm.tab10(np.linspace(0, 1, len(results)))

for (name, history), color in zip(results.items(), colors):
    axes[0].plot(history['loss'], color=color, linewidth=2, marker='o', 
                 markersize=5, label=name)
    axes[1].plot(history['accuracy'], color=color, linewidth=2, marker='o', 
                 markersize=5, label=name)

axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('Training Loss by Activation', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Accuracy', fontsize=12)
axes[1].set_title('Test Accuracy by Activation', fontsize=14)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## Part 5: Recommendations Summary

In [None]:
# Create a summary table
print("\n" + "=" * 80)
print("                    ACTIVATION FUNCTION RECOMMENDATIONS")
print("=" * 80)

recommendations = [
    ("ReLU", "Default choice for most networks", "CNNs, MLPs, general hidden layers"),
    ("LeakyReLU", "When worried about dead neurons", "GANs, very deep networks"),
    ("GELU", "Modern NLP and transformers", "BERT, GPT, ViT"),
    ("SiLU/Swish", "Modern efficient architectures", "EfficientNet, MobileNet, LLaMA"),
    ("Sigmoid", "Binary classification output only", "Final layer for binary prediction"),
    ("Tanh", "When output needs to be in [-1,1]", "RNN hidden states, LSTM gates")
]

print(f"{'Activation':<15} {'When to Use':<35} {'Common Applications'}")
print("-" * 80)
for name, when, apps in recommendations:
    print(f"{name:<15} {when:<35} {apps}")

print("\n" + "=" * 80)
print("                         KEY TAKEAWAYS")
print("=" * 80)
print("""
1. NEVER use Sigmoid/Tanh in hidden layers of deep networks (vanishing gradients)

2. Start with ReLU - it's simple, fast, and works well

3. For transformers, use GELU (it's what BERT/GPT use)

4. For efficient models, try SiLU/Swish

5. If you see "dead neurons" (many zeros), switch to LeakyReLU

6. Match your activation to your framework's default for best performance
""")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Using Sigmoid in hidden layers

```python
# ‚ùå Wrong - vanishing gradients
model = Sequential([
    Linear(784, 256), Sigmoid(),
    Linear(256, 128), Sigmoid(),
    Linear(128, 10)
])

# ‚úÖ Right
model = Sequential([
    Linear(784, 256), ReLU(),
    Linear(256, 128), ReLU(),
    Linear(128, 10)
])
```

### Mistake 2: Forgetting activation altogether

```python
# ‚ùå Wrong - just linear regression!
out = Linear(784, 256)(x)
out = Linear(256, 10)(out)

# ‚úÖ Right
out = Linear(784, 256)(x)
out = ReLU()(out)  # Don't forget this!
out = Linear(256, 10)(out)
```

### Mistake 3: Wrong activation for output layer

```python
# For multi-class classification:
# ‚úÖ Use Softmax on final layer (gives probabilities that sum to 1)

# For binary classification:
# ‚úÖ Use Sigmoid on final layer (gives probability of class 1)

# For regression:
# ‚úÖ Use no activation (or linear) on final layer
```

---

## ‚úã Try It Yourself

### Exercise 1: Implement PReLU

PReLU (Parametric ReLU) is like LeakyReLU, but the slope for negative values is learned:

$$\text{PReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \end{cases}$$

Where $\alpha$ is a learnable parameter!

<details>
<summary>üí° Hint</summary>
Start with LeakyReLU, but make alpha a trainable parameter.
You'll need to compute the gradient with respect to alpha too!
</details>

In [None]:
# Your code here: Implement PReLU
class PReLU:
    def __init__(self, alpha_init=0.01):
        self.alpha = alpha_init
        # TODO: Implement forward and backward
        pass

### Exercise 2: Experiment with Deep Sigmoid Network

Try training a 10-layer network with Sigmoid activations. What happens?
Then try with ReLU. Compare the gradients in early layers.

In [None]:
# Your code here: Compare deep networks

---

## üéâ Checkpoint

You've learned:

- ‚úÖ How to implement 6 activation functions from scratch
- ‚úÖ Why the vanishing gradient problem occurs
- ‚úÖ How to choose the right activation for your task
- ‚úÖ The historical evolution of activation functions

---

## üìñ Further Reading

- [Delving Deep into Rectifiers (He et al.)](https://arxiv.org/abs/1502.01852) - PReLU and He initialization
- [GELU Paper](https://arxiv.org/abs/1606.08415) - Gaussian Error Linear Units
- [Swish/SiLU Paper](https://arxiv.org/abs/1710.05941) - Searching for Activation Functions

---

## üßπ Cleanup

In [None]:
import gc
gc.collect()

print("‚úÖ Cleanup complete!")
print("\nüéØ Next: Proceed to notebook 03-regularization-experiments.ipynb")