# 🧪 Concept 4: Activation Functions Analysis

## Deep Neural Network Architectures - Week 5
**Module:** 2 - Optimization and Regularization  
**Topic:** Comparing Activation Functions and Their Impact on Gradients

---

## 📋 Learning Objectives
By the end of this notebook, you will:
1. **Analyze** mathematical properties of different activation functions
2. **Compare** gradient flow characteristics across activations
3. **Understand** why ReLU revolutionized deep learning
4. **Choose** appropriate activations for different scenarios

---

## 🚰 The Water Flow Analogy

Think of activation functions as different types of pipes in a water system:

**Sigmoid Pipe:**
- **Adjustable valve** that reduces flow
- **Maximum flow:** 25% of input pressure
- **Multiple pipes:** Flow reduces to almost nothing

**ReLU Pipe:**
- **Check valve:** Either fully open (100% flow) or fully closed (0% flow)
- **No reduction:** When open, full pressure passes through
- **Result:** Strong flow maintained through many pipes

---

## 💻 Mathematical Analysis

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

print(f"TensorFlow version: {tf.__version__}")
print(f"NumPy version: {np.__version__}")

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Configure plotting
plt.style.use('default')
sns.set_palette("Set2")
plt.rcParams['figure.figsize'] = (12, 8)

In [None]:
def analyze_sigmoid():
    """Comprehensive analysis of sigmoid activation function"""
    
    print("🔍 SIGMOID FUNCTION ANALYSIS")
    print("=" * 40)
    
    x = np.linspace(-6, 6, 1000)
    sigmoid = 1 / (1 + np.exp(-x))
    sigmoid_deriv = sigmoid * (1 - sigmoid)

    # Create comprehensive visualization
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))

    # Plot 1: Sigmoid function
    axes[0, 0].plot(x, sigmoid, 'b-', linewidth=3, label='σ(x) = 1/(1 + e⁻ˣ)')
    axes[0, 0].axhline(y=0.5, color='gray', linestyle='--', alpha=0.5)
    axes[0, 0].axvline(x=0, color='gray', linestyle='--', alpha=0.5)
    axes[0, 0].set_title('Sigmoid Function', fontsize=14, fontweight='bold')
    axes[0, 0].set_xlabel('Input (x)')
    axes[0, 0].set_ylabel('Output σ(x)')
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)
    axes[0, 0].set_ylim(-0.1, 1.1)

    # Plot 2: Sigmoid derivative
    axes[0, 1].plot(x, sigmoid_deriv, 'r-', linewidth=3, label="σ'(x) = σ(x)(1 - σ(x))")
    axes[0, 1].axhline(y=0.25, color='k', linestyle='--', alpha=0.7, label='Maximum = 0.25')
    axes[0, 1].axvline(x=0, color='gray', linestyle='--', alpha=0.5)
    axes[0, 1].set_title('Sigmoid Derivative\n(The Gradient Killer!)', fontsize=14, fontweight='bold')
    axes[0, 1].set_xlabel('Input (x)')
    axes[0, 1].set_ylabel("Derivative σ'(x)")
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)
    
    # Plot 3: Gradient reduction through layers
    layers = range(1, 11)
    gradient_reduction = [0.25**i for i in layers]
    axes[1, 0].semilogy(layers, gradient_reduction, 'ro-', linewidth=3, markersize=8)
    axes[1, 0].set_title('Gradient Reduction Through Layers', fontsize=14, fontweight='bold')
    axes[1, 0].set_xlabel('Layer Depth')
    axes[1, 0].set_ylabel('Gradient Magnitude (log scale)')
    axes[1, 0].grid(True, alpha=0.3)
    
    # Add annotations for key points
    axes[1, 0].annotate(f'Layer 5: {0.25**5:.6f}', xy=(5, 0.25**5), 
                       xytext=(7, 0.01), arrowprops=dict(arrowstyle='->', color='red'))
    axes[1, 0].annotate(f'Layer 10: {0.25**10:.2e}', xy=(10, 0.25**10), 
                       xytext=(8, 1e-5), arrowprops=dict(arrowstyle='->', color='red'))
    
    # Plot 4: Saturation zones
    saturation_left = sigmoid_deriv[x < -3]
    saturation_right = sigmoid_deriv[x > 3]
    active_zone = sigmoid_deriv[(-3 <= x) & (x <= 3)]
    
    x_left = x[x < -3]
    x_right = x[x > 3]
    x_active = x[(-3 <= x) & (x <= 3)]
    
    axes[1, 1].fill_between(x_left, 0, saturation_left, alpha=0.3, color='red', label='Left Saturation')
    axes[1, 1].fill_between(x_right, 0, saturation_right, alpha=0.3, color='red', label='Right Saturation')
    axes[1, 1].fill_between(x_active, 0, active_zone, alpha=0.3, color='green', label='Active Zone')
    axes[1, 1].plot(x, sigmoid_deriv, 'b-', linewidth=2)
    axes[1, 1].set_title('Sigmoid Saturation Zones', fontsize=14, fontweight='bold')
    axes[1, 1].set_xlabel('Input (x)')
    axes[1, 1].set_ylabel("Derivative σ'(x)")
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

    # Print numerical analysis
    print("\n📊 NUMERICAL ANALYSIS:")
    print(f"Maximum derivative: {np.max(sigmoid_deriv):.6f} (at x = 0)")
    print(f"Derivative at x = ±3: {sigmoid_deriv[np.argmin(np.abs(x-3))]:.6f}")
    print(f"Derivative at x = ±6: {sigmoid_deriv[np.argmin(np.abs(x-6))]:.6f}")
    
    print("\n💀 GRADIENT DEATH PROGRESSION:")
    for layer in [1, 3, 5, 7, 10]:
        reduction = 0.25**layer
        percentage = reduction * 100
        print(f"Layer {layer:2d}: {reduction:.2e} ({percentage:.4f}% of original)")
    
    print("\n🚨 CRITICAL INSIGHT:")
    print("After just 5 sigmoid layers, gradient is 1000x smaller!")
    print("Early layers receive virtually no learning signal.")

# Run sigmoid analysis
analyze_sigmoid()

In [None]:
def compare_relu_variants():
    """Compare different ReLU variants and their properties"""
    
    print("\n🔥 ReLU VARIANTS COMPARISON")
    print("=" * 40)

    x = np.linspace(-3, 3, 1000)

    # Define activation functions
    activations = {
        'ReLU': {
            'func': np.maximum(0, x),
            'derivative': np.where(x > 0, 1, 0),
            'description': 'Standard ReLU: f(x) = max(0, x)',
            'pros': ['Simple', 'Fast', 'No vanishing gradients'],
            'cons': ['Dying ReLU problem', 'Not differentiable at 0']
        },
        'Leaky ReLU': {
            'func': np.where(x > 0, x, 0.01 * x),
            'derivative': np.where(x > 0, 1, 0.01),
            'description': 'Leaky ReLU: f(x) = max(0.01x, x)',
            'pros': ['Fixes dying ReLU', 'Always has gradient'],
            'cons': ['Still piecewise linear', 'Hyperparameter (slope)']
        },
        'ELU': {
            'func': np.where(x > 0, x, np.exp(x) - 1),
            'derivative': np.where(x > 0, 1, np.exp(x)),
            'description': 'ELU: f(x) = x if x>0, else α(eˣ-1)',
            'pros': ['Smooth', 'Negative outputs', 'Self-normalizing'],
            'cons': ['Computationally expensive', 'Saturation for large negative']
        },
        'Swish': {
            'func': x * (1 / (1 + np.exp(-x))),
            'derivative': (1 / (1 + np.exp(-x))) + x * (1 / (1 + np.exp(-x))) * (1 - (1 / (1 + np.exp(-x)))),
            'description': 'Swish: f(x) = x * σ(x)',
            'pros': ['Smooth', 'Self-gated', 'Unbounded above'],
            'cons': ['Computational overhead', 'Can vanish for large negative']
        }
    }

    # Create comprehensive comparison plot
    fig, axes = plt.subplots(2, 4, figsize=(20, 10))
    colors = ['blue', 'green', 'red', 'purple']

    for i, (name, props) in enumerate(activations.items()):
        color = colors[i]
        
        # Top row: Activation functions
        axes[0, i].plot(x, props['func'], color=color, linewidth=3, label=name)
        axes[0, i].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
        axes[0, i].axvline(x=0, color='gray', linestyle='--', alpha=0.5)
        axes[0, i].set_title(f'{name}\n{props["description"]}', fontsize=12, fontweight='bold')
        axes[0, i].set_xlabel('x')
        axes[0, i].set_ylabel(f'{name}(x)')
        axes[0, i].grid(True, alpha=0.3)
        axes[0, i].legend()
        
        # Bottom row: Derivatives
        axes[1, i].plot(x, props['derivative'], color=color, linewidth=3, label=f"{name} derivative")
        axes[1, i].axhline(y=1, color='orange', linestyle='--', alpha=0.7, label='Gradient = 1')
        axes[1, i].axhline(y=0, color='gray', linestyle='--', alpha=0.5)
        axes[1, i].axvline(x=0, color='gray', linestyle='--', alpha=0.5)
        axes[1, i].set_title(f'{name} Derivative', fontsize=12, fontweight='bold')
        axes[1, i].set_xlabel('x')
        axes[1, i].set_ylabel(f"{name}'(x)")
        axes[1, i].grid(True, alpha=0.3)
        axes[1, i].legend()
        
        # Set consistent y-limits for derivatives
        axes[1, i].set_ylim(-0.1, 2.0)

    plt.tight_layout()
    plt.show()
    
    # Print detailed comparison
    print("\n📋 DETAILED COMPARISON:")
    print("=" * 60)
    
    for name, props in activations.items():
        print(f"\n🔹 {name.upper()}:")
        print(f"   Description: {props['description']}")
        print(f"   ✅ Pros: {', '.join(props['pros'])}")
        print(f"   ❌ Cons: {', '.join(props['cons'])}")
        
        # Calculate key statistics
        max_deriv = np.max(props['derivative'])
        min_deriv = np.min(props['derivative'])
        mean_deriv = np.mean(props['derivative'])
        
        print(f"   📊 Derivative stats: Max={max_deriv:.3f}, Min={min_deriv:.3f}, Mean={mean_deriv:.3f}")

# Run ReLU variants comparison
compare_relu_variants()

In [None]:
def practical_activation_comparison():
    """Compare activations in actual neural networks"""
    
    print("\n🧪 PRACTICAL NETWORK COMPARISON")
    print("=" * 40)
    
    # Create networks with different activations
    activations_to_test = ['sigmoid', 'tanh', 'relu', 'elu']
    models = {}
    
    for activation in activations_to_test:
        model = tf.keras.Sequential([
            tf.keras.layers.Dense(64, activation=activation, input_shape=(10,)),
            tf.keras.layers.Dense(64, activation=activation),
            tf.keras.layers.Dense(64, activation=activation),
            tf.keras.layers.Dense(64, activation=activation),
            tf.keras.layers.Dense(1, activation='sigmoid')
        ], name=f'{activation.title()}Network')
        models[activation] = model
    
    # Test data
    X_test = tf.random.normal((100, 10))
    y_test = tf.random.uniform((100, 1))
    
    # Analyze gradient flow for each activation
    results = {}
    
    for activation, model in models.items():
        print(f"\nAnalyzing {activation.upper()} network...")
        
        with tf.GradientTape() as tape:
            predictions = model(X_test)
            loss = tf.reduce_mean(tf.square(predictions - y_test))
        
        gradients = tape.gradient(loss, model.trainable_variables)
        
        # Calculate gradient norms
        grad_norms = [tf.norm(g).numpy() for g in gradients if g is not None]
        
        # Calculate statistics
        results[activation] = {
            'gradient_norms': grad_norms[::2],  # Only weights, skip biases
            'min_gradient': min(grad_norms),
            'max_gradient': max(grad_norms),
            'mean_gradient': np.mean(grad_norms),
            'std_gradient': np.std(grad_norms),
            'vanished_layers': sum(1 for g in grad_norms if g < 1e-6),
            'weak_layers': sum(1 for g in grad_norms if g < 1e-4),
            'loss': loss.numpy()
        }
        
        print(f"  Gradient range: {results[activation]['min_gradient']:.2e} to {results[activation]['max_gradient']:.2e}")
        print(f"  Vanished layers: {results[activation]['vanished_layers']}/5")
        print(f"  Loss: {results[activation]['loss']:.6f}")
    
    return results, models

# Run practical comparison
comparison_results, comparison_models = practical_activation_comparison()

In [None]:
# Visualize practical comparison results
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

activations = list(comparison_results.keys())
colors = ['red', 'orange', 'green', 'blue']

# Plot 1: Gradient magnitudes by layer and activation
ax1 = axes[0, 0]
x_positions = np.arange(5)  # 5 layers
width = 0.2

for i, (activation, color) in enumerate(zip(activations, colors)):
    grad_norms = comparison_results[activation]['gradient_norms']
    ax1.bar(x_positions + i * width, grad_norms, width, 
            label=activation.title(), color=color, alpha=0.7)

ax1.set_xlabel('Layer')
ax1.set_ylabel('Gradient Magnitude')
ax1.set_title('Gradient Magnitudes by Layer and Activation')
ax1.set_yscale('log')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_xticks(x_positions + width * 1.5)
ax1.set_xticklabels([f'Layer {i+1}' for i in range(5)])

# Plot 2: Vanished layers comparison
ax2 = axes[0, 1]
vanished_counts = [comparison_results[act]['vanished_layers'] for act in activations]
bars = ax2.bar(activations, vanished_counts, color=colors, alpha=0.7)
ax2.set_ylabel('Number of Vanished Layers')
ax2.set_title('Vanished Layers by Activation')
ax2.set_ylim(0, 5)

# Add value labels on bars
for bar, value in zip(bars, vanished_counts):
    ax2.text(bar.get_x() + bar.get_width()/2, value + 0.1, str(value), 
             ha='center', va='bottom', fontweight='bold')

# Plot 3: Gradient statistics comparison
ax3 = axes[1, 0]
metrics = ['min_gradient', 'max_gradient', 'mean_gradient', 'std_gradient']
metric_labels = ['Min', 'Max', 'Mean', 'Std']

x_pos = np.arange(len(metrics))
for i, (activation, color) in enumerate(zip(activations, colors)):
    values = [comparison_results[activation][metric] for metric in metrics]
    ax3.bar(x_pos + i * width, values, width, 
            label=activation.title(), color=color, alpha=0.7)

ax3.set_xlabel('Metric')
ax3.set_ylabel('Gradient Value (log scale)')
ax3.set_title('Gradient Statistics Comparison')
ax3.set_yscale('log')
ax3.legend()
ax3.set_xticks(x_pos + width * 1.5)
ax3.set_xticklabels(metric_labels)
ax3.grid(True, alpha=0.3)

# Plot 4: Health score comparison
ax4 = axes[1, 1]

# Calculate health scores (higher is better)
health_scores = []
for activation in activations:
    result = comparison_results[activation]
    score = 0
    
    # No vanished layers: +3 points
    if result['vanished_layers'] == 0:
        score += 3
    elif result['vanished_layers'] <= 1:
        score += 1
    
    # Reasonable gradient range: +2 points
    if result['min_gradient'] > 1e-5:
        score += 2
    elif result['min_gradient'] > 1e-6:
        score += 1
    
    # No explosion: +2 points
    if result['max_gradient'] < 10:
        score += 2
    elif result['max_gradient'] < 100:
        score += 1
    
    # Stable gradients: +1 point
    if result['std_gradient'] < result['mean_gradient']:
        score += 1
    
    health_scores.append(score)

bars = ax4.bar(activations, health_scores, color=colors, alpha=0.7)
ax4.set_ylabel('Health Score (0-8)')
ax4.set_title('Overall Gradient Health Score')
ax4.set_ylim(0, 8)

# Add value labels and health assessment
health_labels = ['Critical', 'Poor', 'Fair', 'Good', 'Excellent']
for bar, score in zip(bars, health_scores):
    ax4.text(bar.get_x() + bar.get_width()/2, score + 0.1, f'{score}/8', 
             ha='center', va='bottom', fontweight='bold')
    
    if score >= 7:
        health_label = 'Excellent'
        color_code = '🟢'
    elif score >= 5:
        health_label = 'Good'
        color_code = '🟡'
    elif score >= 3:
        health_label = 'Fair'
        color_code = '🟠'
    else:
        health_label = 'Poor'
        color_code = '🔴'
    
    ax4.text(bar.get_x() + bar.get_width()/2, score/2, f'{color_code}\n{health_label}', 
             ha='center', va='center', fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Create activation function recommendation system
def recommend_activation(problem_type, network_depth, computational_budget):
    """Recommend activation function based on problem characteristics"""
    
    print(f"\n🎯 ACTIVATION FUNCTION RECOMMENDATION")
    print(f"Problem Type: {problem_type}")
    print(f"Network Depth: {network_depth} layers")
    print(f"Computational Budget: {computational_budget}")
    print("=" * 50)
    
    recommendations = []
    
    # Deep network recommendations
    if network_depth > 10:
        if computational_budget == 'high':
            recommendations.append(("Swish", "Excellent for very deep networks, smooth gradients", 9))
            recommendations.append(("ELU", "Good alternative, self-normalizing properties", 8))
        else:
            recommendations.append(("ReLU", "Best balance of performance and speed", 9))
            recommendations.append(("Leaky ReLU", "Fixes dying ReLU in deep networks", 7))
    
    # Medium depth networks
    elif network_depth > 5:
        if problem_type == 'classification':
            recommendations.append(("ReLU", "Standard choice for classification", 9))
            recommendations.append(("ELU", "Smooth alternative with good properties", 8))
        elif problem_type == 'regression':
            recommendations.append(("ELU", "Smooth, works well for regression", 9))
            recommendations.append(("ReLU", "Fast and effective", 8))
    
    # Shallow networks
    else:
        if computational_budget == 'low':
            recommendations.append(("ReLU", "Simple and fast", 8))
            recommendations.append(("Tanh", "Classic choice for shallow networks", 6))
        else:
            recommendations.append(("Swish", "Smooth, good for optimization", 8))
            recommendations.append(("ELU", "Good general purpose choice", 7))
    
    # Special considerations
    if problem_type == 'sequence_modeling':
        recommendations.append(("Tanh", "Good for RNNs, bounded output", 7))
        recommendations.append(("Sigmoid", "If you need bounded outputs [0,1]", 5))
    
    # Sort by score
    recommendations.sort(key=lambda x: x[2], reverse=True)
    
    print("📋 RECOMMENDATIONS (ranked by suitability):")
    print()
    
    for i, (activation, reason, score) in enumerate(recommendations[:3], 1):
        stars = "⭐" * min(score, 5)
        print(f"{i}. {activation.upper()} {stars}")
        print(f"   Reason: {reason}")
        print(f"   Score: {score}/10")
        print()
    
    return recommendations[0][0] if recommendations else "ReLU"

# Test recommendation system
scenarios = [
    ("classification", 15, "high"),
    ("regression", 8, "medium"),
    ("classification", 3, "low"),
    ("sequence_modeling", 6, "medium")
]

print("🧪 TESTING RECOMMENDATION SYSTEM")
print("=" * 50)

for problem_type, depth, budget in scenarios:
    best_activation = recommend_activation(problem_type, depth, budget)
    print(f"\n➡️ BEST CHOICE: {best_activation}\n")
    print("-" * 50)

---

## 🔍 Key Insights from Analysis

### 📊 Mathematical Properties Summary

| Activation | Max Derivative | Saturates? | Computational Cost | Gradient Flow |
|------------|----------------|------------|--------------------|--------------|
| **Sigmoid** | 0.25 | Yes | Low | 🔴 Poor (vanishing) |
| **Tanh** | 1.0 | Yes | Low | 🟡 Fair (still vanishing) |
| **ReLU** | 1.0 | No (positive) | Very Low | 🟢 Excellent |
| **Leaky ReLU** | 1.0 | No | Very Low | 🟢 Excellent |
| **ELU** | 1.0 | Partial | Medium | 🟢 Excellent |
| **Swish** | 1.25 | No | High | 🟢 Excellent |

### 🎯 Practical Guidelines

#### ✅ **Use ReLU when:**
- Building deep networks (>5 layers)
- Computational efficiency is important
- You need simple, reliable performance
- Classification tasks

#### ✅ **Use Leaky ReLU when:**
- Experiencing "dying ReLU" problem
- Need gradients for all neurons
- Deep networks with sparse activations

#### ✅ **Use ELU when:**
- Want smooth activation function
- Regression tasks
- Self-normalizing networks
- Medium computational budget

#### ✅ **Use Swish when:**
- Maximum performance is critical
- High computational budget available
- Very deep networks
- Research/experimental settings

#### ❌ **Avoid Sigmoid/Tanh when:**
- Building deep networks (>3 layers)
- Gradient flow is critical
- Training time is important

---

## 💡 The ReLU Revolution

### Why ReLU Changed Everything:

1. **No Vanishing Gradients:** Derivative is either 0 or 1
2. **Computational Efficiency:** Just max(0, x)
3. **Sparsity:** Creates sparse representations
4. **Unbounded:** No upper saturation
5. **Empirical Success:** Enabled deep learning breakthroughs

### The Simple Truth:
Sometimes the simplest solutions are the most powerful. ReLU's success shows that elegant mathematics often beats complex engineering.

---

## 🎯 Next Steps

In the next notebook, we'll explore:
- **Weight initialization strategies** (Xavier, He, LSUV)
- **How initialization affects gradient flow**
- **Proper initialization for different activations**

---

*This notebook demonstrates Concept 4 of Week 5: Deep Neural Network Architectures*