---
**These materials are created by Prof. Ramesh Babu exclusively for M.Tech Students of SRM University**

© 2025 Prof. Ramesh Babu. All rights reserved. This material is protected by copyright and may not be reproduced, distributed, or transmitted in any form or by any means without prior written permission.

---

# 🎭 T3-Exercise-3: Activation Functions - The Soul of Neural Networks
**Deep Neural Network Architectures (21CSE558T) - Week 2, Day 4**  
**M.Tech Lab Session - Duration: 30-45 minutes**

---

## 🎯 LEARNING OBJECTIVES
By the end of this exercise, you will:
- 🎪 Master the classic activation trinity: **ReLU, Sigmoid, Tanh**
- ⚡ Explore advanced activations: **Leaky ReLU, ELU, Swish, GELU**
- 🎲 Understand **Softmax** - the probability wizard
- 📊 **Visualize** activation behaviors and their impact
- 🧠 Choose the **right activation** for different scenarios
- 🔥 Apply activations in **real neural network** contexts

## 🔗 CONNECTION TO NEURAL NETWORKS
Activation functions are the **soul** of neural networks:
- 🎭 **Without activations** → Just linear transformations (boring!)
- ✨ **With activations** → Universal function approximators (magic!)
- 🧬 **They introduce non-linearity** → Enable complex pattern learning
- 🎯 **Different tasks need different activations** → Classification vs Regression

**Mind-blowing fact:** A neural network without activation functions is just a glorified linear regression! 🤯

## 📚 PREREQUISITES
- ✅ Completed T3-Exercise-1 (Tensor Fundamentals)
- ✅ Completed T3-Exercise-2 (Mathematical Operations)
- 📈 Basic understanding of function graphs

## ⚙️ SETUP & ACTIVATION TOOLKIT
🎨 Let's prepare our visualization and computation tools!

In [None]:
# 🎨 Complete toolkit for activation function exploration
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys

# Set up beautiful plotting
plt.style.use('default')
sns.set_palette("husl")

# 🔧 Environment check
print("🎭 ACTIVATION FUNCTION LABORATORY")
print("=" * 38)
print(f"🐍 Python: {sys.version.split()[0]}")
print(f"🔥 TensorFlow: {tf.__version__}")
print(f"🔢 NumPy: {np.__version__}")
print(f"📊 Matplotlib: Ready for beautiful plots!")

# 🎮 Computational power check
if tf.config.list_physical_devices('GPU'):
    print("🚀 GPU: Ready for lightning-fast activations!")
else:
    print("💻 CPU: Perfect for learning and visualization!")

print("\n🎨 Ready to bring neural networks to life!\n")

# Helper function for beautiful plots
def plot_activation(x, y, title, color='blue', derivative=None):
    """Create beautiful activation function plots"""
    plt.figure(figsize=(10, 6))
    
    if derivative is not None:
        plt.subplot(1, 2, 1)
    
    plt.plot(x, y, color=color, linewidth=3, label=f'{title}')
    plt.grid(True, alpha=0.3)
    plt.xlabel('Input (x)', fontsize=12)
    plt.ylabel('Output f(x)', fontsize=12)
    plt.title(f'🎭 {title} Activation Function', fontsize=14, fontweight='bold')
    plt.legend(fontsize=12)
    
    if derivative is not None:
        plt.subplot(1, 2, 2)
        plt.plot(x, derivative, color='red', linewidth=3, label=f"{title} Derivative")
        plt.grid(True, alpha=0.3)
        plt.xlabel('Input (x)', fontsize=12)
        plt.ylabel("Derivative f'(x)", fontsize=12)
        plt.title(f'📈 {title} Gradient', fontsize=14, fontweight='bold')
        plt.legend(fontsize=12)
    
    plt.tight_layout()
    plt.show()

print("🎨 Plotting toolkit ready!")

## 🧠 CORE CONCEPTS: Why Activation Functions Matter

### 🎭 **The Drama of Non-linearity:**

**🎬 Scene 1: Life Without Activations**
```
Layer1: input × W1 + b1 = linear_output1
Layer2: linear_output1 × W2 + b2 = linear_output2
Result: Still just one big linear transformation! 😴
```

**🎪 Scene 2: Life With Activations**
```
Layer1: activation(input × W1 + b1) = non_linear_output1
Layer2: activation(non_linear_output1 × W2 + b2) = complex_patterns! ✨
Result: Universal function approximation! 🚀
```

### 🎯 **Activation Function Categories:**

1. **🔥 Rectified Family** (ReLU, Leaky ReLU, ELU)
   - Sparse activation (many zeros)
   - Computationally efficient
   - Great for hidden layers

2. **🌊 Smooth Functions** (Sigmoid, Tanh)
   - Smooth gradients everywhere
   - Bounded outputs
   - Classic but can saturate

3. **🎲 Probability Functions** (Softmax)
   - Outputs sum to 1
   - Perfect for classification
   - Converts scores to probabilities

4. **⚡ Modern Innovations** (Swish, GELU, Mish)
   - State-of-the-art performance
   - Self-gating properties
   - Used in transformers

## 🔥 STEP 1: The Rectified Family - ReLU & Friends
### 💪 The workhorses of modern deep learning!

In [None]:
# 🔥 ReLU - The Game Changer
print("🔥 ReLU: RECTIFIED LINEAR UNIT")
print("=" * 32)

# Create input range for visualization
x = tf.linspace(-5.0, 5.0, 1000)

# ReLU function: f(x) = max(0, x)
relu_output = tf.nn.relu(x)

print("📝 Mathematical Definition: f(x) = max(0, x)")
print("🎯 Key Properties:")
print("   • Simple and fast computation")
print("   • Sparse activation (50% neurons typically inactive)")
print("   • No saturation for positive values")
print("   • Gradient is either 0 or 1")
print()

# Test with sample values
test_values = tf.constant([-2.0, -1.0, 0.0, 1.0, 2.0])
relu_test = tf.nn.relu(test_values)

print("🧪 Test Values:")
for i in range(len(test_values)):
    print(f"   ReLU({test_values[i].numpy():4.1f}) = {relu_test[i].numpy():4.1f}")

print()
print("🧠 Neural Network Impact:")
print("   • Solves vanishing gradient problem")
print("   • Creates sparse representations")
print("   • Enables very deep networks")
print()

# Visualize ReLU
plot_activation(x.numpy(), relu_output.numpy(), "ReLU", color='red')

In [None]:
# ⚡ Leaky ReLU - The Improved Version
print("⚡ LEAKY ReLU: Never Completely Dead!")
print("=" * 39)

# Leaky ReLU: f(x) = max(αx, x) where α is small (e.g., 0.01)
alpha = 0.01
leaky_relu_output = tf.nn.leaky_relu(x, alpha=alpha)

print(f"📝 Mathematical Definition: f(x) = max({alpha}x, x)")
print("🎯 Key Advantages:")
print("   • Prevents 'dying ReLU' problem")
print("   • Small gradient for negative values")
print("   • All neurons can contribute to learning")
print()

# Test with sample values
leaky_test = tf.nn.leaky_relu(test_values, alpha=alpha)

print("🧪 Comparison with ReLU:")
print("Input\tReLU\tLeaky ReLU")
print("-" * 25)
for i in range(len(test_values)):
    print(f"{test_values[i].numpy():4.1f}\t{relu_test[i].numpy():4.1f}\t{leaky_test[i].numpy():8.3f}")

print()
print("🎨 Visual difference: Notice the small slope for negative values!")

# Visualize Leaky ReLU
plot_activation(x.numpy(), leaky_relu_output.numpy(), "Leaky ReLU", color='orange')

In [None]:
# 🌟 ELU - Exponential Linear Unit
print("🌟 ELU: EXPONENTIAL LINEAR UNIT")
print("=" * 34)

# ELU: f(x) = x if x > 0, α(e^x - 1) if x ≤ 0
alpha_elu = 1.0
elu_output = tf.nn.elu(x)

print("📝 Mathematical Definition:")
print("   f(x) = x           if x > 0")
print("   f(x) = α(e^x - 1)  if x ≤ 0")
print()
print("🎯 Special Properties:")
print("   • Smooth function everywhere")
print("   • Negative values saturate to -α")
print("   • Can produce negative outputs")
print("   • Helps with zero-centered activations")
print()

# Test with sample values
elu_test = tf.nn.elu(test_values)

print("🧪 Activation Comparison:")
print("Input\tReLU\tLeaky\tELU")
print("-" * 30)
for i in range(len(test_values)):
    print(f"{test_values[i].numpy():4.1f}\t{relu_test[i].numpy():4.1f}\t{leaky_test[i].numpy():5.2f}\t{elu_test[i].numpy():6.3f}")

# Visualize ELU
plot_activation(x.numpy(), elu_output.numpy(), "ELU", color='green')

## 🌊 STEP 2: The Smooth Classics - Sigmoid & Tanh
### 📈 The smooth operators of neural networks!

In [None]:
# 🌊 Sigmoid - The Probability Pioneer
print("🌊 SIGMOID: The Classic Probability Function")
print("=" * 43)

# Sigmoid: f(x) = 1 / (1 + e^(-x))
sigmoid_output = tf.nn.sigmoid(x)

print("📝 Mathematical Definition: f(x) = 1 / (1 + e^(-x))")
print("🎯 Key Properties:")
print("   • Output range: (0, 1) - perfect for probabilities!")
print("   • Smooth and differentiable everywhere")
print("   • S-shaped curve")
print("   • Can saturate (vanishing gradients)")
print()

# Test with sample values
sigmoid_test = tf.nn.sigmoid(test_values)

print("🧪 Sigmoid Outputs (probabilities):")
for i in range(len(test_values)):
    prob = sigmoid_test[i].numpy()
    print(f"   Sigmoid({test_values[i].numpy():4.1f}) = {prob:.3f} ({prob*100:.1f}%)")

print()
print("🧠 Perfect for:")
print("   • Binary classification (output layer)")
print("   • Gate mechanisms (LSTM forget gates)")
print("   • Converting logits to probabilities")
print()

# Calculate derivative for gradient visualization
with tf.GradientTape() as tape:
    tape.watch(x)
    sigmoid_out = tf.nn.sigmoid(x)
sigmoid_grad = tape.gradient(sigmoid_out, x)

# Visualize Sigmoid with its derivative
plot_activation(x.numpy(), sigmoid_output.numpy(), "Sigmoid", color='purple', 
               derivative=sigmoid_grad.numpy())

In [None]:
# 🎭 Tanh - The Zero-Centered Champion
print("🎭 TANH: The Zero-Centered Activation")
print("=" * 37)

# Tanh: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
tanh_output = tf.nn.tanh(x)

print("📝 Mathematical Definition: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))")
print("🎯 Key Properties:")
print("   • Output range: (-1, 1) - zero-centered!")
print("   • Stronger gradients than sigmoid")
print("   • Symmetric around origin")
print("   • Scaled and shifted sigmoid")
print()

# Test with sample values
tanh_test = tf.nn.tanh(test_values)

print("🧪 Sigmoid vs Tanh Comparison:")
print("Input\tSigmoid\tTanh")
print("-" * 20)
for i in range(len(test_values)):
    print(f"{test_values[i].numpy():4.1f}\t{sigmoid_test[i].numpy():6.3f}\t{tanh_test[i].numpy():6.3f}")

print()
print("🧠 Advantages of Tanh:")
print("   • Zero-centered outputs (better for hidden layers)")
print("   • Stronger gradients in the linear region")
print("   • Better than sigmoid for hidden layers")
print()

# Calculate derivative
with tf.GradientTape() as tape:
    tape.watch(x)
    tanh_out = tf.nn.tanh(x)
tanh_grad = tape.gradient(tanh_out, x)

# Visualize Tanh with its derivative
plot_activation(x.numpy(), tanh_output.numpy(), "Tanh", color='teal', 
               derivative=tanh_grad.numpy())

## 🎲 STEP 3: Softmax - The Probability Wizard
### 🧙‍♂️ Converting scores into beautiful probabilities!

In [None]:
# 🎲 Softmax - The Classification Master
print("🎲 SOFTMAX: The Probability Distribution Creator")
print("=" * 48)

print("📝 Mathematical Definition:")
print("   softmax(x_i) = e^(x_i) / Σ(e^(x_j)) for all j")
print()
print("🎯 Magic Properties:")
print("   • All outputs are positive")
print("   • All outputs sum to exactly 1.0")
print("   • Emphasizes the largest input")
print("   • Perfect for multi-class classification")
print()

# Example: Multi-class classification scenario
print("🎪 DEMO: Image Classification with 5 Classes")
print("=" * 45)

# Simulate raw scores (logits) from a classifier
class_names = ['🐱 Cat', '🐶 Dog', '🐦 Bird', '🐟 Fish', '🐸 Frog']
logits = tf.constant([[2.5, 1.2, 0.8, 3.1, 0.5],  # Sample 1: Fish likely
                      [4.2, 1.0, 2.1, 0.3, 0.8],  # Sample 2: Cat likely
                      [1.1, 1.2, 4.8, 0.9, 1.3]]) # Sample 3: Bird likely

print("🔢 Raw Logits (classifier outputs):")
for i, sample in enumerate(logits):
    print(f"   Sample {i+1}: {sample.numpy()}")
print()

# Apply softmax
probabilities = tf.nn.softmax(logits)

print("✨ After Softmax (probabilities):")
for i, (sample_logits, sample_probs) in enumerate(zip(logits, probabilities)):
    print(f"\n📊 Sample {i+1}:")
    print("   Class\t\tLogit\tProbability")
    print("   " + "-"*35)
    
    for j, (name, logit, prob) in enumerate(zip(class_names, sample_logits, sample_probs)):
        print(f"   {name}\t{logit.numpy():.1f}\t{prob.numpy():.3f} ({prob.numpy()*100:.1f}%)")
    
    # Show prediction
    predicted_class = tf.argmax(sample_probs)
    confidence = tf.reduce_max(sample_probs)
    print(f"   \n🏆 Prediction: {class_names[predicted_class]} with {confidence.numpy()*100:.1f}% confidence")
    
    # Verify probabilities sum to 1
    prob_sum = tf.reduce_sum(sample_probs)
    print(f"   ✅ Probability sum: {prob_sum.numpy():.6f}")

print("\n🧠 Key Insights:")
print("   • Higher logits → Higher probabilities")
print("   • Softmax amplifies differences between classes")
print("   • Always produces valid probability distributions")
print()

In [None]:
# 📊 Visualizing Softmax Behavior
print("📊 SOFTMAX BEHAVIOR VISUALIZATION")
print("=" * 35)

# Create a simple 3-class example for visualization
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Three different scenarios
scenarios = [
    ([1.0, 1.0, 1.0], "Equal Logits", "Uniform distribution"),
    ([3.0, 1.0, 1.0], "One High Logit", "Clear winner"),
    ([10.0, 1.0, 1.0], "Very High Logit", "Dominant class")
]

colors = ['skyblue', 'lightcoral', 'lightgreen']
class_labels = ['Class A', 'Class B', 'Class C']

for idx, (logits_list, title, description) in enumerate(scenarios):
    # Calculate softmax
    logits_tensor = tf.constant(logits_list)
    probs = tf.nn.softmax(logits_tensor)
    
    # Create bar plot
    axes[idx].bar(class_labels, probs.numpy(), color=colors)
    axes[idx].set_title(f'{title}\n{description}', fontweight='bold')
    axes[idx].set_ylabel('Probability')
    axes[idx].set_ylim(0, 1)
    
    # Add value labels on bars
    for i, prob in enumerate(probs.numpy()):
        axes[idx].text(i, prob + 0.02, f'{prob:.3f}', ha='center', fontweight='bold')
    
    # Add logits as subtitle
    axes[idx].text(0.5, -0.15, f'Logits: {logits_list}', ha='center', 
                   transform=axes[idx].transAxes, fontsize=10)

plt.suptitle('🎲 Softmax Behavior: From Logits to Probabilities', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

print("💡 Observations:")
print("   • Equal logits → Equal probabilities (1/n for n classes)")
print("   • Higher logits → Much higher probabilities")
print("   • Softmax amplifies differences exponentially!")
print()

## ⚡ STEP 4: Modern Activations - The New Generation
### 🚀 State-of-the-art activations powering today's AI!

In [None]:
# ⚡ Swish - Google's Self-Gating Activation
print("⚡ SWISH: The Self-Gating Innovation")
print("=" * 36)

# Swish: f(x) = x * sigmoid(x)
def swish(x):
    return x * tf.nn.sigmoid(x)

swish_output = swish(x)

print("📝 Mathematical Definition: f(x) = x * sigmoid(x)")
print("🎯 Revolutionary Properties:")
print("   • Self-gating: uses its own value as gate")
print("   • Smooth and non-monotonic")
print("   • Better than ReLU in many tasks")
print("   • Used in EfficientNet and other SOTA models")
print()

# Test with sample values
swish_test = swish(test_values)

print("🧪 Activation Comparison:")
print("Input\tReLU\tSwish")
print("-" * 18)
for i in range(len(test_values)):
    print(f"{test_values[i].numpy():4.1f}\t{relu_test[i].numpy():4.1f}\t{swish_test[i].numpy():6.3f}")

print()
print("🔍 Key Insight: Notice how Swish allows small negative values!")
print("   This helps with gradient flow and information preservation.")

# Visualize Swish
plot_activation(x.numpy(), swish_output.numpy(), "Swish", color='magenta')

In [None]:
# 🌟 GELU - The Transformer's Favorite
print("🌟 GELU: Gaussian Error Linear Unit")
print("=" * 35)

# GELU: f(x) = x * Φ(x) where Φ is the cumulative distribution function of the standard normal distribution
# Approximation: f(x) ≈ 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x³)))
def gelu_approx(x):
    return 0.5 * x * (1 + tf.nn.tanh(tf.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3))))

gelu_output = gelu_approx(x)

print("📝 Mathematical Concept: f(x) = x * P(X ≤ x) where X ~ N(0,1)")
print("🎯 Transformer Power:")
print("   • Probabilistic interpretation")
print("   • Smooth approximation to ReLU")
print("   • Standard in BERT, GPT, and other transformers")
print("   • Balances between ReLU and Swish")
print()

# Test with sample values
gelu_test = gelu_approx(test_values)

print("🧪 Modern Activation Comparison:")
print("Input\tReLU\tSwish\tGELU")
print("-" * 25)
for i in range(len(test_values)):
    print(f"{test_values[i].numpy():4.1f}\t{relu_test[i].numpy():4.1f}\t{swish_test[i].numpy():5.2f}\t{gelu_test[i].numpy():5.2f}")

print()
print("🤖 AI Breakthrough: GELU helps transformers understand language better!")

# Visualize GELU
plot_activation(x.numpy(), gelu_output.numpy(), "GELU", color='indigo')

## 🎨 STEP 5: The Grand Comparison - All Activations Together
### 🌈 See how different activations shape neural network behavior!

In [None]:
# 🎨 The Ultimate Activation Function Comparison
print("🎨 THE ULTIMATE ACTIVATION COMPARISON")
print("=" * 40)

# Calculate all activations
activations = {
    'ReLU': tf.nn.relu(x),
    'Leaky ReLU': tf.nn.leaky_relu(x, alpha=0.01),
    'ELU': tf.nn.elu(x),
    'Sigmoid': tf.nn.sigmoid(x),
    'Tanh': tf.nn.tanh(x),
    'Swish': swish(x),
    'GELU': gelu_approx(x)
}

# Create comprehensive comparison plot
plt.figure(figsize=(15, 10))

colors = ['red', 'orange', 'green', 'purple', 'teal', 'magenta', 'indigo']

for i, (name, activation) in enumerate(activations.items()):
    plt.plot(x.numpy(), activation.numpy(), label=name, linewidth=3, color=colors[i])

plt.grid(True, alpha=0.3)
plt.xlabel('Input (x)', fontsize=14)
plt.ylabel('Output f(x)', fontsize=14)
plt.title('🎨 The Complete Activation Function Family', fontsize=16, fontweight='bold')
plt.legend(fontsize=12, loc='upper left')
plt.xlim(-5, 5)
plt.ylim(-2, 5)

# Add some annotations
plt.annotate('ReLU family\n(sparse)', xy=(2, 2), xytext=(3, 3.5),
            arrowprops=dict(arrowstyle='->', color='red', alpha=0.7),
            fontsize=10, ha='center', bbox=dict(boxstyle='round,pad=0.3', facecolor='yellow', alpha=0.7))

plt.annotate('Smooth classics\n(bounded)', xy=(-2, -0.8), xytext=(-3.5, -1.5),
            arrowprops=dict(arrowstyle='->', color='purple', alpha=0.7),
            fontsize=10, ha='center', bbox=dict(boxstyle='round,pad=0.3', facecolor='lightblue', alpha=0.7))

plt.annotate('Modern marvels\n(self-gating)', xy=(1, 0.5), xytext=(1, 2),
            arrowprops=dict(arrowstyle='->', color='magenta', alpha=0.7),
            fontsize=10, ha='center', bbox=dict(boxstyle='round,pad=0.3', facecolor='lightgreen', alpha=0.7))

plt.tight_layout()
plt.show()

print("🌈 Visual Insights:")
print("   🔥 ReLU family: Sharp transitions, sparse activation")
print("   🌊 Classics: Smooth curves, bounded outputs")
print("   ⚡ Modern: Self-gating, smooth approximations to ReLU")
print()

In [None]:
# ⚡ Performance and Use Case Guide
print("⚡ ACTIVATION FUNCTION SELECTION GUIDE")
print("=" * 42)

# Create a comprehensive guide
guide_data = {
    'Activation': ['ReLU', 'Leaky ReLU', 'ELU', 'Sigmoid', 'Tanh', 'Swish', 'GELU'],
    'Speed': ['⚡⚡⚡', '⚡⚡⚡', '⚡⚡', '⚡⚡', '⚡⚡', '⚡', '⚡'],
    'Gradient Flow': ['⚠️', '✅', '✅', '⚠️', '⚠️', '✅', '✅'],
    'Hidden Layers': ['✅', '✅', '✅', '❌', '✅', '✅', '✅'],
    'Output Layer': ['❌', '❌', '❌', '✅ (binary)', '❌', '❌', '❌'],
    'Best For': ['General purpose', 'Avoiding dead neurons', 'Zero-centered', 'Binary classification', 'Hidden layers', 'High performance', 'Transformers']
}

import pandas as pd
df = pd.DataFrame(guide_data)

print("📊 Quick Reference Table:")
print(df.to_string(index=False))
print()

print("🎯 CHOOSING THE RIGHT ACTIVATION:")
print()
print("🏗️ For Hidden Layers:")
print("   1st choice: ReLU (fast, simple, works well)")
print("   2nd choice: Leaky ReLU (if ReLU neurons die)")
print("   3rd choice: Swish/GELU (for state-of-the-art performance)")
print()
print("🎲 For Output Layers:")
print("   Binary classification: Sigmoid")
print("   Multi-class classification: Softmax")
print("   Regression: None (linear) or ReLU (if outputs should be positive)")
print()
print("🚀 For Modern Architectures:")
print("   Transformers: GELU")
print("   CNNs: ReLU or Swish")
print("   RNNs: Tanh (traditional) or modern variants")
print()

## ✅ PRACTICAL APPLICATION & VALIDATION
### 🎪 Let's build a complete neural network with different activations!

In [None]:
# 🎪 Building Networks with Different Activations
print("🎪 ACTIVATION FUNCTION BATTLE: Same Network, Different Activations")
print("=" * 68)

# Create sample data (XOR problem - perfect for testing activations)
X = tf.constant([[0., 0.], [0., 1.], [1., 0.], [1., 1.]])
y_true = tf.constant([[0.], [1.], [1.], [0.]])

print("🧩 Problem: XOR Truth Table")
print("Input\tOutput")
print("-" * 12)
for i in range(4):
    print(f"{X[i].numpy()}\t{y_true[i].numpy()[0]}")
print()

def create_xor_network(activation_fn, activation_name):
    """Create a simple XOR network with specified activation"""
    
    # Network architecture: 2 → 4 → 1
    W1 = tf.Variable(tf.random.normal([2, 4], stddev=0.5))
    b1 = tf.Variable(tf.zeros([4]))
    W2 = tf.Variable(tf.random.normal([4, 1], stddev=0.5))
    b2 = tf.Variable(tf.zeros([1]))
    
    # Forward pass
    hidden = activation_fn(tf.matmul(X, W1) + b1)
    output = tf.nn.sigmoid(tf.matmul(hidden, W2) + b2)  # Sigmoid for output
    
    # Calculate loss
    loss = tf.reduce_mean(tf.square(y_true - output))
    
    return output, loss, hidden

# Test different activations
activation_tests = [
    (tf.nn.relu, "ReLU"),
    (tf.nn.tanh, "Tanh"),
    (lambda x: tf.nn.leaky_relu(x, alpha=0.01), "Leaky ReLU"),
    (swish, "Swish")
]

print("🔬 Testing Different Activations on XOR Problem:")
print("=" * 50)

results = []
for activation_fn, name in activation_tests:
    output, loss, hidden = create_xor_network(activation_fn, name)
    
    print(f"\n🎭 {name} Activation:")
    print(f"   📉 Loss: {loss.numpy():.4f}")
    print(f"   🧠 Hidden layer activations (sample):")
    print(f"      Input [0,0] → {hidden[0].numpy()}")
    print(f"   🎯 Predictions:")
    
    for i in range(4):
        pred = output[i].numpy()[0]
        true_val = y_true[i].numpy()[0]
        error = abs(pred - true_val)
        print(f"      {X[i].numpy()} → {pred:.3f} (target: {true_val}, error: {error:.3f})")
    
    results.append((name, loss.numpy()))

print("\n🏆 ACTIVATION PERFORMANCE RANKING:")
print("=" * 35)
results.sort(key=lambda x: x[1])  # Sort by loss
for i, (name, loss) in enumerate(results, 1):
    medal = "🥇" if i == 1 else "🥈" if i == 2 else "🥉" if i == 3 else "📊"
    print(f"{medal} {i}. {name}: Loss = {loss:.4f}")

print("\n💡 Remember: This is just one random initialization!")
print("   Real performance depends on training, data, and architecture.")
print()

## 🔍 KEY TAKEAWAYS

### 🎭 **Activation Function Mastery:**

1. **🔥 ReLU Family** - The workhorses of deep learning
   - Simple, fast, and effective
   - Watch out for dying ReLU problem
   - Leaky ReLU and ELU solve the dying neuron issue

2. **🌊 Smooth Classics** - Still relevant in specific contexts
   - Sigmoid: Perfect for binary classification output
   - Tanh: Better than sigmoid for hidden layers (zero-centered)
   - Both can suffer from vanishing gradients

3. **🎲 Softmax** - The probability master
   - Converts any vector to valid probability distribution
   - Essential for multi-class classification
   - Amplifies differences between classes

4. **⚡ Modern Innovations** - State-of-the-art performance
   - Swish: Self-gating, smooth, non-monotonic
   - GELU: The transformer standard, probabilistic interpretation
   - Often outperform ReLU in complex tasks

### 🎯 **Selection Guidelines:**
- **Hidden layers:** Start with ReLU, upgrade to Swish/GELU for performance
- **Binary output:** Sigmoid
- **Multi-class output:** Softmax
- **Regression output:** Linear (no activation) or ReLU if positive

### 🧠 **Deep Understanding:**
- Activations introduce non-linearity (the magic ingredient!)
- Different activations create different learning dynamics
- Gradient flow is crucial for deep networks
- Modern activations often outperform classics

### 🤔 **Questions to Explore:**
- How do activation choices affect training speed?
- Why do transformers prefer GELU over ReLU?
- What happens if you use different activations in different layers?

## ➡️ NEXT EXERCISE PREVIEW

### 📊 T3-Exercise-4: Reduction Operations - Aggregating Intelligence

**Get ready to master:**
- 🔢 **Sum, Mean, Max, Min** - The fundamental aggregators
- 📐 **Axis operations** - Understanding dimensions and broadcasting
- 📊 **Statistical operations** - Variance, standard deviation, moments
- 🎯 **Neural network applications** - Loss functions, batch statistics, attention
- 🧠 **Practical scenarios** - Building actual loss functions and metrics

🔮 **Coming up:** Learn how neural networks summarize and aggregate information!

---

# 🎉 EXERCISE 3 COMPLETED!
## 🎭 **You've mastered the soul of neural networks!**
### ⚡ **Now you understand how linear operations become intelligent behavior!**
#### 🚀 **Ready to aggregate intelligence with reduction operations!**