# Activations - Nonlinearity in Neural Networks

Welcome to the Activations module! This is where neural networks get their power through nonlinearity.

## Learning Goals
- Understand why activation functions are essential for neural networks
- Implement the four most important activation functions: ReLU, Sigmoid, Tanh, and Softmax
- Visualize how activations transform data and enable complex learning
- See how activations work with layers to build powerful networks
- Master the NBGrader workflow with comprehensive testing

## Build → Use → Understand
1. **Build**: Activation functions that add nonlinearity
2. **Use**: Transform tensors and see immediate results
3. **Understand**: How nonlinearity enables complex pattern learning

In [None]:
#| default_exp core.activations

#| export
import math
import numpy as np
import matplotlib.pyplot as plt
import os
import sys
from typing import Union, List

# Import our Tensor class - try from package first, then from local module
try:
    from tinytorch.core.tensor import Tensor
except ImportError:
    # For development, import from local tensor module
    sys.path.append(os.path.join(os.path.dirname(__file__), '..', '01_tensor'))
    from tensor_dev import Tensor

In [None]:
print("🔥 TinyTorch Activations Module")
print(f"NumPy version: {np.__version__}")
print(f"Python version: {sys.version_info.major}.{sys.version_info.minor}")
print("Ready to build activation functions!")

## 📦 Where This Code Lives in the Final Package

**Learning Side:** You work in `modules/source/02_activations/activations_dev.py`  
**Building Side:** Code exports to `tinytorch.core.activations`

```python
# Final package structure:
from tinytorch.core.activations import ReLU, Sigmoid, Tanh, Softmax
from tinytorch.core.tensor import Tensor  # Foundation
from tinytorch.core.layers import Dense  # Uses activations
```

**Why this matters:**
- **Learning:** Focused modules for deep understanding
- **Production:** Proper organization like PyTorch's `torch.nn.ReLU`
- **Consistency:** All activation functions live together in `core.activations`
- **Integration:** Works seamlessly with tensors and layers

## What Are Activation Functions? The Key to Neural Network Intelligence

### 🎯 The Core Problem: Linear Limitations

Without activation functions, neural networks are fundamentally limited. No matter how many layers you stack, they can only learn linear relationships:

```
Layer 1: h₁ = W₁ · x + b₁
Layer 2: h₂ = W₂ · h₁ + b₂ = W₂ · (W₁ · x + b₁) + b₂
Layer 3: h₃ = W₃ · h₂ + b₃ = W₃ · (W₂ · (W₁ · x + b₁) + b₂) + b₃
```

**Mathematical Reality**: This always simplifies to:
```
y = W_combined · x + b_combined
```

**A single linear transformation!** This means neural networks without activations cannot learn:
- **Image patterns**: Recognizing curves, shapes, textures (nonlinear pixel relationships)
- **Language patterns**: Understanding syntax, semantics, context (nonlinear word relationships)
- **Strategic patterns**: Game playing, decision making (nonlinear strategy relationships)
- **Any complex real-world pattern**: Most useful relationships in data are nonlinear

### 🔑 The Solution: Strategic Nonlinearity

Activation functions break this linear limitation by adding nonlinearity between layers:

```
Neural Network Flow:
Input → Linear Layer → Activation → Linear Layer → Activation → ... → Output
  x   →    W₁x + b₁   →    f(·)    →    W₂h₁ + b₂  →    f(·)    →     y
```

**Now each layer can learn complex transformations!**

### 📊 Visual Understanding: The Power of Nonlinearity

#### Linear Network (No Activations):
```
Input Space:        Decision Boundary:       Capability:
     ·              ─────────────────────       Only straight lines
   ·   ·                                      Cannot separate:
 ·   ·   ·          ─────────────────────       • XOR problem
   ·   ·                                       • Circular patterns
     ·              ─────────────────────       • Any curved boundary
```

#### Nonlinear Network (With Activations):
```
Input Space:        Decision Boundary:       Capability:
     ·              ╭─────────────────╮       Complex curves
   ·   ·            │  ╭─────────╮    │       Can separate:
 ·   ·   ·          │  │         │    │       • XOR problem
   ·   ·            │  ╰─────────╯    │       • Circular patterns  
     ·              ╰─────────────────╯       • Any complex shape
```

### 🏭 Real-World Impact: The Deep Learning Revolution

**Before Activation Functions (Pre-2000s)**:
- Limited to linear classifiers and simple perceptrons
- Could not solve XOR problem (fundamental nonlinear pattern)
- Shallow networks with limited capability
- AI winter due to fundamental limitations

**After Activation Functions (2000s-Present)**:
- Deep learning revolution begins
- Complex pattern recognition possible
- State-of-the-art results in vision, language, games
- Universal approximation theorem: can learn any function

### 🎯 The Four Essential Activations We'll Master

1. **ReLU (Rectified Linear Unit)**: `f(x) = max(0, x)`
   - **The foundation** of modern deep learning
   - **Used in**: ResNet, VGG, GPT, BERT, virtually every modern model
   - **Why crucial**: Solves vanishing gradient problem, computationally efficient

2. **Sigmoid**: `f(x) = 1/(1 + e^(-x))`  
   - **The classic** activation for probability outputs
   - **Used in**: Binary classification, LSTM gates, attention mechanisms
   - **Why crucial**: Maps any input to probability range (0,1)

3. **Tanh**: `f(x) = (e^x - e^(-x))/(e^x + e^(-x))`
   - **Zero-centered** activation for better training
   - **Used in**: LSTM cells, traditional neural networks, signal processing
   - **Why crucial**: Better gradients due to zero-centered output

4. **Softmax**: `f(x_i) = e^(x_i) / Σⱼ e^(x_j)`
   - **Probability distribution** for multi-class problems  
   - **Used in**: Classification heads, attention mechanisms, language models
   - **Why crucial**: Converts logits to valid probability distributions

### 🧠 Mathematical Foundation: Understanding the Functions

Each activation function serves a specific purpose:

#### **Activation Properties Table**
```
Function  | Range     | Key Property        | Primary Use
----------|-----------|--------------------|-----------------
ReLU      | [0, ∞)    | Sparse (many 0s)   | Hidden layers
Sigmoid   | (0, 1)    | Probability-like    | Binary output
Tanh      | (-1, 1)   | Zero-centered       | Hidden layers
Softmax   | (0, 1)    | Sums to 1          | Multi-class output
```

#### **Gradient Properties (Critical for Training)**
```
Function  | Gradient   | Vanishing Gradient? | Training Efficiency
----------|------------|--------------------|-----------------
ReLU      | 1 or 0     | No (for x > 0)     | Excellent
Sigmoid   | ≤ 0.25     | Yes (for large |x|) | Poor for deep nets
Tanh      | ≤ 1        | Yes (for large |x|) | Better than sigmoid
Softmax   | Complex    | No                 | Good for outputs
```

This mathematical foundation explains why ReLU revolutionized deep learning - it's the only activation that doesn't suffer from vanishing gradients!

## 🔧 DEVELOPMENT

## Step 1: ReLU - The Foundation of Deep Learning

### What is ReLU?
**ReLU (Rectified Linear Unit)** is the most important activation function in deep learning:

```
f(x) = max(0, x)
```

- **Positive inputs**: Pass through unchanged
- **Negative inputs**: Become zero
- **Zero**: Stays zero

### Why ReLU Revolutionized Deep Learning
1. **Computational efficiency**: Just a max operation
2. **No vanishing gradients**: Derivative is 1 for positive values
3. **Sparsity**: Many neurons output exactly 0
4. **Empirical success**: Works well in practice

### Visual Understanding
```
Input:  [-2, -1, 0, 1, 2]
ReLU:   [ 0,  0, 0, 1, 2]
```

### Real-World Applications
- **Image classification**: ResNet, VGG, AlexNet
- **Object detection**: YOLO, R-CNN
- **Language models**: Transformer feedforward layers
- **Recommendation**: Deep collaborative filtering

### Mathematical Properties
- **Derivative**: f'(x) = 1 if x > 0, else 0
- **Range**: [0, ∞)
- **Sparsity**: Outputs exactly 0 for negative inputs

In [None]:
#| export
class ReLU:
    """
    ReLU Activation Function: f(x) = max(0, x)
    
    The most popular activation function in deep learning.
    Simple, fast, and effective for most applications.
    """
    
    def forward(self, x):
        """
        Apply ReLU activation: f(x) = max(0, x)
        
        TODO: Implement ReLU activation function.
        
        STEP-BY-STEP IMPLEMENTATION:
        1. For each element in the input tensor, apply max(0, element)
        2. Use NumPy's maximum function for efficient element-wise operation
        3. Return a new tensor of the same type with the results
        4. Preserve the input tensor's shape
        
        EXAMPLE USAGE:
        ```python
        relu = ReLU()
        input_tensor = Tensor([[-2, -1, 0, 1, 2]])
        output = relu(input_tensor)
        print(output.data)  # [[0, 0, 0, 1, 2]]
        ```
        
        IMPLEMENTATION HINTS:
        - Use np.maximum(0, x.data) for element-wise max with 0
        - Return the same type as input: return type(x)(result)
        - The shape should remain the same as input
        - Don't modify the input tensor (immutable operations)
        
        LEARNING CONNECTIONS:
        - This is like torch.nn.ReLU() in PyTorch
        - Used in virtually every modern neural network
        - Enables deep networks by preventing vanishing gradients
        - Creates sparse representations (many zeros)
        """
        ### BEGIN SOLUTION
        result = np.maximum(0, x.data)
        return type(x)(result)
        ### END SOLUTION
    
    def __call__(self, x):
        """Make the class callable: relu(x) instead of relu.forward(x)"""
        return self.forward(x)

### 🧪 Test Your ReLU Implementation

Once you implement the ReLU forward method above, run this cell to test it:

In [None]:
def test_unit_relu_activation():
    """Unit test for the ReLU activation function."""
    print("🔬 Unit Test: ReLU Activation...")

    # Create ReLU instance
    relu = ReLU()

    # Test with mixed positive/negative values
    test_input = Tensor([[-2, -1, 0, 1, 2]])
    result = relu(test_input)
    expected = np.array([[0, 0, 0, 1, 2]])
    
    assert np.array_equal(result.data, expected), f"ReLU failed: expected {expected}, got {result.data}"
    
    # Test that negative values become zero
    assert np.all(result.data >= 0), "ReLU should make all negative values zero"
    
    # Test that positive values remain unchanged
    positive_input = Tensor([[1, 2, 3, 4, 5]])
    positive_result = relu(positive_input)
    assert np.array_equal(positive_result.data, positive_input.data), "ReLU should preserve positive values"
    
    # Test with 2D tensor
    matrix_input = Tensor([[-1, 2], [3, -4]])
    matrix_result = relu(matrix_input)
    matrix_expected = np.array([[0, 2], [3, 0]])
    assert np.array_equal(matrix_result.data, matrix_expected), "ReLU should work with 2D tensors"
    
    # Test shape preservation
    assert matrix_result.shape == matrix_input.shape, "ReLU should preserve input shape"
    
    print("✅ ReLU activation tests passed!")
    print(f"✅ Negative values correctly zeroed")
    print(f"✅ Positive values preserved")
    print(f"✅ Shape preservation working")
    print(f"✅ Works with multi-dimensional tensors")

# Run the test
test_unit_relu_activation()

### 🎯 Checkpoint: ReLU Mastery
Congratulations! You've successfully implemented and tested the ReLU activation function. 

Before moving to the next activation, make sure you can:

```python
# Create and test ReLU with different input patterns
relu = ReLU()

# Test 1: Basic functionality
basic_input = Tensor([[-3, -1, 0, 1, 3]])
basic_output = relu(basic_input)
print(f"Input:  {basic_input.data}")   # [[-3, -1,  0,  1,  3]]
print(f"Output: {basic_output.data}")  # [[ 0,  0,  0,  1,  3]]

# Test 2: 2D tensors (simulating mini-batch)
batch_input = Tensor([[-2, 1], [3, -1]])
batch_output = relu(batch_input)
print(f"Batch Output: {batch_output.data}")  # [[0, 1], [3, 0]]

# Test 3: Understand sparsity
print(f"Sparsity: {np.count_nonzero(batch_output.data == 0) / batch_output.size * 100:.1f}% zeros")
```

**Key Understanding**: ReLU creates **sparse representations** - many outputs are exactly zero. This sparsity makes networks:
- **Computationally efficient**: Skip zero computations
- **Memory efficient**: Compress sparse representations  
- **Biologically inspired**: Real neurons often stay silent

You now have the foundation activation that powers modern deep learning!

## Step 2: Sigmoid - Classic Binary Classification

### What is Sigmoid? The Probability Gateway

**Sigmoid** is the classic S-shaped activation function that gracefully maps any real number to the probability range (0, 1):

```
f(x) = 1 / (1 + e^(-x))
```

### 📊 Visual Understanding: The Sigmoid Curve

```
Sigmoid Function Shape:

     1.0 |                    ╭─────
         |                 ╭──╯
     0.8 |              ╭─╯
         |           ╭─╯
     0.6 |        ╭─╯
         |      ╭╯
     0.4 |   ╭─╯
         |  ╱
     0.2 |╱
         ╱
     0.0 ╱────────────────────────────
        -4    -2     0     2     4
```

**Key Insight**: The sigmoid is nature's smooth switch - it gradually transitions from "off" (0) to "on" (1).

### 🎯 Why Sigmoid Is Essential for ML

#### **1. Probability Interpretation**
```
Raw Neural Network Output:  [-2.5, 0.3, 1.8, -0.7]  (any range)
After Sigmoid:              [0.08, 0.57, 0.86, 0.33]  (valid probabilities)
```

#### **2. Smooth Decision Boundaries**
Unlike step functions, sigmoid provides smooth transitions:
```
Hard Decision (Step):     0, 0, 0, 1, 1, 1  (abrupt jump)
Soft Decision (Sigmoid):  0.01, 0.27, 0.73, 0.88, 0.95  (smooth transition)
```

#### **3. Binary Classification Perfect Fit**
```
For binary classification:
- Output > 0.5 → Class 1 (positive)
- Output < 0.5 → Class 0 (negative)  
- Output = 0.5 → Uncertain (decision boundary)
```

### 🏭 Real-World Applications: Where Sigmoid Shines

**Binary Classification Tasks**:
- **Email spam detection**: Probability this email is spam
- **Medical diagnosis**: Probability patient has condition
- **Fraud detection**: Probability transaction is fraudulent
- **A/B testing**: Probability user clicks/converts

**Neural Network Components**:
- **LSTM gates**: Forget gate, input gate, output gate decisions
- **Attention mechanisms**: How much attention to pay to each element
- **Probability outputs**: Final layer for binary classification

### 🧠 Mathematical Deep Dive

#### **Critical Properties**
```
Property          | Value/Formula        | ML Significance
------------------|---------------------|----------------------------
Range             | (0, 1)              | Valid probability space
Derivative        | σ(x)·(1-σ(x))       | Self-referential gradient
Maximum gradient  | 0.25 (at x=0)       | Vanishing gradient problem
Symmetry point    | σ(0) = 0.5          | Natural decision boundary
Saturation        | σ(±∞) ≈ 0 or 1      | Confident predictions
```

#### **The Vanishing Gradient Problem**
```
Input Range    | Sigmoid Output | Gradient Size | Training Impact
---------------|----------------|---------------|----------------
x ∈ [-1, 1]    | [0.27, 0.73]   | ~0.2         | Good learning
x ∈ [-3, 3]    | [0.05, 0.95]   | ~0.05        | Slow learning  
x ∈ [-5, 5]    | [0.007, 0.993] | ~0.007       | Very slow
x ∈ [-10, 10]  | [~0, ~1]       | ~0.0001      | Learning stops
```

**This is why ReLU replaced sigmoid in hidden layers - sigmoid gradients vanish for large inputs!**

### 🎯 When to Use Sigmoid vs ReLU

```
Use Case                     | Sigmoid | ReLU | Why?
----------------------------|---------|------|---------------------------
Hidden layers (deep nets)  |    ❌   |  ✅  | ReLU avoids vanishing gradients
Binary classification output|    ✅   |  ❌  | Need probability interpretation
LSTM/GRU gates             |    ✅   |  ❌  | Need smooth 0-1 gating
Multi-class classification  |    ❌   |  ❌  | Use Softmax instead
```

In [None]:
#| export
class Sigmoid:
    """
    Sigmoid Activation Function: f(x) = 1 / (1 + e^(-x))
    
    Maps any real number to the range (0, 1).
    Useful for binary classification and probability outputs.
    """
    
    def forward(self, x):
        """
        Apply Sigmoid activation: f(x) = 1 / (1 + e^(-x))
        
        TODO: Implement Sigmoid activation function.
        
        STEP-BY-STEP IMPLEMENTATION:
        1. Compute the negative of input: -x.data
        2. Compute the exponential: np.exp(-x.data)
        3. Add 1 to the exponential: 1 + np.exp(-x.data)
        4. Take the reciprocal: 1 / (1 + np.exp(-x.data))
        5. Return as new Tensor
        
        EXAMPLE USAGE:
        ```python
        sigmoid = Sigmoid()
        input_tensor = Tensor([[-2, -1, 0, 1, 2]])
        output = sigmoid(input_tensor)
        print(output.data)  # [[0.119, 0.269, 0.5, 0.731, 0.881]]
        ```
        
        IMPLEMENTATION HINTS:
        - Use np.exp() for exponential function
        - Formula: 1 / (1 + np.exp(-x.data))
        - Handle potential overflow with np.clip(-x.data, -500, 500)
        - Return Tensor(result)
        
        LEARNING CONNECTIONS:
        - This is like torch.nn.Sigmoid() in PyTorch
        - Used in binary classification output layers
        - Key component in LSTM and GRU gating mechanisms
        - Historically important for early neural networks
        """
        ### BEGIN SOLUTION
        # Clip to prevent overflow
        clipped_input = np.clip(-x.data, -500, 500)
        result = 1 / (1 + np.exp(clipped_input))
        return type(x)(result)
        ### END SOLUTION
    
    def __call__(self, x):
        """Make the class callable: sigmoid(x) instead of sigmoid.forward(x)"""
        return self.forward(x)

### 🧪 Test Your Sigmoid Implementation

Once you implement the Sigmoid forward method above, run this cell to test it:

In [None]:
def test_unit_sigmoid_activation():
    """Unit test for the Sigmoid activation function."""
    print("🔬 Unit Test: Sigmoid Activation...")

# Create Sigmoid instance
    sigmoid = Sigmoid()

    # Test with known values
    test_input = Tensor([[0]])
    result = sigmoid(test_input)
    expected = 0.5
    
    assert abs(result.data[0][0] - expected) < 1e-6, f"Sigmoid(0) should be 0.5, got {result.data[0][0]}"
    
    # Test with positive and negative values
    test_input = Tensor([[-2, -1, 0, 1, 2]])
    result = sigmoid(test_input)
    
    # Check that all values are between 0 and 1
    assert np.all(result.data > 0), "Sigmoid output should be > 0"
    assert np.all(result.data < 1), "Sigmoid output should be < 1"
    
    # Test symmetry: sigmoid(-x) = 1 - sigmoid(x)
    x_val = 1.0
    pos_result = sigmoid(Tensor([[x_val]]))
    neg_result = sigmoid(Tensor([[-x_val]]))
    symmetry_check = abs(pos_result.data[0][0] + neg_result.data[0][0] - 1.0)
    assert symmetry_check < 1e-6, "Sigmoid should be symmetric around 0.5"
    
    # Test with 2D tensor
    matrix_input = Tensor([[-1, 1], [0, 2]])
    matrix_result = sigmoid(matrix_input)
    assert matrix_result.shape == matrix_input.shape, "Sigmoid should preserve shape"
    
    # Test extreme values (should not overflow)
    extreme_input = Tensor([[-100, 100]])
    extreme_result = sigmoid(extreme_input)
    assert not np.any(np.isnan(extreme_result.data)), "Sigmoid should handle extreme values"
    assert not np.any(np.isinf(extreme_result.data)), "Sigmoid should not produce inf values"
    
    print("✅ Sigmoid activation tests passed!")
    print(f"✅ Outputs correctly bounded between 0 and 1")
    print(f"✅ Symmetric property verified")
    print(f"✅ Handles extreme values without overflow")
    print(f"✅ Shape preservation working")

# Run the test
test_unit_sigmoid_activation()

## Step 3: Tanh - Centered Activation

### What is Tanh?
**Tanh (Hyperbolic Tangent)** is similar to sigmoid but centered around zero:

```
f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
```

### Why Tanh is Better Than Sigmoid
1. **Zero-centered**: Outputs range from -1 to 1
2. **Better gradients**: Helps with gradient flow in deep networks
3. **Faster convergence**: Less bias shift during training
4. **Stronger gradients**: Maximum gradient is 1 vs 0.25 for sigmoid

### Visual Understanding
```
Input: [-∞, -2, -1, 0, 1, 2, ∞]
Tanh:  [-1, -0.96, -0.76, 0, 0.76, 0.96, 1]
```

### Real-World Applications
- **Hidden layers**: Better than sigmoid for internal activations
- **RNN cells**: Classic RNN and LSTM use tanh
- **Normalization**: When you need zero-centered outputs
- **Feature scaling**: Maps inputs to [-1, 1] range

### Mathematical Properties
- **Range**: (-1, 1)
- **Derivative**: f'(x) = 1 - f(x)²
- **Zero-centered**: f(0) = 0
- **Antisymmetric**: f(-x) = -f(x)

In [None]:
#| export
class Tanh:
    """
    Tanh Activation Function: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
    
    Zero-centered activation function with range (-1, 1).
    Better gradient properties than sigmoid.
    """
    
    def forward(self, x: Tensor) -> Tensor:
        """
        Apply Tanh activation: f(x) = (e^x - e^(-x)) / (e^x + e^(-x))
        
        TODO: Implement Tanh activation function.
        
        STEP-BY-STEP IMPLEMENTATION:
        1. Use NumPy's built-in tanh function: np.tanh(x.data)
        2. Alternatively, implement manually:
           - Compute e^x and e^(-x)
           - Calculate (e^x - e^(-x)) / (e^x + e^(-x))
        3. Return as new Tensor
        
        EXAMPLE USAGE:
        ```python
        tanh = Tanh()
        input_tensor = Tensor([[-2, -1, 0, 1, 2]])
        output = tanh(input_tensor)
        print(output.data)  # [[-0.964, -0.762, 0, 0.762, 0.964]]
        ```
        
        IMPLEMENTATION HINTS:
        - Use np.tanh(x.data) for simplicity
        - Manual implementation: (np.exp(x.data) - np.exp(-x.data)) / (np.exp(x.data) + np.exp(-x.data))
        - Handle overflow by clipping inputs: np.clip(x.data, -500, 500)
        - Return Tensor(result)
        
        LEARNING CONNECTIONS:
        - This is like torch.nn.Tanh() in PyTorch
        - Used in RNN, LSTM, and GRU cells
        - Better than sigmoid for hidden layers
        - Zero-centered outputs help with gradient flow
        """
        ### BEGIN SOLUTION
        # Use NumPy's built-in tanh function
        result = np.tanh(x.data)
        return type(x)(result)
        ### END SOLUTION
    
    def __call__(self, x: Tensor) -> Tensor:
        """Make the class callable: tanh(x) instead of tanh.forward(x)"""
        return self.forward(x)

### 🧪 Test Your Tanh Implementation

Once you implement the Tanh forward method above, run this cell to test it:

In [None]:
def test_unit_tanh_activation():
    """Unit test for the Tanh activation function."""
    print("🔬 Unit Test: Tanh Activation...")

# Create Tanh instance
    tanh = Tanh()

    # Test with zero (should be 0)
    test_input = Tensor([[0]])
    result = tanh(test_input)
    expected = 0.0
    
    assert abs(result.data[0][0] - expected) < 1e-6, f"Tanh(0) should be 0, got {result.data[0][0]}"
    
    # Test with positive and negative values
    test_input = Tensor([[-2, -1, 0, 1, 2]])
    result = tanh(test_input)
    
    # Check that all values are between -1 and 1
    assert np.all(result.data > -1), "Tanh output should be > -1"
    assert np.all(result.data < 1), "Tanh output should be < 1"
    
    # Test antisymmetry: tanh(-x) = -tanh(x)
    x_val = 1.5
    pos_result = tanh(Tensor([[x_val]]))
    neg_result = tanh(Tensor([[-x_val]]))
    antisymmetry_check = abs(pos_result.data[0][0] + neg_result.data[0][0])
    assert antisymmetry_check < 1e-6, "Tanh should be antisymmetric"
    
    # Test with 2D tensor
    matrix_input = Tensor([[-1, 1], [0, 2]])
    matrix_result = tanh(matrix_input)
    assert matrix_result.shape == matrix_input.shape, "Tanh should preserve shape"
    
    # Test extreme values (should not overflow)
    extreme_input = Tensor([[-100, 100]])
    extreme_result = tanh(extreme_input)
    assert not np.any(np.isnan(extreme_result.data)), "Tanh should handle extreme values"
    assert not np.any(np.isinf(extreme_result.data)), "Tanh should not produce inf values"
    
    # Test that extreme values approach ±1
    assert abs(extreme_result.data[0][0] - (-1)) < 1e-6, "Tanh(-∞) should approach -1"
    assert abs(extreme_result.data[0][1] - 1) < 1e-6, "Tanh(∞) should approach 1"
    
    print("✅ Tanh activation tests passed!")
    print(f"✅ Outputs correctly bounded between -1 and 1")
    print(f"✅ Antisymmetric property verified")
    print(f"✅ Zero-centered (tanh(0) = 0)")
    print(f"✅ Handles extreme values correctly")

# Run the test
test_unit_tanh_activation()

## Step 4: Softmax - Probability Distributions

### What is Softmax?
**Softmax** converts a vector of real numbers into a probability distribution:

```
f(x_i) = e^(x_i) / Σ(e^(x_j))
```

### Why Softmax is Essential
1. **Probability distribution**: Outputs sum to 1
2. **Multi-class classification**: Choose one class from many
3. **Interpretable**: Each output is a probability
4. **Differentiable**: Enables gradient-based learning

### Visual Understanding
```
Input:  [1, 2, 3]
Softmax:[0.09, 0.24, 0.67]  # Sums to 1.0
```

### Real-World Applications
- **Classification**: Image classification, text classification
- **Language models**: Next word prediction
- **Attention mechanisms**: Where to focus attention
- **Reinforcement learning**: Action selection probabilities

### Mathematical Properties
- **Range**: (0, 1) for each output
- **Constraint**: Σ(f(x_i)) = 1
- **Argmax preservation**: Doesn't change relative ordering
- **Temperature scaling**: Can be made sharper or softer

In [None]:
#| export
class Softmax:
    """
    Softmax Activation Function: f(x_i) = e^(x_i) / Σ(e^(x_j))
    
    Converts a vector of real numbers into a probability distribution.
    Essential for multi-class classification.
    """
    
    def forward(self, x):
        """
        Apply Softmax activation: f(x_i) = e^(x_i) / Σ(e^(x_j))
        
        TODO: Implement Softmax activation function.
        
        STEP-BY-STEP IMPLEMENTATION:
        1. Handle empty input case
        2. Subtract max value for numerical stability: x - max(x)
        3. Compute exponentials: np.exp(x - max(x))
        4. Compute sum of exponentials: np.sum(exp_values)
        5. Divide each exponential by the sum: exp_values / sum
        6. Return as same tensor type as input
        
        EXAMPLE USAGE:
        ```python
        softmax = Softmax()
        input_tensor = Tensor([[1, 2, 3]])
        output = softmax(input_tensor)
        print(output.data)  # [[0.09, 0.24, 0.67]]
        print(np.sum(output.data))  # 1.0
        ```
        
        IMPLEMENTATION HINTS:
        - Handle empty case: if x.data.size == 0: return type(x)(x.data.copy())
        - Subtract max for numerical stability: x_shifted = x.data - np.max(x.data, axis=-1, keepdims=True)
        - Compute exponentials: exp_values = np.exp(x_shifted)
        - Sum along last axis: sum_exp = np.sum(exp_values, axis=-1, keepdims=True)
        - Divide: result = exp_values / sum_exp
        - Return same type as input: return type(x)(result)
        
        LEARNING CONNECTIONS:
        - This is like torch.nn.Softmax() in PyTorch
        - Used in classification output layers
        - Key component in attention mechanisms
        - Enables probability-based decision making
        """
        ### BEGIN SOLUTION
        # Handle empty input
        if x.data.size == 0:
            return type(x)(x.data.copy())
        
        # Subtract max for numerical stability
        x_shifted = x.data - np.max(x.data, axis=-1, keepdims=True)
        
        # Compute exponentials
        exp_values = np.exp(x_shifted)
        
        # Sum along last axis
        sum_exp = np.sum(exp_values, axis=-1, keepdims=True)
        
        # Divide to get probabilities
        result = exp_values / sum_exp
        
        return type(x)(result)
        ### END SOLUTION
    
    def __call__(self, x):
        """Make the class callable: softmax(x) instead of softmax.forward(x)"""
        return self.forward(x)

### 🧪 Test Your Softmax Implementation

Once you implement the Softmax forward method above, run this cell to test it:

In [None]:
def test_unit_softmax_activation():
    """Unit test for the Softmax activation function."""
    print("🔬 Unit Test: Softmax Activation...")

# Create Softmax instance
    softmax = Softmax()

    # Test with simple input
    test_input = Tensor([[1, 2, 3]])
    result = softmax(test_input)
    
    # Check that outputs sum to 1
    output_sum = np.sum(result.data)
    assert abs(output_sum - 1.0) < 1e-6, f"Softmax outputs should sum to 1, got {output_sum}"
    
    # Check that all outputs are positive
    assert np.all(result.data > 0), "Softmax outputs should be positive"
    assert np.all(result.data < 1), "Softmax outputs should be less than 1"
    
    # Test with uniform input (should give equal probabilities)
    uniform_input = Tensor([[1, 1, 1]])
    uniform_result = softmax(uniform_input)
    expected_prob = 1.0 / 3.0
    
    for prob in uniform_result.data[0]:
        assert abs(prob - expected_prob) < 1e-6, f"Uniform input should give equal probabilities"
    
    # Test with batch input (multiple samples)
    batch_input = Tensor([[1, 2, 3], [4, 5, 6]])
    batch_result = softmax(batch_input)
    
    # Check that each row sums to 1
    for i in range(batch_input.shape[0]):
        row_sum = np.sum(batch_result.data[i])
        assert abs(row_sum - 1.0) < 1e-6, f"Each row should sum to 1, row {i} sums to {row_sum}"
    
    # Test numerical stability with large values
    large_input = Tensor([[1000, 1001, 1002]])
    large_result = softmax(large_input)
    
    assert not np.any(np.isnan(large_result.data)), "Softmax should handle large values"
    assert not np.any(np.isinf(large_result.data)), "Softmax should not produce inf values"
    
    large_sum = np.sum(large_result.data)
    assert abs(large_sum - 1.0) < 1e-6, "Large values should still sum to 1"

# Test shape preservation
    assert batch_result.shape == batch_input.shape, "Softmax should preserve shape"
    
    print("✅ Softmax activation tests passed!")
    print(f"✅ Outputs sum to 1 (probability distribution)")
    print(f"✅ All outputs are positive")
    print(f"✅ Handles uniform inputs correctly")
    print(f"✅ Works with batch inputs")
    print(f"✅ Numerically stable with large values")

# Run the test
test_unit_softmax_activation()

## 🎯 Comprehensive Test: All Activations Working Together

### Real-World Scenario
Let's test how all activation functions work together in a realistic neural network scenario:

- **Input processing**: Raw data transformation
- **Hidden layers**: ReLU for internal processing
- **Output layer**: Softmax for classification
- **Comparison**: See how different activations transform the same data

In [None]:
def test_unit_activations_comprehensive():
    """Comprehensive unit test for all activation functions working together."""
    print("🔬 Unit Test: Activation Functions Comprehensive Test...")
    
    # Create instances of all activation functions
    relu = ReLU()
    sigmoid = Sigmoid()
    tanh = Tanh()
    softmax = Softmax()
    
    # Test data: simulating neural network layer outputs
    test_data = Tensor([[-2, -1, 0, 1, 2]])
    
    # Apply each activation function
    relu_result = relu(test_data)
    sigmoid_result = sigmoid(test_data)
    tanh_result = tanh(test_data)
    softmax_result = softmax(test_data)
    
    # Test that all functions preserve input shape
    assert relu_result.shape == test_data.shape, "ReLU should preserve shape"
    assert sigmoid_result.shape == test_data.shape, "Sigmoid should preserve shape"
    assert tanh_result.shape == test_data.shape, "Tanh should preserve shape"
    assert softmax_result.shape == test_data.shape, "Softmax should preserve shape"
    
    # Test that all functions return Tensor objects
    assert isinstance(relu_result, Tensor), "ReLU should return Tensor"
    assert isinstance(sigmoid_result, Tensor), "Sigmoid should return Tensor"
    assert isinstance(tanh_result, Tensor), "Tanh should return Tensor"
    assert isinstance(softmax_result, Tensor), "Softmax should return Tensor"
    
    # Test ReLU properties
    assert np.all(relu_result.data >= 0), "ReLU output should be non-negative"
    
    # Test Sigmoid properties
    assert np.all(sigmoid_result.data > 0), "Sigmoid output should be positive"
    assert np.all(sigmoid_result.data < 1), "Sigmoid output should be less than 1"
    
    # Test Tanh properties
    assert np.all(tanh_result.data > -1), "Tanh output should be > -1"
    assert np.all(tanh_result.data < 1), "Tanh output should be < 1"
    
    # Test Softmax properties
    softmax_sum = np.sum(softmax_result.data)
    assert abs(softmax_sum - 1.0) < 1e-6, "Softmax outputs should sum to 1"
    
    # Test chaining activations (realistic neural network scenario)
    # Hidden layer with ReLU
    hidden_output = relu(test_data)
    
    # Add some weights simulation (element-wise multiplication)
    weights = Tensor([[0.5, 0.3, 0.8, 0.2, 0.7]])
    weighted_output = hidden_output * weights
    
    # Final layer with Softmax
    final_output = softmax(weighted_output)
    
    # Test that chained operations work
    assert isinstance(final_output, Tensor), "Chained operations should return Tensor"
    assert abs(np.sum(final_output.data) - 1.0) < 1e-6, "Final output should be valid probability"
    
    # Test with batch data (multiple samples)
    batch_data = Tensor([
    [-2, -1, 0, 1, 2],
    [1, 2, 3, 4, 5],
    [-1, 0, 1, 2, 3]
    ])
    
    batch_softmax = softmax(batch_data)
    
    # Each row should sum to 1
    for i in range(batch_data.shape[0]):
        row_sum = np.sum(batch_softmax.data[i])
        assert abs(row_sum - 1.0) < 1e-6, f"Batch row {i} should sum to 1"
    
    print("✅ Activation functions comprehensive tests passed!")
    print(f"✅ All functions work together seamlessly")
    print(f"✅ Shape preservation across all activations")
    print(f"✅ Chained operations work correctly")
    print(f"✅ Batch processing works for all activations")
    print(f"✅ Ready for neural network integration!")

# Run the comprehensive test
test_unit_activations_comprehensive()

In [None]:
def test_module_activation_tensor_integration():
    """
    Integration test for activation functions with Tensor operations.
    
    Tests that activation functions properly integrate with the Tensor class
    and maintain compatibility for neural network operations.
    """
    print("🔬 Running Integration Test: Activation-Tensor Integration...")
    
    # Test 1: Activation functions preserve Tensor types
    input_tensor = Tensor([-2.0, -1.0, 0.0, 1.0, 2.0])
    
    relu_fn = ReLU()
    sigmoid_fn = Sigmoid()
    tanh_fn = Tanh()
    
    relu_result = relu_fn(input_tensor)
    sigmoid_result = sigmoid_fn(input_tensor) 
    tanh_result = tanh_fn(input_tensor)
    
    assert isinstance(relu_result, Tensor), "ReLU should return Tensor"
    assert isinstance(sigmoid_result, Tensor), "Sigmoid should return Tensor"
    assert isinstance(tanh_result, Tensor), "Tanh should return Tensor"
    
    # Test 2: Activations work with matrix Tensors (neural network layers)
    layer_output = Tensor([[1.0, -2.0, 3.0], 
                          [-1.0, 2.0, -3.0]])  # Simulating dense layer output
    
    relu_fn = ReLU()
    activated = relu_fn(layer_output)
    expected = np.array([[1.0, 0.0, 3.0], 
                        [0.0, 2.0, 0.0]])
    
    assert isinstance(activated, Tensor), "Matrix activation should return Tensor"
    assert np.array_equal(activated.data, expected), "Matrix ReLU should work correctly"
    
    # Test 3: Softmax with classification scenario
    logits = Tensor([[2.0, 1.0, 0.1],  # Batch of 2 samples
                    [1.0, 3.0, 0.2]])   # Each with 3 classes
    
    softmax_fn = Softmax()
    probabilities = softmax_fn(logits)
    
    assert isinstance(probabilities, Tensor), "Softmax should return Tensor"
    assert probabilities.shape == logits.shape, "Softmax should preserve shape"
    
    # Each row should sum to 1 (probability distribution)
    for i in range(logits.shape[0]):
        row_sum = np.sum(probabilities.data[i])
        assert abs(row_sum - 1.0) < 1e-6, f"Probability row {i} should sum to 1"
    
    # Test 4: Chaining tensor operations with activations
    x = Tensor([1.0, 2.0, 3.0])
    y = Tensor([4.0, 5.0, 6.0])
    
    # Simulate: dense layer output -> activation -> more operations
    dense_sim = x * y  # Element-wise multiplication (simulating dense layer)
    relu_fn = ReLU()
    activated = relu_fn(dense_sim)  # Apply activation
    final = activated + Tensor([1.0, 1.0, 1.0])  # More tensor operations
    
    expected_final = np.array([5.0, 11.0, 19.0])  # [4,10,18] -> relu -> +1 = [5,11,19]
    
    assert isinstance(final, Tensor), "Chained operations should maintain Tensor type"
    assert np.array_equal(final.data, expected_final), "Chained operations should work correctly"
    
    print("✅ Integration Test Passed: Activation-Tensor integration works correctly.")

# Run the integration test
test_module_activation_tensor_integration()

## 🎯 MODULE SUMMARY: Activation Functions

    Congratulations! You've successfully implemented all four essential activation functions:

### ✅ What You've Built
    - **ReLU**: The foundation of modern deep learning with sparsity and efficiency
    - **Sigmoid**: Classic activation for binary classification and probability outputs
    - **Tanh**: Zero-centered activation with better gradient properties
    - **Softmax**: Probability distribution for multi-class classification

### ✅ Key Learning Outcomes
    - **Understanding**: Why nonlinearity is essential for neural networks
    - **Implementation**: Built activation functions from scratch using NumPy
    - **Testing**: Progressive validation with immediate feedback after each function
    - **Integration**: Saw how activations work together in neural networks
    - **Real-world context**: Understanding where each activation is used

### ✅ Mathematical Mastery
    - **ReLU**: f(x) = max(0, x) - Simple but powerful
    - **Sigmoid**: f(x) = 1/(1 + e^(-x)) - Maps to (0,1)
    - **Tanh**: f(x) = tanh(x) - Zero-centered, maps to (-1,1)
    - **Softmax**: f(x_i) = e^(x_i)/Σ(e^(x_j)) - Probability distribution

### ✅ Professional Skills Developed
    - **Numerical stability**: Handling overflow and underflow
    - **API design**: Consistent interfaces across all functions
    - **Testing discipline**: Immediate validation after each implementation
    - **Integration thinking**: Understanding how components work together

### ✅ Ready for Next Steps
    Your activation functions are now ready to power:
    - **Dense layers**: Linear transformations with nonlinear activations
    - **Convolutional layers**: Spatial feature extraction with ReLU
    - **Network architectures**: Complete neural networks with proper activations
    - **Training**: Gradient computation through activation functions

### 🔗 Connection to Real ML Systems
    Your implementations mirror production systems:
    - **PyTorch**: `torch.nn.ReLU()`, `torch.nn.Sigmoid()`, `torch.nn.Tanh()`, `torch.nn.Softmax()`
    - **TensorFlow**: `tf.nn.relu()`, `tf.nn.sigmoid()`, `tf.nn.tanh()`, `tf.nn.softmax()`
    - **Industry applications**: Every major deep learning model uses these functions

### 🎯 The Power of Nonlinearity
    You've unlocked the key to deep learning:
    - **Before**: Linear models limited to simple patterns
    - **After**: Nonlinear models can learn any pattern (universal approximation)

    **Next Module**: Layers - Building blocks that combine your tensors and activations into powerful transformations!

### 🧪 Integration Test: Tensor → Activations Workflow

This comprehensive test validates that your tensor and activation implementations work together seamlessly, simulating a realistic neural network forward pass.

In [None]:
def test_tensor_activations_integration():
    """
    Integration test validating end-to-end tensor + activations workflow.
    
    Simulates a realistic neural network scenario:
    1. Create input data (Tensor)
    2. Apply linear transformation (simulated weight multiplication)
    3. Apply each activation function
    4. Verify mathematical properties and integration
    """
    print("🔬 Integration Test: Tensor ↔ Activations Integration...")
    
    # Simulate realistic neural network data
    # Batch of 3 samples, each with 4 features (mini-batch processing)
    raw_input = Tensor([
        [-2.5, -1.0,  0.0,  1.5],  # Sample 1: mixed positive/negative
        [ 3.2,  0.5, -0.8,  2.1],  # Sample 2: mostly positive
        [-1.8,  2.3, -3.1,  0.7]   # Sample 3: mixed with extreme values
    ])
    
    print(f"📊 Input shape: {raw_input.shape}")
    print(f"📊 Input data:\n{raw_input.data}")
    
    # Simulate weight matrix (4 input features → 3 hidden units)
    weights = Tensor([
        [ 0.5, -0.3,  0.8],  # Feature 1 weights to 3 hidden units
        [-0.2,  0.7,  0.1],  # Feature 2 weights  
        [ 0.9, -0.5,  0.4],  # Feature 3 weights
        [ 0.3,  0.8, -0.6]   # Feature 4 weights
    ])
    
    # Simulate matrix multiplication (input @ weights)
    # In real neural networks, this would be: output = input @ weights + bias
    hidden_pre_activation = raw_input @ weights.data.T  # Transpose for correct dimensions
    hidden_pre_activation = Tensor(hidden_pre_activation)
    
    print(f"📊 Pre-activation shape: {hidden_pre_activation.shape}")
    print(f"📊 Pre-activation values:\n{hidden_pre_activation.data}")
    
    # Test each activation function on the realistic data
    relu = ReLU()
    sigmoid = Sigmoid() 
    tanh = Tanh()
    softmax = Softmax()
    
    # Apply all activations
    relu_output = relu(hidden_pre_activation)
    sigmoid_output = sigmoid(hidden_pre_activation)  
    tanh_output = tanh(hidden_pre_activation)
    softmax_output = softmax(hidden_pre_activation)
    
    print("\n🔍 Testing activation outputs...")
    
    # Validate shapes are preserved
    assert relu_output.shape == hidden_pre_activation.shape, "ReLU should preserve shape"
    assert sigmoid_output.shape == hidden_pre_activation.shape, "Sigmoid should preserve shape"
    assert tanh_output.shape == hidden_pre_activation.shape, "Tanh should preserve shape"
    assert softmax_output.shape == hidden_pre_activation.shape, "Softmax should preserve shape"
    print("✅ Shape preservation: All activations maintain input dimensions")
    
    # Validate mathematical properties
    # ReLU properties
    assert np.all(relu_output.data >= 0), "ReLU outputs must be non-negative"
    sparsity = np.count_nonzero(relu_output.data == 0) / relu_output.size
    print(f"✅ ReLU sparsity: {sparsity*100:.1f}% zeros (good for efficiency)")
    
    # Sigmoid properties  
    assert np.all(sigmoid_output.data > 0), "Sigmoid outputs must be positive"
    assert np.all(sigmoid_output.data < 1), "Sigmoid outputs must be less than 1"
    sigmoid_range = [np.min(sigmoid_output.data), np.max(sigmoid_output.data)]
    print(f"✅ Sigmoid range: [{sigmoid_range[0]:.3f}, {sigmoid_range[1]:.3f}] ∈ (0,1)")
    
    # Tanh properties
    assert np.all(tanh_output.data > -1), "Tanh outputs must be greater than -1"  
    assert np.all(tanh_output.data < 1), "Tanh outputs must be less than 1"
    tanh_range = [np.min(tanh_output.data), np.max(tanh_output.data)]
    print(f"✅ Tanh range: [{tanh_range[0]:.3f}, {tanh_range[1]:.3f}] ∈ (-1,1)")
    
    # Softmax properties (most important for multi-class classification)
    for i in range(softmax_output.shape[0]):  # Check each sample
        sample_sum = np.sum(softmax_output.data[i])
        assert abs(sample_sum - 1.0) < 1e-6, f"Softmax row {i} should sum to 1, got {sample_sum}"
    print("✅ Softmax probability: Each row sums to 1.0 (valid probability distribution)")
    
    # Test activation chaining (realistic neural network scenario)
    print("\n🔗 Testing activation chaining (hidden → output layers)...")
    
    # Hidden layer: ReLU activation (common choice)
    hidden_output = relu(hidden_pre_activation)
    
    # Simulate output layer weights (3 hidden → 2 output classes)
    output_weights = Tensor([
        [0.6, -0.4],  # Hidden unit 1 → [class 0, class 1]
        [-0.3, 0.8],  # Hidden unit 2 → [class 0, class 1]  
        [0.5, 0.2]    # Hidden unit 3 → [class 0, class 1]
    ])
    
    # Output layer pre-activation
    output_pre_activation = hidden_output @ output_weights.data
    output_pre_activation = Tensor(output_pre_activation) 
    
    # Output layer: Softmax for classification
    final_output = softmax(output_pre_activation)
    
    print(f"📊 Final classification output shape: {final_output.shape}")
    print(f"📊 Final probabilities:\n{final_output.data}")
    
    # Validate final output properties
    assert final_output.shape == (3, 2), "Should have 3 samples × 2 classes"
    for i in range(3):
        sample_probs = final_output.data[i]
        assert abs(np.sum(sample_probs) - 1.0) < 1e-6, f"Sample {i} probabilities should sum to 1"
        assert np.all(sample_probs > 0), f"Sample {i} should have positive probabilities"
        predicted_class = np.argmax(sample_probs)
        confidence = np.max(sample_probs)
        print(f"✅ Sample {i+1}: Class {predicted_class}, Confidence {confidence:.3f}")
    
    # Test tensor arithmetic integration
    print("\n🔢 Testing tensor arithmetic with activations...")
    
    # Element-wise operations should work seamlessly
    combined_output = relu_output + sigmoid_output * 0.5
    assert isinstance(combined_output, Tensor), "Arithmetic should return Tensor"
    assert combined_output.shape == relu_output.shape, "Arithmetic should preserve shape"
    print("✅ Tensor arithmetic integration: Addition and multiplication work")
    
    # Test broadcasting with activations
    bias = Tensor([0.1, -0.05, 0.2])  # Shape: (3,)
    biased_output = sigmoid_output + bias  # Should broadcast
    assert biased_output.shape == sigmoid_output.shape, "Broadcasting should work with activations"
    print("✅ Broadcasting integration: Bias addition works")
    
    print("\n🎯 Integration Test Results:")
    print("✅ Tensor creation and manipulation")
    print("✅ Matrix operations (simulated linear layers)")  
    print("✅ All activation functions working correctly")
    print("✅ Mathematical properties validated")
    print("✅ Realistic neural network forward pass")
    print("✅ Activation chaining (hidden → output)")
    print("✅ Tensor arithmetic with activations")
    print("✅ Broadcasting compatibility")
    
    print("\n🚀 Ready for real neural networks! Your tensor and activation implementations")
    print("   can handle the computational demands of modern deep learning.")

# Run the comprehensive integration test
test_tensor_activations_integration()

    Your activation functions are the key to neural network intelligence. Now let's build the layers that use them!
""" 