# üß† Activation Functions in Neural Networks

This notebook demonstrates the implementation and behavior of common activation functions used in deep learning. Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns.

**What you'll learn:**
- How each activation function transforms input values
- The mathematical formula behind each function
- When to use each activation function

## Setup

We'll use NumPy for efficient numerical computations. NumPy's vectorized operations allow us to apply activation functions to entire arrays at once.

In [1]:
import numpy as np

---
## 1. Sigmoid Function

**Formula:** œÉ(x) = 1 / (1 + e‚ÅªÀ£)

**Output Range:** 0 to 1

The sigmoid function squashes any input value into a range between 0 and 1, making it ideal for:
- **Binary classification** output layers (predicting probabilities)
- Interpreting outputs as probabilities

**Characteristics:**
- Smooth, S-shaped curve
- Output of 0.5 when input is 0
- Saturates (flattens) for very large or very small inputs, which can cause vanishing gradients

**‚ö†Ô∏è Numerical Stability Note:** The simple implementation below works well for typical input ranges. However, for extreme inputs (e.g., x < -700 or x > 700), `np.exp(-x)` can overflow or underflow. Production code often uses numerically-stable variants like conditional formulations or input clipping. See the stable version in the cell below the basic implementation.

In [2]:
# Basic sigmoid implementation (works for typical input ranges)
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

values = np.array([-2, -1, 0, 1, 2])
sigmoid_values = sigmoid(values)
print("Sigmoid Function Results:")
print(sigmoid_values)

Sigmoid Function Results:
[0.11920292 0.26894142 0.5        0.73105858 0.88079708]


In [None]:
# Numerically stable sigmoid using np.where
# This avoids overflow for large negative values and underflow for large positive values
def sigmoid_stable(x):
    return np.where(
        x >= 0,
        1 / (1 + np.exp(-x)),      # For positive x: standard formula
        np.exp(x) / (1 + np.exp(x)) # For negative x: equivalent but stable
    )

# Test with extreme values
extreme_values = np.array([-1000, -100, 0, 100, 1000])
print("Stable Sigmoid with extreme inputs:")
print(sigmoid_stable(extreme_values))

**Interpreting the results:**
- Input `-2` ‚Üí Output `0.12` (close to 0, low probability)
- Input `0` ‚Üí Output `0.5` (exactly in the middle)
- Input `2` ‚Üí Output `0.88` (close to 1, high probability)

Notice how negative inputs map to values below 0.5, and positive inputs map to values above 0.5.

---
## 2. Softmax Function

**Formula:** softmax(x·µ¢) = eÀ£‚Å± / Œ£eÀ£ ≤

**Output Range:** 0 to 1 (all outputs sum to 1)

Softmax converts a vector of raw scores (logits) into a probability distribution. It's the go-to choice for:
- **Multi-class classification** output layers
- When you need outputs to represent mutually exclusive class probabilities

**Characteristics:**
- All outputs are positive and sum to exactly 1
- Larger inputs get exponentially larger probabilities
- We subtract `max(x)` for numerical stability to prevent overflow

**Implementation Note:** The version below handles both 1D vectors and batched 2D+ inputs by using `axis=-1` and `keepdims=True` for proper broadcasting.

In [3]:
def softmax(x):
    # Subtract max for numerical stability (prevents overflow)
    # axis=-1 and keepdims=True ensure this works for both 1D and batched inputs
    x_max = np.max(x, axis=-1, keepdims=True)
    e_x = np.exp(x - x_max)
    return e_x / e_x.sum(axis=-1, keepdims=True)

# Single vector example
values = np.array([2.0, 1.0, 0.1])
softmax_values = softmax(values)
print("Softmax Function Results (1D):")
print(softmax_values)
print(f"Sum: {softmax_values.sum():.4f}")


Softmax Function Results:
[0.65900114 0.24243297 0.09856589]


In [None]:
# Batched example: multiple samples at once
batch_values = np.array([
    [2.0, 1.0, 0.1],   # Sample 1
    [1.0, 2.0, 3.0],   # Sample 2
    [0.5, 0.5, 0.5]    # Sample 3 (equal logits)
])
batch_softmax = softmax(batch_values)
print("\nSoftmax Function Results (Batched 2D):")
print(batch_softmax)
print(f"\nRow sums (should all be 1.0): {batch_softmax.sum(axis=-1)}")

**Interpreting the results:**
- Input `[2.0, 1.0, 0.1]` represents raw scores for 3 classes
- Class 0 (score 2.0) ‚Üí 65.9% probability (highest score = highest probability)
- Class 1 (score 1.0) ‚Üí 24.2% probability
- Class 2 (score 0.1) ‚Üí 9.9% probability
- **Sum: 0.659 + 0.242 + 0.099 = 1.0** ‚úì

This is perfect for tasks like image classification where an image belongs to exactly one category.

---
## 3. Tanh (Hyperbolic Tangent) Function

**Formula:** tanh(x) = (eÀ£ - e‚ÅªÀ£) / (eÀ£ + e‚ÅªÀ£)

**Output Range:** -1 to 1

Tanh is similar to sigmoid but outputs values centered around zero. This makes it useful for:
- **Hidden layers** where zero-centered outputs improve training
- RNNs and LSTMs where values need to flow in both directions

**Characteristics:**
- Zero-centered (output is 0 when input is 0)
- Stronger gradients than sigmoid (steeper curve)
- Still suffers from vanishing gradients at extreme values

In [4]:
def tanh(x):
    return np.tanh(x)

values = np.array([-2, -1, 0, 1, 2])
tanh_values = tanh(values)
print("\nTanh Function Results:")
print(tanh_values)


Tanh Function Results:
[-0.96402758 -0.76159416  0.          0.76159416  0.96402758]


**Interpreting the results:**
- Input `-2` ‚Üí Output `-0.96` (close to -1)
- Input `0` ‚Üí Output `0` (exactly zero-centered)
- Input `2` ‚Üí Output `0.96` (close to 1)

**Comparison with Sigmoid:**
- Tanh outputs are symmetric around 0 (-1 to 1)
- Sigmoid outputs are always positive (0 to 1)
- Tanh is essentially a scaled and shifted sigmoid: tanh(x) = 2 √ó sigmoid(2x) - 1

---
## 4. ReLU (Rectified Linear Unit) Function

**Formula:** ReLU(x) = max(0, x)

**Output Range:** 0 to ‚àû

ReLU is the most widely used activation function in modern deep learning. It's the **default choice for hidden layers** because:
- Computationally efficient (simple comparison operation)
- Reduces vanishing gradient problem (gradient is 1 for positive inputs)
- Promotes sparsity (many neurons output exactly 0)

**Characteristics:**
- Outputs 0 for all negative inputs
- Outputs the input unchanged for positive values
- Can suffer from "dying ReLU" where neurons get stuck outputting 0

In [5]:
def relu(x):
    return np.maximum(0, x)

values = np.array([-2, -1, 0, 1, 2])
relu_values = relu(values)
print("\nReLU Function Results:")
print(relu_values)


ReLU Function Results:
[0 0 0 1 2]


**Interpreting the results:**
- Input `-2` ‚Üí Output `0` (negative values become 0)
- Input `-1` ‚Üí Output `0` (negative values become 0)
- Input `0` ‚Üí Output `0` (boundary case)
- Input `1` ‚Üí Output `1` (positive values pass through unchanged)
- Input `2` ‚Üí Output `2` (positive values pass through unchanged)

**Why ReLU is so popular:**
1. **Speed**: Just a comparison, no exponentials to compute
2. **Gradient flow**: Gradient is 1 for positive inputs, preventing vanishing gradients
3. **Sparsity**: Many neurons output 0, making the network more efficient

---
## Summary: Choosing the Right Activation Function

| Layer Type | Recommended Activation | Reason |
|------------|----------------------|--------|
| Hidden layers | **ReLU** | Fast, reduces vanishing gradients |
| Binary classification output | **Sigmoid** | Outputs probability (0-1) |
| Multi-class classification output | **Softmax** | Outputs probability distribution |
| RNN/LSTM hidden layers | **Tanh** | Zero-centered, works well with sequences |
| Regression output | **Linear (none)** | Allows any output value |

**Pro tip:** When in doubt, start with ReLU for hidden layers. Only switch to alternatives like Leaky ReLU or ELU if you encounter training issues.