# Module 01: Perceptrons and Activation Functions

**Difficulty**: ⭐⭐ (Intermediate)
**Estimated Time**: 45-60 minutes
**Prerequisites**: 
- Module 00: Introduction to Neural Networks
- Linear algebra (vectors, matrices, dot products)
- Basic calculus (derivatives)
- Python and NumPy

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Implement** a single perceptron from scratch using NumPy
2. **Explain** the concept of linear separability and its limitations
3. **Compare** different activation functions (Sigmoid, Tanh, ReLU, LeakyReLU, GELU, Swish)
4. **Calculate** derivatives of activation functions for backpropagation
5. **Visualize** activation function behaviors and their properties
6. **Choose** appropriate activation functions for different scenarios

## 1. Setup and Imports

In [None]:
# Standard libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, make_blobs
from sklearn.model_selection import train_test_split

# Set random seeds for reproducibility
np.random.seed(42)

# Configure plotting
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

print("Setup complete!")
print(f"NumPy version: {np.__version__}")

## 2. The Perceptron Model

### 2.1 What is a Perceptron?

A **perceptron** is the simplest form of a neural network, consisting of a single neuron. Invented by Frank Rosenblatt in 1958, it was one of the first algorithms capable of learning from data.

### Mathematical Model

Given input vector $\mathbf{x} = [x_1, x_2, ..., x_n]$:

1. **Weighted Sum (Pre-activation)**:
   $$z = \sum_{i=1}^{n} w_i x_i + b = \mathbf{w}^T \mathbf{x} + b$$

2. **Activation Function**:
   $$y = f(z)$$

Where:
- $\mathbf{w}$ = weight vector (learnable)
- $b$ = bias term (learnable)
- $f$ = activation function
- $y$ = output/prediction

### Original Perceptron (Step Function)

The original perceptron used a step activation:

$$f(z) = \begin{cases} 1 & \text{if } z \geq 0 \\ 0 & \text{if } z < 0 \end{cases}$$

This creates a **linear decision boundary** that divides the input space into two regions.

In [None]:
class Perceptron:
    """
    Simple perceptron implementation with step activation function.
    
    This is the original perceptron algorithm that can learn linearly
    separable patterns through iterative weight updates.
    """
    
    def __init__(self, n_features, learning_rate=0.01, n_iterations=100):
        """
        Initialize perceptron with random weights.
        
        Parameters:
        -----------
        n_features : int
            Number of input features
        learning_rate : float
            Step size for weight updates
        n_iterations : int
            Number of training epochs
        """
        self.lr = learning_rate
        self.n_iter = n_iterations
        
        # Initialize weights and bias to small random values
        self.weights = np.random.randn(n_features) * 0.01
        self.bias = 0.0
        
        # Track training history
        self.errors = []
    
    def activation(self, z):
        """
        Step activation function.
        Returns 1 if z >= 0, else 0.
        """
        return np.where(z >= 0, 1, 0)
    
    def predict(self, X):
        """
        Make predictions on input data.
        
        Parameters:
        -----------
        X : array-like, shape (n_samples, n_features)
            Input data
            
        Returns:
        --------
        predictions : array, shape (n_samples,)
            Binary predictions (0 or 1)
        """
        # Compute weighted sum
        z = np.dot(X, self.weights) + self.bias
        # Apply activation
        return self.activation(z)
    
    def fit(self, X, y):
        """
        Train the perceptron using the perceptron learning rule.
        
        Update rule: w = w + lr * (y_true - y_pred) * x
        
        Parameters:
        -----------
        X : array-like, shape (n_samples, n_features)
            Training data
        y : array-like, shape (n_samples,)
            Target labels (0 or 1)
        """
        for iteration in range(self.n_iter):
            errors = 0
            
            # Iterate through each training sample
            for xi, target in zip(X, y):
                # Make prediction
                prediction = self.predict(xi.reshape(1, -1))[0]
                
                # Calculate error
                error = target - prediction
                
                # Update weights and bias if there's an error
                if error != 0:
                    self.weights += self.lr * error * xi
                    self.bias += self.lr * error
                    errors += 1
            
            # Record number of errors in this epoch
            self.errors.append(errors)
            
            # Early stopping if no errors
            if errors == 0:
                print(f"Converged after {iteration + 1} iterations")
                break
        
        return self

# Test the perceptron on a simple dataset
print("Creating a linearly separable dataset...")
X, y = make_blobs(n_samples=100, centers=2, n_features=2, 
                  center_box=(-5, 5), random_state=42)

# Train perceptron
perceptron = Perceptron(n_features=2, learning_rate=0.1, n_iterations=50)
perceptron.fit(X, y)

# Calculate accuracy
predictions = perceptron.predict(X)
accuracy = np.mean(predictions == y)
print(f"\nTraining Accuracy: {accuracy * 100:.2f}%")
print(f"Final weights: {perceptron.weights}")
print(f"Final bias: {perceptron.bias:.4f}")

In [None]:
def plot_perceptron_decision_boundary(perceptron, X, y):
    """
    Visualize the perceptron's decision boundary and training data.
    """
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Decision boundary
    # Create mesh grid
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    
    # Predict for each point in mesh
    Z = perceptron.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # Plot decision boundary
    ax1.contourf(xx, yy, Z, alpha=0.3, levels=1, cmap='RdYlBu')
    ax1.scatter(X[:, 0], X[:, 1], c=y, edgecolors='black', 
               cmap='RdYlBu', s=100, linewidth=1.5)
    
    # Draw the decision boundary line
    # Decision boundary: w1*x1 + w2*x2 + b = 0
    # Solve for x2: x2 = -(w1*x1 + b) / w2
    w1, w2 = perceptron.weights
    if w2 != 0:
        x_boundary = np.array([x_min, x_max])
        y_boundary = -(w1 * x_boundary + perceptron.bias) / w2
        ax1.plot(x_boundary, y_boundary, 'k--', linewidth=2, 
                label='Decision Boundary')
    
    ax1.set_xlabel('Feature 1', fontsize=12, weight='bold')
    ax1.set_ylabel('Feature 2', fontsize=12, weight='bold')
    ax1.set_title('Perceptron Decision Boundary', fontsize=13, weight='bold')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Plot 2: Training errors over time
    ax2.plot(range(1, len(perceptron.errors) + 1), perceptron.errors, 
            marker='o', linewidth=2, markersize=6)
    ax2.set_xlabel('Epoch', fontsize=12, weight='bold')
    ax2.set_ylabel('Number of Errors', fontsize=12, weight='bold')
    ax2.set_title('Training Progress', fontsize=13, weight='bold')
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

# Visualize results
plot_perceptron_decision_boundary(perceptron, X, y)

## 3. Linear Separability and Limitations

### What is Linear Separability?

A dataset is **linearly separable** if a straight line (in 2D), plane (in 3D), or hyperplane (in higher dimensions) can perfectly separate the two classes.

### The XOR Problem

The famous limitation of single-layer perceptrons is their inability to solve the **XOR (exclusive OR)** problem:

| X₁ | X₂ | XOR Output |
|----|----|------------|
| 0  | 0  | 0          |
| 0  | 1  | 1          |
| 1  | 0  | 1          |
| 1  | 1  | 0          |

XOR is **not linearly separable** - you cannot draw a single straight line to separate the classes. This limitation was highlighted by Minsky and Papert (1969), leading to the first "AI winter."

**Solution**: Multi-layer networks with non-linear activations can solve XOR and other non-linearly separable problems!

In [None]:
# Demonstrate the XOR problem
print("Attempting to learn XOR with a single perceptron...\n")

# XOR dataset
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_xor = np.array([0, 1, 1, 0])

# Try to train perceptron on XOR
perceptron_xor = Perceptron(n_features=2, learning_rate=0.1, n_iterations=100)
perceptron_xor.fit(X_xor, y_xor)

# Evaluate
predictions = perceptron_xor.predict(X_xor)
accuracy = np.mean(predictions == y_xor)

print(f"\nXOR Accuracy: {accuracy * 100:.2f}%")
print("\nPredictions vs Actual:")
for i in range(len(X_xor)):
    print(f"Input: {X_xor[i]} | Predicted: {predictions[i]} | Actual: {y_xor[i]} | "
          f"{'✓' if predictions[i] == y_xor[i] else '✗'}")

# Visualize why it fails
fig, ax = plt.subplots(figsize=(8, 6))

# Plot XOR points
colors = ['red' if label == 0 else 'blue' for label in y_xor]
ax.scatter(X_xor[:, 0], X_xor[:, 1], c=colors, s=300, 
          edgecolors='black', linewidth=2, alpha=0.7)

# Add labels
for i, (x, y, label) in enumerate(zip(X_xor[:, 0], X_xor[:, 1], y_xor)):
    ax.annotate(f'({int(x)},{int(y)})\nClass {label}', 
               xy=(x, y), xytext=(10, 10), textcoords='offset points',
               fontsize=11, weight='bold')

ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.set_xlabel('X₁', fontsize=13, weight='bold')
ax.set_ylabel('X₂', fontsize=13, weight='bold')
ax.set_title('XOR Problem: Not Linearly Separable', fontsize=14, weight='bold')
ax.grid(True, alpha=0.3)
ax.text(0.5, -0.3, 'No single line can separate red from blue!', 
       ha='center', fontsize=12, style='italic', color='darkred')

plt.tight_layout()
plt.show()

## 4. Activation Functions

Modern neural networks use **differentiable** activation functions instead of the step function. This allows gradient-based optimization (backpropagation).

### Why Activation Functions?

1. **Introduce Non-linearity**: Without activations, deep networks would just be linear transformations
2. **Enable Complex Patterns**: Non-linear activations allow learning of complex decision boundaries
3. **Gradient Flow**: Differentiable activations enable backpropagation

### Common Activation Functions

Let's implement and visualize the most important activation functions.

In [None]:
class ActivationFunctions:
    """
    Collection of common activation functions and their derivatives.
    Each function is implemented as both forward and backward (derivative).
    """
    
    @staticmethod
    def sigmoid(z):
        """
        Sigmoid (Logistic) activation: σ(z) = 1 / (1 + e^(-z))
        
        Properties:
        - Range: (0, 1)
        - S-shaped curve
        - Often used in output layer for binary classification
        - Problem: Vanishing gradients for large |z|
        """
        return 1 / (1 + np.exp(-z))
    
    @staticmethod
    def sigmoid_derivative(z):
        """
        Derivative: σ'(z) = σ(z) * (1 - σ(z))
        """
        s = ActivationFunctions.sigmoid(z)
        return s * (1 - s)
    
    @staticmethod
    def tanh(z):
        """
        Hyperbolic Tangent: tanh(z) = (e^z - e^(-z)) / (e^z + e^(-z))
        
        Properties:
        - Range: (-1, 1)
        - Zero-centered (better than sigmoid)
        - Still suffers from vanishing gradients
        """
        return np.tanh(z)
    
    @staticmethod
    def tanh_derivative(z):
        """
        Derivative: tanh'(z) = 1 - tanh²(z)
        """
        return 1 - np.tanh(z) ** 2
    
    @staticmethod
    def relu(z):
        """
        Rectified Linear Unit: ReLU(z) = max(0, z)
        
        Properties:
        - Range: [0, ∞)
        - Most popular in hidden layers
        - Computationally efficient
        - Helps with vanishing gradient problem
        - Problem: "Dying ReLU" (neurons can stop learning)
        """
        return np.maximum(0, z)
    
    @staticmethod
    def relu_derivative(z):
        """
        Derivative: ReLU'(z) = 1 if z > 0, else 0
        """
        return np.where(z > 0, 1, 0)
    
    @staticmethod
    def leaky_relu(z, alpha=0.01):
        """
        Leaky ReLU: max(αz, z)
        
        Properties:
        - Allows small gradient when z < 0
        - Prevents dying ReLU problem
        - α typically set to 0.01
        """
        return np.where(z > 0, z, alpha * z)
    
    @staticmethod
    def leaky_relu_derivative(z, alpha=0.01):
        """
        Derivative: 1 if z > 0, else α
        """
        return np.where(z > 0, 1, alpha)
    
    @staticmethod
    def gelu(z):
        """
        Gaussian Error Linear Unit (GELU)
        Approximation: GELU(z) ≈ 0.5z(1 + tanh[√(2/π)(z + 0.044715z³)])
        
        Properties:
        - Smooth approximation to ReLU
        - Used in transformers (BERT, GPT)
        - Better performance in many tasks
        """
        return 0.5 * z * (1 + np.tanh(np.sqrt(2 / np.pi) * 
                                       (z + 0.044715 * z**3)))
    
    @staticmethod
    def swish(z, beta=1.0):
        """
        Swish (also called SiLU): z * sigmoid(βz)
        
        Properties:
        - Smooth, non-monotonic
        - Self-gated activation
        - Used in EfficientNet and other modern architectures
        """
        return z * ActivationFunctions.sigmoid(beta * z)

# Create instance for easy access
act = ActivationFunctions()

print("Activation functions loaded successfully!")
print("\nAvailable functions:")
print("  - Sigmoid")
print("  - Tanh")
print("  - ReLU")
print("  - Leaky ReLU")
print("  - GELU")
print("  - Swish")

In [None]:
# Visualize all activation functions

z = np.linspace(-5, 5, 1000)

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.ravel()

# Define activations to plot
activations = [
    ('Sigmoid', act.sigmoid, act.sigmoid_derivative),
    ('Tanh', act.tanh, act.tanh_derivative),
    ('ReLU', act.relu, act.relu_derivative),
    ('Leaky ReLU', act.leaky_relu, act.leaky_relu_derivative),
    ('GELU', act.gelu, None),
    ('Swish', act.swish, None)
]

for idx, (name, func, deriv_func) in enumerate(activations):
    ax = axes[idx]
    
    # Plot activation function
    y = func(z)
    ax.plot(z, y, linewidth=2.5, label=f'{name}', color='blue')
    
    # Plot derivative if available
    if deriv_func is not None:
        dy = deriv_func(z)
        ax.plot(z, dy, linewidth=2, linestyle='--', 
               label=f"{name}'", color='red', alpha=0.7)
    
    # Styling
    ax.axhline(y=0, color='black', linewidth=0.8, alpha=0.3)
    ax.axvline(x=0, color='black', linewidth=0.8, alpha=0.3)
    ax.grid(True, alpha=0.3)
    ax.set_xlabel('z (input)', fontsize=11, weight='bold')
    ax.set_ylabel('f(z) (output)', fontsize=11, weight='bold')
    ax.set_title(name, fontsize=13, weight='bold')
    ax.legend(loc='best', fontsize=10)
    ax.set_xlim(-5, 5)

plt.suptitle('Activation Functions and Their Derivatives', 
            fontsize=16, weight='bold', y=1.00)
plt.tight_layout()
plt.show()

## 5. Choosing Activation Functions

### Decision Guide

**Hidden Layers:**
- **ReLU** (default choice): Fast, works well in most cases
- **Leaky ReLU**: When you suspect dying ReLU problem
- **GELU/Swish**: For transformers or when you want smooth non-linearity
- **Tanh**: When you need zero-centered outputs

**Output Layer:**
- **Sigmoid**: Binary classification (probability output 0-1)
- **Softmax**: Multi-class classification (probability distribution)
- **Linear (no activation)**: Regression problems
- **Tanh**: When output should be in range (-1, 1)

### Comparison Table

| Activation | Range | Advantages | Disadvantages | Use Cases |
|-----------|-------|------------|---------------|----------|
| **Sigmoid** | (0, 1) | Smooth, interpretable as probability | Vanishing gradients, not zero-centered | Output layer (binary) |
| **Tanh** | (-1, 1) | Zero-centered | Vanishing gradients | Hidden layers (when zero-centered needed) |
| **ReLU** | [0, ∞) | Fast, no vanishing gradient | Dying ReLU, not zero-centered | Hidden layers (default) |
| **Leaky ReLU** | (-∞, ∞) | Fixes dying ReLU | Small negative slope may not help much | Hidden layers (alternative to ReLU) |
| **GELU** | (-∞, ∞) | Smooth, state-of-the-art performance | Computationally expensive | Transformers, modern architectures |
| **Swish** | (-∞, ∞) | Self-gated, smooth | Computationally expensive | Modern CNNs, when performance matters |

In [None]:
# Compare gradient flow for different activations

def compare_gradients():
    """
    Visualize how gradients behave for different activation functions.
    This is crucial for understanding the vanishing/exploding gradient problem.
    """
    z_range = np.linspace(-6, 6, 1000)
    
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
    
    # Plot 1: Gradients comparison
    ax1.plot(z_range, act.sigmoid_derivative(z_range), 
            label='Sigmoid', linewidth=2.5)
    ax1.plot(z_range, act.tanh_derivative(z_range), 
            label='Tanh', linewidth=2.5)
    ax1.plot(z_range, act.relu_derivative(z_range), 
            label='ReLU', linewidth=2.5)
    ax1.plot(z_range, act.leaky_relu_derivative(z_range), 
            label='Leaky ReLU', linewidth=2.5)
    
    ax1.axhline(y=1, color='green', linestyle=':', linewidth=2, 
               alpha=0.5, label='Ideal (gradient = 1)')
    ax1.axhline(y=0, color='black', linewidth=1, alpha=0.3)
    ax1.axvline(x=0, color='black', linewidth=1, alpha=0.3)
    
    ax1.set_xlabel('z (pre-activation)', fontsize=12, weight='bold')
    ax1.set_ylabel('Gradient (df/dz)', fontsize=12, weight='bold')
    ax1.set_title('Gradient Flow Comparison', fontsize=14, weight='bold')
    ax1.legend(loc='best', fontsize=11)
    ax1.grid(True, alpha=0.3)
    ax1.set_ylim(-0.1, 1.2)
    
    # Plot 2: Vanishing gradient problem
    # Simulate gradient through multiple layers
    n_layers = 10
    z_test = np.array([4.0])  # Large activation value
    
    sigmoid_grads = [1.0]
    tanh_grads = [1.0]
    relu_grads = [1.0]
    
    for layer in range(n_layers):
        # Multiply gradients (chain rule)
        sigmoid_grads.append(sigmoid_grads[-1] * 
                           act.sigmoid_derivative(z_test)[0])
        tanh_grads.append(tanh_grads[-1] * 
                        act.tanh_derivative(z_test)[0])
        relu_grads.append(relu_grads[-1] * 
                        act.relu_derivative(z_test)[0])
    
    layers = range(n_layers + 1)
    ax2.semilogy(layers, sigmoid_grads, 'o-', label='Sigmoid', 
                linewidth=2.5, markersize=6)
    ax2.semilogy(layers, tanh_grads, 's-', label='Tanh', 
                linewidth=2.5, markersize=6)
    ax2.semilogy(layers, relu_grads, '^-', label='ReLU', 
                linewidth=2.5, markersize=6)
    
    ax2.set_xlabel('Layer Depth', fontsize=12, weight='bold')
    ax2.set_ylabel('Gradient Magnitude (log scale)', fontsize=12, weight='bold')
    ax2.set_title('Vanishing Gradient Problem\n(z=4.0, propagating backward)', 
                 fontsize=14, weight='bold')
    ax2.legend(loc='best', fontsize=11)
    ax2.grid(True, alpha=0.3, which='both')
    
    plt.tight_layout()
    plt.show()
    
    print("Gradient Analysis:")
    print("=" * 60)
    print(f"After {n_layers} layers (with z=4.0):")
    print(f"  Sigmoid gradient: {sigmoid_grads[-1]:.2e} (vanishing!)")
    print(f"  Tanh gradient: {tanh_grads[-1]:.2e} (vanishing!)")
    print(f"  ReLU gradient: {relu_grads[-1]:.2e} (preserved!)")
    print("\nThis demonstrates why ReLU helps with deep networks!")

compare_gradients()

## 6. Exercises

### Exercise 1: Implement Custom Activation

Implement the **ELU (Exponential Linear Unit)** activation function:

$$\text{ELU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha(e^z - 1) & \text{if } z \leq 0 \end{cases}$$

Where $\alpha$ is typically set to 1.0.

**Properties**:
- Smooth function
- Negative values push mean activation closer to zero
- Helps with vanishing gradient

Implement both the function and its derivative.

In [None]:
# Exercise 1: Your solution here

def elu(z, alpha=1.0):
    """
    Implement ELU activation function.
    
    Parameters:
    -----------
    z : array-like
        Input values
    alpha : float
        Scaling parameter (default=1.0)
    
    Returns:
    --------
    output : array-like
        ELU(z)
    """
    # TODO: Implement ELU
    pass

def elu_derivative(z, alpha=1.0):
    """
    Derivative of ELU.
    
    Hint: 
    - For z > 0: derivative is 1
    - For z <= 0: derivative is ELU(z) + alpha
    """
    # TODO: Implement ELU derivative
    pass

# Test your implementation
z_test = np.array([-2, -1, 0, 1, 2])
print("Test values:", z_test)
print("ELU output:", elu(z_test))
print("ELU derivative:", elu_derivative(z_test))

In [None]:
# Solution to Exercise 1

def elu(z, alpha=1.0):
    """ELU activation function."""
    return np.where(z > 0, z, alpha * (np.exp(z) - 1))

def elu_derivative(z, alpha=1.0):
    """Derivative of ELU."""
    return np.where(z > 0, 1, elu(z, alpha) + alpha)

# Test and visualize
z_range = np.linspace(-5, 5, 1000)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Plot ELU vs ReLU
ax1.plot(z_range, elu(z_range), label='ELU', linewidth=2.5)
ax1.plot(z_range, act.relu(z_range), label='ReLU', 
        linewidth=2.5, linestyle='--', alpha=0.7)
ax1.axhline(y=0, color='black', linewidth=0.8, alpha=0.3)
ax1.axvline(x=0, color='black', linewidth=0.8, alpha=0.3)
ax1.grid(True, alpha=0.3)
ax1.set_xlabel('z', fontsize=12, weight='bold')
ax1.set_ylabel('f(z)', fontsize=12, weight='bold')
ax1.set_title('ELU vs ReLU', fontsize=13, weight='bold')
ax1.legend(fontsize=11)

# Plot derivatives
ax2.plot(z_range, elu_derivative(z_range), label="ELU'", linewidth=2.5)
ax2.plot(z_range, act.relu_derivative(z_range), label="ReLU'", 
        linewidth=2.5, linestyle='--', alpha=0.7)
ax2.axhline(y=0, color='black', linewidth=0.8, alpha=0.3)
ax2.axvline(x=0, color='black', linewidth=0.8, alpha=0.3)
ax2.grid(True, alpha=0.3)
ax2.set_xlabel('z', fontsize=12, weight='bold')
ax2.set_ylabel("f'(z)", fontsize=12, weight='bold')
ax2.set_title('Derivatives', fontsize=13, weight='bold')
ax2.legend(fontsize=11)

plt.tight_layout()
plt.show()

print("\nKey difference: ELU has smooth negative values, while ReLU is zero.")
print("This helps push mean activations closer to zero!")

### Exercise 2: Perceptron on Real Data

Train a perceptron on a linearly separable subset of the Iris dataset.

**Tasks**:
1. Create a binary classification problem (select 2 classes from Iris)
2. Use only 2 features for visualization
3. Train the perceptron
4. Visualize the decision boundary
5. Calculate accuracy

In [None]:
# Exercise 2: Your solution here

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load Iris dataset
iris = load_iris()
X_full = iris.data
y_full = iris.target

# TODO: Select only classes 0 and 1 (setosa and versicolor)
# Hint: Use boolean indexing with (y_full == 0) | (y_full == 1)

# TODO: Select only features 2 and 3 (petal length and petal width)

# TODO: Standardize features (important for perceptron!)
# Use StandardScaler from sklearn

# TODO: Train perceptron

# TODO: Calculate and print accuracy

# TODO: Visualize decision boundary

In [None]:
# Solution to Exercise 2

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load and prepare data
iris = load_iris()
X_full = iris.data
y_full = iris.target

# Select only setosa (0) and versicolor (1)
mask = (y_full == 0) | (y_full == 1)
X = X_full[mask][:, 2:]  # Petal length and width
y = y_full[mask]

print(f"Dataset shape: {X.shape}")
print(f"Class distribution: {np.bincount(y)}")

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train perceptron
print("\nTraining perceptron...")
perceptron_iris = Perceptron(n_features=2, learning_rate=0.01, n_iterations=50)
perceptron_iris.fit(X_scaled, y)

# Evaluate
predictions = perceptron_iris.predict(X_scaled)
accuracy = np.mean(predictions == y)
print(f"\nAccuracy: {accuracy * 100:.2f}%")

# Visualize
plot_perceptron_decision_boundary(perceptron_iris, X_scaled, y)

# Add feature names
plt.figure(figsize=(10, 6))
for class_val in [0, 1]:
    mask = y == class_val
    plt.scatter(X_scaled[mask, 0], X_scaled[mask, 1], 
               label=iris.target_names[class_val],
               s=100, edgecolors='black', linewidth=1.5, alpha=0.7)

# Draw decision boundary
x_min, x_max = X_scaled[:, 0].min() - 1, X_scaled[:, 0].max() + 1
if perceptron_iris.weights[1] != 0:
    x_boundary = np.array([x_min, x_max])
    y_boundary = -(perceptron_iris.weights[0] * x_boundary + 
                   perceptron_iris.bias) / perceptron_iris.weights[1]
    plt.plot(x_boundary, y_boundary, 'k--', linewidth=2, 
            label='Decision Boundary')

plt.xlabel('Petal Length (standardized)', fontsize=12, weight='bold')
plt.ylabel('Petal Width (standardized)', fontsize=12, weight='bold')
plt.title(f'Perceptron on Iris Dataset (Accuracy: {accuracy*100:.1f}%)', 
         fontsize=14, weight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### Exercise 3: Activation Function Selection

For each scenario below, choose the most appropriate activation function and explain why:

1. **Hidden layer in a deep CNN for image classification (100 layers deep)**
2. **Output layer for predicting house prices (regression)**
3. **Output layer for binary classification (spam detection)**
4. **Hidden layer in a transformer model for language understanding**
5. **Output layer for multi-class classification (10 classes)**

Write your answers in the markdown cell below.

**Your Answers to Exercise 3:**

*(Double-click to edit)*

1. **Deep CNN hidden layer**:
   - Activation: 
   - Reason:

2. **House price regression output**:
   - Activation:
   - Reason:

3. **Binary classification output**:
   - Activation:
   - Reason:

4. **Transformer hidden layer**:
   - Activation:
   - Reason:

5. **Multi-class classification output**:
   - Activation:
   - Reason:

In [None]:
# Solutions to Exercise 3

print("ACTIVATION FUNCTION SELECTION - SOLUTIONS")
print("=" * 70)

print("\n1. Deep CNN Hidden Layer (100 layers):")
print("   Activation: ReLU or Leaky ReLU")
print("   Reason: Prevents vanishing gradients in very deep networks.")
print("           Fast computation. Leaky ReLU prevents dying ReLU problem.")
print("           Proven to work well in deep CNNs like ResNet, VGG.")

print("\n2. House Price Regression Output:")
print("   Activation: None (Linear)")
print("   Reason: Regression requires predicting continuous values without")
print("           bounds. Linear activation allows any real number output.")
print("           Prices can range from low to very high values.")

print("\n3. Binary Classification Output (Spam Detection):")
print("   Activation: Sigmoid")
print("   Reason: Outputs probability between 0 and 1.")
print("           Perfect for binary classification.")
print("           Can interpret as P(spam) - threshold at 0.5.")
print("           Works with binary cross-entropy loss.")

print("\n4. Transformer Hidden Layer:")
print("   Activation: GELU")
print("   Reason: Standard in transformers (BERT, GPT use GELU).")
print("           Smooth approximation to ReLU with better performance.")
print("           Empirically shown to work better in attention-based models.")
print("           Allows probabilistic interpretation of neuron activation.")

print("\n5. Multi-class Classification Output (10 classes):")
print("   Activation: Softmax")
print("   Reason: Converts logits to probability distribution over classes.")
print("           Ensures outputs sum to 1.0.")
print("           Works with categorical cross-entropy loss.")
print("           Provides interpretable class probabilities.")

print("\n" + "=" * 70)

## 7. Summary

Excellent work! You've completed the perceptron and activation functions module. Let's recap:

### Key Concepts

1. **Perceptron**
   - Simplest neural network (single neuron)
   - Linear model: $y = f(\mathbf{w}^T\mathbf{x} + b)$
   - Can learn linearly separable patterns
   - Cannot solve XOR (not linearly separable)

2. **Linear Separability**
   - Determines if single perceptron can solve problem
   - XOR is classic example of non-linear problem
   - Multi-layer networks needed for complex patterns

3. **Activation Functions**
   - Introduce non-linearity to networks
   - Enable learning of complex patterns
   - Must be differentiable for backpropagation

4. **Common Activations**
   - **Sigmoid**: (0, 1), good for probabilities, vanishing gradients
   - **Tanh**: (-1, 1), zero-centered, still vanishing gradients
   - **ReLU**: [0, ∞), default choice, fast, prevents vanishing gradients
   - **Leaky ReLU**: Prevents dying ReLU problem
   - **GELU**: Smooth, used in transformers
   - **Swish**: Self-gated, used in modern architectures

5. **Selection Guide**
   - Hidden layers: ReLU (default), GELU (transformers)
   - Binary output: Sigmoid
   - Multi-class output: Softmax
   - Regression output: Linear (no activation)

### Important Insights

- **Vanishing Gradients**: Sigmoid/Tanh derivatives become very small for large |z|, making deep networks hard to train
- **ReLU Advantage**: Constant gradient of 1 for positive inputs prevents vanishing gradients
- **Dying ReLU**: Neurons can "die" if they always output 0; Leaky ReLU helps
- **Modern Trends**: GELU and Swish gaining popularity for better performance

### What's Next?

In **Module 02: Backpropagation and Gradient Descent**, we'll learn:
- How neural networks actually learn (optimization)
- Forward and backward propagation in detail
- Computing gradients using the chain rule
- Loss functions and their derivatives
- Gradient descent variants

### Additional Resources

**Papers**:
- Rosenblatt (1958): "The Perceptron: A Probabilistic Model"
- Glorot et al. (2011): "Deep Sparse Rectifier Neural Networks" (ReLU)
- Hendrycks & Gimpel (2016): "Gaussian Error Linear Units (GELUs)"

**Interactive**:
- TensorFlow Playground: Visualize activation effects
- Distill.pub: "Activation Atlas" for understanding activations

---

**Ready to learn how neural networks optimize?** Continue to **Module 02: Backpropagation and Gradient Descent**!