# Day 47: Feedforward Neural Networks and Activation Functions

Welcome to Day 47 of the 100 Days of Machine Learning Challenge! Today, we embark on an exciting journey into the world of deep learning by exploring feedforward neural networks and activation functions. These concepts form the foundation of modern deep learning and artificial intelligence.

## Introduction

Feedforward neural networks (FFNNs), also known as multilayer perceptrons (MLPs), represent one of the most fundamental architectures in deep learning. Unlike the simple perceptron we might have encountered earlier, feedforward neural networks can model complex, non-linear relationships in data through the use of multiple layers and non-linear activation functions.

### Why Feedforward Neural Networks Matter

In machine learning, we've learned about linear models like linear regression and logistic regression. While these models are powerful, they're limited to learning linear relationships. Real-world problems often involve highly complex, non-linear patterns that linear models cannot capture. Feedforward neural networks bridge this gap by:

1. **Modeling Non-Linearity**: Through activation functions, neural networks can learn complex non-linear patterns
2. **Feature Learning**: Networks automatically learn useful representations of data through hidden layers
3. **Universal Approximation**: Theoretically, a neural network with sufficient neurons can approximate any continuous function
4. **Scalability**: They can handle high-dimensional data and large datasets effectively

### Learning Objectives

By the end of this lesson, you will be able to:

- Understand the architecture of feedforward neural networks
- Explain the role of activation functions in neural networks
- Implement different activation functions and visualize their properties
- Build and train a feedforward neural network using scikit-learn
- Apply neural networks to real-world classification problems
- Interpret the results and evaluate network performance

## Theoretical Foundation

### What is a Feedforward Neural Network?

A feedforward neural network is a computational model inspired by biological neural networks in the human brain. Information flows in one direction—from input to output—without cycles or loops, hence the term "feedforward."

#### Network Architecture

A feedforward neural network consists of:

1. **Input Layer**: Receives the raw features/data
2. **Hidden Layer(s)**: Intermediate layers that transform the input
3. **Output Layer**: Produces the final prediction

Each layer contains multiple **neurons** (also called nodes or units), and neurons in adjacent layers are connected by **weights**. Each neuron also has a **bias** term.

#### Mathematical Representation

For a single neuron, the computation can be expressed as:

$$z = w_1x_1 + w_2x_2 + ... + w_nx_n + b = \sum_{i=1}^{n} w_ix_i + b = \mathbf{w}^T\mathbf{x} + b$$

Where:
- $x_i$ are the input features
- $w_i$ are the weights
- $b$ is the bias term
- $z$ is the weighted sum (also called pre-activation)

The neuron then applies an **activation function** $f$ to produce its output:

$$a = f(z) = f(\mathbf{w}^T\mathbf{x} + b)$$

For a complete layer with $m$ neurons, we can write this in matrix form:

$$\mathbf{Z} = \mathbf{W}\mathbf{X} + \mathbf{b}$$
$$\mathbf{A} = f(\mathbf{Z})$$

Where:
- $\mathbf{X}$ is the input vector or matrix
- $\mathbf{W}$ is the weight matrix
- $\mathbf{b}$ is the bias vector
- $\mathbf{Z}$ is the pre-activation
- $\mathbf{A}$ is the activation (output)

### Why Do We Need Activation Functions?

Without activation functions, no matter how many layers we stack, the network would still be equivalent to a single-layer linear model. This is because the composition of linear functions is still linear:

$$f(g(x)) = W_2(W_1x + b_1) + b_2 = (W_2W_1)x + (W_2b_1 + b_2) = W_{combined}x + b_{combined}$$

Activation functions introduce **non-linearity**, allowing the network to learn complex patterns and approximate non-linear functions.

## Common Activation Functions

Let's explore the most important activation functions used in neural networks:

### 1. Sigmoid Function

The sigmoid function squashes input values to the range (0, 1):

$$\sigma(z) = rac{1}{1 + e^{-z}}$$

**Properties:**
- Output range: (0, 1)
- S-shaped curve
- Useful for binary classification in output layer
- Derivative: $\sigma'(z) = \sigma(z)(1 - \sigma(z))$

**Drawbacks:**
- Vanishing gradient problem for very large or small inputs
- Outputs not zero-centered
- Computationally expensive (exponential function)

### 2. Hyperbolic Tangent (tanh)

The tanh function is similar to sigmoid but maps inputs to (-1, 1):

$$	anh(z) = rac{e^z - e^{-z}}{e^z + e^{-z}} = rac{e^{2z} - 1}{e^{2z} + 1}$$

**Properties:**
- Output range: (-1, 1)
- Zero-centered (better than sigmoid)
- S-shaped curve
- Derivative: $	anh'(z) = 1 - 	anh^2(z)$

**Drawbacks:**
- Still suffers from vanishing gradient problem
- Computationally expensive

### 3. Rectified Linear Unit (ReLU)

ReLU is the most popular activation function in deep learning:

$$	ext{ReLU}(z) = \max(0, z) = egin{cases} z & 	ext{if } z > 0 \\ 0 & 	ext{if } z \leq 0 \end{cases}$$

**Properties:**
- Output range: [0, ∞)
- Very simple and fast to compute
- Doesn't saturate for positive values
- Derivative: $	ext{ReLU}'(z) = egin{cases} 1 & 	ext{if } z > 0 \\ 0 & 	ext{if } z \leq 0 \end{cases}$

**Advantages:**
- Computationally efficient
- Helps mitigate vanishing gradient problem
- Leads to sparse activations (some neurons output 0)

**Drawbacks:**
- "Dying ReLU" problem: neurons can get stuck outputting 0
- Not zero-centered
- Non-differentiable at z=0 (though not a problem in practice)

### 4. Leaky ReLU

Leaky ReLU addresses the dying ReLU problem by allowing small negative values:

$$	ext{Leaky ReLU}(z) = \max(lpha z, z) = egin{cases} z & 	ext{if } z > 0 \\ lpha z & 	ext{if } z \leq 0 \end{cases}$$

Where $lpha$ is a small constant (typically 0.01).

**Properties:**
- Output range: (-∞, ∞)
- Prevents dying neurons
- All benefits of ReLU

### 5. Softmax Function

Softmax is used in the output layer for multi-class classification:

$$	ext{softmax}(z_i) = rac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$$

Where $K$ is the number of classes.

**Properties:**
- Outputs sum to 1 (can be interpreted as probabilities)
- Used for multi-class classification
- Each output is between 0 and 1

In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification, make_moons
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")

Libraries imported successfully!
NumPy version: 2.3.4


## Visualizing Activation Functions

Let's visualize these activation functions to better understand their behavior:

In [None]:
# Define activation functions
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def tanh(z):
    return np.tanh(z)

def relu(z):
    return np.maximum(0, z)

def leaky_relu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)

# Generate input values
z = np.linspace(-10, 10, 1000)

# Create subplots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Plot Sigmoid
axes[0, 0].plot(z, sigmoid(z), 'b-', linewidth=2, label='Sigmoid')
axes[0, 0].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[0, 0].axhline(y=1, color='k', linestyle='--', alpha=0.3)
axes[0, 0].axvline(x=0, color='k', linestyle='--', alpha=0.3)
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].set_title('Sigmoid Activation Function', fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('z')
axes[0, 0].set_ylabel('σ(z)')
axes[0, 0].legend()
axes[0, 0].text(-8, 0.5, r'$\sigma(z) = \frac{1}{1 + e^{-z}}$', fontsize=12, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

# Plot Tanh
axes[0, 1].plot(z, tanh(z), 'g-', linewidth=2, label='Tanh')
axes[0, 1].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[0, 1].axhline(y=1, color='k', linestyle='--', alpha=0.3)
axes[0, 1].axhline(y=-1, color='k', linestyle='--', alpha=0.3)
axes[0, 1].axvline(x=0, color='k', linestyle='--', alpha=0.3)
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].set_title('Tanh Activation Function', fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('z')
axes[0, 1].set_ylabel('tanh(z)')
axes[0, 1].legend()
axes[0, 1].text(-8, 0.5, r'$tanh(z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}$', fontsize=12, bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.5))

# Plot ReLU
axes[1, 0].plot(z, relu(z), 'r-', linewidth=2, label='ReLU')
axes[1, 0].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[1, 0].axvline(x=0, color='k', linestyle='--', alpha=0.3)
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].set_title('ReLU Activation Function', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('z')
axes[1, 0].set_ylabel('ReLU(z)')
axes[1, 0].legend()
axes[1, 0].text(-8, 5, r'$ReLU(z) = max(0, z)$', fontsize=12, bbox=dict(boxstyle='round', facecolor='lightcoral', alpha=0.5))

# Plot Leaky ReLU
axes[1, 1].plot(z, leaky_relu(z), 'm-', linewidth=2, label='Leaky ReLU (α=0.01)')
axes[1, 1].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[1, 1].axvline(x=0, color='k', linestyle='--', alpha=0.3)
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].set_title('Leaky ReLU Activation Function', fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('z')
axes[1, 1].set_ylabel('Leaky ReLU(z)')
axes[1, 1].legend()
axes[1, 1].text(-8, 5, r'$Leaky\ ReLU(z) = max(\alpha z, z)$', fontsize=12, bbox=dict(boxstyle='round', facecolor='plum', alpha=0.5))

plt.tight_layout()
plt.show()

print("\nKey Observations:")
print("1. Sigmoid: Smooth S-curve, outputs in (0,1), saturates at extremes")
print("2. Tanh: Similar to sigmoid but zero-centered, outputs in (-1,1)")
print("3. ReLU: Simple linear for positive values, zero for negative")
print("4. Leaky ReLU: Similar to ReLU but allows small negative values")

NameError: name 'np' is not defined

## Understanding Derivatives of Activation Functions

The derivatives of activation functions are crucial for backpropagation (which we'll explore in the next lesson). Let's visualize them:

In [None]:
# Define derivatives
def sigmoid_derivative(z):
    s = sigmoid(z)
    return s * (1 - s)

def tanh_derivative(z):
    return 1 - np.tanh(z)**2

def relu_derivative(z):
    return np.where(z > 0, 1, 0)

def leaky_relu_derivative(z, alpha=0.01):
    return np.where(z > 0, 1, alpha)

# Create visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Sigmoid derivative
axes[0, 0].plot(z, sigmoid_derivative(z), 'b-', linewidth=2)
axes[0, 0].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[0, 0].axvline(x=0, color='k', linestyle='--', alpha=0.3)
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].set_title("Sigmoid Derivative", fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('z')
axes[0, 0].set_ylabel("σ'(z)")
axes[0, 0].fill_between(z, 0, sigmoid_derivative(z), alpha=0.3)

# Tanh derivative
axes[0, 1].plot(z, tanh_derivative(z), 'g-', linewidth=2)
axes[0, 1].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[0, 1].axvline(x=0, color='k', linestyle='--', alpha=0.3)
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].set_title("Tanh Derivative", fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('z')
axes[0, 1].set_ylabel("tanh'(z)")
axes[0, 1].fill_between(z, 0, tanh_derivative(z), alpha=0.3, color='green')

# ReLU derivative
axes[1, 0].plot(z, relu_derivative(z), 'r-', linewidth=2)
axes[1, 0].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[1, 0].axvline(x=0, color='k', linestyle='--', alpha=0.3)
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].set_title("ReLU Derivative", fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('z')
axes[1, 0].set_ylabel("ReLU'(z)")
axes[1, 0].fill_between(z, 0, relu_derivative(z), alpha=0.3, color='red')

# Leaky ReLU derivative
axes[1, 1].plot(z, leaky_relu_derivative(z), 'm-', linewidth=2)
axes[1, 1].axhline(y=0, color='k', linestyle='--', alpha=0.3)
axes[1, 1].axvline(x=0, color='k', linestyle='--', alpha=0.3)
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].set_title("Leaky ReLU Derivative", fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('z')
axes[1, 1].set_ylabel("Leaky ReLU'(z)")
axes[1, 1].fill_between(z, 0, leaky_relu_derivative(z), alpha=0.3, color='magenta')

plt.tight_layout()
plt.show()

print("\nDerivative Insights:")
print("1. Sigmoid/Tanh: Derivatives approach 0 at extremes (vanishing gradient)")
print("2. ReLU: Derivative is 1 for positive, 0 for negative (can die)")
print("3. Leaky ReLU: Always has a non-zero gradient (prevents dying neurons)")

NameError: name 'plt' is not defined

## Feedforward Neural Network Architecture

A typical feedforward neural network processes data through multiple layers. Let's visualize how data flows through the network:

In [None]:
# Simple visualization of network architecture
fig, ax = plt.subplots(1, 1, figsize=(12, 8))

# Network structure: 4 input, 5 hidden, 3 output
layer_sizes = [4, 5, 3]
layer_names = ['Input Layer\n(4 features)', 'Hidden Layer\n(5 neurons)', 'Output Layer\n(3 classes)']

# Vertical positions for each layer
v_spacing = 1.0 / max(layer_sizes)
h_spacing = 1.0 / (len(layer_sizes) - 1)

# Draw neurons
for n, (layer_size, layer_name) in enumerate(zip(layer_sizes, layer_names)):
    layer_top = v_spacing * (layer_size - 1) / 2
    for m in range(layer_size):
        x = n * h_spacing
        y = layer_top - m * v_spacing
        circle = plt.Circle((x, y), v_spacing/4, color='steelblue', ec='black', linewidth=2, zorder=4)
        ax.add_artist(circle)
        
        # Add labels for input and output
        if n == 0:
            ax.text(x - 0.15, y, f'$x_{m+1}$', ha='right', va='center', fontsize=11)
        elif n == len(layer_sizes) - 1:
            ax.text(x + 0.15, y, f'$y_{m+1}$', ha='left', va='center', fontsize=11)

# Draw connections
for n in range(len(layer_sizes) - 1):
    layer_size_a = layer_sizes[n]
    layer_size_b = layer_sizes[n + 1]
    layer_top_a = v_spacing * (layer_size_a - 1) / 2
    layer_top_b = v_spacing * (layer_size_b - 1) / 2
    
    for m in range(layer_size_a):
        for o in range(layer_size_b):
            x1 = n * h_spacing
            y1 = layer_top_a - m * v_spacing
            x2 = (n + 1) * h_spacing
            y2 = layer_top_b - o * v_spacing
            line = plt.Line2D([x1, x2], [y1, y2], c='gray', alpha=0.3, linewidth=1, zorder=1)
            ax.add_artist(line)

# Add layer labels
for n, layer_name in enumerate(layer_names):
    x = n * h_spacing
    ax.text(x, -0.4, layer_name, ha='center', fontsize=12, fontweight='bold',
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.7))

# Add arrows showing flow
arrow_y = 0.6
for n in range(len(layer_sizes) - 1):
    x1 = n * h_spacing + 0.05
    x2 = (n + 1) * h_spacing - 0.05
    ax.annotate('', xy=(x2, arrow_y), xytext=(x1, arrow_y),
                arrowprops=dict(arrowstyle='->', lw=2, color='red'))

ax.text(0.5, 0.7, 'Forward Pass Direction', ha='center', fontsize=12, color='red', fontweight='bold')

ax.set_xlim(-0.3, 1.3)
ax.set_ylim(-0.6, 0.8)
ax.axis('off')
ax.set_title('Feedforward Neural Network Architecture', fontsize=16, fontweight='bold', pad=20)

plt.tight_layout()
plt.show()

print("\nArchitecture Details:")
print("- Input Layer: Receives raw features (no activation)")
print("- Hidden Layer: Applies weights, bias, and activation function")
print("- Output Layer: Produces final predictions")
print("- Each connection has an associated weight")
print("- Information flows forward: Input → Hidden → Output")

NameError: name 'plt' is not defined

## Python Implementation

### Building a Feedforward Neural Network with scikit-learn

Scikit-learn provides the `MLPClassifier` (Multi-Layer Perceptron Classifier) for building feedforward neural networks. Let's implement a neural network for a binary classification task.

#### Step 1: Generate Synthetic Data

We'll use the `make_moons` dataset, which creates two interleaving half-moon shapes - perfect for demonstrating neural networks' ability to learn non-linear decision boundaries:

In [None]:
# Generate non-linear dataset
X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Visualize the dataset
plt.figure(figsize=(10, 6))
plt.scatter(X_train[y_train == 0, 0], X_train[y_train == 0, 1], 
            c='blue', marker='o', label='Class 0', alpha=0.6, edgecolors='k')
plt.scatter(X_train[y_train == 1, 0], X_train[y_train == 1, 1], 
            c='red', marker='s', label='Class 1', alpha=0.6, edgecolors='k')
plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Feature 2', fontsize=12)
plt.title('Training Data: Moons Dataset (Non-Linear)', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Training samples: {X_train.shape[0]}")
print(f"Test samples: {X_test.shape[0]}")
print(f"Features: {X_train.shape[1]}")
print(f"Classes: {len(np.unique(y))}")

NameError: name 'make_moons' is not defined

#### Step 2: Data Preprocessing

Neural networks work best when features are scaled. We'll use standardization to ensure all features have mean 0 and standard deviation 1:

In [None]:
# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Original data statistics:")
print(f"Mean: {X_train.mean(axis=0)}")
print(f"Std: {X_train.std(axis=0)}")
print(f"\nScaled data statistics:")
print(f"Mean: {X_train_scaled.mean(axis=0)}")
print(f"Std: {X_train_scaled.std(axis=0)}")

NameError: name 'StandardScaler' is not defined

#### Step 3: Build and Train the Neural Network

We'll create a feedforward neural network with:
- Input layer: 2 features
- Hidden layer 1: 10 neurons with ReLU activation
- Hidden layer 2: 5 neurons with ReLU activation  
- Output layer: 2 classes (binary classification)

In [None]:
# Create the neural network
mlp = MLPClassifier(
    hidden_layer_sizes=(10, 5),  # Two hidden layers with 10 and 5 neurons
    activation='relu',            # ReLU activation function
    solver='adam',                # Adam optimizer
    max_iter=1000,               # Maximum iterations
    random_state=42,
    verbose=False
)

# Train the model
mlp.fit(X_train_scaled, y_train)

# Make predictions
y_train_pred = mlp.predict(X_train_scaled)
y_test_pred = mlp.predict(X_test_scaled)

# Calculate accuracy
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print("Model Training Complete!")
print(f"\nNetwork Architecture:")
print(f"- Input Layer: {X_train.shape[1]} neurons")
print(f"- Hidden Layer 1: {mlp.hidden_layer_sizes[0]} neurons (ReLU)")
print(f"- Hidden Layer 2: {mlp.hidden_layer_sizes[1]} neurons (ReLU)")
print(f"- Output Layer: {len(np.unique(y))} neurons")
print(f"\nTotal Parameters: {sum([w.size for w in mlp.coefs_]) + sum([b.size for b in mlp.intercepts_])}")
print(f"\nPerformance:")
print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
print(f"\nConverged: {mlp.n_iter_} iterations")

NameError: name 'MLPClassifier' is not defined

#### Step 4: Visualize Decision Boundary

One of the best ways to understand how a neural network learns is to visualize its decision boundary:

In [None]:
# Create a mesh for decision boundary
x_min, x_max = X_train_scaled[:, 0].min() - 0.5, X_train_scaled[:, 0].max() + 0.5
y_min, y_max = X_train_scaled[:, 1].min() - 0.5, X_train_scaled[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                     np.arange(y_min, y_max, 0.02))

# Predict for each point in the mesh
Z = mlp.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Training data decision boundary
axes[0].contourf(xx, yy, Z, alpha=0.4, cmap='RdYlBu')
axes[0].scatter(X_train_scaled[y_train == 0, 0], X_train_scaled[y_train == 0, 1],
                c='blue', marker='o', label='Class 0', edgecolors='k', alpha=0.7)
axes[0].scatter(X_train_scaled[y_train == 1, 0], X_train_scaled[y_train == 1, 1],
                c='red', marker='s', label='Class 1', edgecolors='k', alpha=0.7)
axes[0].set_xlabel('Feature 1 (scaled)', fontsize=12)
axes[0].set_ylabel('Feature 2 (scaled)', fontsize=12)
axes[0].set_title(f'Training Set Decision Boundary\nAccuracy: {train_accuracy:.4f}', 
                  fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Test data decision boundary
axes[1].contourf(xx, yy, Z, alpha=0.4, cmap='RdYlBu')
axes[1].scatter(X_test_scaled[y_test == 0, 0], X_test_scaled[y_test == 0, 1],
                c='blue', marker='o', label='Class 0', edgecolors='k', alpha=0.7)
axes[1].scatter(X_test_scaled[y_test == 1, 0], X_test_scaled[y_test == 1, 1],
                c='red', marker='s', label='Class 1', edgecolors='k', alpha=0.7)
axes[1].set_xlabel('Feature 1 (scaled)', fontsize=12)
axes[1].set_ylabel('Feature 2 (scaled)', fontsize=12)
axes[1].set_title(f'Test Set Decision Boundary\nAccuracy: {test_accuracy:.4f}', 
                  fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nObservations:")
print("- The neural network learned a complex, non-linear decision boundary")
print("- The boundary successfully separates the two moon-shaped classes")
print("- This would be impossible with a linear classifier like logistic regression")

NameError: name 'X_train_scaled' is not defined

## Comparing Different Activation Functions

Let's compare how different activation functions perform on the same dataset:

In [None]:
# Train models with different activation functions
activations = ['logistic', 'tanh', 'relu']
results = {}

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, activation in enumerate(activations):
    # Train model
    model = MLPClassifier(
        hidden_layer_sizes=(10, 5),
        activation=activation,
        solver='adam',
        max_iter=1000,
        random_state=42,
        verbose=False
    )
    model.fit(X_train_scaled, y_train)
    
    # Predict and evaluate
    y_pred = model.predict(X_test_scaled)
    acc = accuracy_score(y_test, y_pred)
    results[activation] = acc
    
    # Plot decision boundary
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    axes[idx].contourf(xx, yy, Z, alpha=0.4, cmap='RdYlBu')
    axes[idx].scatter(X_test_scaled[y_test == 0, 0], X_test_scaled[y_test == 0, 1],
                      c='blue', marker='o', label='Class 0', edgecolors='k', alpha=0.7, s=30)
    axes[idx].scatter(X_test_scaled[y_test == 1, 0], X_test_scaled[y_test == 1, 1],
                      c='red', marker='s', label='Class 1', edgecolors='k', alpha=0.7, s=30)
    axes[idx].set_xlabel('Feature 1', fontsize=11)
    axes[idx].set_ylabel('Feature 2', fontsize=11)
    axes[idx].set_title(f'{activation.upper()} Activation\nAccuracy: {acc:.4f}', 
                        fontsize=12, fontweight='bold')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print comparison
print("\nActivation Function Comparison:")
print("-" * 40)
for activation, acc in results.items():
    print(f"{activation.upper():>12}: {acc:.4f}")
print("-" * 40)
print(f"\nBest: {max(results, key=results.get).upper()} with {max(results.values()):.4f} accuracy")

NameError: name 'plt' is not defined

## Hands-On Exercise: Multi-Class Classification

Now it's your turn! Let's apply what we've learned to a more complex multi-class classification problem using the Iris dataset - a classic machine learning dataset.

### The Problem

The Iris dataset contains measurements of iris flowers from three different species:
- Setosa
- Versicolor  
- Virginica

We'll build a neural network to classify iris flowers based on four features:
1. Sepal length
2. Sepal width
3. Petal length
4. Petal width

Let's load and explore the data:

In [None]:
# Load Iris dataset
from sklearn.datasets import load_iris

iris = load_iris()
X_iris = iris.data
y_iris = iris.target

print("Iris Dataset Information:")
print(f"Samples: {X_iris.shape[0]}")
print(f"Features: {X_iris.shape[1]}")
print(f"Classes: {len(np.unique(y_iris))}")
print(f"\nFeature names: {iris.feature_names}")
print(f"Class names: {iris.target_names}")

# Display sample data
import pandas as pd
df = pd.DataFrame(X_iris, columns=iris.feature_names)
df['species'] = [iris.target_names[i] for i in y_iris]
print(f"\nFirst 5 samples:")
print(df.head())

# Class distribution
print(f"\nClass distribution:")
for i, name in enumerate(iris.target_names):
    count = np.sum(y_iris == i)
    print(f"{name}: {count} samples")

NameError: name 'np' is not defined

### Visualizing the Iris Dataset

Let's visualize the relationships between features:

In [None]:
# Visualize pairwise feature relationships
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

feature_pairs = [(0, 1), (0, 2), (0, 3), (2, 3)]
colors = ['blue', 'red', 'green']

for idx, (f1, f2) in enumerate(feature_pairs):
    for class_idx, class_name in enumerate(iris.target_names):
        mask = y_iris == class_idx
        axes[idx].scatter(X_iris[mask, f1], X_iris[mask, f2],
                         c=colors[class_idx], label=class_name,
                         alpha=0.6, edgecolors='k', s=50)
    
    axes[idx].set_xlabel(iris.feature_names[f1], fontsize=11)
    axes[idx].set_ylabel(iris.feature_names[f2], fontsize=11)
    axes[idx].set_title(f'{iris.feature_names[f1]} vs {iris.feature_names[f2]}',
                       fontsize=12, fontweight='bold')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nObservations:")
print("- Setosa (blue) is clearly separable from the other two classes")
print("- Versicolor (red) and Virginica (green) have some overlap")
print("- Petal measurements seem more discriminative than sepal measurements")

NameError: name 'plt' is not defined

### Building the Multi-Class Neural Network

Now let's build, train, and evaluate a neural network for this 3-class classification problem:

In [None]:
# Split the data
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.3, random_state=42, stratify=y_iris
)

# Scale features
scaler_iris = StandardScaler()
X_train_iris_scaled = scaler_iris.fit_transform(X_train_iris)
X_test_iris_scaled = scaler_iris.transform(X_test_iris)

# Build neural network with 2 hidden layers
mlp_iris = MLPClassifier(
    hidden_layer_sizes=(20, 10),  # 20 neurons in first hidden layer, 10 in second
    activation='relu',
    solver='adam',
    max_iter=2000,
    random_state=42,
    verbose=False
)

# Train the model
mlp_iris.fit(X_train_iris_scaled, y_train_iris)

# Make predictions
y_train_pred_iris = mlp_iris.predict(X_train_iris_scaled)
y_test_pred_iris = mlp_iris.predict(X_test_iris_scaled)

# Calculate accuracies
train_acc_iris = accuracy_score(y_train_iris, y_train_pred_iris)
test_acc_iris = accuracy_score(y_test_iris, y_test_pred_iris)

print("Multi-Class Neural Network Results")
print("=" * 50)
print(f"\nNetwork Architecture:")
print(f"- Input Layer: 4 neurons (4 features)")
print(f"- Hidden Layer 1: 20 neurons (ReLU)")
print(f"- Hidden Layer 2: 10 neurons (ReLU)")
print(f"- Output Layer: 3 neurons (3 classes)")
print(f"\nTotal Parameters: {sum([w.size for w in mlp_iris.coefs_]) + sum([b.size for b in mlp_iris.intercepts_])}")
print(f"\nPerformance:")
print(f"- Training Accuracy: {train_acc_iris:.4f} ({train_acc_iris*100:.2f}%)")
print(f"- Test Accuracy: {test_acc_iris:.4f} ({test_acc_iris*100:.2f}%)")
print(f"\nTraining completed in {mlp_iris.n_iter_} iterations")

# Detailed classification report
print(f"\nDetailed Classification Report (Test Set):")
print(classification_report(y_test_iris, y_test_pred_iris, target_names=iris.target_names))

NameError: name 'train_test_split' is not defined

### Analyzing Performance with Confusion Matrix

A confusion matrix helps us understand which classes the model confuses:

In [None]:
# Compute confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay

cm = confusion_matrix(y_test_iris, y_test_pred_iris)

# Plot confusion matrix
fig, ax = plt.subplots(1, 1, figsize=(10, 8))
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=iris.target_names)
disp.plot(cmap='Blues', ax=ax, values_format='d')
ax.set_title('Confusion Matrix - Iris Classification', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

print("\nConfusion Matrix Analysis:")
print("-" * 50)
for i, class_name in enumerate(iris.target_names):
    correct = cm[i, i]
    total = cm[i, :].sum()
    accuracy = correct / total * 100
    print(f"{class_name:>12}: {correct}/{total} correct ({accuracy:.1f}%)")

# Calculate per-class errors
print(f"\nMisclassification Analysis:")
for i in range(len(iris.target_names)):
    for j in range(len(iris.target_names)):
        if i != j and cm[i, j] > 0:
            print(f"  {cm[i, j]} {iris.target_names[i]} classified as {iris.target_names[j]}")

NameError: name 'confusion_matrix' is not defined

### Understanding Prediction Probabilities

Neural networks can also output probability estimates for each class. Let's examine some predictions:

In [None]:
# Get probability predictions
y_proba = mlp_iris.predict_proba(X_test_iris_scaled)

# Display predictions for first 10 test samples
print("Sample Predictions with Probabilities:")
print("=" * 80)
print(f"{'Sample':<8} {'True':<12} {'Predicted':<12} {'Confidence':<12} {'Probabilities':<30}")
print("-" * 80)

for i in range(10):
    true_class = iris.target_names[y_test_iris[i]]
    pred_class = iris.target_names[y_test_pred_iris[i]]
    confidence = y_proba[i].max()
    probs = ', '.join([f'{p:.3f}' for p in y_proba[i]])
    
    marker = "✓" if y_test_iris[i] == y_test_pred_iris[i] else "✗"
    print(f"{i+1:<8} {true_class:<12} {pred_class:<12} {confidence:<12.3f} [{probs}] {marker}")

print("=" * 80)

# Visualize probability distribution
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
x_pos = np.arange(len(y_test_iris))
width = 0.25

for class_idx in range(3):
    ax.bar(x_pos + class_idx * width, y_proba[:, class_idx], width,
           label=iris.target_names[class_idx], alpha=0.8)

ax.set_xlabel('Test Sample Index', fontsize=12)
ax.set_ylabel('Probability', fontsize=12)
ax.set_title('Class Probability Predictions for All Test Samples', fontsize=14, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print(f"\nAverage prediction confidence: {y_proba.max(axis=1).mean():.4f}")
print(f"Minimum prediction confidence: {y_proba.max(axis=1).min():.4f}")
print(f"Maximum prediction confidence: {y_proba.max(axis=1).max():.4f}")

NameError: name 'mlp_iris' is not defined

## Key Takeaways

Congratulations on completing Day 47! Let's review the essential concepts:

### 1. Feedforward Neural Networks
- FFNNs are composed of layers: input, hidden, and output
- Information flows forward through the network without cycles
- Each neuron computes a weighted sum plus bias, then applies an activation function
- Networks can learn complex, non-linear patterns through multiple layers

### 2. Activation Functions
- **Essential for non-linearity**: Without activation functions, networks are just linear models
- **Sigmoid**: Outputs (0,1), useful for binary output, suffers from vanishing gradients
- **Tanh**: Outputs (-1,1), zero-centered, still has vanishing gradient issues
- **ReLU**: Most popular, simple and effective, can suffer from dying neurons
- **Leaky ReLU**: Addresses dying ReLU problem with small negative slope

### 3. Network Architecture
- Input layer size = number of features
- Hidden layer sizes are hyperparameters to tune
- Output layer size = number of classes (or 1 for binary classification)
- More layers and neurons = more capacity but risk of overfitting

### 4. Best Practices
- Always scale/normalize your input features
- Start with ReLU activation for hidden layers
- Use appropriate output activation (sigmoid for binary, softmax for multi-class)
- Monitor both training and test accuracy to detect overfitting
- Experiment with different architectures and hyperparameters

### 5. Practical Insights
- Neural networks excel at learning non-linear decision boundaries
- They can achieve high accuracy on complex classification tasks
- Proper preprocessing (feature scaling) is crucial
- Understanding prediction probabilities helps interpret model confidence

## What's Next?

In the coming lessons, we'll dive deeper into:
- **Day 48**: Backpropagation - how neural networks actually learn
- **Day 49**: Training techniques, loss functions, and optimizers
- **Day 50**: Evaluation, regularization, and hyperparameter tuning
- Future weeks: Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs)

## Further Resources

To deepen your understanding of feedforward neural networks and activation functions, explore these resources:

### Online Courses and Tutorials
1. **Neural Networks and Deep Learning** by deeplearning.ai on Coursera
   - Comprehensive introduction to neural networks
   - https://www.coursera.org/learn/neural-networks-deep-learning

2. **3Blue1Brown - Neural Networks Playlist**
   - Excellent visual explanations of how neural networks work
   - https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi

3. **Stanford CS231n: Convolutional Neural Networks**
   - Lecture notes on neural networks and activation functions
   - http://cs231n.github.io/neural-networks-1/

### Documentation and References
4. **Scikit-learn MLPClassifier Documentation**
   - Complete API reference and examples
   - https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

5. **Deep Learning Book by Goodfellow, Bengio, and Courville**
   - Chapter 6: Deep Feedforward Networks
   - https://www.deeplearningbook.org/contents/mlp.html

### Research Papers
6. **"Understanding the difficulty of training deep feedforward neural networks"** by Glorot and Bengio (2010)
   - Important insights on initialization and activation functions
   - http://proceedings.mlr.press/v9/glorot10a.html

### Interactive Resources
7. **TensorFlow Playground**
   - Interactive visualization of neural networks
   - https://playground.tensorflow.org/

8. **Neural Network Playground by ConvNetJS**
   - Experiment with different architectures in your browser
   - https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

### Practice Datasets
9. **UCI Machine Learning Repository**
   - Hundreds of datasets for practice
   - https://archive.ics.uci.edu/ml/index.php

10. **Kaggle Datasets**
    - Real-world datasets and competitions
    - https://www.kaggle.com/datasets

Happy learning! Remember, the best way to master neural networks is through hands-on practice and experimentation.