# Coding Assignment 3: Neural Networks

**Name:** [Your Name Here]  
**Student ID:** [Your Student ID]  
**Date:** [Today's Date]  

## Overview

Welcome to your first deep learning assignment! In this notebook, you'll journey from linear models to neural networks by implementing a Multi-Layer Perceptron (MLP) from scratch using NumPy. You'll then compare your implementation with PyTorch, the industry-standard deep learning framework.

**Learning Goals:**
- Understand neural network architecture and forward propagation
- Implement backpropagation and gradient descent from scratch
- Compare different activation functions and architectures
- Apply neural networks to real classification problems
- Transition from NumPy to PyTorch implementations
- Reflect on when to use neural networks vs simpler models

**Estimated Time:** 2 hours

## Part 1: Mathematical Foundation (30 minutes)

Neural networks extend the linear models you've implemented by adding layers and non-linear activation functions.

In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.datasets import load_iris, load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# PyTorch imports (we'll use these later)
try:
    import torch
    import torch.nn as nn
    import torch.optim as optim
    import torch.nn.functional as F
    PYTORCH_AVAILABLE = True
    print(f"PyTorch version: {torch.__version__}")
except ImportError:
    PYTORCH_AVAILABLE = False
    print("PyTorch not available - you can install it with: pip install torch")

# Set random seeds for reproducibility
np.random.seed(42)
if PYTORCH_AVAILABLE:
    torch.manual_seed(42)

print("Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")

### 1.1 From Linear to Neural Networks

**Linear Model (CA1 & CA2):**
- Single layer: `y = Wx + b`
- Linear decision boundaries
- Limited representation power

**Neural Network:**
- Multiple layers with non-linear activations
- Can learn complex, non-linear patterns
- Universal approximation capability

**Multi-Layer Perceptron (MLP) Architecture:**
```
Input → Hidden Layer 1 → Hidden Layer 2 → ... → Output
  x   →   σ(W₁x + b₁)  →  σ(W₂h₁ + b₂) → ... →  y
```

In [None]:
# Let's visualize the difference between linear and non-linear decision boundaries
np.random.seed(42)

# Create a non-linearly separable dataset (XOR-like problem)
n_samples = 200
X_nonlinear = np.random.randn(n_samples, 2)
y_nonlinear = ((X_nonlinear[:, 0] * X_nonlinear[:, 1]) > 0).astype(int)

# Create a linearly separable dataset
X_linear = np.random.randn(n_samples, 2)
X_linear[y_nonlinear == 1] += [1.5, 1.5]  # Shift one class
y_linear = y_nonlinear.copy()

# Plot both datasets
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Linear problem
colors = ['red', 'blue']
for i in range(2):
    mask = y_linear == i
    ax1.scatter(X_linear[mask, 0], X_linear[mask, 1], 
               c=colors[i], alpha=0.6, label=f'Class {i}')
ax1.set_title('Linearly Separable Problem\n(Good for Linear Models)')
ax1.set_xlabel('Feature 1')
ax1.set_ylabel('Feature 2')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Non-linear problem
for i in range(2):
    mask = y_nonlinear == i
    ax2.scatter(X_nonlinear[mask, 0], X_nonlinear[mask, 1], 
               c=colors[i], alpha=0.6, label=f'Class {i}')
ax2.set_title('Non-Linearly Separable Problem\n(Needs Neural Networks)')
ax2.set_xlabel('Feature 1')
ax2.set_ylabel('Feature 2')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Key Insight: Neural networks can solve problems that linear models cannot!")

### 1.2 Activation Functions

Activation functions introduce non-linearity, enabling neural networks to learn complex patterns.

**Common Activation Functions:**
- **Sigmoid**: `σ(x) = 1/(1 + e^(-x))` → Output range [0, 1]
- **Tanh**: `tanh(x) = (e^x - e^(-x))/(e^x + e^(-x))` → Output range [-1, 1]
- **ReLU**: `ReLU(x) = max(0, x)` → Output range [0, ∞)

In [None]:
def sigmoid(x):
    """
    Sigmoid activation function.
    
    Parameters:
    x (array): Input values
    
    Returns:
    array: Sigmoid of input
    """
    # TODO: Implement sigmoid function: 1 / (1 + exp(-x))
    # Hint: Use np.exp() and clip x to prevent overflow
    x_clipped = np.clip(x, -500, 500)  # Prevent overflow
    return None  # Replace None with your implementation

def sigmoid_derivative(x):
    """
    Derivative of sigmoid function.
    
    Parameters:
    x (array): Input values (pre-activation)
    
    Returns:
    array: Derivative of sigmoid
    """
    # TODO: Implement sigmoid derivative: sigmoid(x) * (1 - sigmoid(x))
    s = sigmoid(x)
    return None  # Replace None with your implementation

def tanh(x):
    """
    Tanh activation function.
    
    Parameters:
    x (array): Input values
    
    Returns:
    array: Tanh of input
    """
    # TODO: Implement tanh function
    # Hint: Use np.tanh() or (exp(x) - exp(-x)) / (exp(x) + exp(-x))
    return None  # Replace None with your implementation

def tanh_derivative(x):
    """
    Derivative of tanh function.
    
    Parameters:
    x (array): Input values (pre-activation)
    
    Returns:
    array: Derivative of tanh
    """
    # TODO: Implement tanh derivative: 1 - tanh(x)^2
    t = tanh(x)
    return None  # Replace None with your implementation

def relu(x):
    """
    ReLU activation function.
    
    Parameters:
    x (array): Input values
    
    Returns:
    array: ReLU of input
    """
    # TODO: Implement ReLU function: max(0, x)
    # Hint: Use np.maximum(0, x)
    return None  # Replace None with your implementation

def relu_derivative(x):
    """
    Derivative of ReLU function.
    
    Parameters:
    x (array): Input values (pre-activation)
    
    Returns:
    array: Derivative of ReLU
    """
    # TODO: Implement ReLU derivative: 1 if x > 0, else 0
    # Hint: Use (x > 0).astype(float)
    return None  # Replace None with your implementation

print("TODO: Implement the activation functions above")

# TODO: Uncomment the visualization code below after implementing the functions
# # Visualize activation functions
# x_range = np.linspace(-5, 5, 100)
# 
# fig, axes = plt.subplots(2, 3, figsize=(18, 10))
# 
# # Activation functions
# axes[0, 0].plot(x_range, sigmoid(x_range), 'b-', linewidth=2, label='Sigmoid')
# axes[0, 0].set_title('Sigmoid Activation')
# axes[0, 0].grid(True, alpha=0.3)
# axes[0, 0].set_ylabel('Output')
# 
# axes[0, 1].plot(x_range, tanh(x_range), 'r-', linewidth=2, label='Tanh')
# axes[0, 1].set_title('Tanh Activation')
# axes[0, 1].grid(True, alpha=0.3)
# 
# axes[0, 2].plot(x_range, relu(x_range), 'g-', linewidth=2, label='ReLU')
# axes[0, 2].set_title('ReLU Activation')
# axes[0, 2].grid(True, alpha=0.3)
# 
# # Derivatives
# axes[1, 0].plot(x_range, sigmoid_derivative(x_range), 'b--', linewidth=2, label='Sigmoid Derivative')
# axes[1, 0].set_title('Sigmoid Derivative')
# axes[1, 0].grid(True, alpha=0.3)
# axes[1, 0].set_xlabel('Input')
# axes[1, 0].set_ylabel('Derivative')
# 
# axes[1, 1].plot(x_range, tanh_derivative(x_range), 'r--', linewidth=2, label='Tanh Derivative')
# axes[1, 1].set_title('Tanh Derivative')
# axes[1, 1].grid(True, alpha=0.3)
# axes[1, 1].set_xlabel('Input')
# 
# axes[1, 2].plot(x_range, relu_derivative(x_range), 'g--', linewidth=2, label='ReLU Derivative')
# axes[1, 2].set_title('ReLU Derivative')
# axes[1, 2].grid(True, alpha=0.3)
# axes[1, 2].set_xlabel('Input')
# 
# plt.tight_layout()
# plt.show()
# 
# print("Key Properties:")
# print("• Sigmoid: Smooth, bounded [0,1], but can cause vanishing gradients")
# print("• Tanh: Smooth, bounded [-1,1], zero-centered")
# print("• ReLU: Simple, unbounded, helps with vanishing gradients")

### 1.3 Forward Propagation

Forward propagation computes the output of a neural network given an input.

**For a 2-layer network:**
1. **Hidden Layer**: `h = σ(W₁x + b₁)`
2. **Output Layer**: `y = σ(W₂h + b₂)`

**Matrix Dimensions:**
- Input: `(batch_size, input_features)`
- W₁: `(input_features, hidden_size)`
- b₁: `(hidden_size,)`
- W₂: `(hidden_size, output_size)`
- b₂: `(output_size,)`

In [None]:
def forward_pass_example():
    """
    Example of forward propagation through a simple 2-layer network.
    """
    # Network architecture: 2 inputs → 3 hidden → 1 output
    
    # Sample input (batch_size=4, input_features=2)
    X = np.array([[1.0, 2.0],
                  [2.0, 1.0],
                  [0.5, 1.5],
                  [1.5, 0.5]])
    
    # Initialize random weights and biases
    np.random.seed(42)
    W1 = np.random.randn(2, 3) * 0.5  # (input_size, hidden_size)
    b1 = np.random.randn(3) * 0.5      # (hidden_size,)
    W2 = np.random.randn(3, 1) * 0.5   # (hidden_size, output_size)
    b2 = np.random.randn(1) * 0.5      # (output_size,)
    
    print(f"Input X shape: {X.shape}")
    print(f"W1 shape: {W1.shape}, b1 shape: {b1.shape}")
    print(f"W2 shape: {W2.shape}, b2 shape: {b2.shape}")
    
    # TODO: Implement forward pass
    # Step 1: Compute hidden layer pre-activation
    z1 = None  # Replace None with X @ W1 + b1
    
    # Step 2: Apply activation function to get hidden layer output
    # Use tanh activation (implement tanh first above!)
    h1 = None  # Replace None with activation function
    
    # Step 3: Compute output layer pre-activation
    z2 = None  # Replace None with h1 @ W2 + b2
    
    # Step 4: Apply activation function to get final output
    # Use sigmoid for binary classification
    output = None  # Replace None with activation function
    
    # TODO: Uncomment the print statements after implementation
    # print(f"\nForward pass results:")
    # print(f"Hidden layer output shape: {h1.shape}")
    # print(f"Final output shape: {output.shape}")
    # print(f"Sample outputs: {output.flatten()[:3]}")
    
    return X, (W1, b1, W2, b2), (z1, h1, z2, output)

# Run the example
print("Forward Propagation Example:")
example_data = forward_pass_example()
print("TODO: Complete the forward pass implementation above")

## Part 2: From-Scratch MLP Implementation (45 minutes)

Now let's build a complete Multi-Layer Perceptron class with forward propagation, backpropagation, and training!

In [None]:
class MLP:
    """
    Multi-Layer Perceptron implementation from scratch.
    """
    
    def __init__(self, input_size, hidden_sizes, output_size, activation='relu'):
        """
        Initialize the MLP.
        
        Parameters:
        input_size (int): Number of input features
        hidden_sizes (list): List of hidden layer sizes [h1, h2, ...]
        output_size (int): Number of output classes
        activation (str): Activation function ('relu', 'tanh', 'sigmoid')
        """
        self.input_size = input_size
        self.hidden_sizes = hidden_sizes
        self.output_size = output_size
        self.activation = activation
        
        # Set activation functions
        self._set_activation_functions()
        
        # Initialize weights and biases
        self.weights = []
        self.biases = []
        self._initialize_parameters()
        
        # Training history
        self.loss_history = []
        
    def _set_activation_functions(self):
        """Set activation function and its derivative."""
        if self.activation == 'relu':
            self.activation_func = relu
            self.activation_derivative = relu_derivative
        elif self.activation == 'tanh':
            self.activation_func = tanh
            self.activation_derivative = tanh_derivative
        elif self.activation == 'sigmoid':
            self.activation_func = sigmoid
            self.activation_derivative = sigmoid_derivative
        else:
            raise ValueError(f"Unknown activation: {self.activation}")
    
    def _initialize_parameters(self):
        """Initialize weights and biases using Xavier initialization."""
        # Create layer sizes list: [input, hidden1, hidden2, ..., output]
        layer_sizes = [self.input_size] + self.hidden_sizes + [self.output_size]
        
        for i in range(len(layer_sizes) - 1):
            # TODO: Initialize weights using Xavier initialization
            # Xavier initialization: weights ~ Normal(0, sqrt(2 / (fan_in + fan_out)))
            fan_in = layer_sizes[i]
            fan_out = layer_sizes[i + 1]
            
            # Replace None with proper weight initialization
            weight = None  # np.random.normal(0, np.sqrt(2 / (fan_in + fan_out)), (fan_in, fan_out))
            bias = None    # np.zeros(fan_out)
            
            self.weights.append(weight)
            self.biases.append(bias)
    
    def forward(self, X):
        """
        Forward propagation through the network.
        
        Parameters:
        X (array): Input data of shape (batch_size, input_size)
        
        Returns:
        tuple: (final_output, activations, pre_activations)
        """
        activations = [X]  # Store all activations for backprop
        pre_activations = []  # Store pre-activation values for backprop
        
        current_input = X
        
        # Forward pass through all layers
        for i, (weight, bias) in enumerate(zip(self.weights, self.biases)):
            # TODO: Compute pre-activation (linear transformation)
            z = None  # Replace None with current_input @ weight + bias
            pre_activations.append(z)
            
            # Apply activation function
            if i < len(self.weights) - 1:  # Hidden layers
                a = self.activation_func(z)
            else:  # Output layer (use sigmoid for binary classification)
                a = sigmoid(z)
            
            activations.append(a)
            current_input = a
        
        return activations[-1], activations, pre_activations
    
    def backward(self, X, y, activations, pre_activations):
        """
        Backward propagation (compute gradients).
        
        Parameters:
        X (array): Input data
        y (array): True labels
        activations (list): Activations from forward pass
        pre_activations (list): Pre-activations from forward pass
        
        Returns:
        tuple: (weight_gradients, bias_gradients)
        """
        m = X.shape[0]  # batch size
        weight_grads = []
        bias_grads = []
        
        # Start with output layer error
        # For binary cross-entropy with sigmoid: dL/dz = (y_pred - y_true)
        delta = activations[-1] - y
        
        # Backpropagate through all layers (reverse order)
        for i in range(len(self.weights) - 1, -1, -1):
            # TODO: Compute gradients for current layer
            # Weight gradient: dL/dW = (1/m) * a_prev^T @ delta
            dW = None  # Replace None with gradient computation
            
            # Bias gradient: dL/db = (1/m) * sum(delta, axis=0)
            db = None  # Replace None with gradient computation
            
            weight_grads.append(dW)
            bias_grads.append(db)
            
            # Compute delta for previous layer (if not input layer)
            if i > 0:
                # TODO: Backpropagate error to previous layer
                # delta_prev = (delta @ W^T) * activation_derivative(z_prev)
                delta_prev = None  # Replace None with backpropagation computation
                delta = delta_prev
        
        # Reverse gradients to match forward order
        weight_grads.reverse()
        bias_grads.reverse()
        
        return weight_grads, bias_grads
    
    def update_parameters(self, weight_grads, bias_grads, learning_rate):
        """Update weights and biases using gradients."""
        for i in range(len(self.weights)):
            # TODO: Update weights and biases
            # weight -= learning_rate * gradient
            self.weights[i] -= None  # Replace None with update rule
            self.biases[i] -= None   # Replace None with update rule
    
    def compute_loss(self, y_true, y_pred):
        """
        Compute binary cross-entropy loss.
        
        Parameters:
        y_true (array): True labels
        y_pred (array): Predicted probabilities
        
        Returns:
        float: Average loss
        """
        # TODO: Implement binary cross-entropy loss
        # BCE = -[y*log(p) + (1-y)*log(1-p)]
        # Clip predictions to prevent log(0)
        epsilon = 1e-7
        y_pred_clipped = np.clip(y_pred, epsilon, 1 - epsilon)
        
        loss = None  # Replace None with loss computation
        return loss
    
    def fit(self, X, y, epochs=1000, learning_rate=0.01, batch_size=32, verbose=True):
        """
        Train the neural network.
        
        Parameters:
        X (array): Training data
        y (array): Training labels
        epochs (int): Number of training epochs
        learning_rate (float): Learning rate
        batch_size (int): Mini-batch size
        verbose (bool): Print training progress
        """
        self.loss_history = []
        n_samples = X.shape[0]
        
        for epoch in range(epochs):
            # Shuffle data for each epoch
            indices = np.random.permutation(n_samples)
            X_shuffled = X[indices]
            y_shuffled = y[indices]
            
            epoch_loss = 0
            
            # Mini-batch training
            for i in range(0, n_samples, batch_size):
                batch_X = X_shuffled[i:i+batch_size]
                batch_y = y_shuffled[i:i+batch_size]
                
                # Forward pass
                output, activations, pre_activations = self.forward(batch_X)
                
                # Compute loss
                batch_loss = self.compute_loss(batch_y, output)
                epoch_loss += batch_loss * len(batch_X)
                
                # Backward pass
                weight_grads, bias_grads = self.backward(batch_X, batch_y, activations, pre_activations)
                
                # Update parameters
                self.update_parameters(weight_grads, bias_grads, learning_rate)
            
            # Record average epoch loss
            avg_loss = epoch_loss / n_samples
            self.loss_history.append(avg_loss)
            
            # Print progress
            if verbose and (epoch + 1) % (epochs // 10) == 0:
                print(f"Epoch {epoch + 1:4d}/{epochs}: Loss = {avg_loss:.6f}")
    
    def predict(self, X):
        """
        Make predictions on new data.
        
        Parameters:
        X (array): Input data
        
        Returns:
        array: Predicted class probabilities
        """
        output, _, _ = self.forward(X)
        return output
    
    def predict_classes(self, X, threshold=0.5):
        """
        Predict class labels.
        
        Parameters:
        X (array): Input data
        threshold (float): Classification threshold
        
        Returns:
        array: Predicted class labels
        """
        probabilities = self.predict(X)
        return (probabilities >= threshold).astype(int)

print("MLP class defined!")
print("TODO: Complete the missing implementations in the methods above")

### 2.1 Test Your MLP Implementation

Let's test your implementation on the non-linearly separable dataset we created earlier.

In [None]:
# TODO: Uncomment and run this code after implementing the MLP class

# # Test on the XOR-like problem
# print("Testing MLP on Non-Linear Problem:")
# 
# # Prepare data
# X_test = X_nonlinear
# y_test = y_nonlinear.reshape(-1, 1)  # Reshape for compatibility
# 
# # Split data
# X_train, X_val, y_train, y_val = train_test_split(X_test, y_test, test_size=0.2, random_state=42)
# 
# # Scale features
# scaler = StandardScaler()
# X_train_scaled = scaler.fit_transform(X_train)
# X_val_scaled = scaler.transform(X_val)
# 
# print(f"Training set: {X_train_scaled.shape}, Validation set: {X_val_scaled.shape}")
# 
# # Create and train MLP
# mlp = MLP(input_size=2, hidden_sizes=[8, 4], output_size=1, activation='relu')
# print(f"\nMLP Architecture: 2 → {mlp.hidden_sizes} → 1")
# 
# # Train the model
# mlp.fit(X_train_scaled, y_train, epochs=1000, learning_rate=0.1, batch_size=16, verbose=True)
# 
# # Make predictions
# train_pred = mlp.predict_classes(X_train_scaled)
# val_pred = mlp.predict_classes(X_val_scaled)
# 
# # Calculate accuracy
# train_accuracy = accuracy_score(y_train, train_pred)
# val_accuracy = accuracy_score(y_val, val_pred)
# 
# print(f"\nResults:")
# print(f"Training Accuracy: {train_accuracy:.4f}")
# print(f"Validation Accuracy: {val_accuracy:.4f}")

print("TODO: Complete the MLP implementation first, then uncomment the code above")

In [None]:
# TODO: Uncomment this visualization after implementing and testing the MLP

# # Visualize training progress and decision boundary
# fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# 
# # Plot 1: Loss curve
# axes[0].plot(mlp.loss_history, 'b-', linewidth=2)
# axes[0].set_title('Training Loss')
# axes[0].set_xlabel('Epoch')
# axes[0].set_ylabel('Loss')
# axes[0].grid(True, alpha=0.3)
# 
# # Plot 2: Decision boundary
# def plot_decision_boundary(model, X, y, scaler, ax, title):
#     """Plot decision boundary for 2D data."""
#     h = 0.01
#     x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
#     y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
#     xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
#                          np.arange(y_min, y_max, h))
#     
#     grid_points = np.c_[xx.ravel(), yy.ravel()]
#     grid_points_scaled = scaler.transform(grid_points)
#     Z = model.predict(grid_points_scaled)
#     Z = Z.reshape(xx.shape)
#     
#     ax.contourf(xx, yy, Z, levels=50, alpha=0.6, cmap='RdYlBu')
#     colors = ['red', 'blue']
#     for i in range(2):
#         mask = y.flatten() == i
#         ax.scatter(X[mask, 0], X[mask, 1], c=colors[i], alpha=0.8, label=f'Class {i}')
#     ax.set_title(title)
#     ax.legend()
#     ax.grid(True, alpha=0.3)
# 
# plot_decision_boundary(mlp, X_test, y_test, scaler, axes[1], 'MLP Decision Boundary')
# 
# # Plot 3: Confusion matrix
# from sklearn.metrics import confusion_matrix
# cm = confusion_matrix(y_val, val_pred)
# sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[2])
# axes[2].set_title('Confusion Matrix')
# axes[2].set_xlabel('Predicted')
# axes[2].set_ylabel('Actual')
# 
# plt.tight_layout()
# plt.show()

print("TODO: Visualization will be available after MLP implementation")

## Part 3: Real-World Application (30 minutes)

Now let's apply your neural network to a real dataset!

In [None]:
# Load a real dataset
print("Loading Wine Dataset...")
wine_data = load_wine()
X_wine = wine_data.data
y_wine = wine_data.target

print(f"Dataset shape: {X_wine.shape}")
print(f"Number of classes: {len(np.unique(y_wine))}")
print(f"Feature names: {wine_data.feature_names[:5]}... (showing first 5)")
print(f"Class distribution: {np.bincount(y_wine)}")

# For binary classification, let's use classes 0 vs 1
binary_mask = y_wine != 2  # Remove class 2
X_binary = X_wine[binary_mask]
y_binary = y_wine[binary_mask]

print(f"\nBinary classification dataset:")
print(f"Shape: {X_binary.shape}")
print(f"Class distribution: {np.bincount(y_binary)}")

In [None]:
# Data preprocessing and splitting
X_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(
    X_binary, y_binary, test_size=0.2, random_state=42, stratify=y_binary
)

# Scale features
wine_scaler = StandardScaler()
X_train_wine_scaled = wine_scaler.fit_transform(X_train_wine)
X_test_wine_scaled = wine_scaler.transform(X_test_wine)

# Reshape targets for compatibility
y_train_wine = y_train_wine.reshape(-1, 1)
y_test_wine = y_test_wine.reshape(-1, 1)

print(f"Training set: {X_train_wine_scaled.shape}")
print(f"Test set: {X_test_wine_scaled.shape}")
print(f"Feature ranges after scaling:")
print(f"  Min: {X_train_wine_scaled.min():.3f}")
print(f"  Max: {X_train_wine_scaled.max():.3f}")
print(f"  Mean: {X_train_wine_scaled.mean():.3f}")
print(f"  Std: {X_train_wine_scaled.std():.3f}")

### 3.1 Experiment with Different Architectures

In [None]:
# TODO: Uncomment and complete this section after implementing the MLP

# # Test different architectures
# architectures = {
#     'Shallow': [16],
#     'Deep Narrow': [8, 8],
#     'Wide': [32],
#     'Deep Wide': [32, 16, 8]
# }
# 
# results = {}
# 
# print("Testing Different Architectures:")
# print("=" * 50)
# 
# for name, hidden_sizes in architectures.items():
#     print(f"\nTraining {name} network: {X_train_wine_scaled.shape[1]} → {hidden_sizes} → 1")
#     
#     # TODO: Create and train MLP with current architecture
#     mlp_arch = MLP(
#         input_size=X_train_wine_scaled.shape[1],
#         hidden_sizes=hidden_sizes,
#         output_size=1,
#         activation='relu'
#     )
#     
#     # Train the model
#     mlp_arch.fit(X_train_wine_scaled, y_train_wine, 
#                  epochs=500, learning_rate=0.01, batch_size=16, verbose=False)
#     
#     # Evaluate
#     train_pred = mlp_arch.predict_classes(X_train_wine_scaled)
#     test_pred = mlp_arch.predict_classes(X_test_wine_scaled)
#     
#     train_acc = accuracy_score(y_train_wine, train_pred)
#     test_acc = accuracy_score(y_test_wine, test_pred)
#     
#     results[name] = {
#         'train_accuracy': train_acc,
#         'test_accuracy': test_acc,
#         'final_loss': mlp_arch.loss_history[-1],
#         'parameters': sum(w.size for w in mlp_arch.weights) + sum(b.size for b in mlp_arch.biases)
#     }
#     
#     print(f"  Train Accuracy: {train_acc:.4f}")
#     print(f"  Test Accuracy:  {test_acc:.4f}")
#     print(f"  Parameters:     {results[name]['parameters']}")
# 
# # TODO: Create comparison visualization
# # Uncomment the visualization code below

print("TODO: Complete MLP implementation first, then uncomment architecture comparison")

In [None]:
# TODO: Uncomment this visualization after running the architecture comparison

# # Visualize architecture comparison
# fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# 
# names = list(results.keys())
# train_accs = [results[name]['train_accuracy'] for name in names]
# test_accs = [results[name]['test_accuracy'] for name in names]
# param_counts = [results[name]['parameters'] for name in names]
# 
# # Plot 1: Accuracy comparison
# x = np.arange(len(names))
# width = 0.35
# 
# axes[0].bar(x - width/2, train_accs, width, label='Train', alpha=0.8)
# axes[0].bar(x + width/2, test_accs, width, label='Test', alpha=0.8)
# axes[0].set_xlabel('Architecture')
# axes[0].set_ylabel('Accuracy')
# axes[0].set_title('Accuracy by Architecture')
# axes[0].set_xticks(x)
# axes[0].set_xticklabels(names, rotation=45)
# axes[0].legend()
# axes[0].grid(True, alpha=0.3)
# 
# # Plot 2: Parameters vs Performance
# axes[1].scatter(param_counts, test_accs, s=100, alpha=0.7)
# for i, name in enumerate(names):
#     axes[1].annotate(name, (param_counts[i], test_accs[i]), 
#                     xytext=(5, 5), textcoords='offset points')
# axes[1].set_xlabel('Number of Parameters')
# axes[1].set_ylabel('Test Accuracy')
# axes[1].set_title('Model Complexity vs Performance')
# axes[1].grid(True, alpha=0.3)
# 
# # Plot 3: Overfitting analysis
# overfitting = [train_accs[i] - test_accs[i] for i in range(len(names))]
# colors = ['green' if x < 0.05 else 'orange' if x < 0.1 else 'red' for x in overfitting]
# axes[2].bar(names, overfitting, color=colors, alpha=0.7)
# axes[2].set_xlabel('Architecture')
# axes[2].set_ylabel('Train - Test Accuracy')
# axes[2].set_title('Overfitting Analysis')
# axes[2].set_xticklabels(names, rotation=45)
# axes[2].grid(True, alpha=0.3)
# axes[2].axhline(y=0.05, color='orange', linestyle='--', alpha=0.7, label='Moderate Overfitting')
# axes[2].axhline(y=0.1, color='red', linestyle='--', alpha=0.7, label='High Overfitting')
# axes[2].legend()
# 
# plt.tight_layout()
# plt.show()
# 
# # Summary
# best_arch = max(results.keys(), key=lambda x: results[x]['test_accuracy'])
# print(f"\nBest Architecture: {best_arch}")
# print(f"Test Accuracy: {results[best_arch]['test_accuracy']:.4f}")
# print(f"Parameters: {results[best_arch]['parameters']}")

print("TODO: Architecture visualization will be available after implementation")

### 3.2 Activation Function Comparison

In [None]:
# TODO: Uncomment and complete this section after implementing the MLP

# # Compare different activation functions
# activations = ['relu', 'tanh', 'sigmoid']
# activation_results = {}
# 
# print("Comparing Activation Functions:")
# print("=" * 40)
# 
# for activation in activations:
#     print(f"\nTesting {activation.upper()} activation...")
#     
#     # TODO: Create MLP with current activation function
#     mlp_act = MLP(
#         input_size=X_train_wine_scaled.shape[1],
#         hidden_sizes=[16, 8],  # Use consistent architecture
#         output_size=1,
#         activation=activation
#     )
#     
#     # Train the model
#     mlp_act.fit(X_train_wine_scaled, y_train_wine,
#                 epochs=500, learning_rate=0.01, batch_size=16, verbose=False)
#     
#     # Evaluate
#     test_pred = mlp_act.predict_classes(X_test_wine_scaled)
#     test_acc = accuracy_score(y_test_wine, test_pred)
#     
#     activation_results[activation] = {
#         'accuracy': test_acc,
#         'loss_history': mlp_act.loss_history
#     }
#     
#     print(f"  Test Accuracy: {test_acc:.4f}")
#     print(f"  Final Loss: {mlp_act.loss_history[-1]:.6f}")
# 
# # TODO: Visualize activation function comparison
# # Uncomment the visualization code below

print("TODO: Complete MLP implementation first, then uncomment activation comparison")

In [None]:
# TODO: Uncomment this visualization after running the activation comparison

# # Visualize activation function comparison
# fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
# 
# # Plot 1: Learning curves
# colors = ['red', 'green', 'blue']
# for i, (activation, data) in enumerate(activation_results.items()):
#     ax1.plot(data['loss_history'], color=colors[i], linewidth=2, 
#             label=f'{activation.upper()} (Acc: {data["accuracy"]:.3f})', alpha=0.8)
# 
# ax1.set_xlabel('Epoch')
# ax1.set_ylabel('Loss')
# ax1.set_title('Learning Curves by Activation Function')
# ax1.legend()
# ax1.grid(True, alpha=0.3)
# ax1.set_yscale('log')
# 
# # Plot 2: Final accuracy comparison
# activations_list = list(activation_results.keys())
# accuracies = [activation_results[act]['accuracy'] for act in activations_list]
# 
# bars = ax2.bar(activations_list, accuracies, color=colors, alpha=0.7)
# ax2.set_xlabel('Activation Function')
# ax2.set_ylabel('Test Accuracy')
# ax2.set_title('Test Accuracy by Activation Function')
# ax2.grid(True, alpha=0.3)
# 
# # Add accuracy values on top of bars
# for bar, acc in zip(bars, accuracies):
#     ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
#             f'{acc:.3f}', ha='center', va='bottom', fontweight='bold')
# 
# plt.tight_layout()
# plt.show()
# 
# # Find best activation
# best_activation = max(activation_results.keys(), key=lambda x: activation_results[x]['accuracy'])
# print(f"\nBest Activation Function: {best_activation.upper()}")
# print(f"Test Accuracy: {activation_results[best_activation]['accuracy']:.4f}")

print("TODO: Activation function visualization will be available after implementation")

## Part 4: PyTorch Implementation (20 minutes)

Now let's see how easy it is to implement the same network using PyTorch!

In [None]:
if not PYTORCH_AVAILABLE:
    print("PyTorch not available. Please install it with: pip install torch")
    print("Skipping PyTorch section...")
else:
    print("PyTorch is available! Let's implement the same MLP.")
    
    class PyTorchMLP(nn.Module):
        """
        PyTorch implementation of MLP.
        """
        
        def __init__(self, input_size, hidden_sizes, output_size, activation='relu'):
            super(PyTorchMLP, self).__init__()
            
            # TODO: Create layers using nn.ModuleList()
            layers = []
            
            # Input to first hidden layer
            layers.append(nn.Linear(input_size, hidden_sizes[0]))
            
            # Hidden layers
            for i in range(len(hidden_sizes) - 1):
                layers.append(nn.Linear(hidden_sizes[i], hidden_sizes[i + 1]))
            
            # Last hidden to output
            layers.append(nn.Linear(hidden_sizes[-1], output_size))
            
            self.layers = nn.ModuleList(layers)
            
            # Set activation function
            if activation == 'relu':
                self.activation = F.relu
            elif activation == 'tanh':
                self.activation = torch.tanh
            elif activation == 'sigmoid':
                self.activation = torch.sigmoid
            else:
                raise ValueError(f"Unknown activation: {activation}")
        
        def forward(self, x):
            """
            Forward pass through the network.
            
            Parameters:
            x (tensor): Input data
            
            Returns:
            tensor: Output predictions
            """
            # TODO: Implement forward pass
            for i, layer in enumerate(self.layers):
                x = layer(x)
                
                # Apply activation function (except for output layer)
                if i < len(self.layers) - 1:  # Hidden layers
                    x = self.activation(x)
                else:  # Output layer
                    x = torch.sigmoid(x)  # Sigmoid for binary classification
            
            return x
    
    print("PyTorch MLP class defined!")

In [None]:
if PYTORCH_AVAILABLE:
    # Convert data to PyTorch tensors
    X_train_torch = torch.FloatTensor(X_train_wine_scaled)
    y_train_torch = torch.FloatTensor(y_train_wine)
    X_test_torch = torch.FloatTensor(X_test_wine_scaled)
    y_test_torch = torch.FloatTensor(y_test_wine)
    
    print(f"Converted to PyTorch tensors:")
    print(f"X_train shape: {X_train_torch.shape}")
    print(f"y_train shape: {y_train_torch.shape}")
    
    # Create PyTorch model
    pytorch_mlp = PyTorchMLP(
        input_size=X_train_torch.shape[1],
        hidden_sizes=[16, 8],
        output_size=1,
        activation='relu'
    )
    
    print(f"\nPyTorch Model Architecture:")
    print(pytorch_mlp)
    
    # Count parameters
    total_params = sum(p.numel() for p in pytorch_mlp.parameters())
    print(f"\nTotal parameters: {total_params}")
    
    # TODO: Set up training components
    criterion = nn.BCELoss()  # Binary Cross Entropy Loss
    optimizer = optim.Adam(pytorch_mlp.parameters(), lr=0.01)
    
    print(f"\nTraining setup:")
    print(f"Loss function: {criterion}")
    print(f"Optimizer: {optimizer}")
else:
    print("Skipping PyTorch implementation...")

In [None]:
if PYTORCH_AVAILABLE:
    # Train PyTorch model
    print("Training PyTorch MLP...")
    
    epochs = 500
    batch_size = 16
    pytorch_losses = []
    
    # Create data loader
    from torch.utils.data import TensorDataset, DataLoader
    train_dataset = TensorDataset(X_train_torch, y_train_torch)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    
    pytorch_mlp.train()  # Set to training mode
    
    for epoch in range(epochs):
        epoch_loss = 0.0
        
        for batch_X, batch_y in train_loader:
            # TODO: Implement training loop
            # 1. Zero gradients
            optimizer.zero_grad()
            
            # 2. Forward pass
            outputs = pytorch_mlp(batch_X)
            
            # 3. Compute loss
            loss = criterion(outputs, batch_y)
            
            # 4. Backward pass
            loss.backward()
            
            # 5. Update parameters
            optimizer.step()
            
            epoch_loss += loss.item() * batch_X.size(0)
        
        # Record average epoch loss
        avg_loss = epoch_loss / len(train_dataset)
        pytorch_losses.append(avg_loss)
        
        # Print progress
        if (epoch + 1) % (epochs // 10) == 0:
            print(f"Epoch {epoch + 1:4d}/{epochs}: Loss = {avg_loss:.6f}")
    
    print("\nPyTorch training completed!")
else:
    print("Skipping PyTorch training...")

In [None]:
if PYTORCH_AVAILABLE:
    # Evaluate PyTorch model
    pytorch_mlp.eval()  # Set to evaluation mode
    
    with torch.no_grad():
        # Get predictions
        train_pred_torch = pytorch_mlp(X_train_torch)
        test_pred_torch = pytorch_mlp(X_test_torch)
        
        # Convert to class predictions
        train_pred_classes = (train_pred_torch >= 0.5).float()
        test_pred_classes = (test_pred_torch >= 0.5).float()
        
        # Calculate accuracy
        train_acc_torch = (train_pred_classes == y_train_torch).float().mean().item()
        test_acc_torch = (test_pred_classes == y_test_torch).float().mean().item()
    
    print(f"PyTorch Results:")
    print(f"Training Accuracy: {train_acc_torch:.4f}")
    print(f"Test Accuracy: {test_acc_torch:.4f}")
    print(f"Final Loss: {pytorch_losses[-1]:.6f}")
    
    # TODO: Compare with your NumPy implementation
    # Uncomment after implementing your MLP
    # print(f"\nComparison with NumPy Implementation:")
    # print(f"NumPy Test Accuracy: {test_accuracy:.4f}")
    # print(f"PyTorch Test Accuracy: {test_acc_torch:.4f}")
    # print(f"Difference: {abs(test_accuracy - test_acc_torch):.4f}")
else:
    print("PyTorch not available for comparison.")

In [None]:
if PYTORCH_AVAILABLE:
    # Visualize PyTorch vs NumPy comparison
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
    
    # Plot 1: Learning curves comparison
    # TODO: Add your NumPy loss history for comparison
    ax1.plot(pytorch_losses, 'b-', linewidth=2, label='PyTorch', alpha=0.8)
    # ax1.plot(your_mlp.loss_history, 'r--', linewidth=2, label='NumPy', alpha=0.8)  # Uncomment after implementing
    ax1.set_xlabel('Epoch')
    ax1.set_ylabel('Loss')
    ax1.set_title('Learning Curves: PyTorch vs NumPy')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    ax1.set_yscale('log')
    
    # Plot 2: Implementation comparison
    implementations = ['PyTorch']
    accuracies_comp = [test_acc_torch]
    
    # TODO: Add NumPy results when available
    # implementations.append('NumPy')
    # accuracies_comp.append(your_test_accuracy)
    
    bars = ax2.bar(implementations, accuracies_comp, 
                   color=['blue', 'red'][:len(implementations)], alpha=0.7)
    ax2.set_ylabel('Test Accuracy')
    ax2.set_title('Implementation Comparison')
    ax2.grid(True, alpha=0.3)
    
    # Add accuracy values on bars
    for bar, acc in zip(bars, accuracies_comp):
        ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
                f'{acc:.3f}', ha='center', va='bottom', fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    print("Key Advantages of PyTorch:")
    print("• Automatic differentiation (no manual backprop implementation)")
    print("• GPU acceleration support")
    print("• Optimized operations and memory management")
    print("• Rich ecosystem of pre-built components")
    print("• Production-ready deployment tools")
else:
    print("Install PyTorch to see the comparison!")

## Part 5: Critical Reflection (15 minutes)

Now let's reflect on what you've learned about neural networks and their practical applications.

### 5.1 Architecture and Design Choices

**TODO: Answer these questions based on your experiments:**

**1. How did different network architectures (shallow vs deep, narrow vs wide) affect performance?**

[TODO: Analyze your architecture comparison results. Discuss trade-offs between model complexity and performance.]

**2. Which activation function worked best for your dataset? Why do you think this was the case?**

[TODO: Compare the activation functions you tested. Consider convergence speed, final accuracy, and potential issues like vanishing gradients.]

**3. What signs of overfitting (if any) did you observe? How could you address them?**

[TODO: Look at the difference between training and test accuracy. Suggest techniques like regularization, dropout, or early stopping.]

### 5.2 Neural Networks vs Previous Models

**4. How do neural networks compare to the linear regression (CA1) and classification models (CA2) you implemented?**

**Advantages of Neural Networks:**
[TODO: List advantages like non-linear modeling, universal approximation, feature learning]

**Disadvantages of Neural Networks:**
[TODO: List disadvantages like complexity, interpretability, training time, overfitting risk]

**5. When would you choose a neural network over simpler models like logistic regression or decision trees?**

[TODO: Consider factors like data size, problem complexity, interpretability requirements, computational resources]

### 5.3 Implementation Insights

**6. What was the most challenging part of implementing backpropagation from scratch?**

[TODO: Reflect on difficulties with gradient computation, matrix operations, or debugging]

**7. How does the PyTorch implementation compare to your NumPy version?**

**Ease of Implementation:**
[TODO: Compare the complexity and lines of code]

**Performance:**
[TODO: Compare training speed and final accuracy]

**Debugging and Development:**
[TODO: Discuss which was easier to debug and modify]

### 5.4 Real-World Applications and Ethics

**8. Give three real-world applications where neural networks would be appropriate:**

**Application 1:**
[TODO: Describe a specific use case, why neural networks are suitable, and what type of data would be involved]

**Application 2:**
[TODO: Describe another use case with different characteristics]

**Application 3:**
[TODO: Describe a third use case, perhaps in a different domain]

**9. What ethical considerations should be kept in mind when deploying neural networks?**

[TODO: Discuss issues like bias, fairness, transparency, privacy, and accountability. Consider the "black box" nature of neural networks.]

**10. How might the "black box" nature of neural networks be problematic in certain applications?**

[TODO: Consider applications like medical diagnosis, loan approval, criminal justice where explainability is crucial]

## Bonus: Advanced Experiments (Optional)

If you have extra time, try these advanced experiments:

In [None]:
# Bonus 1: Implement different weight initialization strategies
def xavier_init(fan_in, fan_out):
    """Xavier/Glorot initialization."""
    limit = np.sqrt(6 / (fan_in + fan_out))
    return np.random.uniform(-limit, limit, (fan_in, fan_out))

def he_init(fan_in, fan_out):
    """He initialization (good for ReLU)."""
    std = np.sqrt(2 / fan_in)
    return np.random.normal(0, std, (fan_in, fan_out))

print("Bonus: Try different initialization strategies in your MLP class!")
print("1. Xavier initialization: Good for tanh and sigmoid")
print("2. He initialization: Good for ReLU")
print("3. Compare convergence speed and final performance")

In [None]:
# Bonus 2: Implement learning rate scheduling
def step_decay(initial_lr, epoch, drop_rate=0.5, epochs_drop=100):
    """Step decay learning rate schedule."""
    return initial_lr * (drop_rate ** (epoch // epochs_drop))

def exponential_decay(initial_lr, epoch, decay_rate=0.95):
    """Exponential decay learning rate schedule."""
    return initial_lr * (decay_rate ** epoch)

print("Bonus: Implement adaptive learning rates!")
print("1. Start with higher learning rate for faster initial convergence")
print("2. Reduce learning rate over time for fine-tuning")
print("3. Compare with constant learning rate")

In [None]:
# Bonus 3: Implement early stopping
def early_stopping_demo():
    """
    Demonstrate early stopping to prevent overfitting.
    """
    print("Bonus: Implement early stopping!")
    print("1. Monitor validation loss during training")
    print("2. Stop training when validation loss stops improving")
    print("3. Save the best model weights")
    print("4. Compare with and without early stopping")
    
    # TODO: Modify your MLP class to include validation monitoring
    # and early stopping logic

early_stopping_demo()

## Summary and Submission

### What You've Accomplished

Congratulations! In this assignment, you have:

**Understood neural network fundamentals** including forward/backward propagation  
**Implemented a complete MLP from scratch** using only NumPy  
**Mastered backpropagation** and gradient descent for neural networks  
**Compared different architectures** and activation functions  
**Applied neural networks** to real-world classification problems  
**Transitioned to PyTorch** for modern deep learning development  
**Reflected critically** on when and how to use neural networks  

### Key Takeaways

**TODO: Write 3-4 key insights from this assignment:**

1. [TODO: Your first key takeaway about neural network capabilities vs limitations]
2. [TODO: Your second key takeaway about implementation challenges or surprises]
3. [TODO: Your third key takeaway about practical considerations (overfitting, architecture choice, etc.)]
4. [TODO: Your fourth key takeaway about PyTorch vs manual implementation]

### Neural Networks vs Previous Models

**TODO: Compare neural networks to the models from CA1 and CA2:**

**When to use Neural Networks:**
[TODO: List scenarios where neural networks are the best choice]

**When to use Simpler Models:**
[TODO: List scenarios where linear/logistic regression or other simple models are better]

### Looking Forward

**TODO: What aspects of neural networks would you like to explore further?**

[TODO: Mention topics like convolutional networks, recurrent networks, attention mechanisms, or specific applications]

### Final Reflection

**TODO: Write a brief (150-200 words) reflection on your experience implementing neural networks from scratch:**

[TODO: Your final reflection here - discuss what was challenging, what was surprising, how it changed your understanding of neural networks, and what you'd like to learn next]

---

**Assignment Complete!**

Make sure to:
1. Complete all TODO sections
2. Test your MLP implementation thoroughly
3. Answer all reflection questions
4. Save your notebook
5. Export as HTML
6. Submit both .ipynb and .html files
7. Include your name and student ID at the top