# 🧠 Neural Networks from Scratch

This notebook covers implementing neural networks from scratch, understanding backpropagation, and optimization techniques commonly asked in deep learning interviews.

## 📋 Table of Contents
1. [Neural Network Fundamentals](#neural-network-fundamentals)
2. [Activation Functions](#activation-functions)
3. [Backpropagation Algorithm](#backpropagation)
4. [Optimizers and Regularization](#optimizers-regularization)
5. [Convolutional Neural Networks](#cnn-basics)
6. [Practice Problems](#practice-problems)
7. [Interview Tips](#interview-tips)

In [None]:
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, make_circles, make_moons, load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix
import time
import warnings
warnings.filterwarnings('ignore')

# Set up plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
np.random.seed(42)

print("✅ All libraries imported successfully!")
print(f"📊 NumPy version: {np.__version__}")
print(f"🧠 Ready for neural network implementations!")

## 🧠 Problem 1: Neural Network Fundamentals

**Problem Statement**: Implement a feedforward neural network with customizable architecture.

**Requirements**:
- Support for multiple hidden layers
- Different activation functions (ReLU, Sigmoid, Tanh)
- Forward propagation implementation
- Weight initialization strategies
- Batch processing capability

**Mathematical Foundation**: 
- Forward pass: $z^{[l]} = W^{[l]}a^{[l-1]} + b^{[l]}$, $a^{[l]} = g^{[l]}(z^{[l]})$
- Where $g^{[l]}$ is the activation function for layer $l$

In [None]:
class ActivationFunctions:
    """Collection of activation functions and their derivatives."""
    
    @staticmethod
    def sigmoid(z):
        """Sigmoid activation function."""
        # Clip to prevent overflow
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))
    
    @staticmethod
    def sigmoid_derivative(z):
        """Derivative of sigmoid function."""
        s = ActivationFunctions.sigmoid(z)
        return s * (1 - s)
    
    @staticmethod
    def relu(z):
        """ReLU activation function."""
        return np.maximum(0, z)
    
    @staticmethod
    def relu_derivative(z):
        """Derivative of ReLU function."""
        return (z > 0).astype(float)
    
    @staticmethod
    def tanh(z):
        """Tanh activation function."""
        return np.tanh(z)
    
    @staticmethod
    def tanh_derivative(z):
        """Derivative of tanh function."""
        return 1 - np.tanh(z) ** 2
    
    @staticmethod
    def leaky_relu(z, alpha=0.01):
        """Leaky ReLU activation function."""
        return np.where(z > 0, z, alpha * z)
    
    @staticmethod
    def leaky_relu_derivative(z, alpha=0.01):
        """Derivative of Leaky ReLU function."""
        return np.where(z > 0, 1, alpha)
    
    @staticmethod
    def softmax(z):
        """Softmax activation function."""
        # Numerical stability
        exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))
        return exp_z / np.sum(exp_z, axis=1, keepdims=True)

class NeuralNetworkFromScratch:
    """Feedforward Neural Network implementation from scratch."""
    
    def __init__(self, layers, activations, learning_rate=0.01, 
                 weight_init='xavier', random_state=None):
        """
        Initialize neural network.
        
        Parameters:
        layers: list of integers representing number of neurons in each layer
        activations: list of activation function names for each layer
        learning_rate: learning rate for gradient descent
        weight_init: weight initialization strategy ('xavier', 'he', 'random')
        """
        self.layers = layers
        self.activations = activations
        self.learning_rate = learning_rate
        self.weight_init = weight_init
        
        if random_state is not None:
            np.random.seed(random_state)
        
        # Initialize weights and biases
        self.weights = {}
        self.biases = {}
        
        for i in range(1, len(layers)):
            if weight_init == 'xavier':
                # Xavier/Glorot initialization
                limit = np.sqrt(6.0 / (layers[i-1] + layers[i]))
                self.weights[i] = np.random.uniform(-limit, limit, (layers[i], layers[i-1]))
            elif weight_init == 'he':
                # He initialization (good for ReLU)
                self.weights[i] = np.random.randn(layers[i], layers[i-1]) * np.sqrt(2.0 / layers[i-1])
            else:
                # Random initialization
                self.weights[i] = np.random.randn(layers[i], layers[i-1]) * 0.1
            
            self.biases[i] = np.zeros((layers[i], 1))
        
        # Activation function mapping
        self.activation_funcs = {
            'sigmoid': ActivationFunctions.sigmoid,
            'relu': ActivationFunctions.relu,
            'tanh': ActivationFunctions.tanh,
            'leaky_relu': ActivationFunctions.leaky_relu,
            'softmax': ActivationFunctions.softmax
        }
        
        self.activation_derivatives = {
            'sigmoid': ActivationFunctions.sigmoid_derivative,
            'relu': ActivationFunctions.relu_derivative,
            'tanh': ActivationFunctions.tanh_derivative,
            'leaky_relu': ActivationFunctions.leaky_relu_derivative
        }
    
    def forward_propagation(self, X):
        """Forward propagation through the network."""
        # Store activations for backpropagation
        self.activations_cache = {}
        self.z_cache = {}
        
        # Input layer
        self.activations_cache[0] = X.T  # Shape: (features, samples)
        
        # Forward through hidden and output layers
        for i in range(1, len(self.layers)):
            # Linear transformation
            z = self.weights[i] @ self.activations_cache[i-1] + self.biases[i]
            self.z_cache[i] = z
            
            # Apply activation function
            activation_func = self.activation_funcs[self.activations[i-1]]
            
            if self.activations[i-1] == 'softmax':
                # Softmax needs special handling for batch processing
                a = activation_func(z.T).T  # Transpose for batch processing
            else:
                a = activation_func(z)
            
            self.activations_cache[i] = a
        
        return self.activations_cache[len(self.layers)-1].T  # Shape: (samples, output_neurons)
    
    def compute_cost(self, Y_true, Y_pred):
        """Compute the cost function."""
        m = Y_true.shape[0]
        
        if self.activations[-1] == 'softmax':
            # Cross-entropy loss for multiclass
            # Convert labels to one-hot if needed
            if len(Y_true.shape) == 1:
                Y_true_onehot = np.eye(self.layers[-1])[Y_true]
            else:
                Y_true_onehot = Y_true
            
            # Avoid log(0)
            Y_pred_clipped = np.clip(Y_pred, 1e-10, 1 - 1e-10)
            cost = -np.mean(np.sum(Y_true_onehot * np.log(Y_pred_clipped), axis=1))
        else:
            # Binary cross-entropy or MSE
            if self.activations[-1] == 'sigmoid':
                # Binary cross-entropy
                Y_pred_clipped = np.clip(Y_pred, 1e-10, 1 - 1e-10)
                cost = -np.mean(Y_true * np.log(Y_pred_clipped) + (1 - Y_true) * np.log(1 - Y_pred_clipped))
            else:
                # Mean Squared Error
                cost = np.mean((Y_true - Y_pred) ** 2) / 2
        
        return cost
    
    def backward_propagation(self, X, Y_true):
        """Backward propagation to compute gradients."""
        m = X.shape[0]
        
        # Initialize gradients
        dW = {}
        db = {}
        
        # Output layer error
        Y_pred = self.activations_cache[len(self.layers)-1].T
        
        if self.activations[-1] == 'softmax':
            # For softmax + cross-entropy, derivative is simplified
            if len(Y_true.shape) == 1:
                Y_true_onehot = np.eye(self.layers[-1])[Y_true]
            else:
                Y_true_onehot = Y_true
            
            dz = (Y_pred - Y_true_onehot) / m
            dz = dz.T  # Shape: (output_neurons, samples)
        else:
            # General case
            if self.activations[-1] == 'sigmoid':
                dz = (Y_pred - Y_true.reshape(-1, 1)) / m
            else:
                # For other activations, compute derivative
                activation_deriv = self.activation_derivatives[self.activations[-1]]
                dz = (Y_pred - Y_true.reshape(-1, 1)) * activation_deriv(self.z_cache[len(self.layers)-1]) / m
            
            dz = dz.T  # Shape: (output_neurons, samples)
        
        # Backpropagate through layers
        for i in range(len(self.layers)-1, 0, -1):
            # Compute gradients for current layer
            dW[i] = dz @ self.activations_cache[i-1].T
            db[i] = np.sum(dz, axis=1, keepdims=True)
            
            if i > 1:  # Not the first layer
                # Compute error for previous layer
                da_prev = self.weights[i].T @ dz
                
                # Apply activation derivative
                activation_deriv = self.activation_derivatives[self.activations[i-2]]
                dz = da_prev * activation_deriv(self.z_cache[i-1])
        
        return dW, db
    
    def update_parameters(self, dW, db):
        """Update weights and biases using gradient descent."""
        for i in range(1, len(self.layers)):
            self.weights[i] -= self.learning_rate * dW[i]
            self.biases[i] -= self.learning_rate * db[i]
    
    def fit(self, X, y, epochs=1000, batch_size=None, verbose=True, validation_split=0.1):
        """Train the neural network."""
        # Split data for validation
        if validation_split > 0:
            X_train, X_val, y_train, y_val = train_test_split(
                X, y, test_size=validation_split, random_state=42
            )
        else:
            X_train, y_train = X, y
            X_val, y_val = None, None
        
        # Training history
        self.history = {
            'cost': [],
            'accuracy': [],
            'val_cost': [],
            'val_accuracy': []
        }
        
        m = X_train.shape[0]
        
        # Set batch size
        if batch_size is None:
            batch_size = m  # Full batch
        
        for epoch in range(epochs):
            epoch_cost = 0
            num_batches = max(1, m // batch_size)
            
            # Shuffle data
            indices = np.random.permutation(m)
            X_shuffled = X_train[indices]
            y_shuffled = y_train[indices]
            
            # Mini-batch training
            for i in range(num_batches):
                start_idx = i * batch_size
                end_idx = min((i + 1) * batch_size, m)
                
                X_batch = X_shuffled[start_idx:end_idx]
                y_batch = y_shuffled[start_idx:end_idx]
                
                # Forward propagation
                Y_pred = self.forward_propagation(X_batch)
                
                # Compute cost
                cost = self.compute_cost(y_batch, Y_pred)
                epoch_cost += cost
                
                # Backward propagation
                dW, db = self.backward_propagation(X_batch, y_batch)
                
                # Update parameters
                self.update_parameters(dW, db)
            
            # Average cost over batches
            epoch_cost /= num_batches
            
            # Calculate training accuracy
            train_pred = self.predict(X_train)
            train_accuracy = accuracy_score(y_train, train_pred)
            
            # Store history
            self.history['cost'].append(epoch_cost)
            self.history['accuracy'].append(train_accuracy)
            
            # Validation metrics
            if X_val is not None:
                val_pred_proba = self.forward_propagation(X_val)
                val_cost = self.compute_cost(y_val, val_pred_proba)
                val_pred = self.predict(X_val)
                val_accuracy = accuracy_score(y_val, val_pred)
                
                self.history['val_cost'].append(val_cost)
                self.history['val_accuracy'].append(val_accuracy)
            
            # Print progress
            if verbose and (epoch + 1) % max(1, epochs // 10) == 0:
                if X_val is not None:
                    print(f"Epoch {epoch+1}/{epochs} - Cost: {epoch_cost:.4f} - "
                          f"Accuracy: {train_accuracy:.4f} - Val_Cost: {val_cost:.4f} - "
                          f"Val_Accuracy: {val_accuracy:.4f}")
                else:
                    print(f"Epoch {epoch+1}/{epochs} - Cost: {epoch_cost:.4f} - "
                          f"Accuracy: {train_accuracy:.4f}")
        
        return self
    
    def predict_proba(self, X):
        """Predict class probabilities."""
        return self.forward_propagation(X)
    
    def predict(self, X):
        """Make predictions."""
        probabilities = self.forward_propagation(X)
        
        if self.activations[-1] == 'softmax':
            # Multiclass: return class with highest probability
            return np.argmax(probabilities, axis=1)
        else:
            # Binary: threshold at 0.5
            return (probabilities > 0.5).astype(int).flatten()

# Test Neural Network implementation
print("🧪 Testing Neural Network Implementation:")

# Generate sample data
X_class, y_class = make_classification(n_samples=1000, n_features=10, n_classes=3,
                                      n_informative=8, n_redundant=2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_class)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_class, test_size=0.2, random_state=42)

# Create neural network
# Architecture: Input(10) -> Hidden(20) -> Hidden(15) -> Output(3)
nn = NeuralNetworkFromScratch(
    layers=[10, 20, 15, 3],
    activations=['relu', 'relu', 'softmax'],
    learning_rate=0.01,
    weight_init='xavier',
    random_state=42
)

print(f"Network architecture: {nn.layers}")
print(f"Activation functions: {nn.activations}")
print(f"Total parameters: {sum(w.size for w in nn.weights.values()) + sum(b.size for b in nn.biases.values())}")

# Train the network
print("\nTraining neural network...")
start_time = time.time()
nn.fit(X_train, y_train, epochs=500, batch_size=32, verbose=True, validation_split=0.15)
training_time = time.time() - start_time

# Make predictions
y_pred = nn.predict(X_test)
y_pred_proba = nn.predict_proba(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

print(f"\n✅ Training completed in {training_time:.2f} seconds")
print(f"Test Accuracy: {accuracy:.4f}")
print(f"Final Training Cost: {nn.history['cost'][-1]:.4f}")
print(f"Final Validation Cost: {nn.history['val_cost'][-1]:.4f}")

In [None]:
# Visualize Neural Network results
plt.figure(figsize=(16, 12))

# Plot 1: Training history
plt.subplot(3, 4, 1)
epochs = range(1, len(nn.history['cost']) + 1)
plt.plot(epochs, nn.history['cost'], label='Training Cost', alpha=0.8)
plt.plot(epochs, nn.history['val_cost'], label='Validation Cost', alpha=0.8)
plt.xlabel('Epoch')
plt.ylabel('Cost')
plt.title('Training and Validation Cost')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 2: Accuracy history
plt.subplot(3, 4, 2)
plt.plot(epochs, nn.history['accuracy'], label='Training Accuracy', alpha=0.8)
plt.plot(epochs, nn.history['val_accuracy'], label='Validation Accuracy', alpha=0.8)
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.title('Training and Validation Accuracy')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 3: Confusion matrix
plt.subplot(3, 4, 3)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

# Plot 4: Weight distribution for first layer
plt.subplot(3, 4, 4)
first_layer_weights = nn.weights[1].flatten()
plt.hist(first_layer_weights, bins=30, alpha=0.7, color='green')
plt.xlabel('Weight Value')
plt.ylabel('Frequency')
plt.title('First Layer Weight Distribution')
plt.grid(True, alpha=0.3)

# Plot 5-8: Activation function comparisons
activation_functions = {
    'Sigmoid': ActivationFunctions.sigmoid,
    'ReLU': ActivationFunctions.relu,
    'Tanh': ActivationFunctions.tanh,
    'Leaky ReLU': ActivationFunctions.leaky_relu
}

x = np.linspace(-5, 5, 100)
for i, (name, func) in enumerate(activation_functions.items()):
    plt.subplot(3, 4, 5 + i)
    y = func(x)
    plt.plot(x, y, linewidth=2, alpha=0.8)
    plt.title(f'{name} Activation')
    plt.xlabel('Input')
    plt.ylabel('Output')
    plt.grid(True, alpha=0.3)

# Plot 9: Prediction confidence
plt.subplot(3, 4, 9)
max_probs = np.max(y_pred_proba, axis=1)
plt.hist(max_probs, bins=20, alpha=0.7, color='orange')
plt.xlabel('Prediction Confidence')
plt.ylabel('Frequency')
plt.title('Prediction Confidence Distribution')
plt.grid(True, alpha=0.3)

# Plot 10: Feature importance (based on first layer weights)
plt.subplot(3, 4, 10)
feature_importance = np.mean(np.abs(nn.weights[1]), axis=0)
features = range(len(feature_importance))
bars = plt.bar(features, feature_importance, alpha=0.7, color='purple')
plt.xlabel('Feature Index')
plt.ylabel('Average |Weight|')
plt.title('Feature Importance (First Layer)')
plt.grid(True, alpha=0.3)

# Plot 11: Learning rate effect (simulated)
plt.subplot(3, 4, 11)
learning_rates = [0.001, 0.01, 0.05, 0.1, 0.5]
final_costs = []

for lr in learning_rates:
    # Quick simulation with fewer epochs
    nn_temp = NeuralNetworkFromScratch(
        layers=[10, 15, 3],
        activations=['relu', 'softmax'],
        learning_rate=lr,
        random_state=42
    )
    nn_temp.fit(X_train[:200], y_train[:200], epochs=100, verbose=False, validation_split=0)
    final_costs.append(nn_temp.history['cost'][-1])

plt.semilogx(learning_rates, final_costs, 'o-', alpha=0.8, linewidth=2, markersize=8)
plt.xlabel('Learning Rate')
plt.ylabel('Final Cost')
plt.title('Learning Rate vs Final Cost')
plt.grid(True, alpha=0.3)

# Plot 12: Network architecture visualization
plt.subplot(3, 4, 12)
# Simple visualization of network structure
layer_sizes = nn.layers
max_neurons = max(layer_sizes)
layer_positions = np.arange(len(layer_sizes))

for i, size in enumerate(layer_sizes):
    y_positions = np.linspace(-max_neurons/2, max_neurons/2, size)
    plt.scatter([i] * size, y_positions, s=100, alpha=0.7, 
                c=plt.cm.viridis(i / len(layer_sizes)))
    
    # Add connections (simplified)
    if i < len(layer_sizes) - 1:
        next_y_positions = np.linspace(-max_neurons/2, max_neurons/2, layer_sizes[i+1])
        for y1 in y_positions[:min(3, len(y_positions))]:
            for y2 in next_y_positions[:min(3, len(next_y_positions))]:
                plt.plot([i, i+1], [y1, y2], 'k-', alpha=0.1)

plt.xlabel('Layer')
plt.ylabel('Neuron Position')
plt.title('Network Architecture')
plt.xticks(range(len(layer_sizes)), [f'Layer {i}\n({size} neurons)' 
                                     for i, size in enumerate(layer_sizes)])
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Performance summary
print("\n📊 Neural Network Performance Summary:")
print("=" * 50)
print(f"Architecture: {' -> '.join(map(str, nn.layers))}")
print(f"Activations: {' -> '.join(nn.activations)}")
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Training time: {training_time:.2f} seconds")
print(f"Final training accuracy: {nn.history['accuracy'][-1]:.4f}")
print(f"Final validation accuracy: {nn.history['val_accuracy'][-1]:.4f}")
print(f"Test accuracy: {accuracy:.4f}")
print(f"Total parameters: {sum(w.size for w in nn.weights.values()) + sum(b.size for b in nn.biases.values())}")

## 🔄 Problem 2: Advanced Optimizers and Regularization

**Problem Statement**: Implement advanced optimization algorithms and regularization techniques.

**Requirements**:
- Adam, RMSprop, and Momentum optimizers
- L1/L2 regularization and Dropout
- Batch normalization
- Learning rate scheduling
- Early stopping

**Key Concepts**: Adaptive learning rates, gradient momentum, regularization for generalization

In [None]:
class Optimizer:
    """Base class for optimizers."""
    def update(self, params, grads):
        raise NotImplementedError

class SGD(Optimizer):
    """Stochastic Gradient Descent optimizer."""
    def __init__(self, learning_rate=0.01):
        self.learning_rate = learning_rate
    
    def update(self, params, grads):
        for key in params:
            params[key] -= self.learning_rate * grads[key]

class Momentum(Optimizer):
    """SGD with Momentum optimizer."""
    def __init__(self, learning_rate=0.01, momentum=0.9):
        self.learning_rate = learning_rate
        self.momentum = momentum
        self.velocity = {}
    
    def update(self, params, grads):
        for key in params:
            if key not in self.velocity:
                self.velocity[key] = np.zeros_like(params[key])
            
            self.velocity[key] = self.momentum * self.velocity[key] - self.learning_rate * grads[key]
            params[key] += self.velocity[key]

class Adam(Optimizer):
    """Adam optimizer."""
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8):
        self.learning_rate = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.epsilon = epsilon
        self.m = {}  # First moment
        self.v = {}  # Second moment
        self.t = 0   # Time step
    
    def update(self, params, grads):
        self.t += 1
        
        for key in params:
            if key not in self.m:
                self.m[key] = np.zeros_like(params[key])
                self.v[key] = np.zeros_like(params[key])
            
            # Update biased first moment estimate
            self.m[key] = self.beta1 * self.m[key] + (1 - self.beta1) * grads[key]
            
            # Update biased second raw moment estimate
            self.v[key] = self.beta2 * self.v[key] + (1 - self.beta2) * (grads[key] ** 2)
            
            # Compute bias-corrected first moment estimate
            m_corrected = self.m[key] / (1 - self.beta1 ** self.t)
            
            # Compute bias-corrected second raw moment estimate
            v_corrected = self.v[key] / (1 - self.beta2 ** self.t)
            
            # Update parameters
            params[key] -= self.learning_rate * m_corrected / (np.sqrt(v_corrected) + self.epsilon)

class RMSprop(Optimizer):
    """RMSprop optimizer."""
    def __init__(self, learning_rate=0.001, decay_rate=0.9, epsilon=1e-8):
        self.learning_rate = learning_rate
        self.decay_rate = decay_rate
        self.epsilon = epsilon
        self.cache = {}
    
    def update(self, params, grads):
        for key in params:
            if key not in self.cache:
                self.cache[key] = np.zeros_like(params[key])
            
            self.cache[key] = self.decay_rate * self.cache[key] + (1 - self.decay_rate) * (grads[key] ** 2)
            params[key] -= self.learning_rate * grads[key] / (np.sqrt(self.cache[key]) + self.epsilon)

class AdvancedNeuralNetwork(NeuralNetworkFromScratch):
    """Neural Network with advanced optimizers and regularization."""
    
    def __init__(self, layers, activations, optimizer='adam', 
                 l1_reg=0.0, l2_reg=0.0, dropout_rate=0.0,
                 weight_init='xavier', random_state=None):
        super().__init__(layers, activations, weight_init=weight_init, random_state=random_state)
        
        # Initialize optimizer
        if optimizer == 'sgd':
            self.optimizer = SGD(learning_rate=0.01)
        elif optimizer == 'momentum':
            self.optimizer = Momentum(learning_rate=0.01)
        elif optimizer == 'adam':
            self.optimizer = Adam(learning_rate=0.001)
        elif optimizer == 'rmsprop':
            self.optimizer = RMSprop(learning_rate=0.001)
        else:
            raise ValueError(f"Unknown optimizer: {optimizer}")
        
        # Regularization parameters
        self.l1_reg = l1_reg
        self.l2_reg = l2_reg
        self.dropout_rate = dropout_rate
        self.dropout_masks = {}
        self.training = True
    
    def forward_propagation(self, X):
        """Forward propagation with dropout."""
        # Store activations for backpropagation
        self.activations_cache = {}
        self.z_cache = {}
        self.dropout_masks = {}
        
        # Input layer
        self.activations_cache[0] = X.T  # Shape: (features, samples)
        
        # Forward through hidden and output layers
        for i in range(1, len(self.layers)):
            # Linear transformation
            z = self.weights[i] @ self.activations_cache[i-1] + self.biases[i]
            self.z_cache[i] = z
            
            # Apply activation function
            activation_func = self.activation_funcs[self.activations[i-1]]
            
            if self.activations[i-1] == 'softmax':
                a = activation_func(z.T).T
            else:
                a = activation_func(z)
            
            # Apply dropout (except for output layer)
            if self.training and self.dropout_rate > 0 and i < len(self.layers) - 1:
                mask = np.random.rand(*a.shape) > self.dropout_rate
                a = a * mask / (1 - self.dropout_rate)  # Inverted dropout
                self.dropout_masks[i] = mask
            
            self.activations_cache[i] = a
        
        return self.activations_cache[len(self.layers)-1].T
    
    def compute_cost(self, Y_true, Y_pred):
        """Compute cost with regularization."""
        # Base cost
        cost = super().compute_cost(Y_true, Y_pred)
        
        # Add regularization
        l1_cost = 0
        l2_cost = 0
        
        for i in range(1, len(self.layers)):
            l1_cost += np.sum(np.abs(self.weights[i]))
            l2_cost += np.sum(self.weights[i] ** 2)
        
        cost += self.l1_reg * l1_cost + self.l2_reg * l2_cost / 2
        
        return cost
    
    def backward_propagation(self, X, Y_true):
        """Backward propagation with regularization."""
        # Get base gradients
        dW, db = super().backward_propagation(X, Y_true)
        
        # Add regularization gradients
        for i in range(1, len(self.layers)):
            # L1 regularization
            if self.l1_reg > 0:
                dW[i] += self.l1_reg * np.sign(self.weights[i])
            
            # L2 regularization
            if self.l2_reg > 0:
                dW[i] += self.l2_reg * self.weights[i]
        
        return dW, db
    
    def update_parameters(self, dW, db):
        """Update parameters using the optimizer."""
        # Combine weights and biases for optimizer
        params = {}
        grads = {}
        
        for i in range(1, len(self.layers)):
            params[f'W{i}'] = self.weights[i]
            params[f'b{i}'] = self.biases[i]
            grads[f'W{i}'] = dW[i]
            grads[f'b{i}'] = db[i]
        
        # Update using optimizer
        self.optimizer.update(params, grads)
    
    def fit(self, X, y, epochs=1000, batch_size=32, validation_split=0.1, 
            verbose=True, early_stopping_patience=None):
        """Train with early stopping."""
        self.training = True
        
        # Split data for validation
        if validation_split > 0:
            X_train, X_val, y_train, y_val = train_test_split(
                X, y, test_size=validation_split, random_state=42
            )
        else:
            X_train, y_train = X, y
            X_val, y_val = None, None
        
        # Training history
        self.history = {
            'cost': [],
            'accuracy': [],
            'val_cost': [],
            'val_accuracy': []
        }
        
        # Early stopping variables
        best_val_cost = float('inf')
        patience_counter = 0
        best_weights = None
        best_biases = None
        
        m = X_train.shape[0]
        
        for epoch in range(epochs):
            epoch_cost = 0
            num_batches = max(1, m // batch_size)
            
            # Shuffle data
            indices = np.random.permutation(m)
            X_shuffled = X_train[indices]
            y_shuffled = y_train[indices]
            
            # Mini-batch training
            for i in range(num_batches):
                start_idx = i * batch_size
                end_idx = min((i + 1) * batch_size, m)
                
                X_batch = X_shuffled[start_idx:end_idx]
                y_batch = y_shuffled[start_idx:end_idx]
                
                # Forward propagation
                Y_pred = self.forward_propagation(X_batch)
                
                # Compute cost
                cost = self.compute_cost(y_batch, Y_pred)
                epoch_cost += cost
                
                # Backward propagation
                dW, db = self.backward_propagation(X_batch, y_batch)
                
                # Update parameters
                self.update_parameters(dW, db)
            
            # Average cost over batches
            epoch_cost /= num_batches
            
            # Calculate training accuracy (without dropout)
            self.training = False
            train_pred = self.predict(X_train)
            train_accuracy = accuracy_score(y_train, train_pred)
            self.training = True
            
            # Store history
            self.history['cost'].append(epoch_cost)
            self.history['accuracy'].append(train_accuracy)
            
            # Validation metrics
            if X_val is not None:
                self.training = False  # Disable dropout for validation
                val_pred_proba = self.forward_propagation(X_val)
                val_cost = self.compute_cost(y_val, val_pred_proba)
                val_pred = self.predict(X_val)
                val_accuracy = accuracy_score(y_val, val_pred)
                self.training = True
                
                self.history['val_cost'].append(val_cost)
                self.history['val_accuracy'].append(val_accuracy)
                
                # Early stopping check
                if early_stopping_patience is not None:
                    if val_cost < best_val_cost:
                        best_val_cost = val_cost
                        patience_counter = 0
                        # Save best weights
                        best_weights = {k: v.copy() for k, v in self.weights.items()}
                        best_biases = {k: v.copy() for k, v in self.biases.items()}
                    else:
                        patience_counter += 1
                        
                        if patience_counter >= early_stopping_patience:
                            print(f"\nEarly stopping at epoch {epoch+1}")
                            # Restore best weights
                            if best_weights is not None:
                                self.weights = best_weights
                                self.biases = best_biases
                            break
            
            # Print progress
            if verbose and (epoch + 1) % max(1, epochs // 10) == 0:
                if X_val is not None:
                    print(f"Epoch {epoch+1}/{epochs} - Cost: {epoch_cost:.4f} - "
                          f"Accuracy: {train_accuracy:.4f} - Val_Cost: {val_cost:.4f} - "
                          f"Val_Accuracy: {val_accuracy:.4f}")
                else:
                    print(f"Epoch {epoch+1}/{epochs} - Cost: {epoch_cost:.4f} - "
                          f"Accuracy: {train_accuracy:.4f}")
        
        self.training = False  # Set to inference mode
        return self
    
    def predict(self, X):
        """Make predictions (inference mode)."""
        was_training = self.training
        self.training = False  # Disable dropout
        predictions = super().predict(X)
        self.training = was_training
        return predictions

# Test Advanced Neural Network
print("🧪 Testing Advanced Neural Network with Different Optimizers:")

# Generate more complex dataset
X_complex, y_complex = make_circles(n_samples=1000, noise=0.1, factor=0.3, random_state=42)

# Split data
X_train_adv, X_test_adv, y_train_adv, y_test_adv = train_test_split(
    X_complex, y_complex, test_size=0.2, random_state=42
)

# Test different optimizers
optimizers = ['sgd', 'momentum', 'adam', 'rmsprop']
results = []

for optimizer in optimizers:
    print(f"\n=== Testing {optimizer.upper()} Optimizer ===")
    
    # Create network
    nn_adv = AdvancedNeuralNetwork(
        layers=[2, 10, 10, 1],
        activations=['relu', 'relu', 'sigmoid'],
        optimizer=optimizer,
        l2_reg=0.001,
        dropout_rate=0.2,
        random_state=42
    )
    
    # Train network
    start_time = time.time()
    nn_adv.fit(X_train_adv, y_train_adv, epochs=200, batch_size=32,
               validation_split=0.15, verbose=False, early_stopping_patience=20)
    training_time = time.time() - start_time
    
    # Evaluate
    y_pred_adv = nn_adv.predict(X_test_adv)
    accuracy_adv = accuracy_score(y_test_adv, y_pred_adv)
    
    results.append({
        'optimizer': optimizer,
        'accuracy': accuracy_adv,
        'training_time': training_time,
        'final_cost': nn_adv.history['cost'][-1],
        'epochs_trained': len(nn_adv.history['cost']),
        'history': nn_adv.history
    })
    
    print(f"Test Accuracy: {accuracy_adv:.4f}")
    print(f"Training Time: {training_time:.2f}s")
    print(f"Epochs Trained: {len(nn_adv.history['cost'])}")

print("\n✅ Advanced neural network tests completed!")

In [None]:
# Visualize optimizer comparison
plt.figure(figsize=(16, 10))

# Plot 1: Optimizer performance comparison
plt.subplot(2, 4, 1)
optimizers_names = [r['optimizer'] for r in results]
accuracies = [r['accuracy'] for r in results]
bars = plt.bar(optimizers_names, accuracies, alpha=0.7)
plt.ylabel('Test Accuracy')
plt.title('Optimizer Performance Comparison')
plt.xticks(rotation=45)

# Add value labels
for bar, acc in zip(bars, accuracies):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f'{acc:.3f}', ha='center', va='bottom')

plt.grid(True, alpha=0.3)

# Plot 2: Training time comparison
plt.subplot(2, 4, 2)
times = [r['training_time'] for r in results]
bars = plt.bar(optimizers_names, times, alpha=0.7, color='orange')
plt.ylabel('Training Time (s)')
plt.title('Training Time Comparison')
plt.xticks(rotation=45)

for bar, time_val in zip(bars, times):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
             f'{time_val:.1f}s', ha='center', va='bottom')

plt.grid(True, alpha=0.3)

# Plot 3: Training curves for all optimizers
plt.subplot(2, 4, 3)
for result in results:
    epochs = range(1, len(result['history']['cost']) + 1)
    plt.plot(epochs, result['history']['cost'], 
             label=result['optimizer'].upper(), alpha=0.8)

plt.xlabel('Epoch')
plt.ylabel('Training Cost')
plt.title('Training Cost Curves')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 4: Validation curves for all optimizers
plt.subplot(2, 4, 4)
for result in results:
    epochs = range(1, len(result['history']['val_accuracy']) + 1)
    plt.plot(epochs, result['history']['val_accuracy'], 
             label=result['optimizer'].upper(), alpha=0.8)

plt.xlabel('Epoch')
plt.ylabel('Validation Accuracy')
plt.title('Validation Accuracy Curves')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 5-8: Decision boundaries for each optimizer
def plot_decision_boundary(X, y, model, title, ax):
    h = 0.02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    mesh_points = np.c_[xx.ravel(), yy.ravel()]
    Z = model.predict_proba(mesh_points)
    if Z.shape[1] > 1:
        Z = Z[:, 1]  # Take second class probability for binary classification
    else:
        Z = Z.flatten()
    Z = Z.reshape(xx.shape)
    
    ax.contourf(xx, yy, Z, levels=50, alpha=0.6, cmap='RdYlBu')
    ax.contour(xx, yy, Z, levels=[0.5], colors='black', linestyles='--', linewidths=2)
    scatter = ax.scatter(X[:, 0], X[:, 1], c=y, alpha=0.8, edgecolors='black')
    ax.set_title(title)
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    return scatter

# Create models for decision boundary visualization
for i, optimizer in enumerate(optimizers):
    ax = plt.subplot(2, 4, 5 + i)
    
    # Create and train a model for visualization
    nn_viz = AdvancedNeuralNetwork(
        layers=[2, 8, 1],
        activations=['relu', 'sigmoid'],
        optimizer=optimizer,
        l2_reg=0.001,
        random_state=42
    )
    
    nn_viz.fit(X_train_adv, y_train_adv, epochs=150, verbose=False, validation_split=0)
    
    scatter = plot_decision_boundary(X_test_adv, y_test_adv, nn_viz, 
                                   f'{optimizer.upper()}\nAcc: {results[i]["accuracy"]:.3f}', ax)

plt.tight_layout()
plt.show()

# Detailed comparison table
print("\n📊 Detailed Optimizer Comparison:")
print("=" * 80)
print(f"{'Optimizer':<12} {'Accuracy':<10} {'Time (s)':<10} {'Epochs':<8} {'Final Cost':<12}")
print("=" * 80)

for result in results:
    print(f"{result['optimizer'].upper():<12} {result['accuracy']:<10.4f} "
          f"{result['training_time']:<10.2f} {result['epochs_trained']:<8} "
          f"{result['final_cost']:<12.4f}")

print("\n🏆 Best performing optimizer:", 
      max(results, key=lambda x: x['accuracy'])['optimizer'].upper())

## 🎯 Problem 3: Simple Convolutional Neural Network

**Problem Statement**: Implement basic CNN components for image classification.

**Requirements**:
- Convolution and pooling layers
- Multiple filters and feature maps
- Forward propagation through CNN
- Flattening for fully connected layers
- Handle different padding strategies

**Key Concepts**: Convolution operation, pooling, parameter sharing, translation invariance

In [None]:
class ConvolutionalLayer:
    """Simple convolutional layer implementation."""
    
    def __init__(self, num_filters, filter_size, stride=1, padding=0):
        self.num_filters = num_filters
        self.filter_size = filter_size
        self.stride = stride
        self.padding = padding
        
        # Initialize filters with small random weights
        self.filters = np.random.randn(num_filters, filter_size, filter_size) * 0.1
        self.biases = np.zeros(num_filters)
    
    def forward(self, input_data):
        """
        Forward pass through convolutional layer.
        input_data shape: (height, width) for single channel or (channels, height, width)
        """
        if len(input_data.shape) == 2:
            # Single channel
            input_data = input_data.reshape(1, input_data.shape[0], input_data.shape[1])
        
        channels, height, width = input_data.shape
        
        # Add padding
        if self.padding > 0:
            padded_input = np.pad(input_data, 
                                ((0, 0), (self.padding, self.padding), (self.padding, self.padding)), 
                                mode='constant')
        else:
            padded_input = input_data
        
        padded_height, padded_width = padded_input.shape[1], padded_input.shape[2]
        
        # Calculate output dimensions
        output_height = (padded_height - self.filter_size) // self.stride + 1
        output_width = (padded_width - self.filter_size) // self.stride + 1
        
        # Initialize output
        output = np.zeros((self.num_filters, output_height, output_width))
        
        # Perform convolution
        for f in range(self.num_filters):
            for i in range(output_height):
                for j in range(output_width):
                    # Extract region
                    h_start = i * self.stride
                    h_end = h_start + self.filter_size
                    w_start = j * self.stride
                    w_end = w_start + self.filter_size
                    
                    region = padded_input[:, h_start:h_end, w_start:w_end]
                    
                    # For simplicity, we'll average across input channels
                    # In practice, filters would have depth equal to input channels
                    region_avg = np.mean(region, axis=0)
                    
                    # Convolution operation
                    output[f, i, j] = np.sum(region_avg * self.filters[f]) + self.biases[f]
        
        return output

class MaxPoolingLayer:
    """Max pooling layer implementation."""
    
    def __init__(self, pool_size=2, stride=None):
        self.pool_size = pool_size
        self.stride = stride if stride is not None else pool_size
    
    def forward(self, input_data):
        """
        Forward pass through max pooling layer.
        input_data shape: (channels, height, width)
        """
        channels, height, width = input_data.shape
        
        # Calculate output dimensions
        output_height = (height - self.pool_size) // self.stride + 1
        output_width = (width - self.pool_size) // self.stride + 1
        
        # Initialize output
        output = np.zeros((channels, output_height, output_width))
        
        # Perform max pooling
        for c in range(channels):
            for i in range(output_height):
                for j in range(output_width):
                    h_start = i * self.stride
                    h_end = h_start + self.pool_size
                    w_start = j * self.stride
                    w_end = w_start + self.pool_size
                    
                    # Max pooling operation
                    output[c, i, j] = np.max(input_data[c, h_start:h_end, w_start:w_end])
        
        return output

class SimpleCNN:
    """Simple CNN for demonstration purposes."""
    
    def __init__(self, input_shape, num_classes):
        self.input_shape = input_shape  # (height, width) for grayscale
        self.num_classes = num_classes
        
        # Define layers
        self.conv1 = ConvolutionalLayer(num_filters=8, filter_size=3, stride=1, padding=1)
        self.pool1 = MaxPoolingLayer(pool_size=2, stride=2)
        
        self.conv2 = ConvolutionalLayer(num_filters=16, filter_size=3, stride=1, padding=1)
        self.pool2 = MaxPoolingLayer(pool_size=2, stride=2)
        
        # Calculate flattened size
        self._calculate_flattened_size()
        
        # Fully connected layers
        self.fc = AdvancedNeuralNetwork(
            layers=[self.flattened_size, 64, num_classes],
            activations=['relu', 'softmax'],
            optimizer='adam',
            random_state=42
        )
    
    def _calculate_flattened_size(self):
        """Calculate the size after convolution and pooling layers."""
        # Simulate forward pass to get dimensions
        dummy_input = np.zeros(self.input_shape)
        
        # Conv1 + Pool1
        conv1_out = self.conv1.forward(dummy_input)
        pool1_out = self.pool1.forward(conv1_out)
        
        # Conv2 + Pool2
        conv2_out = self.conv2.forward(pool1_out)
        pool2_out = self.pool2.forward(conv2_out)
        
        self.flattened_size = pool2_out.size
        print(f"Flattened size after conv layers: {self.flattened_size}")
    
    def forward(self, X):
        """Forward pass through the entire CNN."""
        batch_size = X.shape[0]
        outputs = []
        
        for i in range(batch_size):
            # Single sample forward pass
            sample = X[i]
            
            # Convolutional layers
            conv1_out = self.conv1.forward(sample)
            pool1_out = self.pool1.forward(conv1_out)
            
            conv2_out = self.conv2.forward(pool1_out)
            pool2_out = self.pool2.forward(conv2_out)
            
            # Flatten
            flattened = pool2_out.flatten()
            outputs.append(flattened)
        
        # Stack outputs
        fc_input = np.array(outputs)
        
        # Fully connected layers
        return self.fc.forward_propagation(fc_input)
    
    def fit(self, X, y, epochs=50, batch_size=32, validation_split=0.1):
        """Train the CNN (simplified - only FC layers are trained)."""
        print("Preprocessing data through convolutional layers...")
        
        # Forward pass through conv layers to get features
        features = []
        for i in range(X.shape[0]):
            sample = X[i]
            
            conv1_out = self.conv1.forward(sample)
            pool1_out = self.pool1.forward(conv1_out)
            
            conv2_out = self.conv2.forward(pool1_out)
            pool2_out = self.pool2.forward(conv2_out)
            
            features.append(pool2_out.flatten())
        
        features = np.array(features)
        
        print(f"Feature extraction completed. Shape: {features.shape}")
        print("Training fully connected layers...")
        
        # Train only the fully connected part
        return self.fc.fit(features, y, epochs=epochs, batch_size=batch_size, 
                          validation_split=validation_split)
    
    def predict(self, X):
        """Make predictions."""
        probabilities = self.forward(X)
        return np.argmax(probabilities, axis=1)

# Test Simple CNN
print("🧪 Testing Simple CNN Implementation:")

# Load digits dataset (8x8 grayscale images)
digits = load_digits()
X_digits = digits.data.reshape(-1, 8, 8)  # Reshape to 2D images
y_digits = digits.target

# Use only first 3 classes for simplicity
mask = y_digits < 3
X_digits = X_digits[mask]
y_digits = y_digits[mask]

print(f"Using {len(X_digits)} samples with {len(np.unique(y_digits))} classes")
print(f"Image shape: {X_digits.shape[1:]}")

# Split data
X_train_cnn, X_test_cnn, y_train_cnn, y_test_cnn = train_test_split(
    X_digits, y_digits, test_size=0.2, random_state=42
)

# Normalize pixel values
X_train_cnn = X_train_cnn / 16.0  # Max pixel value in digits dataset is 16
X_test_cnn = X_test_cnn / 16.0

# Create CNN
cnn = SimpleCNN(input_shape=(8, 8), num_classes=3)

# Train CNN
print("\nTraining CNN...")
start_time = time.time()
cnn.fit(X_train_cnn, y_train_cnn, epochs=100, batch_size=16, validation_split=0.15)
cnn_training_time = time.time() - start_time

# Make predictions
y_pred_cnn = cnn.predict(X_test_cnn)
cnn_accuracy = accuracy_score(y_test_cnn, y_pred_cnn)

print(f"\n✅ CNN training completed in {cnn_training_time:.2f} seconds")
print(f"CNN Test Accuracy: {cnn_accuracy:.4f}")

# Compare with regular neural network on flattened images
print("\nComparing with regular neural network...")
X_train_flat = X_train_cnn.reshape(X_train_cnn.shape[0], -1)
X_test_flat = X_test_cnn.reshape(X_test_cnn.shape[0], -1)

nn_regular = AdvancedNeuralNetwork(
    layers=[64, 32, 16, 3],
    activations=['relu', 'relu', 'softmax'],
    optimizer='adam',
    random_state=42
)

nn_regular.fit(X_train_flat, y_train_cnn, epochs=100, batch_size=16, 
               validation_split=0.15, verbose=False)
y_pred_regular = nn_regular.predict(X_test_flat)
regular_accuracy = accuracy_score(y_test_cnn, y_pred_regular)

print(f"Regular NN Test Accuracy: {regular_accuracy:.4f}")
print(f"CNN vs Regular NN improvement: {cnn_accuracy - regular_accuracy:.4f}")

In [None]:
# Visualize CNN results
plt.figure(figsize=(16, 10))

# Plot 1: Sample images from each class
plt.subplot(3, 4, 1)
for i in range(3):
    class_samples = X_digits[y_digits == i]
    sample_image = class_samples[0]
    plt.subplot(3, 4, i + 1)
    plt.imshow(sample_image, cmap='gray')
    plt.title(f'Class {i} Sample')
    plt.axis('off')

# Plot 4: Confusion matrix
plt.subplot(3, 4, 4)
cm_cnn = confusion_matrix(y_test_cnn, y_pred_cnn)
sns.heatmap(cm_cnn, annot=True, fmt='d', cmap='Blues')
plt.title('CNN Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

# Plot 5: Feature maps from first convolution layer
sample_image = X_test_cnn[0]
conv1_features = cnn.conv1.forward(sample_image)

for i in range(min(4, conv1_features.shape[0])):
    plt.subplot(3, 4, 5 + i)
    plt.imshow(conv1_features[i], cmap='viridis')
    plt.title(f'Conv1 Filter {i+1}')
    plt.axis('off')

# Plot 9: Training curves comparison
plt.subplot(3, 4, 9)
cnn_epochs = range(1, len(cnn.fc.history['cost']) + 1)
regular_epochs = range(1, len(nn_regular.history['cost']) + 1)

plt.plot(cnn_epochs, cnn.fc.history['cost'], label='CNN', alpha=0.8)
plt.plot(regular_epochs, nn_regular.history['cost'], label='Regular NN', alpha=0.8)
plt.xlabel('Epoch')
plt.ylabel('Training Cost')
plt.title('Training Cost Comparison')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 10: Accuracy comparison
plt.subplot(3, 4, 10)
plt.plot(cnn_epochs, cnn.fc.history['accuracy'], label='CNN', alpha=0.8)
plt.plot(regular_epochs, nn_regular.history['accuracy'], label='Regular NN', alpha=0.8)
plt.xlabel('Epoch')
plt.ylabel('Training Accuracy')
plt.title('Training Accuracy Comparison')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 11: Model comparison bar chart
plt.subplot(3, 4, 11)
models = ['CNN', 'Regular NN']
accuracies_comp = [cnn_accuracy, regular_accuracy]
bars = plt.bar(models, accuracies_comp, alpha=0.7, color=['blue', 'orange'])
plt.ylabel('Test Accuracy')
plt.title('Model Performance Comparison')

for bar, acc in zip(bars, accuracies_comp):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f'{acc:.3f}', ha='center', va='bottom')

plt.grid(True, alpha=0.3)

# Plot 12: Filter weights visualization
plt.subplot(3, 4, 12)
filter_weights = cnn.conv1.filters[0]  # First filter
plt.imshow(filter_weights, cmap='RdBu', vmin=-0.5, vmax=0.5)
plt.title('First Conv Filter Weights')
plt.colorbar()
plt.axis('off')

plt.tight_layout()
plt.show()

print("\n📊 CNN vs Regular NN Comparison:")
print("=" * 50)
print(f"{'Model':<15} {'Accuracy':<12} {'Parameters':<12} {'Time (s)':<10}")
print("=" * 50)

# Calculate parameters (approximate)
cnn_params = (cnn.conv1.filters.size + cnn.conv1.biases.size + 
              cnn.conv2.filters.size + cnn.conv2.biases.size +
              sum(w.size for w in cnn.fc.weights.values()) + 
              sum(b.size for b in cnn.fc.biases.values()))

regular_params = (sum(w.size for w in nn_regular.weights.values()) + 
                  sum(b.size for b in nn_regular.biases.values()))

print(f"{'CNN':<15} {cnn_accuracy:<12.4f} {cnn_params:<12} {cnn_training_time:<10.2f}")
print(f"{'Regular NN':<15} {regular_accuracy:<12.4f} {regular_params:<12} {'N/A':<10}")
print("=" * 50)
print(f"CNN Improvement: +{cnn_accuracy - regular_accuracy:.4f} accuracy")
print(f"Parameter Efficiency: CNN uses {cnn_params/regular_params:.2f}x parameters")

## 🏃‍♂️ Practice Problems

Let's practice some additional neural network concepts commonly asked in interviews.

In [None]:
# Problem 4: Implement Batch Normalization
class BatchNormalization:
    """Batch Normalization layer implementation."""
    
    def __init__(self, num_features, momentum=0.9, epsilon=1e-5):
        self.num_features = num_features
        self.momentum = momentum
        self.epsilon = epsilon
        
        # Learnable parameters
        self.gamma = np.ones(num_features)  # Scale parameter
        self.beta = np.zeros(num_features)  # Shift parameter
        
        # Running statistics (for inference)
        self.running_mean = np.zeros(num_features)
        self.running_var = np.ones(num_features)
        
        self.training = True
    
    def forward(self, x):
        """Forward pass through batch normalization."""
        if self.training:
            # Training mode: use batch statistics
            batch_mean = np.mean(x, axis=0)
            batch_var = np.var(x, axis=0)
            
            # Normalize
            x_normalized = (x - batch_mean) / np.sqrt(batch_var + self.epsilon)
            
            # Update running statistics
            self.running_mean = self.momentum * self.running_mean + (1 - self.momentum) * batch_mean
            self.running_var = self.momentum * self.running_var + (1 - self.momentum) * batch_var
            
        else:
            # Inference mode: use running statistics
            x_normalized = (x - self.running_mean) / np.sqrt(self.running_var + self.epsilon)
        
        # Scale and shift
        return self.gamma * x_normalized + self.beta

# Problem 5: Implement different weight initialization strategies
class WeightInitializer:
    """Different weight initialization strategies."""
    
    @staticmethod
    def xavier_uniform(fan_in, fan_out, shape):
        """Xavier/Glorot uniform initialization."""
        limit = np.sqrt(6.0 / (fan_in + fan_out))
        return np.random.uniform(-limit, limit, shape)
    
    @staticmethod
    def xavier_normal(fan_in, fan_out, shape):
        """Xavier/Glorot normal initialization."""
        std = np.sqrt(2.0 / (fan_in + fan_out))
        return np.random.normal(0, std, shape)
    
    @staticmethod
    def he_uniform(fan_in, shape):
        """He uniform initialization (good for ReLU)."""
        limit = np.sqrt(6.0 / fan_in)
        return np.random.uniform(-limit, limit, shape)
    
    @staticmethod
    def he_normal(fan_in, shape):
        """He normal initialization (good for ReLU)."""
        std = np.sqrt(2.0 / fan_in)
        return np.random.normal(0, std, shape)
    
    @staticmethod
    def lecun_uniform(fan_in, shape):
        """LeCun uniform initialization."""
        limit = np.sqrt(3.0 / fan_in)
        return np.random.uniform(-limit, limit, shape)
    
    @staticmethod
    def lecun_normal(fan_in, shape):
        """LeCun normal initialization."""
        std = np.sqrt(1.0 / fan_in)
        return np.random.normal(0, std, shape)

# Test different initialization strategies
print("🧪 Testing Weight Initialization Strategies:")

# Generate a challenging dataset
X_init, y_init = make_moons(n_samples=1000, noise=0.2, random_state=42)
X_train_init, X_test_init, y_train_init, y_test_init = train_test_split(
    X_init, y_init, test_size=0.2, random_state=42
)

# Test different initialization methods
init_methods = ['xavier', 'he', 'random']
init_results = []

for init_method in init_methods:
    print(f"\n=== Testing {init_method} initialization ===")
    
    # Create network with different initialization
    nn_init = AdvancedNeuralNetwork(
        layers=[2, 20, 20, 1],
        activations=['relu', 'relu', 'sigmoid'],
        optimizer='adam',
        weight_init=init_method,
        dropout_rate=0.1,
        random_state=42
    )
    
    # Train network
    start_time = time.time()
    nn_init.fit(X_train_init, y_train_init, epochs=200, batch_size=32,
                validation_split=0.1, verbose=False, early_stopping_patience=15)
    training_time = time.time() - start_time
    
    # Evaluate
    y_pred_init = nn_init.predict(X_test_init)
    accuracy_init = accuracy_score(y_test_init, y_pred_init)
    
    init_results.append({
        'method': init_method,
        'accuracy': accuracy_init,
        'training_time': training_time,
        'epochs_trained': len(nn_init.history['cost']),
        'final_cost': nn_init.history['cost'][-1],
        'history': nn_init.history
    })
    
    print(f"Accuracy: {accuracy_init:.4f}")
    print(f"Training time: {training_time:.2f}s")
    print(f"Epochs trained: {len(nn_init.history['cost'])}")

# Problem 6: Gradient checking
def gradient_check(model, X, y, epsilon=1e-7):
    """Numerical gradient checking for debugging."""
    print("\n🔍 Performing Gradient Check:")
    
    # Forward pass to compute gradients
    model.forward_propagation(X)
    dW, db = model.backward_propagation(X, y)
    
    # Check gradients for first layer weights (subset)
    W = model.weights[1]
    dW_analytic = dW[1]
    
    # Numerical gradient computation (check a few weights)
    check_indices = [(0, 0), (0, 1), (1, 0)] if W.shape[0] > 1 and W.shape[1] > 1 else [(0, 0)]
    
    for i, j in check_indices:
        # Forward pass with W[i,j] + epsilon
        W[i, j] += epsilon
        pred_plus = model.forward_propagation(X)
        cost_plus = model.compute_cost(y, pred_plus)
        
        # Forward pass with W[i,j] - epsilon
        W[i, j] -= 2 * epsilon
        pred_minus = model.forward_propagation(X)
        cost_minus = model.compute_cost(y, pred_minus)
        
        # Restore original weight
        W[i, j] += epsilon
        
        # Numerical gradient
        dW_numerical = (cost_plus - cost_minus) / (2 * epsilon)
        
        # Compare
        dW_analytic_val = dW_analytic[i, j]
        difference = abs(dW_numerical - dW_analytic_val)
        relative_error = difference / (abs(dW_numerical) + abs(dW_analytic_val) + 1e-8)
        
        print(f"Weight[{i},{j}]: Numerical={dW_numerical:.8f}, "
              f"Analytic={dW_analytic_val:.8f}, "
              f"Relative Error={relative_error:.2e}")
        
        if relative_error < 1e-5:
            print("  ✅ Gradient check passed")
        elif relative_error < 1e-3:
            print("  ⚠️ Gradient check warning (acceptable)")
        else:
            print("  ❌ Gradient check failed")

# Perform gradient check on a small network
print("\n🔍 Gradient Check on Small Network:")
X_small = X_train_init[:10]  # Small batch for gradient check
y_small = y_train_init[:10]

# Create small network for gradient checking
nn_check = AdvancedNeuralNetwork(
    layers=[2, 3, 1],
    activations=['relu', 'sigmoid'],
    optimizer='sgd',
    random_state=42
)

gradient_check(nn_check, X_small, y_small)

print("\n✅ Advanced neural network concepts tested!")

In [None]:
# Visualize initialization and advanced concepts results
plt.figure(figsize=(16, 10))

# Plot 1: Initialization method comparison
plt.subplot(2, 4, 1)
methods = [r['method'] for r in init_results]
accuracies_init = [r['accuracy'] for r in init_results]
bars = plt.bar(methods, accuracies_init, alpha=0.7, color=['blue', 'green', 'red'])
plt.ylabel('Test Accuracy')
plt.title('Weight Initialization Comparison')

for bar, acc in zip(bars, accuracies_init):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f'{acc:.3f}', ha='center', va='bottom')

plt.grid(True, alpha=0.3)

# Plot 2: Training convergence for different initializations
plt.subplot(2, 4, 2)
for result in init_results:
    epochs = range(1, len(result['history']['cost']) + 1)
    plt.plot(epochs, result['history']['cost'], 
             label=result['method'].title(), alpha=0.8, linewidth=2)

plt.xlabel('Epoch')
plt.ylabel('Training Cost')
plt.title('Convergence by Initialization')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 3: Training time comparison
plt.subplot(2, 4, 3)
times_init = [r['training_time'] for r in init_results]
bars = plt.bar(methods, times_init, alpha=0.7, color='orange')
plt.ylabel('Training Time (s)')
plt.title('Training Time by Initialization')

for bar, time_val in zip(bars, times_init):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
             f'{time_val:.1f}s', ha='center', va='bottom')

plt.grid(True, alpha=0.3)

# Plot 4: Batch normalization effect demonstration
plt.subplot(2, 4, 4)
# Simulate batch normalization effect
np.random.seed(42)
x_before = np.random.normal(5, 3, 1000)  # Shifted and scaled data
bn = BatchNormalization(1)
x_after = bn.forward(x_before.reshape(-1, 1)).flatten()

plt.hist(x_before, bins=30, alpha=0.7, label='Before BatchNorm', density=True)
plt.hist(x_after, bins=30, alpha=0.7, label='After BatchNorm', density=True)
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Batch Normalization Effect')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 5-8: Weight distributions for different initialization methods
init_names = ['Xavier', 'He', 'LeCun', 'Random']
fan_in, fan_out = 10, 20
shape = (fan_out, fan_in)

weight_inits = [
    WeightInitializer.xavier_normal(fan_in, fan_out, shape),
    WeightInitializer.he_normal(fan_in, shape),
    WeightInitializer.lecun_normal(fan_in, shape),
    np.random.normal(0, 1, shape)  # Standard normal
]

for i, (weights, name) in enumerate(zip(weight_inits, init_names)):
    plt.subplot(2, 4, 5 + i)
    plt.hist(weights.flatten(), bins=30, alpha=0.7, density=True)
    plt.xlabel('Weight Value')
    plt.ylabel('Density')
    plt.title(f'{name} Initialization')
    plt.grid(True, alpha=0.3)
    
    # Add statistics
    mean_val = np.mean(weights)
    std_val = np.std(weights)
    plt.text(0.05, 0.95, f'μ={mean_val:.3f}\nσ={std_val:.3f}', 
             transform=plt.gca().transAxes, verticalalignment='top',
             bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

plt.tight_layout()
plt.show()

# Summary of all results
print("\n📊 Complete Neural Network Analysis Summary:")
print("=" * 70)
print("\n🧠 Basic Neural Network:")
print(f"  Architecture: {' -> '.join(map(str, nn.layers))}")
print(f"  Test Accuracy: {accuracy:.4f}")
print(f"  Training Time: {training_time:.2f}s")

print("\n⚙️ Optimizer Comparison:")
for result in results:
    print(f"  {result['optimizer'].upper():<10}: {result['accuracy']:.4f} accuracy, {result['training_time']:.1f}s")

print("\n🏗️ Weight Initialization Comparison:")
for result in init_results:
    print(f"  {result['method'].title():<10}: {result['accuracy']:.4f} accuracy, {result['epochs_trained']} epochs")

print("\n🖼️ CNN vs Regular NN:")
print(f"  CNN Accuracy: {cnn_accuracy:.4f}")
print(f"  Regular NN:   {regular_accuracy:.4f}")
print(f"  CNN Advantage: +{cnn_accuracy - regular_accuracy:.4f}")

print("\n🏆 Best Configurations:")
best_optimizer = max(results, key=lambda x: x['accuracy'])
best_init = max(init_results, key=lambda x: x['accuracy'])
print(f"  Best Optimizer: {best_optimizer['optimizer'].upper()} ({best_optimizer['accuracy']:.4f})")
print(f"  Best Initialization: {best_init['method'].title()} ({best_init['accuracy']:.4f})")
print(f"  CNN showed {cnn_accuracy - regular_accuracy:.4f} improvement over regular NN")

## 💡 Interview Tips

### 🧠 Neural Network Fundamentals
1. **Understand the mathematics** - Know forward/backward propagation equations
2. **Explain activation functions** - When and why to use each one
3. **Know optimization algorithms** - SGD, Momentum, Adam, RMSprop differences
4. **Regularization techniques** - L1/L2, Dropout, Batch Normalization
5. **Weight initialization** - Why it matters and different strategies

### ⚡ Common Interview Questions
1. **"Implement backpropagation from scratch"**
   - Know chain rule, gradient computation
   - Handle different activation functions
   - Understand matrix dimensions

2. **"Explain vanishing/exploding gradients"**
   - Causes: deep networks, poor initialization, saturating activations
   - Solutions: Better initialization, BatchNorm, ResNets, LSTM/GRU

3. **"Why does Adam work better than SGD?"**
   - Adaptive learning rates per parameter
   - Momentum and bias correction
   - Better handling of sparse gradients

4. **"How does dropout prevent overfitting?"**
   - Forces network to not rely on specific neurons
   - Creates ensemble effect
   - Only applied during training

### 🎯 Activation Functions Guide
- **ReLU**: Default choice, fast computation, dead neuron problem
- **Leaky ReLU**: Fixes dead neuron problem
- **Sigmoid**: Output layer for binary classification, saturates
- **Tanh**: Zero-centered, still saturates
- **Softmax**: Multi-class classification output

### 🔧 Debugging Neural Networks
1. **Start simple** - Small network, simple data
2. **Check gradients** - Numerical gradient checking
3. **Monitor training** - Loss curves, accuracy plots
4. **Visualization** - Weight distributions, activations
5. **Regularization** - Add gradually if overfitting

### 📊 Performance Optimization
- **Batch size**: Larger batches for stable gradients, smaller for regularization
- **Learning rate**: Most important hyperparameter
- **Architecture**: Deeper vs wider networks
- **Normalization**: BatchNorm, LayerNorm for stability
- **Early stopping**: Prevent overfitting

### 🏗️ CNN Concepts
- **Convolution**: Local connectivity, parameter sharing
- **Pooling**: Translation invariance, dimension reduction
- **Receptive field**: How much input affects each output
- **Feature maps**: What different filters detect

### 🎓 Advanced Topics to Know
- **Batch Normalization**: Internal covariate shift
- **Skip connections**: ResNet, gradient flow
- **Attention mechanisms**: Transformer architecture
- **Regularization**: DropConnect, Spectral normalization
- **Optimization**: Learning rate scheduling, warm restarts

## 🎓 Summary

In this notebook, we covered:

✅ **Neural Network Fundamentals** - Forward/backward propagation, activation functions  
✅ **Advanced Optimizers** - SGD, Momentum, Adam, RMSprop implementations  
✅ **Regularization Techniques** - L1/L2, Dropout, early stopping  
✅ **Weight Initialization** - Xavier, He, LeCun strategies  
✅ **Convolutional Networks** - Conv layers, pooling, feature extraction  
✅ **Batch Normalization** - Normalization for stable training  
✅ **Gradient Checking** - Debugging backpropagation implementation  

### 🚀 Next Steps
1. Practice implementing networks from memory
2. Try different architectures and datasets
3. Move on to TensorFlow/Keras implementations
4. Study advanced architectures (ResNet, Transformer)

### 📚 Additional Practice
- Implement LSTM/GRU for sequence modeling
- Create attention mechanisms
- Build VAEs and GANs
- Implement custom loss functions

### 🔑 Key Interview Points
- **Mathematical Understanding**: Know the equations, not just the APIs
- **Implementation Skills**: Can code backpropagation from scratch
- **Debugging Ability**: Can diagnose and fix training issues
- **Architecture Knowledge**: Understand different network types
- **Optimization Expertise**: Know when and how to tune hyperparameters

### 🧠 Core Concepts Mastered
- **Forward Propagation**: $a^{[l]} = g^{[l]}(W^{[l]}a^{[l-1]} + b^{[l]})$
- **Backward Propagation**: Chain rule application for gradient computation
- **Cost Functions**: Cross-entropy, MSE, regularization terms
- **Activation Functions**: Non-linearity introduction and properties
- **Optimization**: Gradient-based parameter updates

**Ready for deep learning frameworks! 🚀**