# PA3: Backpropagation & Neural Networks - Analysis Notebook

## Learning Objectives

In this analysis, you will:

1. **Analyze gradient flow** through neural networks using concrete numerical examples
2. **Compare network architectures** on a challenging nonlinear classification problem
3. **Diagnose training dynamics** and identify common failure modes
4. **Validate your implementations** using gradient checking techniques
5. **Develop professional ML skills** for real-world neural network debugging

## Assignment Structure

You will work with a "Swiss Roll" dataset that demonstrates the limitations of linear methods and the power of neural networks. Your task is to analyze how network complexity affects performance and to develop insights about training dynamics.

**Your contributions**: Choose appropriate visualizations, interpret results, and write analytical summaries explaining your findings. The helper functions are provided, but the analysis choices and interpretations are yours.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from matplotlib.animation import FuncAnimation
from IPython.display import HTML
import seaborn as sns
from typing import Dict, List, Tuple
import warnings
warnings.filterwarnings('ignore')

# Import your implementations
from student_code import *
from utils import *

# Set style for professional plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
np.random.seed(42)

## Helper Functions

The following functions are provided to support your analysis. You should use these tools to explore different aspects of neural network training, but the choice of what to analyze and how to interpret results is up to you.

In [None]:
def generate_swiss_roll_2d(n_samples=300, noise=0.1, seed=42):
    """
    Generate 2D Swiss Roll classification data.
    This creates a nonlinear classification problem.
    """
    np.random.seed(seed)
    
    # Generate spiral pattern
    t = np.random.uniform(0, 4*np.pi, n_samples)
    x = t * np.cos(t) + noise * np.random.randn(n_samples)
    y = t * np.sin(t) + noise * np.random.randn(n_samples)
    
    # Create classification based on radius
    radius = np.sqrt(x**2 + y**2)
    labels = (radius > np.median(radius)).astype(int)
    
    X = np.column_stack([x, y])
    X = (X - X.mean(axis=0)) / X.std(axis=0)  # Normalize
    
    return X, labels

def plot_decision_boundary(X, y, predict_func, title="Decision Boundary", ax=None):
    """
    Plot 2D data with decision boundary.
    
    Parameters:
    - X: Feature matrix (n_samples, 2)
    - y: Labels (n_samples,)
    - predict_func: Function that takes X and returns predictions
    - title: Plot title
    - ax: Matplotlib axis (optional)
    """
    if ax is None:
        fig, ax = plt.subplots(1, 1, figsize=(8, 6))
    
    # Plot data points
    colors = ['red', 'blue']
    for i in range(2):
        mask = y == i
        ax.scatter(X[mask, 0], X[mask, 1], c=colors[i], alpha=0.7, s=50,
                  label=f'Class {i}', edgecolors='black', linewidth=0.5)
    
    # Create decision boundary
    if predict_func is not None:
        x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
        y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
        xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                            np.linspace(y_min, y_max, 100))
        
        mesh_points = np.c_[xx.ravel(), yy.ravel()]
        Z = predict_func(mesh_points)
        Z = Z.reshape(xx.shape)
        
        ax.contour(xx, yy, Z, levels=[0.5], colors='black', linestyles='--', linewidths=2)
        ax.contourf(xx, yy, Z, levels=50, alpha=0.3, cmap='RdYlBu')
    
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    ax.set_title(title)
    ax.legend()
    ax.grid(True, alpha=0.3)

def train_simple_network(X, y, hidden_size=0, learning_rate=0.1, epochs=300, activation='sigmoid'):
    """
    Train a neural network with 0 or 1 hidden layer.
    
    Parameters:
    - hidden_size: 0 for single layer, >0 for two-layer network
    - Returns: weights dict, loss history, prediction function
    """
    y_formatted = y.reshape(-1, 1)
    
    if hidden_size == 0:
        # Single layer network
        weights, loss_history = train_single_layer(
            X, y_formatted, activation=activation, loss_type='mse',
            epochs=epochs, learning_rate=learning_rate
        )
        
        def predict_func(X_new):
            u, v = single_layer_forward(X_new, weights['W'], weights['b'], activation)
            return v.flatten()
            
        return weights, loss_history, predict_func
    
    else:
        # Two layer network (manual implementation)
        np.random.seed(42)
        W1 = np.random.randn(X.shape[1], hidden_size) * 0.1
        b1 = np.zeros(hidden_size)
        W2 = np.random.randn(hidden_size, 1) * 0.1
        b2 = np.zeros(1)
        
        loss_history = []
        
        for epoch in range(epochs):
            # Forward pass
            u1, v1 = single_layer_forward(X, W1, b1, activation)
            u2, v2 = single_layer_forward(v1, W2, b2, 'sigmoid')
            
            loss = mse_loss(y_formatted, v2)
            loss_history.append(loss)
            
            # Backward pass
            dL_dv2 = mse_derivative(y_formatted, v2)
            dL_dW2, dL_db2, dL_dv1 = single_layer_backward(dL_dv2, u2, v1, W2, 'sigmoid')
            dL_dW1, dL_db1, dL_dX = single_layer_backward(dL_dv1, u1, X, W1, activation)
            
            # Update weights
            W1 -= learning_rate * dL_dW1
            b1 -= learning_rate * dL_db1
            W2 -= learning_rate * dL_dW2
            b2 -= learning_rate * dL_db2
        
        weights = {'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}
        
        def predict_func(X_new):
            u1, v1 = single_layer_forward(X_new, W1, b1, activation)
            u2, v2 = single_layer_forward(v1, W2, b2, 'sigmoid')
            return v2.flatten()
        
        return weights, loss_history, predict_func

def demonstrate_chain_rule_step_by_step(X_sample, y_sample, weights):
    """
    Trace gradient computation through a single training example.
    Returns detailed breakdown of chain rule application.
    """
    # Forward pass
    W, b = weights['W'], weights['b']
    u = linear_forward(X_sample, W, b)
    v = sigmoid_forward(u)
    loss = mse_loss(np.array([[y_sample]]), v)
    
    # Backward pass with detailed tracking
    dL_dv = mse_derivative(np.array([[y_sample]]), v)
    dv_du = sigmoid_derivative(u)
    dL_du = dL_dv * dv_du
    dL_dW, dL_db, dL_dX = linear_backward(dL_du, X_sample, W)
    
    return {
        'input': X_sample.flatten(),
        'target': y_sample,
        'weights': W.flatten(),
        'bias': b[0],
        'u': u[0, 0],
        'v': v[0, 0],
        'loss': loss,
        'dL_dv': dL_dv[0, 0],
        'dv_du': dv_du[0, 0],
        'dL_du': dL_du[0, 0],
        'dL_dW': dL_dW.flatten(),
        'dL_db': dL_db[0],
        'dL_dX': dL_dX.flatten()
    }

def analyze_training_curves(loss_histories, labels, title="Training Comparison"):
    """
    Plot and analyze multiple training curves.
    """
    plt.figure(figsize=(10, 6))
    
    for loss_history, label in zip(loss_histories, labels):
        plt.plot(loss_history, label=label, linewidth=2)
    
    plt.xlabel('Epoch')
    plt.ylabel('MSE Loss')
    plt.title(title)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.yscale('log')
    plt.show()
    
    # Return final losses for comparison
    return [history[-1] for history in loss_histories]

def gradient_magnitude_analysis(X, y, weights, activation='sigmoid'):
    """
    Analyze gradient magnitudes to understand training dynamics.
    """
    y_formatted = y.reshape(-1, 1)
    
    if 'W1' in weights:  # Two layer
        u1, v1 = single_layer_forward(X, weights['W1'], weights['b1'], activation)
        u2, v2 = single_layer_forward(v1, weights['W2'], weights['b2'], 'sigmoid')
        
        dL_dv2 = mse_derivative(y_formatted, v2)
        dL_dW2, dL_db2, dL_dv1 = single_layer_backward(dL_dv2, u2, v1, weights['W2'], 'sigmoid')
        dL_dW1, dL_db1, dL_dX = single_layer_backward(dL_dv1, u1, X, weights['W1'], activation)
        
        return {
            'layer1_weights': np.linalg.norm(dL_dW1),
            'layer1_bias': np.linalg.norm(dL_db1),
            'layer2_weights': np.linalg.norm(dL_dW2),
            'layer2_bias': np.linalg.norm(dL_db2)
        }
    else:  # Single layer
        u, v = single_layer_forward(X, weights['W'], weights['b'], activation)
        dL_dv = mse_derivative(y_formatted, v)
        dL_dW, dL_db, dL_dX = single_layer_backward(dL_dv, u, X, weights['W'], activation)
        
        return {
            'weights': np.linalg.norm(dL_dW),
            'bias': np.linalg.norm(dL_db)
        }

## Part 1: Data Setup and Initial Analysis

Generate the Swiss Roll dataset and create an initial visualization. 

**Your task**: Examine the data pattern and explain why this would be challenging for linear methods.

In [None]:
# Generate the dataset
X_swiss, y_swiss = generate_swiss_roll_2d(n_samples=300, noise=0.15)

# Visualize the challenge
plt.figure(figsize=(10, 8))
plot_decision_boundary(X_swiss, y_swiss, predict_func=None, 
                      title="Swiss Roll Classification Challenge")
plt.show()

print(f"Dataset shape: {X_swiss.shape}")
print(f"Class distribution: {np.bincount(y_swiss)}")

**Analysis Question 1**: Examine the data pattern above. Why would a single linear classifier (one neuron) struggle with this pattern? Write your explanation below:

*[Your answer here]*

## Part 2: Single Neuron Analysis

Train a single neuron on the Swiss Roll data and analyze its performance.

**Your task**: Train the network, visualize results, and analyze the chain rule for a specific example.

In [None]:
# Train single neuron
single_weights, single_loss, single_predict = train_simple_network(
    X_swiss, y_swiss, hidden_size=0, epochs=200
)

# Visualize results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

plot_decision_boundary(X_swiss, y_swiss, single_predict, 
                      "Single Neuron Decision Boundary", ax=ax1)

ax2.plot(single_loss, linewidth=2)
ax2.set_xlabel('Epoch')
ax2.set_ylabel('MSE Loss')
ax2.set_title('Single Neuron Training Curve')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate accuracy
predictions = single_predict(X_swiss)
binary_preds = (predictions > 0.5).astype(int)
accuracy = np.mean(binary_preds == y_swiss)
print(f"Single neuron accuracy: {accuracy:.3f}")
print(f"Final loss: {single_loss[-1]:.6f}")

### Chain Rule Demonstration

**Your task**: Select a training example and trace the gradient computation step-by-step. Choose an example that helps illustrate the chain rule clearly.

In [None]:
# TODO: Choose an example index and justify your choice
example_idx = None  # Replace with your chosen index

# TODO: Explain why you chose this particular example
print("Example selection rationale:")
print("[Your explanation here]")

# Demonstrate chain rule
if example_idx is not None:
    X_example = X_swiss[example_idx:example_idx+1]
    y_example = y_swiss[example_idx]
    
    chain_rule_data = demonstrate_chain_rule_step_by_step(X_example, y_example, single_weights)
    
    # TODO: Display and interpret the chain rule computation
    print("\nChain Rule Step-by-Step:")
    print(f"Input: {chain_rule_data['input']}")
    print(f"Target: {chain_rule_data['target']}")
    # Add more detailed analysis here

**Analysis Question 2**: Based on your chain rule demonstration, explain how the gradient flows backward through the network. How does each partial derivative contribute to the final weight update?

*[Your answer here]*

## Part 3: Network Architecture Comparison

Compare different network architectures on the same problem.

**Your task**: Train networks with different complexities and analyze their performance differences.

In [None]:
# TODO: Design your architecture comparison experiment
# Consider: hidden layer sizes, activation functions, training parameters

architectures = [
    # TODO: Define your architecture configurations
    # Example format: {'hidden_size': 0, 'activation': 'sigmoid', 'label': 'Single Neuron'}
]

results = {}
loss_histories = []
labels = []

# TODO: Train each architecture and collect results
for config in architectures:
    # Train network with this configuration
    # Store results for comparison
    pass

# TODO: Create comparison visualizations
# Consider: decision boundaries, training curves, accuracy comparison

**Analysis Question 3**: Which architecture performed best and why? What trade-offs did you observe between model complexity and performance?

*[Your answer here]*

## Part 4: Activation Function Analysis

Compare how different activation functions affect training dynamics.

**Your task**: Train identical architectures with different activation functions and analyze the differences in gradient flow and convergence.

In [None]:
# TODO: Design activation function comparison
activations = ['sigmoid', 'relu']  # Add others if you implement them

activation_results = {}

# TODO: Train networks with different activations
# TODO: Analyze gradient magnitudes during training
# TODO: Compare convergence behavior

for activation in activations:
    # Train and analyze
    pass

**Analysis Question 4**: How do different activation functions affect gradient flow? Which activation function worked best for this problem and why?

*[Your answer here]*

## Part 5: Training Dynamics and Learning Rate Analysis

Investigate how learning rate affects training dynamics.

**Your task**: Test different learning rates and identify common training problems.

In [None]:
# TODO: Design learning rate experiment
learning_rates = [0.001, 0.01, 0.1, 1.0]  # Add more if needed

lr_results = {}

# TODO: Train with different learning rates
# TODO: Identify and categorize training problems
# TODO: Create diagnostic visualizations

for lr in learning_rates:
    # Train and diagnose
    pass

**Analysis Question 5**: What training problems did you observe with different learning rates? How would you diagnose these issues in practice?

*[Your answer here]*

## Part 6: Gradient Checking Validation

Use the simple gradient checking function to validate your implementation.

**Your task**: Perform gradient checking on your trained networks and interpret the results.

In [None]:
# TODO: Implement gradient checking analysis
# Use the simple_gradient_check function from your student_code

# Select a subset of data for checking
X_check = X_swiss[:5]
y_check = y_swiss[:5]

# TODO: Define loss function for gradient checking
# TODO: Get analytical gradients from your implementation
# TODO: Use simple_gradient_check to validate
# TODO: Interpret results

**Analysis Question 6**: What did gradient checking reveal about your implementation? How confident are you in your backpropagation code?

*[Your answer here]*

## Part 7: Executive Summary and Professional Insights

**Your task**: Synthesize your findings into a professional analysis that could be shared with colleagues or stakeholders.

### Executive Summary

**Write a 2-3 paragraph summary addressing:**

1. **Problem Complexity**: What made the Swiss Roll challenging for different approaches?
2. **Architecture Insights**: How did network complexity affect solution quality?
3. **Training Considerations**: What practical insights did you gain about neural network training?
4. **Professional Applications**: How would you apply these insights to real-world projects?

*[Your executive summary here]*

### Key Technical Findings

**List your 3-5 most important technical insights:**

1. *[Finding 1]*
2. *[Finding 2]*
3. *[Finding 3]*
4. *[Finding 4]*
5. *[Finding 5]*

### Recommendations for Practice

**Based on your analysis, what practical recommendations would you make for neural network development?**

*[Your recommendations here]*

## Peer Review Questions

**For your peer reviewer:**

1. **Analysis Depth**: Are the architecture comparisons thorough and well-justified? What additional experiments would strengthen the analysis?

2. **Chain Rule Understanding**: Is the gradient flow explanation clear and mathematically sound? How could it be improved?

3. **Professional Insights**: Do the findings translate to practical guidance for neural network development? Are the recommendations actionable?

4. **Experimental Design**: Are the experimental choices (learning rates, architectures, etc.) well-motivated? What would you change?

5. **Communication**: Is the technical content accessible to someone learning neural networks? Where could explanations be clearer?

**Reviewer Guidelines:**
- Focus on the quality of analysis and interpretation, not just correctness
- Consider whether the insights would help someone understand neural networks better
- Suggest specific improvements for unclear sections
- Evaluate the professional relevance of the findings