# Gradient Descent Variants: Implementation & Analysis
**Course:** Deep Neural Network Architectures (21CSE558T)  
**Module:** 2 - Optimization and Regularization  
**Assignment:** Week 4, Day 4 Homework  
**Due:** Before Week 5, Day 1  

---

**© 2025 Prof. Ramesh Babu | SRM University | Data Science and Business Systems (DSBS)**  
*Course Materials for 21CSE558T - Deep Neural Network Architectures*

---

## Learning Objectives
By completing this assignment, you will:
1. **Implement** all three gradient descent variants on a real dataset
2. **Compare** convergence patterns and computational trade-offs
3. **Analyze** the impact of batch size on optimization performance
4. **Develop** practical intuition for choosing optimization algorithms

## Assignment Structure
- **Part 1:** Implementation Challenge (60%)
- **Part 2:** Experimental Analysis (30%)
- **Part 3:** Written Analysis (10%)

---

## Setup and Data Loading
We'll use the Boston Housing dataset - a classic regression problem with real-world characteristics.

In [None]:
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
import time
from sklearn.datasets import load_boston
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

print("Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")

In [None]:
# Load and explore the Boston Housing dataset
boston = load_boston()
X, y = boston.data, boston.target

print("=== Boston Housing Dataset ===")
print(f"Features: {X.shape[1]} (dimensions)")
print(f"Samples: {X.shape[0]} (houses)")
print(f"Target: Housing prices in $1000s")
print(f"\nFeature names: {boston.feature_names}")
print(f"\nPrice statistics:")
print(f"  Min: ${y.min():.1f}k")
print(f"  Max: ${y.max():.1f}k")
print(f"  Mean: ${y.mean():.1f}k")
print(f"  Std: ${y.std():.1f}k")

In [None]:
# Data preprocessing
# 1. Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Standardize features (crucial for gradient descent)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 3. Add bias column (intercept term)
X_train_bias = np.column_stack([np.ones(X_train_scaled.shape[0]), X_train_scaled])
X_test_bias = np.column_stack([np.ones(X_test_scaled.shape[0]), X_test_scaled])

print("=== Data Preprocessing Complete ===")
print(f"Training set: {X_train_bias.shape[0]} samples, {X_train_bias.shape[1]} features (including bias)")
print(f"Test set: {X_test_bias.shape[0]} samples")
print(f"Feature scaling: Mean ≈ 0, Std ≈ 1")
print(f"Training data shape: {X_train_bias.shape}")

---
# Part 1: Implementation Challenge (60%)
Implement all three gradient descent variants with proper metrics tracking.

## 1.1 Batch Gradient Descent Implementation

In [None]:
def batch_gradient_descent(X, y, learning_rate=0.01, epochs=1000, tolerance=1e-6):
    """
    Batch Gradient Descent Implementation
    
    Args:
        X: Feature matrix with bias column (m × n+1)
        y: Target values (m,)
        learning_rate: Step size for parameter updates
        epochs: Maximum number of iterations
        tolerance: Convergence threshold
    
    Returns:
        weights: Learned parameters
        costs: Cost function values over epochs
        training_time: Time taken for training
    """
    m, n = X.shape
    
    # Initialize weights randomly
    weights = np.random.normal(0, 0.01, n)
    
    # Track metrics
    costs = []
    prev_cost = float('inf')
    
    # Start timing
    start_time = time.time()
    
    for epoch in range(epochs):
        # Forward pass: compute predictions for ALL examples
        predictions = X @ weights
        
        # Compute cost (Mean Squared Error)
        cost = np.mean((predictions - y) ** 2)
        costs.append(cost)
        
        # Compute gradients using ALL examples
        gradients = (2/m) * X.T @ (predictions - y)
        
        # Update weights
        weights -= learning_rate * gradients
        
        # Check for convergence
        if abs(prev_cost - cost) < tolerance:
            print(f"BGD converged at epoch {epoch}")
            break
        prev_cost = cost
        
        # Progress reporting
        if epoch % 100 == 0:
            print(f"BGD Epoch {epoch}: Cost = {cost:.6f}")
    
    training_time = time.time() - start_time
    
    return weights, costs, training_time

# Test BGD implementation
print("=== Testing Batch Gradient Descent ===")
weights_bgd, costs_bgd, time_bgd = batch_gradient_descent(
    X_train_bias, y_train, learning_rate=0.1, epochs=1000
)

print(f"\nBGD Results:")
print(f"Final cost: {costs_bgd[-1]:.6f}")
print(f"Training time: {time_bgd:.3f} seconds")
print(f"Total epochs: {len(costs_bgd)}")

## 1.2 Stochastic Gradient Descent Implementation

In [None]:
def stochastic_gradient_descent(X, y, learning_rate=0.01, epochs=1000, tolerance=1e-6):
    """
    Stochastic Gradient Descent Implementation
    
    Args:
        X: Feature matrix with bias column (m × n+1)
        y: Target values (m,)
        learning_rate: Step size for parameter updates
        epochs: Maximum number of iterations
        tolerance: Convergence threshold
    
    Returns:
        weights: Learned parameters
        costs: Cost function values over epochs
        training_time: Time taken for training
    """
    m, n = X.shape
    
    # Initialize weights randomly
    weights = np.random.normal(0, 0.01, n)
    
    # Track metrics
    costs = []
    
    # Start timing
    start_time = time.time()
    
    for epoch in range(epochs):
        epoch_cost = 0
        
        # Shuffle data each epoch for better convergence
        indices = np.random.permutation(m)
        
        for i in indices:
            # Forward pass: compute prediction for SINGLE example
            x_i = X[i:i+1]  # Keep as 2D array
            y_i = y[i]
            
            prediction = x_i @ weights
            
            # Compute cost for this example
            cost = (prediction - y_i) ** 2
            epoch_cost += cost
            
            # Compute gradients using SINGLE example
            gradient = 2 * x_i.T @ (prediction - y_i)
            
            # Update weights after each example
            weights -= learning_rate * gradient.flatten()
        
        # Average cost for the epoch
        avg_cost = epoch_cost / m
        costs.append(avg_cost)
        
        # Progress reporting
        if epoch % 100 == 0:
            print(f"SGD Epoch {epoch}: Cost = {avg_cost:.6f}")
    
    training_time = time.time() - start_time
    
    return weights, costs, training_time

# Test SGD implementation
print("=== Testing Stochastic Gradient Descent ===")
weights_sgd, costs_sgd, time_sgd = stochastic_gradient_descent(
    X_train_bias, y_train, learning_rate=0.01, epochs=500  # Lower LR and epochs for SGD
)

print(f"\nSGD Results:")
print(f"Final cost: {costs_sgd[-1]:.6f}")
print(f"Training time: {time_sgd:.3f} seconds")
print(f"Total epochs: {len(costs_sgd)}")

## 1.3 Mini-batch Gradient Descent Implementation

In [None]:
def mini_batch_gradient_descent(X, y, batch_size=32, learning_rate=0.01, epochs=1000, tolerance=1e-6):
    """
    Mini-batch Gradient Descent Implementation
    
    Args:
        X: Feature matrix with bias column (m × n+1)
        y: Target values (m,)
        batch_size: Number of examples per batch
        learning_rate: Step size for parameter updates
        epochs: Maximum number of iterations
        tolerance: Convergence threshold
    
    Returns:
        weights: Learned parameters
        costs: Cost function values over epochs
        training_time: Time taken for training
    """
    m, n = X.shape
    
    # Initialize weights randomly
    weights = np.random.normal(0, 0.01, n)
    
    # Track metrics
    costs = []
    
    # Start timing
    start_time = time.time()
    
    for epoch in range(epochs):
        epoch_cost = 0
        num_batches = 0
        
        # Shuffle data each epoch
        indices = np.random.permutation(m)
        
        # Create mini-batches
        for i in range(0, m, batch_size):
            batch_indices = indices[i:i+batch_size]
            X_batch = X[batch_indices]
            y_batch = y[batch_indices]
            
            # Forward pass: compute predictions for BATCH
            predictions = X_batch @ weights
            
            # Compute cost for this batch
            batch_cost = np.mean((predictions - y_batch) ** 2)
            epoch_cost += batch_cost * len(X_batch)  # Weight by batch size
            
            # Compute gradients using BATCH
            gradients = (2/len(X_batch)) * X_batch.T @ (predictions - y_batch)
            
            # Update weights after each batch
            weights -= learning_rate * gradients
            
            num_batches += 1
        
        # Average cost for the epoch
        avg_cost = epoch_cost / m
        costs.append(avg_cost)
        
        # Progress reporting
        if epoch % 100 == 0:
            print(f"Mini-batch GD (bs={batch_size}) Epoch {epoch}: Cost = {avg_cost:.6f}")
    
    training_time = time.time() - start_time
    
    return weights, costs, training_time

# Test Mini-batch GD with different batch sizes
batch_sizes = [16, 32, 64, 128]
mb_results = {}

print("=== Testing Mini-batch Gradient Descent ===")
for bs in batch_sizes:
    print(f"\nTesting batch size {bs}:")
    weights, costs, training_time = mini_batch_gradient_descent(
        X_train_bias, y_train, batch_size=bs, learning_rate=0.05, epochs=500
    )
    
    mb_results[bs] = {
        'weights': weights,
        'costs': costs,
        'time': training_time,
        'final_cost': costs[-1]
    }
    
    print(f"  Final cost: {costs[-1]:.6f}")
    print(f"  Training time: {training_time:.3f} seconds")

---
# Part 2: Experimental Analysis (30%)
Compare performance metrics and create visualizations.

## 2.1 Convergence Comparison

In [None]:
# Create comprehensive convergence comparison
plt.figure(figsize=(15, 10))

# Plot 1: All algorithms together
plt.subplot(2, 3, 1)
plt.plot(costs_bgd[:200], 'b-', linewidth=2, label='Batch GD', alpha=0.8)
plt.plot(costs_sgd[:200], 'r-', linewidth=1, label='Stochastic GD', alpha=0.7)
for bs in [32, 64]:
    plt.plot(mb_results[bs]['costs'][:200], '--', linewidth=1.5, label=f'Mini-batch (bs={bs})')
plt.title('Convergence Comparison (First 200 Epochs)')
plt.xlabel('Epoch')
plt.ylabel('Cost (MSE)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')

# Plot 2: Batch GD detail
plt.subplot(2, 3, 2)
plt.plot(costs_bgd, 'b-', linewidth=2)
plt.title('Batch Gradient Descent')
plt.xlabel('Epoch')
plt.ylabel('Cost (MSE)')
plt.grid(True, alpha=0.3)
plt.yscale('log')

# Plot 3: SGD detail
plt.subplot(2, 3, 3)
plt.plot(costs_sgd, 'r-', linewidth=1, alpha=0.7)
plt.title('Stochastic Gradient Descent')
plt.xlabel('Epoch')
plt.ylabel('Cost (MSE)')
plt.grid(True, alpha=0.3)
plt.yscale('log')

# Plot 4: Mini-batch comparison
plt.subplot(2, 3, 4)
colors = ['green', 'orange', 'purple', 'brown']
for i, bs in enumerate(batch_sizes):
    plt.plot(mb_results[bs]['costs'], color=colors[i], linewidth=1.5, 
             label=f'Batch size {bs}', alpha=0.8)
plt.title('Mini-batch GD: Batch Size Comparison')
plt.xlabel('Epoch')
plt.ylabel('Cost (MSE)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')

# Plot 5: Training time comparison
plt.subplot(2, 3, 5)
methods = ['BGD', 'SGD'] + [f'MB-{bs}' for bs in batch_sizes]
times = [time_bgd, time_sgd] + [mb_results[bs]['time'] for bs in batch_sizes]
colors_bar = ['blue', 'red'] + ['green', 'orange', 'purple', 'brown']

bars = plt.bar(methods, times, color=colors_bar, alpha=0.7)
plt.title('Training Time Comparison')
plt.ylabel('Time (seconds)')
plt.xticks(rotation=45)
# Add value labels on bars
for bar, time_val in zip(bars, times):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
             f'{time_val:.2f}s', ha='center', va='bottom')
plt.grid(True, alpha=0.3)

# Plot 6: Final cost comparison
plt.subplot(2, 3, 6)
final_costs = [costs_bgd[-1], costs_sgd[-1]] + [mb_results[bs]['final_cost'] for bs in batch_sizes]
bars = plt.bar(methods, final_costs, color=colors_bar, alpha=0.7)
plt.title('Final Cost Comparison')
plt.ylabel('Final MSE')
plt.xticks(rotation=45)
# Add value labels on bars
for bar, cost_val in zip(bars, final_costs):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5, 
             f'{cost_val:.1f}', ha='center', va='bottom')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("=== Convergence Analysis Summary ===")
print(f"Batch GD: {len(costs_bgd)} epochs, final cost: {costs_bgd[-1]:.4f}, time: {time_bgd:.2f}s")
print(f"SGD: {len(costs_sgd)} epochs, final cost: {costs_sgd[-1]:.4f}, time: {time_sgd:.2f}s")
for bs in batch_sizes:
    result = mb_results[bs]
    print(f"Mini-batch (bs={bs}): {len(result['costs'])} epochs, final cost: {result['final_cost']:.4f}, time: {result['time']:.2f}s")

## 2.2 Performance Evaluation on Test Set

In [None]:
def evaluate_model(weights, X_test, y_test):
    """Evaluate model performance on test set"""
    predictions = X_test @ weights
    mse = mean_squared_error(y_test, predictions)
    r2 = r2_score(y_test, predictions)
    rmse = np.sqrt(mse)
    return {'MSE': mse, 'RMSE': rmse, 'R2': r2, 'predictions': predictions}

# Evaluate all models
print("=== Test Set Performance Evaluation ===")
print("\nModel Performance on Test Set:")
print("-" * 50)

# BGD evaluation
bgd_eval = evaluate_model(weights_bgd, X_test_bias, y_test)
print(f"Batch GD:     MSE = {bgd_eval['MSE']:.3f}, RMSE = {bgd_eval['RMSE']:.3f}, R² = {bgd_eval['R2']:.4f}")

# SGD evaluation
sgd_eval = evaluate_model(weights_sgd, X_test_bias, y_test)
print(f"Stochastic GD: MSE = {sgd_eval['MSE']:.3f}, RMSE = {sgd_eval['RMSE']:.3f}, R² = {sgd_eval['R2']:.4f}")

# Mini-batch evaluations
mb_evals = {}
for bs in batch_sizes:
    mb_eval = evaluate_model(mb_results[bs]['weights'], X_test_bias, y_test)
    mb_evals[bs] = mb_eval
    print(f"Mini-batch {bs:2d}: MSE = {mb_eval['MSE']:.3f}, RMSE = {mb_eval['RMSE']:.3f}, R² = {mb_eval['R2']:.4f}")

print(f"\nBaseline (predicting mean): MSE = {np.var(y_test):.3f}, R² = 0.0000")

## 2.3 Prediction Visualization

In [None]:
# Visualize predictions vs actual values
plt.figure(figsize=(15, 10))

# BGD predictions
plt.subplot(2, 3, 1)
plt.scatter(y_test, bgd_eval['predictions'], alpha=0.6, color='blue')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Price ($1000s)')
plt.ylabel('Predicted Price ($1000s)')
plt.title(f'Batch GD\nR² = {bgd_eval["R2"]:.4f}')
plt.grid(True, alpha=0.3)

# SGD predictions
plt.subplot(2, 3, 2)
plt.scatter(y_test, sgd_eval['predictions'], alpha=0.6, color='red')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Price ($1000s)')
plt.ylabel('Predicted Price ($1000s)')
plt.title(f'Stochastic GD\nR² = {sgd_eval["R2"]:.4f}')
plt.grid(True, alpha=0.3)

# Mini-batch predictions (show best performing)
best_mb_bs = max(batch_sizes, key=lambda bs: mb_evals[bs]['R2'])
plt.subplot(2, 3, 3)
plt.scatter(y_test, mb_evals[best_mb_bs]['predictions'], alpha=0.6, color='green')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Price ($1000s)')
plt.ylabel('Predicted Price ($1000s)')
plt.title(f'Mini-batch GD (bs={best_mb_bs})\nR² = {mb_evals[best_mb_bs]["R2"]:.4f}')
plt.grid(True, alpha=0.3)

# Residual plots
plt.subplot(2, 3, 4)
residuals_bgd = y_test - bgd_eval['predictions']
plt.scatter(bgd_eval['predictions'], residuals_bgd, alpha=0.6, color='blue')
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Price ($1000s)')
plt.ylabel('Residuals')
plt.title('BGD Residuals')
plt.grid(True, alpha=0.3)

plt.subplot(2, 3, 5)
residuals_sgd = y_test - sgd_eval['predictions']
plt.scatter(sgd_eval['predictions'], residuals_sgd, alpha=0.6, color='red')
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Price ($1000s)')
plt.ylabel('Residuals')
plt.title('SGD Residuals')
plt.grid(True, alpha=0.3)

plt.subplot(2, 3, 6)
residuals_mb = y_test - mb_evals[best_mb_bs]['predictions']
plt.scatter(mb_evals[best_mb_bs]['predictions'], residuals_mb, alpha=0.6, color='green')
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Price ($1000s)')
plt.ylabel('Residuals')
plt.title(f'Mini-batch (bs={best_mb_bs}) Residuals')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 2.4 Learning Rate Sensitivity Analysis

In [None]:
# Test different learning rates for mini-batch GD
learning_rates = [0.001, 0.01, 0.05, 0.1, 0.2]
lr_results = {}

print("=== Learning Rate Sensitivity Analysis ===")
print("Testing different learning rates with mini-batch GD (batch_size=32)\n")

for lr in learning_rates:
    print(f"Testing learning rate {lr}...")
    try:
        weights, costs, training_time = mini_batch_gradient_descent(
            X_train_bias, y_train, batch_size=32, learning_rate=lr, epochs=200
        )
        
        # Check if training was stable (no NaN or extremely large values)
        if np.any(np.isnan(weights)) or np.any(np.abs(weights) > 1000):
            print(f"  Learning rate {lr}: UNSTABLE (diverged)")
            lr_results[lr] = {'stable': False, 'final_cost': float('inf')}
        else:
            lr_results[lr] = {
                'stable': True,
                'weights': weights,
                'costs': costs,
                'time': training_time,
                'final_cost': costs[-1]
            }
            print(f"  Learning rate {lr}: Final cost = {costs[-1]:.4f}, Time = {training_time:.2f}s")
    except Exception as e:
        print(f"  Learning rate {lr}: ERROR - {str(e)}")
        lr_results[lr] = {'stable': False, 'final_cost': float('inf')}

# Visualize learning rate effects
plt.figure(figsize=(12, 4))

# Plot 1: Convergence curves for different learning rates
plt.subplot(1, 2, 1)
colors = ['blue', 'green', 'orange', 'red', 'purple']
for i, lr in enumerate(learning_rates):
    if lr_results[lr]['stable']:
        plt.plot(lr_results[lr]['costs'], color=colors[i], linewidth=1.5, 
                label=f'LR = {lr}', alpha=0.8)
    else:
        plt.axhline(y=1000, color=colors[i], linestyle='--', 
                   label=f'LR = {lr} (diverged)', alpha=0.5)

plt.title('Learning Rate Sensitivity')
plt.xlabel('Epoch')
plt.ylabel('Cost (MSE)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')

# Plot 2: Final cost vs learning rate
plt.subplot(1, 2, 2)
stable_lrs = [lr for lr in learning_rates if lr_results[lr]['stable']]
stable_costs = [lr_results[lr]['final_cost'] for lr in stable_lrs]

plt.plot(stable_lrs, stable_costs, 'bo-', linewidth=2, markersize=8)
plt.title('Final Cost vs Learning Rate')
plt.xlabel('Learning Rate')
plt.ylabel('Final Cost (MSE)')
plt.grid(True, alpha=0.3)
plt.xscale('log')

# Highlight optimal learning rate
if stable_costs:
    best_lr_idx = np.argmin(stable_costs)
    best_lr = stable_lrs[best_lr_idx]
    best_cost = stable_costs[best_lr_idx]
    plt.plot(best_lr, best_cost, 'ro', markersize=12, alpha=0.7, label=f'Optimal: {best_lr}')
    plt.legend()

plt.tight_layout()
plt.show()

print(f"\nOptimal learning rate: {best_lr} (Final cost: {best_cost:.4f})")

---
# Part 3: Written Analysis (10%)
Answer the critical questions based on your experimental results.

## Analysis Questions

**Answer these questions based on your experimental results:**

### Question 1: Which variant converged fastest and why?

**Your Answer:** 
*(Replace this with your analysis based on the results above)*

Based on the experimental results:
- **Fastest convergence**: [Analyze which method reached low cost values quickest]
- **Reasons**: [Explain in terms of update frequency, gradient noise, etc.]
- **Trade-offs**: [Discuss what was sacrificed for speed]

### Question 2: How did batch size affect convergence stability?

**Your Answer:**
*(Analyze the mini-batch results with different batch sizes)*

- **Small batch sizes (16-32)**: [Describe convergence pattern]
- **Large batch sizes (64-128)**: [Describe convergence pattern]
- **Stability vs Speed trade-off**: [Explain the relationship]

### Question 3: What learning rates worked best for each variant?

**Your Answer:**
*(Based on your learning rate sensitivity analysis)*

- **SGD optimal LR**: [From your observations]
- **Mini-batch optimal LR**: [From sensitivity analysis]
- **BGD optimal LR**: [From your observations]
- **Why different?**: [Explain the relationship between batch size and learning rate]

### Question 4: When would you choose each variant in practice?

**Your Answer:**
*(Practical recommendations based on different scenarios)*

**Choose Batch GD when:**
- [List scenarios: dataset size, computational resources, etc.]

**Choose SGD when:**
- [List scenarios: memory constraints, online learning, etc.]

**Choose Mini-batch GD when:**
- [List scenarios: most common use cases]

### Additional Insights

**What surprised you in the results?**
*(Discuss any unexpected findings)*

**Real-world implications:**
*(How would this knowledge help in practical deep learning projects?)*

---
# Summary and Conclusions

## Key Findings Summary

Create a final summary table with your results:

In [None]:
# Create comprehensive results summary
import pandas as pd

# Compile all results
summary_data = {
    'Algorithm': ['Batch GD', 'Stochastic GD'] + [f'Mini-batch (bs={bs})' for bs in batch_sizes],
    'Final Training Cost': [costs_bgd[-1], costs_sgd[-1]] + [mb_results[bs]['final_cost'] for bs in batch_sizes],
    'Test MSE': [bgd_eval['MSE'], sgd_eval['MSE']] + [mb_evals[bs]['MSE'] for bs in batch_sizes],
    'Test R²': [bgd_eval['R2'], sgd_eval['R2']] + [mb_evals[bs]['R2'] for bs in batch_sizes],
    'Training Time (s)': [time_bgd, time_sgd] + [mb_results[bs]['time'] for bs in batch_sizes],
    'Updates per Epoch': [1, len(X_train_bias)] + [len(X_train_bias)//bs for bs in batch_sizes]
}

results_df = pd.DataFrame(summary_data)
results_df = results_df.round(4)

print("=== COMPREHENSIVE RESULTS SUMMARY ===")
print(results_df.to_string(index=False))

# Find best performing algorithm
best_r2_idx = results_df['Test R²'].idxmax()
best_algorithm = results_df.loc[best_r2_idx, 'Algorithm']
best_r2 = results_df.loc[best_r2_idx, 'Test R²']

print(f"\n🏆 Best performing algorithm: {best_algorithm} (R² = {best_r2:.4f})")

# Find fastest algorithm
fastest_idx = results_df['Training Time (s)'].idxmin()
fastest_algorithm = results_df.loc[fastest_idx, 'Algorithm']
fastest_time = results_df.loc[fastest_idx, 'Training Time (s)']

print(f"⚡ Fastest algorithm: {fastest_algorithm} ({fastest_time:.2f} seconds)")

print("\n=== ASSIGNMENT COMPLETION ===")
print("✅ Part 1: Implementation Challenge - Complete")
print("✅ Part 2: Experimental Analysis - Complete") 
print("📝 Part 3: Written Analysis - Complete your answers above")
print("\n📊 All visualizations and metrics generated successfully!")
print("📋 Ready for submission - ensure all markdown analysis sections are filled out.")

---
# Submission Checklist

Before submitting, ensure you have:

- [ ] **Implemented all three gradient descent variants correctly**
- [ ] **Tested multiple batch sizes for mini-batch GD**
- [ ] **Generated all required visualizations**
- [ ] **Completed the learning rate sensitivity analysis**
- [ ] **Answered all analysis questions in Part 3**
- [ ] **Included performance comparison table**
- [ ] **Added your personal insights and conclusions**
- [ ] **Code is well-commented and runs without errors**
- [ ] **All plots have proper titles, labels, and legends**

## Grading Rubric (Total: 100 points)

### Part 1: Implementation (60 points)
- Batch GD implementation (15 points)
- Stochastic GD implementation (15 points) 
- Mini-batch GD implementation (20 points)
- Code quality and documentation (10 points)

### Part 2: Experimental Analysis (30 points)
- Convergence visualizations (10 points)
- Performance metrics (10 points)
- Learning rate analysis (10 points)

### Part 3: Written Analysis (10 points)
- Quality of analysis (5 points)
- Practical insights (5 points)

**Good luck with your assignment! 🚀**