# PA2: Gradient Descent Visualization & Analysis

**Learning Goals:**
- Understand the impact of hyperparameters on gradient descent behavior
- Visualize optimization landscapes and convergence patterns
- Develop intuition for numerical stability vs. accuracy trade-offs
- Practice critical analysis and communication of machine learning results

**Submission Requirements:**
- Complete all code cells and written analysis sections
- Export notebook as PDF for peer review
- Focus on clear explanations and insightful visualizations

---

## Setup & Data Preparation

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from student_code import (
    numerical_derivative,
    numerical_gradient,
    linear_predict,
    mse_cost,
    mse_gradient,
    initialize_weights,
    gradient_descent_step,
    gradient_descent,
    add_intercept,
    generate_synthetic_data,
    has_converged
)

# Set plotting style for clean visualizations
plt.style.use('default')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 11
plt.rcParams['axes.grid'] = True
plt.rcParams['grid.alpha'] = 0.3

print("All functions imported successfully!")
print("Ready for gradient descent analysis...")

## Part 1: Delta-h Sensitivity Analysis

**Objective**: Explore how the step size `h` in numerical derivatives affects both accuracy and numerical stability.

**Key Questions:**
- What happens when `h` is too large? Too small?
- How do we balance numerical precision with approximation error?
- Why does this matter for gradient descent?

In [None]:
# Define a test function where we know the exact derivative
def test_function(x):
    """f(x) = x^3 + 2*x^2 + x, so f'(x) = 3*x^2 + 4*x + 1"""
    return x[0]**3 + 2*x[0]**2 + x[0]

def true_derivative(x_val):
    """Analytical derivative of test_function"""
    return 3*x_val**2 + 4*x_val + 1

# Test point
x_test = np.array([2.0])
true_deriv = true_derivative(x_test[0])

print(f"Test point: x = {x_test[0]}")
print(f"True derivative: f'({x_test[0]}) = {true_deriv}")

In [None]:
# Test different h values
h_values = np.logspace(-12, 0, 50)  # From 1e-12 to 1
numerical_derivatives = []
errors = []

for h in h_values:
    numerical_deriv = numerical_derivative(test_function, x_test, dimension=0, h=h)
    error = abs(numerical_deriv - true_deriv)
    
    numerical_derivatives.append(numerical_deriv)
    errors.append(error)

numerical_derivatives = np.array(numerical_derivatives)
errors = np.array(errors)

In [None]:
# Visualize the results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Plot 1: Numerical derivative vs h
ax1.semilogx(h_values, numerical_derivatives, 'b-', linewidth=2, label='Numerical Derivative')
ax1.axhline(y=true_deriv, color='r', linestyle='--', linewidth=2, label='True Derivative')
ax1.set_xlabel('Step Size (h)')
ax1.set_ylabel('Derivative Value')
ax1.set_title('Numerical Derivative vs Step Size')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: Error vs h
ax2.loglog(h_values, errors, 'g-', linewidth=2, label='Absolute Error')
ax2.set_xlabel('Step Size (h)')
ax2.set_ylabel('Absolute Error')
ax2.set_title('Numerical Derivative Error vs Step Size')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Find optimal h
optimal_idx = np.argmin(errors)
optimal_h = h_values[optimal_idx]
min_error = errors[optimal_idx]

print(f"\nOptimal h: {optimal_h:.2e}")
print(f"Minimum error: {min_error:.2e}")

### 1.1 Delta-h Analysis Questions

**Answer the following based on your plots above:**

1. **Large h behavior**: What happens to numerical derivative accuracy when h is too large (h > 0.1)? Why?

*Your answer here:*

2. **Small h behavior**: What happens when h becomes very small (h < 1e-10)? What causes this?

*Your answer here:*

3. **Sweet spot**: Why is there an optimal h value around 1e-5 to 1e-6? What two types of error are being balanced?

*Your answer here:*

4. **Practical implications**: How does this analysis inform your choice of h in gradient descent? What could go wrong if you choose h poorly?

*Your answer here:*

---

## Part 2: Learning Rate Exploration

**Objective**: Understand how learning rate affects gradient descent convergence, stability, and speed.

**Key Concepts:**
- Convergence speed vs. stability trade-offs
- Oscillation and divergence patterns
- The "Goldilocks zone" for learning rates

In [None]:
# Generate synthetic dataset for consistent comparison
np.random.seed(42)  # Fixed seed for reproducible analysis
X, y = generate_synthetic_data(100, 2, noise=0.1, seed=42)
X_with_intercept = add_intercept(X)

print(f"Dataset shape: X={X.shape}, y={y.shape}")
print(f"X with intercept shape: {X_with_intercept.shape}")
print(f"Features range: [{X.min():.2f}, {X.max():.2f}]")
print(f"Target range: [{y.min():.2f}, {y.max():.2f}]")

In [None]:
# Test different learning rates
learning_rates = [0.001, 0.01, 0.05, 0.1, 0.2, 0.5]
epochs = 200

# Store results for each learning rate
lr_results = {}

for lr in learning_rates:
    print(f"Training with learning rate: {lr}")
    
    final_weights, cost_history = gradient_descent(
        X_with_intercept, y, learning_rate=lr, epochs=epochs
    )
    
    lr_results[lr] = {
        'weights': final_weights,
        'costs': cost_history,
        'final_cost': cost_history[-1],
        'converged': has_converged(cost_history, tolerance=1e-6)
    }
    
    print(f"  Final cost: {cost_history[-1]:.6f}")
    print(f"  Converged: {lr_results[lr]['converged']}")
    print(f"  Final weights: [{final_weights[0]:.3f}, {final_weights[1]:.3f}, {final_weights[2]:.3f}]")
    print()

In [None]:
# Visualize learning rate effects
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

colors = plt.cm.viridis(np.linspace(0, 1, len(learning_rates)))

for i, lr in enumerate(learning_rates):
    costs = lr_results[lr]['costs']
    
    axes[i].plot(costs, color=colors[i], linewidth=2)
    axes[i].set_title(f'Learning Rate: {lr}')
    axes[i].set_xlabel('Epoch')
    axes[i].set_ylabel('Cost')
    axes[i].grid(True, alpha=0.3)
    
    # Add convergence info
    final_cost = lr_results[lr]['final_cost']
    converged = lr_results[lr]['converged']
    status = "Converged" if converged else "Not Converged"
    axes[i].text(0.05, 0.95, f'Final: {final_cost:.4f}\n{status}', 
                transform=axes[i].transAxes, verticalalignment='top',
                bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

plt.tight_layout()
plt.show()

In [None]:
# Compare all learning rates on one plot
plt.figure(figsize=(12, 8))

for i, lr in enumerate(learning_rates):
    costs = lr_results[lr]['costs']
    plt.plot(costs, color=colors[i], linewidth=2, label=f'LR = {lr}')

plt.xlabel('Epoch')
plt.ylabel('Cost')
plt.title('Cost Curves for Different Learning Rates')
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')  # Log scale to see all curves clearly
plt.show()

# Summary table
print("\n" + "="*60)
print("LEARNING RATE COMPARISON SUMMARY")
print("="*60)
print(f"{'LR':<8} {'Final Cost':<12} {'Converged':<10} {'Final Weights':<25}")
print("-"*60)
for lr in learning_rates:
    result = lr_results[lr]
    weights_str = f"[{result['weights'][0]:.2f}, {result['weights'][1]:.2f}, {result['weights'][2]:.2f}]"
    print(f"{lr:<8} {result['final_cost']:<12.6f} {result['converged']!s:<10} {weights_str:<25}")

### 2.1 Learning Rate Analysis Questions

**Analyze your results and answer the following:**

1. **Too small learning rate**: What happens with very small learning rates (0.001)? What are the pros and cons?

*Your answer here:*

2. **Too large learning rate**: What happens with large learning rates (0.5)? Do you see oscillations or divergence?

*Your answer here:*

3. **Optimal range**: Which learning rate(s) work best for this problem? What makes them "just right"?

*Your answer here:*

4. **Practical guidelines**: Based on this analysis, how would you choose a learning rate for a new problem?

*Your answer here:*

---

## Part 3: Standardized Stopping Condition Analysis

**Objective**: Using identical starting conditions, analyze when to stop gradient descent and justify your reasoning.

**Setup**: Everyone will use the same data, initial weights, and learning rate. Your task is to determine the optimal stopping point and defend your choice.

In [None]:
# STANDARDIZED SETUP - DO NOT MODIFY
# Everyone uses identical conditions for fair comparison

np.random.seed(12345)  # Fixed seed
X_std, y_std = generate_synthetic_data(150, 3, noise=0.2, seed=12345)
X_std_intercept = add_intercept(X_std)

# Fixed initial conditions
initial_weights = np.array([0.1, -0.1, 0.05, -0.05])  # [bias, w1, w2, w3]
learning_rate = 0.02
max_epochs = 2000

print("STANDARDIZED CONDITIONS:")
print(f"Dataset: {X_std.shape[0]} samples, {X_std.shape[1]} features")
print(f"Initial weights: {initial_weights}")
print(f"Learning rate: {learning_rate}")
print(f"Maximum epochs: {max_epochs}")
print("\nRunning gradient descent...")

In [None]:
# Run gradient descent with detailed tracking
weights = initial_weights.copy()
cost_history_std = []
weight_history = [weights.copy()]

for epoch in range(max_epochs):
    weights, cost = gradient_descent_step(X_std_intercept, y_std, weights, learning_rate)
    cost_history_std.append(cost)
    
    # Store weights every 50 epochs for trajectory analysis
    if epoch % 50 == 0:
        weight_history.append(weights.copy())

final_weights_std = weights
print(f"Training completed: {len(cost_history_std)} epochs")
print(f"Final cost: {cost_history_std[-1]:.6f}")
print(f"Final weights: {final_weights_std}")

In [None]:
# Detailed cost curve analysis
plt.figure(figsize=(15, 10))

# Main cost curve
plt.subplot(2, 2, 1)
plt.plot(cost_history_std, 'b-', linewidth=1)
plt.xlabel('Epoch')
plt.ylabel('Cost')
plt.title('Full Training Cost Curve')
plt.grid(True, alpha=0.3)

# Zoomed view of later epochs
plt.subplot(2, 2, 2)
start_idx = max(0, len(cost_history_std) - 500)
plt.plot(range(start_idx, len(cost_history_std)), cost_history_std[start_idx:], 'r-', linewidth=1)
plt.xlabel('Epoch')
plt.ylabel('Cost')
plt.title('Final 500 Epochs (Detailed View)')
plt.grid(True, alpha=0.3)

# Cost differences between consecutive epochs
plt.subplot(2, 2, 3)
cost_diffs = np.diff(cost_history_std)
plt.plot(cost_diffs, 'g-', linewidth=1)
plt.xlabel('Epoch')
plt.ylabel('Cost Change')
plt.title('Cost Change Per Epoch')
plt.grid(True, alpha=0.3)

# Log scale for better visibility
plt.subplot(2, 2, 4)
plt.semilogy(cost_history_std, 'purple', linewidth=1)
plt.xlabel('Epoch')
plt.ylabel('Cost (log scale)')
plt.title('Cost Curve (Log Scale)')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Analysis tools for stopping condition
def analyze_convergence_metrics(costs, window_size=100):
    """Calculate various convergence metrics"""
    costs = np.array(costs)
    
    # Moving average of cost changes
    cost_changes = np.abs(np.diff(costs))
    moving_avg_change = np.convolve(cost_changes, np.ones(window_size)/window_size, mode='valid')
    
    # Relative improvement
    rel_improvements = np.abs(np.diff(costs)) / (costs[:-1] + 1e-10)
    
    # Cost variance in recent window
    recent_variance = []
    for i in range(window_size, len(costs)):
        window_costs = costs[i-window_size:i]
        recent_variance.append(np.var(window_costs))
    
    return {
        'moving_avg_change': moving_avg_change,
        'relative_improvements': rel_improvements,
        'recent_variance': np.array(recent_variance)
    }

metrics = analyze_convergence_metrics(cost_history_std, window_size=50)

# Plot convergence metrics
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Moving average of cost changes
axes[0,0].plot(metrics['moving_avg_change'])
axes[0,0].set_title('Moving Average of Cost Changes')
axes[0,0].set_xlabel('Epoch')
axes[0,0].set_ylabel('Average |Cost Change|')
axes[0,0].grid(True, alpha=0.3)

# Relative improvements
axes[0,1].semilogy(metrics['relative_improvements'])
axes[0,1].set_title('Relative Cost Improvements')
axes[0,1].set_xlabel('Epoch')
axes[0,1].set_ylabel('|ΔCost| / Cost')
axes[0,1].grid(True, alpha=0.3)

# Recent variance
axes[1,0].semilogy(metrics['recent_variance'])
axes[1,0].set_title('Cost Variance in Recent Window')
axes[1,0].set_xlabel('Epoch')
axes[1,0].set_ylabel('Variance')
axes[1,0].grid(True, alpha=0.3)

# Convergence test with different tolerances
tolerances = [1e-3, 1e-4, 1e-5, 1e-6, 1e-7]
convergence_epochs = []

for tol in tolerances:
    for epoch in range(10, len(cost_history_std)):
        partial_history = cost_history_std[:epoch+1]
        if has_converged(partial_history, tolerance=tol):
            convergence_epochs.append(epoch)
            break
    else:
        convergence_epochs.append(len(cost_history_std))

axes[1,1].semilogx(tolerances, convergence_epochs, 'ro-', linewidth=2, markersize=8)
axes[1,1].set_title('Convergence Epoch vs Tolerance')
axes[1,1].set_xlabel('Tolerance')
axes[1,1].set_ylabel('Convergence Epoch')
axes[1,1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("CONVERGENCE ANALYSIS:")
for i, tol in enumerate(tolerances):
    epoch = convergence_epochs[i]
    if epoch < len(cost_history_std):
        print(f"Tolerance {tol:.0e}: Converged at epoch {epoch} (cost: {cost_history_std[epoch]:.6f})")
    else:
        print(f"Tolerance {tol:.0e}: Did not converge within {max_epochs} epochs")

### 3.1 Your Stopping Condition Decision

**Based on the standardized run above, determine your optimal stopping point and justify your choice.**

#### Your Recommended Stopping Point:

**Epoch:** *(Enter your chosen epoch here)*

**Cost at stopping point:** *(Enter the cost at your chosen epoch)*

**Tolerance threshold:** *(What tolerance would you recommend?)*

#### Your Justification:

**1. Why did you choose this stopping point?**

*Your reasoning here - consider computational cost, diminishing returns, practical considerations*

**2. What evidence from the plots supports your decision?**

*Reference specific metrics, curves, or patterns that informed your choice*

**3. What are the trade-offs of your choice?**

*What do you gain and what do you potentially sacrifice with this stopping point?*

**4. How would you defend this choice to a colleague?**

*Present your argument as if explaining to another data scientist*

---

## Part 4: Comparative Studies

**Objective**: Compare different aspects of gradient descent to understand their relative importance.

### 4.1 Initialization Method Comparison

In [None]:
# Compare different weight initialization methods
np.random.seed(42)  # For reproducible comparison
X_init, y_init = generate_synthetic_data(80, 2, noise=0.15, seed=42)
X_init_intercept = add_intercept(X_init)

init_methods = ['zeros', 'random', 'small_random']
init_results = {}
epochs = 150
lr = 0.05

for method in init_methods:
    print(f"Testing initialization: {method}")
    
    # Reset random seed before each method for fair comparison of random methods
    np.random.seed(42)
    
    # Initialize weights
    if method == 'zeros':
        initial_weights = initialize_weights(X_init_intercept.shape[1], method='zeros')
    elif method == 'random':
        initial_weights = initialize_weights(X_init_intercept.shape[1], method='random')
    else:  # small_random
        initial_weights = initialize_weights(X_init_intercept.shape[1], method='small_random')
    
    print(f"  Initial weights: {initial_weights}")
    
    # Run gradient descent manually to track from specific initial weights
    weights = initial_weights.copy()
    costs = []
    
    for epoch in range(epochs):
        weights, cost = gradient_descent_step(X_init_intercept, y_init, weights, lr)
        costs.append(cost)
    
    init_results[method] = {
        'initial_weights': initial_weights,
        'final_weights': weights,
        'costs': costs,
        'final_cost': costs[-1]
    }
    
    print(f"  Final cost: {costs[-1]:.6f}")
    print(f"  Final weights: {weights}")
    print()

In [None]:
# Visualize initialization comparison
plt.figure(figsize=(12, 8))

colors = ['blue', 'red', 'green']
for i, method in enumerate(init_methods):
    costs = init_results[method]['costs']
    plt.plot(costs, color=colors[i], linewidth=2, label=f'{method.title()} Init')

plt.xlabel('Epoch')
plt.ylabel('Cost')
plt.title('Cost Curves for Different Weight Initializations')
plt.legend()
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.show()

# Summary comparison
print("INITIALIZATION COMPARISON:")
print("-" * 70)
print(f"{'Method':<15} {'Initial Weights':<25} {'Final Cost':<12} {'Convergence':<12}")
print("-" * 70)
for method in init_methods:
    result = init_results[method]
    init_str = f"[{result['initial_weights'][0]:.3f}, {result['initial_weights'][1]:.3f}, {result['initial_weights'][2]:.3f}]"
    converged = has_converged(result['costs'])
    print(f"{method:<15} {init_str:<25} {result['final_cost']:<12.6f} {converged!s:<12}")

### 4.2 Feature Scaling Impact

In [None]:
# Create dataset with features on very different scales
np.random.seed(42)
n_samples = 100

# Feature 1: Small scale (0-1)
feature1 = np.random.uniform(0, 1, n_samples)
# Feature 2: Large scale (1000-2000) 
feature2 = np.random.uniform(1000, 2000, n_samples)

X_unscaled = np.column_stack([feature1, feature2])
true_weights = np.array([2.0, 0.001])  # Compensate for scale difference
y_scale = X_unscaled @ true_weights + 0.1 * np.random.randn(n_samples)

# Create scaled version (standardization)
X_scaled = (X_unscaled - X_unscaled.mean(axis=0)) / X_unscaled.std(axis=0)

# Add intercepts
X_unscaled_intercept = add_intercept(X_unscaled)
X_scaled_intercept = add_intercept(X_scaled)

print("FEATURE SCALING EXPERIMENT:")
print(f"Unscaled features - Feature 1 range: [{feature1.min():.3f}, {feature1.max():.3f}]")
print(f"Unscaled features - Feature 2 range: [{feature2.min():.0f}, {feature2.max():.0f}]")
print(f"Scaled features - Feature 1 range: [{X_scaled[:,0].min():.3f}, {X_scaled[:,0].max():.3f}]")
print(f"Scaled features - Feature 2 range: [{X_scaled[:,1].min():.3f}, {X_scaled[:,1].max():.3f}]")
print(f"Target range: [{y_scale.min():.3f}, {y_scale.max():.3f}]")

In [None]:
# Train on both scaled and unscaled data
epochs = 300
lr = 0.01  # Same learning rate for both

# Unscaled training
print("Training on unscaled features...")
weights_unscaled, costs_unscaled = gradient_descent(
    X_unscaled_intercept, y_scale, learning_rate=lr, epochs=epochs
)

# Scaled training  
print("Training on scaled features...")
weights_scaled, costs_scaled = gradient_descent(
    X_scaled_intercept, y_scale, learning_rate=lr, epochs=epochs
)

print(f"\nUnscaled - Final cost: {costs_unscaled[-1]:.6f}")
print(f"Scaled - Final cost: {costs_scaled[-1]:.6f}")
print(f"\nUnscaled - Final weights: {weights_unscaled}")
print(f"Scaled - Final weights: {weights_scaled}")

In [None]:
# Visualize feature scaling impact
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Cost curves comparison
ax1.plot(costs_unscaled, 'r-', linewidth=2, label='Unscaled Features')
ax1.plot(costs_scaled, 'b-', linewidth=2, label='Scaled Features')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Cost')
ax1.set_title('Feature Scaling Impact on Convergence')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_yscale('log')

# Gradient magnitudes (approximate)
# Calculate gradients at several points during training
sample_epochs = range(0, epochs, 20)
unscaled_grad_norms = []
scaled_grad_norms = []

for epoch in sample_epochs:
    # Reconstruct weights at this epoch (approximate)
    w_unscaled = initialize_weights(3, 'zeros')
    w_scaled = initialize_weights(3, 'zeros')
    
    for e in range(epoch):
        w_unscaled, _ = gradient_descent_step(X_unscaled_intercept, y_scale, w_unscaled, lr)
        w_scaled, _ = gradient_descent_step(X_scaled_intercept, y_scale, w_scaled, lr)
    
    # Calculate gradient norms
    grad_unscaled = mse_gradient(X_unscaled_intercept, y_scale, w_unscaled)
    grad_scaled = mse_gradient(X_scaled_intercept, y_scale, w_scaled)
    
    unscaled_grad_norms.append(np.linalg.norm(grad_unscaled))
    scaled_grad_norms.append(np.linalg.norm(grad_scaled))

ax2.plot(sample_epochs, unscaled_grad_norms, 'r-', linewidth=2, label='Unscaled Features')
ax2.plot(sample_epochs, scaled_grad_norms, 'b-', linewidth=2, label='Scaled Features') 
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Gradient Norm')
ax2.set_title('Gradient Magnitude During Training')
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.set_yscale('log')

plt.tight_layout()
plt.show()

### 4.3 Comparative Analysis Questions

**1. Initialization Methods:**

a) Which initialization method performed best? Why do you think this happened?

*Your answer:*

b) When might different initialization strategies be preferred?

*Your answer:*

**2. Feature Scaling:**

a) How did feature scaling affect convergence speed and stability?

*Your answer:*

b) Why does feature scaling matter for gradient descent? (Think about the optimization landscape)

*Your answer:*

c) What practical advice would you give about feature scaling?

*Your answer:*

---

## Part 5: Executive Summary & Communication

**Objective**: Synthesize your findings into a clear, professional summary suitable for peer review.

### 5.1 Executive Summary

**Write a 3-4 paragraph executive summary of your gradient descent analysis for a technical audience:**

#### Overview & Methodology
*Describe what you analyzed and how*

#### Key Findings
*Summarize your most important discoveries about hyperparameters, convergence, etc.*

#### Practical Recommendations  
*What actionable advice would you give to someone implementing gradient descent?*

#### Implications
*Why do these findings matter for machine learning practitioners?*

### 5.2 Questions for Peer Reviewers

**Provide 3 specific questions you'd like your peers to address when reviewing your analysis:**

1. **Question about stopping condition:** 

2. **Question about hyperparameter insights:**

3. **Question about practical applications:**

### 5.3 Reflection on Learning

**Reflect on what you learned from implementing gradient descent from scratch:**

#### Technical Insights
*What did you understand better after implementing the math yourself?*

#### Challenges Encountered
*What was difficult? How did you overcome challenges?*

#### Connections to Broader ML
*How does this foundational understanding help with more complex ML algorithms?*

#### Future Applications
*How will you apply these insights to future projects?*

---

## Submission Checklist

Before submitting, verify you have completed:

**Part 1: Delta-h Analysis**
- [ ] Generated delta-h sensitivity plots
- [ ] Answered all analysis questions about numerical stability
- [ ] Identified optimal h value and explained trade-offs

**Part 2: Learning Rate Exploration**  
- [ ] Tested multiple learning rates with visualizations
- [ ] Analyzed convergence patterns and identified optimal range
- [ ] Answered questions about learning rate selection

**Part 3: Standardized Stopping Analysis**
- [ ] Used identical setup for fair comparison
- [ ] Analyzed convergence metrics and stopping criteria
- [ ] Made specific stopping recommendation with justification

**Part 4: Comparative Studies**
- [ ] Compared initialization methods with analysis
- [ ] Demonstrated feature scaling impact
- [ ] Answered comparative analysis questions

**Part 5: Communication**
- [ ] Written professional executive summary
- [ ] Provided specific questions for peer review
- [ ] Reflected on learning and broader applications

**Export your notebook as PDF for peer review submission.**