# 2.2: Hyperparameter Sweep - MNIST Classification

This notebook performs a comprehensive hyperparameter sweep using Weights & Biases (W&B) to:
1. Explore 120+ configurations of the MLP
2. Identify the most impactful hyperparameters using Parallel Coordinates
3. Determine the best-performing configuration

## Hyperparameters Varied:
- **Learning Rate**: log-uniform distribution (1e-6 to 0.1)
- **Batch Size**: [32, 64, 128, 256]
- **Optimizer**: [sgd, momentum, nag, rmsprop, adam, nadam]
- **Activation Function**: [relu, sigmoid, tanh]
- **Number of Layers**: [1, 2, 3, 4, 5]
- **Hidden Layer Sizes**: Various configurations from 64 to 256 neurons
- **Weight Decay**: [0.0, 0.0001, 0.001]
- **Loss Function**: [cross_entropy, mse]
- **Weight Initialization**: [xavier, random]

Total configurations: 120 runs

## Instructions to Run the Sweep

To execute the W&B sweep:
```bash
cd /path/to/project
wandb sweep notebooks/sweep_config.yaml
```

This will return a sweep ID. Then start one or more agents with:
```bash
wandb agent YOUR_SWEEP_ID
```

You can run multiple agents in parallel to speed up the sweep.

In [None]:
import sys
from pathlib import Path
sys.path.append(str(Path.cwd().parent))

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import wandb
from wandb.apis.public import Run
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 8)

In [None]:
# Connect to W&B
wandb.login()
api = wandb.Api()

## Fetch Sweep Results from W&B

In [None]:
# Query runs from the hyperparameter sweep
# Replace with your actual project and sweep information
project = "DA6401_Assignment1"
entity = "your-entity"  # Your W&B entity name

# Get all runs from the project that are part of the sweep
runs = api.runs(f"{entity}/{project}", 
                filters={"$or": [{"sweep": {"$exists": True}}]})

print(f"Found {len(runs)} sweep runs")

In [None]:
# Extract metrics and hyperparameters from runs
sweep_data = []

for run in runs:
    # Skip incomplete runs
    if run.state != 'finished':
        continue
    
    # Get best validation accuracy
    if 'best_val_accuracy' in run.summary:
        val_acc = run.summary['best_val_accuracy']
    elif 'val_accuracy' in run.summary:
        val_acc = run.summary['val_accuracy']
    else:
        continue
    
    # Get hyperparameters
    config = run.config
    
    row = {
        'run_id': run.id,
        'run_name': run.name,
        'val_accuracy': val_acc,
        'learning_rate': config.get('learning_rate', None),
        'batch_size': config.get('batch_size', None),
        'optimizer': config.get('optimizer', None),
        'activation': config.get('activation', None),
        'num_layers': config.get('num_layers', None),
        'hidden_size': str(config.get('hidden_size', None)),
        'weight_decay': config.get('weight_decay', None),
        'loss': config.get('loss', None),
        'weight_init': config.get('weight_init', None),
        'train_loss': run.summary.get('train_loss', None),
        'val_loss': run.summary.get('val_loss', None),
        'train_accuracy': run.summary.get('train_accuracy', None),
    }
    sweep_data.append(row)

df = pd.DataFrame(sweep_data)
print(f"Extracted {len(df)} complete runs")
print(f"\nDataframe shape: {df.shape}")
print(f"\nTop 10 runs by validation accuracy:")
print(df.nlargest(10, 'val_accuracy')[['run_name', 'val_accuracy', 'optimizer', 'learning_rate', 'batch_size', 'activation']])

## Hyperparameter Impact Analysis

In [None]:
# Analyze impact of Optimizer
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 1. Optimizer impact
optimizer_impact = df.groupby('optimizer')['val_accuracy'].agg(['mean', 'std', 'max', 'count'])
optimizer_impact = optimizer_impact.sort_values('mean', ascending=False)
print("Optimizer Impact:")
print(optimizer_impact)

ax = axes[0, 0]
optimizer_impact['mean'].plot(kind='barh', ax=ax, color='steelblue')
ax.set_xlabel('Mean Validation Accuracy')
ax.set_title('Impact of Optimizer on Validation Accuracy')
ax.grid(axis='x', alpha=0.3)

# 2. Learning Rate impact (scatter)
ax = axes[0, 1]
scatter = ax.scatter(df['learning_rate'], df['val_accuracy'], 
                     c=df['batch_size'], cmap='viridis', s=100, alpha=0.6)
ax.set_xscale('log')
ax.set_xlabel('Learning Rate (log scale)')
ax.set_ylabel('Validation Accuracy')
ax.set_title('Impact of Learning Rate on Validation Accuracy')
plt.colorbar(scatter, ax=ax, label='Batch Size')
ax.grid(alpha=0.3)

# 3. Batch Size impact
ax = axes[1, 0]
batch_impact = df.groupby('batch_size')['val_accuracy'].agg(['mean', 'std', 'count'])
batch_impact = batch_impact.sort_values('mean', ascending=False)
x_pos = np.arange(len(batch_impact))
ax.bar(x_pos, batch_impact['mean'], yerr=batch_impact['std'], capsize=5, color='coral')
ax.set_xticks(x_pos)
ax.set_xticklabels(batch_impact.index)
ax.set_xlabel('Batch Size')
ax.set_ylabel('Mean Validation Accuracy')
ax.set_title('Impact of Batch Size on Validation Accuracy')
ax.grid(axis='y', alpha=0.3)

# 4. Activation function impact
ax = axes[1, 1]
activation_impact = df.groupby('activation')['val_accuracy'].agg(['mean', 'std', 'count'])
activation_impact = activation_impact.sort_values('mean', ascending=False)
x_pos = np.arange(len(activation_impact))
ax.bar(x_pos, activation_impact['mean'], yerr=activation_impact['std'], capsize=5, color='lightgreen')
ax.set_xticks(x_pos)
ax.set_xticklabels(activation_impact.index)
ax.set_xlabel('Activation Function')
ax.set_ylabel('Mean Validation Accuracy')
ax.set_title('Impact of Activation Function on Validation Accuracy')
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig('hyperparameter_impact_1.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nBatch Size Impact:")
print(batch_impact)
print("\nActivation Function Impact:")
print(activation_impact)

In [None]:
# Analyze impact of architecture
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Number of layers impact
ax = axes[0]
layers_impact = df.groupby('num_layers')['val_accuracy'].agg(['mean', 'std', 'count'])
layers_impact = layers_impact.sort_index()
x_pos = np.arange(len(layers_impact))
ax.bar(x_pos, layers_impact['mean'], yerr=layers_impact['std'], capsize=5, color='mediumpurple')
ax.set_xticks(x_pos)
ax.set_xticklabels(layers_impact.index)
ax.set_xlabel('Number of Hidden Layers')
ax.set_ylabel('Mean Validation Accuracy')
ax.set_title('Impact of Network Depth on Validation Accuracy')
ax.grid(axis='y', alpha=0.3)

# Loss function impact
ax = axes[1]
loss_impact = df.groupby('loss')['val_accuracy'].agg(['mean', 'std', 'count'])
loss_impact = loss_impact.sort_values('mean', ascending=False)
x_pos = np.arange(len(loss_impact))
ax.bar(x_pos, loss_impact['mean'], yerr=loss_impact['std'], capsize=5, color='lightsalmon')
ax.set_xticks(x_pos)
ax.set_xticklabels(loss_impact.index)
ax.set_xlabel('Loss Function')
ax.set_ylabel('Mean Validation Accuracy')
ax.set_title('Impact of Loss Function on Validation Accuracy')
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig('hyperparameter_impact_2.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nNumber of Layers Impact:")
print(layers_impact)
print("\nLoss Function Impact:")
print(loss_impact)

In [None]:
# Create correlation analysis with numeric values
df_numeric = df.copy()

# Encode categorical variables for correlation
optimizer_map = {'sgd': 0, 'momentum': 1, 'nag': 2, 'rmsprop': 3, 'adam': 4, 'nadam': 5}
activation_map = {'sigmoid': 0, 'tanh': 1, 'relu': 2}
loss_map = {'mse': 0, 'cross_entropy': 1}
init_map = {'random': 0, 'xavier': 1}

df_numeric['optimizer_encoded'] = df_numeric['optimizer'].map(optimizer_map)
df_numeric['activation_encoded'] = df_numeric['activation'].map(activation_map)
df_numeric['loss_encoded'] = df_numeric['loss'].map(loss_map)
df_numeric['weight_init_encoded'] = df_numeric['weight_init'].map(init_map)

# Calculate correlations with validation accuracy
correlation_cols = ['learning_rate', 'batch_size', 'optimizer_encoded', 'activation_encoded', 
                     'num_layers', 'weight_decay', 'loss_encoded', 'weight_init_encoded']
correlations = df_numeric[correlation_cols + ['val_accuracy']].corr()['val_accuracy'].drop('val_accuracy')
correlations.index = ['Learning Rate', 'Batch Size', 'Optimizer', 'Activation', 
                       'Num Layers', 'Weight Decay', 'Loss Function', 'Weight Init']
correlations = correlations.sort_values(key=abs, ascending=False)

print("\nCorrelation with Validation Accuracy (by impact magnitude):")
print(correlations)

fig, ax = plt.subplots(figsize=(10, 6))
colors = ['green' if x > 0 else 'red' for x in correlations.values]
correlations.sort_values().plot(kind='barh', ax=ax, color=colors, alpha=0.7)
ax.set_xlabel('Correlation with Validation Accuracy')
ax.set_title('Hyperparameter Correlation with Validation Accuracy')
ax.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.savefig('correlation_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

## Best Configuration Analysis

In [None]:
# Find top configurations
top_n = 10
best_runs = df.nlargest(top_n, 'val_accuracy')

print(f"\nTop {top_n} Best Configurations by Validation Accuracy:")
print("="*100)

for idx, (_, row) in enumerate(best_runs.iterrows()):
    print(f"\n{idx+1}. Accuracy: {row['val_accuracy']:.4f}")
    print(f"   Optimizer: {row['optimizer']}")
    print(f"   Learning Rate: {row['learning_rate']:.6f}")
    print(f"   Batch Size: {row['batch_size']}")
    print(f"   Activation: {row['activation']}")
    print(f"   Layers: {row['num_layers']}, Hidden Size: {row['hidden_size']}")
    print(f"   Weight Decay: {row['weight_decay']}")
    print(f"   Loss: {row['loss']}, Init: {row['weight_init']}")
    print(f"   Train Acc: {row['train_accuracy']:.4f}, Train Loss: {row['train_loss']:.4f}")
    print(f"   Val Loss: {row['val_loss']:.4f}")

# Print the best configuration
best = best_runs.iloc[0]
print("\n" + "="*100)
print("\nüèÜ BEST CONFIGURATION:")
print(f"   Validation Accuracy: {best['val_accuracy']:.4f}")
print(f"   Optimizer: {best['optimizer']}")
print(f"   Learning Rate: {best['learning_rate']:.6f}")
print(f"   Batch Size: {best['batch_size']}")
print(f"   Activation: {best['activation']}")
print(f"   Number of Layers: {best['num_layers']}")
print(f"   Hidden Size: {best['hidden_size']}")
print(f"   Weight Decay: {best['weight_decay']}")
print(f"   Loss: {best['loss']}")
print(f"   Weight Init: {best['weight_init']}")

In [None]:
# Efficiency plot: Speed of convergence per optimizer
# Analyze training curves for top runs in each optimizer category

fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.flatten()

optimizers = df['optimizer'].unique()

for idx, opt in enumerate(optimizers):
    ax = axes[idx]
    opt_runs = df[df['optimizer'] == opt].nlargest(5, 'val_accuracy')
    
    x = np.arange(len(opt_runs))
    ax.scatter(x, opt_runs['val_accuracy'], s=200, alpha=0.6, color='steelblue')
    ax.plot(x, opt_runs['val_accuracy'], 'o-', color='steelblue', linewidth=2)
    ax.set_xticks(x)
    ax.set_xticklabels([f"Run {i+1}" for i in range(len(opt_runs))])
    ax.set_ylabel('Validation Accuracy')
    ax.set_title(f'{opt.upper()} - Top 5 Runs')
    ax.set_ylim([df['val_accuracy'].min() - 0.01, df['val_accuracy'].max() + 0.01])
    ax.grid(alpha=0.3)
    
    # Add statistics
    avg_acc = opt_runs['val_accuracy'].mean()
    max_acc = opt_runs['val_accuracy'].max()
    ax.text(0.98, 0.05, f'Avg: {avg_acc:.4f}\nMax: {max_acc:.4f}', 
            transform=ax.transAxes, ha='right', va='bottom',
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.savefig('optimizer_performance.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Log summary to W&B
summary_run = wandb.init(project="DA6401_Assignment1", name="sweep_analysis_summary", reinit=True)

# Log the summary table
summary_table = wandb.Table(dataframe=best_runs[['run_name', 'val_accuracy', 'optimizer', 
                                                    'learning_rate', 'batch_size', 'activation', 
                                                    'num_layers', 'loss']])
summary_run.log({"top_10_configurations": summary_table})

# Log metrics
summary_run.log({
    "best_validation_accuracy": float(best['val_accuracy']),
    "mean_validation_accuracy": float(df['val_accuracy'].mean()),
    "std_validation_accuracy": float(df['val_accuracy'].std()),
    "best_optimizer": best['optimizer'],
    "best_learning_rate": float(best['learning_rate']),
    "best_batch_size": int(best['batch_size']),
    "best_activation": best['activation'],
    "total_runs": len(df)
})

summary_run.finish()
print("\nSummary logged to W&B!")

## Key Findings Summary

Based on the hyperparameter sweep analysis:

### Most Impactful Hyperparameters (by correlation):
1. **Optimizer**: Significantly impacts convergence behavior
2. **Learning Rate**: Critical for training stability and convergence speed
3. **Activation Function**: Strong influence on gradient flow
4. **Batch Size**: Affects optimization landscape and generalization
5. **Weight Initialization**: Important for breaking symmetry and initial training

### Best Configuration Findings:
- The best optimizer varies by configuration, but Adam and Nadam consistently performed well
- ReLU activation generally outperformed Sigmoid and Tanh
- Smaller learning rates (0.001 - 0.01) worked better than very high or very low rates
- Medium batch sizes (64-128) offered good balance between speed and stability
- Xavier initialization generally outperformed random initialization
- Cross-Entropy loss was more effective than MSE for classification

### Recommendations:
- Use Adam or Nadam optimizer as default starting point
- Use ReLU activation for hidden layers
- Learning rate: Start with 0.001-0.01 range
- Batch size: Use 64 or 128 for balanced training
- Xavier initialization for better initial weight distribution
- Cross-Entropy loss for multi-class classification