# Task 3.3: Loss Landscape Visualization

**Module:** 3 - Mathematics for Deep Learning  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand what a loss landscape represents
- [ ] Create 2D heatmap visualizations of loss
- [ ] Create 3D surface plots of loss landscapes
- [ ] Visualize optimization trajectories on loss surfaces
- [ ] Identify local minima, saddle points, and flat regions

---

## üìö Prerequisites

- Completed: Task 3.1-3.2
- Knowledge of: Neural networks, gradient descent

---

## üåç Real-World Context

**Why visualize loss landscapes?**

- **Debugging:** Why isn't my model learning? (Maybe stuck in local minimum)
- **Architecture:** Wide vs deep networks have different landscapes
- **Understanding:** Skip connections create smoother landscapes (why ResNet works!)

**Famous insight:** Li et al. (2018) showed that ResNet has much smoother loss landscapes than plain networks, explaining their trainability.

---

## üßí ELI5: What is a Loss Landscape?

> **Imagine you're exploring a foggy mountain range...**
>
> - The **height** at any point = how wrong your model is (loss)
> - **Your position** on the map = your model's parameters
> - The **lowest valleys** = best model parameters
> - **Peaks and ridges** = bad parameters (high loss)
>
> When you train a model, you're:
> 1. Standing somewhere on this mountain (initial parameters)
> 2. Trying to find the lowest valley (minimize loss)
> 3. You can only feel which way is DOWN (gradients)
> 4. You take small steps hoping to reach the bottom
>
> **The challenge:** 
> - There might be multiple valleys (local minima)
> - Some paths lead to dead ends (saddle points)
> - Some valleys are narrow and hard to find

---

In [None]:
import numpy as np
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
import warnings
warnings.filterwarnings('ignore')

# Set seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print("üöÄ Loss Landscape Visualization Lab")
print("=" * 50)
print(f"PyTorch version: {torch.__version__}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

---

## Part 1: Understanding Loss Landscape Concepts

Before we dive into neural networks, let's visualize different landscape features using simple 2D functions.

In [None]:
def plot_landscape_3d(func, x_range, y_range, title, ax=None):
    """Create a 3D surface plot of a 2D function"""
    x = np.linspace(*x_range, 100)
    y = np.linspace(*y_range, 100)
    X, Y = np.meshgrid(x, y)
    Z = func(X, Y)
    
    if ax is None:
        fig = plt.figure(figsize=(10, 7))
        ax = fig.add_subplot(111, projection='3d')
    
    surf = ax.plot_surface(X, Y, Z, cmap='viridis', alpha=0.8, 
                          linewidth=0, antialiased=True)
    ax.set_xlabel('Parameter 1')
    ax.set_ylabel('Parameter 2')
    ax.set_zlabel('Loss')
    ax.set_title(title)
    
    return ax

# Different landscape types
fig = plt.figure(figsize=(16, 10))

# 1. Single global minimum (ideal!)
ax1 = fig.add_subplot(221, projection='3d')
single_min = lambda x, y: x**2 + y**2
plot_landscape_3d(single_min, (-3, 3), (-3, 3), "1. Single Global Minimum\n(Ideal - Convex)", ax1)

# 2. Multiple local minima
ax2 = fig.add_subplot(222, projection='3d')
multi_min = lambda x, y: np.sin(2*x) * np.cos(2*y) + 0.1*(x**2 + y**2)
plot_landscape_3d(multi_min, (-3, 3), (-3, 3), "2. Multiple Local Minima\n(Non-convex)", ax2)

# 3. Saddle point
ax3 = fig.add_subplot(223, projection='3d')
saddle = lambda x, y: x**2 - y**2
plot_landscape_3d(saddle, (-2, 2), (-2, 2), "3. Saddle Point\n(Goes down in x, up in y)", ax3)

# 4. Narrow valley (hard to optimize)
ax4 = fig.add_subplot(224, projection='3d')
narrow = lambda x, y: 0.1*x**2 + 10*y**2  # Very different scales!
plot_landscape_3d(narrow, (-3, 3), (-3, 3), "4. Narrow Valley\n(Ill-conditioned)", ax4)

plt.tight_layout()
plt.show()

print("\nüìä Four types of loss landscapes:")
print("1. Convex: One global minimum - easy to optimize!")
print("2. Non-convex: Multiple minima - might get stuck")
print("3. Saddle point: Gradient is zero but not minimum - deceiving")
print("4. Ill-conditioned: Different scales - causes zigzagging")

### üîç Key Concepts

| Feature | Description | Challenge |
|---------|-------------|----------|
| **Global Minimum** | Lowest point overall | The goal! |
| **Local Minimum** | Lowest point nearby | Can trap optimizer |
| **Saddle Point** | Flat in gradient, but not minimum | Gradient ~0 but not done |
| **Narrow Valley** | Steep in some directions | Causes oscillation |

---

## Part 2: Creating a Neural Network for Visualization

Let's create a simple network and visualize its actual loss landscape on a toy dataset.

In [None]:
# Create a simple dataset (2D classification)
def create_moons_dataset(n_samples=200, noise=0.1):
    """Create a two-moons classification dataset"""
    n_samples_per_class = n_samples // 2
    
    # First moon
    theta1 = np.linspace(0, np.pi, n_samples_per_class)
    X1 = np.column_stack([np.cos(theta1), np.sin(theta1)])
    
    # Second moon (shifted)
    theta2 = np.linspace(0, np.pi, n_samples_per_class)
    X2 = np.column_stack([1 - np.cos(theta2), 1 - np.sin(theta2) - 0.5])
    
    X = np.vstack([X1, X2]) + np.random.randn(n_samples, 2) * noise
    y = np.array([0] * n_samples_per_class + [1] * n_samples_per_class)
    
    return X, y

# Generate data
X_np, y_np = create_moons_dataset(200, noise=0.15)

# Convert to PyTorch tensors
X_data = torch.FloatTensor(X_np)
y_data = torch.FloatTensor(y_np).unsqueeze(1)

# Visualize the dataset
plt.figure(figsize=(8, 6))
plt.scatter(X_np[y_np==0, 0], X_np[y_np==0, 1], c='red', label='Class 0', alpha=0.6)
plt.scatter(X_np[y_np==1, 0], X_np[y_np==1, 1], c='blue', label='Class 1', alpha=0.6)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Two Moons Dataset')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"Dataset: {X_np.shape[0]} samples, {X_np.shape[1]} features")

In [None]:
# Define a simple network
class SimpleNet(nn.Module):
    """A tiny network for loss landscape visualization"""
    
    def __init__(self, hidden_size=4):
        super().__init__()
        self.fc1 = nn.Linear(2, hidden_size)
        self.fc2 = nn.Linear(hidden_size, 1)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.sigmoid(self.fc2(x))
        return x

# Create network and loss function
model = SimpleNet(hidden_size=4)
criterion = nn.BCELoss()

print("Network architecture:")
print(f"  Input(2) ‚Üí Hidden({4}) ‚Üí Output(1)")
print(f"  Total parameters: {sum(p.numel() for p in model.parameters())}")

---

## Part 3: Visualizing a 2D Slice of the Loss Landscape

Neural networks have many parameters, but we can visualize a 2D "slice" by:
1. Picking a trained model's parameters as the "center"
2. Choosing two random directions in parameter space
3. Moving along these directions and measuring loss

This technique was introduced by Li et al. (2018) in "Visualizing the Loss Landscape of Neural Nets".

In [None]:
def get_params_as_vector(model):
    """Flatten all model parameters into a single vector"""
    return torch.cat([p.data.view(-1) for p in model.parameters()])

def set_params_from_vector(model, params_vector):
    """Set model parameters from a flattened vector"""
    idx = 0
    for p in model.parameters():
        numel = p.numel()
        p.data = params_vector[idx:idx+numel].view(p.shape)
        idx += numel

def compute_loss(model, X, y, criterion):
    """Compute loss without gradient tracking"""
    with torch.no_grad():
        outputs = model(X)
        return criterion(outputs, y).item()

def create_random_direction(model):
    """Create a random direction in parameter space (normalized per layer)"""
    direction = []
    for p in model.parameters():
        d = torch.randn_like(p)
        # Normalize by the parameter's norm (filter normalization)
        d = d / (d.norm() + 1e-10) * p.norm()
        direction.append(d.view(-1))
    return torch.cat(direction)

print("Helper functions defined!")

In [None]:
# First, train the model so we have a good center point
model = SimpleNet(hidden_size=8)
optimizer = torch.optim.Adam(model.parameters(), lr=0.05)
criterion = nn.BCELoss()

# Training loop
losses = []
for epoch in range(500):
    optimizer.zero_grad()
    outputs = model(X_data)
    loss = criterion(outputs, y_data)
    loss.backward()
    optimizer.step()
    losses.append(loss.item())

print(f"Training complete!")
print(f"  Initial loss: {losses[0]:.4f}")
print(f"  Final loss: {losses[-1]:.4f}")

# Plot training curve
plt.figure(figsize=(8, 4))
plt.plot(losses)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Create the 2D loss landscape visualization

# Save the trained parameters as center
center_params = get_params_as_vector(model).clone()

# Create two random directions
torch.manual_seed(42)
direction1 = create_random_direction(model)
direction2 = create_random_direction(model)

# Define the range to explore
alpha_range = np.linspace(-1.0, 1.0, 51)
beta_range = np.linspace(-1.0, 1.0, 51)

# Compute loss for each point in the grid
print("Computing loss landscape (this may take a moment)...")
loss_surface = np.zeros((len(beta_range), len(alpha_range)))

for i, beta in enumerate(beta_range):
    for j, alpha in enumerate(alpha_range):
        # Move in the two directions from center
        new_params = center_params + alpha * direction1 + beta * direction2
        set_params_from_vector(model, new_params)
        loss_surface[i, j] = compute_loss(model, X_data, y_data, criterion)

# Restore original parameters
set_params_from_vector(model, center_params)

print(f"Loss range: [{loss_surface.min():.4f}, {loss_surface.max():.4f}]")

In [None]:
# Visualize as 2D contour plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 2D Contour (heatmap)
A, B = np.meshgrid(alpha_range, beta_range)
contour = axes[0].contourf(A, B, loss_surface, levels=50, cmap='viridis')
axes[0].contour(A, B, loss_surface, levels=20, colors='white', alpha=0.3, linewidths=0.5)
axes[0].scatter([0], [0], color='red', s=100, marker='*', label='Trained model', zorder=5)
axes[0].set_xlabel('Direction 1 (Œ±)')
axes[0].set_ylabel('Direction 2 (Œ≤)')
axes[0].set_title('Loss Landscape (2D Slice)')
axes[0].legend()
plt.colorbar(contour, ax=axes[0], label='Loss')

# 3D Surface
ax3d = fig.add_subplot(122, projection='3d')
surf = ax3d.plot_surface(A, B, loss_surface, cmap='viridis', alpha=0.8)
ax3d.scatter([0], [0], [loss_surface[len(beta_range)//2, len(alpha_range)//2]], 
            color='red', s=100, marker='*')
ax3d.set_xlabel('Direction 1 (Œ±)')
ax3d.set_ylabel('Direction 2 (Œ≤)')
ax3d.set_zlabel('Loss')
ax3d.set_title('Loss Landscape (3D Surface)')

plt.tight_layout()
plt.show()

print("\nüìä The trained model (red star) sits in a valley!")
print("   This is what optimization achieved.")

### üîç What Just Happened?

We visualized a 2D "slice" of the loss landscape:

- **Center (red star):** Our trained model's parameters
- **Axes:** Two random directions in parameter space
- **Color:** Loss value (darker = lower = better)

The trained model sits at a **low point** (valley) in the landscape!

---

## Part 4: Visualizing the Optimization Trajectory

Let's watch how the optimizer navigates the loss landscape during training.

In [None]:
# Train a fresh model and record the trajectory
torch.manual_seed(123)
model = SimpleNet(hidden_size=8)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)  # Use SGD for clearer trajectory

# Record parameters at each step
trajectory = [get_params_as_vector(model).clone()]
losses = [compute_loss(model, X_data, y_data, criterion)]

# Train and record
for epoch in range(200):
    optimizer.zero_grad()
    outputs = model(X_data)
    loss = criterion(outputs, y_data)
    loss.backward()
    optimizer.step()
    
    trajectory.append(get_params_as_vector(model).clone())
    losses.append(loss.item())

print(f"Recorded {len(trajectory)} points along the trajectory")
print(f"Loss: {losses[0]:.4f} ‚Üí {losses[-1]:.4f}")

In [None]:
# Use PCA to project trajectory to 2D
from sklearn.decomposition import PCA

# Stack all trajectory points
traj_matrix = torch.stack(trajectory).numpy()

# Fit PCA to find the 2 most important directions
pca = PCA(n_components=2)
traj_2d = pca.fit_transform(traj_matrix)

print(f"Variance explained by 2 PCs: {pca.explained_variance_ratio_.sum()*100:.1f}%")

# Create loss landscape along these PCA directions
# Use final point as center
center = traj_matrix[-1]
pc1 = pca.components_[0]
pc2 = pca.components_[1]

# Determine range from trajectory
margin = 0.5
x_min, x_max = traj_2d[:, 0].min() - margin, traj_2d[:, 0].max() + margin
y_min, y_max = traj_2d[:, 1].min() - margin, traj_2d[:, 1].max() + margin

# Create grid
x_range = np.linspace(x_min, x_max, 50)
y_range = np.linspace(y_min, y_max, 50)

# Compute loss surface
print("Computing loss surface along PCA directions...")
loss_surface = np.zeros((len(y_range), len(x_range)))

for i, y_val in enumerate(y_range):
    for j, x_val in enumerate(x_range):
        # Reconstruct parameters
        params = center + x_val * pc1 + y_val * pc2
        set_params_from_vector(model, torch.FloatTensor(params))
        loss_surface[i, j] = compute_loss(model, X_data, y_data, criterion)

# Restore original
set_params_from_vector(model, torch.FloatTensor(traj_matrix[-1]))

In [None]:
# Plot trajectory on loss landscape
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 2D with trajectory
X_grid, Y_grid = np.meshgrid(x_range, y_range)
contour = axes[0].contourf(X_grid, Y_grid, loss_surface, levels=50, cmap='viridis')
axes[0].contour(X_grid, Y_grid, loss_surface, levels=15, colors='white', alpha=0.3, linewidths=0.5)

# Plot trajectory
axes[0].plot(traj_2d[:, 0], traj_2d[:, 1], 'r-', linewidth=2, alpha=0.7, label='Optimization path')
axes[0].scatter(traj_2d[0, 0], traj_2d[0, 1], color='red', s=150, marker='o', 
               edgecolors='white', linewidth=2, label='Start', zorder=5)
axes[0].scatter(traj_2d[-1, 0], traj_2d[-1, 1], color='yellow', s=150, marker='*', 
               edgecolors='black', linewidth=2, label='End', zorder=5)

axes[0].set_xlabel('PC1')
axes[0].set_ylabel('PC2')
axes[0].set_title('Optimization Trajectory on Loss Landscape')
axes[0].legend(loc='upper right')
plt.colorbar(contour, ax=axes[0], label='Loss')

# 3D with trajectory
ax3d = fig.add_subplot(122, projection='3d')
surf = ax3d.plot_surface(X_grid, Y_grid, loss_surface, cmap='viridis', alpha=0.6)

# Plot trajectory in 3D
traj_losses = np.array(losses)
ax3d.plot(traj_2d[:, 0], traj_2d[:, 1], traj_losses, 'r-', linewidth=2, label='Path')
ax3d.scatter([traj_2d[0, 0]], [traj_2d[0, 1]], [traj_losses[0]], 
            color='red', s=100, marker='o', label='Start')
ax3d.scatter([traj_2d[-1, 0]], [traj_2d[-1, 1]], [traj_losses[-1]], 
            color='yellow', s=100, marker='*', label='End')

ax3d.set_xlabel('PC1')
ax3d.set_ylabel('PC2')
ax3d.set_zlabel('Loss')
ax3d.set_title('3D Optimization Path')

plt.tight_layout()
plt.show()

print("\nüìä The optimizer follows a path from high loss (start) to low loss (end)!")

---

## Part 5: Comparing Different Architectures

Let's see how network width affects the loss landscape.

In [None]:
def visualize_landscape_for_model(model, title, X_data, y_data, resolution=40):
    """Train model and visualize its loss landscape"""
    criterion = nn.BCELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.05)
    
    # Train
    for _ in range(300):
        optimizer.zero_grad()
        loss = criterion(model(X_data), y_data)
        loss.backward()
        optimizer.step()
    
    # Get center and directions
    center = get_params_as_vector(model).clone()
    torch.manual_seed(42)
    dir1 = create_random_direction(model)
    dir2 = create_random_direction(model)
    
    # Create landscape
    alpha_range = np.linspace(-1, 1, resolution)
    beta_range = np.linspace(-1, 1, resolution)
    loss_surface = np.zeros((resolution, resolution))
    
    for i, beta in enumerate(beta_range):
        for j, alpha in enumerate(alpha_range):
            new_params = center + alpha * dir1 + beta * dir2
            set_params_from_vector(model, new_params)
            loss_surface[i, j] = compute_loss(model, X_data, y_data, criterion)
    
    set_params_from_vector(model, center)
    
    return loss_surface, alpha_range, beta_range

# Compare narrow vs wide networks
print("Creating landscape for narrow network (hidden=2)...")
torch.manual_seed(42)
narrow_model = SimpleNet(hidden_size=2)
narrow_surface, ar, br = visualize_landscape_for_model(narrow_model, "Narrow", X_data, y_data)

print("Creating landscape for wide network (hidden=32)...")
torch.manual_seed(42)
wide_model = SimpleNet(hidden_size=32)
wide_surface, _, _ = visualize_landscape_for_model(wide_model, "Wide", X_data, y_data)

In [None]:
# Compare landscapes
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

A, B = np.meshgrid(ar, br)

# Narrow network
c1 = axes[0].contourf(A, B, narrow_surface, levels=50, cmap='viridis')
axes[0].contour(A, B, narrow_surface, levels=15, colors='white', alpha=0.3)
axes[0].scatter([0], [0], color='red', s=100, marker='*')
axes[0].set_xlabel('Direction 1')
axes[0].set_ylabel('Direction 2')
axes[0].set_title(f'Narrow Network (hidden=2)\n{sum(p.numel() for p in narrow_model.parameters())} params')
plt.colorbar(c1, ax=axes[0], label='Loss')

# Wide network
c2 = axes[1].contourf(A, B, wide_surface, levels=50, cmap='viridis')
axes[1].contour(A, B, wide_surface, levels=15, colors='white', alpha=0.3)
axes[1].scatter([0], [0], color='red', s=100, marker='*')
axes[1].set_xlabel('Direction 1')
axes[1].set_ylabel('Direction 2')
axes[1].set_title(f'Wide Network (hidden=32)\n{sum(p.numel() for p in wide_model.parameters())} params')
plt.colorbar(c2, ax=axes[1], label='Loss')

plt.tight_layout()
plt.show()

print("\nüìä Observations:")
print("  - Wider networks often have smoother landscapes")
print("  - More parameters = more directions to escape local minima")
print("  - This is why overparameterized networks train well!")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Forgetting Filter Normalization

```python
# ‚ùå Wrong: Random directions have different scales per layer
direction = torch.randn_like(params)

# ‚úÖ Right: Normalize each layer's direction by its parameter norm
for p in model.parameters():
    d = torch.randn_like(p)
    d = d / d.norm() * p.norm()  # Scale to match layer
```

**Why:** Without normalization, some directions dominate others.

### Mistake 2: Visualizing Too Small a Region

```python
# ‚ùå Wrong: Too zoomed in - miss the big picture
alpha_range = np.linspace(-0.01, 0.01, 50)

# ‚úÖ Right: Explore a meaningful range
alpha_range = np.linspace(-1.0, 1.0, 50)
```

### Mistake 3: Not Training Before Visualization

```python
# ‚ùå Wrong: Visualizing around random initialization
model = SimpleNet()
visualize(model)  # Random point in landscape!

# ‚úÖ Right: Train first, then visualize
model = SimpleNet()
train(model)
visualize(model)  # Now we see the minimum!
```

---

## ‚úã Try It Yourself

### Exercise: Compare Optimizers on the Same Landscape

Visualize the trajectories of SGD, SGD+Momentum, and Adam on the same loss landscape.

<details>
<summary>üí° Hint</summary>

1. Create a fresh model for each optimizer
2. Use the same random seed for initialization
3. Record trajectories for each
4. Project all to the same PCA space
5. Plot on the same contour
</details>

In [None]:
# YOUR CODE HERE

# Hint structure:
# optimizers_to_compare = [
#     ('SGD', torch.optim.SGD, {'lr': 0.5}),
#     ('Momentum', torch.optim.SGD, {'lr': 0.5, 'momentum': 0.9}),
#     ('Adam', torch.optim.Adam, {'lr': 0.05}),
# ]
# 
# trajectories = {}
# for name, opt_class, opt_kwargs in optimizers_to_compare:
#     torch.manual_seed(42)  # Same initialization!
#     model = SimpleNet(hidden_size=8)
#     optimizer = opt_class(model.parameters(), **opt_kwargs)
#     # ... train and record trajectory

print("Implement the comparison!")

---

## üéâ Checkpoint

You've learned:

- ‚úÖ **Loss landscapes** are high-dimensional surfaces the optimizer navigates
- ‚úÖ **2D slices** let us visualize what's happening
- ‚úÖ **Trajectories** show how optimization proceeds
- ‚úÖ **Architecture** affects landscape smoothness
- ‚úÖ Wider networks tend to have smoother, easier landscapes

**Key insight:** The loss landscape is the "terrain" your optimizer explores. Understanding it helps you debug training!

---

## üìñ Further Reading

- [Visualizing the Loss Landscape of Neural Nets](https://arxiv.org/abs/1712.09913) - The seminal paper
- [Loss Surfaces, Mode Connectivity, and Fast Ensembling](https://arxiv.org/abs/1802.10026)
- [Deep Double Descent](https://openai.com/blog/deep-double-descent/) - OpenAI blog post

---

## üßπ Cleanup

In [None]:
import gc
if torch.cuda.is_available():
    torch.cuda.empty_cache()
gc.collect()

print("‚úÖ Cleanup complete!")
print("\n‚û°Ô∏è  Next: Task 3.4 - SVD for LoRA Intuition")