# Assignment 2: 2-Layer MLP From Scratch

**Goal**: Implement a 2-layer Multi-Layer Perceptron without using `nn.Sequential`.

## Requirements
- ❌ No `nn.Sequential` allowed
- ✅ Use `nn.Linear`, `nn.ReLU` but define them explicitly
- ✅ Implement forward pass manually
- ✅ Achieve >85% accuracy on MNIST

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
import matplotlib.pyplot as plt
import numpy as np

# Set random seeds
torch.manual_seed(42)
np.random.seed(42)

# Check for GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

## Step 1: Implement the 2-Layer MLP

**Important**: Do NOT use `nn.Sequential`!

In [None]:
class TwoLayerMLP(nn.Module):
    """
    A 2-layer MLP implemented from scratch.
    
    Architecture:
        Input (784) -> Linear -> ReLU -> Linear -> Output (10)
    """
    
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(TwoLayerMLP, self).__init__()
        
        # TODO: Define layers (NO nn.Sequential!)
        self.fc1 = None  # First linear layer
        self.relu = None  # Activation function
        self.fc2 = None  # Second linear layer
        
        # Initialize weights
        self._init_weights()
    
    def _init_weights(self):
        """Initialize weights using Xavier initialization."""
        # TODO: Use nn.init.xavier_uniform_ or kaiming_uniform_
        pass
    
    def forward(self, x):
        """
        Forward pass.
        
        Args:
            x: Input tensor of shape (batch_size, 28, 28) or (batch_size, 784)
        
        Returns:
            Output tensor of shape (batch_size, 10)
        """
        # TODO: Implement forward pass
        # 1. Flatten input: x.view(x.size(0), -1)
        # 2. First layer + activation
        # 3. Second layer
        # 4. Return output (no softmax - CrossEntropyLoss handles it)
        
        pass

# Create model instance
model = TwoLayerMLP(input_dim=28*28, hidden_dim=256, output_dim=10).to(device)

# Verify no nn.Sequential is used
has_sequential = any(isinstance(m, nn.Sequential) for m in model.modules())
print(f"Uses nn.Sequential: {has_sequential} (should be False)")

print("\nModel architecture:")
print(model)

## Step 2: Load MNIST Dataset

In [None]:
# TODO: Load MNIST dataset
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST('./data', train=False, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=1000, shuffle=False)

print(f"Training samples: {len(train_dataset)}")
print(f"Test samples: {len(test_dataset)}")

## Step 3: Visualize Some Data

In [None]:
# TODO: Visualize a batch of images
examples = enumerate(train_loader)
batch_idx, (example_data, example_targets) = next(examples)

fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i in range(10):
    ax = axes[i//5, i%5]
    ax.imshow(example_data[i].squeeze(), cmap='gray')
    ax.set_title(f'Label: {example_targets[i]}')
    ax.axis('off')
plt.tight_layout()
plt.show()

## Step 4: Define Loss and Optimizer

In [None]:
# TODO: Define loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

print("Loss function:", criterion)
print("Optimizer:", optimizer)

## Step 5: Training Loop

In [None]:
# TODO: Implement training loop
num_epochs = 10
history = {'train_loss': [], 'train_acc': [], 'val_loss': [], 'val_acc': []}

for epoch in range(num_epochs):
    # Training phase
    model.train()
    train_loss = 0
    correct = 0
    total = 0
    
    for batch_idx, (data, target) in enumerate(train_loader):
        # Your training code here
        pass
    
    # Validation phase
    model.eval()
    val_loss = 0
    correct = 0
    total = 0
    
    with torch.no_grad():
        for data, target in test_loader:
            # Your validation code here
            pass
    
    # Record history and print progress
    print(f"Epoch {epoch+1}/{num_epochs}")

## Step 6: Plot Training Curves

In [None]:
# TODO: Plot loss and accuracy curves
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Loss plot
epochs = range(1, num_epochs + 1)
ax1.plot(epochs, history['train_loss'], 'b-', label='Train Loss')
ax1.plot(epochs, history['val_loss'], 'r-', label='Val Loss')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Training and Validation Loss')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Accuracy plot
ax2.plot(epochs, history['train_acc'], 'b-', label='Train Acc')
ax2.plot(epochs, history['val_acc'], 'r-', label='Val Acc')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy (%)')
ax2.set_title('Training and Validation Accuracy')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Step 7: Test Predictions

In [None]:
# TODO: Visualize predictions
model.eval()
with torch.no_grad():
    examples = enumerate(test_loader)
    batch_idx, (example_data, example_targets) = next(examples)
    example_data = example_data.to(device)
    
    output = model(example_data)
    predictions = output.argmax(dim=1, keepdim=True)

fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i in range(10):
    ax = axes[i//5, i%5]
    ax.imshow(example_data[i].cpu().squeeze(), cmap='gray')
    pred = predictions[i].item()
    true = example_targets[i].item()
    color = 'green' if pred == true else 'red'
    ax.set_title(f'Pred: {pred}, True: {true}', color=color)
    ax.axis('off')
plt.tight_layout()
plt.show()

## Questions

### 1. What is the advantage of NOT using `nn.Sequential`?

**Answer:**  
The main advantage is **greater flexibility and control over the forward pass**.

- Enables complex architectures (skip connections, multi-branch networks)
- Allows conditional logic inside `forward()`
- Supports layer reuse and weight sharing
- Makes it easier to inspect or manipulate intermediate activations
- Allows insertion of custom operations (attention, gating, etc.)

`nn.Sequential` is best for simple linear stacks of layers, while custom modules allow full computational graph control.

---

### 2. How does weight initialization affect training?

**Answer:**  
Weight initialization affects **training stability, convergence speed, and final performance**.

- **Too small weights** → vanishing activations and gradients
- **Too large weights** → exploding activations and gradients
- Proper initialization preserves activation and gradient variance across layers

**Common strategies:**
- **Xavier (Glorot)** initialization for tanh/sigmoid activations
- **He (Kaiming)** initialization for ReLU-based activations

Good initialization helps gradients flow efficiently, especially in deep networks.

---

### 3. What happens if you use too many / too few hidden units?

**Answer:**

- **Too few hidden units**
  - Underfitting
  - High bias
  - Model cannot capture data complexity
  - Poor training and test performance

- **Too many hidden units**
  - Overfitting
  - High variance
  - Good training performance, poor generalization
  - Increased computation and memory cost

The number of hidden units should match the complexity of the data and be controlled with regularization.

---

### 4. How could you improve the model’s accuracy?

**Answer:**

- **Architecture**
  - Add or adjust hidden layers/units
  - Use better activation functions
  - Add residual connections

- **Optimization**
  - Tune learning rate
  - Use better optimizers (Adam, AdamW)
  - Apply learning rate schedules

- **Regularization**
  - Dropout
  - Weight decay (L2)
  - Batch normalization
  - Early stopping

- **Data**
  - Collect more data
  - Apply data augmentation
  - Improve feature quality

- **Training**
  - Better weight initialization
  - Train longer if not overfitting

Improving accuracy requires balancing model capacity, optimization, regularization, and data quality.
