# üé≤ Dropout Regularization: Preventing Overfitting

This notebook demonstrates **Dropout Regularization**‚Äîa powerful technique that randomly deactivates neurons during training to prevent overfitting. We'll train two models on the Sonar dataset and compare their performance.

## What You'll Learn

1. How dropout prevents overfitting
2. Implementing dropout in PyTorch
3. Comparing models with and without dropout
4. Interpreting training vs validation loss curves

---

## 1. Setup and Imports

In [None]:
import pandas as pd
import torch
from torch import nn, optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt

## 2. Loading the Sonar Dataset

The **Sonar dataset** is a classic binary classification problem:
- **Task**: Distinguish between sonar signals bounced off a metal cylinder (Mine) vs a rock
- **Features**: 60 numerical attributes (sonar frequencies)
- **Samples**: 208 total
- **Classes**: 'M' (Mine) and 'R' (Rock)

This is a challenging dataset because:
- Small sample size (easy to overfit)
- High-dimensional features (60 features for 208 samples)

**Perfect for demonstrating dropout's effectiveness!**

In [None]:
df = pd.read_csv("sonar.all-data", header=None)
print(f"Dataset shape: {df.shape}")
df.sample(5)

In [None]:
# Check class distribution
print("Class distribution:")
print(df[60].value_counts())

## 3. Data Preprocessing

We need to:
1. Convert class labels ('M', 'R') to numeric (0, 1)
2. Split into features (X) and target (y)
3. Create train/test split
4. Convert to PyTorch tensors

In [None]:
# Convert labels: M (Mine) = 0, R (Rock) = 1
df[60] = df[60].map({'M': 0, 'R': 1})

# Split features and target
X = df.drop(60, axis=1)
y = df[60]

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

In [None]:
# Train/test split (75% train, 25% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=1
)

print(f"Training samples: {X_train.shape[0]}")
print(f"Test samples: {X_test.shape[0]}")

In [None]:
# Convert to PyTorch tensors
X_train_tensor = torch.tensor(X_train.values, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.long)
X_test_tensor = torch.tensor(X_test.values, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.long)

# Create TensorDatasets and DataLoaders
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

print(f"Number of batches: {len(train_loader)}")

---

## 4. Model Without Dropout

First, let's create a neural network **without any dropout**. This model is prone to overfitting, especially on small datasets like Sonar.

### Architecture

```
Input (60) ‚Üí Hidden (128) ‚Üí ReLU ‚Üí Hidden (64) ‚Üí ReLU ‚Üí Output (2)
```

In [None]:
class SimpleNN(nn.Module):
    """Neural network WITHOUT dropout - prone to overfitting."""
    def __init__(self):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(60, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 2)  # Output: 2 classes (Mine, Rock)
        )

    def forward(self, x):
        return self.network(x)

## 5. Training Function

We define a training function that tracks:
- **Training loss**: How well the model fits training data
- **Validation loss**: How well the model generalizes
- **Validation accuracy**: Classification performance

**Key insight**: When training loss decreases but validation loss increases, the model is overfitting!

In [None]:
def train_model(model, train_loader, val_loader, criterion, optimizer, epochs=20):
    """
    Train model and track metrics.
    
    Returns:
        train_losses, val_losses, val_accuracies
    """
    train_losses, val_losses, val_accuracies = [], [], []
    
    for epoch in range(epochs):
        # Training phase
        model.train()  # Enable dropout
        running_loss = 0.0
        for inputs, labels in train_loader:
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
        train_losses.append(running_loss / len(train_loader))
        
        # Validation phase
        model.eval()  # Disable dropout
        val_loss = 0.0
        y_pred, y_true = [], []
        with torch.no_grad():
            for inputs, labels in val_loader:
                outputs = model(inputs)
                loss = criterion(outputs, labels)
                val_loss += loss.item()
                _, predicted = torch.max(outputs, 1)
                y_pred.extend(predicted.cpu().numpy())
                y_true.extend(labels.cpu().numpy())
        
        val_losses.append(val_loss / len(val_loader))
        val_accuracy = accuracy_score(y_true, y_pred)
        val_accuracies.append(val_accuracy)
        
        print(f"Epoch {epoch+1:2d}/{epochs} | "
              f"Train Loss: {train_losses[-1]:.4f} | "
              f"Val Loss: {val_losses[-1]:.4f} | "
              f"Val Acc: {val_accuracy:.4f}")
    
    return train_losses, val_losses, val_accuracies

## 6. Train Model Without Dropout

Let's train the model without dropout and observe the overfitting behavior.

In [None]:
# Initialize model without dropout
model_without_dropout = SimpleNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model_without_dropout.parameters(), lr=0.001)

print("Training WITHOUT Dropout")
print("=" * 60)
train_losses_no_dropout, val_losses_no_dropout, val_accuracies_no_dropout = train_model(
    model_without_dropout, train_loader, test_loader, criterion, optimizer, epochs=20
)

In [None]:
# Visualize training progress
plt.figure(figsize=(10, 6))
epochs = range(1, 21)

plt.plot(epochs, train_losses_no_dropout, 'b-', label="Train Loss", linewidth=2)
plt.plot(epochs, val_losses_no_dropout, 'r-', label="Validation Loss", linewidth=2)
plt.plot(epochs, val_accuracies_no_dropout, 'g--', label="Validation Accuracy", linewidth=2)

plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss / Accuracy', fontsize=12)
plt.title('Training WITHOUT Dropout (Overfitting Expected)', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

### Analyzing the Results

Look for signs of **overfitting**:
- Training loss keeps decreasing
- Validation loss starts increasing (or plateaus while train loss drops)
- Gap between training and validation loss grows

---

## 7. Model With Dropout

Now let's add **Dropout layers** to prevent overfitting.

### How Dropout Works

During training, dropout randomly "drops" (sets to zero) a fraction of neurons:

```
Training:   [‚óè][‚óã][‚óè][‚óè][‚óã][‚óè]  (‚óã = dropped)
Inference:  [‚óè][‚óè][‚óè][‚óè][‚óè][‚óè]  (all active)
```

This forces the network to:
- Not rely on any single neuron
- Learn redundant representations
- Generalize better

### Architecture with Dropout

```
Input (60) ‚Üí Hidden (128) ‚Üí ReLU ‚Üí Dropout(0.5) ‚Üí Hidden (64) ‚Üí ReLU ‚Üí Dropout(0.5) ‚Üí Output (2)
```

We use `p=0.5` (50% dropout rate), which is a common choice for hidden layers.

In [None]:
class SimpleNNWithDropout(nn.Module):
    """Neural network WITH dropout - better generalization."""
    def __init__(self):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(60, 128),
            nn.ReLU(),
            nn.Dropout(p=0.5),  # 50% dropout after first hidden layer
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Dropout(p=0.5),  # 50% dropout after second hidden layer
            nn.Linear(64, 2)    # No dropout before output
        )

    def forward(self, x):
        return self.network(x)

## 8. Train Model With Dropout

In [None]:
# Initialize model with dropout
model_with_dropout = SimpleNNWithDropout()
optimizer = optim.Adam(model_with_dropout.parameters(), lr=0.001)

print("\nTraining WITH Dropout (p=0.5)")
print("=" * 60)
train_losses_with_dropout, val_losses_with_dropout, val_accuracies_with_dropout = train_model(
    model_with_dropout, train_loader, test_loader, criterion, optimizer, epochs=20
)

In [None]:
# Visualize training progress with dropout
plt.figure(figsize=(10, 6))
epochs = range(1, 21)

plt.plot(epochs, train_losses_with_dropout, 'b-', label="Train Loss", linewidth=2)
plt.plot(epochs, val_losses_with_dropout, 'r-', label="Validation Loss", linewidth=2)
plt.plot(epochs, val_accuracies_with_dropout, 'g--', label="Validation Accuracy", linewidth=2)

plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Loss / Accuracy', fontsize=12)
plt.title('Training WITH Dropout (Better Generalization)', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.show()

---

## 9. Comparing Both Models

Let's compare the validation performance of both models side by side.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
epochs = range(1, 21)

# Validation Loss Comparison
axes[0].plot(epochs, val_losses_no_dropout, 'r-o', label="Without Dropout", linewidth=2)
axes[0].plot(epochs, val_losses_with_dropout, 'b-o', label="With Dropout", linewidth=2)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Validation Loss', fontsize=12)
axes[0].set_title('Validation Loss Comparison', fontsize=14)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

# Validation Accuracy Comparison
axes[1].plot(epochs, val_accuracies_no_dropout, 'r-o', label="Without Dropout", linewidth=2)
axes[1].plot(epochs, val_accuracies_with_dropout, 'b-o', label="With Dropout", linewidth=2)
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Validation Accuracy', fontsize=12)
axes[1].set_title('Validation Accuracy Comparison', fontsize=14)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Summary statistics
print("\n" + "=" * 50)
print("SUMMARY")
print("=" * 50)
print(f"\nWithout Dropout:")
print(f"  Best Val Accuracy: {max(val_accuracies_no_dropout):.4f}")
print(f"  Final Val Accuracy: {val_accuracies_no_dropout[-1]:.4f}")
print(f"  Final Val Loss: {val_losses_no_dropout[-1]:.4f}")

print(f"\nWith Dropout:")
print(f"  Best Val Accuracy: {max(val_accuracies_with_dropout):.4f}")
print(f"  Final Val Accuracy: {val_accuracies_with_dropout[-1]:.4f}")
print(f"  Final Val Loss: {val_losses_with_dropout[-1]:.4f}")

---

## 10. Key Takeaways

### What We Observed

| Aspect | Without Dropout | With Dropout |
|--------|-----------------|---------------|
| **Training loss** | Decreases rapidly | Decreases more slowly |
| **Validation loss** | May increase (overfit) | More stable |
| **Generalization** | Poor on small datasets | Better |

### When to Use Dropout

- Small datasets (like Sonar)
- Deep networks with many parameters
- When you see overfitting (train loss << val loss)
- Fully connected layers (use Dropout2d for CNNs)

### Dropout Best Practices

1. **Start with p=0.5** for hidden layers
2. **Use lower rates (0.1-0.2)** for input layers
3. **Never use dropout on output layer**
4. **Always call `model.eval()`** before inference
5. **Combine with other regularization** (L2, early stopping) for best results