# Dropout

Let's think briefly about what we
expect from a good predictive model.
We want it to peform well on unseen data.
Classical generalization theory
suggests that to close the gap between
train and test performance,
we should aim for a simple model.
Simplicity can come in the form
of a small number of dimensions.
We explored this when discussing the
monomial basis functions of linear models.
Additionally, as we saw when discussing weight decay
($\ell_2$ regularization),
the (inverse) norm of the parameters also
represents a useful measure of simplicity.
Another useful notion of simplicity is smoothness,
i.e., that the function should not be sensitive
to small changes to its inputs.
For instance, when we classify images,
we would expect that adding some random noise
to the pixels should be mostly harmless.

Scientists formalized
this idea when he proved that training with input noise
is equivalent to Tikhonov regularization.
This work drew a clear mathematical connection
between the requirement that a function be smooth (and thus simple),
and the requirement that it be resilient
to perturbations in the input.

Then, :citet:`Srivastava.Hinton.Krizhevsky.ea.2014`
developed a clever idea for how to apply Bishop's idea
to the internal layers of a network, too.
Their idea, called *dropout*, involves
injecting noise while computing
each internal layer during forward propagation,
and it has become a standard technique
for training neural networks.
The method is called *dropout* because we literally
*drop out* some neurons during training.
Throughout training, on each iteration,
standard dropout consists of zeroing out
some fraction of the nodes in each layer
before calculating the subsequent layer.

To be clear, we are imposing
our own narrative with the link to Bishop.
The original paper on dropout
offers intuition through a surprising
analogy to sexual reproduction.
The authors argue that neural network overfitting
is characterized by a state in which
each layer relies on a specific
pattern of activations in the previous layer,
calling this condition *co-adaptation*.
Dropout, they claim, breaks up co-adaptation
just as sexual reproduction is argued to
break up co-adapted genes.
While such an justification of this theory is certainly up for debate,
the dropout technique itself has proved enduring,
and various forms of dropout are implemented
in most deep learning libraries.


The key challenge is how to inject this noise.
One idea is to inject it in an *unbiased* manner
so that the expected value of each layer---while fixing
the others---equals the value it would have taken absent noise.
In Bishop's work, he added Gaussian noise
to the inputs to a linear model.
At each training iteration, he added noise
sampled from a distribution with mean zero
$\epsilon \sim \mathcal{N}(0,\sigma^2)$ to the input $\mathbf{x}$,
yielding a perturbed point $\mathbf{x}' = \mathbf{x} + \epsilon$.
In expectation, $E[\mathbf{x}'] = \mathbf{x}$.

In standard dropout regularization,
one zeros out some fraction of the nodes in each layer
and then *debiases* each layer by normalizing
by the fraction of nodes that were retained (not dropped out).
In other words,
with *dropout probability* $p$,
each intermediate activation $h$ is replaced by
a random variable $h'$ as follows:

$$
\begin{aligned}
h' =
\begin{cases}
    0 & \textrm{ with probability } p \\
    \frac{h}{1-p} & \textrm{ otherwise}
\end{cases}
\end{aligned}
$$

By design, the expectation remains unchanged, i.e., $E[h'] = h$.


In [1]:
import torch
from torch import nn

## Dropout in Practice

Recall the MLP with a hidden layer and five hidden units
from :numref:`fig_mlp`.
When we apply dropout to a hidden layer,
zeroing out each hidden unit with probability $p$,
the result can be viewed as a network
containing only a subset of the original neurons.
In :numref:`fig_dropout2`, $h_2$ and $h_5$ are removed.
Consequently, the calculation of the outputs
no longer depends on $h_2$ or $h_5$
and their respective gradient also vanishes
when performing backpropagation.
In this way, the calculation of the output layer
cannot be overly dependent on any
one element of $h_1, \ldots, h_5$.

![MLP before and after dropout.](http://d2l.ai/_images/dropout2.svg)
:label:`fig_dropout2`

Typically, we disable dropout at test time.
Given a trained model and a new example,
we do not drop out any nodes
and thus do not need to normalize.
However, there are some exceptions:
some researchers use dropout at test time as a heuristic
for estimating the *uncertainty* of neural network predictions:
if the predictions agree across many different dropout outputs,
then we might say that the network is more confident.

## Implementation from Scratch

To implement the dropout function for a single layer,
we must draw as many samples
from a Bernoulli (binary) random variable
as our layer has dimensions,
where the random variable takes value $1$ (keep)
with probability $1-p$ and $0$ (drop) with probability $p$.
One easy way to implement this is to first draw samples
from the uniform distribution $U[0, 1]$.
Then we can keep those nodes for which the corresponding
sample is greater than $p$, dropping the rest.

In the following code, we (**implement a `dropout_layer` function
that drops out the elements in the tensor input `X`
with probability `dropout`**),
rescaling the remainder as described above:
dividing the survivors by `1.0-dropout`.


In [2]:
def dropout_layer(X, dropout):
    assert 0 <= dropout <= 1
    if dropout == 1: return torch.zeros_like(X)
    mask = (torch.rand(X.shape) > dropout).float()
    return mask * X / (1.0 - dropout)

We can [**test out the `dropout_layer` function on a few examples**].
In the following lines of code,
we pass our input `X` through the dropout operation,
with probabilities 0, 0.5, and 1, respectively.


In [3]:
X = torch.arange(16, dtype = torch.float32).reshape((2, 8))
print('dropout_p = 0:', dropout_layer(X, 0))
print('dropout_p = 0.5:', dropout_layer(X, 0.5))
print('dropout_p = 1:', dropout_layer(X, 1))

dropout_p = 0: tensor([[ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11., 12., 13., 14., 15.]])
dropout_p = 0.5: tensor([[ 0.,  0.,  0.,  0.,  0.,  0., 12.,  0.],
        [ 0.,  0., 20.,  0., 24.,  0., 28., 30.]])
dropout_p = 1: tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.]])


### Defining the Model

The model below applies dropout to the output
of each hidden layer (following the activation function).
We can set dropout probabilities for each layer separately.
A common choice is to set
a lower dropout probability closer to the input layer.
We ensure that dropout is only active during training.


In [6]:
class DropoutMLPScratch(torch.nn.Module):
    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
                 dropout_1, dropout_2, lr):
        super(DropoutMLPScratch, self).__init__()
        self.lin1 = nn.LazyLinear(num_hiddens_1)
        self.lin2 = nn.LazyLinear(num_hiddens_2)
        self.lin3 = nn.LazyLinear(num_outputs)
        self.relu = nn.ReLU()
        self.dropout_1 = dropout_1
        self.dropout_2 = dropout_2

    def forward(self, X):
        H1 = self.relu(self.lin1(X.reshape((X.shape[0], -1))))
        if self.training:
            H1 = dropout_layer(H1, self.dropout_1)
        H2 = self.relu(self.lin2(H1))
        if self.training:
            H2 = dropout_layer(H2, self.dropout_2)
        return self.lin3(H2)

### Training

Write your code to train the provided network similar to the training of MLPs described early in the lectures on the FashionMNIST Dataset for 10 epochs.

https://pytorch.org/vision/0.19/generated/torchvision.datasets.FashionMNIST.html


In [7]:
hparams = {'num_outputs':10, 'num_hiddens_1':256, 'num_hiddens_2':256,
           'dropout_1':0.5, 'dropout_2':0.5, 'lr':0.1}
model = DropoutMLPScratch(**hparams)

#write your training and testing code here
import torch
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader
import torch.optim as optim
import torch.nn.functional as F

# Loading the FashionMNIST dataset
batch_size = 64
transform = transforms.Compose([transforms.ToTensor()])

train_dataset = torchvision.datasets.FashionMNIST(root='./data', train=True, transform=transform, download=True)
test_dataset = torchvision.datasets.FashionMNIST(root='./data', train=False, transform=transform, download=True)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Defining training and evaluation functions
def train_model(model, train_loader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    for X, y in train_loader:
        X, y = X.to(device), y.to(device)
        optimizer.zero_grad()
        output = model(X)
        loss = criterion(output, y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(train_loader)

def evaluate_model(model, test_loader, device):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for X, y in test_loader:
            X, y = X.to(device), y.to(device)
            output = model(X)
            _, predicted = torch.max(output, 1)
            correct += (predicted == y).sum().item()
            total += y.size(0)
    return correct / total

# Setting device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Training setup
criterion = torch.nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=hparams['lr'])
num_epochs = 10

# Training loop
for epoch in range(num_epochs):
    train_loss = train_model(model, train_loader, criterion, optimizer, device)
    test_accuracy = evaluate_model(model, test_loader, device)
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {train_loss:.4f}, Test Accuracy: {test_accuracy:.4f}")

Epoch 1/10, Loss: 0.7891, Test Accuracy: 0.8097
Epoch 2/10, Loss: 0.5210, Test Accuracy: 0.8207
Epoch 3/10, Loss: 0.4715, Test Accuracy: 0.8396
Epoch 4/10, Loss: 0.4449, Test Accuracy: 0.8508
Epoch 5/10, Loss: 0.4234, Test Accuracy: 0.8548
Epoch 6/10, Loss: 0.4078, Test Accuracy: 0.8560
Epoch 7/10, Loss: 0.3954, Test Accuracy: 0.8578
Epoch 8/10, Loss: 0.3859, Test Accuracy: 0.8643
Epoch 9/10, Loss: 0.3756, Test Accuracy: 0.8656
Epoch 10/10, Loss: 0.3694, Test Accuracy: 0.8633


## Higher Level Implementation

With high-level APIs, all we need to do is add a `Dropout` layer
after each fully connected layer,
passing in the dropout probability
as the only argument to its constructor.
During training, the `Dropout` layer will randomly
drop out outputs of the previous layer
(or equivalently, the inputs to the subsequent layer)
according to the specified dropout probability.
When not in training mode,
the `Dropout` layer simply passes the data through during testing.


In [9]:
class DropoutMLP(torch.nn.Module):
    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
                 dropout_1, dropout_2, lr):
        super(DropoutMLP, self).__init__()
        self.net = nn.Sequential(
            nn.Flatten(), nn.LazyLinear(num_hiddens_1), nn.ReLU(),
            nn.Dropout(dropout_1), nn.LazyLinear(num_hiddens_2), nn.ReLU(),
            nn.Dropout(dropout_2), nn.LazyLinear(num_outputs))
    def forward(self, X):
        return self.net(X)





Next, write your code to train the given model on the FashionMNIST Dataset for 10 epochs.


In [10]:
hparams = {'num_outputs':10, 'num_hiddens_1':256, 'num_hiddens_2':256,
           'dropout_1':0.5, 'dropout_2':0.5, 'lr':0.1}
model = DropoutMLP(**hparams)

# write your training and testing code here
# Loading the FashionMNIST dataset
batch_size = 64
transform = transforms.Compose([transforms.ToTensor()])

train_dataset = torchvision.datasets.FashionMNIST(root='./data', train=True, transform=transform, download=True)
test_dataset = torchvision.datasets.FashionMNIST(root='./data', train=False, transform=transform, download=True)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Defining training and evaluation functions
def train_model(model, train_loader, criterion, optimizer, device):
    model.train()
    total_loss = 0
    for X, y in train_loader:
        X, y = X.to(device), y.to(device)
        optimizer.zero_grad()
        output = model(X)
        loss = criterion(output, y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(train_loader)

def evaluate_model(model, test_loader, device):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for X, y in test_loader:
            X, y = X.to(device), y.to(device)
            output = model(X)
            _, predicted = torch.max(output, 1)
            correct += (predicted == y).sum().item()
            total += y.size(0)
    return correct / total

# Setting device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Training setup
criterion = torch.nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=hparams['lr'])
num_epochs = 10

# Training loop
for epoch in range(num_epochs):
    train_loss = train_model(model, train_loader, criterion, optimizer, device)
    test_accuracy = evaluate_model(model, test_loader, device)
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {train_loss:.4f}, Test Accuracy: {test_accuracy:.4f}")

Epoch 1/10, Loss: 0.7909, Test Accuracy: 0.8090
Epoch 2/10, Loss: 0.5210, Test Accuracy: 0.8334
Epoch 3/10, Loss: 0.4704, Test Accuracy: 0.8375
Epoch 4/10, Loss: 0.4437, Test Accuracy: 0.8555
Epoch 5/10, Loss: 0.4263, Test Accuracy: 0.8588
Epoch 6/10, Loss: 0.4079, Test Accuracy: 0.8437
Epoch 7/10, Loss: 0.3951, Test Accuracy: 0.8644
Epoch 8/10, Loss: 0.3847, Test Accuracy: 0.8600
Epoch 9/10, Loss: 0.3786, Test Accuracy: 0.8689
Epoch 10/10, Loss: 0.3704, Test Accuracy: 0.8705


## Summary

Beyond controlling the number of dimensions and the size of the weight vector, dropout is yet another tool for avoiding overfitting. Often tools are used jointly.
Note that dropout is
used only during training:
it replaces an activation $h$ with a random variable with expected value $h$.


## Exercises

### 1. What happens if you change the dropout probabilities for the first and second layers? In particular, what happens if you switch the ones for both layers? Design an experiment to answer these questions, describe your results quantitatively, and summarize the qualitative takeaways.



In [11]:
# Function to load FashionMNIST dataset
def load_data(batch_size=64):
    transform = transforms.Compose([transforms.ToTensor()])
    train_dataset = torchvision.datasets.FashionMNIST(root='./data', train=True, transform=transform, download=True)
    test_dataset = torchvision.datasets.FashionMNIST(root='./data', train=False, transform=transform, download=True)
    train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
    return train_loader, test_loader

# Model definition
def initialize_model(dropout_1, dropout_2):
    hparams = {
        'num_outputs': 10, 'num_hiddens_1': 256, 'num_hiddens_2': 256,
        'dropout_1': dropout_1, 'dropout_2': dropout_2, 'lr': 0.1
    }
    model = DropoutMLP(**hparams)
    return model

# Training and Evaluation Functions
def train_and_evaluate(model, train_loader, test_loader, num_epochs=10):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    criterion = torch.nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=0.1)

    for epoch in range(num_epochs):
        model.train()
        total_loss = 0
        for X, y in train_loader:
            X, y = X.to(device), y.to(device)
            optimizer.zero_grad()
            output = model(X)
            loss = criterion(output, y)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        test_accuracy = evaluate_model(model, test_loader, device)
        print(f"Epoch {epoch+1}/{num_epochs}, Loss: {total_loss / len(train_loader):.4f}, Test Accuracy: {test_accuracy:.4f}")

# Testing function
def evaluate_model(model, test_loader, device):
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for X, y in test_loader:
            X, y = X.to(device), y.to(device)
            output = model(X)
            _, predicted = torch.max(output, 1)
            correct += (predicted == y).sum().item()
            total += y.size(0)
    return correct / total

# Loading data
train_loader, test_loader = load_data()

# Experiment 1: dropout_1 = 0.3, dropout_2 = 0.5
print("Experiment 1: dropout_1 = 0.3, dropout_2 = 0.5")
model1 = initialize_model(dropout_1=0.3, dropout_2=0.5)
train_and_evaluate(model1, train_loader, test_loader, num_epochs=10)

# Experiment 2: dropout_1 = 0.5, dropout_2 = 0.3
print("\nExperiment 2: dropout_1 = 0.5, dropout_2 = 0.3")
model2 = initialize_model(dropout_1=0.5, dropout_2=0.3)
train_and_evaluate(model2, train_loader, test_loader, num_epochs=10)

Experiment 1: dropout_1 = 0.3, dropout_2 = 0.5
Epoch 1/10, Loss: 0.7448, Test Accuracy: 0.7767
Epoch 2/10, Loss: 0.4876, Test Accuracy: 0.8159
Epoch 3/10, Loss: 0.4359, Test Accuracy: 0.8500
Epoch 4/10, Loss: 0.4097, Test Accuracy: 0.8624
Epoch 5/10, Loss: 0.3859, Test Accuracy: 0.8407
Epoch 6/10, Loss: 0.3742, Test Accuracy: 0.8579
Epoch 7/10, Loss: 0.3570, Test Accuracy: 0.8652
Epoch 8/10, Loss: 0.3505, Test Accuracy: 0.8636
Epoch 9/10, Loss: 0.3391, Test Accuracy: 0.8746
Epoch 10/10, Loss: 0.3300, Test Accuracy: 0.8724

Experiment 2: dropout_1 = 0.5, dropout_2 = 0.3
Epoch 1/10, Loss: 0.7717, Test Accuracy: 0.7878
Epoch 2/10, Loss: 0.5030, Test Accuracy: 0.8290
Epoch 3/10, Loss: 0.4585, Test Accuracy: 0.8260
Epoch 4/10, Loss: 0.4294, Test Accuracy: 0.8341
Epoch 5/10, Loss: 0.4104, Test Accuracy: 0.8561
Epoch 6/10, Loss: 0.3942, Test Accuracy: 0.8455
Epoch 7/10, Loss: 0.3810, Test Accuracy: 0.8598
Epoch 8/10, Loss: 0.3738, Test Accuracy: 0.8586
Epoch 9/10, Loss: 0.3646, Test Accuracy:

**Results**:
- **Experiment 1** (`dropout_1 = 0.3`, `dropout_2 = 0.5`):
  - The model started with a test accuracy of 77.67% and gradually improved to 87.24% after 10 epochs.
  - The final loss stabilized around 0.3300, indicating that the model is able to generalize well with this dropout configuration.

- **Experiment 2** (`dropout_1 = 0.5`, `dropout_2 = 0.3`):
  - The model started with a test accuracy of 78.78% and reached 85.24% after 10 epochs.
  - The final loss was around 0.3563, which is slightly higher than Experiment 1.

**Qualitative Takeaways**:
- **Lower dropout in initial layers** (Experiment 1) resulted in better test accuracy and lower loss compared to **higher dropout in the initial layers** (Experiment 2).
- Dropout closer to the input layer (Experiment 2) may disrupt the feature extraction process more, which could explain the slightly lower performance. In contrast, a higher dropout probability in later layers (Experiment 1) seems to improve generalization by focusing regularization towards the deeper layers where co-adaptation of features is more likely to occur.
  
This suggests that placing a higher dropout rate in deeper layers might be preferable for achieving better generalization.

### 2. Increase the number of epochs to 50 and compare the results.

In [12]:
# Experiment 3: Training for 50 epochs with dropout_1 = 0.5, dropout_2 = 0.5
print("\nExperiment 3: Training for 50 epochs with dropout_1 = 0.5, dropout_2 = 0.5")
model3 = initialize_model(dropout_1=0.5, dropout_2=0.5)
train_and_evaluate(model3, train_loader, test_loader, num_epochs=50)


Experiment 3: Training for 50 epochs with dropout_1 = 0.5, dropout_2 = 0.5
Epoch 1/50, Loss: 0.7935, Test Accuracy: 0.8043
Epoch 2/50, Loss: 0.5233, Test Accuracy: 0.8261
Epoch 3/50, Loss: 0.4726, Test Accuracy: 0.8419
Epoch 4/50, Loss: 0.4456, Test Accuracy: 0.8440
Epoch 5/50, Loss: 0.4218, Test Accuracy: 0.8513
Epoch 6/50, Loss: 0.4058, Test Accuracy: 0.8535
Epoch 7/50, Loss: 0.3961, Test Accuracy: 0.8549
Epoch 8/50, Loss: 0.3856, Test Accuracy: 0.8632
Epoch 9/50, Loss: 0.3770, Test Accuracy: 0.8650
Epoch 10/50, Loss: 0.3689, Test Accuracy: 0.8693
Epoch 11/50, Loss: 0.3591, Test Accuracy: 0.8674
Epoch 12/50, Loss: 0.3546, Test Accuracy: 0.8699
Epoch 13/50, Loss: 0.3516, Test Accuracy: 0.8688
Epoch 14/50, Loss: 0.3481, Test Accuracy: 0.8612
Epoch 15/50, Loss: 0.3398, Test Accuracy: 0.8772
Epoch 16/50, Loss: 0.3336, Test Accuracy: 0.8782
Epoch 17/50, Loss: 0.3299, Test Accuracy: 0.8722
Epoch 18/50, Loss: 0.3300, Test Accuracy: 0.8806
Epoch 19/50, Loss: 0.3267, Test Accuracy: 0.8758
Ep

**Results**:
- Training the model for 50 epochs with `dropout_1 = 0.5` and `dropout_2 = 0.5` yielded a final test accuracy of approximately **89.08%** and a final loss of around **0.2614**.
- Test accuracy initially improved rapidly, reaching around 87-88% by epoch 30, after which it continued to improve but at a slower pace.

**Qualitative Takeaways**:
- **Increased epochs** allowed the model to achieve higher accuracy as it continued learning from the data.
- However, the rate of improvement slowed after around 30 epochs, showing **diminishing returns** in accuracy. This suggests that while longer training can yield better results, the model begins to converge after a certain number of epochs.
- The use of dropout (with both `dropout_1` and `dropout_2` at 0.5) helped maintain the model's generalization even after prolonged training, reducing the risk of overfitting.


### 3. Why Dropout Is Not Typically Used at Test Time

**Answer**:
Dropout is used only during training because it introduces randomness by randomly deactivating neurons in each layer, which helps prevent overfitting by forcing the model to learn redundant representations. However, at test time, we want to use the **full model without any randomness** to ensure stable and consistent predictions.

If dropout were applied during testing, the predictions would vary due to random deactivation of neurons, resulting in **inconsistent and less reliable predictions**. Instead, during testing, we use the entire network with all neurons active, and each neuron’s weights are implicitly scaled to account for the dropout used during training.

---

### Summary
1. **Changing Dropout Rates**: Using a lower dropout rate in earlier layers and a higher rate in later layers improves generalization and accuracy.
2. **Increasing Epochs**: Training for more epochs can enhance accuracy but with diminishing improvements after a certain point.
3. **Dropout at Test Time**: Dropout is not used at test time to maintain stability and consistency in predictions.