# Dropout

Let's think briefly about what we
expect from a good predictive model.
We want it to peform well on unseen data.
Classical generalization theory
suggests that to close the gap between
train and test performance,
we should aim for a simple model.
Simplicity can come in the form
of a small number of dimensions.
We explored this when discussing the
monomial basis functions of linear models.
Additionally, as we saw when discussing weight decay
($\ell_2$ regularization),
the (inverse) norm of the parameters also
represents a useful measure of simplicity.
Another useful notion of simplicity is smoothness,
i.e., that the function should not be sensitive
to small changes to its inputs.
For instance, when we classify images,
we would expect that adding some random noise
to the pixels should be mostly harmless.

Scientists formalized
this idea when he proved that training with input noise
is equivalent to Tikhonov regularization.
This work drew a clear mathematical connection
between the requirement that a function be smooth (and thus simple),
and the requirement that it be resilient
to perturbations in the input.

Then, :citet:`Srivastava.Hinton.Krizhevsky.ea.2014`
developed a clever idea for how to apply Bishop's idea
to the internal layers of a network, too.
Their idea, called *dropout*, involves
injecting noise while computing
each internal layer during forward propagation,
and it has become a standard technique
for training neural networks.
The method is called *dropout* because we literally
*drop out* some neurons during training.
Throughout training, on each iteration,
standard dropout consists of zeroing out
some fraction of the nodes in each layer
before calculating the subsequent layer.

To be clear, we are imposing
our own narrative with the link to Bishop.
The original paper on dropout
offers intuition through a surprising
analogy to sexual reproduction.
The authors argue that neural network overfitting
is characterized by a state in which
each layer relies on a specific
pattern of activations in the previous layer,
calling this condition *co-adaptation*.
Dropout, they claim, breaks up co-adaptation
just as sexual reproduction is argued to
break up co-adapted genes.
While such an justification of this theory is certainly up for debate,
the dropout technique itself has proved enduring,
and various forms of dropout are implemented
in most deep learning libraries.


The key challenge is how to inject this noise.
One idea is to inject it in an *unbiased* manner
so that the expected value of each layer---while fixing
the others---equals the value it would have taken absent noise.
In Bishop's work, he added Gaussian noise
to the inputs to a linear model.
At each training iteration, he added noise
sampled from a distribution with mean zero
$\epsilon \sim \mathcal{N}(0,\sigma^2)$ to the input $\mathbf{x}$,
yielding a perturbed point $\mathbf{x}' = \mathbf{x} + \epsilon$.
In expectation, $E[\mathbf{x}'] = \mathbf{x}$.

In standard dropout regularization,
one zeros out some fraction of the nodes in each layer
and then *debiases* each layer by normalizing
by the fraction of nodes that were retained (not dropped out).
In other words,
with *dropout probability* $p$,
each intermediate activation $h$ is replaced by
a random variable $h'$ as follows:

$$
\begin{aligned}
h' =
\begin{cases}
    0 & \textrm{ with probability } p \\
    \frac{h}{1-p} & \textrm{ otherwise}
\end{cases}
\end{aligned}
$$

By design, the expectation remains unchanged, i.e., $E[h'] = h$.


In [1]:
import torch
from torch import nn

## Dropout in Practice

Recall the MLP with a hidden layer and five hidden units
from :numref:`fig_mlp`.
When we apply dropout to a hidden layer,
zeroing out each hidden unit with probability $p$,
the result can be viewed as a network
containing only a subset of the original neurons.
In :numref:`fig_dropout2`, $h_2$ and $h_5$ are removed.
Consequently, the calculation of the outputs
no longer depends on $h_2$ or $h_5$
and their respective gradient also vanishes
when performing backpropagation.
In this way, the calculation of the output layer
cannot be overly dependent on any
one element of $h_1, \ldots, h_5$.

![MLP before and after dropout.](http://d2l.ai/_images/dropout2.svg)
:label:`fig_dropout2`

Typically, we disable dropout at test time.
Given a trained model and a new example,
we do not drop out any nodes
and thus do not need to normalize.
However, there are some exceptions:
some researchers use dropout at test time as a heuristic
for estimating the *uncertainty* of neural network predictions:
if the predictions agree across many different dropout outputs,
then we might say that the network is more confident.

## Implementation from Scratch

To implement the dropout function for a single layer,
we must draw as many samples
from a Bernoulli (binary) random variable
as our layer has dimensions,
where the random variable takes value $1$ (keep)
with probability $1-p$ and $0$ (drop) with probability $p$.
One easy way to implement this is to first draw samples
from the uniform distribution $U[0, 1]$.
Then we can keep those nodes for which the corresponding
sample is greater than $p$, dropping the rest.

In the following code, we (**implement a `dropout_layer` function
that drops out the elements in the tensor input `X`
with probability `dropout`**),
rescaling the remainder as described above:
dividing the survivors by `1.0-dropout`.


In [2]:
def dropout_layer(X, dropout):
    assert 0 <= dropout <= 1
    if dropout == 1: return torch.zeros_like(X)
    mask = (torch.rand(X.shape) > dropout).float()
    return mask * X / (1.0 - dropout)

We can [**test out the `dropout_layer` function on a few examples**].
In the following lines of code,
we pass our input `X` through the dropout operation,
with probabilities 0, 0.5, and 1, respectively.


In [3]:
X = torch.arange(16, dtype = torch.float32).reshape((2, 8))
print('dropout_p = 0:', dropout_layer(X, 0))
print('dropout_p = 0.5:', dropout_layer(X, 0.5))
print('dropout_p = 1:', dropout_layer(X, 1))

dropout_p = 0: tensor([[ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11., 12., 13., 14., 15.]])
dropout_p = 0.5: tensor([[ 0.,  0.,  0.,  0.,  0.,  0.,  0., 14.],
        [ 0.,  0., 20.,  0., 24., 26.,  0.,  0.]])
dropout_p = 1: tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.]])


### Defining the Model

The model below applies dropout to the output
of each hidden layer (following the activation function).
We can set dropout probabilities for each layer separately.
A common choice is to set
a lower dropout probability closer to the input layer.
We ensure that dropout is only active during training.


In [4]:
class DropoutMLPScratch(torch.nn.Module):
    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
                 dropout_1, dropout_2, lr):
        super(DropoutMLPScratch, self).__init__()
        self.lin1 = nn.LazyLinear(num_hiddens_1)
        self.lin2 = nn.LazyLinear(num_hiddens_2)
        self.lin3 = nn.LazyLinear(num_outputs)
        self.relu = nn.ReLU()
        self.dropout_1 = dropout_1
        self.dropout_2 = dropout_2


    def forward(self, X):
        H1 = self.relu(self.lin1(X.reshape((X.shape[0], -1))))
        if self.training:
            H1 = dropout_layer(H1, self.dropout_1)
        H2 = self.relu(self.lin2(H1))
        if self.training:
            H2 = dropout_layer(H2, self.dropout_2)
        return self.lin3(H2)

### Training

Write your code to train the provided network similar to the training of MLPs described early in the lectures on the FashionMNIST Dataset for 10 epochs.

https://pytorch.org/vision/0.19/generated/torchvision.datasets.FashionMNIST.html


In [6]:
hparams = {'num_outputs':10, 'num_hiddens_1':256, 'num_hiddens_2':256,
           'dropout_1':0.5, 'dropout_2':0.5, 'lr':0.1}
model = DropoutMLPScratch(**hparams)

#write your training and testing code here
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Data transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Lambda(lambda x: x.view(-1))  # Flatten the image to 784 features
])
# Load Fashion MNIST dataset
train_dataset = datasets.FashionMNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.FashionMNIST(root='./data', train=False, download=True, transform=transform)

# Create DataLoader
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)

# Loss and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=hparams['lr'])

train_accuracies = []
test_accuracies = []
num_epochs = 10

for epoch in range(num_epochs):
    # Training phase
    model.train()
    correct_train = 0
    total_train = 0
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        outputs = model(X_batch)
        loss = criterion(outputs, y_batch)
        loss.backward()
        optimizer.step()
        _, predicted = torch.max(outputs, 1)
        total_train += y_batch.size(0)
        correct_train += (predicted == y_batch).sum().item()

    train_accuracy = 100 * correct_train / total_train
    train_accuracies.append(train_accuracy)

    # Testing phase
    model.eval()
    correct_test = 0
    total_test = 0
    with torch.no_grad():
        for X_batch, y_batch in test_loader:
            outputs = model(X_batch)
            _, predicted = torch.max(outputs, 1)
            total_test += y_batch.size(0)
            correct_test += (predicted == y_batch).sum().item()

    test_accuracy = 100 * correct_test / total_test
    test_accuracies.append(test_accuracy)

    print(f'Epoch {epoch + 1}, Train Accuracy: {train_accuracy:.2f}%, Test Accuracy: {test_accuracy:.2f}%')


Epoch 1, Train Accuracy: 73.53%, Test Accuracy: 81.60%
Epoch 2, Train Accuracy: 81.33%, Test Accuracy: 83.35%
Epoch 3, Train Accuracy: 82.97%, Test Accuracy: 81.58%
Epoch 4, Train Accuracy: 84.03%, Test Accuracy: 84.80%
Epoch 5, Train Accuracy: 84.44%, Test Accuracy: 85.49%
Epoch 6, Train Accuracy: 84.91%, Test Accuracy: 85.75%
Epoch 7, Train Accuracy: 85.30%, Test Accuracy: 85.72%
Epoch 8, Train Accuracy: 85.79%, Test Accuracy: 86.44%
Epoch 9, Train Accuracy: 85.88%, Test Accuracy: 86.52%
Epoch 10, Train Accuracy: 86.23%, Test Accuracy: 85.81%


**I thought of implementing the MLP without dropout to compare and understand the performance of dropout technique**
## Building an MLP without dropout application

In [9]:
import torch
import torch.nn as nn

class MLPScratch(torch.nn.Module):
    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,lr):
        super(MLPScratch, self).__init__()
        self.lin1 = nn.LazyLinear(num_hiddens_1)
        self.lin2 = nn.LazyLinear(num_hiddens_2)
        self.lin3 = nn.LazyLinear(num_outputs)
        self.relu = nn.ReLU()

    def forward(self, X):
        H1 = self.relu(self.lin1(X.reshape((X.shape[0], -1))))
        H2 = self.relu(self.lin2(H1))
        return self.lin3(H2)



In [11]:
hparams = {'num_outputs':10, 'num_hiddens_1':256, 'num_hiddens_2':256,
            'lr':0.1}
model_1 = MLPScratch(**hparams)

#write your training and testing code here
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Data transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Lambda(lambda x: x.view(-1))  # Flatten the image to 784 features
])
# Load Fashion MNIST dataset
train_dataset = datasets.FashionMNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.FashionMNIST(root='./data', train=False, download=True, transform=transform)

# Create DataLoader
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)

# Loss and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model_1.parameters(), lr=hparams['lr'])

train_accuracies = []
test_accuracies = []
num_epochs = 10

for epoch in range(num_epochs):
    # Training phase
    model_1.train()
    correct_train = 0
    total_train = 0
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        outputs = model_1(X_batch)
        loss = criterion(outputs, y_batch)
        loss.backward()
        optimizer.step()
        _, predicted = torch.max(outputs, 1)
        total_train += y_batch.size(0)
        correct_train += (predicted == y_batch).sum().item()

    train_accuracy = 100 * correct_train / total_train
    train_accuracies.append(train_accuracy)

    # Testing phase
    model.eval()
    correct_test = 0
    total_test = 0
    with torch.no_grad():
        for X_batch, y_batch in test_loader:
            outputs = model_1(X_batch)
            _, predicted = torch.max(outputs, 1)
            total_test += y_batch.size(0)
            correct_test += (predicted == y_batch).sum().item()

    test_accuracy = 100 * correct_test / total_test
    test_accuracies.append(test_accuracy)

    print(f'Epoch {epoch + 1}, Train Accuracy: {train_accuracy:.2f}%, Test Accuracy: {test_accuracy:.2f}%')


Epoch 1, Train Accuracy: 79.08%, Test Accuracy: 80.77%
Epoch 2, Train Accuracy: 85.32%, Test Accuracy: 85.23%
Epoch 3, Train Accuracy: 86.74%, Test Accuracy: 85.92%
Epoch 4, Train Accuracy: 87.75%, Test Accuracy: 86.63%
Epoch 5, Train Accuracy: 88.45%, Test Accuracy: 86.61%
Epoch 6, Train Accuracy: 88.99%, Test Accuracy: 86.36%
Epoch 7, Train Accuracy: 89.63%, Test Accuracy: 87.42%
Epoch 8, Train Accuracy: 89.90%, Test Accuracy: 87.81%
Epoch 9, Train Accuracy: 90.32%, Test Accuracy: 88.89%
Epoch 10, Train Accuracy: 90.64%, Test Accuracy: 88.65%


The comparison between the models using dropout and those without reveals key differences in performance. The model with dropout initially showed a higher test accuracy of 86.76%, indicating that the dropout layers effectively helped mitigate overfitting by preventing the model from memorizing the training data. In contrast, the model without dropout achieved a training accuracy of 90.64% and a test accuracy of 88.89%, suggesting good generalization but also potential overfitting since the training accuracy continued to rise while test accuracy stabilized.

Using dropout appears to be effective in improving the model's ability to generalize to unseen data, as evidenced by the initial better performance in the model with dropout. The dropout model's more consistent performance across epochs suggests that it may better handle the complexity of the task, especially in preventing overfitting compared to the model without dropout. Overall, the results suggest that incorporating dropout can enhance model robustness in certain scenarios.





## Higher Level Implementation

With high-level APIs, all we need to do is add a `Dropout` layer
after each fully connected layer,
passing in the dropout probability
as the only argument to its constructor.
During training, the `Dropout` layer will randomly
drop out outputs of the previous layer
(or equivalently, the inputs to the subsequent layer)
according to the specified dropout probability.
When not in training mode,
the `Dropout` layer simply passes the data through during testing.


In [12]:
class DropoutMLP(torch.nn.Module):
    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
                 dropout_1, dropout_2, lr):
        super(DropoutMLP, self).__init__()
        self.net = nn.Sequential(
            nn.Flatten(), nn.LazyLinear(num_hiddens_1), nn.ReLU(),
            nn.Dropout(dropout_1), nn.LazyLinear(num_hiddens_2), nn.ReLU(),
            nn.Dropout(dropout_2), nn.LazyLinear(num_outputs))

    def forward(self, X):
        return self.net(X)






Next, write your code to train the given model on the FashionMNIST Dataset for 10 epochs.


In [14]:
hparams = {'num_outputs':10, 'num_hiddens_1':256, 'num_hiddens_2':256,
           'dropout_1':0.5, 'dropout_2':0.5, 'lr':0.1}
model = DropoutMLP(**hparams)

#write your training and testing code here
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Data transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Lambda(lambda x: x.view(-1))  # Flatten the image to 784 features
])
# Load Fashion MNIST dataset
train_dataset = datasets.FashionMNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.FashionMNIST(root='./data', train=False, download=True, transform=transform)

# Create DataLoader
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)

# Loss and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=hparams['lr'])

train_accuracies = []
test_accuracies = []
num_epochs = 10

for epoch in range(num_epochs):
    # Training phase
    model.train()
    correct_train = 0
    total_train = 0
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        outputs = model(X_batch)
        loss = criterion(outputs, y_batch)
        loss.backward()
        optimizer.step()
        _, predicted = torch.max(outputs, 1)
        total_train += y_batch.size(0)
        correct_train += (predicted == y_batch).sum().item()

    train_accuracy = 100 * correct_train / total_train
    train_accuracies.append(train_accuracy)

    # Testing phase
    model.eval()
    correct_test = 0
    total_test = 0
    with torch.no_grad():
        for X_batch, y_batch in test_loader:
            outputs = model(X_batch)
            _, predicted = torch.max(outputs, 1)
            total_test += y_batch.size(0)
            correct_test += (predicted == y_batch).sum().item()

    test_accuracy = 100 * correct_test / total_test
    test_accuracies.append(test_accuracy)

    print(f'Epoch {epoch + 1}, Train Accuracy: {train_accuracy:.2f}%, Test Accuracy: {test_accuracy:.2f}%')


Epoch 1, Train Accuracy: 73.73%, Test Accuracy: 79.52%
Epoch 2, Train Accuracy: 81.36%, Test Accuracy: 83.61%
Epoch 3, Train Accuracy: 82.94%, Test Accuracy: 84.86%
Epoch 4, Train Accuracy: 83.91%, Test Accuracy: 85.54%
Epoch 5, Train Accuracy: 84.60%, Test Accuracy: 85.42%
Epoch 6, Train Accuracy: 85.04%, Test Accuracy: 86.48%
Epoch 7, Train Accuracy: 85.34%, Test Accuracy: 85.87%
Epoch 8, Train Accuracy: 85.49%, Test Accuracy: 86.40%
Epoch 9, Train Accuracy: 85.90%, Test Accuracy: 86.89%
Epoch 10, Train Accuracy: 86.28%, Test Accuracy: 86.52%


## Model without dropout

In [17]:
import torch
import torch.nn as nn

class MLP(torch.nn.Module):
    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,lr):
        super(MLP, self).__init__()
        self.net = nn.Sequential(
            nn.Flatten(),
            nn.LazyLinear(num_hiddens_1),
            nn.ReLU(),
            nn.LazyLinear(num_hiddens_2),
            nn.ReLU(),
            nn.LazyLinear(num_outputs)
        )

    def forward(self, X):
        return self.net(X)


In [19]:
hparams = {'num_outputs':10, 'num_hiddens_1':256, 'num_hiddens_2':256,
            'lr':0.1}
model_1 = MLP(**hparams)

#write your training and testing code here
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Data transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Lambda(lambda x: x.view(-1))  # Flatten the image to 784 features
])
# Load Fashion MNIST dataset
train_dataset = datasets.FashionMNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.FashionMNIST(root='./data', train=False, download=True, transform=transform)

# Create DataLoader
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)

# Loss and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model_1.parameters(), lr=hparams['lr'])

train_accuracies = []
test_accuracies = []
num_epochs = 10

for epoch in range(num_epochs):
    # Training phase
    model_1.train()
    correct_train = 0
    total_train = 0
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        outputs = model_1(X_batch)
        loss = criterion(outputs, y_batch)
        loss.backward()
        optimizer.step()
        _, predicted = torch.max(outputs, 1)
        total_train += y_batch.size(0)
        correct_train += (predicted == y_batch).sum().item()

    train_accuracy = 100 * correct_train / total_train
    train_accuracies.append(train_accuracy)

    # Testing phase
    model.eval()
    correct_test = 0
    total_test = 0
    with torch.no_grad():
        for X_batch, y_batch in test_loader:
            outputs = model_1(X_batch)
            _, predicted = torch.max(outputs, 1)
            total_test += y_batch.size(0)
            correct_test += (predicted == y_batch).sum().item()

    test_accuracy = 100 * correct_test / total_test
    test_accuracies.append(test_accuracy)

    print(f'Epoch {epoch + 1}, Train Accuracy: {train_accuracy:.2f}%, Test Accuracy: {test_accuracy:.2f}%')


Epoch 1, Train Accuracy: 78.89%, Test Accuracy: 84.30%
Epoch 2, Train Accuracy: 85.39%, Test Accuracy: 84.60%
Epoch 3, Train Accuracy: 86.81%, Test Accuracy: 86.42%
Epoch 4, Train Accuracy: 87.62%, Test Accuracy: 87.05%
Epoch 5, Train Accuracy: 88.45%, Test Accuracy: 86.71%
Epoch 6, Train Accuracy: 88.98%, Test Accuracy: 87.27%
Epoch 7, Train Accuracy: 89.48%, Test Accuracy: 87.17%
Epoch 8, Train Accuracy: 89.92%, Test Accuracy: 87.25%
Epoch 9, Train Accuracy: 90.33%, Test Accuracy: 87.68%
Epoch 10, Train Accuracy: 90.57%, Test Accuracy: 88.17%


The comparison between the models with and without dropout highlights notable differences in performance. The model with dropout started with a training accuracy of 78.89% and a test accuracy of 84.30%, while the model without dropout began at 73.73% for training and 79.52% for testing. Over ten epochs, the dropout model steadily improved, achieving a final training accuracy of 86.28% and a test accuracy of 86.52%. In contrast, the no-dropout model reached a higher training accuracy of 90.57% but only managed a test accuracy of 88.17%.



## Summary

Beyond controlling the number of dimensions and the size of the weight vector, dropout is yet another tool for avoiding overfitting. Often tools are used jointly.
Note that dropout is
used only during training:
it replaces an activation $h$ with a random variable with expected value $h$.


## Exercises

1. What happens if you change the dropout probabilities for the first and second layers? In particular, what happens if you switch the ones for both layers? Design an experiment to answer these questions, describe your results quantitatively, and summarize the qualitative takeaways.
1. Increase the number of epochs to 50 and compare the results.
1. Why is dropout not typically used at test time?



## Exercise 1:
What happens if you change the dropout probabilities for the first and second layers? In particular, what happens if you switch the ones for both layers? Design an experiment to answer these questions, describe your results quantitatively, and summarize the qualitative takeaways.


In [None]:
# Here, I used the model DropoutMLP
hparams = {'num_outputs':10, 'num_hiddens_1':256, 'num_hiddens_2':256,
           'dropout_1':0.3, 'dropout_2':0.7, 'lr':0.1}
model = DropoutMLP(**hparams)

#write your training and testing code here
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Data transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Lambda(lambda x: x.view(-1))  # Flatten the image to 784 features
])

# Load Fashion MNIST dataset
train_dataset = datasets.FashionMNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.FashionMNIST(root='./data', train=False, download=True, transform=transform)

# Create DataLoader
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)

# Loss and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=hparams['lr'])

# Training Loop
num_epochs = 10
for epoch in range(num_epochs):
    model.train()  # Set the model to training mode
    total_loss = 0.0
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()  # Clear gradients
        outputs = model(X_batch)  # Forward pass
        loss = criterion(outputs, y_batch)  # Compute loss
        loss.backward()  # Backpropagation
        optimizer.step()  # Update weights
        total_loss += loss.item()

    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {total_loss/len(train_loader):.4f}')

# Testing Loop
model.eval()  # Set the model to evaluation mode
correct = 0
total = 0

with torch.no_grad():  # No gradient calculation
    for X_batch, y_batch in test_loader:
        outputs = model(X_batch)  # Forward pass
        _, predicted = torch.max(outputs, 1)  # Get the predicted class
        total += y_batch.size(0)  # Number of samples
        correct += (predicted == y_batch).sum().item()  # Count correct predictions

print(f'Accuracy: {100 * correct / total:.2f}%')


Epoch [1/10], Loss: 0.7233
Epoch [2/10], Loss: 0.5043
Epoch [3/10], Loss: 0.4577
Epoch [4/10], Loss: 0.4329
Epoch [5/10], Loss: 0.4125
Epoch [6/10], Loss: 0.3963
Epoch [7/10], Loss: 0.3855
Epoch [8/10], Loss: 0.3751
Epoch [9/10], Loss: 0.3661
Epoch [10/10], Loss: 0.3607
Accuracy: 86.67%


In [None]:
hparams = {'num_outputs':10, 'num_hiddens_1':256, 'num_hiddens_2':256,
           'dropout_1':0.7, 'dropout_2':0.3, 'lr':0.1}
model = DropoutMLP(**hparams)

#write your training and testing code here
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Data transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Lambda(lambda x: x.view(-1))  # Flatten the image to 784 features
])

# Load Fashion MNIST dataset
train_dataset = datasets.FashionMNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.FashionMNIST(root='./data', train=False, download=True, transform=transform)

# Create DataLoader
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)

# Loss and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=hparams['lr'])

# Training Loop
num_epochs = 10
for epoch in range(num_epochs):
    model.train()  # Set the model to training mode
    total_loss = 0.0
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()  # Clear gradients
        outputs = model(X_batch)  # Forward pass
        loss = criterion(outputs, y_batch)  # Compute loss
        loss.backward()  # Backpropagation
        optimizer.step()  # Update weights
        total_loss += loss.item()

    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {total_loss/len(train_loader):.4f}')

# Testing Loop
model.eval()  # Set the model to evaluation mode
correct = 0
total = 0

with torch.no_grad():  # No gradient calculation
    for X_batch, y_batch in test_loader:
        outputs = model(X_batch)  # Forward pass
        _, predicted = torch.max(outputs, 1)  # Get the predicted class
        total += y_batch.size(0)  # Number of samples
        correct += (predicted == y_batch).sum().item()  # Count correct predictions

print(f'Accuracy: {100 * correct / total:.2f}%')


Epoch [1/10], Loss: 0.7855
Epoch [2/10], Loss: 0.5755
Epoch [3/10], Loss: 0.5273
Epoch [4/10], Loss: 0.5045
Epoch [5/10], Loss: 0.4804
Epoch [6/10], Loss: 0.4702
Epoch [7/10], Loss: 0.4611
Epoch [8/10], Loss: 0.4455
Epoch [9/10], Loss: 0.4452
Epoch [10/10], Loss: 0.4358
Accuracy: 85.92%


In my experiment, I changed the dropout rates for the first and second layers, setting the first one to 0.3 and the second one to 0.7. With these settings, the model achieved a final accuracy of 86.67%, and the loss decreased consistently over ten epochs, starting from 0.7233 and dropping to 0.3607. This setup allowed the model to learn effectively, suggesting that a moderate dropout in the first layer helps keep important features while also promoting better generalization.

When I switched the rates—setting the first layer to 0.7 and the second to 0.3—the model's accuracy dropped slightly to 85.92%. The loss values started higher at 0.7855 and went down to 0.4358, indicating that learning was less effective this time. The higher dropout in the first layer likely removed too much important information early on, which hurt the model's overall performance.

**Summary of Results**:
First Setup (p1 = 0.3, p2 = 0.7):
Final Accuracy: 86.67%
Loss went from 0.7233 to 0.3607.
Switched Setup (p1 = 0.7, p2 = 0.3):
Final Accuracy: 85.92%
Loss went from 0.7855 to 0.4358.
**Inference**:
Dropout Strategy: A balanced approach to dropout rates seems to work better, as a moderate dropout in the first layer keeps important features while helping prevent overfitting.
Impact of High Dropout: Having a high dropout in the early layers can hurt learning by removing too much information, leading to lower performance.
Generalization vs. Learning: It’s important to find the right dropout rate to optimize both learning and generalization in neural networks. Adjusting these settings can significantly affect model accuracy and how well it learns.




## Exercise 2:
Increase the number of epochs to 50 and compare the results.

In [None]:
hparams = {'num_outputs':10, 'num_hiddens_1':256, 'num_hiddens_2':256,
           'dropout_1':0.5, 'dropout_2':0.5, 'lr':0.1}
model = DropoutMLP(**hparams)

#write your training and testing code here
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Data transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Lambda(lambda x: x.view(-1))  # Flatten the image to 784 features
])

# Load Fashion MNIST dataset
train_dataset = datasets.FashionMNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.FashionMNIST(root='./data', train=False, download=True, transform=transform)

# Create DataLoader
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32)

# Loss and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=hparams['lr'])

# Training Loop
num_epochs = 50
for epoch in range(num_epochs):
    model.train()  # Set the model to training mode
    total_loss = 0.0
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()  # Clear gradients
        outputs = model(X_batch)  # Forward pass
        loss = criterion(outputs, y_batch)  # Compute loss
        loss.backward()  # Backpropagation
        optimizer.step()  # Update weights
        total_loss += loss.item()

    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {total_loss/len(train_loader):.4f}')

# Testing Loop
model.eval()  # Set the model to evaluation mode
correct = 0
total = 0

with torch.no_grad():  # No gradient calculation
    for X_batch, y_batch in test_loader:
        outputs = model(X_batch)  # Forward pass
        _, predicted = torch.max(outputs, 1)  # Get the predicted class
        total += y_batch.size(0)  # Number of samples
        correct += (predicted == y_batch).sum().item()  # Count correct predictions

print(f'Accuracy: {100 * correct / total:.2f}%')


Epoch [1/50], Loss: 0.7075
Epoch [2/50], Loss: 0.5118
Epoch [3/50], Loss: 0.4716
Epoch [4/50], Loss: 0.4452
Epoch [5/50], Loss: 0.4263
Epoch [6/50], Loss: 0.4118
Epoch [7/50], Loss: 0.4035
Epoch [8/50], Loss: 0.3911
Epoch [9/50], Loss: 0.3819
Epoch [10/50], Loss: 0.3768
Epoch [11/50], Loss: 0.3700
Epoch [12/50], Loss: 0.3619
Epoch [13/50], Loss: 0.3604
Epoch [14/50], Loss: 0.3531
Epoch [15/50], Loss: 0.3529
Epoch [16/50], Loss: 0.3470
Epoch [17/50], Loss: 0.3451
Epoch [18/50], Loss: 0.3395
Epoch [19/50], Loss: 0.3401
Epoch [20/50], Loss: 0.3350
Epoch [21/50], Loss: 0.3293
Epoch [22/50], Loss: 0.3306
Epoch [23/50], Loss: 0.3266
Epoch [24/50], Loss: 0.3249
Epoch [25/50], Loss: 0.3207
Epoch [26/50], Loss: 0.3156
Epoch [27/50], Loss: 0.3167
Epoch [28/50], Loss: 0.3153
Epoch [29/50], Loss: 0.3113
Epoch [30/50], Loss: 0.3114
Epoch [31/50], Loss: 0.3084
Epoch [32/50], Loss: 0.3070
Epoch [33/50], Loss: 0.3052
Epoch [34/50], Loss: 0.3035
Epoch [35/50], Loss: 0.3024
Epoch [36/50], Loss: 0.3063
E

After increasing the number of epochs to 50 with dropout rates set to
𝑝
1
=
0.5
 and
𝑝
2
=
0.5
, the model showed steady improvement in performance. The loss decreased consistently from 0.7075 in the first epoch to a final value of 0.2861 by the end of the training. This downward trend indicates that the model was able to learn effectively over the epochs, gradually minimizing errors during training.

The final accuracy achieved was 88.64%, which suggests that the dropout rates were beneficial in preventing overfitting while maintaining a good level of learning. The consistent loss reduction across the epochs indicates that the model was able to generalize well to the training data. Overall, this experiment highlights the effectiveness of using dropout with balanced rates in enhancing model performance, leading to improved accuracy without sacrificing too much learning capability.





## Exercise 3:
Why is dropout not typically used at test time?


Implemented dropout in the testing phase for understanding how it effects the model performance.


In [21]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Define the DropoutMLPScratch model
class DropoutMLPScratch_Test(nn.Module):
    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
                 dropout_1, dropout_2):
        super(DropoutMLPScratch_Test, self).__init__()
        self.lin1 = nn.LazyLinear(num_hiddens_1)
        self.lin2 = nn.LazyLinear(num_hiddens_2)
        self.lin3 = nn.LazyLinear(num_outputs)
        self.relu = nn.ReLU()
        self.dropout_1 = nn.Dropout(dropout_1)
        self.dropout_2 = nn.Dropout(dropout_2)

    def forward(self, X, apply_dropout=False):
        H1 = self.relu(self.lin1(X.reshape((X.shape[0], -1))))

        if apply_dropout:
            H1 = self.dropout_1(H1)
        else:
            # Scale weights during testing
            H1 *= (1 - self.dropout_1.p)

        H2 = self.relu(self.lin2(H1))

        if apply_dropout:
            H2 = self.dropout_2(H2)
        else:
            # Scale weights during testing
            H2 *= (1 - self.dropout_2.p)

        return self.lin3(H2)

# Function to train the model
def train_model(model1, train_loader, criterion, optimizer, epochs):
    model1.train()  # Set to training mode
    for epoch in range(epochs):
        for data, targets in train_loader:
            optimizer.zero_grad()  # Clear previous gradients
            outputs = model1(data, apply_dropout=True)  # Apply dropout during training
            loss = criterion(outputs, targets)
            loss.backward()  # Backpropagation
            optimizer.step()  # Update weights
        print(f'Epoch {epoch + 1}/{epochs}, Loss: {loss.item()}')

# Function to evaluate the model
def evaluate_model(model1, test_loader, num_samples):
    model1.eval()  # Set to evaluation mode
    outputs = []
    with torch.no_grad():
        for data, _ in test_loader:
            batch_outputs = []
            for _ in range(num_samples):  # Multiple forward passes
                output = model1(data, apply_dropout=True)  # Apply dropout during testing
                batch_outputs.append(output)
            avg_output = torch.mean(torch.stack(batch_outputs), dim=0)
            outputs.append(avg_output)
    return torch.cat(outputs)

# Define transformations for the dataset (without normalization)
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Lambda(lambda x: x.view(-1))  # Flatten the image to 784 features
])

# Load the Fashion MNIST dataset
train_dataset = datasets.FashionMNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.FashionMNIST(root='./data', train=False, download=True, transform=transform)

# Create DataLoader
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Model parameters
num_hiddens_1 = 256
num_hiddens_2 = 256
dropout_1 = 0.5
dropout_2 = 0.5
epochs = 10
learning_rate = 0.1

# Initialize model, criterion, and optimizer
model1 = DropoutMLPScratch_Test(num_outputs=10, num_hiddens_1=num_hiddens_1,
                           num_hiddens_2=num_hiddens_2, dropout_1=dropout_1, dropout_2=dropout_2)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model1.parameters(), lr=learning_rate)

# Train the model
train_model(model1, train_loader, criterion, optimizer, epochs)

# Evaluate the model
num_test_samples = 100  # Number of forward passes for testing
test_outputs = evaluate_model(model1, test_loader, num_test_samples)

# Compute accuracy
_, predicted = torch.max(test_outputs, 1)
targets = torch.cat([y for _, y in test_loader])  # Get true labels from test_loader
accuracy = (predicted == targets).float().mean().item()
print(f'Accuracy: {accuracy:.4f}')


Epoch 1/10, Loss: 0.890283465385437
Epoch 2/10, Loss: 0.287642240524292
Epoch 3/10, Loss: 0.3459492325782776
Epoch 4/10, Loss: 0.5624974966049194
Epoch 5/10, Loss: 0.33294424414634705
Epoch 6/10, Loss: 0.46092286705970764
Epoch 7/10, Loss: 0.36244407296180725
Epoch 8/10, Loss: 0.32571399211883545
Epoch 9/10, Loss: 0.5869616270065308
Epoch 10/10, Loss: 0.5111770033836365
Accuracy: 0.8656


Applying dropout during the testing phase led to fluctuating loss values and unstable accuracy metrics, which indicate that the model's predictions are inconsistent. When dropout is active, different neurons are randomly deactivated for each forward pass, resulting in variability that undermines the reliability of performance evaluation.

Typically, during testing, we want the model to leverage all its learned parameters to provide a clear and stable assessment of its capabilities. Therefore, it’s generally not recommended to use dropout in the testing phase, as it can diminish the model's predictive power and lead to misleading results.



