# Dropout

Idea: At training time,
- Compute for every layer a mask over all neurons such that with probability $p$ an entry is taken.
- Set activations of all entries in the mask to zero.

![Dropout training time](Dropout.png)

Implementation:

- For every neuron add independent Bernoulli random variable $r$ with parameter $p$.
- For every neuron activation $y$ compute $\tilde{y} = (1-r) \cdot y$ to be the activation used subsequently.

![Dropout training time detail](Dropout-detail.png)

Interpretation:
- Instead of training one full network, we train an exponential number of thinned networks all at once.
- Each thinned network must work well, hence reducing dependency on any single latent representation.
    - Co-adaptation is avoided, independent features are encouraged.
    - Overfitting is reduced, we learn an exponential number of ensembles.
    - Redundant representations are encouraged, potentially improving generalization.
- During test time we take the full network, which is an average over all thinned ones.

At test time
- Use the original unthinned network.
- To account for the difference between thinned and fill network, multiply the weights of each layer by $1-p$.

In [1]:
import torch
import numpy as np

from torchvision import datasets
import torchvision.transforms as transforms
from torch.utils.data.sampler import SubsetRandomSampler

import torch.nn as nn
import torch.nn.functional as F

import torch.optim as optim

import matplotlib.pyplot as plt
%matplotlib inline

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Check if CUDA is available
device = torch.device('cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu')
print(device)

cuda


In [28]:
import torch
import torchvision
import torchvision.transforms as transforms

# Define the transformations for the dataset
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

# Download and load the training data
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=32, shuffle=True, num_workers=2)

# Download and load the test data
testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=32, shuffle=False, num_workers=2)

classes = ["airplane", "automobile", "bird", "cat", "deer", "dog", "frog",
          "horse", "ship", "truck"]

# Example usage
for images, labels in trainloader:
    print(images.shape, labels.shape)
    break

Files already downloaded and verified
Files already downloaded and verified
torch.Size([32, 3, 32, 32]) torch.Size([32])


In [None]:
class Dropout(nn.Module):
    def __init__(self, p):
        super().__init__()
        self.p = p
        
    def get_mask(self, x):
        mask = torch.rand(*x.shape)<=self.p
        return mask
        
    def forward(self, x):
        if self.training:
            mask = self.get_mask(x)
            x = x * mask
        else:
            x = x * (1-self.p)
        return x

In [None]:
class Dropout(nn.Module):
    def __init__(self, p):
        super().__init__()
        self.p = p
        
    def get_mask(self, x):
        mask = torch.rand(*x.shape, device=device) <= (1-self.p)
        return mask
        
    def forward(self, x):
        if self.training:
            mask = self.get_mask(x)
            x = 1/(1-self.p) * x * mask
        return x

In [None]:
# Let us check dropout

d = Dropout(0.25)
x = torch.rand(10, 10).to(device)
print(x)

In [None]:
print(d.forward(x))

In [None]:
class CNN(nn.Module):
    def __init__(self, p=0.25):
        super(CNN, self).__init__()
        
        # Convolutional layers
        #Init_channels, channels, kernel_size, padding) 
        self.conv1 = nn.Conv2d(3, 16, 3, padding=1)
        self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
        self.conv3 = nn.Conv2d(32, 64, 3, padding=1)
        
        # Pooling layers
        self.pool = nn.MaxPool2d(2,2)
        
        # FC layers
        # Linear layer (64x4x4 -> 500)
        self.fc1 = nn.Linear(64 * 4 * 4, 500)
        
        # Linear Layer (500 -> 10)
        self.fc2 = nn.Linear(500, 10)
        
        # Dropout layer
        self.dropout = nn.Dropout(p)
        
    def forward(self, x):
        x = self.pool(F.elu(self.dropout(self.conv1(x))))
        x = self.pool(F.elu(self.dropout(self.conv2(x))))
        x = self.pool(F.elu(self.dropout(self.conv3(x))))
        
        # Flatten the image
        x = x.view(-1, 64*4*4)
        #x = self.dropout(x)
        x = F.elu(self.fc1(x))
        #x = self.dropout(x)
        x = self.fc2(x)
        return x

In [32]:
class CNN(nn.Module):
    def __init__(self, dropout=0.2):
        super(CNN, self).__init__()
        
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=32, kernel_size=5, stride=1, padding=2)
        #3*32*32 -> 32*32*32
        self.dropout1 = nn.Dropout(p=dropout)        
        self.pool1 = nn.MaxPool2d(kernel_size=(2,2), stride=2)
        #32*32*32 -> 16*16*32
        self.conv2 = nn.Conv2d(32, 64, 3, stride=1, padding=1)
        #16*16*32 -> 16*16*64
        self.dropout2 = nn.Dropout(p=dropout)
        self.pool2 = nn.MaxPool2d(kernel_size=(2,2), stride=2)
        #16*16*64 -> 8*8*64
        self.fc1 = nn.Linear(8*8*64, 1024)
        self.dropout3 = nn.Dropout(p=dropout)
        self.fc2 = nn.Linear(1024, 512)
        self.dropout4 = nn.Dropout(p=dropout)
        self.fc3 = nn.Linear(512, 10)
            

    def forward(self, x):
        x = self.dropout1(self.conv1(x))
        x = self.pool1(F.relu(x))
        x = self.dropout2(self.conv2(x))
        x = self.pool2(F.relu(x))
        x = x.view(-1, self.num_flat_features(x)) 
        #self.num_flat_features(x) = 8*8*64 here.
        #-1 means: get the rest a row (in this case is 16 mini-batches)
        #pytorch nn only takes mini-batch as the input
        
        x = F.relu(self.fc1(x))
        x = self.dropout3(x)
        x = F.relu(self.fc2(x))
        x = self.dropout4(x)
        x = self.fc3(x)
        return x
    
    def num_flat_features(self, x):
        size = x.size()[1:] # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features

In [43]:
def train(model, n_epochs=25, weight_decay=0.0000, model_checkpoint='model_cifar.pt'):
    # Specify the Loss function
    criterion = nn.CrossEntropyLoss()

    # Specify the optimizer
    optimizer = optim.SGD(model.parameters(), lr=0.001)

    valid_loss_min = np.Inf # track change in validation loss

    for epoch in range(1, n_epochs+1):

        # keep track of training and validation loss
        train_loss = 0.0
        valid_loss = 0.0
    
        ###################
        # train the model #
        ###################
        model.train()
        for data, target in trainloader:
            # move tensors to GPU if CUDA is available
            data = data.to(device)
            target = target.to(device)
            # clear the gradients of all optimized variables
            optimizer.zero_grad()
            # forward pass: compute predicted outputs by passing inputs to the model
            output = model(data)
            # calculate the batch loss
            loss = criterion(output, target)
            # backward pass: compute gradient of the loss with respect to model parameters
            loss.backward()

            # Manually add weight decay
            with torch.no_grad():
                for param in model.parameters():
                    if param.requires_grad:
                        param.grad += weight_decay * param
            
            # perform a single optimization step (parameter update)
            optimizer.step()
            # update training loss
            train_loss += loss.item()*data.size(0)
        
        ######################    
        # validate the model #
        ######################
        model.eval()
        for data, target in testloader:
            # move tensors to GPU if CUDA is available
            data = data.to(device)
            target = target.to(device)

            # forward pass: compute predicted outputs by passing inputs to the model
            output = model(data)
            # calculate the batch loss
            loss = criterion(output, target)
            # update average validation loss 
            valid_loss += loss.item()*data.size(0)
    
        # calculate average losses
        train_loss = train_loss/len(trainloader.dataset)
        valid_loss = valid_loss/len(testloader.dataset)
            
        # print training/validation statistics 
        print('Epoch: {} \tTraining Loss: {:.6f} \tValidation Loss: {:.6f}'.format(
            epoch, train_loss, valid_loss))
    
        # save model if validation loss has decreased
        if valid_loss <= valid_loss_min:
            print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model ...'.format(
            valid_loss_min,
            valid_loss))
            torch.save(model.state_dict(), model_checkpoint)
            valid_loss_min = valid_loss

In [35]:
model_dropout = CNN(dropout=0.2).to(device)
train(model_dropout, n_epochs=20, model_checkpoint='model_cifar_dropout.pt')

Epoch: 1 	Training Loss: 2.290635 	Validation Loss: 2.279601
Validation loss decreased (inf --> 2.279601).  Saving model ...
Epoch: 2 	Training Loss: 2.221849 	Validation Loss: 2.180125
Validation loss decreased (2.279601 --> 2.180125).  Saving model ...
Epoch: 3 	Training Loss: 2.085038 	Validation Loss: 2.061150
Validation loss decreased (2.180125 --> 2.061150).  Saving model ...
Epoch: 4 	Training Loss: 1.968961 	Validation Loss: 1.952566
Validation loss decreased (2.061150 --> 1.952566).  Saving model ...
Epoch: 5 	Training Loss: 1.864512 	Validation Loss: 1.859413
Validation loss decreased (1.952566 --> 1.859413).  Saving model ...
Epoch: 6 	Training Loss: 1.778727 	Validation Loss: 1.780695
Validation loss decreased (1.859413 --> 1.780695).  Saving model ...
Epoch: 7 	Training Loss: 1.695100 	Validation Loss: 1.700058
Validation loss decreased (1.780695 --> 1.700058).  Saving model ...
Epoch: 8 	Training Loss: 1.628724 	Validation Loss: 1.641476
Validation loss decreased (1.70005

In [36]:
model_wo_dropout = CNN(dropout=0.0).to(device)
train(model_wo_dropout, n_epochs=20, model_checkpoint='model_cifar_wo_dropout.pt')

Epoch: 1 	Training Loss: 2.298147 	Validation Loss: 2.292824
Validation loss decreased (inf --> 2.292824).  Saving model ...
Epoch: 2 	Training Loss: 2.284789 	Validation Loss: 2.272804
Validation loss decreased (2.292824 --> 2.272804).  Saving model ...
Epoch: 3 	Training Loss: 2.245980 	Validation Loss: 2.201385
Validation loss decreased (2.272804 --> 2.201385).  Saving model ...
Epoch: 4 	Training Loss: 2.137732 	Validation Loss: 2.066096
Validation loss decreased (2.201385 --> 2.066096).  Saving model ...
Epoch: 5 	Training Loss: 2.012112 	Validation Loss: 1.948091
Validation loss decreased (2.066096 --> 1.948091).  Saving model ...
Epoch: 6 	Training Loss: 1.916246 	Validation Loss: 1.867050
Validation loss decreased (1.948091 --> 1.867050).  Saving model ...
Epoch: 7 	Training Loss: 1.839795 	Validation Loss: 1.791742
Validation loss decreased (1.867050 --> 1.791742).  Saving model ...
Epoch: 8 	Training Loss: 1.761482 	Validation Loss: 1.712122
Validation loss decreased (1.79174

In [48]:
model_dropout_wd = CNN(dropout=0.0).to(device)
train(model_dropout_wd, n_epochs=20, weight_decay=0.001, model_checkpoint='model_cifar_dropout_wd.pt')

Epoch: 1 	Training Loss: 2.297493 	Validation Loss: 2.290910
Validation loss decreased (inf --> 2.290910).  Saving model ...
Epoch: 2 	Training Loss: 2.279932 	Validation Loss: 2.262664
Validation loss decreased (2.290910 --> 2.262664).  Saving model ...
Epoch: 3 	Training Loss: 2.223989 	Validation Loss: 2.164115
Validation loss decreased (2.262664 --> 2.164115).  Saving model ...
Epoch: 4 	Training Loss: 2.098789 	Validation Loss: 2.023497
Validation loss decreased (2.164115 --> 2.023497).  Saving model ...
Epoch: 5 	Training Loss: 1.971953 	Validation Loss: 1.912690
Validation loss decreased (2.023497 --> 1.912690).  Saving model ...
Epoch: 6 	Training Loss: 1.882899 	Validation Loss: 1.833533
Validation loss decreased (1.912690 --> 1.833533).  Saving model ...
Epoch: 7 	Training Loss: 1.802457 	Validation Loss: 1.752810
Validation loss decreased (1.833533 --> 1.752810).  Saving model ...
Epoch: 8 	Training Loss: 1.726368 	Validation Loss: 1.680781
Validation loss decreased (1.75281

In [37]:
def track_test_loss(model):
    # track test loss
    test_loss = 0.0
    class_correct = list(0. for i in range(10))
    class_total = list(0. for i in range(10))
    criterion = nn.CrossEntropyLoss()

    model.eval()
    # iterate over test data
    for data, target in testloader:
        # move tensors to GPU if CUDA is available
        data = data.to(device)
        target = target.to(device)
        # forward pass: compute predicted outputs by passing inputs to the model
        output = model(data)
        # calculate the batch loss
        loss = criterion(output, target)
        # update test loss 
        test_loss += loss.item()*data.size(0)
        # convert output probabilities to predicted class
        _, pred = torch.max(output, 1)    
        # compare predictions to true label
        correct_tensor = pred.eq(target.data.view_as(pred))
        correct = np.squeeze(correct_tensor.cpu().numpy())
        # calculate test accuracy for each object class
        for i in range(len(target.data)):
            label = target.data[i]
            class_correct[label] += correct[i].item()
            class_total[label] += 1

    # average test loss
    test_loss = test_loss/len(testloader.dataset)
    print('Test Loss: {:.6f}\n'.format(test_loss))

    for i in range(10):
        if class_total[i] > 0:
            print('Test Accuracy of %5s: %2d%% (%2d/%2d)' % (
                classes[i], 100 * class_correct[i] / class_total[i],
                np.sum(class_correct[i]), np.sum(class_total[i])))
        else:
            print('Test Accuracy of %5s: N/A (no training examples)' % (classes[i]))

    print('\nTest Accuracy (Overall): %2d%% (%2d/%2d)' % (
        100. * np.sum(class_correct) / np.sum(class_total),
        np.sum(class_correct), np.sum(class_total)))

In [38]:
track_test_loss(model_dropout)

Test Loss: 1.335373

Test Accuracy of airplane: 67% (678/1000)
Test Accuracy of automobile: 58% (587/1000)
Test Accuracy of  bird: 39% (398/1000)
Test Accuracy of   cat: 33% (335/1000)
Test Accuracy of  deer: 47% (470/1000)
Test Accuracy of   dog: 47% (471/1000)
Test Accuracy of  frog: 61% (619/1000)
Test Accuracy of horse: 62% (623/1000)
Test Accuracy of  ship: 65% (650/1000)
Test Accuracy of truck: 61% (618/1000)

Test Accuracy (Overall): 54% (5449/10000)


In [39]:
track_test_loss(model_wo_dropout)

Test Loss: 1.321031

Test Accuracy of airplane: 65% (651/1000)
Test Accuracy of automobile: 64% (646/1000)
Test Accuracy of  bird: 29% (291/1000)
Test Accuracy of   cat: 26% (264/1000)
Test Accuracy of  deer: 42% (424/1000)
Test Accuracy of   dog: 42% (426/1000)
Test Accuracy of  frog: 62% (623/1000)
Test Accuracy of horse: 71% (715/1000)
Test Accuracy of  ship: 59% (590/1000)
Test Accuracy of truck: 63% (638/1000)

Test Accuracy (Overall): 52% (5268/10000)


In [49]:
track_test_loss(model_dropout_wd)

Test Loss: 1.298788

Test Accuracy of airplane: 53% (533/1000)
Test Accuracy of automobile: 70% (709/1000)
Test Accuracy of  bird: 40% (409/1000)
Test Accuracy of   cat: 21% (218/1000)
Test Accuracy of  deer: 43% (433/1000)
Test Accuracy of   dog: 49% (491/1000)
Test Accuracy of  frog: 64% (640/1000)
Test Accuracy of horse: 64% (648/1000)
Test Accuracy of  ship: 69% (694/1000)
Test Accuracy of truck: 57% (577/1000)

Test Accuracy (Overall): 53% (5352/10000)


## Conclusion

- Suitable for small datasets that are easy to overfit.
- Less used for big datasets where overfitting is no concern.
- Was used extensively in earlier days (AlexNet, GoogLeNet, ...), not used much for current neural networks (Transformers, ConvNext, ...).