# Dropout Intuition:

Dropout is applied to the activations of the Hidden Layer during training. Preventing Co-Adaptation:

- Co-adaptation occurs when neurons adjust their weights in a way that depends on the presence of other neurons. This can lead to overfitting, where the model performs well on training data but poorly on unseen data.

- By randomly dropping activations, dropout ensures that neurons cannot rely on specific other neurons being present, promoting independent feature learning.

Suppose we have 4 neurons at one layer, A, B, C, D. With some random unexpected initialization, it is possible that the weights of A & B are closed to zero. Imagine the following two situations:

If we don’t have dropout, this layer only relay on neurons C & D to transform information to the next layer. So the neurons A & B might be too “lazy” to adjust its weight since there are C & D.

If we set dropout = 0.5, then at each epoch, two neurons are dropped randomly. So it is possible that both C and D are dropped, so A & B have to transform information and adopt the gradients weights adjustment. (This make sure learned weight are evenly distributed to existed neuron)


## Forward Pass
### During Training:
- Dropout randomly "drops out" (i.e., sets to zero) a fraction of the neurons' activations in a layer. This is achieved by generating a binary mask where each neuron is kept with a probability 𝑝
(the dropout rate) and dropped with probability $1−𝑝$
- Scaling: To maintain the expected value of the activations, the remaining (non-dropped) activations are scaled by $1/𝑝$. This ensures that the overall magnitude of the activations remains consistent between training and 
- in pytorch: model.train()


### During Inference.
- During Evaluation: Dropout is typically disabled, meaning all neurons are active, and no scaling is applied.
- in pytorch model.eval()

## Backward Pass

Dropout Behavior:

Recall that for training model order -> Forward pass -> loss -> Backward -> get gradient ....

Gradient Flow: The same dropout mask used in the forward pass is applied to the gradients during the backward pass. This means that gradients corresponding to the dropped neurons are also zeroed out, ensuring that these neurons do not contribute to weight updates. (偷懒的人 w=> 0 被杀掉 lay off，不参与gradient update). The same mask used in forward pass will be applied. Only the neuron kept in forward pass will received the gradients update, neuron not kept will be frozen and no gradient update

eg. if the first and last neurons of a layer are dropped out in the first forward pass, a gradient is eventually calculated at the output and then backpropagated. When the second forward pass starts, a different set of neurons will be dropped out by temporarily frozing their weights.



Scaling: Similar to the forward pass, the gradients are scaled by $1/𝑝$ to account for the scaling applied during the forward pass.

![dropout](https://d2l.ai/_images/dropout2.svg)

In [73]:
import torch

In [None]:
def dropout(X, rate):
    if not (0 <= rate <= 1):
        raise NotImplemented("invalid rate value")
    
    if rate == 1:
        return torch.zeros_like(X)
    
    mask = (torch.rand(X.shape) > rate) # binary mask with T/F
    results = X * mask # randomly zero out value inside X matrix
    # scales up to make sure expectation of neuron is same
    # X = [1, 1, 1, 1, 1] with 5 neurons, p = 0.8
    # for each neuron expect(X[0]) = 1, 0.8 to keep and 0.2 to drop 
    # so cur expected value of X[0] = 1 * 0.8 + 0 * 0.2 = 0.8 instead of 1
    # to correct this  X[0] / p = 0.8 / 0.8 = 1 so expectation back to 1
    results = results / rate
    return results
    

In [98]:
X = torch.arange(16).reshape((2, 8))
X

tensor([[ 0,  1,  2,  3,  4,  5,  6,  7],
        [ 8,  9, 10, 11, 12, 13, 14, 15]])

In [99]:
X = dropout(X, 0.5)
X

tensor([[ 0.,  2.,  4.,  0.,  8.,  0.,  0.,  0.],
        [ 0.,  0., 20., 22.,  0.,  0.,  0.,  0.]])

# Dropout in pyTorch

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim


class SimpleModel(nn.Module):

    def __init__(self):
        super().__init__()
        self.l1 = nn.Linear(10, 20)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.3)
        self.l2 = nn.Linear(20, 1)
    
    def forward(self, X):
        x = self.l1(X)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.l2(x)
        return x


model = SimpleModel()
model.train() # <------ Turn on dropout
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

train_data = torch.randn(5, 10) # shape = (5, 10)
target_data = torch.randn(5, 1) # shape = (5, 1)

# Forward pass
output_train = model(train_data)
loss_train = criterion(output_train, target_data)

# Backward pass and optimization
optimizer.zero_grad()
loss_train.backward()
optimizer.step()

print(f"Training Output with Dropout: {output_train}")
print(f"\nTraining Loss: {loss_train.item()})

Training Output with Dropout: tensor([[-0.1167],
        [-0.2020],
        [-0.3047],
        [-0.1872],
        [-0.0012]], grad_fn=<AddmmBackward0>)

Training Loss: 1.78287672996521


In [108]:
# set model to evaluation mode for validation and test
model.eval()

# Example validation data
input_val = torch.randn(5, 10)
target_val = torch.randn(5, 1)

# Forward pass
with torch.no_grad():
    output_val = model(input_val)
    loss_val = criterion(output_val, target_val)

print("\nValidation Output without Dropout:")
print(output_val)
print("\nValidation Loss:")
print(loss_val.item())


Validation Output without Dropout:
tensor([[-0.0591],
        [ 0.0543],
        [-0.0431],
        [-0.0533],
        [ 0.0093]])

Validation Loss:
1.2603604793548584


# Other Dropout tech
- DropConnect is a variant of the traditional dropout technique introduced by Wan et al. in their 2013 paper, "Regularization of Neural Networks using DropConnect". Unlike standard dropout, which randomly drops activations, DropConnect randomly drops weights during training.