# Dropout (Forward + Backward)

**Concept:**  
Dropout is a regularization technique to reduce overfitting in neural networks by randomly "dropping" units during training.

- During **training**, each neuron is kept with probability $1-p$ and dropped (set to zero) with probability $p$.  
- To maintain the expected scale of activations, the outputs of kept neurons are scaled by $\frac{1}{1-p}$.  
- During **inference**, dropout is disabled (all neurons active) since the scaling has already been accounted for.

---

**Formulas:**

- **Forward pass (training):**
  $$
  \tilde{h} = \frac{m \odot h}{1-p}, \quad m \sim \text{Bernoulli}(1-p)
  $$

- **Backward pass:**
  $$
  \frac{\partial L}{\partial h} = \frac{m \odot \frac{\partial L}{\partial \tilde{h}}}{1-p}
  $$

where:
- $ h $: input activations  
- $ \tilde{h} $: output after dropout  
- $ m $: dropout mask (0/1 values)  
- $ p $: dropout probability  

---

**Key Points:**
- Dropout reduces overfitting by preventing co-adaptation of neurons.  
- Different masks are applied independently at each training step.  
- Backpropagation uses the same mask to ensure consistent gradient flow.  


# Dropout (Forward + Backward) — From Scratch

**Idea.** During **training**, randomly drop each activation with probability \(p\) (set to 0) and scale the survivors by $\tfrac{1}{1-p}$ so the expected activation stays unchanged. During **inference**, disable dropout.

**Forward (training)**
$$
\tilde{h} \;=\; \frac{m \odot h}{1-p}, \qquad m \sim \mathrm{Bernoulli}(1-p)
$$

**Backward**
$$
\frac{\partial L}{\partial h} \;=\; \frac{m \odot \frac{\partial L}{\partial \tilde{h}}}{1-p}
$$

We’ll insert dropout **after ReLU** on the hidden layer of a 2-layer MLP and implement the gradients manually (no autograd).


In [1]:
import torch
torch.manual_seed(0)

# Toy data
N, d, H, K = 128, 20, 64, 5
X = torch.randn(N, d)
y = torch.randint(0, K, (N,))

# One-hot labels
Y = torch.zeros(N, K)
Y[torch.arange(N), y] = 1.0

# Parameters
W1 = torch.randn(d, H) * 0.02
b1 = torch.zeros(H)
W2 = torch.randn(H, K) * 0.02
b2 = torch.zeros(K)

# Hyperparams
lr = 0.5
epochs = 200
p_drop = 0.5 # dropout probability
training = True

def softmax(logits):
    z = logits - logits.max(dim=1, keepdim=True).values
    expz = torch.exp(z)
    return expz / expz.sum(dim=1, keepdim=True)

def dropout_forward(h, p, training):
    """
    Returns (h_tilde, mask).
    If not training, returns (h, None).
    """
    if not training or p == 0.0:
        return h, None
    keep_prob = 1.0 - p
    mask = (torch.rand_like(h) < keep_prob).float()
    h_tilde = (mask * h) / keep_prob
    return h_tilde, mask

def dropout_backward(grad_out, mask, p):
    """
    Backprop through dropout using the SAME mask from forward.
    """
    if mask is None or p == 0.0:
        return grad_out
    keep_prob = 1.0 - p
    return (mask * grad_out) / keep_prob

for epoch in range(epochs):
    # Forward
    Z1 = X @ W1 + b1
    H1 = torch.clamp(Z1, min=0.0)
    H1_do, mask = dropout_forward(H1, p_drop, training)  # Dropout after ReLU
    S  = H1_do @ W2 + b2
    P  = softmax(S)
    loss = -(Y * (P + 1e-12).log()).sum() / N

    # Backward (manual)
    dS  = (P - Y) / N
    dW2 = H1_do.t() @ dS
    db2 = dS.sum(dim=0)

    dH1_do = dS @ W2.t()
    dH1 = dropout_backward(dH1_do, mask, p_drop)        # back through dropout
    dZ1 = dH1 * (Z1 > 0).float()

    dW1 = X.t() @ dZ1
    db1 = dZ1.sum(dim=0)

    # SGD step
    W1 -= lr * dW1
    b1 -= lr * db1
    W2 -= lr * dW2
    b2 -= lr * db2

    if (epoch + 1) % 50 == 0:
        pred = P.argmax(dim=1)
        acc = (pred == y).float().mean().item()
        print(f"[manual+dropout] epoch {epoch+1:3d} | loss {loss.item():.4f} | acc {acc:.3f}")

[manual+dropout] epoch  50 | loss 1.0639 | acc 0.594
[manual+dropout] epoch 100 | loss 0.7131 | acc 0.758
[manual+dropout] epoch 150 | loss 0.4612 | acc 0.844
[manual+dropout] epoch 200 | loss 0.2999 | acc 0.922


# Dropout with PyTorch `nn.Dropout` (Autograd)

Here we use PyTorch modules so autograd handles gradients.

**Model (train mode):**
$$
\text{Linear}(d\!\to\!H) \;\rightarrow\; \mathrm{ReLU} \;\rightarrow\; \mathrm{Dropout}(p) \;\rightarrow\; \text{Linear}(H\!\to\!K)
$$

- `model.train()` enables dropout (applies the Bernoulli mask and $\tfrac{1}{1-p}$ scaling).
- `model.eval()` disables dropout for inference.
- Loss: `nn.CrossEntropyLoss` (log-softmax + NLL internally).


In [2]:
import torch
from torch import nn

torch.manual_seed(0)

# Same toy data
N, d, H, K = 128, 20, 64, 5
X = torch.randn(N, d)
y = torch.randint(0, K, (N,))

p_drop = 0.5

model = nn.Sequential(
    nn.Linear(d, H),
    nn.ReLU(),
    nn.Dropout(p_drop),   # dropout placed after ReLU
    nn.Linear(H, K)
)

criterion = nn.CrossEntropyLoss()
optim = torch.optim.SGD(model.parameters(), lr=0.5)

model.train()  # enable dropout
for epoch in range(200):
    optim.zero_grad()
    logits = model(X)             # dropout active here
    loss = criterion(logits, y)
    loss.backward()
    optim.step()

    if (epoch + 1) % 50 == 0:
        acc = (logits.argmax(1) == y).float().mean().item()
        print(f"[nn.Dropout]    epoch {epoch+1:3d} | loss {loss.item():.4f} | acc {acc:.3f}")

# Inference (disable dropout):
model.eval()
with torch.no_grad():
    logits = model(X)             # dropout OFF
    acc = (logits.argmax(1) == y).float().mean().item()
    print(f"[eval] accuracy (dropout off): {acc:.3f}")


[nn.Dropout]    epoch  50 | loss 0.8975 | acc 0.680
[nn.Dropout]    epoch 100 | loss 0.5385 | acc 0.836
[nn.Dropout]    epoch 150 | loss 0.3482 | acc 0.891
[nn.Dropout]    epoch 200 | loss 0.2693 | acc 0.922
[eval] accuracy (dropout off): 1.000
