# Dropout

> "Dropout, simply described, is the concept that if you can learn how to do a task repeatedly whilst drunk, you should be able to do the task even better when sober."

## Seriously, what is it?

> Randomly setting some neurons equal to zero __during training and forward pass__.

![dropout_gif](images/dropout.gif)

- Dropout is a widely spread neural network specific regularization method
- Remaining connections have to "catch up" for the ones dropped and perform task independently
- Different set of neurons are randomly dropped during each pass

For single neuron equation would look like this:

$$
O_i = X_ig(\sum_{k=1}^{d_i}w_k x_k + b), \ P(X_i = 0) = p
$$

In simple terms, `p` specifies probability of zeroing out this specific neuron (and `q=1-p` is a probability of keeping it).

## Train vs evaluation behaviour

Of course, this approach would be wasteful during test as:
- It might produce unreproducible behaviour for single sample
- It would not utilize the whole network

Because of that, the above equation only applies during training phase.

> During evaluation (training or validation) we use all of the neurons 

But if we were to suddenly give the output layer ALL of the values after it had been trained on some of them being dropped out, then the output is going to be way bigger than it is used to. 

E.g. If during evaluation, we naively used all outputs when during training p=0.5, then the output is going to be twice as large as what was experienced during training!

So we need to do one more __very important thing__

> During evaluation (training or validation), we scale every output down BY MULTIPLYING IT BY THE PROBABILITY OF THE NEURON BEING DROPPED OUT

![train_vs_evaluation](images/train_vs_evaluation.png)

For single neuron testing equation would look like this:

$$
O_i = qg(\sum_{k=1}^{d_i}w_k x_k + b), \ q = 1-p
$$

## Things to note

- Don't apply dropout after your final layer
    - You don't want to randomly set your prediction to zero!
    - If your model outputs probabilities for a classification problem, a zero probability for the true label class will give you an infinite loss when using a cross entropy loss.
- The probability of dropping out a node is usually kept consistent across a model's architecture
- The dropout probability must not be trained using gradient descent. It is a hyperparameter. Make sure it is not part of the graph (requires_grad must = False). 
- You should apply dropout after your activation functions.

## Dropout in PyTorch

As usual, PyTorch makes using this layer simple.

Check out the documentation [here](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html)

Below, we implement a simple neural network with dropout layers.

In [None]:
import torch

class MyNetwork(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = torch.nn.Sequential(
            torch.nn.Linear(784, 256),
            torch.nn.ReLU(),
            torch.nn.Dropout(0.5),
            torch.nn.Linear(256, 64),
            torch.nn.ReLU(),
            torch.nn.Dropout(0.5),
            torch.nn.Linear(64, 10),
            torch.nn.Softmax()
        )

    def forward(self, x):
        return self.layers(x)

## `train` vs `eval` mode in PyTorch

As mentioned above, there are important differences for how dropout should be used during training and evaluation. But how do we tell PyTorch which mode we are in?

`torch.nn.Module` to the rescue!

PyTorch provides every class which inherits from `torch.nn.Module` a `.train()` and a `.eval()` method. These methods set the behaviour of layers which do different things depending on the mode you're currently in. Calling either of these methods on your model will apply the change to all layers within your model.

Check out the docs for train and eval [here](https://pytorch.org/docs/stable/generated/torch.nn.Module.html)

Note: Dropout is not the only layer which behaves differently between training and evaluation - check out BatchNormalisation

In [None]:
mymodel = MyNetwork()

mymodel.train() # set model to training mode

# do training here

mymodel.eval() # set model to evaluation mode

# do evaluation here


## Demo: Implementing a custom `Dropout` layer
- Inside `__init__`:
    - single argument `p` the probability of neuron being dropped
    - check whether `p` lies within `(0, 1)` range and if it doesn't raise `ValueError` with appropriate message (e.g. `p (probability) has to lie in (0, 1) range!`)
    - Create `self._distribution = torch.distributions.binomial.Binomial` object with specified `p` probability
- Inside `forward`:
    - Use `self.training` `bool` value in `forward` to differentiate between test and train behaviour
    - Use `self._distribution.sample` method to get binary mask with the same shape as `inputs` tensor (training)
    - Use `.to(inputs.device)` to cast created tensor to `cuda` (or other device) if needed (training). __Note:__ `torch.distributions` __is not casted to device with the module__ as it's not `torch.nn.Module` instance (see [this issue]() for more on the topic)
    - Multiply with the binary mask and return it (training)
    - Multiply by keep constant (testing phase)

In [8]:
import torch


class Dropout(torch.nn.Module):
    def __init__(self, p: float):
        if not 0 < p < 1:
            raise ValueError(f"p should lie between (0, 1), got: {p}")
            
        super().__init__()
        self.p = p # float
        self._distribution = torch.distributions.binomial.Binomial(probs=self.p) # NOT A PARAMETER
        
    def forward(self, inputs):
        # (batch, in_features)
        if self.training:
            # (batch, in_features)
            mask = self._distribution.sample(inputs.size()).to(inputs.device)
            return inputs * mask
        return (1 - self.p) * inputs
    
module = Dropout(0.5)


torch.int64

### Test

Run the code below to see for eventual errors. 

You should see some values zeroed out during training and no zeroes during testing

In [2]:
def test_my_dropout(module):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    cpu_input = torch.randn(8, 5)
    gpu_input = torch.randn(8, 5).to(device)


    module(cpu_input)
    print("\n\n------------------- TRAINING -------------------\n\n")
    print(module(gpu_input))
    print("\n\n-------------------- TESTING -------------------\n\n")

    module.eval()
    print(module(gpu_input))

test_my_dropout(Dropout(p=0.5))



------------------- TRAINING -------------------


tensor([[-0.0000, -0.7166, -0.0000, -1.8316, -0.0000],
        [-0.0000,  0.9460, -0.7197,  0.7440,  0.0000],
        [ 0.0000, -0.3741, -2.2701, -1.1797, -0.0000],
        [-0.0000,  0.6468, -1.7475, -0.0000, -0.0000],
        [-0.0000, -0.0000,  0.0693,  1.9605, -0.0000],
        [-0.1221,  0.3878, -0.0000,  0.0000, -1.0863],
        [-0.0000,  0.0000, -0.0000, -1.6344, -0.0000],
        [ 1.1280,  0.0000,  0.3795,  0.0000,  0.5406]], device='cuda:0')


-------------------- TESTING -------------------


tensor([[-0.8776, -0.3583, -0.0216, -0.9158, -0.2285],
        [-0.5360,  0.4730, -0.3598,  0.3720,  0.2497],
        [ 0.0518, -0.1870, -1.1350, -0.5899, -0.2168],
        [-0.1372,  0.3234, -0.8738, -0.7646, -0.5086],
        [-0.7511, -0.1746,  0.0346,  0.9803, -0.5257],
        [-0.0611,  0.1939, -0.0632,  0.1635, -0.5431],
        [-0.3338,  0.4279, -0.2291, -0.8172, -0.8041],
        [ 0.5640,  0.2071,  0.1897,  0.4786,  0.270

## Dropout rationale

### Ensemble

> Dropout works like an ensemble of models

- During each `forward` pass different internal routes are used to propagate information
- During `testing` phase all of the routes are considered but scaled appropriately

![](images/ensemble.png)

> __"Dropout provides an inexpensive way to approximate training and inference of exponentially many neural networks. This prevents co-adaptation of features and dependence on any specific features, which improves generalisation."__

Instead of having to train many individual networks, we can sample from many different "subnetworks" of the model which are generated when certain connections are removed.

### Sparsity (most important weights)

- Dropout pushes distributions of activations towards zero
- Neural network focuses more on the important weights and important output neurons
- We get a model that is __easier to reason about__ (not to confuse with easy!)
- __Breaks co-adaptation__ (multiple neurons do similar tasks, hence decision boundary is less clear)
- Due to above, generalization is likely to improve as the most important features are considered (most important factor according to original authors)

### Scaling rationale - Monte Carlo sampling

- Save each model created during forward pass (__a lot of models__!) randomly (`k=50` used) during training (preferably after some training passed)
- Ask each one to predict on test
- Average their results
- __Results similar to just multiplying activation by expected value (within one standard deviation)!__

### Noise addition

- As we randomly generate masks, we create noise during each forward pass
- Noise is known to improve generalization as it makes the model more reluctant to follow random/uninformative patterns
- __This is called internal representation noise__

In [None]:
import torch

torch.nn.Sequential(torch.nn.Dropout(p=0.2), torch.nn.Linear(100))

## Usage tips & tricks

> Those are mostly anectodal, always perform validation! It __might be__ worth to check those out

- Use layer size of `N/p`. If you think `128` layer size would be good for this problem and you set `p=0.5`, go with `256` neurons instead
- __Use `p=0.5` for internal layers__
- __Use `p=0.2` if Dropout is applied on input__
- __Use with Fully Connected Networks__, that's where this technique is most likely to bring improvements
- __It should not be the first technique you go to__ as others are more popular and usually work better in practice
- __Increase learning rate when using dropout__, momentum to `0.95-0.99` instead of `0.9`
- `L1` regularization should improve sparsity and force the network to keep only the most valuable connections, __might be a good choice__.

## When to use?

- Fully Connected Networks (without batch normalization)
- Between linear layers
- Possibly on input data (as long as it's not an image or text)

## When not to use?

> Possible solutions are outlined in the challenges for you to read!

- When using one of the most popular building blocks for neural networks: **Batch Normalization** (more about that during batch normalization explanation)
- At least not in the same "block", as dropout changes mean and std of activations
- Most neural network architectures use Batch Normalization hence Dropout is not as popular anymore (sometimes for input, sometimes for linear layers at the very end of network)
- Convolutional neural networks (as weights are highly correlated and the effect is miniscule if any)
- Also prediction surface is more smooth and we are "un-smoothing" it using standard Dropout
- Recurrent Neural networks

## Summary

- Dropout is well-known & battle tested regularization technique
- Randomly switching of neurons after activation layer during training
- Leaving all the connections during test phase but scaled
- Works like an ensemble
- __In practice Inverted Dropout is used__ (test phase is fully untouched)
- Should be used with FCNs, rarely other types of layers (or you need a sound rationale for that)
- PyTorch provides `torch.distributions` module for random data generation

## Challenges

- What is `AlphaDropout`?
- What is `SpatialDropout`?
- What is `DropConnect`? 
- What is `ShakeShake` regularization (you can do this one after convolutional neural networks also)