# Dropout

At each iteration, we are going to disable randomly selected neurons, then do both forward and backward pass
with those neurons disabled. Numerically, we can sample a vector from Bernoulli distribution,
then multiply the neuron's activation matrix's with this vector column-wise (e.g. disabling the same neuron across all samples).

$$ p = \text{probability of keeping a node} $$
$$ \mathbf{d} = \text{vector of random variables ~ Bernoulli}(p)$$

Now we can zero-out randomly selected columns of the activation matrix:
$$ 
\tilde{\mathbf{a}}_{:, m}^{[1]} 
= \frac {\mathbf{a}_{:, m}^{[1]} * d_m} {p} 
$$

Why divide by $p$? To keep the mean across the features constant:

$$ 
\mathrm{E}[\tilde{\mathbf{a}}_{s, :}^{[1]}]
= \frac {\mathrm{E}[\mathbf{a}_{s, :}^{[1]}] * \mathrm{E}[\mathbf{d}]} {p} 
= \frac {\mathrm{E}[\mathbf{a}_{s, :}^{[1]}] * p}{p} 
= \mathrm{E}[\mathbf{a}_{s, :}^{[1]}]
$$

Why does this work as a regularization method? 
Intuitively, every node has to use information in all of its input nodes to minimize the impact of any particular input node being disabled. 
Numerically, this makes the L2-norm of the weights vector smaller.
Theoretically, it’s also similar to having an ensemble of neural networks, since each sampling of the dropout mask represents a different network.
Overall, the dropout introduces noise robustness to the model.

In [5]:
krom tensorflow.keras.layers import Dropout
?Dropout

[0;31mInit signature:[0m [0mDropout[0m[0;34m([0m[0mrate[0m[0;34m,[0m [0mnoise_shape[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0mseed[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Applies Dropout to the input.

Dropout consists in randomly setting
a fraction `rate` of input units to 0 at each update during training time,
which helps prevent overfitting.

Arguments:
    rate: float between 0 and 1. Fraction of the input units to drop.
    noise_shape: 1D integer tensor representing the shape of the
        binary dropout mask that will be multiplied with the input.
        For instance, if your inputs have shape
        `(batch_size, timesteps, features)` and
        you want the dropout mask to be the same for all timesteps,
        you can use `noise_shape=(batch_size, 1, features)`.
    seed: A Python integer to use as random seed.
[0;31mFile:[0m           /opt/conda/lib/python3.7