# Dropout

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Mitchell-Mirano/sorix/blob/qa/docs/learn/layers/06-Dropout.ipynb)
[![Open in GitHub](https://img.shields.io/badge/Open%20in-GitHub-black?logo=github)](https://github.com/Mitchell-Mirano/sorix/blob/qa/docs/learn/layers/06-Dropout.ipynb)
[![Open in Docs](https://img.shields.io/badge/Open%20in-Docs-blue?logo=readthedocs)](https://mitchell-mirano.github.io/sorix/latest/learn/layers/06-Dropout)


The **Dropout** layer implements a powerful regularization technique widely used in deep neural networks to prevent overfitting. During training, it randomly zeroes some of the elements of the input tensor with probability $p$ using samples from a Bernoulli distribution. This forces the network to learn more robust features and prevents the co-adaptation of neurons.

## Mathematical definition

During training, for each element $x$ of the input tensor, the output $y$ is computed as:

$$
y = \begin{cases}
0 & \text{with probability } p, \\
\frac{x}{1-p} & \text{with probability } 1-p.
\end{cases}
$$

The factor $\frac{1}{1-p}$ ensures that the expected value of the output remains the same as during inference:

$$
\mathbb{E}[y] = p \cdot 0 + (1-p) \cdot \frac{x}{1-p} = x.
$$

## Training vs Inference

Like all regularization layers, Dropout behaves differently depending on the model's mode:

*   **Training Mode**: The mask is randomly generated and applied, and scaling is performed.
*   **Evaluation Mode** (`model.train(False)`): Dropout acts as an identity function ($y = x$), as scaling was already handled during training.

## Backward computation (gradient)

The gradient is propagated through the mask used during the forward pass. If an element was zeroed out, its gradient will also be zero:

$$
\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial y} \cdot \text{mask} \cdot \frac{1}{1-p}.
$$

## Functional view

The Dropout layer maps from $\mathbb{R}^{N \times d}$ to $\mathbb{R}^{N \times d}$. It is a zero-parameter layer (though it has the hyperparameter $p$), meaning it does not have trainable weights like Linear or BatchNorm1d.

## Implementation specifics

Sorix's Dropout implementation uses the input tensor's device to generate the random mask, ensuring that random operations are GPU-accelerated when necessary.

In [1]:
# Uncomment the next line and run this cell to install sorix
#!pip install 'sorix @ git+https://github.com/Mitchell-Mirano/sorix.git@qa/docs_learn/docs_learn/docs_learn/docs_learn'

In [2]:
import numpy as np
from sorix import tensor
from sorix.nn import Dropout
import sorix

In [3]:
# Create a large batch to see the dropout statistics
X = tensor(np.ones((1, 10)))
print(f"Input: {X.data}")

dropout = Dropout(p=0.5)
dropout.train() # Default mode
Y_train = dropout(X)
print(f"Output (Training): {Y_train.data}")

dropout.eval() # Change to inference mode
Y_eval = dropout(X)
print(f"Output (Evaluation): {Y_eval.data}")

Input: [[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
Output (Training): [[2. 2. 0. 0. 2. 0. 0. 2. 0. 2.]]
Output (Evaluation): [[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
