# Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

## Background

In deep learning, representing model uncertainty is of crucial importance. However, standard deep learning tools for regression and classification do not capture model uncertainty.

The use of dropout (and its variants) in NNs can be interpreted as a Bayesian approximation of a well known probabilistic model: the Gaussian process (GP)

 Dropout is used in many models in deep learning as a way to avoid over-fitting

We develop tools for representing model uncertainty of existing dropout NNs – extracting information that has been thrown away so far.

we give a complete theoretical treatment of the link between Gaussian processes and dropout, and develop
the tools necessary to represent uncertainty in deep learning.

We show that a neural network with arbitrary depth and non-linearities, with dropout applied before every weight
layer, is mathematically equivalent to an approximation to the probabilistic deep Gaussian process.

We show that the dropout objective, in effect, minimises the Kullback–Leibler divergence between an approximate distribution and the posterior of a deep Gaussian process (marginalised over its finite rank covariance function parameters).

## Dropout as a Bayesian Approximation

Dropout 

Minimisation objective of NN with $L_2$re gularization

$$L_{dropout}=\frac{1}{N}\sum_{i=1}^{N}E(y_i,\hat{y}_i)+\lambda\sum_i^L (||W_i||_2^2+||b_i||_2^2)$$

Deep Gaussian

covariance function

$$K(x,y)=\int p(\omega)p(b)\sigma(\omega^Tx+b)\sigma(\omega^Ty+b)dwdb$$

$$p(y \vert x, X, Y)=\int p(y \vert x,\omega)p(\omega 
\vert X, Y)d\omega$$
$$p(y \vert x,w)=\mathcal(N)(y;\hat{y}(x,\omega),\tau^{-1}I_D)$$
$$\hat{y}(x,\omega=\{W1,...,W_L\})$$
$$=\sqrt{\frac{1}{K_L}}W_L\sigma(...\sqrt{\frac{1}{K_1}}W_2\sigma(W_1x+m_1)...)$$

We use $q(\omega)$ to approximate $p(\omega \vert X, Y)$

$q(\omega)$ is defined as:
$$W_i=M_i\cdot diag([z_{i,j}]_{j=1}^{K_i})$$
$$z_{i,j}\propto Bernoulli(p_i)$$ for i=1,...,L, j=1,...,$K_{i-1}$

Minimisation Objective

$$-\int q(\omega)log p(Y\vert X,\omega)d\omega + KL(q(\omega)||p(\omega)))$$

The first term can be rewritten as:

$$-\sum_{n=1}^{N}\int q(\omega)log p(y_n\vert x_n,\omega)d\omega$$

The second term can be approximated by:

$$\sum_{i=1}^L (\frac{p_il^2}{2}||M_i||_2^2+\frac{l^2}{2}||m_i||_2^2)$$

Given model precision $\tau$ we scale the result by the constant $\frac{1}{\tau N}$ to obtain the objective:

$$\mathcal{L}_{GP-MC}\propto \frac{1}{N}\sum_{n=1}^{N}\frac{-log p(y_n|x_n,\hat{\omega}_n)}{\tau}+\sum_{i=1}^{L}(\frac{p_il^2}{2\tau N}||M_i||_2^2+\frac{l^2}{2\tau N}||m_i||_2^2)$$

Setting $$E(y_n,\hat{y}(x_n,\hat{\omega}_n))=-log p(y_n \vert x_n, \hat{\omega}_n)/\tau$$

## Obtaining Model Uncertainty


Our approximate predictive distribution is given by
$$q(y^{*}\vert x^{*})=\int p(y^{*}\vert x^{*},\omega)q(\omega)d\omega$$

Moment-matching

We sample T sets of vectors of realisations from the Bernoulli distribution $\{z_1^t,...,z_L^t\}_{t=1}^T$

$$E_{q(y^{*}\vert x^{*})}\approx \frac{1}{T}\sum_{t=1}^T \hat(y)^{*}(x^{*},W_1^t,...,W_L^t)$$

$$E_{q(y^{*}\vert x^{*})}((y^{*})^T y^{*})$$

$$\approx \tau^{-1}I_D+\frac{1}{T}\sum_{t=1}^T \hat{y}^{*}(x^{*}, W_1^t,...,W_L^t)^T\hat{y}^{*}(x^{*},W_1^t,...,W_L^t)$$

The model predictive variance
$$Var_{q(y^{*}\vert x^{*})}(y^{*})$$
$$\approx \tau^{-1}I_D+\frac{1}{T}\sum_{t=1}^{T}\hat{y}^{*}(x^{*},W_1^t,...,W_L^t)^T\hat{y}^{*}(x^{*},W_1^t,...,W_L^t)-E_{q(y^{*}\vert x^{*})}(y^{*})^TE_{q(y^{*}\vert x^{*})}(y^{*})$$

$$\tau=\frac{pl^2}{2N\lambda}$$

Estimate predictive log-likelihood by Monte Carlo integration

$$log p(y^{*}\vert x^{*},X,Y)$$
$$\approx logsumexp(-\frac{1}{2}\tau||y-\hat{y}_t||^2)-logT-\frac{1}{2}log2\pi-\frac{1}{2}log \tau^{-1}$$