# Variational inference using an invertible neural network

Given a data set of input output pairs, $X = \{(x_i, t_i)\}_{i=1}^n$, a common goal of supervised machine learning is to train a model, $f$, to predict outcomes given new inputs. For regression tasks, we assume that the model predicts the expected value of outcomes and that observations of the outcome are noisy with zero mean Gaussian noise, $\varepsilon \sim \mathcal{N}(0, \sigma^2)$, 

$$
t_i = f(x_i, z) + \varepsilon
$$
where $z$ are the parameters of the model. In this case, we'll consider $z$ to be latent (unobserved) variables whose distribution must be inferred given the observed data, $X$. 

In other words, we want to determine the posterior distribution, $p(z | X)$. Variational inference provides a means of optimizing the parameters, $\lambda$, of an approximate distribution $q(z | \lambda)$ so that it matches the true posterior. The objective function that quantifies the difference between the approximate distribution and the true posterior is the Kullback-Leibler divergence. 

$$
\lambda^* = \underset{\lambda}{\text{argmin}} \; \text{KL} \; \left[q(z | \lambda) || p(z | X) \right]  
$$

$$
\text{KL} \; \left[q(z | \lambda) || p(z | X) \right] = -\int_z \mathrm{ln} \left( \frac{p(z | X)}{q(z | \lambda)} \right) q(z | \lambda) dz
$$

$$
= -\int_z \mathrm{ln}  p(z | X) \; q(z | \lambda) \; dz 
+ \int_z \mathrm{ln} q(z | \lambda) \; q(z | \lambda) \; dz
$$

Using Bayes theorem, 

$$
= -\int_z \mathrm{ln}  p(X | z) \; q(z | \lambda) \; dz 
-\int_z \mathrm{ln}  p(z) \; q(z | \lambda) \; dz 
+ \underbrace{\int_z \mathrm{ln}  p(X) \; q(z | \lambda) \; dz}_{\text{const. w.r.t. z}}
+ \int_z \mathrm{ln} q(z | \lambda) \; q(z | \lambda) \; dz
$$



Ignoring terms that are constant w.r.t. $z$ (and $\lambda$) gives the objective function

$$
J(\lambda) = -\int_z \mathrm{ln}  p(X | z) \; q(z | \lambda) \; dz 
-\int_z \mathrm{ln}  p(z) \; q(z | \lambda) \; dz 
+ \int_z \mathrm{ln} q(z | \lambda) \; q(z | \lambda) \; dz
$$

Generating samples from proposal distribution $q(z | \lambda)$ is accomplished using a neural network with parameters $\lambda$, which receives as input a random variable, $y$, whose distribution is easy to draw samples from. For example, $y \sim p(y) = \mathcal{U}[0, 1]$, so that $z = nn(y, \lambda)$. The parameters of the neural network are determined by minimizing the objective function, $J(\lambda)$, so that the neural network learns to translate samples from the base distribution into samples from the posterior. 

In order to evaluate the likelihood of a sample, $q(z|\lambda)$, the neural network must be invertible so that the change of variables formula can be evaluated, 

$$
q(z | \lambda) = p(y) \vert \mathrm{det} \; \nabla_{z} nn^{-1}(z, \lambda) \vert
$$

The objective can be evaluated using a Monte Carlo approach by drawing many candidate samples, $z_1, ..., z_m$, 

$$
J(\lambda) \approx \sum_{i=1}^{m} \left( \sum_{j=1}^{n} \frac{1}{2 \sigma^2} (t_j - f(x_j, z_i))^2 + \frac{\alpha}{2} z^T z + \mathrm{ln} \vert \mathrm{det} \; \nabla_{z_i} nn^{-1}(z_i, \lambda) \vert  \right)
$$