# Variational Bayesian Inference

In this notebook, we will review the variational Bayes process, beginning with a technical introduction to the formalism and derivation, followed by a python implementation. 
The material covered here references Blei et al., 2018, Varitional Inference: A Review for Statisticians and Chappel et al., 2016. The FMRIB Variational Bayes Tutorial. 


## Technical Overview

### Problem Statement 

Similarly to problems addressed by sampling methods, the goal of variational inference (VI) is to approximate parameter distributions from data, specifically in cases where an analytical treatment is intractable. 

Consider the following example (from Blei et al., 2018): 

For latent variabels $\mathbf{z} = z_{1:m}$ and observations $\mathbf{x} = x_{1:n}$, the posterior conditional density is given by : 

$$ p(\mathbf{z} \mid \mathbf{x}) = \frac{p(\mathbf{x} \mid \mathbf{z}) * p(\mathbf{z})}{p(\mathbf{x})} $$

The denominator, ${p(\mathbf{x})}$, whose value is needed to compute the posterior, is calculated by: 

$$ {p(\mathbf{x})} = \int p(\mathbf{z} , \mathbf{x}) dz $$


This integral is often intractable or too computationally expensive to be feasible. For similar reasons (notably the number of latent variables)sampling methods are slow to converge.

VI aims to circumvent the large time complexity by approaching the problem through optimisation. 

The process begins by positing a contrived _approximate_ density, $\mathfrak{D}$ of latent variables $\mathbf{z}$. Then, using this density can find a set of valuers for $q(\mathbf{z}) \in \mathfrak{D}$ whose values maximise the Kullback-Liebler divergence between the approximate density and the true posterior. 

$$ q^{*}(\mathbf{z}) = \underset{q(\mathbf{z}) \in \mathfrak{D}}{argmin} \mathrm{KL}(q(\mathbf{z}) \mid \mid p(\mathbf{z} \mid \mathbf{x})) $$



### VI Computation 

To compute the optimsation problem stated above requires we do manipulation of the terms. This is because the problem is still dependent on the _evidence_ term ($p(\mathbf{x})$; recall, the intragral of this is often difficult/intractable). To see this clearly, we can rewrite out expression using Bayes rule: 

$$ q^{*}(\mathbf{z}) = \underset{q(\mathbf{z}) \in \mathfrak{D}}{argmin} \mathrm{KL}(q(\mathbf{z}) \mid \mid \frac{p(\mathbf{z}, \mathbf{x})}{p(\mathbf{x})}) $$

We can avoid evidence term using the following manipulations. First, let's break down the KL divergence: 

$$ KL(q(\mathbf{z}) \mid \mid p(\mathbf{z} \mid \mathbf{x})) = \mathbb{E}[\log q(\mathbf{z})] - \mathbb{E}[\log p(\mathbf{z} \mid \mathbf{x})] + \log p(\mathbf{x}) $$

Observing that the evidence term is constant and additive, we can formulate the Evidence Lower Bound (ELBO) as: 


\begin{align}
 ELBO(q) &= - KL(q(\mathbf{z}) \mid \mid p(\mathbf{z} \mid \mathbf{x})) - \log p(\mathbf{x}) \\
 \\
         &= \mathbb{E}[\log p(\mathbf{z}, \mathbf{x})] - \mathbb{E}[\log q(\mathbf{z})]
\end{align}

Note that, maximsing the ELBO is the same as minimising the KL divergence. 

By exanding the joint probaibility, we can rewrit the ELBO in terms of the log likelihood and KL divergence between our prior $p(\mathbf{z})$ and our variational density $q(\mathbf{z})$: 

\begin{align}
ELBO(q) &= \mathbb{E}[\log p(\mathbf{x} \mid \mathbf{z})] + \mathbb{E}[\log p(\mathbf{z})] - \mathbb{E}[\log q(\mathbf(z))] \\
\\
&= \mathbb{E}[\log p(\mathbf{x} \mid \mathbf{z}) - KL(q(\mathbf{z}) \mid \mid p(\mathbf{z}))
\end{align}



