# Variational Inference
Sources
* [David Blei talk on Variational Inference](https://www.youtube.com/watch?v=Dv86zdWjJKQ)
* Blei, et al. (2017) Variational Inference: A Review for Statisticians
* [Python code for variational inference](https://zhiyzuo.github.io/VI/#python-implementation)

Variational inference is a general method used to approximate a posterior: that's it. The posterior inference problem is: given a probability model (including observed and hidden variables), we square the model up with observed data. After discovering patterns, we can predict and explore. This allows customized data analysis. It is particularly useful because we can draw up a graphical model and implement it. It separates our assumptions from computation and application.

A probabilistic model is a joint distribution $$ p\left(z,x\right) $$ where $x$ are observed variables and $z$ are latent (hidden) variables. We can make inference about the hidden variables through the posterior, which is the conditional distibution $$ p\left(z\middle| x\right) = p\left(z,x\right) / p\left(x\right). $$

Main issue: the evidence $p\left(x\right)$ is generally untractable, as an integral over the latent space.

**Variational inference** (VI) is one approach to approximating the posterior, and it does so via optimization. We posit a family $$ \mathcal{Q} = \left\{ q\left(z; \theta \right) \right\} $$ of distributions $q$ over the latent space. Each distribution in the family is parametrized by some variational parameters $\theta$. In VI we optimize over the variational parameters $\theta$ to find the "best" approximation to the posterior.

This seems difficult because we only have observed data. How can we possibly make this work?

**Stochastic optimization** techniques are particularly useful because they scale VI up to big data and generalize to a large class of models.

See: graphical models. Node: random variable; arrow: dependence between random variables; shaded node: observed random variable; blank node: latent variable; plate (rectangle): repetition.


### Conditionally conjugate models

Observations $x_i$, local variables $z=z_{1:n}$, global variables $\beta$. The $i$th data point $x_i$ depends only on $z_i$ and $\beta$. We have the joint distribution $$ p\left(\beta,z,x\right) = p\left(\beta\right) \prod_{i=1}^n p\left(z_i,x_i \, \middle| \,\beta \right).$$ Note that $\beta$ is conditionally independent from $x_i$ and $z_i$. The goal is to calculate the posterior $$ p\left(\beta,z \, \middle|\, x\right)$$ A **complete conditional** is the conditional of a latent variable given the observations and other latent variables. Assume that each complete conditional is in the exponential family, i.e., $ p\left(z_i\, \middle| \, \beta, x_i\right) $ and $p\left(\beta \, \middle| \, z,x\right)$ are in the exponential family. Given these assumptions, we can make claims about the parameters of these complete conditional distributions. 

These are important because many common models fall into this category:
* Bayesian mixture models
* time-series models
* matrix factorization
* Dirichlet process mixtures
* multi-level regression
* stochastic block models


### Variational inference

We want to minimize the KL divergence between our variational family and the true posterior. Then we will have our approximation. Unfortunately, the KL divergence contains the evidence term $\mathbb{E}\left[ \log p\left(x\right)\right]$ and so is intractable (can't compute the integral in the evidence). Instead, **we optimize using the evidence lower bound (ELBO)**. By expanding ELBO, we see that it balances two terms: $$ \mathcal{L}\left(\theta\right) = \mathbb{E}_q\left[\log\, p\left(\beta,z,x\right)\right]-\mathbb{E}_q\left[\log\, q\left(\beta,z\, \middle| \,\theta\right)\right]$$ We maximize the ELBO. The first term encourages $q$ to place its mass on the MAP estimate (i.e., seek parameters which give high likelihood to the data), and the second encourages $q$ to be diffuse (spread around). ELBO is non-convex!


### One form for $q\left(\beta,z\right)$: the mean-field family

The mean-field family is a fully-factorized distribution: $$ q\left(\beta, z ; \lambda, \phi\right) = q\left(\beta; \lambda\right) \prod_{i=1}^n q\left(z_i;\phi_i\right) $$ (Note here that the quantities after the semicolons are the parameters of the distributions). Each factor is the same family as the model's complete conditional: $$ p\left(\beta\, \middle| \, z,x\right) = h\left(\beta\right) \exp \left\{ \eta_g\left(z,x\right)^T \beta - a\left( \eta_g\left(z,x\right)\right)\right\}$$ $$q\left(\beta;\lambda\right) = h\left(\beta\right)\exp\left\{\lambda^T\beta - a\left(\lambda\right)\right\}$$ This is a bunch of disconnected variables. Every variable is disconnected from every other. Through the ELBO, we are connecting this distribution to the posterior we care about. We will never capture posterior correlations, since these don't show up in the $q$.

We can now optimize the ELBO using coordinate ascent.

How are the expectation values computed?

### Stochastic optimization

Main idea: replace expensive gradient computation with a noisy, cheap (local) version.

### Black-box variational inference 

This is the ultimate goal, whereby we can take any data and any model and simply throw them into a variational inference black box.