## [Workshop on Variational Bayes by Tamara Broderick (2020)](https://tamarabroderick.com/tutorial_2020_smiles.html)

Bayesian modeling can answer:
* What do we know?
* How well do we know it?

Modern problems often deal with large sets of high-dimensional data. Variational Bayes can scale and be very fast on these large problems!

Example applications
* [Latent Dirichlet Allocation (Blei et. al 2003)](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf?ref=https://githubhelp.com)

## What is Bayesian inference?

\begin{align*}
p\left(\theta \, \middle| \, y_{1:N} \right) & \propto_{\theta} \, p\left(y_{1:N}\, \middle| \, \theta \right) p\left(\theta\right) \\
\text{posterior} & \propto \text{likelihood} \times \text{prior} \\
\end{align*}

The (data) likelihood is the probability of observing the data we've seen, given a particular choice of the parameters.

The prior is our initial "belief" about what we expect the parameters to be, before we have seen any data.

The posterior is the combination of the two of these --- our initial beliefs and our observations --- into a distribution that reflects our current "belief" about what we expect the parameters to be.

The posterior is computed from the prior and the likelihood by applying Bayes' rule, a "fact of probability."

### How do we use it in practice?

1. Build a model: choose a prior and a likelihood.
2. Compute the posterior.
3. Report a summary: posterior means, variances, etc.

Q: does the model always have parameters? Is the posterior always over the model parameters? Even for "Bayesian Non-parametrics?" I think the answer is "yes" because even Bayesian non-parametric models still have parameters, just infinitely many of them.


In general, we can't --- nor would we want to --- compute a closed-form formula for the posterior distribution. We need to perform approximate inference, which is a computational thing.

MCMC is the dominant method for doing this. It has eventual convergence guarantees, but it is very slow!

### Variational Bayes

In variational Bayesian methods, we approximate the true posterior distribution, $p\left(\theta\,\middle|\, y\right)$, by a "variational" distribution, $q\left(\theta\right)$.

We determine $q$ as the optimal member of a family of "nice" distributions. By a "nice" distribution, we mean that we can easily compute quantities of interest from it; and by "optimal" we mean that it minimizes some loss function. 

Since $q$ is meant to approximate the posterior $p$, the loss function ends up being a sort of "distance" between the two distributions. In particular, variational Bayesian methods use the Kullback-Leibler divergence (KLD).

### What is the KLD?

\begin{align*}
    \text{KL}\left(  q\left(\cdot\right) \, \left\lVert \right. \, p\left(\cdot\middle|y\right) \right) & := \int q\left(\theta'\right) \log \frac{q\left(\theta'\right)}{p\left(\theta'\middle|y\right)} \, d\theta' \\
    & = \mathbb{E}_{q\left(\theta'\right)}\left[ \log \frac{q\left(\theta'\right)}{p\left(\theta'\middle|y\right)} \right] \\
    & = \mathbb{E}_{q\left(\theta'\right)}\left[ \log q\left(\theta'\right) \right] - \mathbb{E}_{q\left(\theta'\right)}\left[ \log p\left(\theta'\middle|y\right)\right] \\
\end{align*}

Note that in every equation above, $\theta'$ is a "dummy variable": in the integral and in the expectations (which are themselves really integrals), $\theta'$ is the variable of integration. Thus, the KLD is computed between two distributions over a space, and doesn't actually rely on any particular point in the space.

As is clear from the definition, this is not technically a "metric" because it is not symmetric. Thus, the order of the operands matters; in variational Bayes we use the convention that the variationa distribution, $q$, comes first.

### The actual optimization problem

The optimization problem that we're trying to solve is thus $$ q^* = \argmin_{q \in \mathcal{Q}} \text{KL}\left( q\left(\cdot\right) \, \left\lVert \right. \, p\left(\cdot\middle|\theta\right) \right) $$

But wait: how can we actually even compute the KLD if we don't know the true posterior?

Answer: let's apply some rules of probability to manipulate it.

First, we have
$$ \log p\left(\theta'\,\middle|\,y\right) = \log p\left(y\,\middle|\,\theta'\right) / p\left(y\right) $$

and so the KFD becomes
\begin{align*}
    \text{KL}\left(q\left(\theta\right) \, \left\lVert \right. \, p\left(\cdot\middle|\theta\right)\right) & = \mathbb{E}_{q\left(\theta'\right)}\left[ \log q\left(\theta'\right) \right] - \mathbb{E}_{q\left(\theta'\right)}\left[ \log p\left(\theta'\middle|y\right)\right] \\
    & = \mathbb{E}_{q\left(\theta'\right)}\left[ \log q\left(\theta'\right) \right] - \mathbb{E}_{q\left(\theta'\right)}\left[ \log p\left(y\middle|\theta'\right)\right] + \mathbb{E}_{q\left(\theta'\right)}\left[p\left(y\right)\right] \\
    & = \mathbb{E}_{q\left(\theta'\right)}\left[ \log q\left(\theta'\right) \right] - \mathbb{E}_{q\left(\theta'\right)}\left[ \log p\left(y\middle|\theta'\right)\right] + p\left(y\right) \\
\end{align*}

where the last equality comes from the fact that $p\left(y\right)$ is independent of $\theta'$ and so is pulled outside of the expectation (an integral).

Recall that $p\left(y\right)$ is the thing that's really difficult for us to compute! However, from the point of view of optimizing over the variational parameter $\theta$, $p\left(y\right)$ is a constant, and can be ignored.

We can rearrange this slightly and write
and so the KFD becomes
\begin{align*}
    \text{KL}\left(q\left(\theta\right) \, \left\lVert \right. \, p\left(\cdot\middle|\theta\right)\right) & = p\left(y\right) - \left( \mathbb{E}_{q\left(\theta'\right)}\left[ \log p\left(y\middle|\theta'\right)\right]-\mathbb{E}_{q\left(\theta'\right)}\left[ \log q\left(\theta'\right) \right]  \right) \\
    & = p\left(y\right) - \text{ELBO}\left(q\left(\cdot\right)\right) \\
\end{align*}

Where we have defined the Evidence Lower BOund for the problem as $$ \text{ELBO}\left(q\right) = \mathbb{E}_{q\left(\theta'\right)}\left[ \log p\left(y\middle|\theta'\right)\right]-\mathbb{E}_{q\left(\theta'\right)}\left[ \log q\left(\theta'\right) \right]   $$

Clearly minimizing the KLD is equivalent to maximizing the ELBO. Plus, the ELBO contains only distributions that are known to us, so it's actually feasible for us to compute. 

Note that the KLD's particular form is what permitted us to get away from the intractible evidence term, and so that's a big motivating factor for starting with the KLD loss.

Also note that the order of the operands to the KLD matter because the other order would have left us with expectations over the posterior distribution.

The name for the ELBO comes from the fact that 
$$ \text{KL} \geq 0 \implies p\left(y\right) \geq \text{ELBO}\left(q\right) $$

### Mean-field Variational Bayes

MFVB is just variational Bayes using the simplest possible family of distributions: distributions that assume all variables are independent: $$ q\left(\theta\right) = \prod_{j} q_j\left(\theta_j\right) $$

Typically, we take each of the $q_j$ to belong to the exponential family.

### Nuances about the variational family

Our choice of the variational family does __not__ impact our model itself; it is not a "modeling decision." It is a choice about what we will be __approximating the model with__.

(Different approximations will have different properties. For example, using a delta distribution family will lead to an approximation which is effectively the MAP point estimate of the posterior.)



### Optimization

Now that we have our inference problem specified as an optimization problem, how do we actually solve it?

There are several popular methods:
* coordinate ascent variational inference
* stochastic variational inference (Hoffman et al. 2013)
* autodiff variational inference