# Variational Inference

See:
* [Great first resource](http://www.cmap.polytechnique.fr/~zoltan.szabo/jc/2017_05_18_Massil_Achab.pdf)
* [VI: A review for statisticians](https://arxiv.org/pdf/1601.00670.pdf)

In [15]:
import numpy as np
import scipy.stats

# Kullback–Leibler (KL) Divergence

The KL divergence measures how similar two probability distributions are. For discrete PDFs this is:

$$
D_{KL}(P || Q) = \sum_{x \in X} P(x) \log \big(\frac{P(x)}{Q(x)}\big)
$$

The divergence is measured in bits of information if the log used is base 2, and nats if it is base $e$. It can be understood as the amount of information gained when moving from $P \rightarrow Q$, or as the amount of information lost when approximating $P$ as $Q$.

In [24]:
def discrete_KL_divergence(p, q):
    assert np.isclose(np.sum(p), 1) and np.isclose(np.sum(q), 1) and len(p) == len(q)
    return np.sum(p * np.log(p / q))

p = np.array([2, 2, 2])
q = np.array([1, 4, 2])
p, q = p / np.sum(p), q / np.sum(q)

print(discrete_KL_divergence(p, q), discrete_KL_divergence(q, p))

print(discrete_KL_divergence(p, p))

# As always if you actually want to use this, use scipy's.
assert np.isclose(scipy.stats.entropy(q, p), discrete_KL_divergence(q, p))

0.15415067982725839 0.14291239755557528
0.0


Some interesting properties:
* $D_{KL} = 0$ iff $Q(x) = P(x)$ for all $x$ (as $\log(1) = 0$).
* $D_{KL} >= 0$ see [Gibb's inequality](https://en.wikipedia.org/wiki/Gibbs%27_inequality)
* It is not symmetric! Be careful that this isn't the divergence *between* $P$ and $Q$ but rather the divergence *from* $P$ to $Q$.
* It is undefined if $Q(x) = 0$ where $P(x) \neq 0$.

# Variational Inference Theory

The short summary of VI is:

* We have some unknown and complicated probability function $P$.
* Propose a family of easier to use distributions $D$.
* Find some $Q \in D$ that is a good approximation to $P$ by minimizing $D_{KL}(P || Q)$. 
* Use $Q$ to do whatever we were going to do with $P$.

## A Reminder About Bayesian Inference

In Bayesian inference we have some observed variables $x$ and some hidden or [latent](https://en.wikipedia.org/wiki/Latent_variable) variables $z$ that encode the structure behind $x$. E.g. is cosmology $x$ are your observations (galaxies, CMB, etc) and $z$ are the cosmological parameters ($\Omega_M$, $\sigma_8$, etc). We want to constrain $z$ using the data $x$.

[Bayes Theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem) tells us that:

$$
P(z | x) = \frac{P(z) P(x | z)}{P(x)}
$$

or in english, the posterior probability of the parameters $z$ given the data $x$ is equal to the prior on $z$ multiplied by the likelihood of that data given those parameters. This is normalized by the evidence $P(x)$. Remember, we want to learn the posterior $P(z | x)$.

We can remove the conditionals, $P(a | b)$, and instead express the posterior as a function of joint, $P(a, b)$, distributions.

$$
P(z | x) = \frac{P(x, z)}{\int P(x, z)\ dz}
$$

Computing this in practice is hard because we would need to do that integral over the potentially high dimensional $z$ space. We need to do something different.

## Evidence Lower Bound (ELBO)

Assume we have some proposed $Q(z)$. We want to see how similar this is to the posterior $P(z | x)$. **N.N.** these are both defined on $z$ not $x$!

$$
\begin{align}
&\ \ \ D_{KL}(Q(z) || P(z | x)) \\
&= \int Q(z) \log \big(\frac{Q(z)}{P(z | x)}\big) dz \\
&= \int Q(z) ( \log(Q(z)) - \log(P(z | x)) ) dz \\
&= \int Q(z) \log(Q(z)) dz - \int Q(z) \log(P(z | x)) dz
\end{align}
$$

These transformations are all good but we are still stuck with the posterior which we can't compute... 

Let's make the notation a bit more concise using expected values. Remember:

$$
\mathbf{E}^P[X] = \int x P(x) dx \\
\mathbf{E}^P[g(X)] = \int g(x) P(x) dx
$$

we say, the expectation value of $g(x)$ with the probability function $P$ is... So similarly,

$$
\int Q(z) \log(P(z | x)) = \mathbf{E}^Q[\log(P(z | x)]
$$

With this notation,

$$
\begin{align}
&\ \ \ D_{KL}(Q(z) || P(z | x)) \\
&= \mathbf{E}^Q[\log(Q(z))] - \mathbf{E}^Q[\log(P(z | x))] \\
&= \mathbf{E}^Q[\log(Q(z))] - \mathbf{E}^Q[\log(\frac{P(x, z)}{P(x)})] \\
&= \mathbf{E}^Q[\log(Q(z))] - \mathbf{E}^Q[\log(P(x, z)) - \log(P(x))] \\
&= \mathbf{E}^Q[\log(Q(z))] - \mathbf{E}^Q[\log(P(x, z))] + \mathbf{E}^Q[\log(P(x))]
\end{align}
$$

but this final term is just $\int Q(z) \log(P(x)) dz = \log(P(x)) \int Q(z) dz = \log(P(x))$ and so,

$$
D_{KL}(Q(z) || P(z | x)) = \mathbf{E}^Q[\log(Q(z))] - \mathbf{E}^Q[\log(P(x, z))] + \log(P(x))
$$

We've now got rid of the posterior and instead have a joint and the evidence. The joint can be expressed as a prior and a likelihood $P(x, z) = P(z) P(x | z)$ which is easy to compute. The evidence is still a problem.

We want to minimize the KL divergence. However, as we have just seen this has a constant term related to the evidence. We can't compute this term (remember the reason we are going through all this complicated stuff is because we can't compute this) but this doesn't matter. As it is a constant we can just minimize, $\mathbf{E}^Q[\log(Q(z))] - \mathbf{E}^Q[\log(P(x, z))]$. In practice we tend to maximize the negative of this which we call ELBO.

$$
{\rm ELBO}(Q) = \mathbf{E}^Q[\log(P(x, z))] - \mathbf{E}^Q[\log(Q(z))]
$$

maximizing this ELBO function is equivalent to minimizing the KL divergence.

The name, Evidence Lower Bound, comes from the fact that ELBO is always less than or equal to the log evidence ($P(x)$). It is a lower bound on the evidence.

$$
\begin{align}
D_{KL}(Q(z) || P(z | x)) &= -{\rm ELBO}(Q) + \log(P(x)) \\
\log(P(x)) &= D_{KL}(Q(z) || P(z | x)) + {\rm ELBO}(Q) \\
\therefore \log(P(x)) &>= {\rm ELBO}(Q)
\end{align}
$$

this inequality is because the KL divergence is always non-negative. 

## Family of Distributions

We said that we were drawing $Q(z)$ from some family of distributions $D$. We wan't this $Q$ to be easy to work with. A popular choice is the **mean-field variational family**.

I don't understand this yet...

# Example

This follows section 3 in [VI: A review for statisticians](https://arxiv.org/pdf/1601.00670.pdf).