# Variational Inference

### 1. What's variational inference?

Variational inference is an approximate technique, which is widely used in inference problem in modern statistic. As we all know, inferene problem is especially important in Bayesian statistics, which frames all inference about unknown quntities as a calculation about the posterior. In facr, posterior is usually intractable in modern Bayesian statistics. In order to sovle this kind of problem, Monte Carlo Markov Chain(MCMC) and variational inference are two main methods. MCMC is a technique of sampling, for example, Metropli-Hast algorithm and gibbs sampling, and variational inference(VI) turns inference into optimization problems. 

The core idea of VI is to posit a family of distribution and then to find the member of that family which is close to the target, where closeness is measured using the Kullback-Leibler divergence.

#### 1.1 Core idea of variational inference

Suppose observed data $X$ and its latent variables $Z$, the prior of latent variables is $p(z)$ and the data likelihood is $p(x\vert z)$, in many situation, we want to inference latent variables $Z$ through posterior $p(z\vert x)$. In fact, the posterior is intractble due to the normalization term, where we have to integration over all latent variable $Z$. 

In this case, MCMC is usually adopted as an approximate computation for posterior inference. However in many situations, like when large data sets are involved or if the model is too complex, more faster approximate techniques is necessary, and VI is another strong alternative.

As in EM, we start by writing the full-data likelihood:

$$
\begin{align*}
p(x,z) = p(z)p(x\vert z)\\
p(z\vert x) = \frac{p(x,z)}{p(x)}
\end{align*}
$$

The difference with EM is that we view $z$ as parameters, not just specific parameters such as cluster membership or missing data.

Thus, we posit a family of approximate distribution $D$ over latent variables. We then try to find the member of family that minimizes the Kullback-Leibler divergence to the true posterior. This turns inference problem to optimization algorithm. 

$$
q^{*}(z) = \arg \min_{q(z)\in D} D_{KL}(q(z)\vert\vert p(z\vert x))
$$

The figure below shows the core idea of VI: to approximate posterior $p(z\vert x)$ with $q(z)$. We optimize $q(z)$ for minimal value of KL divergence.

![VI](https://raw.githubusercontent.com/Gwan-Siu/BlogCode/master/EM_and_VI/image/VI.png)

**Comparision of MCMC and VI**

| MCMC | VI  |
| :----------: | :---: |
| More computationally intensive | Less intensive |
| Gaurantess producing asymptotically exact samples from the target distribution | No such gaurantees |
| Slower | Faster, expecially for large data sets and complex distributions |
| Best for precise inference | Useful to explore many scenarios quickly or large data sets |


#### 1.2 How does variational inference work?

The goal of the VI is:
    
$$
\begin{equation}
q^{*}(z) = \arg \min_{q(z)\in D}D_{KL}(q(z)\vert \vert p(z\vert x))
\end{equation}
$$

In fact, to minimize the $D_{KL}(q(z)\vert \vert p(z\vert x))$ is equivalent to maximize ELBO.

$$
\begin{aligned}
D_{KL} &= \mathbb{E}_{q(z)}[\log q(z)] -\mathbb{E}_{q(z)}[\log p(z\vert x)] \\
&= \mathbb{E}_{q(z)}[\log q(z)]-\mathbb{E}_{q(z)}[\log p(x,z)] + \mathbb{E}_{q(z)}[\log p(x)] \\
\Rightarrow \log p(x) &= \mathbb{E}_{q(z)}[\log p(x,z)] -\mathbb{E}_{q(z)}[\log q(z)] + D_{KL}(q(z)\vert \vert p(z\vert x))  \\
\log p(x) &= \mathrm{ELBO}(q)+ D_{KL}(q(z)\vert \vert p(z\vert x))
\end{aligned}
$$

where $\mathrm{ELBO}(q)= \mathbb{E}_{q(z)}[\log p(x,z)] -\mathbb{E}_{q(z)}[\log q(z)]$, which is called evidence lower bound, and $\mathbb{E}_{q(z)}[\log p(x)] =\log p(x)$  because $p(x)$ is independent with $q(z)$. 

The figure is like the one in EM, but the difference with EM is the likelihood $\ln p(X\vert\theta)$ is fixed. The goal of VI is to minimize the KL divergence of $q$ and $p$, thus it's equivalent to maximize the ELBO $\mathcal{L}(q,\theta)$.

![VI](https://am207.github.io/2017/wiki/images/klsplitup.png)

Therefore, we can turn our goal to an optimization problem: maximize $\mathrm{ELBO}$ is equivalent to minimize the DL divergence. Our objective function is $\mathrm{ELBO}$:

$$
\begin{align}
\mathrm{ELBO} &= \mathbb{E}_{q(z)}[\log p(x,z)] - \mathbb{E}_{q(z)}[\log q(z)] \\
&= \mathbb{E}_{q(z)}[\log p(x\vert z)]+\mathbb{E}_{q(z)}[\log p(z)]-\mathbb{E}_{q(z)}[\log q(z)] \\
&= \mathbb{E}_{q(z)}[\log p(x\vert z)]-D_{KL}(q(z)\vert p(z))
\end{align}
$$

- 1. The first term is the expectation of the data likelihood and thus $\mathrm{ELBO}$ encourage distributions put their mass on configurations of latent variables that explain observed data. 
- 2. The second term is the negative KL divergence between the variational distribution and the prior, so the $\mathrm{ELBO}$ force $q(z)$ to close to the prior $p(z)$.

**Hence, maximize** $\mathrm{ELBO}$ ** means to balance the likelihood and prior.**

### 2. Mean-field Theory

In the section 1, we know the core idea of VI and know the goal of VI is to maximize the $\mathrm{ELBO}$. The sucessive question is how to choose the proposed family of distribution. Intuitively, the complexity of family of distributions we choose directly determine the complexity of the optimization problem. 

**The more flexibility in the family of distributions, the closer the approximation and the harder the optimization.**

The priciple that we choose the family of distribution is **mean-field theory**.  **What's the mean-field theory?** the latent variables are mutually independent and each governed by a distinct factor in the variational distribution. A generic member of the mean-field variational family is given by the below equation-

$$
\begin{equation}
q(z) = \prod_{j=1}^{m}q(z_{j})
\end{equation}
$$

the latent variable in mean-field theory is mutually independent, so it cannot capture the correlation in the original space. Once the latent variable of the posterior is dependent, the mean-field approximate will be affected. The example is below:

![mean-field](http://7xpqrs.com1.z0.glb.clouddn.com/Fu9ZVDbU07MHvRwdhShbD7NisdZ4)

**Notice that we are not making any comment about the conditional independence or lack thereof of posterior parameters. We are merely saying we will find new functions of these parameters such that we can multiply these functions to construct an approximate posterior.**

### 3. CAVI Algorithm

### 4. Variational inference and GMM