# Evidence Lower Bound (ELBO)

An even more general perspective on the EM algorithm is to consider the evidence lower bound (ELBO). It also sometimes refered to as the variational lower bound. 

The evidence refers to the 'model evidence' in a Bayesian setting - it allows different models to be compared without the use of hold-out data. 

The ELBO is also a demonstration of a variation framework in which a distribution is introduced over the latent variables which is subsequently optimized with respect to the distribution using the calculus of variations. 

Again the goal is to maximize the log likelihood of the data:

\begin{equation}
p(X|\theta) = \sum_{Z} p(X, Z|\theta)
\end{equation}

Assuming $z$ is a discrete latent variable. This is valid for continuous latent variables with replacement of the summation with an integral. 

Suppose that direct optimization of $p(X|\theta)$ is difficult but that optimization of the complete-data likelihood $p(X, Z|\theta)$ is much easier. Introducing a distribution $q(Z)$ over the latent variables, the log likelihood can be decomposed into:

\begin{equation}
\ln p(X|\theta) = \mathcal{L}(q, \theta) + KL(q||p)
\end{equation}

where 

\begin{equation}
\mathcal{L}(q, \theta) = \sum_{Z} q(Z) \ln \left( \frac{p(X, Z|\theta)}{q(Z)} \right)
\end{equation}

and 

\begin{equation}
KL(q||p) = - \sum_{Z} q(Z) \ln \left( \frac{p(Z|X, \theta)}{q(Z)} \right)
\end{equation}

is the Kullback-Leibler divergence between the variational distribution $q(Z)$ and the posterior of the latent variables $p(Z|X, \theta)$. 

To verify this decompositon, the product rule gives:

\begin{equation}
p(X, Z|\theta) = p(Z|X, \theta) p(X|\theta)
\end{equation}

and so 

\begin {equation}
\ln p(X, Z|\theta) = \ln p(Z|X, \theta) + \ln p(X|\theta)
\end{equation}

Substituting this into the expression for $\mathcal{L}(q, \theta)$ and noting that $\sum_{Z} q(Z) = 1$ verifies the decomposition. 

Recall the the KL divergence satisifies $KL(q||p) \geq 0$ with $KL(q||p) = 0$ if and only if $q(Z) = p(Z|X, \theta)$. It follows then that $\mathcal{L}(q, \theta) \leq \ln p(X|\theta)$ so that $\mathcal{L}(q, \theta)$ provides a lower bound to the log likelihood.

![ELBO](./figures/ELBO.png)

This decompostion can be used to derive the EM algorithm and verify that it does maximize the log likelihood. The current value of the parameter vector is $\theta^{old}$. In the E-step, the lower bound is maximized with respect to $q(Z)$ while holding $\theta^{old}$ fixed. As $\ln p(X| \theta^{old})$ does not depend on $q(Z)$, the largest value of $\mathcal{L}(q, \theta^{old})$ is obtained when $KL(q||p) = 0$, which occurs when $q(Z) = p(Z|X, \theta^{old})$. In this case, the lower bound will equal the log likelihood. 

In the M-step, the distribution $q(Z)$ is held fixed and the lower bound is maximized with respect to $theta$ to provide some new parameter values $\theta^{new}$. This will cause the lower bound to increase, unless it is already at a maximum.  
