In [1]:
%load_ext autoreload
%autoreload 2
import sys
sys.path.insert(0,'../../modules')

In [2]:
import numpy as np
import common_plots
import plotly.graph_objects as go

# The Expectation-Maximization algorithm
The Em algorithm is used to improve the estimate of model parameters $\theta$. There are two steps to this. Firstly, a distribution is found over all unknown/missing values using an inference algorithm. Then secondly, the model parameters are maximized using that distribution. The origins come from the Evidence Lower Bound (ELBO):
### ELBO EM algorithm derivation:
The Kullback-Leibler divergence is a measure of the difference between two distributions $P$ and $Q$: $$KL(Q||P)=\sum_x Q(x)\log \frac{Q(x)}{P(x)}$$
The divergence can be thought as the average difference in numbers of  bits required to encode samples from distribution $P$ using a code made with distribution $Q$ instead. Say we have model parameters $\theta$, observed data $X$ and unobserved data $Z$. Define $P(Z|X,\theta)$ as the true distribution of $Z$ given $X$ and $\theta$ and $Q(Z)$ as an arbitrary distribution over $Z$. If look at the KL divergence between the two we get: 
$$
\begin{aligned}
    KL(Q(Z)||P(Z|X,\theta))&=\sum_z Q(z)\log \bigg(\frac{Q(z)}{P(z|X,\theta)}\bigg) \\
    KL(Q(Z)||P(Z|X,\theta))&=\sum_z Q(z)\log \bigg(\frac{Q(z)P(X,\theta)}{P(z,X,\theta)}\bigg) \\
    KL(Q(Z)||P(Z|X,\theta))&=\sum_z Q(z)\log \bigg(\frac{Q(z)P(X,\theta)}{P(z,X|\theta)p(\theta)}\bigg) \\
    KL(Q(Z)||P(Z|X,\theta))&=\sum_z Q(z)\log \bigg(\frac{Q(z)}{P(z,X|\theta)}\frac{P(X,\theta)}{p(\theta)} \bigg)\\
    KL(Q(Z)||P(Z|X,\theta))&=\sum_z Q(z)\bigg(\log \frac{Q(z)}{P(z,X|\theta)} + \log P(X|\theta)\bigg) \\
    KL(Q(Z)||P(Z|X,\theta))&=\bigg(\sum_z Q(z)\log \frac{Q(z)}{P(z,X|\theta)}\bigg) + \log P(X|\theta) \\
    KL(Q(Z)||P(Z|X,\theta))&=-\bigg(\sum_z Q(z)\log \frac{P(z,X|\theta)}{Q(z)}\bigg) + \log P(X|\theta) \\
    \log P(X|\theta)&=\bigg(\sum_z Q(z)\log \frac{P(z,X|\theta)}{Q(z)}\bigg)+KL(Q||P(z|X,\theta)) \\
\end{aligned}
$$ <br>
As the $KL$ divergence is always positive the first term on the right hand side provides a lower bound on $P(X|\theta)$, hence the name. As $P(X|\theta)$ is a constant increasing the first term on the right hand side decreases the $KL$ divergence between the two distributions $Q(Z)$ and $P(Z|X,\theta)$. If we say $Q(Z) = P(Z|X,\theta)$ this makes the $KL$ divergence 0. The form can be rewritten:
$$\sum_z P(z|X,\theta)\log \frac{P(z,X|\theta)}{P(z|X,\theta)}=\sum_z P(z|X,\theta)\log P(z,X|\theta) - \sum_z P(z|X,\theta)\log P(z|X,\theta)$$
Thus increasing $\sum_z Q(z)\log \frac{P(z,X|\theta)}{Q(z)}$ by optimizing $\theta$ will push the data log likelihood up. The EM algorithm works by first infering $P(Z|X,\theta_\text{old})$, then finding $\theta_\text{new}$ using the infered values. This just becomes a normal MLE problem.