# Lecture 5: Mixture Models and the Expectation Maximization Algorithm

## K-means review

K-means optimizes the cost function $J$, where

$$
J(U, Z) = \sum_{n=1}^{N} \sum_{k=1}^{K} z_{k,n} \| x_n - u_k \|_2^2, \quad
\text{s.t.} \> z_{k,n} \in \{0, 1\} \land \sum_{k=1}{K} z_{k,n} = 1 \> \forall n
$$

Want **probabilistic cluster assignments**. Instead of saying "this datapoint belongs to that cluster", we want to say that "this data point belongs to cluster X with probability 80%, cluster Y with probability 15%, and cluster Z with probability 5%".

**Relax z's constraint** to $z_{k,n} \in [0, 1]$, i.e. let more than one $z_{\cdot,n}$ be non-zero.

Use **generative model == statistical model** and infer its parameters $\theta \in \Theta$ using **Maximum Likelihood**.

ML = pick model under which data has highest likelihood:

$$
\mathcal{L}(\theta; \mathbf{X}) := p_{\theta}(\mathbf{X}) \overset{\text{i.i.d.}}{=}
\prod_{n=1}^N p_{\theta}(x_n)
$$

We choose the **maximum likelihood estimator** $\hat{\theta}$ which maximizes the likelihood of the data, whose data points we assumed to be independently sampled from the same underlying distribution:

$$
\hat{\theta} = {\arg\max}_{\theta\in\Theta}p_{\theta}(\mathbf{X}) = 
{\arg\max}_{\theta\in\Theta} \sum_{n=1}^{N} \log p_{\theta}(\mathbf{x_n})
$$

## Mixture models

Finite mixture model for one data point:

$$
p_\theta(x) = \sum_{k=1}^K \pi_k p_{\theta_k}(x)
$$

Mixing proportions which sum up to 1. Every distribution has a proportion, **not every data point**. The whole model contains just $K$ mixing proportions. They dictate the relative cluster sizes.

### Gaussian Mixture Model (GMM)

$p_{\theta_k}(x) = \mathcal{N}(\mu_k, \Sigma_k)$

For clustering, two-stage sampling. Sample a cluster, then sample from that cluster's gaussian.

We introduce hard assignment variables $z_k$, BUT we denote their assignment probabilities by $\pi_k$, and work with those:

$$ P(z_k = 1) = \pi_k $$

$$ p_\pi(z) = \prod_{k=1}^{K} \pi_k^{z_k} $$

(this multiplies only the probability of the kth row of the n column; all the rest are zero)

We are now left with the following joint distribution over x and z (**complete data** distribution):

$$
p_\theta(x, z) = \prod_{k=1}^K \left[ \pi_k p_{\theta_k}(x) \right]^{z_k}
$$


Final MLE objective: 

$$
\hat{\theta} = {\arg\max}_{\theta\in\Theta} \sum_{n=1}^{N} \log p_{\theta}(\mathbf{x_n})
 = {\arg\max}_{\theta\in\Theta} \sum_{n=1}^{N} \log \left[ \sum_{k=1}^K \pi_k p_{\theta_k}(x_n) \right]
$$

Hard to optimize directly. No closed-form solution (problem sum inside the log).

## Expectation maximization

Cannot optimize log-likelihood directly (contains sum of logs). Instead we can maximize a lowe bound on the log-likelihood, based on the complete data distribution, $p_\theta(x, y)$.

### Jensen's inequality

A secant line of a convex function is always above the graph.

$$ \log\left( \frac{ \sum_{i=1}^n x_i }{n} \right) \ge \frac{\sum_{i=1}^n\log{x_i} }{n} $$

### A. Expectation Step

Optimize **bound** w.r.t. "helper" distribution $q$.

TODO(andrei): Upgrade these sections based on the Bishop book since the slides aren't great.

$$ \sum_{k=1}^K q_k \le 1 \quad \forall k $$

Lagrangian for each data point, obtain optimal q for each data point. Fix this and perform next step.



### B. Maximization step

Can establish closed-form solutions for $\mathbf{\mu}^*$ and $\mathbf{\Sigma}^*$ given the previous $q$s, as well as the data.

 * c.f. the centroid position recomputation in K-means. Remember that here, we don't have just ONE cluster assignment per point, but we have soft assignments to ALL clusters.
 * c.f. naive derivation of EM solution (see slides and Bishop)

In [1]:
# TODO(andrei): Table/diagram with precise comparison between K-means and EM.

## Model Selection

**We still need to pick K first, even in GMMs solved with the EM algorithm!!!**

One can technically keep increasing K until K==N and the LL of the data given the model keeps getting better. This is not good!

### AIC and BIC

$\kappa(\cdot)$ = number of free parameters in model.

So what does a model contain? Assuming full covariance matrices:
K \* D means, and K \* (D + 1) \* D \* 1/2 covariances PLUS K - 1 weights (-1 because we're talking about free variables, and we know they must sum to 1) $\pi$.

Note that a covariance matrix is symmetric!

#### Akaike Information Criterion

(Smaller is better)

$$ AIC(\theta | X) = -\log p_\theta(X) + \kappa(\theta) $$

#### Bayesian Information Criterion

(Smaller is better)

$$ BIC(\theta | X) = -\log p_\theta(X) + \frac{1}{2} \kappa(\theta) \log N $$

BIC is harsher.

#### Howto

Both AIC and BIC have clear minimum when computed for multiple Ks. It tends to coincide with the knee in the LL decrease.
