# Lecture 7: Non-Negative Matrix Factorization

## Topic models

Find low-dimensional document representation in topic/concept space.
Approach: **predictive model** (log-likelihood)

But: can't actually use ML. Instead, we maximize a lower bound on the likelihood using expectation maximization, like we did in GMMs.

## pLSA

### Sampling model

**Topic-wise sampling.** For every token:
 (1) Sample topic from categorical distribution.
 (2) Sample word from that category.
 
Generally:

$$
p(w|d) = \sum_{z=1}^{K} p(w|z) p(z|d)
$$

(where z is the topic; need to pre-specify K)

How do we reach this? Start with prob. of word given document, and use sum rule t (law of total probability) to express it as a sum over $z$:

$$
p(w | d) = \sum_{z} p( w, z \>|\> d ) = \sum_{z} p ( w \>|\> z, d ) p ( z \>|\> d )
$$

Remember the graphical model! The way we constructed it, we assume that knowing the topic is **enough** to predict a word:

$$ w \perp d  \>|\> z \iff p(w|z, d) = p(w|z) $$

We can apply this in the above formulation as follows:


$$
p(w | d) = \sum_{z} p ( w \>|\> z, d ) p ( z \>|\> d )
 = \sum_{z} p(w \>|\> z) p (z \>|\> d)
$$


### How do we compute these distributions?

Assume words are C.I., apply log, and get log-likelihood:

$$
\mathcal{L}(U, V) \overset{\text{c.i.}}{=} \sum_{i,j} x_{ij} \log p(w_j \>|\> d_i)
 = \sum_{(i, j) \in \mathcal{X}} \log \sum_{z=1}^{K} p(w_j \>|\> z) p(z \>|\> d_i)
 := \sum_{(i, j) \in \mathcal{X}} \log \sum_{z=1}^{K} v_{zj} u_{zi}
$$

Where: $u_{zi} \ge 0 \land \sum_z u_zi = 1$, i.e. for a given document, the sum of its components it 1. Should not be interpreted probabilistically, as a document **can** cover multiple topics.

And: $v_{zj} \ge 0 \land \sum_j v_zj = 1$, i.e. for a given topic z, its words sum up to 1, i.e. a topic is a probability distribution of words.

### Optimizing the log likelihood

We have a sum of logs, so we can't directly (just like in MMs), but we **can** optimize  a lower bound instead:

$$
\underbrace{
\log\sum_{z=1}^K q_{zij} \frac{u_{zi} v_{zj}}{q_{zij}}
}_{\substack{\text{Our objective} \\ \text{Hard to optimize because of the log(sum).}}}
\overset{\text{Jensen's inequality}}{\ge}
\sum_{z=1}^K q_{zij} \log \frac{u_{zi} v_{zj}}{q_{zij}}
=
\underbrace{
\sum_{z=1}^K q_{zij} \left[ \log u_{zi} + \log v_{zj} - \log q_{zij} \right]
}_{\substack{\text{
Lower bound on our objective.}\\ \text{Not perfect, but solvable} \\ \text{(logarithm now inside sum)}
}}
$$

If we maximize the lower bound well enough, we can get very close to our real objective, which we cannot optimize directly. So, handwavily, given that we're going
"in the right direction", we can get away with computing the optimal model parameters
by optimizing the lower bound instead of the main objective.

TODO(andrei): Theory on how good the approximation actually is.

So why is a $\log\left(\sum\right)$ hard to optimize? Well, its derivative ends up having the whole sum as a denominator, preventing it from being decomposable via addition. This makes SGD impossible, which is really bad.

### Expectation step

Optimize $q$:

$$
q_{zij} = \frac{v_{zj}u_{zi}}{\sum_{k=1}^K v_{kj} u_{ki}} 
 = \frac{  p(w_j | z) p(z | d_i) }{\sum_{k=1}^K p(w_j | k) p(k | d_i)}
$$

$q_{\cdot i j}$ now contains the posterior of the occurrence (i, j) at the current step, i.e. given the previous U and V matrices. In the next step, we'll update the matrices themselves based on the newly-computed $q$s.

Yes, this is a closed-form solution!

### Maximization step

Compute optimal U and V based on the $q$s determined in the previous expectation-step. Like in K-means, where we update the model itself, i.e. the centroid positions, based on the latent variable, i.e. the assignments.

U, which models the topics in documents.

$$
u_{zi} = \frac{\sum_j x_{ij} q_{zij}}{\sum_j x_{ij}}
$$

V, which models the word distributions in topics.

$$
v_{zj} = \frac{\sum_i x_{ij} q_{zij}}{\sum_{i,l} x_{il} q_{zil}}
$$

## Latent Dirichlet Allocation (LDA)

Start with generative document model. For this, we first need to sample the topic weights $u_i$ for the new documents.

I.e. we need to come up with a probability distribution from which we can sample a probability vector.

While $u$ shouldn't be interpreted as a probabilty vector qualitatively, since it's meant to describe a mixture of topics in a document $d_i$, rather then the likelihood of that docuent belonging to certain topics, it still showcases the main properties of a categorical probability distribution: every element is $\ge 0$ and they all sum up to 1.

So mathematically, we can still treat $u_i$ as a probability vector.

And we sample it from a **Dirichlet distribution**, which is the **conjugate distribution** of the categorical distribution. [When in doubt, Wiki it!](https://en.wikipedia.org/wiki/Conjugate_prior#Table_of_conjugate_distributions)

We therefore model $u_i$ sampled from a Dirichled distribution, parameterized by $\alpha$, which is the number of parameters in $u$, i.e. the number of topics ($\alpha = K$):

$$
p(\mathbf{u}_i \>|\> \alpha) \propto \prod_{z=1}^K u_{zi}^{\alpha_k-1}
$$

Treat U as nuisance parameter (TODO(andrei): WHY?).

Each column of U represents a document, and every row of that column represents a certain topic $z$'s weight in that document.

Why are V the only real parameters?
TODO(andrei): Ask around and/or check out Bishop book or Andrew Ng's paper on LDA.

$$
p(x|V, u) = \frac{l!}{\prod_j x_j!}\prod_j \pi_j^{x_j}
$$

Where:
$$
\pi_j := \sum_z v_{zj} u_{z}
$$

What is $pi_j$ for a word $j$? 

Bayesian averaging integrates over $u$, conditioning on $\alpha$, the nr. of categories. WHAT DOES THIS DO?

Learning LDA: **beyond the scope of the lecture**

## Non-Negative Matrix Factorization

Given a non-negative integer count matrix $X \in \mathbb{Z}_{\ge 0}^{N\times M}$.

NMF computes:

$$
X \approx U^T V
$$

Where U and V are non-negative, end every column has a sum of 1. 

NMF objective:

$$
J(U,V) = \frac{1}{2} \| X - U^T V \|^2_F \quad \text{s.t.} \> u_{zi}, v_{zj} \ge 0
$$

(fraction meant to lead to cleaner gradient)

So we optimize:

$$
U^{*}, V^{*} = \min_{U,V} J(U, V)
$$

Objective convex in U and in V, but not jointly in U and V at the same time, so we use **alternating least squares (ALS)**.