# Week 3 Overview

During this week's lessons, you will learn topic analysis in depth, including mixture models and how they work, Expectation-Maximization (EM) algorithm and how it can be used to estimate parameters of a mixture model, the basic topic model, Probabilistic Latent Semantic Analysis (PLSA), and how Latent Dirichlet Allocation (LDA) extends PLSA.

## Goals and Objectives

After you actively engage in the learning experiences in this module, you should be able to:

* Explain what a mixture of unigram language model is and why using a background language in a mixture can help “absorb” common words in English.
* Explain what PLSA is and how it can be used to mine and analyze topics in text.
* Explain the general idea of using a generative model for text mining.
* Explain how to compute the probability of observing a word from a mixture model like PLSA.
* Explain the basic idea of the EM algorithm and how it works.
* Explain the main difference between LDA and PLSA. 

## Key Phrases and Concepts

* Mixture model
* Component model
* Constraints on probabilities
* Probabilistic Latent Semantic Analysis (PLSA)
* Expectation-Maximization (EM) algorithm
* E-step and M-step
* Hidden variables
* Hill climbing
* Local maximum
* Latent Dirichlet Allocation (LDA)

## Guiding Questions

### What is a mixture model?

Mixture model is improved version of generative model to penalize common word by introducing another distribution called as background word distribution. The background word distribution used to quantitative fixed probability of common word in general collection, such as collection of English documents. Then mix the distribution of word in particular document with distribution of background model.

### In general, how do you compute the probability of observing a particular word from a mixture model?

Let:

- $\theta_d$ is the unkown topic model which is word distribution of particular document $d$.
- $\theta_B$ is the known background model.

Constrain:

- $p(\theta_d) + p(\theta_B) = 1$

Then:

- A mixture model of word $w$ is:
$$p(w) = p(\theta_B) p(w|\theta_B) + p(w|\theta_d)p(\theta_d)$$


![mixture model](images/mixture-model.png)

### What is the general form of the expression for this probability?

Let:

- $d$ is the document.
- $\theta_d$ is the topic model of $d$.
- $\theta_B$ is background topic.

Then:

- For all word distribution in the document, optimize all parameters as $\Lambda$, such that:
$$\Lambda = (\{p(w|\theta_d)\}, \{p(w|\theta_B)\}, p(\theta_B), p(\theta_d))$$

### What does the maximum likelihood estimate of the component word distributions of a mixture model behave like?

Given that:

- Likehood function:
$$\eqalign{
    p(d|\Lambda) &= \prod_{i=1}^{|d|} p(x_i|\Lambda) = \prod_{i=1}^{|d|}[p(\theta_d)p(x_i|\theta_d) + p(\theta_B)p(x_i|\theta_B)]\\
    &= \prod_{i=1}^M[p(\theta_d)p(w_i|\theta_d)+p(\theta_B)p(w_i|\theta_B)]^{c(w,d)}
}$$

- Maximum Likehood (ML) estimate is sum of likehood function for each word probability:
$$\Lambda^* = arg \ max_{\Lambda} \ p(d|\Lambda)$$
constraint:
$$\sum_{i=1}^M p(w_i|\theta_d) + \sum_{i=1}^M p(w_i|\theta_B) = 1$$
or
$$p(\theta_d) + p(\theta_B) = 1$$

The ML estimate follow two rules:

1. The mixture model is a linear system, thus one can find unknown distribution given by known distribution. For example:
    - The problem of two words distribution of *text* and *the*.
    - Given that background distribution ($\theta_B$) are:
    $$\eqalign{
        p("text"|\theta_B) &= 0.1\\
        p("the"|\theta_B) &= 0.9
    }$$
    - While:
    $$\eqalign{
        p(\theta_d) &= 0.5\\
        p(\theta_B) &= 0.5
    }$$
    - The distribution is a linear system:
    $$\eqalign{
        p(d|\Lambda) &= p("text"|\Lambda) p("the"|\Lambda)\\
                     &= [0.5*p("text"|\theta_d)+0.5*0.1]*[0.5*p("the"|\theta_d)+0.5*0.9]
    }$$
<br>
2. The linear system follow algebra rule:
    - The algebra rule: if $x+y = constant$, then $xy$ reaches maximum when $x=y$.
    - The previous linear system can be solved by follow the algebra rule:
    $$0.5*p("text"|\theta_d)+0.5*0.1 = 0.5*p("the"|\theta_d)+0.5*0.9$$
    - So:
    $$p("text"|\theta_d)=0.9 >> p("the"|\theta_d)=0.1$$
    - The word distribution is inequality between probability of words, such that: if $p(w_1|\theta_B)>p(w_2|\theta_B)$, then $p(w_1|\theta_d)<p(w_2|\theta_d)$.
<br><br>

The ML estimate behaviors are:
 
1. Higher frequency words get higher $p(w|\theta_d)$. It's means that if a word occurs more frequently in the observed text data, it would also encourage the unknown distribution $\theta_d$ to assign a somewhat higher probability to this word.
2. There are exist $p(w_i|\theta_d)>p(w_j|\theta_B)$ and $p(w_j|\theta_d)<p(w_i|\theta_B)$ for any random word $\{w_i, w_j\}$ in the data. It's means each distribution try to bet high probability to any other word.
3. The probability $p(\theta)$ regulates the collaboration and competition between component models.

### In what sense do they “collaborate” and/or “compete”?

The first behavior of Maximum Likehood Estimate of mixed model tells us that each word distribution is an inequality, such that:

- Two distribution will **collaborate** with each other in order to maximize the constrains $p(\theta_d)+p(\theta_B)=1$. It's means, the colaboration try to find maximum value of $p(\theta_d)$ and $p(\theta_B)$.
- If a distribution assigns high probability to word $w_i$ than word $w_j$, then another distribution do the opposite by assign high probability to word $w_j$ than word $w_i$. It's means two distribution will **compete** to each other in order to assign high probability in particular word.

### Why can we use a fixed background word distribution to force a discovered topic word distribution to reduce its probability on the common (often non-content) words? 

### What is the basic idea of the EM algorithm? What does the E-step typically do? What does the M-step typically do? In which of the two steps do we typically apply the Bayes rule? Does EM converge to a global maximum?

### What is PLSA? How many parameters does a PLSA model have? How is this number affected by the size of our data set to be mined? How can we adjust the standard PLSA to incorporate a prior on a topic word distribution? 

### How is LDA different from PLSA? What is shared by the two models? 

## Additional Readings and Resources


* C. Zhai and S. Massung, Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining. ACM and Morgan & Claypool Publishers, 2016. Chapter 17.
* Blei, D. 2012. Probabilistic Topic Models. Communications of the ACM 55 (4): 77–84. doi: 10.1145/2133806.2133826.
* Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai. Automatic Labeling of Multinomial Topic Models. Proceedings of ACM KDD 2007, pp. 490-499, DOI=10.1145/1281192.1281246.
* Yue Lu, Qiaozhu Mei, and Chengxiang Zhai. 2011. Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Information Retrieval, 14, 2 (April 2011), 178-203. doi: 10.1007/s10791-010-9141-9.