# Topic Modeling

[1. Bag-of-Words in More Detail](#1)<br>
[2. Latent Variables](#2)<br>
[3. Matrix Representation of Latent Dirichlet Allocation](#3)<br>
[4. Beta Distribution](#4)<br>
[5. Dirichlet Distribution](#5)<br>
[6. More on Latent Dirichlet Allocation](#6)<br>
[7. Sample a Topic](#7)<br>
[8. Sample a Word](#8)<br>
[9. Combing the Models](#9)<br>
[10. Topic Modeling Lab](#10)<br>

## References
In this section, we'll be following this article by David Blei, Andrew Ng, and Michael Jordan.
* [Latent Dirichlet Allocation](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)

# <a id='1'>1: Bag-of-Words (BoW) in More Detail</a>

If you think about the BoW model graphically, it represents the relationship between a set of document objects and a set of word objects.

Assume we have the article, *Space exploration. A vote to explore space has been explored*, and that we have done a good job processing the text (case, stemming, lemmatization, etc.)
* There are three main terms: **space**, **vote**, and **explore**
* To find the probability of each term appearing in the article, we divide the count of each term by the total number of terms
* We have three parameters - probabilities for each term ( $p(\text{space|article})$, $p(\text{vote|article})$, $p(\text{explore|article})$ )

To add some notation:
* d: documents (units of groups of terms to be analyzed)
* t: terms (elements that compose documents)
* P(t|d): probability of a term appearing in the document ("For any given document, $d$, and observed term, $t$, how likely is it that the document $d$ generated the term $t$")

<img src="assets/images/03/img_01.png" width=700 align='center'>

Now, if we do this for many documents, say 500, and many terms, say 1,000, we can get something of the sort:

<img src="assets/images/03/img_02.png" width=700 align='center'>

If we have 500,000 parameters, that is a lot of parameters to figure out. We can reduce the number of parameters and still keep most of the information by representing the terms in a latent space. This is commonly known as **topic modeling**.

# <a id='2'>2: Latent Variables</a>

Consider adding to the model the notion of a small set of topics or latent variables (or themes) that actually drive the generation of words in each document. So in this model, any document is considered to have an underlying mixture of topics associated with it. Similarly, a topic is considered to be a mixture of terms that it is likely to generate.

If we take our **documents**, our **terms**, and assert there are a number of **topics**, say 3, then we have 2-sets of probability distributions:
1. $p(\text{z|d})$: topic-document probability (probability of topic $z$ given a document $d$)
2. $p(\text{t|z})$: term-topic probability (probability of a term $t$ given a topic $z$)

Our new probability of a document given a term, $p(\text{t|d})$, can be expressed as a sum over the two previous probabilities:
$$P\left( t \mid d \right) = \Sigma_{z} P\left( t \mid z \right) \cdot P\left( z \mid d \right)$$

<img src="assets/images/03/img_03.png" width=700 align='center'>

Now, the number of parameters is: (number of documents * number of topics) + (number of topics * number of terms)
> * 500 documents, 10 topics, 1,000 words: (500 * 10) + (10 * 1,000) = 15,000