# Topic Modeling

[1. Bag-of-Words in More Detail](#1)<br>
[2. Latent Variables](#2)<br>
[3. Matrix Representation of Latent Dirichlet Allocation](#3)<br>
[4. Beta Distribution](#4)<br>
[5. Dirichlet Distribution](#5)<br>
[6. More on Latent Dirichlet Allocation](#6)<br>
[7. Sample a Topic](#7)<br>
[8. Sample a Word](#8)<br>
[9. Combing the Models](#9)<br>
[10. Topic Modeling Lab](#10)<br>

## References
In this section, we'll be following this article by David Blei, Andrew Ng, and Michael Jordan.
* [Latent Dirichlet Allocation](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)

# <a id='1'>1: Bag-of-Words (BoW) in More Detail</a>

If you think about the BoW model graphically, it represents the relationship between a set of document objects and a set of word objects.

Assume we have the article, *Space exploration. A vote to explore space has been explored*, and that we have done a good job processing the text (case, stemming, lemmatization, etc.)
* There are three main terms: **space**, **vote**, and **explore**
* To find the probability of each term appearing in the article, we divide the count of each term by the total number of terms
* We have three parameters - probabilities for each term ( $p(\text{space|article})$, $p(\text{vote|article})$, $p(\text{explore|article})$ )

To add some notation:
* d: documents (units of groups of terms to be analyzed)
* t: terms (elements that compose documents)
* P(t|d): probability of a term appearing in the document ("For any given document, $d$, and observed term, $t$, how likely is it that the document $d$ generated the term $t$")

<img src="assets/images/03/img_01.png" width=700 align='center'>

Now, if we do this for many documents, say 500, and many terms, say 1,000, we can get something of the sort:

<img src="assets/images/03/img_02.png" width=700 align='center'>

If we have 500,000 parameters, that is a lot of parameters to figure out. We can reduce the number of parameters and still keep most of the information by representing the terms in a latent space. This is commonly known as **topic modeling**.

# <a id='2'>2: Latent Variables</a>

Consider adding to the model the notion of a small set of topics or latent variables (or themes) that actually drive the generation of words in each document. So in this model, any document is considered to have an underlying mixture of topics associated with it. Similarly, a topic is considered to be a mixture of terms that it is likely to generate.

If we take our **documents**, our **terms**, and assert there are a number of **topics**, say 3, then we have 2-sets of probability distributions:
1. $p(\text{z|d})$: topic-document probability (probability of topic $z$ given a document $d$)
2. $p(\text{t|z})$: term-topic probability (probability of a term $t$ given a topic $z$)

Our new probability of a document given a term, $p(\text{t|d})$, can be expressed as a sum over the two previous probabilities:
$$P\left( t \mid d \right) = \Sigma_{z} P\left( t \mid z \right) \cdot P\left( z \mid d \right)$$

<img src="assets/images/03/img_03.png" width=700 align='center'>

Now, the number of parameters is: (number of documents * number of topics) + (number of topics * number of terms)
* 500 documents, 10 topics, 1,000 terms: (500 * 10) + (10 * 1,000) = 15,000
> * Note: same number of documents and terms as before, but much less parameters than 500,000!

This is called **Latent Dirichlet Allocation** or LDiA for short.

# <a id='3'>3: Matrix Representation of Latent Dirichlet Allocation</a>

An LDiA is an example of matrix factorization

The idea is as follows:<br>
<img src="assets/images/03/img_04.png" width=700 align='center'>

* We go from a BoW model to an LDiA model
> * The BoW on the left basically says "our probability of, say, the word 'tax' being generated by the second document is the label of the white arrow"
> * The LDiA on the right, that probability is calculated by the white arrows multiplying the probability of a term $t$, say 'tax' in a topic $z$ say 'politics', by the corresponding probability of a topic $z$ given a document $d$ and adding them

Then, you can have a BoW matrix, composed of terms as columns and documents as rows, like on the bottom left, equal to, or represented by, the product of two matrices:
1. tall skinny matrix of documents as rows and topics as columns
2. wide flat matrix of topics as rows and terms as columns

In this case, the entry of the second document for the term tax, will be equal to the inner product of the corresponding row and column in the matrices on the right
> * If the matrices are big, say 500 documents and 1,000 terms, such that the BoW matrix is 500,000 elements large (500 by 1,000 = $\text{m} x \text{n}$)
> * The two matrices in the topic model combined have only 15,000 elements (mxn * nxm = (500x10) * (10x1,000) = matrix of size 500 by 1,000 for the original matrix)

Aside from being much simpler, the LDiA model has a huge advantage that it gives us a bunch of topics that we can divide documents upon. In this example, we are asserting they are *science, politics, and sports*, but in reality the algorithm will just throw some topics and it'll be up to us to look at the associated words and decide what is the common topic of all these words.

For these examples, we'll keep asserting these 3 topics, but think of them instead as *topic 1, topic 2, and topic 3*

The LDiA model is represented, as before, as:
$$P\left( t \mid d \right) = \Sigma_{z} P\left( t \mid z \right) \cdot P\left( z \mid d \right)$$

## Matrix Multiplication

The idea for building our LDA model will be to factor our BoW matrix into two matrices, one with documents by topic and the other as topic by terms

<img src="assets/images/03/img_05.png" width=700 align='center'>

Recall how we built our BoW matrix: identify the terms and the number of times they appear in a specific document and divide by the sum of terms in that document to get the probabilities/frequencies:

<img src="assets/images/03/img_06.png" width=700 align='center'>

For our **document topic matrix**, we have as follows.

If we have a document, say `document 3` (or doc 3), and doc 3 is mostly about science and a bit about sports and politics. Maybe it's 70% about science, 10% about politics, and 20% about sports. We record these values in the **document-topic matrix**:

<img src="assets/images/03/img_07.png" width=700 align='center'>

For the **topic-term matrix**, we have a similar approach. Start with a topic, say politics, and let's say we can figure out the probabilities that words are generated by this topic. We take all these probabilities to sum to one. We take these probabilities and place them into the **topic-term matrix** as such:

<img src="assets/images/03/img_08.png" width=700 align='center'>

From these two matrices, the product of them together will approximate the BoW matrix!

<img src="assets/images/03/img_09.png" width=700 align='center'>

But we haven't gone into depth about HOW to calculate the entries in these matrices. One way is to use the traditional [*matrix factorization* algorithm](https://developers.google.com/machine-learning/recommendation/collaborative/matrix). However, these matrices are unique in that each of the rows sum to one and there is a significantly meaningful amount of structure coming from a set of documents, topics and words.

What we'll do is something more elaborate than matrix multiplication. The basic idea is that the entries in the two topic modeling matrices come from special distributions. So, we'll embrace this fact and work with these distributions to find these two matrices!