# Topic Modeling

[1. Bag-of-Words in More Detail](#1)<br>
[2. Latent Variables](#2)<br>
[3. Matrix Representation of Latent Dirichlet Allocation](#3)<br>

> [3.1: Picking Topics](#3.1)<br>

[4. Beta Distribution](#4)<br>
[5. Dirichlet Distribution](#5)<br>
[6. More on Latent Dirichlet Allocation](#6)<br>
[7. Sample a Topic](#7)<br>
[8. Sample a Word](#8)<br>
[9. Combing the Models](#9)<br>
[10. Topic Modeling Lab](#10)<br>

## References
In this section, we'll be following this article by David Blei, Andrew Ng, and Michael Jordan.
* [Latent Dirichlet Allocation](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)

# <a id='1'>1: Bag-of-Words (BoW) in More Detail</a>

If you think about the BoW model graphically, it represents the relationship between a set of document objects and a set of word objects.

Assume we have the article, *Space exploration. A vote to explore space has been explored*, and that we have done a good job processing the text (case, stemming, lemmatization, etc.)
* There are three main terms: **space**, **vote**, and **explore**
* To find the probability of each term appearing in the article, we divide the count of each term by the total number of terms
* We have three parameters - probabilities for each term ( $p(\text{space|article})$, $p(\text{vote|article})$, $p(\text{explore|article})$ )

To add some notation:
* d: documents (units of groups of terms to be analyzed)
* t: terms (elements that compose documents)
* P(t|d): probability of a term appearing in the document ("For any given document, $d$, and observed term, $t$, how likely is it that the document $d$ generated the term $t$")

<img src="assets/images/03/img_01.png" width=700 align='center'>

Now, if we do this for many documents, say 500, and many terms, say 1,000, we can get something of the sort:

<img src="assets/images/03/img_02.png" width=700 align='center'>

If we have 500,000 parameters, that is a lot of parameters to figure out. We can reduce the number of parameters and still keep most of the information by representing the terms in a latent space. This is commonly known as **topic modeling**.

# <a id='2'>2: Latent Variables</a>

Consider adding to the model the notion of a small set of topics or latent variables (or themes) that actually drive the generation of words in each document. So in this model, any document is considered to have an underlying mixture of topics associated with it. Similarly, a topic is considered to be a mixture of terms that it is likely to generate.

If we take our **documents**, our **terms**, and assert there are a number of **topics**, say 3, then we have 2-sets of probability distributions:
1. $p(\text{z|d})$: topic-document probability (probability of topic $z$ given a document $d$)
2. $p(\text{t|z})$: term-topic probability (probability of a term $t$ given a topic $z$)

Our new probability of a document given a term, $p(\text{t|d})$, can be expressed as a sum over the two previous probabilities:
$$P\left( t \mid d \right) = \Sigma_{z} P\left( t \mid z \right) \cdot P\left( z \mid d \right)$$

<img src="assets/images/03/img_03.png" width=700 align='center'>

Now, the number of parameters is: (number of documents * number of topics) + (number of topics * number of terms)
* 500 documents, 10 topics, 1,000 terms: (500 * 10) + (10 * 1,000) = 15,000
> * Note: same number of documents and terms as before, but much less parameters than 500,000!

This is called **Latent Dirichlet Allocation** or LDiA for short.

# <a id='3'>3: Matrix Representation of Latent Dirichlet Allocation</a>

An LDiA is an example of matrix factorization

The idea is as follows:<br>
<img src="assets/images/03/img_04.png" width=700 align='center'>

* We go from a BoW model to an LDiA model
> * The BoW on the left basically says "our probability of, say, the word 'tax' being generated by the second document is the label of the white arrow"
> * The LDiA on the right, that probability is calculated by the white arrows multiplying the probability of a term $t$, say 'tax' in a topic $z$ say 'politics', by the corresponding probability of a topic $z$ given a document $d$ and adding them

Then, you can have a BoW matrix, composed of terms as columns and documents as rows, like on the bottom left, equal to, or represented by, the product of two matrices:
1. tall skinny matrix of documents as rows and topics as columns
2. wide flat matrix of topics as rows and terms as columns

In this case, the entry of the second document for the term tax, will be equal to the inner product of the corresponding row and column in the matrices on the right
> * If the matrices are big, say 500 documents and 1,000 terms, such that the BoW matrix is 500,000 elements large (500 by 1,000 = $\text{m} x \text{n}$)
> * The two matrices in the topic model combined have only 15,000 elements (mxn * nxm = (500x10) * (10x1,000) = matrix of size 500 by 1,000 for the original matrix)

Aside from being much simpler, the LDiA model has a huge advantage that it gives us a bunch of topics that we can divide documents upon. In this example, we are asserting they are *science, politics, and sports*, but in reality the algorithm will just throw some topics and it'll be up to us to look at the associated words and decide what is the common topic of all these words.

For these examples, we'll keep asserting these 3 topics, but think of them instead as *topic 1, topic 2, and topic 3*

The LDiA model is represented, as before, as:
$$P\left( t \mid d \right) = \Sigma_{z} P\left( t \mid z \right) \cdot P\left( z \mid d \right)$$

## Matrix Multiplication

The idea for building our LDA model will be to factor our BoW matrix into two matrices, one with documents by topic and the other as topic by terms

<img src="assets/images/03/img_05.png" width=700 align='center'>

Recall how we built our BoW matrix: identify the terms and the number of times they appear in a specific document and divide by the sum of terms in that document to get the probabilities/frequencies:

<img src="assets/images/03/img_06.png" width=700 align='center'>

For our **document topic matrix**, we have as follows.

If we have a document, say `document 3` (or doc 3), and doc 3 is mostly about science and a bit about sports and politics. Maybe it's 70% about science, 10% about politics, and 20% about sports. We record these values in the **document-topic matrix**:

<img src="assets/images/03/img_07.png" width=700 align='center'>

For the **topic-term matrix**, we have a similar approach. Start with a topic, say politics, and let's say we can figure out the probabilities that words are generated by this topic. We take all these probabilities to sum to one. We take these probabilities and place them into the **topic-term matrix** as such:

<img src="assets/images/03/img_08.png" width=700 align='center'>

From these two matrices, the product of them together will approximate the BoW matrix!

<img src="assets/images/03/img_09.png" width=700 align='center'>

But we haven't gone into depth about HOW to calculate the entries in these matrices. One way is to use the traditional [*matrix factorization* algorithm](https://developers.google.com/machine-learning/recommendation/collaborative/matrix). However, these matrices are unique in that each of the rows sum to one and there is a significantly meaningful amount of structure coming from a set of documents, topics and words.

What we'll do is something more elaborate than matrix multiplication. The basic idea is that the entries in the two topic modeling matrices come from special distributions. So, we'll embrace this fact and work with these distributions to find these two matrices!

## <a id='3.1'>3.1: Picking topics</a>

Pretend you are at a party in a triangular room. There are people roaming around the room. In each of the corners, there are different things happening. In one corner, there is food, in another corner there is desert, and in the last there is music.

<img src="assets/images/03/img_10.png" width=700 align='center'>

People naturally get drawn to these corners based on their preferences if they like food, desert, or music. Or perhaps they are undecided and equally space themselves from say food and desert. However, they mostly walk away from the blue areas and toward the red areas.

<img src="assets/images/03/img_11.png" width=700 align='center'>

Now, imagine the alternative. We are still at a party, but now in the corners, there is a lion, fire, and radioactive material. 

<img src="assets/images/03/img_12.png" width=700 align='center'>

Now, people will do the opposite of what they did when there were desirable things in the corners; they will move away from the corners. They will gravitate toward the center.

<img src="assets/images/03/img_13.png" width=700 align='center'>

So now, we have three scenarios:
1. We place nice things in the corners
2. We put nothing in the corners
3. We place bad things in the corners

<img src="assets/images/03/img_14.png" width=700 align='center'>

In the above three scenarios, we can think of the parameters at the corners as *repelling factors*: if they are large, then the points are pushed away, small they draw the points to them, and $1$ the points are static

As an example, if we have the following three Dirichlet Distributions, which of these three wis more likely to generate the topics in our model?

<img src="assets/images/03/img_15.png" width=700 align='center'>

Answer: **Left**
> * If we randomly select any point in the distribution, it is most likely, of the three distributions, to be associated strongly with one of the three topics<br>
> * Most articles will be represented by one topic strongly, but maybe others weakly, this means then that the distribution that helps us do that, distinguish well the primary topic of an article, is one that is most useful!

<img src="assets/images/03/img_16.png" width=700 align='center'>

So, for our LDiA model, we will pick a Dirichlet Distribution with small parameters $\alpha$, such as $\overrightarrow{\alpha}=\{0.7,0.7,0.7\}$, and from here we'll sample a few points to be our documents. Each point gives us a mixture of probabilities $\overrightarrow{\theta}$ that will characterize the distribution of topics for that particular document.

<img src="assets/images/03/img_17.png" width=700 align='center'>

In 3D, the dirichlet distributions look as such:

<img src="assets/images/03/img_18.png" width=700 align='center'>

This shows that the probability of picking a point on the triangle depends on the height of the probability distribution at that point. So, as we can see on the left that the edges where the topics are most strong are the highest point on the distribution, we would prefer the one on the left!

# <a id='4'>4: $\beta$ Distributions</a>

Let's think about probability distributions.

Assume we have a coin and we toss it twice. The outcomes are 1 heads and 1 tails. What do we think about this coin? It could be a fair coin, it could be biased toward heads or tails, but we don't have enough data to be sure. To conintue the thought experiment, let's say that we think it's fair, but not with much confidence.

So, the probability distribution could look something like this - higher at $\frac{1}{2}$ but a bit *even* over the entire interval:

<img src="assets/images/03/img_19.png" width=700 align='center'>

Now, let's say we toss the coint 20 times and we get 10 heads and 10 tails. We feel more confident that the coin is fair. The probability distribution may look something more like this:

<img src="assets/images/03/img_20.png" width=700 align='center'>

But what if we toss the coin 4 times and get heads 3 times and tails once? We get an average of $\frac{3}{4}$ on probability of getting heads, but we don't have much confidence. We might have a distribution like such:

<img src="assets/images/03/img_21.png" width=700 align='center'>

But if we toss it 400 times and get 300 heads and 100 tails, we become more confident in the coin being biased toward heads and may get a probability distribution like this:

<img src="assets/images/03/img_22.png" width=700 align='center'>

This is called the **$\beta$ - Distribution** and it works for any values $a$ and $b$:

<img src="assets/images/03/img_23.png" width=700 align='center'>

The **gamma function** can be thought of as a continuous version of the factorial function.

$$\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}x^{a-1}y^{b-1}$$
$$\text{s.t.}$$
$$\Gamma(a)=(a-1)!$$

Where $a$ is an integer, we get the general factorial form. But, if $a$ is not an integer, but instead some form of flow, we can get something of the sort:

<img src="assets/images/03/img_24.png" width=700 align='center'>

So, if we have something like $0.1$ for heads and $0.4$ for tails, aside from it making no sense, we can still use this in the $\beta$-distribution. We just need to use the right funtion for the probability distribution. The probability distribution, or $\beta$-distribution looks as such and means that $p$ is much more likely to be close to zero or close to one than to be somewhere in the middle.

<img src="assets/images/03/img_25.png" width=700 align='center'>

This makes a bit of sense: if $p$ is close to zero or 1, then we are likely to have zero heads or zero tails, which at least gets us close to one of the values we mentioned of $0.1$ or $0.4$