# Background: Introduction to Topic Modelling 
## Theoretical Context and the Dirichlet Distribution
Topic modelling is an extremely powerful tool used to identify the hidden thematic structure in a set of documents. By assuming that every word in a document is generated from a fixed set of topics, we can provide insight on the latent structure of these documents. There are differnent was of doing this. Probabilistic approaches - one of which we will explore today, as well as matrix factorization approaches (non-negative matrix approximations), which we won't be covering.

We will explore Latent Dirichlet Allocation, a specific probabilistic approach to topic modelling. Before we get into LDA, let's take a step back to review the Dirichlet distribution. LDA makes several assumptions which we will mention as we go through this module, but there is one we can bring up now: LDA assumes a bag of words model! Meaning that we don't really care about dependencies between words, or dependencies between documents. 

Recall:

A standard probability distribution used on the simplex is the *Dirichlet distribution*. The Dirichlet distribution can be defined as follows. 

$$p\left(\theta_{1},\ldots ,\theta_{K};\alpha _{1},\ldots ,\alpha _{K}\right)= \frac{1}{B(\vec{\alpha})}\prod _{i=1}^{K}\theta_{i}^{\alpha _{i}-1}$$

Where $B(\vec{\alpha})$ is just the normalizing constant that makes the distribution sum to (actually integrate to) $1$. 


### What can we do with topic modelling?
It might be of interest to examine how trends in documents change over time. For example, a study conducted at Brown University by Uriel Cohen Priva,Ph.D and Joseph Austerweil,Ph.D analyzed a large dataset of articles from the journal Cognition. They wanted to explore trends within Cognition over four decades. They found several trends in the data, like "the rise of moral cognition, eyetrackig methods", etc.

Fun fact: before its applications in computational linguistics, topic models were used in image processing, as well as in the processing of biological data.

# Formalization: Topic Modelling Algorithm 

Ian: this needs more work, all i did was copy + paste what was in the outline. Feel free to add any more pictures and definitely more discussion, whatever u think is best.

Discuss how we use topic models to see how topics (rather than words)
trend over time.
iii.
Broad process definition: topic models assume that every word in a
document is generated by one of a # of topics.
1. A distribution over topics (the gist of a document) is sampled from
a dirichlet
2. Each word is sampled from the topics of a document
3. Documents are biased to be more likely to generate some words
rather than others
4. Together these biases lead the model when given a corpus of
documents to converge on solutions in which words that are likely
to co-occur are generated by the same topic!
iv.
Explain the process with a whiteboard demo??


<img src="topicmodelwords.png">


# Approaches: Latent Dirichlet Allocation: An Approach to Topic Modelling 
note to graham: definitely explain the assumptions LDA makes (documents independent,words independent,



$$\Pr(\beta_{1:K},\theta_{1:D},Z_{1:D},W_{1:D}) = \prod_{i=1}^{k}\Pr(\beta_{i})\prod_{d=1}^{D}\Pr(\theta_{d})(\prod_{n=1}^{N}\Pr(Z_{d,n}|\theta_{d})\Pr(w_{d,n}|\beta_{1:K},Z_{d,n}))$$

<img src="gist.png">
<img src = "plate.png">




## Where are we now, where are we going?

<img src = "summary.png">

# Bayesian Inference Algorithms and LDA

We have a model. How do we figure out, given a bunch of data, the assignments of the hidden random variables?

As we saw above, LDA is purely a generative statistical model. If our goal is to actually uncover the hidden latent structure of a corpus of documents, then we need to find a way to invert our generative model. We do this via a method of bayesian inference. There are several methods of inference, namely sampling methods or variational bayes. Today we will look at sampling methods. But before we get into the particular approach, we examine why it is that we need such methods.

How might we compute the conditional distribution of the topic structure, given the observed documents? This involves (as we've seen in class before) computing the posterior distribution. 


$$\Pr(\beta_{1:K},\theta_{1:D},Z_{1:D}|W_{1:D}) = {\frac {\Pr(\beta_{1:K},\theta_{1:D},Z_{1:D},W_{1:D})}{Pr(W_{1:D})}}$$


Let's digest this.
The numerator is the joint distribution of all of the random variables.
The denominator is the marginal probability of all of the observations; the probability of seeing the observed corupus under any particular topic model.

We can't compute this directly because of the denominator. Since each of these random variables are hidden random variables, we would need to integrate (take the sum of) over all the random variables for a corpus. Remember - each of these random variables can take on numerous values, so calculating this precisely is intractable. Consider the case where you have 10 random variables, where each can take on 12 values. You would have to sum over 12^10 samples. This simply cannot be solved in polynomial time, rather it would take exponentially long. 

Backing up a step, let's recall what the marginal distribution was. Let's say we have 4 random variables we need to condition on. We have to try to integrate one variable at a time out of our equation, by summing over all of the other possible values the other random variables can take. 


TL;DR:
The number of possible structures is exponentially large, it is intractable to compute. We simply cannot compute the posterior because of this nasty denominator. So instead, we approximate it. This can be done several ways, but we're going to focus on Gibbs Sampling.


<video controls src="WbFc7Rn.mp4" />


# (Collapsed) Gibbs Sampling: A Sampling Approach to Bayesian Inference



Gibbs Sampling is a Markov Chain Monte Carlo (MCMC) sampling algorithm, where the markov chain is a sequence of random variable states. The fundamental idea of using MCMC here is that we make a separate probabilistic choice for each of the 1..k dimensions for each of our random variables, where each of these choices depends on the other k-1 dimensions. 

As mentioned above, the distribution to sample from isn't easy to compute. However, we are in fact able to compute the *conditional distribution* of a random variables given all the other random variables. So we compute the conditional distribution for each random variable we want to resample, given all other random variables in the model. 

A generic gibbs sampler:
<img src = "gibbsgeneric.png">

Here, we start off with a random initial assignment^1. 
1. Start off with a random initial assignment*. (Randomly assign each word in each document to one of K topics..)
2. Algorithmically, find another set of random variables. 
    -> Divide the probability of the previous assignment (R1, found in step 1), by the new random variables (R2, found in step 2). 

P(corpus | new random variables) * P(new random variables) / Pr(corpus)
	/
P(corpus|random variables) * P(random variables) / P(corpus)

= P(corpus|random variables2)/P(corpus|random variables1)

3. This will continue on to approximate the posterior distribution until we asymptotically approach (converge to)^2 the true posterior. 

Above, we see that the corpus and random variables cancel out. 
Now we have a ratio of how likely these two assignments are in comparison to eachother. We accept the new set with a probability proportional to their ratio. If we accept, we set R1=R2. 
and repeat the process otherwise we keep r1 and repeat. This moves along the set of random variables favoring the random variables which explain the corpus better.  

(Note: Gibbs Sampling is just a way of sampling from a complicated distribution. There are many different ways of implementing it, depending on what research area you're working in / which distributions to sample from depending on your data. One way is to divide  by two distributions (one sampled previously, one newly sampled) and then accept the one that is most likely.))


1 If you have a better assignment than a random assignment, the algorithm will converge in a lesser amount of time. 
2 Convergence here implies that we have sample values that are close enough to the same distribution as if they were sampled from the true posterior joint distribution. Essentially we have sampled each latent variable and each has been conditioned on the updated values of all other latent variables. 

Let's examine this process from a high level.
<img src = "gibbsexample1.png">
<img src = "gibbsexample_2.png">

# Conclusion and Take-Aways



cheese is good food