# Latent-Dirichlet-Allocation (LDA) Theory

LDA is one of the most common techniques to perform topic modeling. At its core, it is a **probabalistic, generative** model, operating on a **bag-of-words** representation. 

It assumes that each document is a mixture of *topics*, and for each topic there is a probability distribution over *words*. 
A *document* is an unordered set of words, that were randomly sampled from a mixture of topics and their respective word distributions.

The goal of LDA is, to find such parameters for the topic and word distribution, that the probability of generating the documents in your corpus is maximized.

![LDA Overview](../assets/graphics/lda-overview.png)

The name "Latent Dirichlet Allocation" comes from the fact that it uses a Dirichlet distribution to model the distribution of topics in each document, and a Dirichlet distribution to model the distribution of words in each topic. Latent indicates that the topics are hidden, and Allocation indicates that the words are allocated to the topics.

## Preparing your Data

Keep in mind, that bag-of-words representations are context free. Words can only have a single meaning, and the order of words in a document is not important. This is a very strong assumption, and is often violated in practice. However, it is a useful assumption, as it allows us to use a lot of the machinery of probability theory to perform inference.

To make your dataset better fit this assumption, it is common to perform preprocessing. In particular, as there is no order to the words of a document, lemmatization is an important step. LDA, for example, does not know, that "*Merkel's* foreign policy" and "*Merkel* was Bundeskanzlerin" both refer to the same enitity.

Further, it is common to remove stopwords, i.e. words with little semantic information, such as "it", "must", or "the", from the corpus.

## Training LDA

LDA mainly has three hyperparameters you can choose. $K$ is the number of topics, $\alpha$ is the Dirichlet prior for the document-topic distribution, and $\eta$ (sometimes $\beta$) is the Dirichlet prior for the topic-word distribution.

$\alpha$ determines, how many topics the model expects per document. Lowering $\alpha$ hence yields a more distinct set of topics per document, while increasing $\alpha$ yields a more uniform distribution of topics per document.

$\eta$ does the same for words within each topic. Lowering $\eta$ yields a more distinct set of words per topic, while increasing $\eta$ yields a more uniform distribution of words per topic.

There is no explicit formula to maximize the topic and word distributions given a set of hyperparameters and a corpus. Instead, we start with a random distribution, check it against the corpus, and then iteratively update the distribution to better fit the corpus. One round of this process is called an *pass*. The number of passes can be set when you train your model, where a higher number yields better results, but is also computationally more expensive.

## Interpreting a Trained Model

A trained model is characterized by its topic and word distributions, of which the word-distributions per topic are of particular interest. It is common to name a topic after its top 5-10 words, i.e. the words that are the most likely to be generated by the topic.

Beyond this, one can also understand the word-distributions as points in a high-dimensional space, and use dimensionality reduction techniques such as t-SNE to visualize the topics in a 2D or 3D space. In such a space, topics with similar word distributions are close to one another, while those with very different word distributions are far apart.