## Latent Dirichlet Allocation (LDiA)

Latent Sematic Analysis (LSA) should be the first choice for most topic modelling, semantic search, or content-based recommendation engines.
<br>
Given the maths used for LSA is efficient and somewhat easy to follow, given that it produces a linear transformation that can be applied to new batches of natural language without training and little loss in accuracy. However, LDiA can give slightly better results in some situations. 

LDiA performs many things accomplished by topic models with LSA (and SVD under the hood), though unlike LSA, LDiA assumes a Dirichlet distribution of word frequencies. It's more precise about the statistics of allocating words to topics than the linear math of LSA.


LDiA creates a semantic vector space model (as with topic vectors) using a topic mix process that for any given document, such topics can be determined by the word mixtures in each topic by which topic those words were assigned to.
<br>
In some ways, this makes an LDiA topic model easier to understand, as words assigned to topics and topics assigned to documents tend to make more sense than for LSA.
<br>
LDiA assumes that each document is a mixture (linear combination) of some arbitrary number of topics that one can select when beginning to train the LDiA model. LDiA also assumes that each topic can be represented by a distribution of words (term frequencies). 
<br>
The probability/weight for each of these topics within a document, along with the probability of a word being assigned to a topic, is assumed to begin with a Dirichlet probability distribution (*prior* in statistics), where the algorithm derives from.

### LDiA idea

Researchers like Blei and Ng proposed an idea by imagining how a machine that could only roll dice i.e. generate random numbers, could also write the documents in a corpus one may want to analyse. Given that one works with Bag of words during this process, it cuts out the part about sequencing such words together to make sense, to write a real document.
<br>
In this way, they modeled the statistics for the mix of words that would become a part of a particular BOW for each document.

They envisioned a machine that only had two choices to make to process generating the mix of words for a specific document. The two roles of 'dice' represent:
<br>
1. The number of words to generate for the document (Poisson distribution)
<br>
2. The number of topics to mix together for the document (Dirichlet distribution)

After these two numbers, the difficult task is to choose the words for the document.
<br>
The imaginary BOW generating machine iterates over those topics and randomly chooses words appropriate to that topic until it hits the number of words that it had decided the document should contain in step 1.
<br.
Deciding the probabilities of those words for topics — the appropriateness of words for each topic— is the hard part. But once that has been determined, your 'bot' just looks up the probabilities for the words for each topic from a matrix of term-topic probabilities.




So all this machine needs is a single parameter for that Poisson distribution (in the dice roll from step 1) that tells it what the 'average' document length should be, and a couple more parameters to define that Dirichlet distribution that sets up the number of topics.
<br>
Then your document generation algorithm needs a term-topic matrix of all the words and topics it likes to use, its vocabulary. And it needs a mix of topics that it likes to 'talk' about.

We can make this process more convenient in our use case, reversing the document generation (writing) problem back around to the original problem of estimating the topics and words from existing documents.
<br>
Initially, we need to measure or compute those parameters about words and topics for the first two steps.
<br>
Then we need to compute the term-topic matrix from a collection of documents. This is ultimately the process of LDiA.

The parameters for steps can be computed by analysing the statistics of the documents in a corpus.
<br>
For instance, to solve step 1, one could calculate the mean number of words (n-grams) in all the BOW for the documents in a specified corpus.
<br> 
An implementation in python may look like this:

In [2]:
import pandas as pd 
from nltk.tokenize import casual_tokenize
from nlpia.data.loaders import get_data
pd.options.display.width = 120 
sms = get_data('sms-spam')

In [3]:
total_corpus_len = 0 
for document_text in sms.text:
    total_corpus_len += len(casual_tokenize(document_text))

In [4]:
mean_document_len = total_corpus_len/len(sms)
round(mean_document_len, 2)

21.35