## Latent Dirichlet Allocation (LDiA)

Latent Sematic Analysis (LSA) should be the first choice for most topic modelling, semantic search, or content-based recommendation engines.
<br>
Given the maths used for LSA is efficient and somewhat easy to follow, given that it produces a linear transformation that can be applied to new batches of natural language without training and little loss in accuracy. However, LDiA can give slightly better results in some situations. 

LDiA performs many things accomplished by topic models with LSA (and SVD under the hood), though unlike LSA, LDiA assumes a Dirichlet distribution of word frequencies. It's more precise about the statistics of allocating words to topics than the linear math of LSA.


LDiA creates a semantic vector space model (as with topic vectors) using a topic mix process that for any given document, such topics can be determined by the word mixtures in each topic by which topic those words were assigned to.
<br>
In some ways, this makes an LDiA topic model easier to understand, as words assigned to topics and topics assigned to documents tend to make more sense than for LSA.
<br>
LDiA assumes that each document is a mixture (linear combination) of some arbitrary number of topics that one can select when beginning to train the LDiA model. LDiA also assumes that each topic can be represented by a distribution of words (term frequencies). 
<br>
The probability/weight for each of these topics within a document, along with the probability of a word being assigned to a topic, is assumed to begin with a Dirichlet probability distribution (*prior* in statistics), where the algorithm derives from.

### LDiA idea

Researchers like Blei and Ng proposed an idea by imagining how a machine that could only roll dice i.e. generate random numbers, could also write the documents in a corpus one may want to analyse. Given that one works with Bag of words during this process, it cuts out the part about sequencing such words together to make sense, to write a real document.
<br>
In this way, they modeled the statistics for the mix of words that would become a part of a particular BOW for each document.

They envisioned a machine that only had two choices to make to process generating the mix of words for a specific document. The two roles of 'dice' represent:
<br>
1. The number of words to generate for the document (Poisson distribution)
<br>
2. The number of topics to mix together for the document (Dirichlet distribution)

After these two numbers, the difficult task is to choose the words for the document.
<br>
The imaginary BOW generating machine iterates over those topics and randomly chooses words appropriate to that topic until it hits the number of words that it had decided the document should contain in step 1.
<br.
Deciding the probabilities of those words for topics — the appropriateness of words for each topic— is the hard part. But once that has been determined, your 'bot' just looks up the probabilities for the words for each topic from a matrix of term-topic probabilities.




So all this machine needs is a single parameter for that Poisson distribution (in the dice roll from step 1) that tells it what the 'average' document length should be, and a couple more parameters to define that Dirichlet distribution that sets up the number of topics.
<br>
Then your document generation algorithm needs a term-topic matrix of all the words and topics it likes to use, its vocabulary. And it needs a mix of topics that it likes to 'talk' about.

We can make this process more convenient in our use case, reversing the document generation (writing) problem back around to the original problem of estimating the topics and words from existing documents.
<br>
Initially, we need to measure or compute those parameters about words and topics for the first two steps.
<br>
Then we need to compute the term-topic matrix from a collection of documents. This is ultimately the process of LDiA.

The parameters for steps can be computed by analysing the statistics of the documents in a corpus.
<br>
For instance, to solve step 1, one could calculate the mean number of words (n-grams) in all the BOW for the documents in a specified corpus.
<br> 
An implementation in python may look like this:

In [52]:
import pandas as pd 
from nltk.tokenize import casual_tokenize
from nlpia.data.loaders import get_data
pd.options.display.width = 120 
sms = get_data('sms-spam')
index = ['sms{}{}'.format(i, '!'*j) for (i,j) in zip(range(len(sms)), sms.spam)]
sms.index = index 

In [53]:
total_corpus_len = 0 
for document_text in sms.text:
    total_corpus_len += len(casual_tokenize(document_text))

In [54]:
mean_document_len = total_corpus_len/len(sms)
round(mean_document_len, 2)

21.35

These document statistics are usually computed directly from the BOWs.
<br>
We need to make sure that we're counting the tokenized and vectorized text words with the specified documents.
<br>
It's also key to maintain a level of text normalization, particularly any stop word filtering, case folding etc. before counting up the unique terms.
<br>
This ensures that such count includes all the words in our BOW vector vocabulary (all the the n-grams we counting), but only those words that our BOW usa (i.e. not stop words)


The second parameter we need to specify for an LDiA model is the **number of topics** which can be tedious.
<br>
This is the case because the number of topics in a particular set of documents can't be measured directly until after we've assigned words to such topics. 
<br>
Clustering algorithms like ***K-means*** and ***KNN***, we would also need to pass the ***k*** parameter ahead of time.
<br>
 We can guess the number of topics (analogous to the ***k*** in k-means, which is the number of clusters) and then check to see if that works for one's set of documents.
<br>
Once specified to the LDA model how many topics to look for, it will find the mix of words to put in each topic to optimise its objective function.

We can optimise this 'hyperparameter' (***k***, the number of topics) by adjusting it until it works for our application.
<br>
We can automate this optimisation if we can measure something about the quality of the LDiA language model for representing the meaning of the documents.
<br>
A 'cost function' to assess the optimisation for this model is how well/poorly the LDiA model performs in classification or regression problems such as sentiment analysis, document keyword tagging or topic analysis. We only need some labeled documents to test our topic model or classifier on.

### LDiA topic model for SMS messages 

The topics produced by LDiA tend to be more understandable and interpretable to humans.
<br>
This is the case as words that frequently occur together are assigned the same topic(s), which is expected by most humans. 
<br>
Whereas LSA (PCA) tries to keep things spread apart to begin with, LDiA tries to keep things close together that started out close together.


Comparing LSA and LDiA, the math optimises for different things - based on the idea that the LDiA optimiser has a different objective function where it will reach a different objective.
<br>
To keep close high-dimensional vectors close together in the lower-dimensional space, LDiA has to twist and contort the space (and the vectors) in nonlinear ways.

To experiment with an LDiA topic model, let's see how it works for a dataset of a few thousand SMS messages, labelled for spaminess.
<br>
Initially, we compute the word vector transformation and then some topic vectors for SMS message (document). We assume the use of 16 topics (components) to classify the spaminess of messages.
<br>
Keeping the number of topics (dimensions) low can help reduce the overfitting problem.

Given LDiA works primarily with BOW count vectors rather than TF-IDF vectors - a workflow to compute BOW vectors in `sklearn` looks like this:

In [55]:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np 
np.random.seed(42) # replicable results 

In [56]:
counter = CountVectorizer(tokenizer=casual_tokenize)
bow_vector = counter.fit_transform(raw_documents=sms.text)

In [57]:
index = ['sms{}{}'.format(i, '!'*j) for (i,j) in zip(range(len(sms)), sms.spam)]
bow_docs = pd.DataFrame(bow_vector.todense(), index=index)
bow_docs.head() # test vector transformation 

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9222,9223,9224,9225,9226,9227,9228,9229,9230,9231
sms0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sms1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sms2!,0,0,0,0,0,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
sms3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
sms4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [58]:
columns_nums, terms = zip(*sorted(zip(counter.vocabulary_.values(), counter.vocabulary_.keys())))
bow_docs.columns = terms

Using the first message `sms0`, we can double check to ensure that the count from the model makes sense.

In [59]:
sms.loc['sms0'].text

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

In [60]:
bow_docs.loc['sms0'][bow_docs.loc['sms0'] > 0].head()

,            1
..           1
...          2
amore        1
available    1
Name: sms0, dtype: int64

We can use LDiA as shown below to construct topic vectors for our SMS corpus 

In [61]:
from sklearn.decomposition import LatentDirichletAllocation as LDA
ldia = LDA(n_components=16, learning_method='batch')
# fit/transform steps for ldia takes slightly longer than PCA/SVD 
# esp for large number of topics/words in a corpus 
ldia_model = ldia.fit_transform(bow_docs)

In [62]:
ldia.components_.shape

(16, 9232)

So the model has allocated your 9,232 words (terms) to 16 topics (components). Let’s take a look at the first few words and how they’re allocated to your 16 topics. 
<br>
Keep in mind that the counts and topics can differ based on each iteration. LDiA is a stochastic algorithm that relies on the random number generator to make some of the statistical decisions it has to make about allocating words to topics. 
<br>
So the topic-word weights will be different from those shown, but they should have similar magnitudes. Each time we run `LatentDirichletAllocation` within `sklearn` (or any LDiA algorithm), we will get different results unless the **random seed** is set to a fixed value:

In [63]:
pd.set_option('display.width', 75)
columns = [f'topic{i}' for i in range(ldia.n_components)]
components = pd.DataFrame(ldia.components_.T, index=terms, columns=columns)

In [64]:
components.round(2).head()

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10,topic11,topic12,topic13,topic14,topic15
!,184.03,15.0,72.22,394.95,45.48,36.14,9.55,44.81,0.43,90.23,37.42,44.18,64.4,297.29,41.16,11.7
"""",0.68,4.22,2.41,0.06,152.35,0.06,0.06,0.06,0.45,0.68,8.42,11.42,0.07,62.72,12.27,0.06
#,0.06,0.06,0.06,0.06,0.06,2.07,0.06,0.06,0.06,0.06,0.06,0.06,1.07,4.05,0.06,0.06
#150,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,1.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06
#5000,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,0.06,3.06,0.06,0.06,0.06,0.06,0.06,0.06


At a glance, the '!' term occurs across all topics, but is a particularly strong part of `topic3`, where as something like '"' puncuation mark is rarely occuring.
<br>
An inference surrounding `topic3` could be that its general theme is about emotional intensity or emphasis and doesn't care much about numbers/quotes.

In [65]:
# exploring the top ten tokens for topic3 
components.topic3.sort_values(ascending=False)[:10]

!       394.952246
.       218.049724
to      119.533134
u       118.857546
call    111.948541
£       107.358914
,        96.954384
*        90.314783
your     90.215961
is       75.750037
Name: topic3, dtype: float64

So the top ten tokens for 'topic3' seem to be focussed around emphatic (imperative) directives requesting someone to do or pay something.
<br>
Also, it would be interesting to find out if this topic is used more often in spam messages as opposed to nonspam messages.
<br>
We can see the allocation of words to topics can be rationalised or reasoned about - even with this quick look.

Before we fit an LDA classifier, we need to compute these LDiA topic vectors for all the corresponding documents (SMS messages). And let’s see how they are different from the topic vectors produced by SVD and PCA for those same documents:

In [66]:
ldia16_topic_vectors = pd.DataFrame(ldia_model, index=index, columns=columns)
ldia16_topic_vectors.round(2).head()

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,topic10,topic11,topic12,topic13,topic14,topic15
sms0,0.0,0.62,0.0,0.0,0.0,0.0,0.0,0.0,0.34,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sms1,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.78,0.01,0.01,0.12,0.01,0.01,0.01,0.01
sms2!,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.98,0.0,0.0,0.0,0.0,0.0,0.0
sms3,0.0,0.0,0.0,0.0,0.09,0.0,0.0,0.0,0.85,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sms4,0.39,0.0,0.33,0.0,0.0,0.0,0.14,0.0,0.0,0.0,0.0,0.0,0.09,0.0,0.0,0.0


We can see that these topics are more cleanly separated among each document. 
<br>
There are a lot of zeros in the allocation of topics to messages. 
<br>
This is one of the things that makes LDiA topics easier to explain to peers or less technical folk when making business decisions based on the NLP pipeline results.

### LDiA/LDA - spam classifier 

To test how good LDiA topics are at predicting labels such as 'spam', we'll use the LDiA topic vectors to train an LDA model:

In [67]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [68]:
X_train, X_test, y_train, y_test = train_test_split(ldia16_topic_vectors, sms.spam, test_size=0.5, random_state=271828)
lda = LDA(n_components=1)
lda = lda.fit(X_train, y_train)
# Can make predictions based on whole dataset of sms messages 
sms['ldia16_spam'] = lda.predict(ldia16_topic_vectors)
round(float(lda.score(X_test, y_test)),2)

0.94

The algorithms for `train_test_split()` and LDiA are stochastic. So each time we run it we will get different results and different accuracy values. 
<br>
If we want to make your pipeline repeatable, look for the seed argument for these models and dataset splitters. You can set the seed to the same value with each run to get reproducible results.

One way a 'collinear' warning can occur is if your text has a few 2-grams or 3-grams where their component words only ever occur together. So the resulting LDiA model had to arbitrarily split the weights among these equivalent term frequencies. Can you find the words in your SMS messages that are causing this 'collinearity' (zero determinant)?
<br>
We’re looking for a word that, whenever it occurs, another word (its pair) is always in the same message.

We can do this search with Python rather than by hand.
<br> 
First, we probably just want to look for any identical bag-of-words vectors in your corpus. These could occur for SMS messages that aren’t identical, like 'Hi there Bob!' or 'Bob, Hi there', because they have the same word counts. 
<br>
We can iterate through all the pairings of the bags of words to look for identical vectors. These will definitely cause a 'collinearity' warning in either **LDiA** or **LSA**.

If we don’t find any exact **BOW** vector duplicates, we could iterate through all the pairings of the words in our vocabulary. 
<br>
We would then iterate through all the bags of words to look for the pairs of SMS messages that contain those exact same two words. If there aren’t any times that those words occur separately in the SMS messages, you’ve found one of the 'collinearities' in your dataset. 
<br>
Some common 2-grams that might cause this are the first and last names of famous people that always occur together and are never used separately, like 'Bill Gates' (as long as there are no other Bills in your SMS messages).

We got more than 90% accuracy on your test set, and we only had to train on half your available data. 
<br>
But we did get a (possible) warning about your features being collinear due to your limited dataset, which gives LDA an “under-determined” problem. The determinant of your topic-document matrix is close to zero, once we discard half the documents with `train_test_split`.
<br>
If you ever need to, we can turn down the LDiA `n_components` to 'fix' this issue, but it would tend to combine those topics together that are a linear combination of each other (collinear).

Let's find out how such LDiA model compares to a higher dimensional model based on the TF-IDF vectors.
<br>
The TF-IDF vectors have many more features (>3k unique terms) - which is likely to suffer from the problems of poor generalization and overfitting.
<br>
Hence, the intermediary generalisation step of LDiA/PCA should help:

In [69]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize.casual import casual_tokenize
tfidf = TfidfVectorizer(tokenizer=casual_tokenize)
tfidf_docs = tfidf.fit_transform(raw_documents=sms.text).todense()

In [70]:
tfidf_docs = tfidf_docs - tfidf_docs.mean(axis=0)

In [71]:
X_train, X_test, y_train, y_test = train_test_split(tfidf_docs, sms.spam.values, test_size=0.5, random_state=271828)
# Assume only 'one topic' given we're only interested in a score (classification) for spam topic 
lda = LDA(n_components=1)
# Fitting LDA model to very much many features will take time
lda = lda.fit(X_train, y_train)
round(float(lda.score(X_test, y_test)), 3)

0.748

The test set accuracy is worse than when we trained it on lower-dimensional topic vectors - instead of TF-IDF vectors.
<br>
This is the purpose of what topic modelling (LSA) is supposed to do. It helps us generalise our models from a small training set, where it still works well on messages using different combinations of words (but similar topics).

### LDiA - increasing topics 

This time, we'll perform another run with more dimensions (topics). Increasing dimensions while using a LDiA model may compensate for a lack of performance relative to LSA (PCA), so we'll need more topics to allocate words to. Here, we'll try 32 topics (components).

In [72]:
from sklearn.decomposition import LatentDirichletAllocation as LDiA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

In [73]:
ldia32 = LDiA(n_components=32, learning_method='batch') # batch is default
ldia32_topic_vectors = ldia32.fit_transform(bow_docs)
ldia32.components_.shape

(32, 9232)

With this vector transformation, we can compute the corresponding 32-D topic vectors for all the documents (SMS messages) within the dataset.

In [74]:
columns_32 = ['topic{}'.format(i) for i in range(ldia32.n_components)]

In [75]:
ldia32_topic_vectors = pd.DataFrame(ldia32_topic_vectors, index=index, columns=columns_32)
ldia32_topic_vectors.round(2).head(5)

Unnamed: 0,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9,...,topic22,topic23,topic24,topic25,topic26,topic27,topic28,topic29,topic30,topic31
sms0,0.0,0.0,0.0,0.06,0.14,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sms1,0.0,0.0,0.0,0.0,0.53,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.14,0.0,0.0
sms2!,0.0,0.0,0.0,0.0,0.0,0.65,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.33,0.0,0.0,0.0,0.0,0.0,0.0
sms3,0.0,0.11,0.0,0.0,0.39,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
sms4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.09,0.0,0.0,0.47,0.0,0.0,0.0,0.0


In this case, these topics (32) are even more sparse - also being more cleanly separated.
<br>
Afterwards, we can utilise the LDA model (classifier) to train upon these spam messages, utilising the added 32D LDiA topic vectors.

In [76]:
X_train, X_test, y_train, y_test = train_test_split(ldia32_topic_vectors, sms.spam, test_size=0.5, random_state=271828)
lda = LDA(n_components=1)
lda = lda.fit(X_train, y_train)
sms['ldia32_spam'] = lda.predict(ldia32_topic_vectors)
X_train.shape

(2418, 32)

In [79]:
round(float(lda.score(X_test, y_test)),3)

0.936

Based on the accuracy scores, 93.6% is comparable to the 94% score retrieved from the 16-D LDiA topic vectors.

It should be noted that 'warnings' related to collinearity issues has nothing to do with opitmisation of topics (components) under a specified topic model.
<br>
Hence, collinearity is an inherent problem within the underlying data. 
<br>
To solve such a problem, we need to add 'noise' or metadata to the SMS data as synthetic words, or remove those duplicate word vectors/pairings that happen to repeat frequently in our documents.  

***Summary***

A larger number of topics allows for more precision about the topics and generally produce topics that linearly separate better.
<br>
Yet, the performance of `LDiA + LDA` is not quite as good as the 96% accuracy of `PCA + LDA`.
<br>
Hence, PCA is keeping our SMS topic vectors spread out more efficiently - allowing for a wider gap between messages to cut with a hyperplane (decision boundary) to separate classes.
<br>
Finding explainable topics, like those used for summarisation, is what `LDiA` is good at. Also, it can do a fairly good job at creating topics useful for linear classification.