# Latent Dirichlet Allocation (LDA)

Note: Adapted SLO TBL topic classification codebase and Derek Fisher's code for my own LDA topic extraction implemenation.

#### Resources Used:

https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158

https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d

https://www.investopedia.com/terms/p/posterior-probability.asp

https://cs.calvin.edu/courses/cs/x95/videos/2018-2019/

## LDA Basics:

<span style="font-family:Papyrus; font-size:1.25em;">
    
##### Basic Concept:

Each document described by a distribution of topics.<br>
Each topic described by a distribution of words.<br>
Typically uses bag-of-words feature representation for documents.<br>
Permits the identification of topics within documents and the mapping of documents to associated topics.<br>

##### Terms:

Observed layer: documents (composites) and words (parts).<br>
Hidden (latent) layer: topics (categories).<br>

k — Number of topics a document belongs to (a fixed number).<br>

V — Size of the vocabulary.<br>

M — Number of documents.<br>

N — Number of words in each document.<br>

w — A word in a document. This is represented as a one hot encoded vector of size V (i.e. V — vocabulary size).<br>

w (bold w): represents a document (i.e. vector of “w”s) of N words.<br>

D — Corpus, a collection of M documents.<br>

z — A topic from a set of k topics. A topic is a distribution words. For example it might be, Animal = (0.3 Cats, 0.4 Dogs, 0 AI, 0.2 Loyal, 0.1 Evil).<br>

![lda](lda_model.jpeg)
    
</span>

<span style="font-family:Papyrus; font-size:1.25em;">
    
α — Distribution related parameter that governs what the distribution of topics is for all the documents in the corpus looks like.<br>

θ — Random matrix where θ(i,j) represents the probability of the i th document to containing the j th topic.<br>

η — Distribution related parameter that governs what the distribution of words in each topic looks like.<br>

β — A random matrix where β(i,j) represents the probability of i th topic containing the j th word.<br>

##### Dirichlet Distribution (example):

![dirichlet](dirichlet_distribution.png)

1) Large values of α pushes the distribution to the center.<br>
2) Small vlues of α pushes the distribution to the edges.<br>

</span>

<span style="font-family:Papyrus; font-size:1.25em;">
    
##### Mathematical equivalent of the above graphical representation of LDA:

![mathematical_model](lda_equation.png)

##### English Translation:

Given a set of M documents with each containing N words and each word generated from a topic "k" from a set of K topics, find the joint posterior probability of:

θ — A distribution of topics, one for each document,<br>
z — N Topics for each document,<br>
β — A distribution of words, one for each topic,<br>

Given:

D — All the data we have (i.e. the corpus),<br>

Using the parameters:

α — A parameter vector for each document (document — Topic distribution).<br>
η — A parameter vector for each topic (topic — word distribution).<br>


##### Joint posterior probability: 

In Bayesian statistics, it is the revised or updated probablity of an event occurring given new information.<br>
Calculated by updating the prior probability using Bayes' Theorem.<br>
In other words, conditional probability - probability of event A occurring given that event B has occurred.<br>

##### Prior probability:

In Bayesian statistics, it is the probablity of an event occurring before new information is given.<br>
Calculated using Bayes' Theorem.

##### Note:

Choose # of topics < # of documents to reduce dimensionality for further analysis using other algorithms.<br>

</span>

# Scikit-Learn Latent Dirichlet Allocation on SLO Twitter Dataset:

### Import libraries and set parameters:

In [15]:
"""
Resources Used:

https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html#sklearn.decomposition.LatentDirichletAllocation
https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730

"""

################################################################################################################
################################################################################################################

# Import libraries.
import logging as log
import warnings
import tensorflow as tf
import time
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

# Import custom utility functions.
import slo_lda_topic_extraction_utility_functions as lda_util

#############################################################

# Miscellaneous parameter adjustments for pandas and python.
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

"""
Turn debug log statements for various sections of code on/off.
(adjust log level as necessary)
"""
log.basicConfig(level=log.INFO)
tf.logging.set_verbosity(tf.logging.INFO)

################################################################################################################
################################################################################################################

<span style="font-family:Papyrus; font-size:1.25em;">

Adjust log verbosity levels as necessary.<br>

Set to "DEBUG" to view all debug output.<br>
Set to "INFO" to view useful information on dataframe shape, etc.<br>

</span>

### Import and prep the dataset for use in LDA topic extraction:

In [16]:
# Import the dataset.
tweet_dataset_processed = \
    pd.read_csv("D:/Dropbox/summer-research-2019/datasets/dataset_20100101-20180510_tok_LDA_PROCESSED.csv", sep=",")

# Reindex and shuffle the data randomly.
tweet_dataset_processed = tweet_dataset_processed.reindex(
    pd.np.random.permutation(tweet_dataset_processed.index))

# Generate a Pandas dataframe.
tweet_dataframe_processed = pd.DataFrame(tweet_dataset_processed)

# Print shape and column names.
log.info("\n")
log.info("The shape of our preprocessed SLO dataframe:")
log.info(tweet_dataframe_processed.shape)
log.info("\n")
log.info("The columns of our preprocessed SLO dataframe:")
log.info(tweet_dataframe_processed.head)
log.info("\n")

# Drop any NaN or empty Tweet rows in dataframe (or else CountVectorizer will blow up).
tweet_dataframe_processed = tweet_dataframe_processed.dropna()

# Print shape and column names.
log.info("\n")
log.info("The shape of our preprocessed SLO dataframe with NaN (empty) rows dropped:")
log.info(tweet_dataframe_processed.shape)
log.info("\n")
log.info("The columns of our preprocessed SLO dataframe with NaN (empty) rows dropped:")
log.info(tweet_dataframe_processed.head)
log.info("\n")

# Reindex everything.
tweet_dataframe_processed.index = pd.RangeIndex(len(tweet_dataframe_processed.index))

# Assign column names.
tweet_dataframe_processed_column_names = ['Tweet']

# Rename column in dataframe.
tweet_dataframe_processed.columns = tweet_dataframe_processed_column_names

# Create input feature.
selected_features = tweet_dataframe_processed[tweet_dataframe_processed_column_names]
processed_features = selected_features.copy()

# Check what we are using as inputs.
log.debug("\n")
log.debug("The Tweets in our input feature:")
log.debug(processed_features['Tweet'])
log.debug("\n")

# Create feature set.
slo_feature_set = processed_features['Tweet']

INFO:root:

INFO:root:The shape of our preprocessed SLO dataframe:
INFO:root:(658982, 1)
INFO:root:

INFO:root:The columns of our preprocessed SLO dataframe:
INFO:root:<bound method NDFrame.head of                                                   tweet_t
562440  sugar hit for local miners and shares in new y...
631505     needs for the coal mine please give generously
559563  riding the commodities wave with a nice breako...
301050  more than a hundred jobs to go at in sa worker...
164203  coal mine gautam adanis brother vinod named in...
...                                                   ...
504735                                                NaN
266067  indigenous owners threaten legal action unless...
503361  “ will push this project until no other fundin...
425274  no place for coal mines or a massive gasfield ...
398365  yeh the 6 was estrela nuno was ��� rochinha is...

[658982 rows x 1 columns]>
INFO:root:

INFO:root:

INFO:root:The shape of our preprocessed SLO dataframe 

<span style="font-family:Papyrus; font-size:1.25em;">

The above log.INFO messages depict the shape and contents of the preprocessed dataframe before and after dropping any rows that are just "NaN", indicating the Tweet was full of irrelevant words.<br>

The rest of the code simply imports our dataset, puts it into a Pandas dataframe, drops any NaN rows to avoid blowing up CountVectorizer, re-assigns the column name to "Tweet", and assigns a name to the input feature we will use for LDA topic extraction.<br>

</span>

### Function that performs the topic extraction:

In [17]:
def latent_dirichlet_allocation_topic_extraction():
    """
    Function performs topic extraction on Tweets using Scikit-Learn LDA model.

    :return: None.
    """
    from sklearn.decomposition import LatentDirichletAllocation

    # LDA can only use raw term counts for LDA because it is a probabilistic graphical model.
    tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=1000, stop_words='english')
    tf = tf_vectorizer.fit_transform(slo_feature_set)
    tf_feature_names = tf_vectorizer.get_feature_names()

    # Run LDA.
    lda = LatentDirichletAllocation(n_topics=20, max_iter=5, learning_method='online', learning_offset=50.,
                                    random_state=0).fit(tf)

    # Display the top words for each topic.
    lda_util.display_topics(lda, tf_feature_names, 10)

<span style="font-family:Papyrus; font-size:1.25em;">

The above is the Scikit-Learn implementation of LDA.  We use the CountVectorizer class to vectorize our input feature and then fit the LDA model to our data.<br>

We call a utility function to display the top words associated with each topic.<br>

</span>

### Main executes the program:

In [None]:
start_time = time.time()
################################################
"""
Perform the Twitter dataset preprocessing.
"""
# lda_util.tweet_dataset_preprocessor("datasets/dataset_20100101-20180510_tok_PROCESSED.csv",
#                                     "datasets/dataset_20100101-20180510_tok_LDA_PROCESSED2.csv", "tweet_t")
"""
Perform exhaustive grid search.
"""
# latent_dirichlet_allocation_grid_search()
"""
Perform the topic extraction.
"""
latent_dirichlet_allocation_topic_extraction()
################################################
end_time = time.time()

log.info("The time taken to perform the operation is: ")
total_time = end_time - start_time
log.info(str(total_time))
log.info("\n")

<span style="font-family:Papyrus; font-size:1.25em;">

Tweet preprocessing is done via a custom library imported as "lda_util" from "slo_lda_topic_extraction_utility_functions.py"<br>

The function to perform exhaustive grid search isn't currently used.  We will use it in the future once we fully understand LDA, its associated hyper parameters, and how to tune for improved results.<br>

It takes around 450 seconds or so to finish LDA topic extraction per execution, so it is not a particularly fast process.<br>

</span>

### Results from a different execution of LDA topic extraction on our dataset:

### Exhaustive grid search for Scikit-Learn LDA:

In [None]:
def latent_dirichlet_allocation_grid_search():
    """
    Function performs exhaustive grid search for Scikit-Learn LDA model.

    :return: None.
    """
    from sklearn.decomposition import LatentDirichletAllocation
    from sklearn.model_selection import GridSearchCV

    # Construct the pipeline.
    latent_dirichlet_allocation_clf = Pipeline([
        ('vect', CountVectorizer(max_df=0.95, min_df=2, max_features=1000, stop_words='english')),
        ('clf', LatentDirichletAllocation()),
    ])

    # What parameters do we search for?
    parameters = {
        'vect__ngram_range': [(1, 1), (1, 2), (1, 3), (1, 4)],
        'clf__n_components': [1, 5, 10, 15],
        'clf__doc_topic_prior': [None],
        'clf__topic_word_prior': [None],
        'clf__learning_method': ['batch', 'online'],
        'clf__learning_decay': [0.5, 0.7, 0.9],
        'clf__learning_offset': [5, 10, 15],
        'clf__max_iter': [5, 10, 15],
        'clf__batch_size': [64, 128, 256],
        'clf__evaluate_every': [0],
        'clf__total_samples': [1e4, 1e6, 1e8],
        'clf__perp_tol': [1e-1, 1e-2, 1e-3],
        'clf__mean_change_tol': [1e-1, 1e-3, 1e-5],
        'clf__max_doc_update_iter': [50, 100, 150],
        'clf__n_jobs': [-1],
        'clf__verbose': [0],
        'clf__random_state': [None],
    }

    # Perform the grid search.
    latent_dirichlet_allocation_clf = GridSearchCV(latent_dirichlet_allocation_clf, parameters, cv=5, iid=False,
                                                   n_jobs=-1)
    latent_dirichlet_allocation_clf.fit(slo_feature_set)

    # View all the information stored in the model after training it.
    classifier_results = pd.DataFrame(latent_dirichlet_allocation_clf.cv_results_)
    log.debug("The shape of the Latent Dirichlet Allocation model's result data structure is:")
    log.debug(classifier_results.shape)
    log.debug(
        "The contents of the Latent Dirichlet Allocation model's result data structure is:")
    log.debug(classifier_results.head())

    # Display the optimal parameters.
    log.info("The optimal parameters found for the Latent Dirichlet Allocation is:")
    for param_name in sorted(parameters.keys()):
        log.info("%s: %r" % (param_name, latent_dirichlet_allocation_clf.best_params_[param_name]))
    log.info("\n")

<span style="font-family:Papyrus; font-size:1.25em;">

We use Scikit-Learn's Pipeline Class to construct a pipeline consisting of the CounteVectorizer and LatentDirichletAllocation classes.<br>

The "parameters" dictionary determine all the possible combinations of hyper parameters we will test in order to find the optimal hyper parameters for the Scikit-Learn LDA model.<br>

The grid search is performed by fitting on the data we wish to use for topic extraction.<br>

The optimal hyper parameters are displayed via "log.info" messages so the log verbosity level must be set appropriately to view them.<br>

</span>

### Why does it work poorly on Tweets?

<span style="font-family:Papyrus; font-size:1.25em;">
    
##### Based on Derek Fisher's senior project presentation:

1) LDA typically works best when the documents are lengthy (large word count) and written in a formal proper style.

2) Tweet text is generally very short in length with a max of around 280 characters.

3) Tweet text is generally written very informally style-wise.

    i) emojis.
    ii) spelling errors.
    iii) other grammatical errors.
    iv) etc.

4) The above makes it difficult for the LDA algorithm to discover any prominent underlying hidden structures.

</span>

## Notes:

<span style="font-family:Papyrus; font-size:1.25em;">

Refer to the below URL link to the utility functions used above for data preprocessing and LDA topic extraction:

Placeholder.

</span>