# Latent Dirichlet Allocation (LDA)

Note: Tentatively finished.  Might update the codebase given progress on reproducing Derek Fisher's results.

#### Resources Used:

https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158

https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d

https://www.investopedia.com/terms/p/posterior-probability.asp

https://cs.calvin.edu/courses/cs/x95/videos/2018-2019/

## LDA Basics:

<span style="font-family:Papyrus; font-size:1.25em;">
    
##### Basic Concept:

Each document described by a distribution of topics.<br>
Each topic described by a distribution of words.<br>
Typically uses bag-of-words feature representation for documents.<br>
Permits the identification of topics within documents and the mapping of documents to associated topics.<br>

##### Terms:

Observed layer: documents (composites) and words (parts).<br>
Hidden (latent) layer: topics (categories).<br>

k — Number of topics a document belongs to (a fixed number).<br>

V — Size of the vocabulary.<br>

M — Number of documents.<br>

N — Number of words in each document.<br>

w — A word in a document. This is represented as a one hot encoded vector of size V (i.e. V — vocabulary size).<br>

w (bold w): represents a document (i.e. vector of “w”s) of N words.<br>

D — Corpus, a collection of M documents.<br>

z — A topic from a set of k topics. A topic is a distribution words. For example it might be, Animal = (0.3 Cats, 0.4 Dogs, 0 AI, 0.2 Loyal, 0.1 Evil).<br>

![lda](lda_model.jpeg)
    
</span>

<span style="font-family:Papyrus; font-size:1.25em;">
    
α — Distribution related parameter that governs what the distribution of topics is for all the documents in the corpus looks like.<br>

θ — Random matrix where θ(i,j) represents the probability of the i th document to containing the j th topic.<br>

η — Distribution related parameter that governs what the distribution of words in each topic looks like.<br>

β — A random matrix where β(i,j) represents the probability of i th topic containing the j th word.<br>

##### Dirichlet Distribution (example):

![dirichlet](dirichlet_distribution.png)

1) Large values of α pushes the distribution to the center.<br>
2) Small vlues of α pushes the distribution to the edges.<br>

</span>

<span style="font-family:Papyrus; font-size:1.25em;">
    
##### Mathematical equivalent of the above graphical representation of LDA:

![mathematical_model](lda_equation.png)

##### English Translation:

Given a set of M documents with each containing N words and each word generated from a topic "k" from a set of K topics, find the joint posterior probability of:

θ — A distribution of topics, one for each document,<br>
z — N Topics for each document,<br>
β — A distribution of words, one for each topic,<br>

Given:

D — All the data we have (i.e. the corups),<br>

Using the parameters:

α — A parameter vector for each document (document — Topic distribution).<br>
η — A parameter vector for each topic (topic — word distribution).<br>


##### Joint posterior probability: 

In Bayesian statistics, it is the revised or updated probablity of an event occurring given new information.<br>
Calculated by updating the prior probability using Bayes' Theorem.<br>
In other words, conditional probability - probability of event A occurring given that event B has occurred.<br>

##### Prior probability:

In Bayesian statistics, it is the probablity of an event occurring before new information is given.<br>
Calculated using Bayes' Theorem.

##### Note:

Choose # of topics < # of documents to reduce dimensionality for further analysis using other algorithms.<br>

</span>

## An example of using a LDA:

### Prepare the dataset to be used in the LDA model:

In [4]:
"""
SLO Topic Modeling
Advisor: Professor VanderLinden
Name: Joseph Jinn
Date: 5-29-19

LDA - Latent Dirichlet Allocation

###########################################################
Notes:

LDA can only use raw term counts (CANNOT use tfidf transformer)

###########################################################
Resources Used:

https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html#sklearn.decomposition.LatentDirichletAllocation

https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730

"""

################################################################################################################
################################################################################################################

import logging as log
import warnings
import tensorflow as tf
from tensorflow import keras
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.pipeline import Pipeline
from sklearn import metrics

#############################################################

# Note: Need to set level AND turn on debug variables in order to see all debug output.
log.basicConfig(level=log.DEBUG)
tf.logging.set_verbosity(tf.logging.ERROR)

# Miscellaneous parameter adjustments for pandas and python.
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

"""
Turn debug log statements for various sections of code on/off.
"""
# Debug the GridSearch functions for each Classifier.
debug_pipeline = True
# Debug the initial dataset import and feature/target set creation.
debug_preprocess_tweets = False
# Debug create_training_and_test_set() function.
debug_train_test_set_creation = False

################################################################################################################
################################################################################################################

# Import the datasets.
tweet_dataset_processed1 = \
    pd.read_csv("D:/Dropbox/summer-research-2019/datasets/tbl_kvlinden_PROCESSED.csv", sep=",")

tweet_dataset_processed2 = \
    pd.read_csv("D:/Dropbox/summer-research-2019/datasets/tbl_training_set_PROCESSED.csv", sep=",")

# Reindex and shuffle the data randomly.
tweet_dataset_processed1 = tweet_dataset_processed1.reindex(
    pd.np.random.permutation(tweet_dataset_processed1.index))

tweet_dataset_processed2 = tweet_dataset_processed2.reindex(
    pd.np.random.permutation(tweet_dataset_processed2.index))

# Generate a Pandas dataframe.
tweet_dataframe_processed1 = pd.DataFrame(tweet_dataset_processed1)
tweet_dataframe_processed2 = pd.DataFrame(tweet_dataset_processed2)

if debug_preprocess_tweets:
    # Print shape and column names.
    log.debug("\n")
    log.debug("The shape of our SLO dataframe 1:")
    log.debug(tweet_dataframe_processed1.shape)
    log.debug("\n")
    log.debug("The columns of our SLO dataframe 1:")
    log.debug(tweet_dataframe_processed1.head)
    log.debug("\n")
    # Print shape and column names.
    log.debug("\n")
    log.debug("The shape of our SLO dataframe 2:")
    log.debug(tweet_dataframe_processed2.shape)
    log.debug("\n")
    log.debug("The columns of our SLO dataframe 2:")
    log.debug(tweet_dataframe_processed2.head)
    log.debug("\n")

# Concatenate the individual datasets together.
frames = [tweet_dataframe_processed1, tweet_dataframe_processed2]
slo_dataframe_combined = pd.concat(frames, ignore_index=True)

# Reindex everything.
slo_dataframe_combined.index = pd.RangeIndex(len(slo_dataframe_combined.index))
# slo_dataframe_combined.index = range(len(slo_dataframe_combined.index))

# Assign column names.
tweet_dataframe_processed_column_names = ['Tweet', 'SLO']

# Create input features.
selected_features = slo_dataframe_combined[tweet_dataframe_processed_column_names]
processed_features = selected_features.copy()

if debug_preprocess_tweets:
    # Check what we are using as inputs.
    log.debug("\n")
    log.debug("The Tweets in our input feature:")
    log.debug(processed_features['Tweet'])
    log.debug("\n")
    log.debug("SLO TBL topic classification label for each Tweet:")
    log.debug(processed_features['SLO'])
    log.debug("\n")

# Create feature set and target sets.
slo_feature_set = processed_features['Tweet']
slo_target_set = processed_features['SLO']


### Create the training and test sets:

In [5]:
#######################################################
def create_training_and_test_set():
    """
    This functions splits the feature and target set into training and test sets for each set.

    Note: We use this to generate a randomized training and target set in order to average our results over
    n iterations.

    random_state = rng (where rng = random number seed generator)

    :return: Nothing.  Global variables are established.
    """
    global tweet_train, tweet_test, target_train, target_test, target_train_encoded, target_test_encoded

    from sklearn.model_selection import train_test_split

    import random
    rng = random.randint(1, 1000000)
    # Split feature and target set into training and test sets for each set.
    tweet_train, tweet_test, target_train, target_test = train_test_split(slo_feature_set, slo_target_set,
                                                                          test_size=0.33,
                                                                          random_state=rng)

    if debug_train_test_set_creation:
        log.debug("Shape of tweet training set:")
        log.debug(tweet_train.data.shape)
        log.debug("Shape of tweet test set:")
        log.debug(tweet_test.data.shape)
        log.debug("Shape of target training set:")
        log.debug(target_train.data.shape)
        log.debug("Shape of target test set:")
        log.debug(target_test.data.shape)
        log.debug("\n")

    #######################################################

    # Use Sci-kit learn to encode labels into integer values - one assigned integer value per class.
    from sklearn import preprocessing

    target_label_encoder = preprocessing.LabelEncoder()
    target_train_encoded = target_label_encoder.fit_transform(target_train)
    target_test_encoded = target_label_encoder.fit_transform(target_test)

    target_train_decoded = target_label_encoder.inverse_transform(target_train_encoded)
    target_test_decoded = target_label_encoder.inverse_transform(target_test_encoded)

    if debug_train_test_set_creation:
        log.debug("Encoded target training labels:")
        log.debug(target_train_encoded)
        log.debug("Decoded target training labels:")
        log.debug(target_train_decoded)
        log.debug("\n")
        log.debug("Encoded target test labels:")
        log.debug(target_test_encoded)
        log.debug("Decoded target test labels:")
        log.debug(target_test_decoded)
        log.debug("\n")

    # return [tweet_train, tweet_test, target_train, target_test, target_train_encoded, target_test_encoded]

### Exhaustive grid search for Scikit-Learn LDA:

In [6]:
################################################################################################################

def latent_dirichlet_allocation_grid_search():
    """
    Function performs exhaustive grid search for LDA.

    :return: None.
    """
    from sklearn.decomposition import LatentDirichletAllocation

    # Create randomized training and test set using our dataset.
    create_training_and_test_set()

    # Construct the pipeline.
    latent_dirichlet_allocation_clf = Pipeline([
        ('vect', CountVectorizer(max_df=0.95, min_df=2, max_features=1000, stop_words='english')),
        ('clf', LatentDirichletAllocation()),
    ])

    from sklearn.model_selection import GridSearchCV

    # What parameters do we search for?
    parameters = {
        'vect__ngram_range': [(1, 1), (1, 2), (1, 3), (1, 4)],
        'clf__n_components': [1, 5, 10, 15],
        'clf__doc_topic_prior': [None],
        'clf__topic_word_prior': [None],
        'clf__learning_method': ['batch', 'online'],
        'clf__learning_decay': [0.5, 0.7, 0.9],
        'clf__learning_offset': [5, 10, 15],
        'clf__max_iter': [5, 10, 15],
        'clf__batch_size': [64, 128, 256],
        'clf__evaluate_every': [0],
        'clf__total_samples': [1e4, 1e6, 1e8],
        'clf__perp_tol': [1e-1, 1e-2, 1e-3],
        'clf__mean_change_tol': [1e-1, 1e-3, 1e-5],
        'clf__max_doc_update_iter': [50, 100, 150],
        'clf__n_jobs': [-1],
        'clf__verbose': [0],
        'clf__random_state': [None],
    }

    # Perform the grid search.
    latent_dirichlet_allocation_clf = GridSearchCV(latent_dirichlet_allocation_clf, parameters, cv=5, iid=False,
                                                   n_jobs=-1)
    latent_dirichlet_allocation_clf.fit(tweet_train)

    if debug_pipeline:
        # View all the information stored in the model after training it.
        classifier_results = pd.DataFrame(latent_dirichlet_allocation_clf.cv_results_)
        log.debug("The shape of the Latent Dirichlet Allocation model's result data structure is:")
        log.debug(classifier_results.shape)
        log.debug(
            "The contents of the Latent Dirichlet Allocation model's result data structure is:")
        log.debug(classifier_results.head())

    # Display the optimal parameters.
    log.debug("The optimal parameters found for the Latent Dirichlet Allocation is:")
    for param_name in sorted(parameters.keys()):
        log.debug("%s: %r" % (param_name, latent_dirichlet_allocation_clf.best_params_[param_name]))
    log.debug("\n")

### Function that performs the topic extraction:

In [7]:
################################################################################################################

def latent_dirichlet_allocation_topic_extraction():
    """
    Function performs topic extraction on Tweets using LDA.

    :return: none.
    """
    from sklearn.decomposition import LatentDirichletAllocation

    # Create randomized training and test set using our dataset.
    create_training_and_test_set()

    # LDA can only use raw term counts for LDA because it is a probabilistic graphical model.
    tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=1000, stop_words='english')
    tf = tf_vectorizer.fit_transform(tweet_train)
    tf_feature_names = tf_vectorizer.get_feature_names()

    # Run LDA.
    lda = LatentDirichletAllocation(n_topics=20, max_iter=5, learning_method='online', learning_offset=50.,
                                    random_state=0).fit(tf)

    # Display the top words for each topic.
    display_topics(lda, tf_feature_names, 10)


################################################################################################################

def display_topics(model, feature_names, num_top_words):
    """
    Helper function to display the top words for each topic in the LDA model.

    :param model: the LDA model
    :param feature_names: feature names from CounteVectorizer
    :param num_top_words: # of words to display for each topic.
    :return: none.
    """
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-num_top_words - 1:-1]]))


############################################################################################

### Main function that executes the program:

In [8]:
"""
Main function.  Execute the program.
"""
import time

if __name__ == '__main__':

    start_time = time.time()

    ################################################
    """
    Scikit-Learn NMF-LDA example.
    """
    # topic_extraction_nmf_lda_example()

    """
    Perform exhaustive grid search.
    """
    # latent_dirichlet_allocation_grid_search()

    """
    Perform the topic extraction.
    """
    latent_dirichlet_allocation_topic_extraction()

    ################################################

    end_time = time.time()

    if debug_pipeline:
        log.debug("The time taken to perform LDA is: ")
        total_time = end_time - start_time
        log.debug(str(total_time))
        log.debug("\n")

############################################################################################

DEBUG:root:The time taken to perform LDA is: 
DEBUG:root:0.5928463935852051
DEBUG:root:



Topic 0:
slomention massive reapproved adani slohashtag work minister enviro approved says
Topic 1:
slohashtag slourl slomention future shenhua national hinge ffs ruled offences
Topic 2:
slomention slohashtag slourl saviour said intends adani donations time structural
Topic 3:
slohashtag slomention slourl coal adani https bhp amp just labor
Topic 4:
renewables slourl sa coalmine spruiker begin thoughts second canary rollback
Topic 5:
slohashtag transition stop adani said risks trojan saskatchewa whats real
Topic 6:
slomention slohashtag adani coal amp slourl change climate link definite
Topic 7:
slohashtag thats suit turnbull rules right pressure person nemo funding
Topic 8:
link definite insanity coal protect reason slourl ffs won climate
Topic 9:
07 major knows https solar supply approved resources open activists
Topic 10:
santoss sentence did group wake qld fracking surveyed clear project
Topic 11:
slohashtag check real background offences assess foreign fails corporate ignores
Topi

##### The above shows numerically indexed topics and the top words associated with each topic.

### Why does it work poorly on Tweets?

<span style="font-family:Papyrus; font-size:1.25em;">
    
##### Based on Derek Fisher's senior project presentation:

1) LDA typically works best when the documents are lengthy (large word count) and written in a formal proper style.

2) Tweet text is generally very short in length with a max of around 280 characters.

3) Tweet text is generally written very informally style-wise.

    i) emojis.
    ii) spelling errors.
    iii) other grammatical errors.
    iv) etc.

4) The above makes it difficult for the LDA algorithm to discover any prominent underlying hidden structures.

</span>