# Latent Dirichlet Allocation (LDA)

Note: Adapted SLO TBL topic classification codebase and Derek Fisher's code for my own LDA topic extraction implemenation.

#### Resources Used:

https://towardsdatascience.com/light-on-math-machine-learning-intuitive-guide-to-latent-dirichlet-allocation-437c81220158

https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d

https://www.investopedia.com/terms/p/posterior-probability.asp

https://cs.calvin.edu/courses/cs/x95/videos/2018-2019/

## LDA Basics:

<span style="font-family:Papyrus; font-size:1.25em;">
    
##### Basic Concept:

Each document described by a distribution of topics.<br>
Each topic described by a distribution of words.<br>
Typically uses bag-of-words feature representation for documents.<br>
Permits the identification of topics within documents and the mapping of documents to associated topics.<br>

##### Terms:

Observed layer: documents (composites) and words (parts).<br>
Hidden (latent) layer: topics (categories).<br>

k — Number of topics a document belongs to (a fixed number).<br>

V — Size of the vocabulary.<br>

M — Number of documents.<br>

N — Number of words in each document.<br>

w — A word in a document. This is represented as a one hot encoded vector of size V (i.e. V — vocabulary size).<br>

w (bold w): represents a document (i.e. vector of “w”s) of N words.<br>

D — Corpus, a collection of M documents.<br>

z — A topic from a set of k topics. A topic is a distribution words. For example it might be, Animal = (0.3 Cats, 0.4 Dogs, 0 AI, 0.2 Loyal, 0.1 Evil).<br>

![lda](lda_model.jpeg)
    
</span>

<span style="font-family:Papyrus; font-size:1.25em;">
    
α — Distribution related parameter that governs what the distribution of topics is for all the documents in the corpus looks like.<br>

θ — Random matrix where θ(i,j) represents the probability of the i th document to containing the j th topic.<br>

η — Distribution related parameter that governs what the distribution of words in each topic looks like.<br>

β — A random matrix where β(i,j) represents the probability of i th topic containing the j th word.<br>

##### Dirichlet Distribution (example):

![dirichlet](dirichlet_distribution.png)

1) Large values of α pushes the distribution to the center.<br>
2) Small vlues of α pushes the distribution to the edges.<br>

</span>

<span style="font-family:Papyrus; font-size:1.25em;">
    
##### Mathematical equivalent of the above graphical representation of LDA:

![mathematical_model](lda_equation.png)

##### English Translation:

Given a set of M documents with each containing N words and each word generated from a topic "k" from a set of K topics, find the joint posterior probability of:

θ — A distribution of topics, one for each document,<br>
z — N Topics for each document,<br>
β — A distribution of words, one for each topic,<br>

Given:

D — All the data we have (i.e. the corups),<br>

Using the parameters:

α — A parameter vector for each document (document — Topic distribution).<br>
η — A parameter vector for each topic (topic — word distribution).<br>


##### Joint posterior probability: 

In Bayesian statistics, it is the revised or updated probablity of an event occurring given new information.<br>
Calculated by updating the prior probability using Bayes' Theorem.<br>
In other words, conditional probability - probability of event A occurring given that event B has occurred.<br>

##### Prior probability:

In Bayesian statistics, it is the probablity of an event occurring before new information is given.<br>
Calculated using Bayes' Theorem.

##### Note:

Choose # of topics < # of documents to reduce dimensionality for further analysis using other algorithms.<br>

</span>

## An example of using a LDA:

### Prepare the datasets to be used in the LDA model:

In [8]:
"""
SLO Topic Modeling
Advisor: Professor VanderLinden
Name: Joseph Jinn
Date: 5-29-19

LDA - Latent Dirichlet Allocation

###########################################################
Notes:

LDA can only use raw term counts (CANNOT use tfidf transformer)

###########################################################
Resources Used:

https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html#sklearn.decomposition.LatentDirichletAllocation

https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730

"""

################################################################################################################
################################################################################################################

import logging as log
import re
import string
import warnings
import tensorflow as tf
import time
from tensorflow import keras
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.pipeline import Pipeline
from sklearn import metrics

#############################################################

# Note: Need to set level AND turn on debug variables in order to see all debug output.
log.basicConfig(level=log.DEBUG)
tf.logging.set_verbosity(tf.logging.ERROR)

# Miscellaneous parameter adjustments for pandas and python.
pd.options.display.max_rows = 10
pd.options.display.float_format = '{:.1f}'.format
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)

"""
Turn debug log statements for various sections of code on/off.
"""
# Debug the GridSearch functions for each Classifier.
debug_pipeline = True
# Debug the initial dataset import and feature/target set creation.
debug_preprocess_tweets = False
# Debug create_training_and_test_set() function.
debug_train_test_set_creation = False

################################################################################################################
################################################################################################################

# Import the datasets.
tweet_dataset_processed1 = \
    pd.read_csv("D:/Dropbox/summer-research-2019/datasets/tbl_kvlinden_LDA_PROCESSED.csv", sep=",")

tweet_dataset_processed2 = \
    pd.read_csv("D:/Dropbox/summer-research-2019/datasets/tbl_training_set_LDA_PROCESSED.csv", sep=",")

tweet_dataset_processed3 = \
    pd.read_csv("D:/Dropbox/summer-research-2019/datasets/dataset_20100101-20180510_tok_LDA_PROCESSED.csv", sep=",")

# Reindex and shuffle the data randomly.
tweet_dataset_processed1 = tweet_dataset_processed1.reindex(
    pd.np.random.permutation(tweet_dataset_processed1.index))

tweet_dataset_processed2 = tweet_dataset_processed2.reindex(
    pd.np.random.permutation(tweet_dataset_processed2.index))

tweet_dataset_processed3 = tweet_dataset_processed3.reindex(
    pd.np.random.permutation(tweet_dataset_processed3.index))

# Generate a Pandas dataframe.
tweet_dataframe_processed1 = pd.DataFrame(tweet_dataset_processed1)
tweet_dataframe_processed2 = pd.DataFrame(tweet_dataset_processed2)
tweet_dataframe_processed3 = pd.DataFrame(tweet_dataset_processed3)

if debug_preprocess_tweets:
    # Print shape and column names.
    log.debug("\n")
    log.debug("The shape of our SLO dataframe 1:")
    log.debug(tweet_dataframe_processed1.shape)
    log.debug("\n")
    log.debug("The columns of our SLO dataframe 1:")
    log.debug(tweet_dataframe_processed1.head)
    log.debug("\n")
    # Print shape and column names.
    log.debug("\n")
    log.debug("The shape of our SLO dataframe 2:")
    log.debug(tweet_dataframe_processed2.shape)
    log.debug("\n")
    log.debug("The columns of our SLO dataframe 2:")
    log.debug(tweet_dataframe_processed2.head)
    log.debug("\n")
    # Print shape and column names.
    log.debug("\n")
    log.debug("The shape of our SLO dataframe 3:")
    log.debug(tweet_dataframe_processed3.shape)
    log.debug("\n")
    log.debug("The columns of our SLO dataframe 3:")
    log.debug(tweet_dataframe_processed3.head)
    log.debug("\n")

# Rename column in 3rd dataframe for concatenation purposes.
tweet_dataframe_processed3.columns = ['Tweet']

# Drop any NaN or empty Tweet rows in 3rd dataframe (or else CountVectorizer will blow up).
tweet_dataframe_processed3 = tweet_dataframe_processed3.dropna()

# Concatenate the individual datasets together.
frames = [tweet_dataframe_processed1, tweet_dataframe_processed2, tweet_dataframe_processed3]
slo_dataframe_combined = pd.concat(frames, ignore_index=True)

if debug_preprocess_tweets:
    # Print shape and column names.
    log.debug("\n")
    log.debug("The shape of our SLO dataframe combined:")
    log.debug(slo_dataframe_combined.shape)
    log.debug("\n")
    log.debug("The columns of our SLO dataframe combined:")
    log.debug(slo_dataframe_combined.head)
    log.debug("\n")

# Reindex everything.
slo_dataframe_combined.index = pd.RangeIndex(len(slo_dataframe_combined.index))
# slo_dataframe_combined.index = range(len(slo_dataframe_combined.index))

# Assign column names.
tweet_dataframe_processed_column_names = ['Tweet']

# Create input features.
selected_features = slo_dataframe_combined[tweet_dataframe_processed_column_names]
processed_features = selected_features.copy()

if debug_preprocess_tweets:
    # Check what we are using as inputs.
    log.debug("\n")
    log.debug("The Tweets in our input feature:")
    log.debug(processed_features['Tweet'])
    log.debug("\n")

# Create feature set.
slo_feature_set = processed_features['Tweet']

### Exhaustive grid search for Scikit-Learn LDA:

In [9]:
def latent_dirichlet_allocation_grid_search():
    """
    Function performs exhaustive grid search for LDA.

    :return: None.
    """
    from sklearn.decomposition import LatentDirichletAllocation

    # Construct the pipeline.
    latent_dirichlet_allocation_clf = Pipeline([
        ('vect', CountVectorizer(max_df=0.95, min_df=2, max_features=1000, stop_words='english')),
        ('clf', LatentDirichletAllocation()),
    ])

    from sklearn.model_selection import GridSearchCV

    # What parameters do we search for?
    parameters = {
        'vect__ngram_range': [(1, 1), (1, 2), (1, 3), (1, 4)],
        'clf__n_components': [1, 5, 10, 15],
        'clf__doc_topic_prior': [None],
        'clf__topic_word_prior': [None],
        'clf__learning_method': ['batch', 'online'],
        'clf__learning_decay': [0.5, 0.7, 0.9],
        'clf__learning_offset': [5, 10, 15],
        'clf__max_iter': [5, 10, 15],
        'clf__batch_size': [64, 128, 256],
        'clf__evaluate_every': [0],
        'clf__total_samples': [1e4, 1e6, 1e8],
        'clf__perp_tol': [1e-1, 1e-2, 1e-3],
        'clf__mean_change_tol': [1e-1, 1e-3, 1e-5],
        'clf__max_doc_update_iter': [50, 100, 150],
        'clf__n_jobs': [-1],
        'clf__verbose': [0],
        'clf__random_state': [None],
    }

    # Perform the grid search.
    latent_dirichlet_allocation_clf = GridSearchCV(latent_dirichlet_allocation_clf, parameters, cv=5, iid=False,
                                                   n_jobs=-1)
    latent_dirichlet_allocation_clf.fit(slo_feature_set)

    if debug_pipeline:
        # View all the information stored in the model after training it.
        classifier_results = pd.DataFrame(latent_dirichlet_allocation_clf.cv_results_)
        log.debug("The shape of the Latent Dirichlet Allocation model's result data structure is:")
        log.debug(classifier_results.shape)
        log.debug(
            "The contents of the Latent Dirichlet Allocation model's result data structure is:")
        log.debug(classifier_results.head())

    # Display the optimal parameters.
    log.debug("The optimal parameters found for the Latent Dirichlet Allocation is:")
    for param_name in sorted(parameters.keys()):
        log.debug("%s: %r" % (param_name, latent_dirichlet_allocation_clf.best_params_[param_name]))
    log.debug("\n")

### Functions that performs the topic extraction:

In [10]:
def latent_dirichlet_allocation_topic_extraction():
    """
    Function performs topic extraction on Tweets using LDA.

    :return: none.
    """
    from sklearn.decomposition import LatentDirichletAllocation

    # LDA can only use raw term counts for LDA because it is a probabilistic graphical model.
    tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=1000, stop_words='english')
    tf = tf_vectorizer.fit_transform(slo_feature_set)
    tf_feature_names = tf_vectorizer.get_feature_names()

    # Run LDA.
    lda = LatentDirichletAllocation(n_topics=20, max_iter=5, learning_method='online', learning_offset=50.,
                                    random_state=0).fit(tf)

    # Display the top words for each topic.
    display_topics(lda, tf_feature_names, 10)


################################################################################################################

def display_topics(model, feature_names, num_top_words):
    """
    Helper function to display the top words for each topic in the LDA model.

    :param model: the LDA model
    :param feature_names: feature names from CounteVectorizer.
    :param num_top_words: # of words to display for each topic.
    :return: none.
    """
    for topic_idx, topic in enumerate(model.components_):
        log.debug("Topic %d:" % (topic_idx))
        log.debug(" ".join([feature_names[i]
                            for i in topic.argsort()[:-num_top_words - 1:-1]]))

### Helper function that pre-processes the Tweet text:

In [11]:
def preprocess_tweet_text(tweet_text):
    """
    Helper function performs text pre-processing using regular expressions and other Python functions.

    Notes:

    TODO - shrink character elongations
    TODO - remove non-english tweets
    TODO - remove non-company associated tweets
    TODO - remove year and time.
    TODO - remove cash items?

    Resources Used:

    https://thispointer.com/python-how-to-convert-a-list-to-string/
    http://jonathansoma.com/lede/foundations/classes/pandas%20columns%20and%20functions/apply-a-function-to-every-row-in-a-pandas-dataframe/

    :return: the processed Tweet.
    """

    # # Remove "RT" tags.
    # preprocessed_tweet_text = re.sub("rt", "", tweet_text)
    #
    # # Remove URL's.
    # preprocessed_tweet_text = re.sub("http[s]?://\S+", "slo_url", preprocessed_tweet_text)
    #
    # # Remove Tweet mentions.
    # preprocessed_tweet_text = re.sub("@\S+", "slo_mention", preprocessed_tweet_text)
    #
    # # Remove Tweet hashtags.
    # preprocessed_tweet_text = re.sub("#\S+", "slo_hashtag", preprocessed_tweet_text)
    #
    # # Remove all punctuation.
    # preprocessed_tweet_text = preprocessed_tweet_text.translate(str.maketrans('', '', string.punctuation))

    # Remove irrelevant words from Tweets.
    delete_list = ["slo_url", "slo_mention", "word_n", "slo_year", "slo_cash", "woodside", "auspol", "adani",
                   "stopadani",
                   "ausbiz", "santos", "whitehaven", "tinto", "fortescue", "bhp", "adelaide", "billiton", "csg",
                   "nswpol",
                   "nsw", "lng", "don", "rio", "pilliga", "australia", "asx", "just", "today", "great", "says", "like",
                   "big", "better", "rite", "would", "SCREEN_NAME", "mining", "former", "qldpod", "qldpol", "qld", "wr",
                   "melbourne", "andrew", "fuck", "spadani", "greg", "th", "australians", "http", "https", "rt",
                   "goadani",
                   "co", "amp", "riotinto", "carmichael", "abbot", "bill shorten",
                   "slourl", "slomention", "slohashtag", "sloyear", "slocash"]

    # Convert series to string.
    tweet_string = str(tweet_text)

    if debug_preprocess_tweets:
        log.debug("Tweet text as string:")
        log.debug(tweet_string)
        log.debug('\n')

    # Split Tweet into individual words.
    words = tweet_string.split()

    if debug_preprocess_tweets:
        log.debug("Tweet text as list:")
        log.debug(words)
        log.debug('\n')

    # Check to see if a word is irrelevant or not.
    words_relevant = []
    for w in words:
        if w not in delete_list:
            words_relevant.append(w)
        else:
            if debug_preprocess_tweets:
                log.debug("Irrelevant word found: ")
                log.debug(w)
                log.debug('\n')

    if debug_preprocess_tweets:
        log.debug("List of relevant words in Tweet: ")
        log.debug(words_relevant)
        log.debug('\n')

    # Convert list back into original Tweet text minus irrelevant words.
    tweet_string = ' '.join(words_relevant)
    # Convert back to a series object.
    tweet_series = pd.Series(tweet_string)

    if debug_preprocess_tweets:
        log.debug("Tweet text with irrelevant words removed: ")
        log.debug(tweet_series)
        log.debug('\n')

    return tweet_series

### Function that pre-processes the Tweet dataset for LDA topic extraction:

In [12]:
def tweet_dataset_preprocessor(input_file_path, output_file_path, column_name):
    """
     Function pre-processes specified dataset in preparation for LDA topic extraction.

    :param input_file_path: relative filepath from project root directory for location of dataset to process.
    :param output_file_path: relative filepath from project root directory for location to save .csv file.
    :param column_name: name of the column in the dataset that we are pre-processing.
    :return: Nothing. Saves to CSV file.
    """

    # Import the dataset.
    slo_dataset_cmu = \
        pd.read_csv(str(input_file_path), sep=",")

    # Shuffle the data randomly.
    slo_dataset_cmu = slo_dataset_cmu.reindex(
        pd.np.random.permutation(slo_dataset_cmu.index))

    # Generate a Pandas dataframe.
    slo_dataframe_cmu = pd.DataFrame(slo_dataset_cmu[str(column_name)])

    if debug_preprocess_tweets:
        # Print shape and column names.
        log.debug("\n")
        log.debug("The shape of our SLO CMU dataframe:")
        log.debug(slo_dataframe_cmu.shape)
        log.debug("\n")
        log.debug("The columns of our SLO CMU dataframe:")
        log.debug(slo_dataframe_cmu.head)
        log.debug("\n")

    #######################################################

    # # Down-case all text.
    # slo_dataframe_cmu['tweet_t'] = slo_dataframe_cmu['tweet_t'].str.lower()

    # Pre-process each tweet individually.
    slo_dataframe_cmu[str(column_name)] = slo_dataframe_cmu[str(column_name)].apply(preprocess_tweet_text)

    # Reindex everything.
    slo_dataframe_cmu.index = pd.RangeIndex(len(slo_dataframe_cmu.index))
    # slo_dataframe_combined.index = range(len(slo_dataframe_combined.index))

    # Save to CSV file.
    slo_dataframe_cmu.to_csv(str(output_file_path), sep=',',
                             encoding='utf-8', index=False)

### Main function that executes the program:

In [13]:
"""
Main function.  Execute the program.
"""
if __name__ == '__main__':

    start_time = time.time()
    ################################################
    """
    Perform the Tweet preprocessing.
    """
    # tweet_dataset_preprocessor("datasets/dataset_20100101-20180510_tok_PROCESSED.csv",
    #                            "datasets/dataset_20100101-20180510_tok_LDA_PROCESSED.csv", "tweet_t")
    # tweet_dataset_preprocessor("datasets/tbl_kvlinden_PROCESSED.csv",
    #                            "datasets/tbl_kvlinden_LDA_PROCESSED.csv", "Tweet")
    # tweet_dataset_preprocessor("datasets/tbl_training_set_PROCESSED.csv",
    #                            "datasets/tbl_training_set_LDA_PROCESSED.csv", "Tweet")
    """
    Perform exhaustive grid search.
    """
    # latent_dirichlet_allocation_grid_search()
    """
    Perform the topic extraction.
    """
    latent_dirichlet_allocation_topic_extraction()
    ################################################
    end_time = time.time()

    if debug_pipeline:
        log.debug("The time taken to perform the operation is: ")
        total_time = end_time - start_time
        log.debug(str(total_time))
        log.debug("\n")

DEBUG:root:Topic 0:
DEBUG:root:time pay news foescue good iron ore join national times
DEBUG:root:Topic 1:
DEBUG:root:adanis coal cou point risk land barnaby joyce deal huge
DEBUG:root:Topic 2:
DEBUG:root:govt money india coal state approval banks needs free giving
DEBUG:root:Topic 3:
DEBUG:root:coal new labor mines build taxpayers native come title board
DEBUG:root:Topic 4:
DEBUG:root:turnbull funding fund basin galilee pm council tell carbon townsville
DEBUG:root:Topic 5:
DEBUG:root:reef company barrier environmental coal oil cut prices profit corporate
DEBUG:root:Topic 6:
DEBUG:root:queensland future coalmine business coal renewables investment palaszczuk workers noh
DEBUG:root:Topic 7:
DEBUG:root:want energy rail line help coal vote clean plan global
DEBUG:root:Topic 8:
DEBUG:root:climate change does ceo coal community thing politicians debt subsidies
DEBUG:root:Topic 9:
DEBUG:root:tax lnp greens paid election alp think public labor loan
DEBUG:root:Topic 10:
DEBUG:root:world doesnt

##### The above shows numerically indexed topics and the top words associated with each topic.

### Why does it work poorly on Tweets?

<span style="font-family:Papyrus; font-size:1.25em;">
    
##### Based on Derek Fisher's senior project presentation:

1) LDA typically works best when the documents are lengthy (large word count) and written in a formal proper style.

2) Tweet text is generally very short in length with a max of around 280 characters.

3) Tweet text is generally written very informally style-wise.

    i) emojis.
    ii) spelling errors.
    iii) other grammatical errors.
    iv) etc.

4) The above makes it difficult for the LDA algorithm to discover any prominent underlying hidden structures.

</span>