# Hierarchical Latent Dirichlet Allocation Topic Model Implementation on SLO Twitter Dataset

### Joseph Jinn and Keith VanderLinden

Our HLDA topic model utilizes a 3rd party library based on a Gibbs sampler ported over from the Java-based MALLET machine learning suite of algorithms.

### Import libraries and set parameters:

We import the requisite libraries, custom utility functions, and set the parameters for our various imported libraries.

In [None]:
# Import libraries.
import logging as log
import time
import warnings

import pandas as pd
import seaborn as sns
import spacy
from matplotlib import pyplot as plt
from wordcloud import WordCloud

#############################################################

# Pandas options.
pd.options.display.max_rows = None
pd.options.display.max_columns = None
pd.options.display.width = None
pd.options.display.max_colwidth = 1000
# Pandas float precision display.
pd.set_option('precision', 12)
# Seaborn setting.
sns.set()
# Don't output these types of warnings to terminal.
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=UserWarning)
# Matplotlib log settings.
mylog = log.getLogger("matplotlib")
mylog.setLevel(log.INFO)

"""
Turn debug log statements for various sections of code on/off.
(adjust log level as necessary)
"""
log.basicConfig(level=log.INFO)
log.disable(level=log.DEBUG)

### Pre-process and Post-process Tweets:


    
We preprocess our Twitter dataset as follows:<br>

1) Downcase all text.<br>
2) Check that there is text, otherwise convert to empty string.<br>
3) Convert html chars. to unicode chars.<br>
4) Remove "RT" tags.<br>
5) Remove concatenated URL's.<br>
6) Handle whitespaces by converting all/multiple whitespace characters to a single whitespace.<br>
7) Remove URL's and replace with "slo_url".<br>
8) Remove Tweet mentions and replace with "slo_mention".<br>
9) Remove Tweet stock symbols and replace with "slo_stock".<br>
10) Remove Tweet hashtags and replace with "slo_hash".<br>
11) Remove Tweet cashtags and replace with "slo_cash".<br>
12) Remove Tweet year and replace with "slo_year".<br>
13) Remove Tweet time and replace with "slo_time".<br>
14) Remove character elongations.<br>

We postprocess our Twitter dataset as follows:<br>

1) Remove the following irrelevant words specified in the List below:<br>

    delete_list = ["word_n", "auspol", "ausbiz", "tinto", "adelaide", "csg", "nswpol",
                   "nsw", "lng", "don", "rio", "pilliga", "australia", "asx", "just", "today", "great", "says", "like",
                   "big", "better", "rite", "would", "SCREEN_NAME", "mining", "former", "qldpod", "qldpol", "qld", "wr",
                   "melbourne", "andrew", "fuck", "spadani", "greg", "th", "australians", "http", "https", "rt",
                   "co", "amp", "carmichael", "abbot", "bill shorten",
                   "slo_url", "slo_mention", "slo_hash", "slo_year", "slo_time", "slo_cash", "slo_stock",
                   "adani", "bhp", "cuesta", "fotescue", "riotinto", "newmontmining", "santos", "oilsearch",
                   "woodside", "ilukaresources", "whitehavencoal",
                   "stopadani", "goadani", "bhpbilliton", "billiton", "cuestacoal", "cuests coal", "cqc",
                   "fortescuenews", "fortescue metals", "rio tinto", "newmont", "newmont mining", "santosltd",
                   "oilsearchltd", "oil search", "woodsideenergy", "woodside petroleum", "woodside energy",
                   "iluka", "iluka resources", "whitehaven", "whitehaven coal"]

2) Remove all punctuation from the Tweet text.<br>
3) Remove all English stop words from the Tweet text.<br>
4) Lemmatize the words in the Tweet text.<br>



In [None]:
# Tokenize using our Twitter dataset.
tweet_dataset_preprocessor(
    "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
    "twitter-dataset-7-10-19-with-irrelevant-tweets-excluded.csv",
    "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
    "twitter-dataset-7-10-19-topic-extraction-ready-tweet-text-with-hashtags-excluded-created-7-29-19.csv",
    "text_derived")

# Tokenize using our Twitter dataset.
tweet_dataset_preprocessor(
    "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
    "twitter-dataset-7-10-19-with-irrelevant-tweets-excluded.csv",
    "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
    "twitter-dataset-7-10-19-topic-extraction-ready-user-description-text-with-hashtags-excluded-created-7-29-19.csv",
    "user_description")



The first parameter in our function call specifies the file path to the dataset to be preprocessed.  The second parameter specifies the location to save the CSV file to.  The 3rd parameter specifies the name of the column in the dataset that contains the original Tweet text.<br>


Tweet preprocessing is done via a custom library imported as "lda_util" using "topic_extraction_utility_functions.py".<br>

Refer to URL link for the codebase to the utility functions used above for data preprocessing and below for LDA topic extraction:<br>

https://github.com/Calvin-CS/slo-classifiers/blob/master/topic/models/topic_extraction_utility_functions.py



### Import and prepare the preprocessed dataset for use in HLDA topic extraction:

We follow the general format of insertion into a Pandas dataframe, isolating the column of interest, and generating a dictionary of words and corpus of documents.  Please refer to the code comments for details on the specific steps for the entire process.

In [None]:
# Import the dataset (absolute path).
tweet_dataset_processed = \
    pd.read_csv("D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
                "twitter-dataset-7-10-19-topic-extraction-ready-tweet-text-with-hashtags-excluded"
                "-created-7-29-19-tokenized.csv", sep=",")

# # Import the dataset (test/debug).
# tweet_dataset_processed = \
#     pd.read_csv("D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
#                 "twitter-dataset-7-10-19-topic-extraction-ready-tweet-text-with-hashtags-excluded"
#                 "-created-7-30-19-test.csv", sep=",")

# Reindex and shuffle the data randomly.
tweet_dataset_processed = tweet_dataset_processed.reindex(
    pd.np.random.permutation(tweet_dataset_processed.index))

# Generate a Pandas dataframe.
tweet_text_dataframe = pd.DataFrame(tweet_dataset_processed)

# Print shape and column names.
log.info(f"\nThe shape of the Tweet text dataframe:")
log.info(f"{tweet_text_dataframe.shape}\n")
log.info(f"\nThe columns of the Tweet text dataframe:")
log.info(f"{tweet_text_dataframe.columns}\n")

# Drop any NaN or empty Tweet rows in dataframe (or else CountVectorizer will blow up).
tweet_text_dataframe = tweet_text_dataframe.dropna()

# Print shape and column names.
log.info(f"\nThe shape of the Tweet text dataframe with NaN (empty) rows dropped:")
log.info(f"{tweet_text_dataframe.shape}\n")
log.info(f"\nThe columns of the Tweet text dataframe with NaN (empty) rows dropped:")
log.info(f"{tweet_text_dataframe.columns}\n")

# Reindex everything.
tweet_text_dataframe.index = pd.RangeIndex(len(tweet_text_dataframe.index))

# Assign column names.
tweet_text_dataframe_column_names = ['text_derived', 'text_derived_preprocessed', 'text_derived_postprocessed']

# Rename column in dataframe.
tweet_text_dataframe.columns = tweet_text_dataframe_column_names

# Create input feature.
selected_features = tweet_text_dataframe[['text_derived_postprocessed']]
processed_features = selected_features.copy()

# Check what we are using as inputs.
log.info(f"\nA sample Tweet in our input feature:")
log.info(f"{processed_features['text_derived_postprocessed'][0]}\n")

# Create feature set.
slo_feature_series = processed_features['text_derived_postprocessed']
slo_feature_series = pd.Series(slo_feature_series)
slo_feature_list = slo_feature_series.tolist()

#############################################################

corpus = []
dictionary = set()
nlp = spacy.load('en')
nlp.remove_pipe("parser")
nlp.remove_pipe("tagger")
nlp.remove_pipe("ner")

# Create the corpus of documents and dictionary of words (vocabulary)
for tweet in slo_feature_list:
    # Tokenize each Tweet (document) and add to List of documents in the corpus.
    corpus.append(tweet.split())
    # Tokenize each Tweet (document) and add individual words to the dictionary of words (vocabulary).
    dictionary.update(tweet.split())

# Attach indices to each word to represent their position in the dictionary of words (vocabulary).
dictionary = sorted(list(dictionary))
vocab_index = {}
for i, w in enumerate(dictionary):
    vocab_index[w] = i

print(f"\nThe number of documents: {len(slo_feature_list)}")
print(f"\nThe number of words in the dictionary: {len(dictionary)}")
print(f"Sample of the words in the dictionary:\n {dictionary[0:100]}")
print(f"\nThe number of documents in the corpus: {len(corpus)}")
print(f"Sample of the documents in the corpus:\n {corpus}")

# Visualize the dictionary of words.
wordcloud = WordCloud(background_color='white').generate(' '.join(slo_feature_list))
plt.figure(figsize=(12, 12))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

print("\nLength of the dictionary, corpus, document 0 in corpus, document 1 in corpus (in that order)")
print(len(dictionary), len(corpus), len(corpus[0]), len(corpus[1]))

"""
Modify the corpus of documents to store the index value of each word from the dictionary (vocabulary)
rather than the words themselves.
"""
new_corpus = []
for document in corpus:
    new_document = []
    for word in document:
        word_index = vocab_index[word]
        new_document.append(word_index)
    new_corpus.append(new_document)

print("\nLength of the dictionary and corpus (as word dictionary index values (in that order))")
print(len(dictionary), len(new_corpus))

print("\nDocument 0 in the corpus as tokenized words:")
print(corpus[0][0:10])
print("Document 0 in the corpus as tokenized word index values from the dictionary:")
print(new_corpus[0][0:10])

print("\nDocument 1 in the corpus as tokenized words:")
print(corpus[1][0:10])
print("Document 1 in the corpus as tokenized word index values from the dictionary:")
print(new_corpus[1][0:10])

print("\nDocument 2 in the corpus as tokenized words:")
print(corpus[2][0:10])
print("Document 2 in the corpus as tokenized word index values from the dictionary:")
print(new_corpus[2][0:10])

### Perform the topic extraction:

This function contains the code specific to each topic modeling library we utilize.  

In [None]:
def hierarchical_latent_dirichlet_allocation_topic_extraction():
    """
    Function performs topic extraction on Tweets using the Gensim HDP model.

    :return: None.
    """
    from hlda.sampler import HierarchicalLDA

    # Set parameters.
    n_samples = 500  # no of iterations for the sampler
    alpha = 10.0  # smoothing over level distributions
    gamma = 1.0  # CRP smoothing parameter; number of imaginary customers at next, as yet unused table
    eta = 0.1  # smoothing over topic-word distributions
    num_levels = 3  # the number of levels in the tree
    display_topics = 50  # the number of iterations between printing a brief summary of the topics so far
    n_words = 5  # the number of most probable words to print for each topic after model estimation
    with_weights = False  # whether to print the words with the weights

    # Train the model.
    hlda = HierarchicalLDA(new_corpus, dictionary, alpha=alpha, gamma=gamma, eta=eta, num_levels=num_levels)
    hlda.estimate(n_samples, display_topics=display_topics, n_words=n_words, with_weights=with_weights)

Here, we call the topic modeling function and train it on our Twitter dataset.  We record the time it takes to process the entire dataset and extract topics.

In [None]:
"""
Main function.  Execute the program.
"""
if __name__ == '__main__':
    my_start_time = time.time()
    ################################################
    """
    Perform the topic extraction.
    """
    hierarchical_latent_dirichlet_allocation_topic_extraction()

    ################################################
    my_end_time = time.time()

    time_elapsed_in_seconds = (my_end_time - my_start_time)
    time_elapsed_in_minutes = (my_end_time - my_start_time) / 60.0
    time_elapsed_in_hours = (my_end_time - my_start_time) / 60.0 / 60.0
    print(f"Time taken to process dataset: {time_elapsed_in_seconds} seconds, "
          f"{time_elapsed_in_minutes} minutes, {time_elapsed_in_hours} hours.")

### Topic Extraction Results on Twitter Dataset Tweet Text:

Execution run with recursive depth of 1 takes almost 4 hours to complete and provides just the results below.

"""
(run 1)

....................................................................................................100
topic=0 level=0 (documents=653900): coal, $, job, 's, stop, project, want, í, tax, fund, 

....................................................................................................200
topic=0 level=0 (documents=653900): coal, $, job, 's, stop, project, want, í, tax, fund, 

....................................................................................................300
topic=0 level=0 (documents=653900): coal, $, job, 's, stop, project, want, í, tax, fund, 

....................................................................................................400
topic=0 level=0 (documents=653900): coal, $, job, 's, stop, project, want, í, tax, fund, 

....................................................................................................500
topic=0 level=0 (documents=653900): coal, $, job, 's, stop, project, want, í, tax, fund, 



Time taken to process dataset: 61687.211238861084 seconds, 1028.1201873143514 minutes, 17.13533645523919 hours.


Process finished with exit code 0

"""

Execution run with recursive depth of 2 will not complete successfully.  This will require utilizing the Borg supercomputer with its much larger RAM capacity.

Execution run with recursive depth of 3 will not complete successfully.  This will require utilizing the Borg supercomputer with its much larger RAM capacity.

The library for this topic model is extremely memory intensive.  This makes sense as it is a recursive algorithm that creates a branching tree-like hierarchy of topics from the root.

## Resources Used:

- https://pypi.org/project/hlda/
    - The Python library we utilize for HLDA.