# Hierarchical Dirichlet Process Topic Model Implementation on SLO Twitter Dataset

### Joseph Jinn and Keith VanderLinden

We utilize the Gensim Hierarchical Dirichlet Process Model.

### Import libraries and set parameters:

We import the requisite libraries, custom utility functions, and set the parameters for our various imported libraries.

In [None]:
# Import libraries.
import logging as log
import time
import warnings

import pandas as pd
import seaborn as sns
from gensim import corpora
from sklearn.feature_extraction.text import CountVectorizer

#############################################################

# Pandas options.
pd.options.display.max_rows = None
pd.options.display.max_columns = None
pd.options.display.width = None
pd.options.display.max_colwidth = 1000
# Pandas float precision display.
pd.set_option('precision', 12)
# Seaborn setting.
sns.set()
# Don't output these types of warnings to terminal.
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=UserWarning)
# Matplotlib log settings.
mylog = log.getLogger("matplotlib")
mylog.setLevel(log.INFO)

"""
Turn debug log statements for various sections of code on/off.
(adjust log level as necessary)
"""
log.basicConfig(level=log.INFO)
log.disable(level=log.DEBUG)

### Pre-process and Post-process Tweets:


    
We preprocess our Twitter dataset as follows:<br>

1) Downcase all text.<br>
2) Check that there is text, otherwise convert to empty string.<br>
3) Convert html chars. to unicode chars.<br>
4) Remove "RT" tags.<br>
5) Remove concatenated URL's.<br>
6) Handle whitespaces by converting all/multiple whitespace characters to a single whitespace.<br>
7) Remove URL's and replace with "slo_url".<br>
8) Remove Tweet mentions and replace with "slo_mention".<br>
9) Remove Tweet stock symbols and replace with "slo_stock".<br>
10) Remove Tweet hashtags and replace with "slo_hash".<br>
11) Remove Tweet cashtags and replace with "slo_cash".<br>
12) Remove Tweet year and replace with "slo_year".<br>
13) Remove Tweet time and replace with "slo_time".<br>
14) Remove character elongations.<br>

We postprocess our Twitter dataset as follows:<br>

1) Remove the following irrelevant words specified in the List below:<br>

    delete_list = ["word_n", "auspol", "ausbiz", "tinto", "adelaide", "csg", "nswpol",
                   "nsw", "lng", "don", "rio", "pilliga", "australia", "asx", "just", "today", "great", "says", "like",
                   "big", "better", "rite", "would", "SCREEN_NAME", "mining", "former", "qldpod", "qldpol", "qld", "wr",
                   "melbourne", "andrew", "fuck", "spadani", "greg", "th", "australians", "http", "https", "rt",
                   "co", "amp", "carmichael", "abbot", "bill shorten",
                   "slo_url", "slo_mention", "slo_hash", "slo_year", "slo_time", "slo_cash", "slo_stock",
                   "adani", "bhp", "cuesta", "fotescue", "riotinto", "newmontmining", "santos", "oilsearch",
                   "woodside", "ilukaresources", "whitehavencoal",
                   "stopadani", "goadani", "bhpbilliton", "billiton", "cuestacoal", "cuests coal", "cqc",
                   "fortescuenews", "fortescue metals", "rio tinto", "newmont", "newmont mining", "santosltd",
                   "oilsearchltd", "oil search", "woodsideenergy", "woodside petroleum", "woodside energy",
                   "iluka", "iluka resources", "whitehaven", "whitehaven coal"]

2) Remove all punctuation from the Tweet text.<br>
3) Remove all English stop words from the Tweet text.<br>
4) Lemmatize the words in the Tweet text.<br>



In [None]:
# Tokenize using our Twitter dataset.
tweet_dataset_preprocessor(
    "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
    "twitter-dataset-7-10-19-with-irrelevant-tweets-excluded.csv",
    "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
    "twitter-dataset-7-10-19-topic-extraction-ready-tweet-text-with-hashtags-excluded-created-7-29-19.csv",
    "text_derived")

# Tokenize using our Twitter dataset.
tweet_dataset_preprocessor(
    "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
    "twitter-dataset-7-10-19-with-irrelevant-tweets-excluded.csv",
    "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
    "twitter-dataset-7-10-19-topic-extraction-ready-user-description-text-with-hashtags-excluded-created-7-29-19.csv",
    "user_description")



The first parameter in our function call specifies the file path to the dataset to be preprocessed.  The second parameter specifies the location to save the CSV file to.  The 3rd parameter specifies the name of the column in the dataset that contains the original Tweet text.<br>


Tweet preprocessing is done via a custom library imported as "lda_util" using "topic_extraction_utility_functions.py".<br>

Refer to URL link for the codebase to the utility functions used above for data preprocessing and below for LDA topic extraction:<br>

https://github.com/Calvin-CS/slo-classifiers/blob/master/topic/models/topic_extraction_utility_functions.py



### Import and prepare the preprocessed dataset for use in HDP topic extraction:

We follow the general format of insertion into a Pandas dataframe, isolating the column of interest, and generating a dictionary of words and corpus of documents.  Please refer to the code comments for details on the specific steps for the entire process.

In [None]:
# Import the dataset (absolute path).
tweet_dataset_processed = \
    pd.read_csv("D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
                "twitter-dataset-7-10-19-topic-extraction-ready-tweet-text-with-hashtags-excluded"
                "-created-7-29-19-tokenized.csv", sep=",")

# # Import the dataset (test/debug).
# tweet_dataset_processed = \
#     pd.read_csv("D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
#                 "twitter-dataset-7-10-19-topic-extraction-ready-tweet-text-with-hashtags-excluded"
#                 "-created-7-30-19-test.csv", sep=",")

# Reindex and shuffle the data randomly.
tweet_dataset_processed = tweet_dataset_processed.reindex(
    pd.np.random.permutation(tweet_dataset_processed.index))

# Generate a Pandas dataframe.
tweet_text_dataframe = pd.DataFrame(tweet_dataset_processed)

# Print shape and column names.
log.info(f"\nThe shape of the Tweet text dataframe:")
log.info(f"{tweet_text_dataframe.shape}\n")
log.info(f"\nThe columns of the Tweet text dataframe:")
log.info(f"{tweet_text_dataframe.columns}\n")

# Drop any NaN or empty Tweet rows in dataframe (or else CountVectorizer will blow up).
tweet_text_dataframe = tweet_text_dataframe.dropna()

# Print shape and column names.
log.info(f"\nThe shape of the Tweet text dataframe with NaN (empty) rows dropped:")
log.info(f"{tweet_text_dataframe.shape}\n")
log.info(f"\nThe columns of the Tweet text dataframe with NaN (empty) rows dropped:")
log.info(f"{tweet_text_dataframe.columns}\n")

# Reindex everything.
tweet_text_dataframe.index = pd.RangeIndex(len(tweet_text_dataframe.index))

# Assign column names.
tweet_text_dataframe_column_names = ['text_derived', 'text_derived_preprocessed', 'text_derived_postprocessed']

# Rename column in dataframe.
tweet_text_dataframe.columns = tweet_text_dataframe_column_names

# Create input feature.
selected_features = tweet_text_dataframe[['text_derived_postprocessed']]
processed_features = selected_features.copy()

# Check what we are using as inputs.
log.info(f"\nA sample Tweet in our input feature:")
log.info(f"{processed_features['text_derived_postprocessed'][0]}\n")

# Create feature set.
slo_feature_series = processed_features['text_derived_postprocessed']
slo_feature_series = pd.Series(slo_feature_series)
slo_feature_list = slo_feature_series.tolist()

# Convert feature list of sentences to comma-separated dictionary of words.
words = [[text for text in tweet.split()] for tweet in slo_feature_list]
log.info(f"\nDictionary of individual words:")
log.info(f"{words[0]}\n")

# Create the Gensim dictionary of words.
dictionary = corpora.Dictionary(words)
log.info(f"\nGensim dictionary of tokenized words.")
log.info(f"{dictionary}\n")
log.info(f"\nGensim dictionary of tokenized words with index ID's.")
log.info(f"{dictionary.token2id}\n")

# Create the Gensim corpus of document term frequencies.
corpus = [dictionary.doc2bow(word, allow_update=True) for word in words]
log.info(f"# of documents in corpus: {len(corpus)}\n")
log.info(f"\nSample of Gensim corpus of document-term frequencies.")
log.info(f"{corpus[0:10]}\n")

### Perform the topic extraction:

This function contains the code specific to each topic modeling library we utilize.  

In [None]:
def hierarchical_dirichlet_process_topic_extraction():
    """
    Function performs topic extraction on Tweets using the Gensim HDP model.

    :return: None.
    """
    from gensim.models import HdpModel

    # LDA can only use raw term counts for LDA because it is a probabilistic graphical model.
    tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=1000, stop_words='english')
    tf = tf_vectorizer.fit_transform(slo_feature_series)
    tf_feature_names = tf_vectorizer.get_feature_names()

    log.info("\n.fit_transform - Learn the vocabulary dictionary and return term-document matrix.")
    log.info(f"{tf}\n")
    log.info("\n.get_feature_names - Array mapping from feature integer indices to feature name")
    log.info(f"{tf_feature_names}\n")

    # Train the HDP model.
    hdp = HdpModel(corpus, dictionary)
    time.sleep(3)

    # # For use as wrapper with Scikit-Learn API.
    # model = HdpTransformer(id2word=dictionary)
    # distribution = model.fit_transform(corpus)

    # Display the top words for each topic.
    topic_info = hdp.print_topics(num_topics=20, num_words=10)

    for topic in topic_info:
        print(topic)

Here, we call the topic modeling function and train it on our Twitter dataset.  We record the time it takes to process the entire dataset and extract topics.

In [None]:
"""
Main function.  Execute the program.
"""
if __name__ == '__main__':
    my_start_time = time.time()
    ################################################
    """
    Perform the topic extraction.
    """
    hierarchical_dirichlet_process_topic_extraction()

    ################################################
    my_end_time = time.time()

    time_elapsed_in_seconds = (my_end_time - my_start_time)
    time_elapsed_in_minutes = (my_end_time - my_start_time) / 60.0
    time_elapsed_in_hours = (my_end_time - my_start_time) / 60.0 / 60.0
    print(f"Time taken to process dataset: {time_elapsed_in_seconds} seconds, "
          f"{time_elapsed_in_minutes} minutes, {time_elapsed_in_hours} hours.")

### Topic Extraction Results on Twitter Dataset Tweet Text:

First run.  Output shows the weight of each of the top words associated with each topic.

In [None]:
"""
(run 1)

(0, "0.025*coal + 0.008*job + 0.007*'s + 0.006*project + 0.006*stop + 0.006*want + 0.006*labor + 0.005*water + 0.005*fund + 0.005*loan")
(1, '0.082*í + 0.050*° + 0.043*tax + 0.041*½í² + 0.023*¼í¶\x93 + 0.016*pay + 0.013*$ + 0.012*coal + 0.010*energy + 0.008*½í')
(2, '0.170*$ + 0.009*cba + 0.009*anz + 0.009*wbc + 0.009*nab + 0.007*coal + 0.006*price + 0.005*fmg + 0.005*bxb + 0.005*syd')
(3, "0.016*coal + 0.005*'s + 0.005*job + 0.005*$ + 0.004*stop + 0.004*project + 0.004*want + 0.004*í + 0.003*fund + 0.003*reef")
(4, "0.015*coal + 0.005*job + 0.005*'s + 0.004*$ + 0.004*project + 0.004*stop + 0.004*want + 0.003*fund + 0.003*gas + 0.003*labor")
(5, "0.015*coal + 0.005*job + 0.005*'s + 0.004*$ + 0.004*project + 0.004*stop + 0.004*want + 0.003*gas + 0.003*water + 0.003*fund")
(6, "0.015*coal + 0.005*$ + 0.005*job + 0.005*'s + 0.004*stop + 0.004*project + 0.004*want + 0.003*gas + 0.003*fund + 0.003*support")
(7, "0.015*coal + 0.005*job + 0.005*'s + 0.004*$ + 0.004*project + 0.004*stop + 0.004*want + 0.003*gas + 0.003*fund + 0.003*labor")
(8, "0.015*coal + 0.005*job + 0.004*'s + 0.004*$ + 0.004*stop + 0.004*project + 0.003*want + 0.003*gas + 0.003*fund + 0.003*support")
(9, "0.015*coal + 0.005*job + 0.005*'s + 0.004*$ + 0.004*project + 0.004*stop + 0.004*want + 0.003*water + 0.003*gas + 0.003*fund")
(10, "0.015*coal + 0.005*'s + 0.005*job + 0.004*$ + 0.004*stop + 0.004*project + 0.004*argyle + 0.004*want + 0.003*gas + 0.003*fund")
(11, "0.015*coal + 0.005*job + 0.005*'s + 0.004*$ + 0.004*project + 0.004*stop + 0.004*want + 0.003*í + 0.003*fund + 0.003*gas")
(12, "0.015*coal + 0.005*job + 0.005*'s + 0.004*$ + 0.004*project + 0.004*stop + 0.004*want + 0.003*gas + 0.003*fund + 0.003*labor")
(13, "0.015*coal + 0.005*job + 0.005*'s + 0.004*$ + 0.004*stop + 0.004*project + 0.004*want + 0.003*gas + 0.003*fund + 0.003*water")
(14, "0.015*coal + 0.005*job + 0.005*'s + 0.004*$ + 0.004*stop + 0.004*project + 0.004*want + 0.004*fund + 0.003*gas + 0.003*labor")
(15, "0.015*coal + 0.005*job + 0.005*'s + 0.004*project + 0.004*stop + 0.004*$ + 0.004*want + 0.004*gas + 0.003*water + 0.003*fund")
(16, "0.015*coal + 0.005*job + 0.005*'s + 0.004*$ + 0.004*stop + 0.004*project + 0.004*want + 0.003*fund + 0.003*support + 0.003*gas")
(17, "0.015*coal + 0.005*job + 0.005*'s + 0.004*$ + 0.004*stop + 0.004*project + 0.004*want + 0.003*fund + 0.003*gas + 0.003*support")
(18, "0.015*coal + 0.005*job + 0.005*'s + 0.004*$ + 0.004*stop + 0.004*project + 0.004*want + 0.003*fund + 0.003*support + 0.003*labor")
(19, "0.015*coal + 0.005*job + 0.005*'s + 0.004*$ + 0.004*stop + 0.004*project + 0.004*want + 0.003*gas + 0.003*fund + 0.003*labor")


Time taken to process dataset: 1036.1106204986572 seconds, 17.268510341644287 minutes, 0.28780850569407146 hours.


Process finished with exit code 0

"""

Second run.  Output shows the weight of each of the top words associated with each topic.

In [None]:
"""
(run 2)

(0, "0.024*coal + 0.008*job + 0.007*'s + 0.006*want + 0.006*stop + 0.006*labor + 0.006*project + 0.005*fund + 0.005*support + 0.005*reef")
(1, '0.158*$ + 0.009*cba + 0.008*anz + 0.008*wbc + 0.008*nab + 0.007*coal + 0.007*price + 0.005*fmg + 0.005*share + 0.005*bxb')
(2, '0.086*í + 0.050*° + 0.043*½í² + 0.032*tax + 0.025*¼í¶\x93 + 0.015*$ + 0.013*coal + 0.009*energy + 0.007*½í + 0.006*pay')
(3, "0.020*coal + 0.013*gas + 0.008*water + 0.007*project + 0.005*'s + 0.005*seam + 0.004*narrabri + 0.004*new + 0.004*rail + 0.004*company")
(4, "0.035*tax + 0.024*pay + 0.012*coal + 0.007*'s + 0.006*ato + 0.005*energy + 0.004*job + 0.004*company + 0.004*haven + 0.004*chevron")
(5, '0.015*coal + 0.006*job + 0.006*go + 0.006*money + 0.006*taxpayer + 0.005*india + 0.005*joyce + 0.005*barnaby + 0.005*think + 0.005*give')
(6, "0.015*coal + 0.005*job + 0.005*'s + 0.004*project + 0.004*stop + 0.004*$ + 0.004*want + 0.003*fund + 0.003*gas + 0.003*reef")
(7, "0.016*coal + 0.005*job + 0.005*'s + 0.004*$ + 0.004*stop + 0.004*project + 0.004*loan + 0.004*fund + 0.004*want + 0.003*support")
(8, "0.015*coal + 0.005*job + 0.005*'s + 0.004*loan + 0.004*project + 0.004*stop + 0.004*labor + 0.004*$ + 0.004*want + 0.003*reef")
(9, "0.015*coal + 0.006*job + 0.006*council + 0.005*townsville + 0.005*'s + 0.004*$ + 0.004*project + 0.004*stop + 0.004*fund + 0.004*pay")
(10, "0.016*coal + 0.005*í + 0.005*'s + 0.005*job + 0.004*$ + 0.004*° + 0.004*stop + 0.004*project + 0.004*want + 0.003*labor")
(11, "0.015*coal + 0.005*job + 0.005*'s + 0.004*$ + 0.004*stop + 0.004*project + 0.004*want + 0.003*water + 0.003*tax + 0.003*gas")
(12, "0.016*coal + 0.005*job + 0.005*'s + 0.004*fund + 0.004*$ + 0.004*stop + 0.004*project + 0.004*want + 0.003*water + 0.003*gas")
(13, "0.015*coal + 0.005*job + 0.004*'s + 0.004*stop + 0.004*$ + 0.004*project + 0.004*want + 0.003*water + 0.003*fund + 0.003*gas")
(14, "0.015*coal + 0.005*job + 0.005*'s + 0.004*$ + 0.004*stop + 0.004*project + 0.004*want + 0.003*í + 0.003*fund + 0.003*support")
(15, "0.015*coal + 0.005*job + 0.005*'s + 0.004*$ + 0.004*stop + 0.004*project + 0.004*want + 0.003*gas + 0.003*fund + 0.003*support")
(16, "0.015*coal + 0.005*job + 0.005*'s + 0.004*$ + 0.004*stop + 0.004*project + 0.004*want + 0.003*fund + 0.003*gas + 0.003*support")
(17, "0.015*coal + 0.006*í + 0.005*job + 0.004*'s + 0.004*$ + 0.004*stop + 0.004*project + 0.004*want + 0.003*fund + 0.003*gas")
(18, "0.015*coal + 0.006*job + 0.005*'s + 0.004*$ + 0.004*stop + 0.004*project + 0.004*want + 0.003*fund + 0.003*gas + 0.003*support")
(19, "0.015*coal + 0.005*job + 0.005*'s + 0.004*$ + 0.004*stop + 0.004*project + 0.004*want + 0.003*fund + 0.003*labor + 0.003*gas")


Time taken to process dataset: 1036.3824276924133 seconds, 17.273040461540223 minutes, 0.2878840076923371 hours.


Process finished with exit code 0

"""

The results are sub-par using just default hyperparameter values.  They are mostly the same words with perhaps slightly different weight values, across all the different topics.

## Resources Used:

- https://radimrehurek.com/gensim/models/hdpmodel.html
    - Gensim HDP topic modeling Class.
    