# Biterm Topic Model Implementation on SLO Twitter Dataset

### Joseph Jinn and Keith VanderLinden

We utilize a 3rd-party Biterm Model library.  This library will only install and run on the Linux OS.

### Import libraries and set parameters:

We import the requisite libraries, custom utility functions, and set the parameters for our various imported libraries.

In [None]:
# Import libraries.
import logging as log
import time
import warnings

import numpy as np
import pandas as pd
import seaborn as sns
from biterm.cbtm import oBTM
from biterm.utility import vec_to_biterms, topic_summuary  # helper functions
from sklearn.feature_extraction.text import CountVectorizer

#############################################################

# Pandas options.
pd.options.display.max_rows = None
pd.options.display.max_columns = None
pd.options.display.width = None
pd.options.display.max_colwidth = 1000
# Pandas float precision display.
pd.set_option('precision', 12)
# Seaborn setting.
sns.set()
# Don't output these types of warnings to terminal.
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=UserWarning)
# Matplotlib log settings.
mylog = log.getLogger("matplotlib")
mylog.setLevel(log.INFO)

"""
Turn debug log statements for various sections of code on/off.
(adjust log level as necessary)
"""
log.basicConfig(level=log.INFO)
log.disable(level=log.DEBUG)

### Pre-process and Post-process Tweets:


    
We preprocess our Twitter dataset as follows:<br>

1) Downcase all text.<br>
2) Check that there is text, otherwise convert to empty string.<br>
3) Convert html chars. to unicode chars.<br>
4) Remove "RT" tags.<br>
5) Remove concatenated URL's.<br>
6) Handle whitespaces by converting all/multiple whitespace characters to a single whitespace.<br>
7) Remove URL's and replace with "slo_url".<br>
8) Remove Tweet mentions and replace with "slo_mention".<br>
9) Remove Tweet stock symbols and replace with "slo_stock".<br>
10) Remove Tweet hashtags and replace with "slo_hash".<br>
11) Remove Tweet cashtags and replace with "slo_cash".<br>
12) Remove Tweet year and replace with "slo_year".<br>
13) Remove Tweet time and replace with "slo_time".<br>
14) Remove character elongations.<br>

We postprocess our Twitter dataset as follows:<br>

1) Remove the following irrelevant words specified in the List below:<br>

    delete_list = ["word_n", "auspol", "ausbiz", "tinto", "adelaide", "csg", "nswpol",
                   "nsw", "lng", "don", "rio", "pilliga", "australia", "asx", "just", "today", "great", "says", "like",
                   "big", "better", "rite", "would", "SCREEN_NAME", "mining", "former", "qldpod", "qldpol", "qld", "wr",
                   "melbourne", "andrew", "fuck", "spadani", "greg", "th", "australians", "http", "https", "rt",
                   "co", "amp", "carmichael", "abbot", "bill shorten",
                   "slo_url", "slo_mention", "slo_hash", "slo_year", "slo_time", "slo_cash", "slo_stock",
                   "adani", "bhp", "cuesta", "fotescue", "riotinto", "newmontmining", "santos", "oilsearch",
                   "woodside", "ilukaresources", "whitehavencoal",
                   "stopadani", "goadani", "bhpbilliton", "billiton", "cuestacoal", "cuests coal", "cqc",
                   "fortescuenews", "fortescue metals", "rio tinto", "newmont", "newmont mining", "santosltd",
                   "oilsearchltd", "oil search", "woodsideenergy", "woodside petroleum", "woodside energy",
                   "iluka", "iluka resources", "whitehaven", "whitehaven coal"]

2) Remove all punctuation from the Tweet text.<br>
3) Remove all English stop words from the Tweet text.<br>
4) Lemmatize the words in the Tweet text.<br>



In [None]:
# Tokenize using our Twitter dataset.
tweet_dataset_preprocessor(
    "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
    "twitter-dataset-7-10-19-with-irrelevant-tweets-excluded.csv",
    "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
    "twitter-dataset-7-10-19-topic-extraction-ready-tweet-text-with-hashtags-excluded-created-7-29-19.csv",
    "text_derived")

# Tokenize using our Twitter dataset.
tweet_dataset_preprocessor(
    "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
    "twitter-dataset-7-10-19-with-irrelevant-tweets-excluded.csv",
    "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
    "twitter-dataset-7-10-19-topic-extraction-ready-user-description-text-with-hashtags-excluded-created-7-29-19.csv",
    "user_description")



The first parameter in our function call specifies the file path to the dataset to be preprocessed.  The second parameter specifies the location to save the CSV file to.  The 3rd parameter specifies the name of the column in the dataset that contains the original Tweet text.<br>


Tweet preprocessing is done via a custom library imported as "lda_util" using "topic_extraction_utility_functions.py".<br>

Refer to URL link for the codebase to the utility functions used above for data preprocessing and below for LDA topic extraction:<br>

https://github.com/Calvin-CS/slo-classifiers/blob/master/topic/models/topic_extraction_utility_functions.py



### Import and prepare the preprocessed dataset for use in Biterm topic extraction:

We follow the general format of insertion into a Pandas dataframe, isolating the column of interest, and generating a dictionary of words and corpus of documents.  Please refer to the code comments for details on the specific steps for the entire process.

In [None]:
# Import the dataset (absolute path).
tweet_dataset_processed = \
    pd.read_csv("D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
                "twitter-dataset-7-10-19-topic-extraction-ready-tweet-text-with-hashtags-excluded"
                "-created-7-29-19-tokenized.csv", sep=",")

# # Import the dataset (test/debug).
# tweet_dataset_processed = \
#     pd.read_csv("D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
#                 "twitter-dataset-7-10-19-topic-extraction-ready-tweet-text-with-hashtags-excluded"
#                 "-created-7-30-19-test.csv", sep=",")

# Reindex and shuffle the data randomly.
tweet_dataset_processed = tweet_dataset_processed.reindex(
    pd.np.random.permutation(tweet_dataset_processed.index))

# Generate a Pandas dataframe.
tweet_text_dataframe = pd.DataFrame(tweet_dataset_processed)

# Print shape and column names.
log.info(f"\nThe shape of the Tweet text dataframe:")
log.info(f"{tweet_text_dataframe.shape}\n")
log.info(f"\nThe columns of the Tweet text dataframe:")
log.info(f"{tweet_text_dataframe.columns}\n")

# Drop any NaN or empty Tweet rows in dataframe (or else CountVectorizer will blow up).
tweet_text_dataframe = tweet_text_dataframe.dropna()

# Print shape and column names.
log.info(f"\nThe shape of the Tweet text dataframe with NaN (empty) rows dropped:")
log.info(f"{tweet_text_dataframe.shape}\n")
log.info(f"\nThe columns of the Tweet text dataframe with NaN (empty) rows dropped:")
log.info(f"{tweet_text_dataframe.columns}\n")

# Reindex everything.
tweet_text_dataframe.index = pd.RangeIndex(len(tweet_text_dataframe.index))

# Assign column names.
tweet_text_dataframe_column_names = ['text_derived', 'text_derived_preprocessed', 'text_derived_postprocessed']

# Rename column in dataframe.
tweet_text_dataframe.columns = tweet_text_dataframe_column_names

# Create input feature.
selected_features = tweet_text_dataframe[['text_derived_postprocessed']]
processed_features = selected_features.copy()

# Check what we are using as inputs.
log.info(f"\nA sample Tweet in our input feature:")
log.info(f"{processed_features['text_derived_postprocessed'][0]}\n")

# Create feature set.
slo_feature_series = processed_features['text_derived_postprocessed']
slo_feature_series = pd.Series(slo_feature_series)
slo_feature_list = slo_feature_series.tolist()

### Perform the topic extraction:

This function contains the code specific to each topic modeling library we utilize.  

In [None]:
def biterm_topic_model_topic_extraction():
    """
    Function performs topic extraction on Tweets using the Gensim HDP model.

    :return: None.
    """
    # LDA can only use raw term counts for LDA because it is a probabilistic graphical model.
    tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=1000, stop_words='english')
    tf = tf_vectorizer.fit_transform(slo_feature_series)
    tf_feature_names = tf_vectorizer.get_feature_names()

    log.info(f"\n.fit_transform - Learn the vocabulary dictionary and return term-document matrix.")
    log.info(f"{tf}\n")
    log.info(f"\n.get_feature_names - Array mapping from feature integer indices to feature name")
    log.info(f"{tf_feature_names}\n")

    # Convert corpus of documents (vectorized text) to numpy array.
    tf_array = tf.toarray()

    # Convert dictionary of words (vocabulary) to numpy array.
    tf_feature_names = np.array(tf_vectorizer.get_feature_names())

    # get biterms
    biterms = vec_to_biterms(tf_array)

    # create btm
    btm = oBTM(num_topics=20, V=tf_feature_names)

    print("\n\n Train Online BTM ..")
    for i in range(0, len(biterms), 100):  # prozess chunk of 200 texts
        biterms_chunk = biterms[i:i + 100]
        btm.fit(biterms_chunk, iterations=50)
    topics = btm.transform(biterms)
    time.sleep(3)

    # print("\n\n Visualize Topics ..")
    # vis = pyLDAvis.prepare(btm.phi_wz.T, topics, np.count_nonzero(tf_array, axis=1), tf_feature_names, np.sum(tf_array, axis=0))
    # pyLDAvis.save_html(vis, './vis/online_btm.html')

    print("\n\n Topic coherence ..")
    topic_summuary(btm.phi_wz.T, tf_array, tf_feature_names, 10)

    print("\n\n Texts & Topics ..")
    for i in range(1, 10):
        print("{} (topic: {})".format(slo_feature_series[i], topics[i].argmax()))

    # print("\n\n Texts & Topics ..")
    # for i in range(len(slo_feature_series)):
    #     print("{} (topic: {})".format(slo_feature_series[i], topics[i].argmax()))

Here, we call the topic modeling function and train it on our Twitter dataset.  We record the time it takes to process the entire dataset and extract topics.

In [None]:
"""
Main function.  Execute the program.
"""
if __name__ == '__main__':
    my_start_time = time.time()
    ################################################
    """
    Perform the topic extraction.
    """
    biterm_topic_model_topic_extraction()

    ################################################
    my_end_time = time.time()

    time_elapsed_in_seconds = (my_end_time - my_start_time)
    time_elapsed_in_minutes = (my_end_time - my_start_time) / 60.0
    time_elapsed_in_hours = (my_end_time - my_start_time) / 60.0 / 60.0
    print(f"Time taken to process dataset: {time_elapsed_in_seconds} seconds, "
          f"{time_elapsed_in_minutes} minutes, {time_elapsed_in_hours} hours.")

### Topic Extraction Results on Twitter Dataset Tweet Text:

Execution run 1.

Execution run 2.

The results display each topic along with their topic coherence metric values as well as the top 10 words associated with each topic.  A sample of the Tweets in the dataset with their assigned topics is also given.  The library for biterm takes the longest to process using default hyperparameters.

## Resources Used:

- https://pypi.org/project/biterm/
    - Biterm Python library we utilize. (Linux OS only)