# Latent Dirichlet Allocation Topic Model Implementation on SLO Twitter Dataset

### Joseph Jinn and Keith VanderLinden



</p>Our implementation utilizes the Scikit-Learn LatentDirichletAllocation class and the Python "lda" library.  We utilize Scikit-Learn's GridSearchCV class to perform an exhaustive grid search for the optimal hyperparameters to fit our Twitter dataset.  We preprocess our raw Twitter dataset before running multiple iterations of the LDA algorithm with the following specified number of topics: 3, 6, 12, and 20.  We limit each topic to the top 10 words that describe that topic.</p><br>



### Import libraries and set parameters:



Adjust log verbosity levels as necessary.<br>

Set to "DEBUG" to view all debug output.<br>
Set to "INFO" to view useful information on dataframe shape, etc.<br>



In [1]:
# Import libraries.
import logging as log
import warnings
import time
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

# Import custom utility functions.
import topic_extraction_utility_functions as lda_util

#############################################################

# Pandas options.
pd.options.display.max_rows = None
pd.options.display.max_columns = None
pd.options.display.width = None
pd.options.display.max_colwidth = 1000
# Pandas float precision display.
pd.set_option('precision', 12)
# Don't output these types of warnings to terminal.
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=DeprecationWarning)
warnings.simplefilter(action='ignore', category=UserWarning)
# Matplotlib log settings.
mylog = log.getLogger("matplotlib")
mylog.setLevel(log.INFO)

"""
Turn debug log statements for various sections of code on/off.
(adjust log level as necessary)
"""
log.basicConfig(level=log.INFO)
log.disable(level=log.DEBUG)

### Pre-process and Post-process Tweets:


    
We preprocess our Twitter dataset as follows:<br>

1) Downcase all text.<br>
2) Check that there is text, otherwise convert to empty string.<br>
3) Convert html chars. to unicode chars.<br>
4) Remove "RT" tags.<br>
5) Remove concatenated URL's.<br>
6) Handle whitespaces by converting all/multiple whitespace characters to a single whitespace.<br>
7) Remove URL's and replace with "slo_url".<br>
8) Remove Tweet mentions and replace with "slo_mention".<br>
9) Remove Tweet stock symbols and replace with "slo_stock".<br>
10) Remove Tweet hashtags and replace with "slo_hash".<br>
11) Remove Tweet cashtags and replace with "slo_cash".<br>
12) Remove Tweet year and replace with "slo_year".<br>
13) Remove Tweet time and replace with "slo_time".<br>
14) Remove character elongations.<br>

We postprocess our Twitter dataset as follows:<br>

1) Remove the following irrelevant words specified in the List below:<br>

    delete_list = ["word_n", "auspol", "ausbiz", "tinto", "adelaide", "csg", "nswpol",
                   "nsw", "lng", "don", "rio", "pilliga", "australia", "asx", "just", "today", "great", "says", "like",
                   "big", "better", "rite", "would", "SCREEN_NAME", "mining", "former", "qldpod", "qldpol", "qld", "wr",
                   "melbourne", "andrew", "fuck", "spadani", "greg", "th", "australians", "http", "https", "rt",
                   "co", "amp", "carmichael", "abbot", "bill shorten",
                   "slo_url", "slo_mention", "slo_hash", "slo_year", "slo_time", "slo_cash", "slo_stock",
                   "adani", "bhp", "cuesta", "fotescue", "riotinto", "newmontmining", "santos", "oilsearch",
                   "woodside", "ilukaresources", "whitehavencoal",
                   "stopadani", "goadani", "bhpbilliton", "billiton", "cuestacoal", "cuests coal", "cqc",
                   "fortescuenews", "fortescue metals", "rio tinto", "newmont", "newmont mining", "santosltd",
                   "oilsearchltd", "oil search", "woodsideenergy", "woodside petroleum", "woodside energy",
                   "iluka", "iluka resources", "whitehaven", "whitehaven coal"]

2) Remove all punctuation from the Tweet text.<br>
3) Remove all English stop words from the Tweet text.<br>
4) Lemmatize the words in the Tweet text.<br>



In [2]:
# Tokenize using our Twitter dataset.
tweet_dataset_preprocessor(
    "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
    "twitter-dataset-7-10-19-with-irrelevant-tweets-excluded.csv",
    "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
    "twitter-dataset-7-10-19-topic-extraction-ready-tweet-text-with-hashtags-excluded-created-7-29-19.csv",
    "text_derived")

# Tokenize using our Twitter dataset.
tweet_dataset_preprocessor(
    "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
    "twitter-dataset-7-10-19-with-irrelevant-tweets-excluded.csv",
    "D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
    "twitter-dataset-7-10-19-topic-extraction-ready-user-description-text-with-hashtags-excluded-created-7-29-19.csv",
    "user_description")



The first parameter in our function call specifies the file path to the dataset to be preprocessed.  The second parameter specifies the location to save the CSV file to.  The 3rd parameter specifies the name of the column in the dataset that contains the original Tweet text.<br>


Tweet preprocessing is done via a custom library imported as "lda_util" using "topic_extraction_utility_functions.py".<br>

Refer to URL link for the codebase to the utility functions used above for data preprocessing and below for LDA topic extraction:<br>

https://github.com/Calvin-CS/slo-classifiers/blob/master/topic/models/topic_extraction_utility_functions.py



### Import and prepare the preprocessed dataset for use in LDA topic extraction:


    
Refer to the code comments for the specific steps performed.<br>
Note that we have to use absolute file paths in Jupyter notebook as opposed to relative file paths in PyCharm.<br>



In [3]:
# Import the dataset (absolute path).
tweet_dataset_processed = \
    pd.read_csv("D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
                "twitter-dataset-7-10-19-topic-extraction-ready-tweet-text-with-hashtags-excluded"
                "-created-7-29-19-tokenized.csv", sep=",")

# # Import the dataset (test/debug).
# tweet_dataset_processed = \
#     pd.read_csv("D:/Dropbox/summer-research-2019/jupyter-notebooks/attribute-datasets/"
#                 "twitter-dataset-7-10-19-topic-extraction-ready-tweet-text-with-hashtags-excluded"
#                 "-created-7-30-19-test.csv", sep=",")

# Reindex and shuffle the data randomly.
tweet_dataset_processed = tweet_dataset_processed.reindex(
    pd.np.random.permutation(tweet_dataset_processed.index))

# Generate a Pandas dataframe.
tweet_text_dataframe = pd.DataFrame(tweet_dataset_processed)

# Print shape and column names.
log.info(f"\nThe shape of the Tweet text dataframe:")
log.info(f"{tweet_text_dataframe.shape}\n")
log.info(f"\nThe columns of the Tweet text dataframe:")
log.info(f"{tweet_text_dataframe.columns}\n")

# Drop any NaN or empty Tweet rows in dataframe (or else CountVectorizer will blow up).
tweet_text_dataframe = tweet_text_dataframe.dropna()

# Print shape and column names.
log.info(f"\nThe shape of the Tweet text dataframe with NaN (empty) rows dropped:")
log.info(f"{tweet_text_dataframe.shape}\n")
log.info(f"\nThe columns of the Tweet text dataframe with NaN (empty) rows dropped:")
log.info(f"{tweet_text_dataframe.columns}\n")

# Reindex everything.
tweet_text_dataframe.index = pd.RangeIndex(len(tweet_text_dataframe.index))

# Assign column names.
tweet_text_dataframe_column_names = ['text_derived', 'text_derived_preprocessed', 'text_derived_postprocessed']

# Rename column in dataframe.
tweet_text_dataframe.columns = tweet_text_dataframe_column_names

# Create input feature.
selected_features = tweet_text_dataframe[['text_derived_postprocessed']]
processed_features = selected_features.copy()

# Check what we are using as inputs.
log.info(f"\nA sample Tweet in our input feature:")
log.info(f"{processed_features['text_derived_postprocessed'][0]}\n")

# Create feature set.
slo_feature_series = processed_features['text_derived_postprocessed']
slo_feature_series = pd.Series(slo_feature_series)
slo_feature_list = slo_feature_series.tolist()

INFO:root:

INFO:root:The shape of our preprocessed SLO dataframe with NaN (empty) rows dropped:
INFO:root:(653094, 1)
INFO:root:

INFO:root:The columns of our preprocessed SLO dataframe with NaN (empty) rows dropped:
INFO:root:<bound method NDFrame.head of                                                   tweet_t
134130  every australian politician should be watching...
109525  exciting to see the results of continued commu...
175975  breaking will proceed on but it doesnt have th...
221807                 already wtf cant do anything right
476425  breaking 4 people occupy a coal train at willo...
...                                                   ...
434193  others have left holes why we never really got...
231520  so confirms they are exploring selling and or ...
14479   power refusing to c oo perate with credit rati...
38365   suppos mine if it stack up environmentally doe...
656954  queensland tourism award winner rejects adanis...

[653094 rows x 1 columns]>
INFO:root:





The above log.INFO messages depict the shape and contents of the preprocessed dataframe after dropping any rows that are just "NaN", indicating the Tweet was full of irrelevant words and is now empty due to the removal of those irrelevant words.<br>



### Perform the topic extraction (uses the stance detection tokenized dataset):



We use the Scikit-Learn CountVectorizer class to vectorize our categorical Tweet data.  We set the max_features parameter to 1000 to indicate a maximum vocabulary of 1k words based on the 1000 words with the highest term frequencies.  We set the stop_words parameter to "English" to indicate we would like to remove English stop words based on a built-in library of stop words.  We set the min_df and max_df parameters to indicate the words with the threshold term frequencies at which we ignore those words and do not include them in our vocabulary.<br>

We use the Scikit-Learn LatentDirichletAllocation class with the below hyperparameters to train and fit our Tweet data.  The parameter n_topics controls the # of topics we would like to extract for topic modeling.  The parameter max_iter controls the # of iterations to perform LDA before we cease.  The parameter learning_method controls the method by which we update the words in our topics.  <br>

We use a utility function to display Topics 1-20 and the top 10 words associated with each Topic.<br>



In [5]:
def latent_dirichlet_allocation_topic_extraction():
    """
    Function performs topic extraction on Tweets using Scikit-Learn LDA model.

    :return: None.
    """
    from sklearn.decomposition import LatentDirichletAllocation

    # LDA can only use raw term counts for LDA because it is a probabilistic graphical model.
    tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=1000, stop_words='english')
    tf = tf_vectorizer.fit_transform(slo_feature_series)
    tf_feature_names = tf_vectorizer.get_feature_names()

    # Run LDA.
    lda = LatentDirichletAllocation(n_components=20, max_iter=5, learning_method='online', learning_offset=50.,
                                    random_state=0).fit(tf)
    time.sleep(3)

    # Display the top words for each topic.
    lda_util.display_topics(lda, tf_feature_names, 10)

    import pyLDAvis
    from pyLDAvis import sklearn
    # pyLDAvis.enable_notebook()
    visualization = sklearn.prepare(lda_model=lda, vectorizer=tf_vectorizer, dtm=tf)
    pyLDAvis.save_html(visualization, 'lda_visualization-no-company-words.html')
    
    
    
"""
Perform the topic extraction.
"""
latent_dirichlet_allocation_topic_extraction()

INFO:root:The time taken to perform the operation is: 
INFO:root:439.08266735076904
INFO:root:



Topic 0:
money slocashn coal work minister ceo taxpayers banks join use
Topic 1:
cou public lnp community away shares townsville end workers best
Topic 2:
people time need group thanks come thing cut latest times
Topic 3:
water reef land farmers barrier help state free adanis plan
Topic 4:
going narrabri barnaby does fight joyce business pm massive australias
Topic 5:
stop new coal government wont breaking approval plans premier carbon
Topic 6:
climate change greens adani vote biggest council companies week coal
Topic 7:
oil tell political coal hey bad clear taxpayer noh message
Topic 8:
queensland funding world risk coal groundwater looks access policy paying
Topic 9:
govt coal india shoen coalmine deal wants years local build
Topic 10:
adanis project coal turnbull point basin right did protect pollution
Topic 11:
labor alp way repo coal canavan look really lnp corruption
Topic 12:
jobs say news create board planet high finance abc thousands
Topic 13:
rail environmental environment di



We cannot seem to find any strong correlation between the 10 words in each Topic such that we could assign an English descriptor to each topic, such as "economic", "environmental", "social", etc.

Of interesting note is that it appears to take longer to perform LDA topic extraction specifying fewer topics over more topics.  We surmise this is because we have a large dataset of 650k+ Tweets which translates to 650k+ different documents in our corpus.  Therefore, it would take the algorithm less time if it could simply assign 650k+ documents to 650k+ different topics rather than having to assign 650k+ documents to a mere 3 topics or in general a much smaller number of topics in comparison to the number of documents.<br>



## LDA Topic Extraction using the "lda" library and collapsed Gibbs Sampling ((uses the stance detection tokenized dataset):



The code below uses the "lda" Python library package that performs LDA topic extraction using collapsed Gibbs Sampling.<br>
This is different from the Scikit-Learn implementation that uses online variational inference.<br>
Otherwise, the dataset is the same and we are still using Scikit-Learn's CountVectorizer class to vectorize our data.<br>



In [6]:
def latent_dirichlet_allocation_collapsed_gibbs_sampling():
    """
    Functions performs LDA topic extraction using collapsed Gibbs Sampling.

    https://pypi.org/project/lda/

    :return: None.
    """
    import lda

    # LDA can only use raw term counts for LDA because it is a probabilistic graphical model.
    tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=1000, stop_words='english')
    tf = tf_vectorizer.fit_transform(slo_feature_series)
    tf_feature_names = tf_vectorizer.get_feature_names()

    # Train and fit the LDA model.
    model = lda.LDA(n_topics=12, n_iter=1000, random_state=1)
    model.fit(tf)  # model.fit_transform(X) is also available
    topic_word = model.topic_word_  # model.components_ also works
    n_top_words = 10
    time.sleep(3)

    # Display the topics and the top words associated with.
    for i, topic_dist in enumerate(topic_word):
        topic_words = np.array(tf_feature_names)[np.argsort(topic_dist)][:-(n_top_words + 1):-1]
        print('Topic {}: {}'.format(i, ' '.join(topic_words)))
        

        
"""
Perform the topic extraction using collapsed Gibbs Sampling.
"""
latent_dirichlet_allocation_collapsed_gibbs_sampling()

INFO:lda:n_documents: 653094
INFO:lda:vocab_size: 1000
INFO:lda:n_words: 3267212
INFO:lda:n_topics: 20
INFO:lda:n_iter: 100
INFO:lda:<0> log likelihood: -33566606
INFO:lda:<10> log likelihood: -27631270
INFO:lda:<20> log likelihood: -24168941
INFO:lda:<30> log likelihood: -23191677
INFO:lda:<40> log likelihood: -22881500
INFO:lda:<50> log likelihood: -22754208
INFO:lda:<60> log likelihood: -22681384
INFO:lda:<70> log likelihood: -22639373
INFO:lda:<80> log likelihood: -22612153
INFO:lda:<90> log likelihood: -22593660
INFO:lda:<99> log likelihood: -22575655
INFO:root:The time taken to perform the operation is: 
INFO:root:90.21567153930664
INFO:root:



Topic 0: coal energy future clean fossil climate carbon time global need
Topic 1: water free billion owners coal traditional unlimited farmers giving dollars
Topic 2: cou coal native title stop adanis approval federal land turnbull
Topic 3: labor greens stop lnp alp vote shoen election suppo want
Topic 4: coal thanks latest times australian adanis green bank repo govt
Topic 5: coal fund money banks project funding govt adanis wont taxpayers
Topic 6: people action stop protest join day protesters campaign time message
Topic 7: gas project coal narrabri seam forest water farmers barnaby pipeline
Topic 8: beach dam watch day tour video brazil story iluka disaster
Topic 9: water basin aesian environmental risk coal right world health suppo
Topic 10: foescue shares group metals debt profit year loss fmg news
Topic 11: loan canavan minister taxpayer slocashn matt board joyce barnaby money
Topic 12: loan rail line adanis coal galilee naif veto basin government
Topic 13: coal new power india p



The results seem to be as incoherent as the Scikit-Learn implementation of LDA topic extraction using online variational inference.<br>

It's difficult to see any correlation between the 10 top words for each topic.<br>

Here, we are using n_iter=100 (iterations) as the fitting to our Twitter data is a lot faster than the Scikit-Learn implementation where max_iter=5 already takes 450 seconds.<br>



##  Updated Topic Extraction Results on Twitter Topic Modeling Dataset Tweet Text (not the stance detection tokenized dataset):

First execution of LDA on our tokenized Twitter dataset.

Second execution of LDA with the same hyperparameters.

Results are similar to that of the tokenized stance detection dataset.

### Why does it work poorly on Tweets?


    
Based on Derek Fisher's senior project presentation:

- LDA typically works best when the documents are lengthy (large word count) and written in a formal proper style.

- Tweet text is generally very short in length with a max of around 280 characters.

- Tweet text is generally written very informally style-wise, e.g.:

    - emojis
    - spelling errors
    - other grammatical errors

- The above makes it difficult for the LDA algorithm to discover any prominent underlying hidden structures.



## Resources Used:



- https://scikit-learn.org/stable/modules/decomposition.html#latentdirichletallocation<br>
    - Scikit-Learn introduction to LDAs'.<br>


- https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html#sklearn.decomposition.LatentDirichletAllocation<br>
    - Scikit-Learn documentation on the LDA class.<br>


- https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730<br>
    - Article with example of topic modeling using Scikit-Learn LDA and NMF.<br>


- https://pypi.org/project/lda/<br>
    - Links to the "lda" Python package website.<br>


