# Latent Dirichlet Allocation Grid Search on SLO Twitter Dataset

### Joseph Jinn and Keith VanderLinden

<span style="font-family:Papyrus; font-size:1.25em;">
    
**Note: This exhaustive grid search applies specifically to the Scikit-Learn LatentDirichletAllocation() class and CountVectorizer() class.  It also utilizes the Pipeline() class to setup the previous two classes.**<br>

</span>

<span style="font-family:Papyrus; font-size:1.25em;">

We use Scikit-Learn's Pipeline Class to construct a pipeline consisting of the CountVectorizer and LatentDirichletAllocation classes.<br>

The "parameters" dictionary determines all the possible combinations of hyperparameters we will test in order to find the optimal hyperparameters for the model.<br>

The grid search is performed by fitting on the Twitter data we wish to use for topic extraction.<br>

The optimal hyperparameters are displayed via "log.info" messages so the log verbosity level must be set appropriately to view them.<br>

We recommend executing this only on a supercomputer as otherwise, it will take an extremely long time to finish due to the number of possible combinations of hyperparameters as defined in the dictionary above.<br>

If running this code snippet on a non-workstation PC, you may want to change "n_jobs=-1" to "n_jobs=0" for the GridSearchCV() class to prevent Python from utilizing all CPU cores and bogging down your system to unusability for the duration of the search.  This will require you to edit the code base in the "slo_lda_topic_extraction_utility_functions.py" file given in the first URL link in the "Notes" section below.<br>

</span>


In [None]:
# What parameters do we search for?
lda_search_parameters = {
    'vect__strip_accents': [None],
    'vect__lowercase': [True],
    'vect__stop_words': ['english'],
    # 'vect__ngram_range': [(1, 1), (1, 2), (1, 3), (1, 4)],
    'vect__analyzer': ['word'],
    'vect__min_df': [2],
    'vect__max_df': [0.95],
    'vect__max_features': [1000],
    'clf__n_components': [5, 10, 20],
    'clf__doc_topic_prior': [None],
    'clf__topic_word_prior': [None],
    'clf__learning_method': ['online'],
    'clf__learning_decay': [0.5, 0.7, 0.9],
    'clf__learning_offset': [5, 10, 15],
    'clf__max_iter': [1, 5, 10],
    'clf__batch_size': [64, 128, 256],
    'clf__evaluate_every': [0],
    # 'clf__total_samples': [1e4, 1e6, 1e8],
    # 'clf__perp_tol': [1e-1, 1e-2, 1e-3],
    'clf__mean_change_tol': [1e-1, 1e-3, 1e-5],
    'clf__max_doc_update_iter': [50, 100, 150],
    'clf__n_jobs': [-1],
    'clf__verbose': [0],
    'clf__random_state': [None],
}

# lda_util.latent_dirichlet_allocation_grid_search(slo_feature_set, lda_search_parameters)

#### Parameter list for GridSearchCV():

<span style="font-family:Papyrus; font-size:1.25em;">

strip_accents: strip accents off of individual characters.<br>

lowercase: downcase all characters.<br>

stop_words: remove all the specified stopwords.<br>

ngram_range: lower/upper boundary for word/character n-grams.<br>

analyzer: determines whether features consist of words or characters.<br>

min_df: ignore words/characters that have a document frequency lower than this threshold.<br>
max_df: ignore words/characters that have a document frequency higher than this threshold.<br>

max_features: the maximum number of words/characters for building the vocabulary.<br>

n_components: determines the # of topics to extract.<br>

doc_topic_prior: alpha (α) value prior parameter.<br>
topic_word_prior: eta (η) value prior parameter.<br>


learning_method: method to update n_components (topics).<br>
learning_decay: controls the learning rate of "online" learning method.<br>
learning_offset: downweights early iterations of the "online" learning method.<br>

max_iter: maximum number of iterations to run the algorithm.<br>
batch_size: number of documents to use in the "Expectation-maximization" (EM) algorithm for the  "online" learning method.<br>

evaluate_every: used in the "fit" method of the model to evaluate perplexity - how well a probability distribution or model predicts a sample..<br>

total_samples: total number of documents used in the "partial_fit" method of the model.

perp_tol: perplexity tolerance level for "batch" learning method that is used only when "evaluate_every" > 0.

mean_change_tol: stopping tolerance level for theta distribution in the EM-algorithm "Expectation" (E) step - gives latent topic probabilities.<br>

max_doc_update_iter: maximum number of iterations for updating theta distribution in the "Expectation" (E) step of the EM-algorithm.<br>

n_jobs: maximum number of threads to use in the "Expectation" (E) step of the EM-algorithm.<br>

verbose: sets the verbosity level of the grid search.<br>

random_state: use a random number generator.  when set to "none", utilizes np.random.<br>

</span>


### Exhaustive grid search for Scikit-Learn LDA using subset of Twitter dataset:

<span style="font-family:Papyrus; font-size:1.25em;">

Here, we implement an exhaustive grid search using a smaller subset of the entire Twitter dataset.  This is done to cut down on the computational time required to finish the search.  We have a large dataset of over 650k+ Tweets so utilizing the full dataset drastically increases the search time.<br>

The first parameter for the "dataframe_subset" function dictates the full dataset you wish to subset while the second parameter defines the number of rows (examples) desired for the subset of the full dataset.<br>

</span>


In [None]:
data_subset = lda_util.dataframe_subset(tweet_dataset_processed, 10000)
lda_util.latent_dirichlet_allocation_grid_search(data_subset, lda_search_parameters)

<span style="font-family:Papyrus; font-size:1.25em;">

**TODO: Get this running on the Borg supercomputer using Singularity Container.**

</span>

## Resources Used:

<span style="font-family:Papyrus; font-size:1.25em;">

Refer to Scikit-Learn Documentation for further information on each class used:<br>

- https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html


- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html<br>


- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html<br>


- https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html<br>

</span>

### Notes:

<span style="font-family:Papyrus; font-size:1.25em;">

Refer to URL link for the codebase to the utility functions used above for the grid searches:<br>

https://github.com/J-Jinn/Summer-Research-2019/blob/master/topic_extraction_utility_functions.py<br>

</span>