# Topic Modeling Algorithms

### Keith VanderLinden and Joseph Jinn

This Jupyter Notebook provides a table of contents with a summarized overview of each baseline topic modeling algorithm.

## Latent Dirichlet Allocation

LDA is a generative probabilistic model that assigns a set of topics to a corpus of documents using Dirichlet distributions as priors.  The model performs well on documents with formal grammatical style and long text lengths but generally performs poorly on short texts with inconsistent grammatical style.  This is the baseline algorithm for topic extraction.

For further details, please refer to the following Jupyter Notebook files:

[LDA Introduction](algorithms/latent-dirichlet-allocation/slo-lda-introduction.ipynb#bookmark)

[LDA Implementation](algorithms/latent-dirichlet-allocation/slo-lda-implementation.ipynb#bookmark)

[LDA Grid Search](algorithms/latent-dirichlet-allocation/slo-lda-grid-search.ipynb#bookmark)

### Resources Referenced

Refer to resources section in "LDA introduction".

## Hierarchical Dirichlet Process

HDP is a generative probabilistic model that is similar to LDA except that the number of assigned topics is not a hyperparameter.  The set of topics themselves is a random (latent) variable that is generated via Dirichlet processes and there is no upper bound on the number of generated topics.  The "hierarchical" part of the name refers to a global set of topics shared among the entire corpus of documents from which the local set of topics assigned to each document is drawn from.  

For further details, please refer to the following Jupyter Notebook files:

[HDP Introduction](algorithms/hierarchical-dirichlet-process/slo-hdp-introduction.ipynb#bookmark)

[HDP Implementation](algorithms/hierarchical-dirichlet-process/slo-hdp-implementation.ipynb#bookmark)

### Resources Referenced

- https://datascience.stackexchange.com/questions/128/latent-dirichlet-allocation-vs-hierarchical-dirichlet-process
    - explains HDP, HLDA, and LDA in a more layman friendly way.


- https://stats.stackexchange.com/questions/135736/hierarchical-dirichlet-processes-in-topic-modeling
    - a more statistical explanation of HDP in topic modeling.


- https://www.quora.com/What-is-an-intuitive-explanation-of-Dirichlet-process-clustering-How-do-Polyas-Urn-or-Stick-Breaking-exemplify-the-Dirichlet-process-How-does-Gibbs-sampling-based-clustering-for-a-Dirichlet-mixture-model-utilize-the-Dirichlet-process
    - explanation of Dirichlet processes.


- https://people.eecs.berkeley.edu/~jordan/papers/hdp.pdf
    - original paper on HDP.

## Hierarchical Latent Dirichlet Allocation

HLDA is an extension of the LDA algorithm.  LDA is implemented on the original corpus of documents resulting in the usual document-topic and word-topic assignments.  Then, "synthetic" documents are created from the word-topic assignments for each document-topic assignment and are grouped into "synthetic" corpuses by topic.  LDA is implemented recursively on the "synthetic" corpuses until "synthetic" documents and corpuses are no longer able to be generated.  For each loop of the recursion, a hierarchy of topic distributions are generated, resulting in a tree-like structure.

For further details, please refer to the following Jupyter Notebook files:

[HLDA Introduction](algorithms/hierarchical-latent-dirichlet-allocation/slo-hlda-introduction.ipynb#bookmark)

[HLDA Implementation](algorithms/hierarchical-latent-dirichlet-allocation/slo-hlda-implementation.ipynb#bookmark)

### Resources Referenced

- https://www.aclweb.org/anthology/W14-3111
    - has a nice overview of the HLDA algorithm with image.
    
    
- https://papers.nips.cc/paper/2466-hierarchical-topic-models-and-the-nested-chinese-restaurant-process.pdf
    - original paper by Blei.
    

## Author-Topic Model (a.k.a. LDA-u)

The author-topic model is an extension of the LDA algorithm.  It combines both the Author model and the LDA topic model into the Author-Topic (LDA) model.  Whereas normal LDA generates a document-topic and word-topic distribution, the author-topic model generates a author-topic and word-topic distribution (there is no longer a document-topic distribution in use).  The result is a probabilistic model that assigns topics to authors.

For further details, please refer to the following Jupyter Notebook files:

[Author-Topic Introduction](algorithms/author-topic/slo-author-topic-introduction.ipynb#bookmark)

[Author-Topic Implementation](algorithms/author-topic/slo-author-topic-implementation.ipynb#bookmark)

### Resources Referenced

- https://mimno.infosci.cornell.edu/info6150/readings/398.pdf
    - original paper on author-topic model.
    
    
- https://www.slideshare.net/FREEZ7/author-topic-model?from_action=save
    - nice slides explaining the author-topic model.

- https://stackoverflow.com/questions/47434426/pandas-groupby-unique-multiple-columns
    - using Pandas .groupby() function. 

- https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/
    - using Pandas .groupby() function blog.
    

## Biterm Model

Biterms are unordered word-pair combinations within a text.  The biterm model resolves the data sparsity issue with short texts by modeling biterms across the entire corpus of documents instead of individual words within a document.  Each biterm is associated with a single topic whereas in LDA each word can be associated with multiple topics.  The model infers the topic of each document using the topics its biterms are associated with.  Hypothetically, topic extraction is easier with this model as inferring the topic of a biterm is easier with the added context provided using word co-occurrences over individual words with no context.

For further details, please refer to the following Jupyter Notebook files:


[Biterm Introduction](algorithms/biterm/slo-biterm-introduction.ipynb#bookmark)

[Biterm Implementation](algorithms/biterm/slo-biterm-implementation.ipynb#bookmark)

### Resources Referenced

- https://www.researchgate.net/publication/262244963_A_biterm_topic_model_for_short_texts
    - original paper on Biterm model.


- https://sutheeblog.wordpress.com/2017/03/20/a-biterm-topic-model-for-short-texts/
    - blog explaining the biterm topic model in a more palatable way.


- https://www.cs.toronto.edu/~jstolee/projects/topic.pdf
    - contains a short section explaining the algorithm; includes plate notation diagram.
    

- https://stackoverflow.com/questions/29786985/whats-the-disadvantage-of-lda-for-short-texts
    - contains a response by the author of the biterm model.