## Comparing topic modelling techniques

There are different topic modelling approaches, each with a different set of advantages and disadvantages. The 'best' modelling technique is far from absolute, and largely depends on the nuances of the text data being analysed. To our knowledge, PFD data has not yet been analysed via NLP or topic modelling techniques, meaning that there exists no literature on the optimal approach(es).

This notebook will compare the suitability of 5 topic modelling techniques for PFD 'concerns' data.<br><br><br>


1. **Latent Dirichlet Allocation (LDA)**

LDA is perhaps the most popular topic modelling technique. It is a probabilistic method that assumes each document is a mixture of various topics (likely suitable for PFD reports which frequently contain multiple concerns). It characterises topics as a 'mixture of words'; the model generates a topic distribution for each document and a word distribution for each topic.

LDA *does* require that we pre-define our number of topics.

It uses Dirichlet distribution priors to model the distribution of topics in documents and words in topics, providing a more statistically aligned framework for topic modelling.<br><br><br>


2. **Correlated Topic Modelling (CTM)**

CTM is an extension of LDA that allows for correlations between topics. While it carries over core disadvantages of LDA in terms of less interpretable keyword lists for each topic, its unique contribution is its inclusion of a covariance structure to model topic correlations. This is particularly interesting for our PFD data, where many reports are built from multiple concerns and therefore topics. 

CTM *does* require us to pre-define our number of topics.<br><br><br>


3. **Non-negative Matrix Factorisation (NMF)**

NMF is a matrix factorisation technique that decomposes the document-term matrix into two lower-dimensional matrices. Topics are characterised by non-negative components in the factorised matrices, representing the importance of words in topics and topics in documents. Similarly to LDA, it assumes that documents contain multiple topics.

NMF *does* require that we pre-define our number of topics.

NMF enforces non-negativity constraints. Many report that resulting topic keywords are therefore more interpretable than LDA, with less 'noise' in the keyword lists.<br><br><br>


4. **Top2Vec**

Topics in Top2Vec are characterised by dense clusters of document and word embeddings. These clusters are identified in a joint embedding space, where both documents and words are represented. It does allow for multiple topics per document; this is achieved through the proximity of document embeddings to multiple topic vectors in the semantic space.

Top2Vec does *not* require us to pre-define our number of topics.

Top2Vec uses deep learning-based embeddings (e.g., Doc2Vec, Universal Sentence Encoder) to capture the semantic relationships in the text. This method ensures that topics are discovered based on the natural clustering of similar documents and words, leading to a more intuitive and data-driven identification of topics.<br><br><br>


5. **BERTopic**

BERTopic uses BERT embeddings and clustering algorithms to discover topics. Topics are characterised by dense clusters of semantically similar embeddings, identified through dimensionality reduction and clustering. Although not originally supported, v0.13 (January 2023) allows us to approximate a probabilistic topic distribution for each report via '.approximate_distribution'.

BERTopic does *not* require us to pre-define our number of topics.<br><br>