# Multiscale Topic Manifold Learning (MSTML)
### Topic Drift Mapping Examples

This notebook demonstrates the core results presented in "A Multiscale Geometric Method for Capturing Relational Topic Alignment" (to be published in 2025) and "Multiscale Topic Manifold Learning for Understanding Interdisciplinarity in Co-Authored Articles" (to appear on arXiv in 2025).

The MSTML framework, developed in these two papers, extends a prior method in longitudinal topic modeling using information-geometric methods. The aforementioned work appeared in the Harvard Data Science Review in 2021 as "A Geometry-Driven Longitudinal Topic Model." MSTML uses this geometric, topic manifold learning approach, but extends the method for processing multimodal data which handles co-author networks. Therefore, we can think of MSTML as a relational topic model (RTM) which is also developed with a temporal-modeling component in mind. In order to accomplish this, the Hellinger-PHATE manifold learning approach of GDLTM is augmented by using Ward's linkage for agglomerative hierarchical clustering of topics. The learned hierarchical dendrogram represents meta-topics which are combinations of "chunk-topics" learned by individual topic models, applied to temporally-chunked segments of the input document corpus.

Hierarchical clustering of the chunk topics is not in itself a novel contribution, as this was used in the GDLTM model for coloring meta topic clusters also.

However, MSTML contributes several novel enhancements which greatly affect the final results and outputs of the model. All of the contributions below are considered novel:

1. The hierarchical clustering dendrogram is parameterized by link probabilities, p_m. Each internal node, m, of the dendrogram, is associated with a probability p_m. Each p_m encodes maximum likelihood estimates of the "latent probability" that two authors would collaborate, if the node m is the nearest common parent, given that each author is considered to be distributed across the chunk topic nodes (leaf level). This is inspired by a method called Hierarchical Random Graphs (HRG) originally developed in (Clauset, 2008), which shows how MLE estimates can be used to predict links in arbitrary networks. However, while HRG deals with network topological data alone, MSTML uses inferred author-topic distributions in conjunction with the network structure, creating a multimodal approach to inferring the p_m link probabilities.
2. Ward's linkage method is used, instead of "single" linkage (default) or "average" linkage (used in GDLTM). The advantage of Ward's linkage is that it respects the smooth manifold of the information-geometric approach by minimizing within-cluster variance, thereby creating more balanced clusters. Single linkage can result in chaining effects which could tie together transitional topics via single high-similarity pairs. Average linkage can over-smooth and ignore niche topics, which is a theme of the MSTML method, which targets ideas related to topic space imputation, including both interpolation between existing topics and extrapolation based on temporal trajectories. Niche topics are particularly interesting for academic research documents, as these topics are thought to be novel as well as interdisciplinary. Ward's linkage can be replaced by other linkage methods as desired, depending on the application. In MSTML Ward's linkage presents an adaptive method which adapts to the local topic space density by minimizing the within cluster variance at each merge step.
3. GDLTM helped develop the idea of using an ensemble of topic models across time, then showing how those topics could be tied back together using a geometry-informed approach by applying a proper distance for probability distributions (Hellinger) and a diffusion-based manifold learning technique (PHATE). This ensemble-of-models (EOM) approach lends itself to the possibility of performing model-based pre-filtering of the vocabulary space. In MSTML, we utilize this idea, inspired by LDAvis, developed in (Sievert, 2014). We train a global LDA model across the entire corpus, and then use term relevancy scoring to rank and filter vocabulary terms. The idea is to determine which terms were the most relevant for distinguishing topics, which is different than simply looking at the most frequent terms. Instead, the term relevancy score is a mixture of frequency and relative frequency scores. The result of this pre-filtering process is a much-reduced vocabulary space, but retains terms that may be rare yet relevant. Again, this is oriented toward the discovery of niche, emerging topics in evolving document corpora.
4. The MSTML papers contribute the first comparisons of language-model topic representations (like in BERTopic) with traditional, distributional topic vectors (like in LDA) using ensembles of models over time. The representation space of the learned topic manifold is contrasted, along with other metrics like topic coherence and diversity. The relational properties of LDA turn out to provide temporal structure that is missing in language-model representations, as highlighted in the MSTML papers. This is considered advantageous when dealing with longitudinal corpora.

In [None]:
from mstml.core import MstmlOrchestrator