# A Geometry-Driven Longitudinal Topic Model (GDLTM)

This notebook demonstrates a basic call flow for longitudinal topic modeling based on the paper by Wang, Hougen, et. al. 2021 that appeared in Harvard Data Science Review (HDSR) with the title "A Geometry-Driven Longitudinal Topic Model".

The GDLTM model is the topic modeling framework that feeds into Multiscale Topic Manifold Learning (MSTML), which extends GDLTM by addressing problems in relational topic modeling (RTM), which involve relational (network) data in conjunction with text data.

GDLTM advanced the temporal topic modeling field by demonstrating the advantages of Hellinger-PHATE manifold learning in conjunction with interpretable, probabilistic models like the popular Latent Dirichlet Allocation (LDA). LDA uses a bag-of-words model for documents and treats documents as mixtures of topics. Using Bayesian networks and Bayesian inference, LDA infers the structures of latent topics as word-frequency multinomial distributions. Simultaneously, documents are represented as multinomial distributions over the latent topics. GDLTM provides advanced topic modeling which performs alignment of topics over time, after dividing a text corpus into chunks by time (e.g. publication date, though any relevant date could be used). Topic alignment over time is then visualized using Hellinger-PHATE manifolds. Hellinger-PHATE manifolds over LDA topics were found to be particularly effective at visualizing complex temporal changes within document corpora while remaining computationally simple compared to prior probabilistic methods. The resultant visualizations map topics into an intuitive embedding space that can be interpreted in a straightforward way using direct translation into word clouds and learned topic trajectories using standard shortest path algorithms. More details of the method are described in the HDSR 2021 paper.

In [1]:
from mstml.core import MstmlOrchestrator

In [2]:
# Set dataset name (matches directory name within data/)
dataset_name = "arxiv"

data_filters = {
    'categories':
        ['stat.AP',
         'stat.CO',
         'stat.ME',
         'stat.OT',
         'stat.TH',
         'cs.LG']
}

arxiv_schema_map = {'abstract': 'raw_text',
                    'update_date': 'date',
                    'authors_parsed': 'authors'}

In [3]:
orch = MstmlOrchestrator(dataset_name=dataset_name)

orch.configure_data_filters(
    date_range={'start': '2013-01-01', 'end': '2017-12-31'},
    categories=data_filters['categories'],
)


2025-08-04 21:45:50,377 - MSTML - INFO - Experiment directory: C:\Users\conra\Documents\GitHub\Multi-Scale-Topic-Manifold-Learning\experiments\mstml_08042025_2145
2025-08-04 21:45:50,378 - MSTML - INFO - Default components initialized: hellinger distance, PHATE embedding
2025-08-04 21:45:50,378 - MSTML - INFO - MstmlOrchestrator initialized
2025-08-04 21:45:50,381 - MSTML - INFO - Data filters configured


<mstml.core.MstmlOrchestrator at 0x1d45088dae0>

In [4]:
orch.load_raw_data(input_schema_map=arxiv_schema_map)

2025-08-04 21:45:50,406 - MSTML - INFO - Loading raw data from C:\Users\conra\Documents\GitHub\Multi-Scale-Topic-Manifold-Learning\data\arxiv/original/ (no filters)
2025-08-04 21:45:50,407 - MSTML - INFO - Discovered 1 supported file(s) in C:\Users\conra\Documents\GitHub\Multi-Scale-Topic-Manifold-Learning\data\arxiv\original
2025-08-04 21:45:50,408 - MSTML - INFO - Auto-discovered 1 file(s): ['arxiv-metadata-oai-snapshot.json']
C:\Users\conra\Documents\GitHub\Multi-Scale-Topic-Manifold-Learning\data\arxiv\original already exists. Use `overwrite=True` to recreate it.
C:\Users\conra\Documents\GitHub\Multi-Scale-Topic-Manifold-Learning\data\arxiv\clean already exists. Use `overwrite=True` to recreate it.
C:\Users\conra\Documents\GitHub\Multi-Scale-Topic-Manifold-Learning\data\arxiv\networks already exists. Use `overwrite=True` to recreate it.


Initialized ID pool with 9000000 IDs (7 digits)
Setting up data directory for arxiv at C:\Users\conra\Documents\GitHub\Multi-Scale-Topic-Manifold-Learning\data...
C:\Users\conra\Documents\GitHub\Multi-Scale-Topic-Manifold-Learning\data\arxiv\original already exists. Use `overwrite=True` to recreate it.
C:\Users\conra\Documents\GitHub\Multi-Scale-Topic-Manifold-Learning\data\arxiv\clean already exists. Use `overwrite=True` to recreate it.
C:\Users\conra\Documents\GitHub\Multi-Scale-Topic-Manifold-Learning\data\arxiv\networks already exists. Use `overwrite=True` to recreate it.
Found file in dataset original directory: C:\Users\conra\Documents\GitHub\Multi-Scale-Topic-Manifold-Learning\data\arxiv\original\arxiv-metadata-oai-snapshot.json
Loaded 2776569 entries from C:\Users\conra\Documents\GitHub\Multi-Scale-Topic-Manifold-Learning\data\arxiv\original\arxiv-metadata-oai-snapshot.json
Total loaded: 2776569 entries from 1 file(s)
Sample entry keys: ['id', 'submitter', 'authors', 'title', '

2025-08-04 21:47:31,650 - MSTML - INFO - Loaded 2776569 raw documents (no filters, no author disambiguation)


<mstml.core.MstmlOrchestrator at 0x1d45088dae0>

In [5]:
print(f"Sample categories: {orch.documents_df['categories'].iloc[0]}")

Sample categories: ['math.HO']


In [6]:
orch.data_loader.df.iloc[1]['categories']

['math.CO', 'cs.CG']

In [7]:
orch.apply_data_filters()

2025-08-04 21:47:31,800 - MSTML - INFO - Applying filters to 2776569 documents
2025-08-04 21:47:31,801 - MSTML - INFO - Using pending filters: ['date_range', 'categories']
2025-08-04 21:47:31,985 - MSTML - INFO - Applying filters: ['date_range', 'categories']
2025-08-04 21:47:31,986 - MSTML - INFO - Applying filter 'date_range' with config: {'start': '2013-01-01', 'end': '2017-12-31'}


   Date filter: >= 2013-01-01
   Date filter: <= 2017-12-31


2025-08-04 21:47:32,665 - MSTML - INFO - Filter 'date_range': 628689/2776569 documents retained (2147880 removed)
2025-08-04 21:47:32,666 - MSTML - INFO - Applying filter 'categories' with config: ['stat.AP', 'stat.CO', 'stat.ME', 'stat.OT', 'stat.TH', 'cs.LG']


   Categories filter: looking for {'stat.OT', 'stat.TH', 'stat.AP', 'stat.CO', 'stat.ME', 'cs.LG'} in column 'categories'
   Sample category values: [['math.NA'], ['quant-ph', 'physics.optics'], ['math.MG', 'math.CA', 'math.DG'], ['cond-mat.stat-mech', 'cs.SI', 'physics.soc-ph'], ['math-ph', 'cond-mat.stat-mech', 'math.MP']]
   Sample category types: ['list', 'list', 'list', 'list', 'list']


2025-08-04 21:47:33,630 - MSTML - INFO - Filter 'categories': 26452/628689 documents retained (602237 removed)
2025-08-04 21:47:33,764 - MSTML - INFO - Total filtering result: 26452/2776569 documents retained (2750117 removed)


   Categories filter applied: 26452/628689 rows retained


<mstml.core.MstmlOrchestrator at 0x1d45088dae0>

In [8]:
len(orch.documents_df)

26452

In [9]:
orch.documents_df.tail()

Unnamed: 0,title,date,raw_text,author_names,author_ids,preprocessed_text,categories
791227,Gap Safe screening rules for sparsity enforcin...,2017-12-29,"In high dimensional regression settings, spars...","[EUGENE, NDIAYE, OLIVIER, FERCOQ, ALEXANDRE, G...",,,"[stat.ML, cs.LG, math.OC, stat.CO]"
927586,Learning to Run with Actor-Critic Ensemble,2017-12-29,We introduce an Actor-Critic Ensemble(ACE) met...,"[ZHEWEI, HUANG, SHUCHANG, ZHOU, BOER, ZHUANG, ...",,,[cs.LG]
927931,Predicting protein-protein interactions based ...,2017-12-29,Protein-Protein Interactions (PPIs) perform es...,"[SAMANEH, AGHAJANBAGLO, SOBHAN, MOOSAVI, MASEU...",,,"[q-bio.QM, cs.LG]"
872374,An Online Learning Approach to Buying and Sell...,2017-12-29,"We adopt the perspective of an aggregator, whi...","[KIA, KHEZELI, EILYAN, BITAR]",,,"[cs.SY, cs.LG]"
928598,Parallel Active Subspace Decomposition for Sca...,2017-12-29,Tensor robust principal component analysis (TR...,"[JONATHAN Q., JIANG, MICHAEL K., NG]",,,"[cs.NA, cs.LG, math.NA]"


In [10]:
orch.preprocess_text()

2025-08-04 21:47:33,958 - MSTML - INFO - Starting comprehensive text preprocessing pipeline


Starting text preprocessing pipeline
Tokenizing documents
Applying lemmatization using multiprocessing
Building initial vocabulary dictionary
Applying frequency filtering (low_thresh=1, high_frac=0.995)
Debug: Initial vocabulary size: 38379
Debug: Tokens to cut: 18656
Debug: Approved tokens: 19330
Debug: Documents after filtering - Total: 26452, Non-empty: 26452
Vocab length: 19330
Training LDA model with 50 topics and 1 passes - this may take a moment...
Training LDA with 26452 non-empty documents, vocabulary size: 19330
LDA model training completed
Computing term relevancy scores (λ=0.6, top_n=2000)
Applying relevancy-based vocabulary reduction
Performing final cleanup and dictionary rebuild


2025-08-04 21:48:24,571 - MSTML - INFO - Text preprocessing complete. Initial vocab: 38379, After filtering: 19330, Final vocab: 9579
2025-08-04 21:48:24,575 - MSTML - INFO - Saved vocabulary dictionary with 9579 terms to id2word.pkl
2025-08-04 21:48:24,615 - MSTML - INFO - Copied 5 essential files to experiment directory: ['main_df.pkl', 'id2word.pkl', 'author_to_authorId.pkl', 'authorId_to_author.pkl', 'authorId_to_df_row.pkl']


Text preprocessing completed successfully for 26452 documents
Data successfully written to 'C:\Users\conra\Documents\GitHub\Multi-Scale-Topic-Manifold-Learning\data\arxiv\clean\id2word.pkl'.
Data successfully written to 'C:\Users\conra\Documents\GitHub\Multi-Scale-Topic-Manifold-Learning\data\arxiv\clean\main_df.pkl'.


2025-08-04 21:48:28,112 - MSTML - INFO - Updated main_df.pkl in both clean directory and experiment directory (26452 rows)


Data successfully written to 'C:\Users\conra\Documents\GitHub\Multi-Scale-Topic-Manifold-Learning\experiments\mstml_08042025_2145\main_df.pkl'.


<mstml.core.MstmlOrchestrator at 0x1d45088dae0>

In [11]:
orch.apply_author_disambiguation()

2025-08-04 21:48:33,257 - MSTML - INFO - Applying author disambiguation to 26452 documents


Initialized ID pool with 9000000 IDs (7 digits)


Skipping document at index 1155 with 21 authors (max: 20)
Skipping document at index 1600 with 28 authors (max: 20)
Skipping document at index 5902 with 67 authors (max: 20)
Skipping document at index 13336 with 40 authors (max: 20)
Skipping document at index 14201 with 113 authors (max: 20)
Skipping document at index 16179 with 22 authors (max: 20)
Skipping document at index 16452 with 61 authors (max: 20)
Skipping document at index 16918 with 31 authors (max: 20)
Skipping document at index 20699 with 75 authors (max: 20)
Skipping document at index 23231 with 25 authors (max: 20)
Skipping document at index 25732 with 22 authors (max: 20)


Dataframe sync: no changes needed
Skipping document at index 1155 with 21 authors (max: 20)
Skipping document at index 1600 with 28 authors (max: 20)
Skipping document at index 5902 with 67 authors (max: 20)
Skipping document at index 13336 with 40 authors (max: 20)
Skipping document at index 14201 with 113 authors (max: 20)
Skipping document at index 16179 with 22 authors (max: 20)
Skipping document at index 16452 with 61 authors (max: 20)
Skipping document at index 16918 with 31 authors (max: 20)
Skipping document at index 20699 with 75 authors (max: 20)
Skipping document at index 23231 with 25 authors (max: 20)
Skipping document at index 25732 with 22 authors (max: 20)
Number of authors prior to disambiguation: 37270
Pool has 9000000 available IDs for estimated need of 29816
Number of unique prefixes: 455
Group 0: matrix shape (2, 25), matches 2
Group 1: matrix shape (1, 12), matches 1
Group 2: matrix shape (6, 55), matches 6
Group 3: matrix shape (3, 27), matches 3
Group 4: matrix 

  matches = awesome_cossim_topn(


Group 74: matrix shape (1, 7), matches 1
Group 75: matrix shape (215, 1287), matches 215
Group 76: matrix shape (31, 274), matches 31
Group 77: matrix shape (221, 1317), matches 229
Group 78: matrix shape (10, 119), matches 10
Group 79: matrix shape (43, 312), matches 43
Group 80: matrix shape (214, 1164), matches 216
Group 81: matrix shape (1, 8), matches 1
Group 82: matrix shape (24, 233), matches 24
Group 83: matrix shape (43, 318), matches 43
Group 84: matrix shape (2, 26), matches 2
Group 85: matrix shape (1, 17), matches 1
Group 86: matrix shape (7, 57), matches 7
Group 87: matrix shape (2, 25), matches 2
Group 88: matrix shape (1, 7), matches 1
Group 89: matrix shape (1, 10), matches 1
Group 90: matrix shape (2, 21), matches 2
Group 91: matrix shape (68, 487), matches 70
Group 92: matrix shape (7, 67), matches 7
Group 93: matrix shape (5, 71), matches 5
Group 94: matrix shape (4, 42), matches 4
Group 95: matrix shape (128, 800), matches 134
Group 96: matrix shape (4, 49), matche

2025-08-04 21:48:36,663 - MSTML - INFO - Author disambiguation completed on 26452 documents


Data successfully written to 'C:\Users\conra\Documents\GitHub\Multi-Scale-Topic-Manifold-Learning\data\arxiv\clean\main_df.pkl'.


2025-08-04 21:48:40,132 - MSTML - INFO - Updated main_df.pkl in both clean directory and experiment directory (26452 rows)


Data successfully written to 'C:\Users\conra\Documents\GitHub\Multi-Scale-Topic-Manifold-Learning\experiments\mstml_08042025_2145\main_df.pkl'.


<mstml.core.MstmlOrchestrator at 0x1d45088dae0>

In [13]:
orch.documents_df.tail()

Unnamed: 0,title,date,raw_text,author_names,author_ids,preprocessed_text,categories,text_processed
791227,Gap Safe screening rules for sparsity enforcin...,2017-12-29,"In high dimensional regression settings, spars...","[EUGENE, NDIAYE, OLIVIER, FERCOQ, ALEXANDRE, G...","[6713798, 4082153, 2528862, 6414939]","[high, dimensional, regression, setting, spars...","[stat.ML, cs.LG, math.OC, stat.CO]","[in, high, dimensional, regression, settings, ..."
927586,Learning to Run with Actor-Critic Ensemble,2017-12-29,We introduce an Actor-Critic Ensemble(ACE) met...,"[ZHEWEI, HUANG, SHUCHANG, ZHOU, BOER, ZHUANG, ...","[5279200, 3233547, 5353604, 1745488]","[introduce, actor, ensemble, ace, method, impr...",[cs.LG],"[we, introduce, an, actor, critic, ensemble, a..."
927931,Predicting protein-protein interactions based ...,2017-12-29,Protein-Protein Interactions (PPIs) perform es...,"[SAMANEH, AGHAJANBAGLO, SOBHAN, MOOSAVI, MASEU...","[6927599, 1631281, 9503693, 3138953]","[protein, protein, interaction, perform, essen...","[q-bio.QM, cs.LG]","[protein, protein, interactions, ppis, perform..."
872374,An Online Learning Approach to Buying and Sell...,2017-12-29,"We adopt the perspective of an aggregator, whi...","[KIA, KHEZELI, EILYAN, BITAR]","[4823052, 3571696]","[adopt, perspective, seek, coordinate, purchas...","[cs.SY, cs.LG]","[we, adopt, the, perspective, of, an, aggregat..."
928598,Parallel Active Subspace Decomposition for Sca...,2017-12-29,Tensor robust principal component analysis (TR...,"[JONATHAN Q., JIANG, MICHAEL K., NG]","[7026759, 6165296]","[tensor, robust, principal, component, analysi...","[cs.NA, cs.LG, math.NA]","[tensor, robust, principal, component, analysi..."


In [None]:
raise Exception("Stop here")

In [None]:
orch.setup_coauthor_network()

In [None]:
orch.create_temporal_chunks(months_per_chunk=1)

In [None]:
orch.train_ensemble_models()

In [None]:
orch.build_topic_manifold()

In [None]:
orch.compute_author_embeddings()

In [None]:
# Save results
results_path = orch.finalize_experiment()
print(f"Analysis complete! Results saved to: {results_path}")