# A Geometry-Driven Longitudinal Topic Model (GDLTM)

This notebook demonstrates a basic call flow for longitudinal topic modeling based on the paper by Wang, Hougen, et. al. 2021 that appeared in Harvard Data Science Review (HDSR) with the title "A Geometry-Driven Longitudinal Topic Model".

The GDLTM model is the topic modeling framework that feeds into Multiscale Topic Manifold Learning (MSTML), which extends GDLTM by addressing problems in relational topic modeling (RTM), which involve relational (network) data in conjunction with text data.

GDLTM advanced the temporal topic modeling field by demonstrating the advantages of Hellinger-PHATE manifold learning in conjunction with interpretable, probabilistic models like the popular Latent Dirichlet Allocation (LDA). LDA uses a bag-of-words model for documents and treats documents as mixtures of topics. Using Bayesian networks and Bayesian inference, LDA infers the structures of latent topics as word-frequency multinomial distributions. Simultaneously, documents are represented as multinomial distributions over the latent topics. GDLTM provides advanced topic modeling which performs alignment of topics over time, after dividing a text corpus into chunks by time (e.g. publication date, though any relevant date could be used). Topic alignment over time is then visualized using Hellinger-PHATE manifolds. Hellinger-PHATE manifolds over LDA topics were found to be particularly effective at visualizing complex temporal changes within document corpora while remaining computationally simple compared to prior probabilistic methods. The resultant visualizations map topics into an intuitive embedding space that can be interpreted in a straightforward way using direct translation into word clouds and learned topic trajectories using standard shortest path algorithms. More details of the method are described in the HDSR 2021 paper.

In [None]:
import os.path

from mstml._file_driver import read_pickle
from mstml.core import MstmlOrchestrator

In [None]:
# Set dataset name (matches directory name within data/)
dataset_name = "arxiv"

data_filters = {
    'categories':
        ['stat.AP',
         'stat.CO',
         'stat.ME',
         'stat.OT',
         'stat.TH',
         'cs.LG']
}

arxiv_schema_map = {'abstract': 'raw_text',
                    'update_date': 'date',
                    'authors_parsed': 'authors'}

In [None]:
orch = MstmlOrchestrator(dataset_name=dataset_name)

orch.configure_data_filters(
    date_range={'start': '2016-01-01', 'end': '2017-12-31'},
    categories=data_filters['categories'],
)


In [None]:
orch.load_raw_data(input_schema_map=arxiv_schema_map,
                   overwrite=True)

In [None]:
orch.apply_data_filters()

In [None]:
len(orch.documents_df)

In [None]:
orch.documents_df.tail()

In [None]:
orch.preprocess_text()

In [None]:
orch.apply_author_disambiguation()

In [None]:
orch.documents_df.tail()

In [None]:
orch.setup_coauthor_network(temporal=True)

In [None]:
orch.create_temporal_chunks(months_per_chunk=1)

In [None]:
orch.train_ensemble_models()

In [None]:
orch.build_topic_manifold()

In [None]:
orch.create_topic_embedding()

In [None]:
orch.display_topic_embedding(color_by='meta_topic')

In [None]:
orch.save_topic_embedding()

In [None]:
raise Exception("Stop here")

In [None]:
# Save results
results_path = orch.finalize_experiment()
print(f"Analysis complete! Results saved to: {results_path}")