# MSTML Basic Usage Example

This notebook demonstrates the basic usage of the Multi-Scale Topic Manifold Learning (MSTML) framework.

## Overview

MSTML provides a scalable method for predicting collaborative behaviors using textual data and probabilistic information geometry of author-topical interests. The framework includes:

- **GDLTM**: Geometry-Driven Longitudinal Topic Model
- **HRG**: Hierarchical Random Graph models
- **Text Processing**: Comprehensive preprocessing utilities
- **Network Analysis**: Tools for analyzing collaboration networks

In [2]:
# Import necessary libraries
import sys
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import MSTML modules
from mstml import Mstml, MstmlParams, MstmlEmbedType
from mstml.gdltm import Gdltm, GdltmParams
from mstml.hrg import HierarchicalRandomGraph
from mstml.utils import *
from mstml.text_processing import TextProcessor, create_academic_text_processor

# Set up plotting
%matplotlib inline
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

## 1. Text Processing Example

Let's start with basic text processing functionality.

In [None]:
# Create sample academic documents
sample_documents = [
    "Machine learning algorithms have revolutionized data analysis in recent years.",
    "Deep neural networks show promising results in natural language processing tasks.",
    "Topic modeling techniques help discover hidden thematic structures in document collections.",
    "Network analysis provides insights into collaboration patterns among researchers.",
    "Hierarchical random graphs capture the multi-scale structure of complex networks."
]

# Initialize text processor
processor = create_academic_text_processor()

# Process documents
processed_docs = processor.process_documents(sample_documents)

print("Original documents:")
for i, doc in enumerate(sample_documents):
    print(f"{i+1}: {doc}")

print("\nProcessed documents:")
for i, doc in enumerate(processed_docs):
    print(f"{i+1}: {doc}")

In [None]:
# Get processing statistics
stats = processor.get_statistics(processed_docs)
print("Processing Statistics:")
for key, value in stats.items():
    print(f"{key}: {value}")

## 2. GDLTM Example

Demonstrate the Geometry-Driven Longitudinal Topic Model.

In [None]:
# Note: This is a simplified example.

print("GDLTM Example:")
print("In a real scenario, you would:")
print("1. Load preprocessed data (main_df.pkl, data_words.pkl, etc.)")
print("2. Create GdltmParams with your dataset configuration")
print("3. Initialize and run the GDLTM pipeline")

# Example parameter setup (would use real data in practice)
# gdltm_params = GdltmParams(
#     dset="your_dataset",
#     dsub="subset_name", 
#     ntopics=10,
#     knnk=15,
#     gamma=0.75
# )

# gdltm = Gdltm(gdltm_params)
# gdltm.run_full_pipeline()

## 3. Network Analysis Example

Create and analyze a simple collaboration network.

In [None]:
# Create sample collaboration data
collaborations = [
    ["Alice", "Bob", "Charlie"],
    ["Alice", "David"],
    ["Bob", "Eve", "Frank"],
    ["Charlie", "David", "Grace"],
    ["Eve", "Grace", "Henry"]
]

# Create collaboration network
network = create_author_network(collaborations)

# Compute network metrics
metrics = compute_network_metrics(network)

print("Network Metrics:")
for key, value in metrics.items():
    print(f"{key}: {value}")

In [None]:
# Visualize the network
import networkx as nx

plt.figure(figsize=(10, 8))
pos = nx.spring_layout(network, seed=42)

# Draw network
nx.draw_networkx_nodes(network, pos, node_color='lightblue', 
                      node_size=1000, alpha=0.7)
nx.draw_networkx_labels(network, pos, font_size=12, font_weight='bold')

# Draw edges with weights
edges = network.edges()
weights = [network[u][v]['weight'] for u, v in edges]
nx.draw_networkx_edges(network, pos, width=[w*2 for w in weights], 
                      alpha=0.6, edge_color='gray')

plt.title("Author Collaboration Network")
plt.axis('off')
plt.tight_layout()
plt.show()

## 4. HRG Example

Demonstrate Hierarchical Random Graph functionality.

In [None]:
# Initialize HRG model with the collaboration network
hrg = HierarchicalRandomGraph(network)

print("Fitting HRG model...")
# Note: This uses a placeholder implementation
# In practice, this would call external C++ HRG fitting code
hrg.fit(num_iterations=10000)

print("HRG model fitted successfully!")
print(f"Model likelihood: {hrg.compute_likelihood():.2f}")

In [None]:
# Get community structure
communities = hrg.get_community_structure()
print("Community Structure:")
for node, community in communities.items():
    print(f"{node}: Community {community}")

In [None]:
# Predict potential new collaborations
predictions = hrg.predict_links(num_predictions=5)
print("Top 5 Link Predictions:")
for i, (node1, node2, prob) in enumerate(predictions, 1):
    print(f"{i}. {node1} - {node2}: {prob:.3f}")

## 5. Utility Functions Example

Demonstrate various utility functions.

In [None]:
# Example probability distributions
p = np.array([0.3, 0.4, 0.2, 0.1])
q = np.array([0.25, 0.35, 0.25, 0.15])

# Compute various distance measures
hellinger_dist = hellinger_distance(p, q)
hellinger_sim = hellinger_similarity(p, q)
js_div = jensen_shannon_divergence(p, q)
cos_sim = cosine_similarity(p, q)

print("Distance/Similarity Measures:")
print(f"Hellinger Distance: {hellinger_dist:.4f}")
print(f"Hellinger Similarity: {hellinger_sim:.4f}")
print(f"Jensen-Shannon Divergence: {js_div:.4f}")
print(f"Cosine Similarity: {cos_sim:.4f}")

In [None]:
# Create time windows example
import datetime

# Sample dates
dates = [
    datetime.datetime(2020, 1, 15),
    datetime.datetime(2020, 6, 10),
    datetime.datetime(2021, 3, 20),
    datetime.datetime(2021, 9, 5),
    datetime.datetime(2022, 2, 28)
]

# Create yearly windows
windows = create_time_windows(dates, window_size='1Y')

print("Time Windows:")
for i, (start, end) in enumerate(windows, 1):
    print(f"Window {i}: {start.strftime('%Y-%m-%d')} to {end.strftime('%Y-%m-%d')}")

## Conclusion

This notebook demonstrated the basic functionality of the MSTML framework:

1. **Text Processing**: Comprehensive preprocessing for academic documents
2. **GDLTM**: Geometry-driven longitudinal topic modeling
3. **Network Analysis**: Collaboration network creation and analysis
4. **HRG**: Hierarchical random graph modeling and link prediction
5. **Utilities**: Various helper functions for data processing and analysis

For more advanced usage and real-world examples, refer to the other notebooks in this directory and the original research papers.

### Next Steps

- Load your own dataset using the preprocessing utilities
- Experiment with different GDLTM parameters
- Try HRG ensemble methods for improved predictions
- Explore the visualization capabilities for topic evolution