# MSTML Ensemble Interdisciplinarity Analysis Example

This notebook demonstrates how to use the MSTML framework for ensemble interdisciplinarity analysis.

## Overview

Ensemble interdisciplinarity analysis measures how interdisciplinary author teams are by analyzing their topic distributions across a hierarchical topic model. The analysis includes:

1. **Document Interdisciplinarity Scoring**: Measuring how diverse the topics are within author teams
2. **Pairwise Topic Analysis**: Analyzing interdisciplinarity between specific topic pairs
3. **Link Prediction**: Using hierarchical random graphs to predict collaboration likelihood

## Prerequisites

Before running this notebook:
1. Run `python setup_mstml.py` to set up the environment
2. Place your preprocessed data in `data/clean/` directory
3. Ensure your data follows the standard MSTML DataFrame format

In [None]:
# Import required libraries
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Add the source directory to Python path
sys.path.append('../source')

# Import MSTML components
from source import (
    MstmlEnsembleInterdisciplinarity, 
    MstmlParams, 
    MstmlEmbedType,
    GdltmParams
)

from source.mstml_library import (
    score_interdisciplinarity,
    compute_interdisciplinarity_score_fast,
    compute_pairwise_interdisciplinarity,
    calculate_major_n_topic_score,
    get_nth_item_from_ordered_dict,
    compute_link_likelihood_scores
)

print("✓ Imports successful")

## Configuration

Set up the analysis parameters and data paths.

In [None]:
# Configuration
DATASET_NAME = "your_dataset"  # Change this to your dataset name
DATA_PATH = f"../data/clean/{DATASET_NAME}.pkl"  # Path to your preprocessed data

# Analysis parameters
CUT_HEIGHT = 0.5  # Dendrogram cut height for topic clustering
N_HOT = 1  # Number of "hot" topics per author
MIN_HOT_THRESHOLD = 0.2  # Minimum threshold for hot topics
N_TOPICS = 20  # Number of topics for LDA

print(f"Dataset: {DATASET_NAME}")
print(f"Data path: {DATA_PATH}")
print(f"Cut height: {CUT_HEIGHT}")
print(f"N-hot: {N_HOT}")
print(f"Topics: {N_TOPICS}")

## Data Loading

Load and validate the preprocessed data. The data should be a pandas DataFrame with the following columns:
- `title`: Document title
- `abstract`: Document abstract/content  
- `authors_parsed`: List of author names
- `authorID`: List of author IDs
- `date`: Publication date
- `text_processed`: Preprocessed text for topic modeling

In [None]:
# Load data
if not os.path.exists(DATA_PATH):
    print(f"❌ Data file not found: {DATA_PATH}")
    print("Please ensure you have:")
    print("1. Run the preprocessing scripts to create clean data")
    print("2. Updated the DATASET_NAME variable above")
    print("3. Placed your data in the correct directory")
    raise FileNotFoundError(f"Data file not found: {DATA_PATH}")

# Load the DataFrame
df = pd.read_pickle(DATA_PATH)

print(f"✓ Loaded dataset with {len(df)} documents")
print(f"Columns: {list(df.columns)}")
print(f"Date range: {df['date'].min()} to {df['date'].max()}")

# Validate required columns
required_columns = ['title', 'authors_parsed', 'authorID', 'text_processed']
missing_columns = [col for col in required_columns if col not in df.columns]

if missing_columns:
    print(f"❌ Missing required columns: {missing_columns}")
    raise ValueError(f"Missing required columns: {missing_columns}")

print("✓ Data validation passed")

## Data Exploration

Explore the dataset to understand its structure and characteristics.

In [None]:
# Basic statistics
print("Dataset Statistics:")
print(f"Total documents: {len(df)}")
print(f"Unique authors: {len(set([author for authors in df['authorID'] for author in authors]))}")
print(f"Average authors per document: {df['authorID'].apply(len).mean():.2f}")
print(f"Max authors per document: {df['authorID'].apply(len).max()}")

# Author collaboration distribution
author_counts = df['authorID'].apply(len)
plt.figure(figsize=(10, 6))
plt.subplot(1, 2, 1)
author_counts.hist(bins=20, alpha=0.7)
plt.xlabel('Number of Authors per Document')
plt.ylabel('Frequency')
plt.title('Distribution of Authors per Document')

# Publication timeline
plt.subplot(1, 2, 2)
df['date'].dt.year.value_counts().sort_index().plot(kind='line')
plt.xlabel('Year')
plt.ylabel('Number of Publications')
plt.title('Publications Over Time')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## Topic Modeling Setup

Set up the topic modeling pipeline using GDLTM (Geometry-Driven Longitudinal Topic Model).

In [None]:
# Create GDLTM parameters
gdltm_params = GdltmParams(
    dset=DATASET_NAME,
    dsub="ensemble_analysis",
    ntopics=N_TOPICS,
    nchunks=1,  # Single time chunk for ensemble analysis
    chunk_len=len(df),
    passes=10,
    iterations=50,
    eval_every=10
)

# Create MSTML parameters
mstml_params = MstmlParams(
    gdltm_params=gdltm_params,
    fwd_window=3,
    embed_type=MstmlEmbedType.SLC_TPC_HELLINGER_SIM_DISTN,
    alpha=0.5,
    beta=1.0
)

print("✓ Parameters configured")
mstml_params.print_params()

## Initialize Ensemble Interdisciplinarity Analysis

Create the ensemble interdisciplinarity analysis object.

In [None]:
# Initialize the ensemble interdisciplinarity analysis
ensemble_analysis = MstmlEnsembleInterdisciplinarity(
    params=mstml_params,
    cut_height=CUT_HEIGHT,
    n_hot=N_HOT,
    min_hot_threshold=MIN_HOT_THRESHOLD
)

print(f"✓ Ensemble analysis initialized")
print(f"Experiment directory: {ensemble_analysis.exp_dir}")

## Topic Model Training

Train the topic model on the document corpus.

**Note**: This is a simplified example. In practice, you would use the full GDLTM pipeline with proper preprocessing, chunking, and hierarchical topic modeling.

In [None]:
# This is a placeholder for the actual topic modeling pipeline
# In practice, you would:
# 1. Use GDLTM to train topic models on time chunks
# 2. Build hierarchical topic dendrograms
# 3. Create author-topic distributions
# 4. Build co-authorship networks

print("⚠️  Topic modeling pipeline placeholder")
print("In a complete implementation, this would include:")
print("1. Document preprocessing and chunking")
print("2. LDA topic model training")
print("3. Hierarchical topic clustering")
print("4. Author-topic distribution computation")
print("5. Co-authorship network construction")

# For demonstration, create mock data structures
# Replace this with actual topic modeling results

# Mock author-topic distributions (replace with real data)
unique_authors = list(set([author for authors in df['authorID'] for author in authors]))
n_authors = len(unique_authors)

# Create random author-topic distributions for demonstration
np.random.seed(42)
author_topic_distributions = {}
for author in unique_authors:
    # Random topic distribution that sums to 1
    dist = np.random.dirichlet(np.ones(N_TOPICS))
    author_topic_distributions[author] = dist

print(f"✓ Created mock author-topic distributions for {n_authors} authors")
print("⚠️  Replace this with actual topic modeling results")

## Document Interdisciplinarity Analysis

Compute interdisciplinarity scores for documents based on their author teams' topic diversity.

In [None]:
# Create document-authors mapping
doc_authors = {}
for idx, row in df.iterrows():
    doc_authors[idx] = row['authorID']

# Create author-to-documents mapping
author_to_docs = {}
for doc_id, authors in doc_authors.items():
    for author in authors:
        if author not in author_to_docs:
            author_to_docs[author] = []
        author_to_docs[author].append(doc_id)

print(f"✓ Created mappings for {len(doc_authors)} documents and {len(author_to_docs)} authors")

# Compute interdisciplinarity scores
interdisciplinarity_scores = compute_interdisciplinarity_score_fast(
    doc_authors=doc_authors,
    author_distributions=author_topic_distributions,
    author_to_docs=author_to_docs,
    n_hot=N_HOT,
    min_hot_threshold=MIN_HOT_THRESHOLD
)

print(f"✓ Computed interdisciplinarity scores for {len(interdisciplinarity_scores)} documents")

# Display top interdisciplinary documents
print("\nTop 10 Most Interdisciplinary Documents:")
for i in range(min(10, len(interdisciplinarity_scores))):
    doc_id, score = get_nth_item_from_ordered_dict(interdisciplinarity_scores, i)
    title = df.loc[doc_id, 'title'][:80] + "..." if len(df.loc[doc_id, 'title']) > 80 else df.loc[doc_id, 'title']
    n_authors = len(df.loc[doc_id, 'authorID'])
    print(f"{i+1:2d}. Score: {score:.3f} | Authors: {n_authors} | {title}")

## Interdisciplinarity Score Analysis

Analyze the distribution of interdisciplinarity scores and their relationship with document characteristics.

In [None]:
# Convert scores to DataFrame for analysis
scores_df = pd.DataFrame([
    {'doc_id': doc_id, 'interdisciplinarity_score': score}
    for doc_id, score in interdisciplinarity_scores.items()
])

# Merge with document metadata
analysis_df = scores_df.merge(
    df[['title', 'authorID', 'date']].reset_index().rename(columns={'index': 'doc_id'}),
    on='doc_id'
)
analysis_df['n_authors'] = analysis_df['authorID'].apply(len)
analysis_df['year'] = analysis_df['date'].dt.year

# Visualizations
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Score distribution
axes[0, 0].hist(analysis_df['interdisciplinarity_score'], bins=30, alpha=0.7, edgecolor='black')
axes[0, 0].set_xlabel('Interdisciplinarity Score')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribution of Interdisciplinarity Scores')

# Score vs number of authors
axes[0, 1].scatter(analysis_df['n_authors'], analysis_df['interdisciplinarity_score'], alpha=0.6)
axes[0, 1].set_xlabel('Number of Authors')
axes[0, 1].set_ylabel('Interdisciplinarity Score')
axes[0, 1].set_title('Interdisciplinarity vs Team Size')

# Score over time
yearly_scores = analysis_df.groupby('year')['interdisciplinarity_score'].mean()
axes[1, 0].plot(yearly_scores.index, yearly_scores.values, marker='o')
axes[1, 0].set_xlabel('Year')
axes[1, 0].set_ylabel('Average Interdisciplinarity Score')
axes[1, 0].set_title('Interdisciplinarity Trends Over Time')
axes[1, 0].tick_params(axis='x', rotation=45)

# Box plot by team size categories
analysis_df['team_size_category'] = pd.cut(
    analysis_df['n_authors'], 
    bins=[0, 1, 2, 3, 5, float('inf')], 
    labels=['Solo', '2 authors', '3 authors', '4-5 authors', '6+ authors']
)
sns.boxplot(data=analysis_df, x='team_size_category', y='interdisciplinarity_score', ax=axes[1, 1])
axes[1, 1].set_xlabel('Team Size Category')
axes[1, 1].set_ylabel('Interdisciplinarity Score')
axes[1, 1].set_title('Interdisciplinarity by Team Size')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# Summary statistics
print("\nInterdisciplinarity Score Statistics:")
print(analysis_df['interdisciplinarity_score'].describe())

print("\nCorrelation with team size:")
correlation = analysis_df['n_authors'].corr(analysis_df['interdisciplinarity_score'])
print(f"Pearson correlation: {correlation:.3f}")

## Pairwise Topic Interdisciplinarity

Analyze interdisciplinarity between specific pairs of topics.

In [None]:
# Compute pairwise interdisciplinarity for a subset of topics
# (Computing all pairs can be computationally expensive)
selected_topics = list(range(min(5, N_TOPICS)))  # Analyze first 5 topics

print(f"Computing pairwise interdisciplinarity for topics: {selected_topics}")

pairwise_scores = compute_pairwise_interdisciplinarity(
    doc_authors=doc_authors,
    author_distributions=author_topic_distributions,
    author_to_docs=author_to_docs,
    topics=selected_topics,
    n_hot=N_HOT,
    min_hot_threshold=MIN_HOT_THRESHOLD
)

print(f"✓ Computed pairwise scores for {len(pairwise_scores)} topic pairs")

# Create heatmap of maximum interdisciplinarity scores between topic pairs
n_topics = len(selected_topics)
heatmap_matrix = np.zeros((n_topics, n_topics))

for i in range(n_topics):
    for j in range(i + 1, n_topics):
        topic_pair = (selected_topics[i], selected_topics[j])
        if topic_pair in pairwise_scores:
            scores = pairwise_scores[topic_pair]
            max_score = max(scores.values()) if scores else 0
            heatmap_matrix[i, j] = max_score
            heatmap_matrix[j, i] = max_score

# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(
    heatmap_matrix, 
    annot=True, 
    fmt='.3f', 
    xticklabels=[f'Topic {i}' for i in selected_topics],
    yticklabels=[f'Topic {i}' for i in selected_topics],
    cmap='viridis'
)
plt.title('Maximum Interdisciplinarity Scores Between Topic Pairs')
plt.tight_layout()
plt.show()

# Show top interdisciplinary documents for a specific topic pair
example_pair = (selected_topics[0], selected_topics[1])
if example_pair in pairwise_scores:
    pair_scores = pairwise_scores[example_pair]
    print(f"\nTop 5 documents for topic pair {example_pair}:")
    for i in range(min(5, len(pair_scores))):
        doc_id, score = get_nth_item_from_ordered_dict(pair_scores, i)
        title = df.loc[doc_id, 'title'][:60] + "..." if len(df.loc[doc_id, 'title']) > 60 else df.loc[doc_id, 'title']
        print(f"{i+1}. Score: {score:.3f} | {title}")

## Results Summary and Export

Summarize the analysis results and export them for further use.

In [None]:
# Create results summary
results_summary = {
    'dataset': DATASET_NAME,
    'n_documents': len(df),
    'n_authors': len(unique_authors),
    'n_topics': N_TOPICS,
    'cut_height': CUT_HEIGHT,
    'n_hot': N_HOT,
    'min_hot_threshold': MIN_HOT_THRESHOLD,
    'mean_interdisciplinarity': analysis_df['interdisciplinarity_score'].mean(),
    'std_interdisciplinarity': analysis_df['interdisciplinarity_score'].std(),
    'max_interdisciplinarity': analysis_df['interdisciplinarity_score'].max(),
    'correlation_team_size': correlation
}

print("\n" + "="*50)
print("ENSEMBLE INTERDISCIPLINARITY ANALYSIS SUMMARY")
print("="*50)

for key, value in results_summary.items():
    if isinstance(value, float):
        print(f"{key}: {value:.4f}")
    else:
        print(f"{key}: {value}")

# Export results
results_dir = Path("../results")
results_dir.mkdir(exist_ok=True)

# Export interdisciplinarity scores
analysis_df.to_csv(results_dir / f"{DATASET_NAME}_interdisciplinarity_scores.csv", index=False)

# Export summary
import json
with open(results_dir / f"{DATASET_NAME}_ensemble_summary.json", 'w') as f:
    json.dump(results_summary, f, indent=2)

print(f"\n✓ Results exported to {results_dir}")
print("Files created:")
print(f"  - {DATASET_NAME}_interdisciplinarity_scores.csv")
print(f"  - {DATASET_NAME}_ensemble_summary.json")

## Next Steps

This notebook provides a foundation for ensemble interdisciplinarity analysis. To extend the analysis:

1. **Implement Full Topic Modeling Pipeline**: Replace the mock data with actual GDLTM topic modeling results
2. **Hierarchical Topic Analysis**: Use the hierarchical random graph functionality for link prediction
3. **Network Analysis**: Analyze co-authorship networks and their relationship to interdisciplinarity
4. **Temporal Analysis**: Extend to longitudinal analysis to see how interdisciplinarity changes over time
5. **Domain-Specific Analysis**: Customize the analysis for your specific research domain

## References

- Original AToMS research and methodology
- GDLTM: Geometry-Driven Longitudinal Topic Model
- Hierarchical Random Graphs for collaboration prediction