# EasyTPP - Data Generation & Benchmarks

This notebook demonstrates how to use EasyTPP for:
- Generating synthetic data with different simulators
- Performing benchmarks with different baselines
- Analyzing and comparing results

## 📋 Contents

1. [Setup and imports](#setup)
2. [Synthetic data generation](#generation)
3. [Benchmarks and baselines](#benchmarks)
4. [Comparative analysis](#analysis)

## 1. Setup and imports {#setup}

In [None]:
import sys
from pathlib import Path
import matplotlib.pyplot as plt
import pandas as pd

# Add project root directory to PYTHONPATH
project_root = Path().absolute().parent
sys.path.insert(0, str(project_root))

# EasyTPP imports
from easy_tpp.data.generation import HawkesSimulator, SelfCorrecting
from easy_tpp.config_factory import DataConfig
from easy_tpp.evaluation.benchmarks.mean_bench import MeanInterTimeBenchmark
from easy_tpp.evaluation.benchmarks.sample_distrib_mark_bench import MarkDistributionBenchmark
from easy_tpp.evaluation.benchmarks.sample_distrib_intertime_bench import InterTimeDistributionBenchmark

print("✅ Modules imported successfully!")

## 2. Synthetic data generation {#generation}

### Bivariate Hawkes Process

Let's generate data with a 2-dimensional Hawkes process.

In [None]:
# Bivariate Hawkes process configuration
params_hawkes = {
    "mu": [0.2, 0.15],                    # Base intensities
    "alpha": [[0.4, 0.1], [0.2, 0.3]],   # Excitation matrix
    "beta": [[2, 1], [1.5, 2.5]]         # Decay matrix
}

print("🎲 Hawkes process configuration:")
print(f"   Base intensities (μ): {params_hawkes['mu']}")
print(f"   Excitation matrix (α): {params_hawkes['alpha']}")
print(f"   Decay matrix (β): {params_hawkes['beta']}")

# Create simulator
hawkes_simulator = HawkesSimulator(
    mu=params_hawkes["mu"],
    alpha=params_hawkes["alpha"],
    beta=params_hawkes["beta"],
    dim_process=2,
    start_time=0,
    end_time=100
)

print("\n🚀 Generating Hawkes data...")
hawkes_simulator.generate_and_save(
    output_dir='./synthetic_hawkes',
    num_simulations=50,
    splits={'train': 0.6, 'test': 0.2, 'dev': 0.2}
)

print("✅ Hawkes data generated in './synthetic_hawkes'")

### Self-Correcting Process

Let's also generate data with a self-correcting process.

In [None]:
# Univariate self-correcting process
self_correcting_simulator = SelfCorrecting(
    dim_process=1,
    start_time=0,
    end_time=200
)

print("🎲 Generating self-correcting data...")
self_correcting_simulator.generate_and_save(
    output_dir='./synthetic_self_correcting',
    num_simulations=30,
    splits={'train': 0.7, 'test': 0.15, 'dev': 0.15}
)

print("✅ Self-correcting data generated in './synthetic_self_correcting'")

### Complex Multivariate Hawkes Process

Let's create a more complex process with 3 dimensions.

In [None]:
# Complex multivariate configuration
params_complex = {
    "mu": [0.1, 0.15, 0.12],
    "alpha": [
        [0.3, 0.1, 0.05],
        [0.2, 0.4, 0.15],
        [0.1, 0.2, 0.35]
    ],
    "beta": [
        [2.5, 1.5, 1.0],
        [2.0, 3.0, 1.5],
        [1.5, 2.5, 3.5]
    ]
}

complex_simulator = HawkesSimulator(
    mu=params_complex["mu"],
    alpha=params_complex["alpha"],
    beta=params_complex["beta"],
    dim_process=3,
    start_time=0,
    end_time=150
)

print("🎲 Generating 3D Hawkes data...")
complex_simulator.generate_and_save(
    output_dir='./synthetic_hawkes_3d',
    num_simulations=25,
    splits={'train': 0.6, 'test': 0.25, 'dev': 0.15}
)

print("✅ 3D Hawkes data generated in './synthetic_hawkes_3d'")

## 3. Benchmarks and baselines {#benchmarks}

Now, let's evaluate different baselines on our synthetic data.

In [None]:
# Configuration for benchmarks on synthetic Hawkes data
data_config_hawkes = DataConfig(
    dataset_id="synthetic_hawkes",
    data_format="pickle",
    num_event_types=2,
    batch_size=16
)

print("📊 Benchmark configuration:")
print(f"   Dataset: {data_config_hawkes.dataset_id}")
print(f"   Event types: {data_config_hawkes.num_event_types}")
print(f"   Batch size: {data_config_hawkes.batch_size}")

### Benchmark 1: Mean Inter-Event Time Prediction

In [None]:
# Benchmark: Mean prediction
mean_benchmark = MeanInterTimeBenchmark(
    data_config=data_config_hawkes,
    experiment_id="mean_hawkes_baseline",
    save_dir="./benchmark_results"
)

print("🎯 Running mean benchmark...")
mean_results = mean_benchmark.evaluate()

print("📈 Mean benchmark results:")
for metric, value in mean_results.items():
    if isinstance(value, (int, float)):
        print(f"   {metric}: {value:.4f}")
    else:
        print(f"   {metric}: {value}")

### Benchmark 2: Event Type Distribution

In [None]:
# Benchmark: Type distribution
mark_benchmark = MarkDistributionBenchmark(
    data_config=data_config_hawkes,
    experiment_id="mark_hawkes_baseline",
    save_dir="./benchmark_results"
)

print("🎯 Running type distribution benchmark...")
mark_results = mark_benchmark.evaluate()

print("📈 Type distribution benchmark results:")
for metric, value in mark_results.items():
    if isinstance(value, (int, float)):
        print(f"   {metric}: {value:.4f}")
    else:
        print(f"   {metric}: {value}")

### Benchmark 3: Inter-Event Time Distribution

In [None]:
# Benchmark: Inter-event time distribution
intertime_benchmark = InterTimeDistributionBenchmark(
    data_config=data_config_hawkes,
    experiment_id="intertime_hawkes_baseline",
    save_dir="./benchmark_results"
)

print("🎯 Running inter-time distribution benchmark...")
intertime_results = intertime_benchmark.evaluate()

print("📈 Inter-time distribution benchmark results:")
for metric, value in intertime_results.items():
    if isinstance(value, (int, float)):
        print(f"   {metric}: {value:.4f}")
    else:
        print(f"   {metric}: {value}")

## 4. Comparative analysis {#analysis}

### Baseline Comparison

In [None]:
# Collect results for comparison
all_benchmarks = {
    'Mean Inter-Time': mean_results,
    'Mark Distribution': mark_results,
    'Inter-Time Distribution': intertime_results
}

print("📊 Comparative benchmark summary:")
print("=" * 60)

# Structured display of results
for benchmark_name, results in all_benchmarks.items():
    print(f"\n🎯 {benchmark_name}:")
    for metric, value in results.items():
        if isinstance(value, (int, float)):
            print(f"   {metric:<25}: {value:>10.4f}")
        else:
            print(f"   {metric:<25}: {str(value):>10}")

### Benchmarks on Different Datasets

Let's compare performance on different types of data.

In [None]:
# Benchmarks on different datasets
datasets_to_test = [
    ('test', 'pickle', 2),
    ('synthetic_hawkes', 'pickle', 2),
    ('synthetic_hawkes_3d', 'pickle', 3)
]

comparative_results = {}

for dataset_id, data_format, num_types in datasets_to_test:
    print(f"\n🧪 Testing on {dataset_id}...")
    
    try:
        # Configuration for this dataset
        test_config = DataConfig(
            dataset_id=dataset_id,
            data_format=data_format,
            num_event_types=num_types,
            batch_size=16
        )
        
        # Test with mean benchmark
        test_benchmark = MeanInterTimeBenchmark(
            data_config=test_config,
            experiment_id=f"mean_{dataset_id}",
            save_dir="./comparative_results"
        )
        
        results = test_benchmark.evaluate()
        comparative_results[dataset_id] = results
        
        print(f"   ✅ Success on {dataset_id}")
        
    except Exception as e:
        print(f"   ❌ Error on {dataset_id}: {str(e)[:50]}...")
        comparative_results[dataset_id] = None

print("\n📈 Comparative results by dataset:")
print("=" * 70)

for dataset, results in comparative_results.items():
    print(f"\n📊 {dataset}:")
    if results:
        for metric, value in results.items():
            if isinstance(value, (int, float)):
                print(f"   {metric}: {value:.4f}")
    else:
        print("   ❌ No results available")

## 🎉 Conclusion

This notebook demonstrated:

✅ **Synthetic data generation** with different simulators
- Bivariate and multivariate Hawkes processes
- Self-correcting process
- Flexible parameter configuration

✅ **Evaluation with robust baselines**
- Mean inter-event time prediction
- Event type distribution
- Inter-event time distribution

✅ **Comparative analysis** on different datasets
- Systematic performance comparison
- Evaluation on real and synthetic data

### 🚀 Possible applications:

- **Model validation**: Use these baselines as reference points
- **Ablation studies**: Understand the impact of different components
- **Data generation**: Create datasets to test new algorithms
- **Exploratory analysis**: Understand data characteristics