### Introduction


**1. Experimental Configs** 

Setup for:
1. graph sizes to run for sparse / dense
2. rw parameters
3. gpytorch / linear operator settings
4. GPU / CPU
5. Random Seeds

**2. Data Synthesis**

Generate syntheric ring graph data for diff, store in *'experiments_sparse/scaling_exp/synthetic_data'*

**3. Random Walk Sampling + Compute Step Matrices**

For dense / sparse settings, run the rw sampling scheme multiple times with different seeds:

1. Load the synthetic graphs
2. Run rw samples for the specified graph sizes
3. Store the step matrices pickle files in *'experiments_sparse/scaling_exp/step_matrices'*
4. Measure the object sizes / rw timing of step matrices, store the result in *'experiments_sparse/scaling_exp/stats'*

**4. Gaussian Processes: Init, Training and Inference**

For dense / sparse settings, run the GP model training / inference multiple times with different seeds, across all graph sizes:

1. Load the step matrices to init kernels
2. Train the model with GPU with a fixed number of iterations, record the timing
3. Do inference with GPU, record the inference time
4. store the timing results in *'experiments_sparse/scaling_exp/stats'*


**Analysis and visualization will be in a different notebook**

### Import Packages

In [13]:
%reload_ext autoreload
%autoreload 2

# Core imports and setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.sparse as sp
import networkx as nx
import time
import psutil
import os
import sys
import gc
from tqdm import tqdm
import pickle
import json
from datetime import datetime

# GP framework imports
import torch
import gpytorch
from gpytorch import settings as gsettings
from gpytorch.kernels import MultiDeviceKernel
from linear_operator import settings
from linear_operator.utils import linear_cg
from linear_operator.operators import IdentityLinearOperator
import gpflow
import tensorflow as tf
from sklearn.metrics import mean_squared_error

# Custom imports
sys.path.append('../..')
from efficient_graph_gp.random_walk_samplers.sampler import RandomWalk as DenseRandomWalk, Graph as DenseGraph
from efficient_graph_gp_sparse.preprocessor import GraphPreprocessor
from efficient_graph_gp.gpflow_kernels import GraphGeneralFastGRFKernel
from efficient_graph_gp_sparse.gptorch_kernels_sparse.sparse_grf_kernel import SparseGRFKernel
from efficient_graph_gp_sparse.utils_sparse import SparseLinearOperator

# Set seeds
torch.manual_seed(42)
tf.random.set_seed(42)
np.random.seed(42)

### Helper Functions

In [14]:
def get_memory_usage():
    """Get current memory usage in MB"""
    return psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024

def generate_ring_graph_data(n_nodes, beta_sample=1.0, kernel_std=1.0, noise_std=0.1, 
                           splits=[0.6, 0.2, 0.2], seed=42):
    """Generate synthetic data on a ring graph"""
    np.random.seed(seed)
    
    # Create ring graph
    G = nx.cycle_graph(n_nodes)
    A = nx.adjacency_matrix(G).tocsr()
    
    # Generate smooth function on ring
    angles = np.linspace(0, 2*np.pi, n_nodes, endpoint=False)
    y_true = beta_sample * (2*np.sin(2*angles) + 0.5*np.cos(4*angles) + 0.3*np.sin(angles))
    y_observed = y_true + np.random.normal(0, noise_std, n_nodes)
    
    # Create splits
    indices = np.arange(n_nodes)
    train_size = int(splits[0] * n_nodes)
    val_size = int(splits[1] * n_nodes)
    
    train_idx = np.random.choice(indices, train_size, replace=False)
    remaining = np.setdiff1d(indices, train_idx)
    val_idx = np.random.choice(remaining, val_size, replace=False)
    test_idx = np.setdiff1d(remaining, val_idx)
    
    return {
        'A_sparse': A,
        'A_dense': A.toarray().astype(np.float64),
        'G': G,
        'y_true': y_true,
        'y_observed': y_observed,
        'X_train': train_idx.reshape(-1, 1).astype(np.float64),
        'y_train': y_observed[train_idx].reshape(-1, 1),
        'X_val': val_idx.reshape(-1, 1).astype(np.float64),
        'y_val': y_observed[val_idx].reshape(-1, 1),
        'X_test': test_idx.reshape(-1, 1).astype(np.float64),
        'y_test': y_observed[test_idx].reshape(-1, 1),
        'train_idx': train_idx,
        'val_idx': val_idx,
        'test_idx': test_idx
    }

def save_data(data, filepath):
    """Save data to pickle file"""
    os.makedirs(os.path.dirname(filepath), exist_ok=True)
    with open(filepath, 'wb') as f:
        pickle.dump(data, f)

def load_data(filepath):
    """Load data from pickle file"""
    with open(filepath, 'rb') as f:
        return pickle.load(f)

def get_data_filepath(n_nodes, data_dir, params):
    """Generate filepath for cached data"""
    filename = f"ring_n{n_nodes}_beta{params['beta_sample']}_std{params['kernel_std']}_noise{params['noise_std']}_seed{params['seed']}.pkl"
    return os.path.join(data_dir, filename)

def generate_and_cache_data(n_nodes, data_dir, beta_sample=1.0, kernel_std=1.0, 
                          noise_std=0.1, splits=[0.6, 0.2, 0.2], seed=42):
    """Generate data and cache to disk, or load from cache if exists"""
    params = {
        'beta_sample': beta_sample,
        'kernel_std': kernel_std, 
        'noise_std': noise_std,
        'seed': seed
    }
    
    filepath = get_data_filepath(n_nodes, data_dir, params)
    
    if os.path.exists(filepath):
        print(f"Loading cached data for {n_nodes} nodes...")
        return load_data(filepath)
    else:
        print(f"Generating data for {n_nodes} nodes...")
        data = generate_ring_graph_data(n_nodes, beta_sample, kernel_std, noise_std, splits, seed)
        save_data(data, filepath)
        return data

def to_device(data, device):
    """Helper function to move data to device"""
    if isinstance(data, dict):
        return {k: to_device(v, device) for k, v in data.items()}
    elif isinstance(data, (list, tuple)):
        return [to_device(item, device) for item in data]
    elif isinstance(data, torch.Tensor):
        return data.to(device)
    else:
        return data

def save_experiment_results(df, experiment_name, stats_dir, config_params=None):
    """
    Save experiment results with timestamped files and configuration
    
    Args:
        df: DataFrame with results
        experiment_name: Name for the experiment (e.g., 'sparse_gp_scaling')
        stats_dir: Directory to save results
        config_params: Optional dict of configuration parameters
    """
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    os.makedirs(stats_dir, exist_ok=True)
    
    # Save main results
    main_file = os.path.join(stats_dir, f'{experiment_name}_stats.csv')
    timestamped_file = os.path.join(stats_dir, f'{experiment_name}_stats_{timestamp}.csv')
    
    df.to_csv(main_file, index=False)
    df.to_csv(timestamped_file, index=False)
    
    # Save configuration if provided
    if config_params:
        config_summary = {
            'timestamp': timestamp,
            'total_experiments': len(df),
            'experiment_name': experiment_name,
            **config_params
        }
        
        with open(os.path.join(stats_dir, f'{experiment_name}_config_{timestamp}.json'), 'w') as f:
            json.dump(config_summary, f, indent=2)
    
    # Compute and save summary statistics
    if len(df) > 0:
        # Group by graph size if 'n_nodes' column exists
        if 'n_nodes' in df.columns:
            numeric_cols = df.select_dtypes(include=[np.number]).columns
            summary = df.groupby('n_nodes')[numeric_cols].agg(['mean', 'std', 'min', 'max']).round(4)
        else:
            numeric_cols = df.select_dtypes(include=[np.number]).columns
            summary = df[numeric_cols].agg(['mean', 'std', 'min', 'max']).round(4)
        
        summary.to_csv(os.path.join(stats_dir, f'{experiment_name}_summary_{timestamp}.csv'))
    
    print(f"üìÅ {experiment_name} results saved:")
    print(f"   Main file: {main_file}")
    print(f"   Timestamped: {timestamped_file}")
    if config_params:
        print(f"   Config: {experiment_name}_config_{timestamp}.json")
        print(f"   Summary: {experiment_name}_summary_{timestamp}.csv")
    
    return timestamped_file

### Exp Configs

In [None]:
# EXPERIMENTAL CONFIGURATION PARAMETERS
# =====================================

# GPyTorch & Linear Operator settings
settings.verbose_linalg._default = False
settings._fast_covar_root_decomposition._default = False
gsettings.max_cholesky_size._global_value = 0
gsettings.cg_tolerance._global_value = 1e-2
gsettings.max_lanczos_quadrature_iterations._global_value = 1
gsettings.num_trace_samples._global_value = 64
gsettings.min_preconditioning_size._global_value = 1e10 #TODO: Enable preconditioning in future

# Random Walk Parameters
WALKS_PER_NODE = 100
P_HALT = 0.1
MAX_WALK_LENGTH = 3

# GP Graph Sizes
GP_GRAPH_SIZES = [2**i for i in range(5, 8)]
GP_SPARSE_ONLY_SIZES = [2**i for i in range(8, 11)]

# Training Parameters
N_EPOCHS = 50
TRAIN_RATIO = 0.6
NOISE_STD = 0.1 # Noise in synthetic data
INITIAL_NOISE_VARIANCE = 0.1
LEARNING_RATE = 0.1

# Device configuration
output_device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
n_devices = torch.cuda.device_count()
print(f"Output device: {output_device}, Number of GPUs: {n_devices}")

# Number of Repeats & Random Seeds
N_REPEATS = 5
RW_SEEDS = [42 + i for i in range(N_REPEATS)]

# Data synthesis parameters
DATA_SYNTHESIS_PARAMS = {
    'beta_sample': 1.0,
    'kernel_std': 1.0,
    'noise_std': 0.1,
    'splits': [0.6, 0.2, 0.2],
    'seed': 42
}

# Data directory
DATA_DIR = os.path.join(os.getcwd(), 'synthetic_data')

Output device: cuda:0, Number of GPUs: 2


### Data Synthesis

In [16]:
def synthesize_all_data():
    print("Synthesizing ring graph data for all graph sizes...")
    
    all_sizes = GP_GRAPH_SIZES + GP_SPARSE_ONLY_SIZES
    
    for n_nodes in tqdm(all_sizes, desc="Generating datasets"):
        data = generate_and_cache_data(
            n_nodes=n_nodes,
            data_dir=DATA_DIR,
            **DATA_SYNTHESIS_PARAMS
        )
        
        # Clean up memory for large datasets
        del data
        if n_nodes >= 10000:
            import gc
            gc.collect()
    
    print(f"Data synthesis complete. Files stored in: {DATA_DIR}")

# Run data synthesis
synthesize_all_data()

Synthesizing ring graph data for all graph sizes...


Generating datasets:   0%|          | 0/6 [00:00<?, ?it/s]

Loading cached data for 32 nodes...


Generating datasets: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 6/6 [00:00<00:00, 265.25it/s]

Loading cached data for 64 nodes...
Loading cached data for 128 nodes...
Loading cached data for 256 nodes...
Loading cached data for 512 nodes...
Loading cached data for 1024 nodes...
Data synthesis complete. Files stored in: /scratches/cartwright/mz473/Efficient-Gaussian-Process-on-Graphs/experiments_sparse/scaling_exp/synthetic_data





### RW Sampling

#### Helper Functions

In [17]:
def run_sparse_rw_sampling(data, rw_seed, n_nodes):
    """Run sparse random walk sampling for a single graph"""
    start_time = time.time()
    pp_sparse = GraphPreprocessor(
        adjacency_matrix=data['A_sparse'],
        walks_per_node=WALKS_PER_NODE,
        p_halt=P_HALT,
        max_walk_length=MAX_WALK_LENGTH,
        random_walk_seed=rw_seed,
        load_from_disk=False,
        use_tqdm=False,
        n_processes=4
    )
    
    # The preprocessor returns torch tensors, but we want scipy matrices
    # So we access the scipy matrices directly after preprocessing
    pp_sparse.preprocess_graph(save_to_disk=False)
    step_matrices_scipy = pp_sparse.step_matrices_scipy
    sparse_rw_time = time.time() - start_time
    
    # Calculate object sizes using scipy matrices
    sparse_total_nnz = sum(mat.nnz for mat in step_matrices_scipy)
    sparse_size_mb = sparse_total_nnz * 16 / (1024**2)  # 8 bytes for values + 8 for indices
    sparse_dense_equiv_mb = sum(mat.shape[0] * mat.shape[1] * 8 for mat in step_matrices_scipy) / (1024**2)
    
    return {
        'time': sparse_rw_time,
        'step_matrices': step_matrices_scipy,
        'total_nnz': sparse_total_nnz,
        'size_mb': sparse_size_mb,
        'dense_equiv_mb': sparse_dense_equiv_mb,
        'avg_nnz_per_matrix': sparse_total_nnz / len(step_matrices_scipy),
        'sparsity': sparse_total_nnz / sum(mat.shape[0] * mat.shape[1] for mat in step_matrices_scipy)
    }

def run_dense_rw_sampling(data, rw_seed, n_nodes):
    """Run dense random walk sampling for a single graph"""
    start_time = time.time()
    dense_graph = DenseGraph(data['A_dense'])
    dense_sampler = DenseRandomWalk(dense_graph, seed=rw_seed)
    
    dense_step_matrices = dense_sampler.get_random_walk_matrices(
        WALKS_PER_NODE, P_HALT, MAX_WALK_LENGTH
    )
    dense_rw_time = time.time() - start_time
    
    # Calculate object sizes
    dense_size_mb = dense_step_matrices.nbytes / (1024**2)
    
    return {
        'time': dense_rw_time,
        'step_matrices': dense_step_matrices,
        'size_mb': dense_size_mb
    }

def save_step_matrices(step_matrices_dir, method, n_nodes, rw_seed, step_matrices, config):
    """Save step matrices to disk and return file size"""
    filename = f"step_matrices_{method}_n{n_nodes}_seed{rw_seed}.pkl"
    filepath = os.path.join(step_matrices_dir, filename)
    
    save_data = {
        'step_matrices_torch' if method == 'sparse' else 'step_matrices': step_matrices,
        'n_nodes': n_nodes,
        'seed': rw_seed,
        'method': method,
        'config': config
    }
    
    with open(filepath, 'wb') as f:
        pickle.dump(save_data, f)
    
    return os.path.getsize(filepath) / (1024**2)

def process_single_graph(n_nodes, data_dir, step_matrices_dir, rw_seeds):
    """Process a single graph size with all seeds"""
    print(f"\nProcessing {n_nodes} nodes...")
    
    # Load synthetic data
    data = generate_and_cache_data(n_nodes=n_nodes, data_dir=data_dir, **DATA_SYNTHESIS_PARAMS)
    
    # Determine if we should run dense
    run_dense = n_nodes <= 256  # Conservative threshold
    
    results = []
    config = {'walks_per_node': WALKS_PER_NODE, 'p_halt': P_HALT, 'max_walk_length': MAX_WALK_LENGTH}
    
    for seed_idx, rw_seed in enumerate(rw_seeds):
        print(f"  Seed {seed_idx + 1}/{len(rw_seeds)} (seed={rw_seed})")
        
        # Run sparse sampling
        print("    Running sparse preprocessing...")
        sparse_result = run_sparse_rw_sampling(data, rw_seed, n_nodes)
        sparse_file_size = save_step_matrices(step_matrices_dir, 'sparse', n_nodes, rw_seed, 
                                            sparse_result['step_matrices'], config)
        
        # Run dense sampling if applicable
        if run_dense:
            print("    Running dense preprocessing...")
            dense_result = run_dense_rw_sampling(data, rw_seed, n_nodes)
            dense_file_size = save_step_matrices(step_matrices_dir, 'dense', n_nodes, rw_seed,
                                               dense_result['step_matrices'], config)
        else:
            print("    Skipping dense (graph too large)")
            dense_result = None
            dense_file_size = None
        
        # Compile statistics
        stat_entry = {
            'n_nodes': n_nodes,
            'n_edges': data['A_sparse'].nnz // 2,
            'seed': rw_seed,
            'sparse_rw_time': sparse_result['time'],
            'dense_rw_time': dense_result['time'] if dense_result else None,
            'sparse_size_mb': sparse_result['size_mb'],
            'dense_size_mb': dense_result['size_mb'] if dense_result else None,
            'sparse_dense_equiv_mb': sparse_result['dense_equiv_mb'],
            'compression_ratio': (dense_result['size_mb'] / sparse_result['size_mb'] 
                                if dense_result and sparse_result['size_mb'] > 0 else np.nan),
            'time_speedup': (dense_result['time'] / sparse_result['time'] 
                           if dense_result else np.nan),
            'sparse_file_size_mb': sparse_file_size,
            'dense_file_size_mb': dense_file_size,
            'sparse_total_nnz': sparse_result['total_nnz'],
            'sparse_avg_nnz_per_matrix': sparse_result['avg_nnz_per_matrix'],
            'graph_sparsity': data['A_sparse'].nnz / (n_nodes**2),
            'step_matrix_sparsity': sparse_result['sparsity'],
            'run_dense': run_dense
        }
        results.append(stat_entry)
        
        # Cleanup
        del sparse_result
        if dense_result:
            del dense_result
        if n_nodes >= 1000:
            gc.collect()
    
    # Cleanup data
    del data
    if n_nodes >= 1000:
        gc.collect()
    
    return results

def print_summary_statistics(rw_df):
    """Print formatted summary statistics"""
    comparison_df = rw_df[rw_df['run_dense'] == True]
    sparse_only_df = rw_df[rw_df['run_dense'] == False]
    
    if len(comparison_df) > 0:
        summary = comparison_df.groupby('n_nodes').agg({
            'sparse_rw_time': ['mean', 'std'],
            'dense_rw_time': ['mean', 'std'],
            'time_speedup': ['mean', 'std'],
            'sparse_size_mb': ['mean', 'std'], 
            'dense_size_mb': ['mean', 'std'],
            'compression_ratio': ['mean', 'std']
        }).round(3)
        print(f"\nüìä Dense vs Sparse Comparison:")
        print(summary)
    
    if len(sparse_only_df) > 0:
        sparse_summary = sparse_only_df.groupby('n_nodes').agg({
            'sparse_rw_time': ['mean', 'std'],
            'sparse_size_mb': ['mean', 'std'],
            'sparse_file_size_mb': ['mean', 'std']
        }).round(3)
        print(f"\nüìä Sparse-Only Results (Large Graphs):")
        print(sparse_summary)

#### Run it!

In [18]:
def run_rw_sampling_experiment():
    """Run random walk sampling experiments for all graph sizes and seeds"""
    
    # Setup directories
    step_matrices_dir = os.path.join(os.getcwd(), 'step_matrices')
    stats_dir = os.path.join(os.getcwd(), 'stats')
    os.makedirs(step_matrices_dir, exist_ok=True)
    os.makedirs(stats_dir, exist_ok=True)
    
    # Process all graph sizes
    all_sizes = GP_GRAPH_SIZES + GP_SPARSE_ONLY_SIZES
    all_results = []
    
    print(f"Running RW sampling experiments for {len(all_sizes)} graph sizes with {N_REPEATS} seeds each...")
    
    for n_nodes in tqdm(all_sizes, desc="Graph sizes"):
        graph_results = process_single_graph(n_nodes, DATA_DIR, step_matrices_dir, RW_SEEDS)
        all_results.extend(graph_results)
    
    # Save and summarize results
    rw_df = pd.DataFrame(all_results)
    stats_file = os.path.join(stats_dir, 'rw_sampling_stats.csv')
    rw_df.to_csv(stats_file, index=False)
    
    print(f"\n‚úÖ RW sampling complete!")
    print(f"   Step matrices saved to: {step_matrices_dir}")
    print(f"   Statistics saved to: {stats_file}")
    print(f"   Processed {len(all_results)} experiments")
    
    print_summary_statistics(rw_df)
    return rw_df

# Run the experiment
rw_results_df = run_rw_sampling_experiment()

Running RW sampling experiments for 6 graph sizes with 5 seeds each...


Graph sizes:   0%|          | 0/6 [00:00<?, ?it/s]


Processing 32 nodes...
Loading cached data for 32 nodes...
  Seed 1/5 (seed=42)
    Running sparse preprocessing...
    Running dense preprocessing...
  Seed 2/5 (seed=43)
    Running sparse preprocessing...
    Running dense preprocessing...
  Seed 3/5 (seed=44)
    Running sparse preprocessing...
    Running dense preprocessing...
  Seed 4/5 (seed=45)
    Running sparse preprocessing...
    Running dense preprocessing...
  Seed 5/5 (seed=46)
    Running sparse preprocessing...
    Running dense preprocessing...


Graph sizes:  17%|‚ñà‚ñã        | 1/6 [00:04<00:21,  4.24s/it]


Processing 64 nodes...
Loading cached data for 64 nodes...
  Seed 1/5 (seed=42)
    Running sparse preprocessing...
    Running dense preprocessing...
  Seed 2/5 (seed=43)
    Running sparse preprocessing...
    Running dense preprocessing...
  Seed 3/5 (seed=44)
    Running sparse preprocessing...
    Running dense preprocessing...
  Seed 4/5 (seed=45)
    Running sparse preprocessing...
    Running dense preprocessing...
  Seed 5/5 (seed=46)
    Running sparse preprocessing...
    Running dense preprocessing...


Graph sizes:  33%|‚ñà‚ñà‚ñà‚ñé      | 2/6 [00:09<00:18,  4.69s/it]


Processing 128 nodes...
Loading cached data for 128 nodes...
  Seed 1/5 (seed=42)
    Running sparse preprocessing...
    Running dense preprocessing...
  Seed 2/5 (seed=43)
    Running sparse preprocessing...
    Running dense preprocessing...
  Seed 3/5 (seed=44)
    Running sparse preprocessing...
    Running dense preprocessing...
  Seed 4/5 (seed=45)
    Running sparse preprocessing...
    Running dense preprocessing...
  Seed 5/5 (seed=46)
    Running sparse preprocessing...
    Running dense preprocessing...


Graph sizes:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 3/6 [00:15<00:15,  5.19s/it]


Processing 256 nodes...
Loading cached data for 256 nodes...
  Seed 1/5 (seed=42)
    Running sparse preprocessing...
    Running dense preprocessing...
  Seed 2/5 (seed=43)
    Running sparse preprocessing...
    Running dense preprocessing...
  Seed 3/5 (seed=44)
    Running sparse preprocessing...
    Running dense preprocessing...
  Seed 4/5 (seed=45)
    Running sparse preprocessing...
    Running dense preprocessing...
  Seed 5/5 (seed=46)
    Running sparse preprocessing...
    Running dense preprocessing...


Graph sizes:  67%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã   | 4/6 [00:24<00:13,  6.77s/it]


Processing 512 nodes...
Loading cached data for 512 nodes...
  Seed 1/5 (seed=42)
    Running sparse preprocessing...
    Skipping dense (graph too large)
  Seed 2/5 (seed=43)
    Running sparse preprocessing...
    Skipping dense (graph too large)
  Seed 3/5 (seed=44)
    Running sparse preprocessing...
    Skipping dense (graph too large)
  Seed 4/5 (seed=45)
    Running sparse preprocessing...
    Skipping dense (graph too large)
  Seed 5/5 (seed=46)
    Running sparse preprocessing...


Graph sizes:  83%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé | 5/6 [00:27<00:05,  5.66s/it]

    Skipping dense (graph too large)

Processing 1024 nodes...
Loading cached data for 1024 nodes...
  Seed 1/5 (seed=42)
    Running sparse preprocessing...
    Skipping dense (graph too large)
  Seed 2/5 (seed=43)
    Running sparse preprocessing...
    Skipping dense (graph too large)
  Seed 3/5 (seed=44)
    Running sparse preprocessing...
    Skipping dense (graph too large)
  Seed 4/5 (seed=45)
    Running sparse preprocessing...
    Skipping dense (graph too large)
  Seed 5/5 (seed=46)
    Running sparse preprocessing...


Graph sizes: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 6/6 [00:34<00:00,  5.70s/it]

    Skipping dense (graph too large)

‚úÖ RW sampling complete!
   Step matrices saved to: /scratches/cartwright/mz473/Efficient-Gaussian-Process-on-Graphs/experiments_sparse/scaling_exp/step_matrices
   Statistics saved to: /scratches/cartwright/mz473/Efficient-Gaussian-Process-on-Graphs/experiments_sparse/scaling_exp/stats/rw_sampling_stats.csv
   Processed 30 experiments

üìä Dense vs Sparse Comparison:
        sparse_rw_time        dense_rw_time        time_speedup         \
                  mean    std          mean    std         mean    std   
n_nodes                                                                  
32               0.348  0.007         0.108  0.001        0.309  0.007   
64               0.360  0.011         0.200  0.009        0.555  0.035   
128              0.384  0.012         0.407  0.021        1.060  0.078   
256              0.454  0.014         0.891  0.017        1.966  0.056   

        sparse_size_mb      dense_size_mb      compression_ratio      




### Sparse GP Inference

#### Sparse Helper Functions

In [19]:
# GPU-accelerated Sparse GP Model
class SparseGraphGPModel(gpytorch.models.ExactGP):
    """Sparse Graph GP Model with pathwise conditioning prediction"""
    
    def __init__(self, x_train, y_train, likelihood, step_matrices_torch):
        super().__init__(x_train, y_train, likelihood)
        self.x_train = x_train
        self.y_train = y_train
        self.mean_module = gpytorch.means.ZeroMean()
        self.covar_module = SparseGRFKernel(
            max_walk_length=MAX_WALK_LENGTH, 
            step_matrices_torch=step_matrices_torch
        )
        self.num_nodes = step_matrices_torch[0].shape[0]
        
    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)
    
    def predict(self, x_test, n_samples=64):
        """
        Batch pathwise conditioning prediction
        
        f_test_posterior = f_test_prior + K_test_train @ v
        v = (K_train_train + noise_variance*I)^{-1} @ (y_train - (f_train_prior + eps))
        """
        num_train = self.x_train.shape[0]
        train_indices = self.x_train.int().flatten()
        test_indices = x_test.int().flatten()
        
        # Feature matrices
        phi = self.covar_module._get_feature_matrix()
        phi_train = phi[train_indices, :]
        phi_test = phi[test_indices, :]
        
        # Covariance matrices
        K_train_train = phi_train @ phi_train.T
        K_test_train = phi_test @ phi_train.T
        
        # Noise setup
        noise_variance = self.likelihood.noise.item()
        noise_std = torch.sqrt(torch.tensor(noise_variance, device=x_test.device))
        A = K_train_train + noise_variance * IdentityLinearOperator(num_train, device=x_test.device)
        
        # Batch samples
        eps1_batch = torch.randn(n_samples, self.num_nodes, device=x_test.device)
        eps2_batch = noise_std * torch.randn(n_samples, num_train, device=x_test.device)
        
        # Prior samples
        f_test_prior_batch = eps1_batch @ phi_test.T
        f_train_prior_batch = eps1_batch @ phi_train.T
        
        # CG solve
        b_batch = self.y_train.unsqueeze(0) - (f_train_prior_batch + eps2_batch)
        v_batch = linear_cg(A._matmul, b_batch.T, tolerance=gsettings.cg_tolerance.value())
        
        # Posterior
        return f_test_prior_batch + (K_test_train @ v_batch).T

def load_step_matrices_from_file(filepath, device):
    """Load and convert step matrices to torch tensors"""
    with open(filepath, 'rb') as f:
        data = pickle.load(f)
    
    step_matrices_scipy = data['step_matrices_torch']  # These are actually scipy matrices
    
    # Convert to torch tensors using GraphPreprocessor's static method
    step_matrices_torch = []
    for mat in step_matrices_scipy:
        tensor = GraphPreprocessor.from_scipy_csr(mat).to(device)
        step_matrices_torch.append(SparseLinearOperator(tensor))
    
    return step_matrices_torch

def train_sparse_gp(data_torch, step_matrices_torch, n_epochs=50, lr=0.1):
    """Train sparse GP model with GPU"""
    likelihood = gpytorch.likelihoods.GaussianLikelihood().to(output_device)
    model = SparseGraphGPModel(
        data_torch['X_train'], data_torch['y_train'], 
        likelihood, step_matrices_torch
    ).to(output_device)
    
    model.train()
    likelihood.train()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
    
    train_time_start = time.time()
    for i in range(n_epochs):
        optimizer.zero_grad()
        output = model(data_torch['X_train'])
        loss = -mll(output, data_torch['y_train'])
        loss.backward()
        optimizer.step()
    train_time = time.time() - train_time_start
    
    return model, likelihood, train_time

def evaluate_sparse_gp(model, likelihood, data_torch, n_samples=64):
    """Evaluate sparse GP with pathwise sampling"""
    model.eval()
    likelihood.eval()
    
    inference_time_start = time.time()
    with torch.no_grad():
        test_samples = model.predict(data_torch['X_test'], n_samples)
        test_mean = test_samples.mean(dim=0)
        test_std = test_samples.std(dim=0)
    inference_time = time.time() - inference_time_start
    
    test_rmse = torch.sqrt(torch.mean((data_torch['y_test'] - test_mean) ** 2)).item()
    
    return {
        'test_rmse': test_rmse,
        'test_mean': test_mean,
        'test_std': test_std, 
        'inference_time': inference_time,
        'noise_variance': likelihood.noise.item(),
        'modulator': model.covar_module.modulator_vector.detach().cpu().numpy()
    }

def run_sparse_gp_experiment(n_nodes, rw_seed, n_epochs=N_EPOCHS):
    """Run complete sparse GP experiment for one graph size and seed"""
    print(f"  Processing {n_nodes} nodes, seed {rw_seed}")
    
    # Load data
    data = generate_and_cache_data(n_nodes, DATA_DIR, **DATA_SYNTHESIS_PARAMS)
    
    # Convert to torch tensors on GPU
    data_torch = {
        'X_train': torch.tensor(data['train_idx'], dtype=torch.float32, device=output_device).unsqueeze(1),
        'y_train': torch.tensor(data['y_train'].flatten(), dtype=torch.float32, device=output_device),
        'X_test': torch.tensor(data['test_idx'], dtype=torch.float32, device=output_device).unsqueeze(1),
        'y_test': torch.tensor(data['y_test'].flatten(), dtype=torch.float32, device=output_device)
    }
    
    # Load step matrices
    step_matrices_file = os.path.join(os.getcwd(), 'step_matrices', f'step_matrices_sparse_n{n_nodes}_seed{rw_seed}.pkl')
    if not os.path.exists(step_matrices_file):
        print(f"    Step matrices not found: {step_matrices_file}")
        return None
    
    step_matrices_torch = load_step_matrices_from_file(step_matrices_file, output_device)
    
    # Train model
    model, likelihood, train_time = train_sparse_gp(data_torch, step_matrices_torch, n_epochs, LEARNING_RATE)
    
    # Evaluate model
    eval_results = evaluate_sparse_gp(model, likelihood, data_torch)
    
    return {
        'n_nodes': n_nodes,
        'seed': rw_seed,
        'n_train': len(data['train_idx']),
        'n_test': len(data['test_idx']),
        'train_time': train_time, 
        'inference_time': eval_results['inference_time'],
        'total_time': train_time + eval_results['inference_time'],
        'test_rmse': eval_results['test_rmse'],
        'noise_variance': eval_results['noise_variance'],
        'modulator_l2': np.linalg.norm(eval_results['modulator'])
    }

def run_sparse_gp_scaling_experiment():
    """Run sparse GP experiments across all sizes and seeds"""
    gp_results = []
    all_sizes = GP_GRAPH_SIZES + GP_SPARSE_ONLY_SIZES
    
    print(f"Running sparse GP experiments for {len(all_sizes)} sizes √ó {len(RW_SEEDS)} seeds...")
    
    for n_nodes in tqdm(all_sizes, desc="Graph sizes"):
        for rw_seed in RW_SEEDS:
            result = run_sparse_gp_experiment(n_nodes, rw_seed)
            if result:
                gp_results.append(result)
            
            # Cleanup GPU memory
            if torch.cuda.is_available():
                torch.cuda.empty_cache()
            gc.collect()
    
    # Save results using helper function
    gp_df = pd.DataFrame(gp_results)
    stats_dir = os.path.join(os.getcwd(), 'stats') 
    
    config_params = {
        'graph_sizes': sorted(gp_df['n_nodes'].unique().tolist()),
        'seeds': sorted(gp_df['seed'].unique().tolist()),
        'n_epochs': N_EPOCHS,
        'learning_rate': LEARNING_RATE,
        'walks_per_node': WALKS_PER_NODE,
        'p_halt': P_HALT,
        'max_walk_length': MAX_WALK_LENGTH,
        'device': str(output_device)
    }
    
    save_experiment_results(gp_df, 'sparse_gp_scaling', stats_dir, config_params)
    
    print(f"\n‚úÖ Sparse GP scaling complete! Processed {len(gp_results)} experiments")
    return gp_df

#### Run it!!

In [20]:
# Run sparse GP scaling experiment
sparse_gp_results_df = run_sparse_gp_scaling_experiment()

Running sparse GP experiments for 6 sizes √ó 5 seeds...


Graph sizes:   0%|          | 0/6 [00:00<?, ?it/s]

  Processing 32 nodes, seed 42
Loading cached data for 32 nodes...
  Processing 32 nodes, seed 43
Loading cached data for 32 nodes...
  Processing 32 nodes, seed 44
Loading cached data for 32 nodes...
  Processing 32 nodes, seed 45
Loading cached data for 32 nodes...
  Processing 32 nodes, seed 46
Loading cached data for 32 nodes...


Graph sizes:  17%|‚ñà‚ñã        | 1/6 [00:29<02:26, 29.33s/it]

  Processing 64 nodes, seed 42
Loading cached data for 64 nodes...
  Processing 64 nodes, seed 43
Loading cached data for 64 nodes...
  Processing 64 nodes, seed 44
Loading cached data for 64 nodes...
  Processing 64 nodes, seed 45
Loading cached data for 64 nodes...
  Processing 64 nodes, seed 46
Loading cached data for 64 nodes...


Graph sizes:  33%|‚ñà‚ñà‚ñà‚ñé      | 2/6 [01:09<02:22, 35.71s/it]

  Processing 128 nodes, seed 42
Loading cached data for 128 nodes...
  Processing 128 nodes, seed 43
Loading cached data for 128 nodes...
  Processing 128 nodes, seed 44
Loading cached data for 128 nodes...
  Processing 128 nodes, seed 45
Loading cached data for 128 nodes...
  Processing 128 nodes, seed 46
Loading cached data for 128 nodes...


Graph sizes:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 3/6 [02:00<02:08, 42.77s/it]

  Processing 256 nodes, seed 42
Loading cached data for 256 nodes...
  Processing 256 nodes, seed 43
Loading cached data for 256 nodes...
  Processing 256 nodes, seed 44
Loading cached data for 256 nodes...
  Processing 256 nodes, seed 45
Loading cached data for 256 nodes...
  Processing 256 nodes, seed 46
Loading cached data for 256 nodes...


Graph sizes:  67%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã   | 4/6 [02:56<01:35, 47.86s/it]

  Processing 512 nodes, seed 42
Loading cached data for 512 nodes...
  Processing 512 nodes, seed 43
Loading cached data for 512 nodes...
  Processing 512 nodes, seed 44
Loading cached data for 512 nodes...
  Processing 512 nodes, seed 45
Loading cached data for 512 nodes...
  Processing 512 nodes, seed 46
Loading cached data for 512 nodes...


Graph sizes:  83%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé | 5/6 [03:47<00:49, 49.19s/it]

  Processing 1024 nodes, seed 42
Loading cached data for 1024 nodes...
  Processing 1024 nodes, seed 43
Loading cached data for 1024 nodes...
  Processing 1024 nodes, seed 44
Loading cached data for 1024 nodes...
  Processing 1024 nodes, seed 45
Loading cached data for 1024 nodes...
  Processing 1024 nodes, seed 46
Loading cached data for 1024 nodes...


Graph sizes: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 6/6 [04:42<00:00, 47.06s/it]


üìÅ sparse_gp_scaling results saved:
   Main file: /scratches/cartwright/mz473/Efficient-Gaussian-Process-on-Graphs/experiments_sparse/scaling_exp/stats/sparse_gp_scaling_stats.csv
   Timestamped: /scratches/cartwright/mz473/Efficient-Gaussian-Process-on-Graphs/experiments_sparse/scaling_exp/stats/sparse_gp_scaling_stats_20250809_235219.csv
   Config: sparse_gp_scaling_config_20250809_235219.json
   Summary: sparse_gp_scaling_summary_20250809_235219.csv

‚úÖ Sparse GP scaling complete! Processed 30 experiments


### Dense GP Inference

#### Helper Functions

In [21]:
# Dense GP Model using GPflow
class DenseGPModel:
    """Dense GP model using GPflow"""
    
    def __init__(self, data, step_matrices_dense=None, walks_per_node=WALKS_PER_NODE, p_halt=P_HALT, max_walk_length=MAX_WALK_LENGTH):
        self.data = data
        self.step_matrices_dense = step_matrices_dense
        self.walks_per_node = walks_per_node
        self.p_halt = p_halt
        self.max_walk_length = max_walk_length
        self.model = None
        
    def build_model(self):
        """Build GPflow model with dense kernel"""
        # Create dense kernel with optional pre-computed step matrices
        self.kernel = GraphGeneralFastGRFKernel(
            self.data['A_dense'], 
            walks_per_node=self.walks_per_node, 
            p_halt=self.p_halt, 
            max_walk_length=self.max_walk_length,
            step_matrices=self.step_matrices_dense,  # Pre-computed step matrices
            use_tqdm=False
        )
        
        # Create GPflow model
        self.model = gpflow.models.GPR(
            data=(self.data['X_train'], self.data['y_train']), 
            kernel=self.kernel, 
            noise_variance=INITIAL_NOISE_VARIANCE
        )
        
    def train(self, n_epochs=50):
        """Train the model"""
        optimizer = gpflow.optimizers.Scipy()
        
        def objective():
            return -self.model.log_marginal_likelihood()
        
        optimizer.minimize(
            objective,
            self.model.trainable_variables,
            options={"maxiter": n_epochs},
            compile=False
        )
        
    def predict(self, X_test):
        """Make predictions"""
        mean_pred, var_pred = self.model.predict_f(X_test)
        return mean_pred.numpy(), var_pred.numpy()

def train_dense_gp(data, step_matrices_dense=None, n_epochs=50):
    """Train dense GP model"""
    model_wrapper = DenseGPModel(data, step_matrices_dense)
    
    # Build model (kernel initialization time)
    model_wrapper.build_model()
    
    # Train model
    train_time_start = time.time()
    model_wrapper.train(n_epochs)
    train_time = time.time() - train_time_start
    
    return model_wrapper, train_time

def evaluate_dense_gp(model_wrapper, data):
    """Evaluate dense GP model"""
    inference_time_start = time.time()
    test_mean, test_var = model_wrapper.predict(data['X_test'])
    inference_time = time.time() - inference_time_start
    
    test_std = np.sqrt(test_var.flatten())
    test_rmse = np.sqrt(mean_squared_error(data['y_test'], test_mean))
    
    return {
        'test_rmse': test_rmse,
        'test_mean': test_mean.flatten(),
        'test_std': test_std,
        'inference_time': inference_time,
        'noise_variance': float(model_wrapper.model.likelihood.variance.numpy()),
        'modulator': model_wrapper.kernel.modulator_vector.numpy()
    }

def run_dense_gp_experiment(n_nodes, rw_seed, n_epochs=N_EPOCHS):
    """Run complete dense GP experiment for one graph size and seed"""
    print(f"  Processing {n_nodes} nodes, seed {rw_seed}")
    
    # Load data
    data = generate_and_cache_data(n_nodes, DATA_DIR, **DATA_SYNTHESIS_PARAMS)
    
    # Load dense step matrices if available
    step_matrices_file = os.path.join(os.getcwd(), 'step_matrices', f'step_matrices_dense_n{n_nodes}_seed{rw_seed}.pkl')
    step_matrices_dense = None
    
    if os.path.exists(step_matrices_file):
        with open(step_matrices_file, 'rb') as f:
            step_data = pickle.load(f)
            step_matrices_dense = step_data['step_matrices']
        print(f"    Loaded pre-computed dense step matrices")
    else:
        print(f"    Computing dense step matrices on-the-fly")
    
    # Train model
    model_wrapper, train_time = train_dense_gp(data, step_matrices_dense, n_epochs)
    
    # Evaluate model
    eval_results = evaluate_dense_gp(model_wrapper, data)
    
    return {
        'n_nodes': n_nodes,
        'seed': rw_seed,
        'n_train': len(data['train_idx']),
        'n_test': len(data['test_idx']),
        'train_time': train_time,
        'inference_time': eval_results['inference_time'],
        'total_time': train_time + eval_results['inference_time'],
        'test_rmse': eval_results['test_rmse'],
        'noise_variance': eval_results['noise_variance'],
        'modulator_l2': np.linalg.norm(eval_results['modulator'])
    }

def run_dense_gp_scaling_experiment():
    """Run dense GP experiments across all feasible sizes and seeds"""
    gp_results = []
    
    # Run on the original GP_GRAPH_SIZES (no size restrictions)
    feasible_sizes = GP_GRAPH_SIZES
    
    print(f"Running dense GP experiments for {len(feasible_sizes)} sizes √ó {len(RW_SEEDS)} seeds...")
    print(f"Sizes: {feasible_sizes}")
    
    for n_nodes in tqdm(feasible_sizes, desc="Graph sizes"):
        for rw_seed in RW_SEEDS:
            result = run_dense_gp_experiment(n_nodes, rw_seed)
            if result:
                gp_results.append(result)
            
            # Cleanup memory
            gc.collect()
    
    # Save results using helper function
    if gp_results:
        gp_df = pd.DataFrame(gp_results)
        stats_dir = os.path.join(os.getcwd(), 'stats')
        
        config_params = {
            'graph_sizes': sorted(gp_df['n_nodes'].unique().tolist()),
            'seeds': sorted(gp_df['seed'].unique().tolist()),
            'n_epochs': N_EPOCHS,
            'walks_per_node': WALKS_PER_NODE,
            'p_halt': P_HALT,
            'max_walk_length': MAX_WALK_LENGTH,
            'framework': 'gpflow_dense'
        }
        
        save_experiment_results(gp_df, 'dense_gp_scaling', stats_dir, config_params)
        
        print(f"\n‚úÖ Dense GP scaling complete! Processed {len(gp_results)} experiments")
        return gp_df
    else:
        print("\n‚ùå No dense GP experiments completed successfully")
        return pd.DataFrame()

#### Run it!!

In [23]:
# Run dense GP scaling experiment  
dense_gp_results_df = run_dense_gp_scaling_experiment()

Running dense GP experiments for 3 sizes √ó 5 seeds...
Sizes: [32, 64, 128]


Graph sizes:   0%|          | 0/3 [00:00<?, ?it/s]

  Processing 32 nodes, seed 42
Loading cached data for 32 nodes...
    Loaded pre-computed dense step matrices
  Processing 32 nodes, seed 43
Loading cached data for 32 nodes...
    Loaded pre-computed dense step matrices
  Processing 32 nodes, seed 44
Loading cached data for 32 nodes...
    Loaded pre-computed dense step matrices
  Processing 32 nodes, seed 45
Loading cached data for 32 nodes...
    Loaded pre-computed dense step matrices
  Processing 32 nodes, seed 46
Loading cached data for 32 nodes...
    Loaded pre-computed dense step matrices


Graph sizes:  33%|‚ñà‚ñà‚ñà‚ñé      | 1/3 [00:10<00:20, 10.08s/it]

  Processing 64 nodes, seed 42
Loading cached data for 64 nodes...
    Loaded pre-computed dense step matrices
  Processing 64 nodes, seed 43
Loading cached data for 64 nodes...
    Loaded pre-computed dense step matrices
  Processing 64 nodes, seed 44
Loading cached data for 64 nodes...
    Loaded pre-computed dense step matrices
  Processing 64 nodes, seed 45
Loading cached data for 64 nodes...
    Loaded pre-computed dense step matrices
  Processing 64 nodes, seed 46
Loading cached data for 64 nodes...
    Loaded pre-computed dense step matrices


Graph sizes:  67%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã   | 2/3 [00:18<00:09,  9.22s/it]

  Processing 128 nodes, seed 42
Loading cached data for 128 nodes...
    Loaded pre-computed dense step matrices
  Processing 128 nodes, seed 43
Loading cached data for 128 nodes...
    Loaded pre-computed dense step matrices
  Processing 128 nodes, seed 44
Loading cached data for 128 nodes...
    Loaded pre-computed dense step matrices
  Processing 128 nodes, seed 45
Loading cached data for 128 nodes...
    Loaded pre-computed dense step matrices
  Processing 128 nodes, seed 46
Loading cached data for 128 nodes...
    Loaded pre-computed dense step matrices


Graph sizes: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3/3 [00:26<00:00,  8.92s/it]


üìÅ dense_gp_scaling results saved:
   Main file: /scratches/cartwright/mz473/Efficient-Gaussian-Process-on-Graphs/experiments_sparse/scaling_exp/stats/dense_gp_scaling_stats.csv
   Timestamped: /scratches/cartwright/mz473/Efficient-Gaussian-Process-on-Graphs/experiments_sparse/scaling_exp/stats/dense_gp_scaling_stats_20250809_235600.csv
   Config: dense_gp_scaling_config_20250809_235600.json
   Summary: dense_gp_scaling_summary_20250809_235600.csv

‚úÖ Dense GP scaling complete! Processed 15 experiments
