# PDE-Selector: Paper Reproduction Notebook

This notebook reproduces all results from the paper:

> **A Meta-Learning Framework for Automated Selection of PDE Identification Methods**
>
> Pranav Lende, Georgia Institute of Technology

## Step 0: Overview

**What this experiment does:**

When you have noisy spatiotemporal data and want to identify the governing PDE, which identification method should you use? Running all methods is expensive. This project trains a meta-learning selector that predicts the best method *before* running any identification algorithm.

**The approach:**
1. Extract 12 inexpensive features ("Tiny-12") from raw data: derivative statistics, spectral features, signal statistics
2. Train a Random Forest classifier to predict which of 4 methods (LASSO, STLSQ, WeakIDENT, RobustIDENT) will achieve the lowest error
3. Evaluate using regret: how much worse is the selector's choice vs. always picking the oracle-best method?

**Key results:**
- 97.06% test accuracy in predicting the best method
- 99.4% zero-regret rate (selector matches oracle choice)

## Setup

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import warnings
warnings.filterwarnings('ignore')

# Paths
REPO_ROOT = Path('.').resolve().parent
PAPER_RUN = REPO_ROOT / 'experiments' / 'paper_run_2025-12-18'
print(f"Using frozen results from: {PAPER_RUN}")

## Step 1: Load Frozen Dataset and Show Statistics

In [None]:
# Load the frozen dataset
df = pd.read_csv(PAPER_RUN / 'full_dataset_4methods.csv')
print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")

In [None]:
# PDE distribution
print("\n=== PDE Distribution ===")
print(df['pde_type'].value_counts())

In [None]:
# Best method distribution
print("\n=== Best Method Distribution ===")
best_method_counts = df['best_method'].value_counts()
print(best_method_counts)
print(f"\nPercentages:")
print((best_method_counts / len(df) * 100).round(2))

In [None]:
# E2 statistics by method
print("\n=== E2 Error Statistics by Method ===")
methods = ['LASSO', 'STLSQ', 'RobustIDENT', 'WeakIDENT']
for method in methods:
    e2_col = f'{method}_e2'
    if e2_col in df.columns:
        print(f"{method}: mean={df[e2_col].mean():.4f}, median={df[e2_col].median():.4f}")

## Step 2: Train the Selector (Random Forest)

This exactly replicates `scripts/train_models.py`.

In [None]:
# Extract features and labels
feature_cols = [f'feat_{i}' for i in range(12)]
X = df[feature_cols].values
y = df['best_method'].values

print(f"Features shape: {X.shape}")
print(f"Labels shape: {y.shape}")
print(f"Unique labels: {np.unique(y)}")

In [None]:
# Train-test split (same seed as paper)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set: {X_train_scaled.shape[0]} samples")
print(f"Test set: {X_test_scaled.shape[0]} samples")

In [None]:
# Train models
models = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'KNN (k=5)': KNeighborsClassifier(n_neighbors=5),
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'SVM (RBF)': SVC(kernel='rbf', random_state=42),
    'Ridge Classifier': RidgeClassifier(random_state=42),
}

results = []
for name, model in models.items():
    # Train
    model.fit(X_train_scaled, y_train)
    
    # Evaluate
    y_pred = model.predict(X_test_scaled)
    test_acc = accuracy_score(y_test, y_pred)
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
    
    results.append({
        'Model': name,
        'Test Accuracy': test_acc,
        'CV Mean': cv_scores.mean(),
        'CV Std': cv_scores.std()
    })
    print(f"{name}: Test Acc={test_acc:.4f}, CV={cv_scores.mean():.4f}±{cv_scores.std():.4f}")

results_df = pd.DataFrame(results).sort_values('Test Accuracy', ascending=False)
print("\n=== Final Rankings ===")
print(results_df.to_string(index=False))

## Step 3: Compute Accuracy and Regret

In [None]:
# Use Random Forest (best model)
best_model = RandomForestClassifier(n_estimators=100, random_state=42)
best_model.fit(X_train_scaled, y_train)

# Predict on full dataset for regret analysis
X_full_scaled = scaler.transform(X)
predictions = best_model.predict(X_full_scaled)

# Compute regret
def compute_regret(row, prediction):
    """Compute regret = selector_e2 - oracle_e2"""
    selector_e2 = row[f'{prediction}_e2']
    oracle_e2 = row['oracle_e2']
    return selector_e2 - oracle_e2

regrets = []
for idx, (_, row) in enumerate(df.iterrows()):
    regret = compute_regret(row, predictions[idx])
    regrets.append(regret)

regrets = np.array(regrets)

print("=== Regret Analysis ===")
print(f"Zero-regret count: {np.sum(regrets == 0)} / {len(regrets)} ({100*np.mean(regrets==0):.1f}%)")
print(f"Mean regret: {regrets.mean():.6f}")
print(f"Max regret: {regrets.max():.4f}")

In [None]:
# Classification report on test set
y_pred_test = best_model.predict(X_test_scaled)
print("\n=== Classification Report (Test Set) ===")
print(classification_report(y_test, y_pred_test))

## Step 4: Regenerate Paper Figures

In [None]:
# Figure 1: Model Comparison
fig, ax = plt.subplots(figsize=(10, 6))
x = range(len(results_df))
ax.bar(x, results_df['Test Accuracy'], color='steelblue', alpha=0.8)
ax.set_xticks(x)
ax.set_xticklabels(results_df['Model'], rotation=45, ha='right')
ax.set_ylabel('Test Accuracy')
ax.set_title('Model Comparison for PDE Method Selection')
ax.set_ylim(0.85, 1.0)
for i, v in enumerate(results_df['Test Accuracy']):
    ax.text(i, v + 0.005, f'{v:.3f}', ha='center', fontsize=9)
plt.tight_layout()
plt.show()

In [None]:
# Figure 2: Confusion Matrix
cm = confusion_matrix(y_test, y_pred_test, labels=best_model.classes_)
fig, ax = plt.subplots(figsize=(8, 6))
im = ax.imshow(cm, cmap='Blues')
ax.set_xticks(range(len(best_model.classes_)))
ax.set_yticks(range(len(best_model.classes_)))
ax.set_xticklabels(best_model.classes_, rotation=45, ha='right')
ax.set_yticklabels(best_model.classes_)
ax.set_xlabel('Predicted')
ax.set_ylabel('True')
ax.set_title('Confusion Matrix (Test Set)')
for i in range(len(best_model.classes_)):
    for j in range(len(best_model.classes_)):
        ax.text(j, i, str(cm[i, j]), ha='center', va='center', 
                color='white' if cm[i, j] > cm.max()/2 else 'black')
plt.colorbar(im)
plt.tight_layout()
plt.show()

In [None]:
# Figure 3: Feature Importance
importances = best_model.feature_importances_
feature_names = [f'feat_{i}' for i in range(12)]
sorted_idx = np.argsort(importances)[::-1]

fig, ax = plt.subplots(figsize=(10, 6))
ax.bar(range(12), importances[sorted_idx], color='forestgreen', alpha=0.8)
ax.set_xticks(range(12))
ax.set_xticklabels([feature_names[i] for i in sorted_idx], rotation=45, ha='right')
ax.set_ylabel('Importance')
ax.set_title('Random Forest Feature Importance')
plt.tight_layout()
plt.show()

print("\nTop 3 features:")
for i in sorted_idx[:3]:
    print(f"  {feature_names[i]}: {importances[i]:.3f}")

In [None]:
# Figure 4: Regret CDF
sorted_regrets = np.sort(regrets)
cdf = np.arange(1, len(sorted_regrets) + 1) / len(sorted_regrets)

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(sorted_regrets, cdf, 'b-', linewidth=2)
ax.axhline(y=0.994, color='r', linestyle='--', label='99.4% (zero-regret)')
ax.set_xlabel('Regret')
ax.set_ylabel('Cumulative Probability')
ax.set_title('Cumulative Distribution of Regret')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Figure 5: Best Method Distribution
fig, ax = plt.subplots(figsize=(8, 6))
counts = df['best_method'].value_counts()
ax.bar(counts.index, counts.values, color='coral', alpha=0.8)
ax.set_ylabel('Count')
ax.set_title('Distribution of Best-Performing Methods')
for i, (method, count) in enumerate(counts.items()):
    ax.text(i, count + 50, f'{count}\n({100*count/len(df):.1f}%)', ha='center', fontsize=9)
plt.tight_layout()
plt.show()

## Step 5: Full Rerun (Optional)

⚠️ **Warning**: This takes ~25 minutes as it re-runs all 4 IDENT methods on all PDE windows.

Set `RUN_FULL = True` to execute.

In [None]:
RUN_FULL = False  # Set to True to re-run the full pipeline

if RUN_FULL:
    import subprocess
    import os
    
    scripts_dir = REPO_ROOT / 'scripts'
    
    print("Step 1: Running all IDENT methods on PDE windows (~25 min)...")
    result = subprocess.run(
        ['python', 'run_all_methods.py'],
        cwd=scripts_dir,
        capture_output=True,
        text=True
    )
    print(result.stdout)
    if result.returncode != 0:
        print(f"Error: {result.stderr}")
    
    print("\nStep 2: Training models...")
    result = subprocess.run(
        ['python', 'train_models.py'],
        cwd=scripts_dir,
        capture_output=True,
        text=True
    )
    print(result.stdout)
    
    print("\nStep 3: Generating figures...")
    result = subprocess.run(
        ['python', 'generate_figures.py'],
        cwd=scripts_dir,
        capture_output=True,
        text=True
    )
    print(result.stdout)
    
    print("\n✅ Full pipeline complete!")
else:
    print("Skipping full rerun. Set RUN_FULL = True to execute.")

## Summary

This notebook reproduced the key results from the paper:

| Metric | Value |
|--------|-------|
| Dataset size | 5,786 windows |
| PDEs | KdV, Heat, KS, Transport |
| Methods | LASSO, STLSQ, WeakIDENT, RobustIDENT |
| Best classifier | Random Forest |
| Test accuracy | 97.06% |
| Zero-regret rate | 99.4% |
| Mean regret | 0.0002 |

The frozen results are stored in `experiments/paper_run_2025-12-18/` for exact reproducibility.