# DataGen - Physics-SR Framework v3.0 Benchmark

## Benchmark Data Generation Module

**Author:** Zhengze Zhang  
**Affiliation:** Department of Statistics, Columbia University  
**Date:** January 2026

---

### Purpose

This notebook generates synthetic test datasets for benchmarking the Physics-SR Framework v3.0.

**Test Equations:**

| # | Name | Equation | Type | Difficulty |
|---|------|----------|------|------------|
| 1 | Coulomb | $F = k \cdot q_1 \cdot q_2 / r^2$ | Rational | Medium |
| 2 | Newton | $F = G \cdot m_1 \cdot m_2 / r^2$ | Rational | Medium |
| 3 | Ideal Gas | $P = n \cdot R \cdot T / V$ | Rational | Easy |
| 4 | Damped | $x = A \cdot e^{-b t} \cdot \cos(\omega t)$ | Nested | Hard |

**Generated Configurations:**
- Sample sizes: 250, 500, 750
- Noise levels: 0%, 5%
- Dummy features: 0, 5

### Output

Datasets saved to `data/` directory as `.npz` files.

---
## Section 1: Header and Imports

In [None]:
# ==============================================================================
# ENVIRONMENT RESET AND FRESH CLONE
# ==============================================================================
# Run this cell to clear runtime memory and clone the latest repository.
# This ensures you are working with the most up-to-date code.

import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    import os
    import shutil
    import gc
    
    # Clear memory
    gc.collect()
    
    # Remove existing repository if present
    repo_path = '/content/Physics-Informed-Symbolic-Regression'
    if os.path.exists(repo_path):
        shutil.rmtree(repo_path)
        print("[OK] Removed existing repository.")
    
    # Remove any cached data
    cache_paths = [
        '/content/sample_data',
        '/root/.cache',
    ]
    for path in cache_paths:
        if os.path.exists(path):
            try:
                shutil.rmtree(path)
            except:
                pass
    
    # Clone fresh repository
    !git clone https://github.com/Garthzzz/Physics-Informed-Symbolic-Regression.git
    print("[OK] Fresh repository cloned.")
    
    # Change to repository directory
    %cd /content/Physics-Informed-Symbolic-Regression
    
    # Verify clone
    !git log --oneline -3
    print()
    print("[OK] Environment reset complete. Ready to run.")
else:
    print("[INFO] Not in Colab environment. Skipping reset.")

In [None]:
"""
DataGen.ipynb - Benchmark Data Generation
==========================================

Physics-SR Framework v3.0 Benchmark Suite

This module provides:
- BaseTestEquation: Abstract base class for test equations
- CoulombEquation: Electrostatic force (rational, inverse-square)
- NewtonGravityEquation: Gravitational force (rational)
- IdealGasEquation: Ideal gas law (rational)
- DampedOscillationEquation: Damped harmonic motion (nested)
- BenchmarkDataGenerator: Data generation and management

Author: Zhengze Zhang
Affiliation: Department of Statistics, Columbia University
"""

print("DataGen: Initializing benchmark data generation module...")
print()

In [None]:
# Standard library imports
import os
import sys
import pickle
import warnings
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from pathlib import Path
from typing import Dict, List, Tuple, Optional, Union, Any
from datetime import datetime

# Scientific computing
import numpy as np
import pandas as pd
from scipy import stats

# Visualization
import matplotlib.pyplot as plt

print("DataGen: All imports successful.")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

In [None]:
# ==============================================================================
# CONFIGURATION CONSTANTS
# ==============================================================================

# Random seed for reproducibility
RANDOM_SEED = 42

# Data generation parameters
DEFAULT_N_SAMPLES = 500
SAMPLE_SIZES = [250, 500, 750]
NOISE_LEVELS = [0.0, 0.05]
DUMMY_COUNTS = [0, 5]

# Output directory - environment aware
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    # Colab: running from repo root
    DATA_DIR = Path('/content/Physics-Informed-Symbolic-Regression/benchmark/data')
else:
    # Local: running from benchmark directory
    DATA_DIR = Path('data')

# Numerical stability thresholds
Y_MIN_THRESHOLD = 1e-6
Y_MAX_THRESHOLD = 1e8
EPS = 1e-10

print("Configuration constants defined.")
print(f"  Environment: {'Google Colab' if IN_COLAB else 'Local'}")
print(f"  Data directory: {DATA_DIR}")
print(f"  Random seed: {RANDOM_SEED}")
print(f"  Sample sizes: {SAMPLE_SIZES}")
print(f"  Noise levels: {NOISE_LEVELS}")
print(f"  Dummy counts: {DUMMY_COUNTS}")

In [None]:
# ==============================================================================
# USERINPUTS DATACLASS
# ==============================================================================

@dataclass
class UserInputs:
    """
    User-defined inputs required for the Physics-SR Framework.
    
    Attributes
    ----------
    variable_dimensions : Dict[str, List[float]]
        Dictionary mapping variable names to dimensional exponents [M, L, T, Theta].
    target_dimensions : List[float]
        Dimensional exponents [M, L, T, Theta] for the target variable.
    physical_bounds : Dict[str, Dict[str, Optional[float]]]
        Physical constraints: {var_name: {'min': float, 'max': float}}
    variable_mapping : Optional[Dict[str, str]]
        Maps column names to physical variable names.
    unit_conversions : Optional[Dict[str, float]]
        Conversion factors to SI units.
    """
    
    variable_dimensions: Dict[str, List[float]]
    target_dimensions: List[float]
    physical_bounds: Dict[str, Dict[str, Optional[float]]]
    variable_mapping: Optional[Dict[str, str]] = None
    unit_conversions: Optional[Dict[str, float]] = None
    
    def __post_init__(self):
        """Validate inputs after initialization."""
        for var_name, dims in self.variable_dimensions.items():
            if len(dims) != 4:
                raise ValueError(
                    f"Variable '{var_name}' has {len(dims)} dimensional exponents, "
                    f"expected 4 [M, L, T, Theta]"
                )
        if len(self.target_dimensions) != 4:
            raise ValueError(
                f"Target dimensions has {len(self.target_dimensions)} exponents, "
                f"expected 4 [M, L, T, Theta]"
            )
    
    def get_variable_names(self) -> List[str]:
        """Return list of variable names."""
        return list(self.variable_dimensions.keys())
    
    def to_dict(self) -> Dict:
        """Convert to dictionary for serialization."""
        return {
            'variable_dimensions': self.variable_dimensions,
            'target_dimensions': self.target_dimensions,
            'physical_bounds': self.physical_bounds,
            'variable_mapping': self.variable_mapping,
            'unit_conversions': self.unit_conversions
        }
    
    @classmethod
    def from_dict(cls, d: Dict) -> 'UserInputs':
        """Create from dictionary."""
        return cls(
            variable_dimensions=d['variable_dimensions'],
            target_dimensions=d['target_dimensions'],
            physical_bounds=d['physical_bounds'],
            variable_mapping=d.get('variable_mapping'),
            unit_conversions=d.get('unit_conversions')
        )

print("UserInputs dataclass defined.")

---
## Section 2: Test Equation Classes

In [None]:
# ==============================================================================
# BASE TEST EQUATION CLASS
# ==============================================================================

class BaseTestEquation(ABC):
    """
    Abstract base class for test equations.
    
    All test equations must implement the methods defined here.
    This provides a consistent interface for data generation.
    """
    
    # Class attributes to be defined by subclasses
    name: str = "base"
    full_name: str = "Base Equation"
    equation_str: str = "y = f(x)"
    equation_type: str = "abstract"
    
    true_feature_names: List[str] = []
    dummy_feature_pool: List[str] = []
    
    @abstractmethod
    def generate_true_features(
        self, 
        n_samples: int, 
        seed: int
    ) -> Dict[str, np.ndarray]:
        """Generate values for true features."""
        pass
    
    @abstractmethod
    def generate_dummy_features(
        self, 
        n_samples: int, 
        n_dummy: int, 
        seed: int
    ) -> Dict[str, np.ndarray]:
        """Generate values for dummy (irrelevant) features."""
        pass
    
    @abstractmethod
    def compute_target(
        self, 
        features: Dict[str, np.ndarray]
    ) -> np.ndarray:
        """Compute target variable from true features."""
        pass
    
    def add_noise(
        self, 
        y: np.ndarray, 
        noise_level: float, 
        seed: int,
        noise_type: str = 'multiplicative'
    ) -> np.ndarray:
        """
        Add noise to target variable.
        
        Parameters
        ----------
        y : np.ndarray
            Clean target values
        noise_level : float
            Noise standard deviation (relative for multiplicative)
        seed : int
            Random seed for reproducibility
        noise_type : str
            'multiplicative' or 'additive'
        """
        if noise_level <= 0:
            return y.copy()
        
        np.random.seed(seed + 2000)
        
        if noise_type == 'multiplicative':
            noise = np.exp(np.random.normal(0, noise_level, len(y)))
            return y * noise
        else:
            noise = np.random.normal(0, noise_level * np.std(y), len(y))
            return y + noise
    
    @abstractmethod
    def get_variable_dimensions(
        self, 
        feature_names: List[str]
    ) -> Dict[str, List[float]]:
        """Return dimensional exponents for given features."""
        pass
    
    @abstractmethod
    def get_target_dimensions(self) -> List[float]:
        """Return dimensional exponents [M, L, T, Theta] for target."""
        pass
    
    @abstractmethod
    def get_physical_bounds(self) -> Dict[str, Dict[str, Optional[float]]]:
        """Return physical bounds for all variables."""
        pass
    
    def create_user_inputs(self, feature_names: List[str]) -> UserInputs:
        """Create UserInputs object for this equation."""
        return UserInputs(
            variable_dimensions=self.get_variable_dimensions(feature_names),
            target_dimensions=self.get_target_dimensions(),
            physical_bounds=self.get_physical_bounds()
        )
    
    def get_ground_truth(self) -> Dict[str, Any]:
        """Return ground truth information."""
        return {
            'name': self.name,
            'full_name': self.full_name,
            'equation': self.equation_str,
            'active_features': self.true_feature_names.copy(),
            'equation_type': self.equation_type
        }
    
    def generate_dataset(
        self,
        n_samples: int,
        noise_level: float,
        n_dummy: int,
        seed: int
    ) -> Dict[str, Any]:
        """Generate complete dataset."""
        # Generate features
        true_features = self.generate_true_features(n_samples, seed)
        dummy_features = self.generate_dummy_features(n_samples, n_dummy, seed)
        
        all_features = {**true_features, **dummy_features}
        
        true_names = list(true_features.keys())
        dummy_names = list(dummy_features.keys())
        feature_names = true_names + dummy_names
        
        X = np.column_stack([all_features[name] for name in feature_names])
        
        # Compute target
        y_true = self.compute_target(true_features)
        y = self.add_noise(y_true, noise_level, seed)
        
        # Create UserInputs
        user_inputs = self.create_user_inputs(feature_names)
        ground_truth = self.get_ground_truth()
        
        return {
            'X': X,
            'y': y,
            'y_true': y_true,
            'feature_names': np.array(feature_names),
            'true_features': np.array(true_names),
            'dummy_features': np.array(dummy_names),
            'equation_name': self.name,
            'equation_str': self.equation_str,
            'equation_type': self.equation_type,
            'n_samples': n_samples,
            'noise_level': noise_level,
            'n_dummy': n_dummy,
            'seed': seed,
            'variable_dimensions': user_inputs.variable_dimensions,
            'target_dimensions': np.array(user_inputs.target_dimensions),
            'physical_bounds': user_inputs.physical_bounds,
            'ground_truth': ground_truth
        }

print("BaseTestEquation class defined.")

In [None]:
# ==============================================================================
# EQUATION 1: COULOMB'S LAW OF ELECTROSTATIC FORCE
# ==============================================================================

class CoulombEquation(BaseTestEquation):
    """
    Coulomb's Law of Electrostatic Force.
    
    Reference:
        Coulomb, C. A., 1785: Premier memoire sur l'electricite et le magnetisme.
        Feynman Lectures on Physics, Vol. II, Chapter 4.
    
    Equation:
        F = k * q1 * q2 / r^2
    
    Variables:
        q1, q2: Electric charges (C)
        r: Distance between charges (m)
        F: Electrostatic force (N)
        k: Coulomb constant = 8.99e9 N*m^2/C^2
    
    Expected y range: [0.01, 1000] N (numerically stable)
    """
    
    name = "coulomb"
    full_name = "Coulomb's Law of Electrostatic Force"
    equation_str = "F = k * q1 * q2 / r^2"
    equation_type = "rational"
    
    true_feature_names = ['q1', 'q2', 'r']
    dummy_feature_pool = ['v1', 'v2', 'T', 'm', 'epsilon']
    
    # Coulomb constant
    K = 8.99e9  # N*m^2/C^2
    
    # Dimensional exponents [M, L, T, Theta]
    _dimensions = {
        'q1':      [0, 0, 1, 0],
        'q2':      [0, 0, 1, 0],
        'r':       [0, 1, 0, 0],
        'v1':      [0, 1, -1, 0],
        'v2':      [0, 1, -1, 0],
        'T':       [0, 0, 0, 1],
        'm':       [1, 0, 0, 0],
        'epsilon': [0, 0, 0, 0],
    }
    
    _target_dims = [1, 1, -2, 0]  # N = kg*m/s^2
    
    # Ranges designed for numerical stability: y in [0.01, 1000] N
    _ranges = {
        'q1':      (1e-6, 1e-4),      # 1 - 100 microCoulombs
        'q2':      (1e-6, 1e-4),      # 1 - 100 microCoulombs
        'r':       (0.1, 10.0),       # 0.1 - 10 meters
        'v1':      (0.0, 100.0),
        'v2':      (0.0, 100.0),
        'T':       (200.0, 400.0),
        'm':       (0.01, 10.0),
        'epsilon': (1.0, 10.0),
    }
    
    def generate_true_features(
        self, 
        n_samples: int, 
        seed: int
    ) -> Dict[str, np.ndarray]:
        """Generate electric charges and distance."""
        np.random.seed(seed)
        return {
            'q1': np.random.uniform(*self._ranges['q1'], n_samples),
            'q2': np.random.uniform(*self._ranges['q2'], n_samples),
            'r':  np.random.uniform(*self._ranges['r'], n_samples),
        }
    
    def generate_dummy_features(
        self, 
        n_samples: int, 
        n_dummy: int, 
        seed: int
    ) -> Dict[str, np.ndarray]:
        """Generate dummy variables."""
        if n_dummy <= 0:
            return {}
        
        np.random.seed(seed + 1000)
        selected = self.dummy_feature_pool[:n_dummy]
        return {
            name: np.random.uniform(*self._ranges[name], n_samples)
            for name in selected
        }
    
    def compute_target(
        self, 
        features: Dict[str, np.ndarray]
    ) -> np.ndarray:
        """Compute electrostatic force using Coulomb's law."""
        q1 = features['q1']
        q2 = features['q2']
        r = features['r']
        return self.K * q1 * q2 / (r ** 2)
    
    def get_variable_dimensions(
        self, 
        feature_names: List[str]
    ) -> Dict[str, List[float]]:
        """Return dimensions for specified features."""
        return {name: self._dimensions[name] for name in feature_names}
    
    def get_target_dimensions(self) -> List[float]:
        """Return target dimensions [M, L, T, Theta]."""
        return self._target_dims.copy()
    
    def get_physical_bounds(self) -> Dict[str, Dict[str, Optional[float]]]:
        """Return physical bounds for all variables."""
        bounds = {'target': {'min': 0.0, 'max': None}}
        for name, (low, high) in self._ranges.items():
            bounds[name] = {'min': 0.0, 'max': high * 10}
        return bounds

print("CoulombEquation class defined.")

In [None]:
# ==============================================================================
# EQUATION 2: NEWTON'S LAW OF GRAVITATION
# ==============================================================================

class NewtonGravityEquation(BaseTestEquation):
    """
    Newton's Law of Universal Gravitation.
    
    Reference:
        Newton, I., 1687: Philosophiae Naturalis Principia Mathematica.
        Feynman Lectures on Physics, Vol. I, Chapter 7.
    
    Equation:
        F = G * m1 * m2 / r^2
    
    Variables:
        m1, m2: Masses (kg)
        r: Distance between centers (m)
        F: Gravitational force (N)
        G: Gravitational constant = 6.674e-11 N*m^2/kg^2
    
    Note: Uses astronomical-scale masses for numerical stability.
    Expected y range: [0.001, 10000] N
    """
    
    name = "newton"
    full_name = "Newton's Law of Gravitation"
    equation_str = "F = G * m1 * m2 / r^2"
    equation_type = "rational"
    
    true_feature_names = ['m1', 'm2', 'r']
    dummy_feature_pool = ['v1', 'v2', 'T', 't', 'rho']
    
    # Gravitational constant
    G = 6.674e-11  # N*m^2/kg^2
    
    # Dimensional exponents [M, L, T, Theta]
    _dimensions = {
        'm1':  [1, 0, 0, 0],
        'm2':  [1, 0, 0, 0],
        'r':   [0, 1, 0, 0],
        'v1':  [0, 1, -1, 0],
        'v2':  [0, 1, -1, 0],
        'T':   [0, 0, 0, 1],
        't':   [0, 0, 1, 0],
        'rho': [1, -3, 0, 0],
    }
    
    _target_dims = [1, 1, -2, 0]  # N = kg*m/s^2
    
    # Astronomical-scale ranges for numerical stability
    # y range: [0.001, 10000] N
    _ranges = {
        'm1':  (1e10, 1e12),        # 10^10 - 10^12 kg (asteroid-scale)
        'm2':  (1e10, 1e12),
        'r':   (1e3, 1e5),          # 1 - 100 km
        'v1':  (0.0, 1000.0),
        'v2':  (0.0, 1000.0),
        'T':   (100.0, 500.0),
        't':   (0.0, 1e6),
        'rho': (1000.0, 8000.0),
    }
    
    def generate_true_features(
        self, 
        n_samples: int, 
        seed: int
    ) -> Dict[str, np.ndarray]:
        """Generate masses and distance."""
        np.random.seed(seed)
        return {
            'm1': np.random.uniform(*self._ranges['m1'], n_samples),
            'm2': np.random.uniform(*self._ranges['m2'], n_samples),
            'r':  np.random.uniform(*self._ranges['r'], n_samples),
        }
    
    def generate_dummy_features(
        self, 
        n_samples: int, 
        n_dummy: int, 
        seed: int
    ) -> Dict[str, np.ndarray]:
        """Generate dummy variables."""
        if n_dummy <= 0:
            return {}
        
        np.random.seed(seed + 1000)
        selected = self.dummy_feature_pool[:n_dummy]
        return {
            name: np.random.uniform(*self._ranges[name], n_samples)
            for name in selected
        }
    
    def compute_target(
        self, 
        features: Dict[str, np.ndarray]
    ) -> np.ndarray:
        """Compute gravitational force using Newton's law."""
        m1 = features['m1']
        m2 = features['m2']
        r = features['r']
        return self.G * m1 * m2 / (r ** 2)
    
    def get_variable_dimensions(
        self, 
        feature_names: List[str]
    ) -> Dict[str, List[float]]:
        """Return dimensions for specified features."""
        return {name: self._dimensions[name] for name in feature_names}
    
    def get_target_dimensions(self) -> List[float]:
        """Return target dimensions [M, L, T, Theta]."""
        return self._target_dims.copy()
    
    def get_physical_bounds(self) -> Dict[str, Dict[str, Optional[float]]]:
        """Return physical bounds for all variables."""
        bounds = {'target': {'min': 0.0, 'max': None}}
        for name, (low, high) in self._ranges.items():
            bounds[name] = {'min': 0.0, 'max': high * 10}
        return bounds

print("NewtonGravityEquation class defined.")

In [None]:
# ==============================================================================
# EQUATION 3: IDEAL GAS LAW
# ==============================================================================

class IdealGasEquation(BaseTestEquation):
    """
    Ideal Gas Law (solved for pressure).
    
    Reference:
        Classical thermodynamics.
        Feynman Lectures on Physics, Vol. I, Chapter 39.
    
    Equation:
        P = n * R * T / V
    
    Variables:
        n: Amount of substance (mol)
        T: Temperature (K)
        V: Volume (m^3)
        P: Pressure (Pa)
        R: Gas constant = 8.314 J/(mol*K)
    """
    
    name = "ideal_gas"
    full_name = "Ideal Gas Law"
    equation_str = "P = n * R * T / V"
    equation_type = "rational"
    
    true_feature_names = ['n', 'T', 'V']
    dummy_feature_pool = ['m', 'rho', 'c_p', 'mu', 'k_th']
    
    # Gas constant
    R = 8.314  # J/(mol*K)
    
    # Dimensional exponents [M, L, T, Theta]
    _dimensions = {
        'n':    [0, 0, 0, 0],
        'T':    [0, 0, 0, 1],
        'V':    [0, 3, 0, 0],
        'm':    [1, 0, 0, 0],
        'rho':  [1, -3, 0, 0],
        'c_p':  [0, 2, -2, -1],
        'mu':   [1, -1, -1, 0],
        'k_th': [1, 1, -3, -1],
    }
    
    _target_dims = [1, -1, -2, 0]  # Pa = kg/(m*s^2)
    
    _ranges = {
        'n':    (0.1, 10.0),
        'T':    (200.0, 500.0),
        'V':    (0.001, 1.0),
        'm':    (0.01, 1.0),
        'rho':  (0.1, 10.0),
        'c_p':  (500.0, 2000.0),
        'mu':   (1e-5, 1e-3),
        'k_th': (0.01, 1.0),
    }
    
    def generate_true_features(
        self, 
        n_samples: int, 
        seed: int
    ) -> Dict[str, np.ndarray]:
        """Generate amount, temperature, and volume."""
        np.random.seed(seed)
        return {
            'n': np.random.uniform(*self._ranges['n'], n_samples),
            'T': np.random.uniform(*self._ranges['T'], n_samples),
            'V': np.random.uniform(*self._ranges['V'], n_samples),
        }
    
    def generate_dummy_features(
        self, 
        n_samples: int, 
        n_dummy: int, 
        seed: int
    ) -> Dict[str, np.ndarray]:
        """Generate dummy thermodynamic properties."""
        if n_dummy <= 0:
            return {}
        
        np.random.seed(seed + 1000)
        selected = self.dummy_feature_pool[:n_dummy]
        return {
            name: np.random.uniform(*self._ranges[name], n_samples)
            for name in selected
        }
    
    def compute_target(
        self, 
        features: Dict[str, np.ndarray]
    ) -> np.ndarray:
        """Compute pressure using ideal gas law."""
        n = features['n']
        T = features['T']
        V = features['V']
        return self.R * n * T / V
    
    def get_variable_dimensions(
        self, 
        feature_names: List[str]
    ) -> Dict[str, List[float]]:
        """Return dimensions for specified features."""
        return {name: self._dimensions[name] for name in feature_names}
    
    def get_target_dimensions(self) -> List[float]:
        """Return target dimensions [M, L, T, Theta]."""
        return self._target_dims.copy()
    
    def get_physical_bounds(self) -> Dict[str, Dict[str, Optional[float]]]:
        """Return physical bounds for all variables."""
        bounds = {'target': {'min': 0.0, 'max': None}}
        for name, (low, high) in self._ranges.items():
            bounds[name] = {'min': 0.0, 'max': high * 10}
        return bounds

print("IdealGasEquation class defined.")

In [None]:
# ==============================================================================
# EQUATION 4: DAMPED HARMONIC OSCILLATION
# ==============================================================================

class DampedOscillationEquation(BaseTestEquation):
    """
    Damped Harmonic Oscillation.
    
    Reference:
        Classical mechanics.
        Feynman Lectures on Physics, Vol. I, Chapter 24.
    
    Equation:
        x(t) = A * exp(-b * t) * cos(omega * t)
    
    Variables:
        A: Initial amplitude (m)
        b: Damping coefficient (s^-1)
        omega: Angular frequency (rad/s)
        t: Time (s)
        x: Displacement (m)
    """
    
    name = "damped"
    full_name = "Damped Harmonic Oscillation"
    equation_str = "x = A * exp(-b * t) * cos(omega * t)"
    equation_type = "nested_transcendental"
    
    true_feature_names = ['A', 'b', 'omega', 't']
    dummy_feature_pool = ['m', 'k', 'v0', 'F_ext', 'theta']
    
    # Dimensional exponents [M, L, T, Theta]
    _dimensions = {
        'A':     [0, 1, 0, 0],
        'b':     [0, 0, -1, 0],
        'omega': [0, 0, -1, 0],
        't':     [0, 0, 1, 0],
        'm':     [1, 0, 0, 0],
        'k':     [1, 0, -2, 0],
        'v0':    [0, 1, -1, 0],
        'F_ext': [1, 1, -2, 0],
        'theta': [0, 0, 0, 0],
    }
    
    _target_dims = [0, 1, 0, 0]  # m (displacement)
    
    _ranges = {
        'A':     (1.0, 10.0),
        'b':     (0.1, 1.0),
        'omega': (1.0, 10.0),
        't':     (0.0, 5.0),
        'm':     (0.1, 10.0),
        'k':     (1.0, 100.0),
        'v0':    (-10.0, 10.0),
        'F_ext': (-100.0, 100.0),
        'theta': (0.0, 2*np.pi),
    }
    
    def generate_true_features(
        self, 
        n_samples: int, 
        seed: int
    ) -> Dict[str, np.ndarray]:
        """Generate oscillation parameters."""
        np.random.seed(seed)
        return {
            'A':     np.random.uniform(*self._ranges['A'], n_samples),
            'b':     np.random.uniform(*self._ranges['b'], n_samples),
            'omega': np.random.uniform(*self._ranges['omega'], n_samples),
            't':     np.random.uniform(*self._ranges['t'], n_samples),
        }
    
    def generate_dummy_features(
        self, 
        n_samples: int, 
        n_dummy: int, 
        seed: int
    ) -> Dict[str, np.ndarray]:
        """Generate dummy mechanical parameters."""
        if n_dummy <= 0:
            return {}
        
        np.random.seed(seed + 1000)
        selected = self.dummy_feature_pool[:n_dummy]
        return {
            name: np.random.uniform(*self._ranges[name], n_samples)
            for name in selected
        }
    
    def compute_target(
        self, 
        features: Dict[str, np.ndarray]
    ) -> np.ndarray:
        """Compute displacement for damped oscillation."""
        A = features['A']
        b = features['b']
        omega = features['omega']
        t = features['t']
        return A * np.exp(-b * t) * np.cos(omega * t)
    
    def add_noise(
        self, 
        y: np.ndarray, 
        noise_level: float, 
        seed: int,
        noise_type: str = 'additive'
    ) -> np.ndarray:
        """Add additive noise since target can be negative."""
        return super().add_noise(y, noise_level, seed, noise_type='additive')
    
    def get_variable_dimensions(
        self, 
        feature_names: List[str]
    ) -> Dict[str, List[float]]:
        """Return dimensions for specified features."""
        return {name: self._dimensions[name] for name in feature_names}
    
    def get_target_dimensions(self) -> List[float]:
        """Return target dimensions [M, L, T, Theta]."""
        return self._target_dims.copy()
    
    def get_physical_bounds(self) -> Dict[str, Dict[str, Optional[float]]]:
        """Return physical bounds for all variables."""
        bounds = {'target': {'min': None, 'max': None}}
        for name, (low, high) in self._ranges.items():
            if name in ['v0', 'F_ext']:
                bounds[name] = {'min': None, 'max': None}
            elif name == 'theta':
                bounds[name] = {'min': 0.0, 'max': 2*np.pi}
            else:
                bounds[name] = {'min': 0.0, 'max': high * 10}
        return bounds

print("DampedOscillationEquation class defined.")

In [None]:
# ==============================================================================
# EQUATION REGISTRY
# ==============================================================================

EQUATION_REGISTRY = {
    'coulomb': CoulombEquation,
    'newton': NewtonGravityEquation,
    'ideal_gas': IdealGasEquation,
    'damped': DampedOscillationEquation,
}

def get_equation(name: str) -> BaseTestEquation:
    """
    Get equation instance by name.
    
    Parameters
    ----------
    name : str
        Equation name (coulomb, newton, ideal_gas, damped)
    """
    if name not in EQUATION_REGISTRY:
        raise ValueError(
            f"Unknown equation: {name}. "
            f"Available: {list(EQUATION_REGISTRY.keys())}"
        )
    return EQUATION_REGISTRY[name]()

print("Equation registry defined.")
print(f"Available equations: {list(EQUATION_REGISTRY.keys())}")

In [None]:
# ==============================================================================
# EXPERIMENT CONFIGURATION FUNCTIONS
# ==============================================================================

def get_core_experiment_configs() -> List[Dict[str, Any]]:
    """
    Generate configurations for core experiments.
    
    Core experiments: 4 equations x 2 noise x 2 dummy x 2 dims = 32 configs
    With 3 methods each = 96 total experiments
    
    Returns
    -------
    List[Dict[str, Any]]
        List of experiment configurations
    """
    configs = []
    equations = list(EQUATION_REGISTRY.keys())
    
    for eq_name in equations:
        for noise in [0.0, 0.05]:
            for n_dummy in [0, 5]:
                for with_dims in [True, False]:
                    configs.append({
                        'equation_name': eq_name,
                        'n_samples': 500,
                        'noise_level': noise,
                        'n_dummy': n_dummy,
                        'with_dims': with_dims,
                    })
    
    return configs


def get_supplementary_experiment_configs() -> List[Dict[str, Any]]:
    """
    Generate configurations for supplementary experiments.
    
    Supplementary experiments: 4 equations x 2 sample sizes = 8 configs
    Physics-SR only
    
    Returns
    -------
    List[Dict[str, Any]]
        List of experiment configurations
    """
    configs = []
    equations = list(EQUATION_REGISTRY.keys())
    
    for eq_name in equations:
        for n_samples in [250, 750]:
            configs.append({
                'equation_name': eq_name,
                'n_samples': n_samples,
                'noise_level': 0.05,
                'n_dummy': 5,
                'with_dims': True,
            })
    
    return configs


print("Experiment configuration functions defined.")
print(f"  Core configs: {len(get_core_experiment_configs())}")
print(f"  Supplementary configs: {len(get_supplementary_experiment_configs())}")

---
## Section 3: Data Generator Class

In [None]:
# ==============================================================================
# BENCHMARK DATA GENERATOR
# ==============================================================================

class BenchmarkDataGenerator:
    """
    Generator for benchmark test datasets.
    
    Handles creation, saving, and loading of test datasets for
    the Physics-SR Framework benchmark suite.
    """
    
    def __init__(self, output_dir: Union[str, Path] = "data"):
        """
        Initialize the data generator.
        
        Parameters
        ----------
        output_dir : Union[str, Path]
            Directory for saving datasets
        """
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)
        
        self.equations = {
            name: cls() for name, cls in EQUATION_REGISTRY.items()
        }
        
        print(f"BenchmarkDataGenerator initialized.")
        print(f"  Output directory: {self.output_dir}")
        print(f"  Available equations: {list(self.equations.keys())}")
    
    def generate_filename(
        self,
        equation_name: str,
        n_samples: int,
        noise_level: float,
        n_dummy: int
    ) -> str:
        """
        Generate standardized filename for dataset.
        
        Format: eq{N}_{name}_n{samples}_noise{level}_dummy{count}.npz
        """
        eq_names = list(EQUATION_REGISTRY.keys())
        eq_num = eq_names.index(equation_name) + 1
        return f"eq{eq_num}_{equation_name}_n{n_samples}_noise{noise_level:.2f}_dummy{n_dummy}.npz"
    
    def generate_dataset(
        self,
        equation_name: str,
        n_samples: int = DEFAULT_N_SAMPLES,
        noise_level: float = 0.0,
        n_dummy: int = 0,
        seed: int = RANDOM_SEED
    ) -> Dict[str, Any]:
        """Generate a single test dataset."""
        if equation_name not in self.equations:
            raise ValueError(
                f"Unknown equation: {equation_name}. "
                f"Available: {list(self.equations.keys())}"
            )
        
        equation = self.equations[equation_name]
        data = equation.generate_dataset(n_samples, noise_level, n_dummy, seed)
        
        eq_names = list(EQUATION_REGISTRY.keys())
        data['equation_index'] = eq_names.index(equation_name) + 1
        data['full_name'] = equation.full_name
        
        return data
    
    def save_dataset(
        self,
        data: Dict[str, Any],
        filename: Optional[str] = None
    ) -> Path:
        """
        Save dataset to .npz file.
        
        Uses pickle serialization for complex nested dict structures.
        """
        if filename is None:
            filename = self.generate_filename(
                data['equation_name'],
                data['n_samples'],
                data['noise_level'],
                data['n_dummy']
            )
        
        filepath = self.output_dir / filename
        
        save_data = {}
        for key, value in data.items():
            if key in ['variable_dimensions', 'physical_bounds', 'ground_truth']:
                # Use pickle for complex nested dict structures
                save_data[key + '_pkl'] = np.frombuffer(
                    pickle.dumps(value), dtype=np.uint8
                )
            elif isinstance(value, np.ndarray):
                save_data[key] = value
            elif isinstance(value, (list, tuple)):
                save_data[key] = np.array(value)
            else:
                save_data[key] = value
        
        np.savez(filepath, **save_data)
        return filepath
    
    def load_dataset(self, filename: Union[str, Path]) -> Dict[str, Any]:
        """
        Load dataset from .npz file.
        
        Handles both pickle-serialized and direct dict formats.
        """
        filepath = Path(filename)
        if not filepath.is_absolute():
            filepath = self.output_dir / filepath
        
        loaded = np.load(filepath, allow_pickle=True)
        
        data = {}
        for key in loaded.files:
            value = loaded[key]
            
            # Handle pickle-serialized dicts
            if key.endswith('_pkl'):
                original_key = key[:-4]  # Remove '_pkl' suffix
                data[original_key] = pickle.loads(value.tobytes())
            elif value.dtype == object and value.shape == ():
                data[key] = value.item()
            elif value.dtype == object:
                data[key] = value.tolist()
            else:
                data[key] = value
        
        return data
    
    def generate_all_datasets(
        self,
        equations: Optional[List[str]] = None,
        sample_sizes: Optional[List[int]] = None,
        noise_levels: Optional[List[float]] = None,
        dummy_counts: Optional[List[int]] = None,
        seed: int = RANDOM_SEED,
        verbose: bool = True
    ) -> List[Path]:
        """Generate all datasets for benchmark suite."""
        if equations is None:
            equations = list(EQUATION_REGISTRY.keys())
        if sample_sizes is None:
            sample_sizes = [500]
        if noise_levels is None:
            noise_levels = NOISE_LEVELS
        if dummy_counts is None:
            dummy_counts = DUMMY_COUNTS
        
        total = len(equations) * len(sample_sizes) * len(noise_levels) * len(dummy_counts)
        
        if verbose:
            print(f"Generating {total} datasets...")
            print(f"  Equations: {equations}")
            print(f"  Sample sizes: {sample_sizes}")
            print(f"  Noise levels: {noise_levels}")
            print(f"  Dummy counts: {dummy_counts}")
            print()
        
        generated_files = []
        count = 0
        
        for eq_name in equations:
            for n_samples in sample_sizes:
                for noise in noise_levels:
                    for n_dummy in dummy_counts:
                        data = self.generate_dataset(
                            equation_name=eq_name,
                            n_samples=n_samples,
                            noise_level=noise,
                            n_dummy=n_dummy,
                            seed=seed
                        )
                        
                        # Validate y range
                        self._validate_y_range(data['y'], eq_name)
                        
                        filepath = self.save_dataset(data)
                        generated_files.append(filepath)
                        
                        count += 1
                        if verbose:
                            print(f"  [{count}/{total}] Generated: {filepath.name}")
        
        if verbose:
            print()
            print(f"Generation complete. {len(generated_files)} files created.")
        
        return generated_files
    
    def _validate_y_range(
        self, 
        y: np.ndarray, 
        eq_name: str,
        min_abs: float = Y_MIN_THRESHOLD,
        max_abs: float = Y_MAX_THRESHOLD
    ) -> bool:
        """
        Validate that y values are in numerically stable range.
        
        Parameters
        ----------
        y : np.ndarray
            Target values
        eq_name : str
            Equation name for logging
        min_abs : float
            Minimum acceptable |y| value
        max_abs : float
            Maximum acceptable |y| value
            
        Returns
        -------
        bool
            True if y is in valid range
        """
        y_nonzero = y[y != 0]
        if len(y_nonzero) == 0:
            print(f"    [WARNING] {eq_name}: All y values are zero!")
            return False
        
        y_abs = np.abs(y_nonzero)
        y_min, y_max = y_abs.min(), y_abs.max()
        
        if y_min < min_abs:
            print(f"    [WARNING] {eq_name}: y_min ({y_min:.2e}) < {min_abs:.0e}")
            return False
        if y_max > max_abs:
            print(f"    [WARNING] {eq_name}: y_max ({y_max:.2e}) > {max_abs:.0e}")
            return False
        
        return True
    
    def list_datasets(self) -> List[Path]:
        """List all datasets in output directory."""
        return sorted(self.output_dir.glob("*.npz"))
    
    def print_summary(self):
        """Print summary of available datasets."""
        datasets = self.list_datasets()
        
        print("=" * 70)
        print(" DATASET SUMMARY")
        print("=" * 70)
        print(f"Output directory: {self.output_dir}")
        print(f"Total datasets: {len(datasets)}")
        print()
        
        if len(datasets) == 0:
            print("No datasets found.")
            return
        
        by_equation = {}
        for path in datasets:
            parts = path.stem.split('_')
            eq_name = parts[1]
            if eq_name not in by_equation:
                by_equation[eq_name] = []
            by_equation[eq_name].append(path)
        
        for eq_name, paths in by_equation.items():
            print(f"  {eq_name}: {len(paths)} datasets")

print("BenchmarkDataGenerator class defined.")

---
## Section 4: Generate All Datasets

In [None]:
print("=" * 70)
print(" GENERATING CORE EXPERIMENT DATASETS")
print("=" * 70)
print()

generator = BenchmarkDataGenerator(output_dir=DATA_DIR)
print()

In [None]:
# Generate core experiment datasets (n=500)
core_files = generator.generate_all_datasets(
    equations=None,
    sample_sizes=[500],
    noise_levels=[0.0, 0.05],
    dummy_counts=[0, 5],
    seed=RANDOM_SEED,
    verbose=True
)

print(f"\nCore datasets generated: {len(core_files)}")

In [None]:
print("=" * 70)
print(" GENERATING SUPPLEMENTARY EXPERIMENT DATASETS")
print("=" * 70)
print()

supp_files = generator.generate_all_datasets(
    equations=None,
    sample_sizes=[250, 750],
    noise_levels=[0.05],
    dummy_counts=[5],
    seed=RANDOM_SEED,
    verbose=True
)

print(f"\nSupplementary datasets generated: {len(supp_files)}")

In [None]:
print("\n" + "=" * 70)
print(" GENERATION COMPLETE")
print("=" * 70)

all_files = generator.list_datasets()
total_size = sum(f.stat().st_size for f in all_files) / 1024 / 1024

print(f"\nTotal datasets: {len(all_files)}")
print(f"Total size: {total_size:.2f} MB")
print(f"Output directory: {generator.output_dir.absolute()}")

---
## Section 5: Validation and Verification

In [None]:
print("=" * 70)
print(" Y-VALUE RANGE VERIFICATION")
print("=" * 70)
print()

for eq_name in EQUATION_REGISTRY.keys():
    filename = generator.generate_filename(eq_name, 500, 0.0, 0)
    filepath = generator.output_dir / filename
    
    if filepath.exists():
        data = generator.load_dataset(filename)
        y = data['y']
        
        y_nonzero = y[y != 0]
        if len(y_nonzero) > 0:
            y_abs = np.abs(y_nonzero)
            y_min, y_max = y_abs.min(), y_abs.max()
        else:
            y_min, y_max = 0, 0
        
        # Check if in stable range
        stable = (y_min >= Y_MIN_THRESHOLD) and (y_max <= Y_MAX_THRESHOLD)
        status = "[OK]" if stable else "[WARNING]"
        
        print(f"{eq_name}:")
        print(f"  Equation: {data['equation_str']}")
        print(f"  y range: [{y_min:.2e}, {y_max:.2e}]")
        print(f"  Status: {status}")
        print()

In [None]:
print("=" * 70)
print(" DATASET STRUCTURE VERIFICATION")
print("=" * 70)
print()

for eq_name in EQUATION_REGISTRY.keys():
    filename = generator.generate_filename(eq_name, 500, 0.0, 5)
    filepath = generator.output_dir / filename
    
    if filepath.exists():
        data = generator.load_dataset(filename)
        
        print(f"{eq_name}:")
        print(f"  X.shape = {data['X'].shape}")
        print(f"  true_features = {list(data['true_features'])}")
        print(f"  dummy_features = {list(data['dummy_features'])}")
        print(f"  target_dimensions = {data['target_dimensions']}")
        print(f"  has variable_dimensions = {'variable_dimensions' in data}")
        print(f"  has physical_bounds = {'physical_bounds' in data}")
        print()

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for idx, eq_name in enumerate(EQUATION_REGISTRY.keys()):
    ax = axes[idx]
    
    filename = generator.generate_filename(eq_name, 500, 0.0, 0)
    filepath = generator.output_dir / filename
    
    if filepath.exists():
        data = generator.load_dataset(filename)
        y = data['y']
        
        ax.hist(y, bins=50, edgecolor='black', alpha=0.7)
        ax.set_title(f"{eq_name.upper()}\n{data['equation_str']}", fontsize=10)
        ax.set_xlabel('Target Value')
        ax.set_ylabel('Frequency')
        
        stats_text = f"Mean: {np.mean(y):.2e}\nStd: {np.std(y):.2e}"
        ax.text(0.95, 0.95, stats_text, transform=ax.transAxes,
                verticalalignment='top', horizontalalignment='right',
                fontsize=8, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.savefig(generator.output_dir / 'target_distributions.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"\nDistribution plot saved to: {generator.output_dir / 'target_distributions.png'}")

In [None]:
generator.print_summary()

---
## Section 6: Usage Examples

In [None]:
print("=" * 70)
print(" EXAMPLE: Loading Dataset for Experiments")
print("=" * 70)
print()

example_file = "eq1_coulomb_n500_noise0.05_dummy5.npz"
data = generator.load_dataset(example_file)

print(f"Loaded: {example_file}")
print()
print("Data structure:")
print(f"  X.shape = {data['X'].shape}")
print(f"  y.shape = {data['y'].shape}")
print(f"  feature_names = {list(data['feature_names'])}")
print(f"  true_features = {list(data['true_features'])}")
print(f"  dummy_features = {list(data['dummy_features'])}")
print()
print("Ground truth:")
print(f"  equation = {data['equation_str']}")
print(f"  type = {data['equation_type']}")

In [None]:
print("\n" + "=" * 70)
print(" DataGen Module Complete")
print("=" * 70)
print()
print("Available classes:")
print("  - UserInputs")
print("  - BaseTestEquation")
print("  - CoulombEquation")
print("  - NewtonGravityEquation")
print("  - IdealGasEquation")
print("  - DampedOscillationEquation")
print("  - BenchmarkDataGenerator")
print()
print("Available functions:")
print("  - get_equation(name)")
print("  - get_core_experiment_configs()")
print("  - get_supplementary_experiment_configs()")
print()
print(f"Generated datasets: {len(generator.list_datasets())}")
print(f"Output directory: {generator.output_dir.absolute()}")