# DataGen - Physics-SR Framework v3.0 Benchmark

## Benchmark Data Generation Module

**Author:** Zhengze Zhang  
**Affiliation:** Department of Statistics, Columbia University  
**Date:** January 2026

---

### Purpose

This notebook generates all test datasets for the Physics-SR Framework v3.0 benchmark experiments.

### Test Equations

1. **KK2000** - Khairoutdinov-Kogan Warm Rain Autoconversion (power-law)
2. **Newton** - Gravitational Force Law (rational function)
3. **Ideal Gas** - Ideal Gas Law for Pressure (rational function)
4. **Damped** - Damped Harmonic Oscillation (nested transcendental)

### Experimental Design

**Core Experiments (16 datasets):**
- 4 equations x 2 noise (0%, 5%) x 2 dummy (0, 5) = 16
- Each configuration has n=500 samples

**Supplementary Experiments (8 datasets):**
- Sample size sensitivity: n = 250, 750
- Fixed: noise=5%, dummy=5

### Output Format

Each dataset is saved as a `.npz` file containing:
- `X`: Feature matrix (n_samples, n_features)
- `y`: Target vector with noise
- `y_true`: Noise-free target
- Metadata: equation info, ground truth, dimensional information

---
## Section 1: Header and Imports

In [None]:
# ==============================================================================
# COLAB SETUP - Run this cell first!
# ==============================================================================
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    import os
    if not os.path.exists('/content/Physics-Informed-Symbolic-Regression'):
        !git clone https://github.com/Garthzzz/Physics-Informed-Symbolic-Regression.git
    %cd /content/Physics-Informed-Symbolic-Regression
    
    !pip install -q pysr
    import pysr
    pysr.install()
    
    print("Setup complete!")

In [None]:
"""
DataGen.ipynb - Benchmark Data Generation Module
=================================================

Physics-SR Framework v3.0 Benchmark Suite

This module generates synthetic datasets for testing the Physics-SR framework
across four physics problems with varying complexity:
- KK2000 Autoconversion (power-law, 2 true variables)
- Newton Gravitation (rational, 3 true variables)
- Ideal Gas Law (rational, 3 true variables)
- Damped Oscillation (nested transcendental, 4 true variables)

Author: Zhengze Zhang
Affiliation: Department of Statistics, Columbia University
"""

# Standard library imports
import os
import sys
import json
import pickle
import warnings
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from typing import Dict, List, Tuple, Optional, Union, Any
from pathlib import Path

# Scientific computing
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

print("DataGen: All imports successful.")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

In [None]:
# ==============================================================================
# CONFIGURATION CONSTANTS
# ==============================================================================

# Random seed for reproducibility
RANDOM_SEED = 42

# Default sample size for core experiments
DEFAULT_N_SAMPLES = 500

# Output directory for generated datasets
DATA_DIR = Path('data')
DATA_DIR.mkdir(exist_ok=True)

# Noise levels to test (multiplicative log-normal noise)
NOISE_LEVELS = [0.0, 0.05]  # 0% and 5%

# Dummy feature counts
DUMMY_COUNTS = [0, 5]

# Sample sizes for supplementary experiments
SAMPLE_SIZES_SUPPLEMENTARY = [250, 750]

# Dimensional exponent order: [Mass, Length, Time, Temperature]
DIM_NAMES = ['M', 'L', 'T', 'Theta']

print(f"Output directory: {DATA_DIR.resolve()}")
print(f"Random seed: {RANDOM_SEED}")
print(f"Noise levels: {NOISE_LEVELS}")
print(f"Dummy feature counts: {DUMMY_COUNTS}")

---
## Section 2: Base Test Equation Class

In [None]:
# ==============================================================================
# ABSTRACT BASE CLASS FOR TEST EQUATIONS
# ==============================================================================

class BaseTestEquation(ABC):
    """
    Abstract base class for test equations in the Physics-SR benchmark.
    
    Each equation defines:
    - True mathematical relationship
    - Feature generation with physically realistic ranges
    - Dummy feature generation
    - Dimensional information for physics-informed methods
    - Noise addition mechanism
    
    Subclasses must implement all abstract methods.
    """
    
    # Class attributes to be defined by subclasses
    name: str                           # Short identifier (e.g., "kk2000")
    full_name: str                      # Full description
    equation_str: str                   # Human-readable equation
    equation_type: str                  # power_law, rational, nested_transcendental
    equation_index: int                 # 1, 2, 3, 4 for file naming
    
    true_feature_names: List[str]       # Features in true equation
    dummy_feature_pool: List[str]       # Available dummy features
    
    @abstractmethod
    def generate_true_features(self, n_samples: int, seed: int) -> Dict[str, np.ndarray]:
        """
        Generate values for features that appear in the true equation.
        
        Parameters
        ----------
        n_samples : int
            Number of samples to generate
        seed : int
            Random seed for reproducibility
            
        Returns
        -------
        Dict[str, np.ndarray]
            Dictionary mapping feature names to value arrays
        """
        pass
    
    @abstractmethod
    def generate_dummy_features(self, n_samples: int, n_dummy: int, seed: int) -> Dict[str, np.ndarray]:
        """
        Generate values for dummy (irrelevant) features.
        
        Parameters
        ----------
        n_samples : int
            Number of samples to generate
        n_dummy : int
            Number of dummy features to include (0 to len(dummy_feature_pool))
        seed : int
            Random seed for reproducibility
            
        Returns
        -------
        Dict[str, np.ndarray]
            Dictionary mapping dummy feature names to value arrays
        """
        pass
    
    @abstractmethod
    def compute_target(self, features: Dict[str, np.ndarray]) -> np.ndarray:
        """
        Compute the target variable from true features.
        
        Parameters
        ----------
        features : Dict[str, np.ndarray]
            Dictionary mapping feature names to value arrays
            
        Returns
        -------
        np.ndarray
            Target variable values (noise-free)
        """
        pass
    
    @abstractmethod
    def get_variable_dimensions(self, feature_names: List[str]) -> Dict[str, List[float]]:
        """
        Return dimensional exponents [M, L, T, Theta] for given features.
        
        Parameters
        ----------
        feature_names : List[str]
            List of feature names to get dimensions for
            
        Returns
        -------
        Dict[str, List[float]]
            Dictionary mapping feature names to dimension vectors
        """
        pass
    
    @abstractmethod
    def get_target_dimensions(self) -> List[float]:
        """
        Return dimensional exponents [M, L, T, Theta] for target variable.
        
        Returns
        -------
        List[float]
            Dimension vector for target
        """
        pass
    
    @abstractmethod
    def get_physical_bounds(self) -> Dict[str, Dict[str, Optional[float]]]:
        """
        Return physical bounds for all variables.
        
        Returns
        -------
        Dict[str, Dict[str, Optional[float]]]
            Dictionary mapping variable names to {min, max} bounds
        """
        pass
    
    def add_noise(self, y: np.ndarray, noise_level: float, seed: int) -> np.ndarray:
        """
        Add multiplicative log-normal noise to target.
        
        For positive-valued targets (rates, forces), multiplicative noise
        is more physically appropriate than additive noise.
        
        y_noisy = y * exp(N(0, noise_level))
        
        Parameters
        ----------
        y : np.ndarray
            Noise-free target values
        noise_level : float
            Standard deviation of log-normal noise (0.05 = 5%)
        seed : int
            Random seed for reproducibility
            
        Returns
        -------
        np.ndarray
            Target with noise added
        """
        if noise_level <= 0:
            return y.copy()
        
        np.random.seed(seed + 2000)  # Offset seed for noise
        
        # Multiplicative log-normal noise
        noise_factor = np.exp(np.random.normal(0, noise_level, len(y)))
        y_noisy = y * noise_factor
        
        return y_noisy
    
    def get_ground_truth(self) -> Dict[str, Any]:
        """
        Return ground truth information for evaluation.
        
        Returns
        -------
        Dict[str, Any]
            Ground truth including equation string, active features, type
        """
        return {
            'equation': self.equation_str,
            'active_features': self.true_feature_names,
            'equation_type': self.equation_type,
            'equation_index': self.equation_index,
            'name': self.name,
            'full_name': self.full_name
        }

print("BaseTestEquation abstract class defined.")

---
## Section 3: Equation 1 - KK2000 Warm Rain Autoconversion

In [None]:
# ==============================================================================
# EQUATION 1: KHAIROUTDINOV-KOGAN 2000 AUTOCONVERSION
# ==============================================================================

class KK2000Equation(BaseTestEquation):
    """
    Khairoutdinov-Kogan 2000 Warm Rain Autoconversion Parameterization.
    
    Equation: P_auto = 1350 * q_c^2.47 * N_d^(-1.79)
    
    This is a widely-used empirical parameterization for the autoconversion
    process in warm rain formation, derived from LES simulations.
    
    Reference:
        Khairoutdinov, M., and Y. Kogan, 2000: "A New Cloud Physics
        Parameterization in a Large-Eddy Simulation Model of Marine
        Stratocumulus." Monthly Weather Review, 128, 229-243.
    
    Characteristics:
    - Type: Power-law (multiplicative)
    - True variables: 2 (q_c, N_d)
    - Non-integer exponents: Yes (2.47, -1.79)
    - Difficulty: Medium (core application of framework)
    """
    
    # Class attributes
    name = "kk2000"
    full_name = "Khairoutdinov-Kogan 2000 Autoconversion"
    equation_str = "P_auto = 1350 * q_c^2.47 * N_d^(-1.79)"
    equation_type = "power_law"
    equation_index = 1
    
    true_feature_names = ['q_c', 'N_d']
    dummy_feature_pool = ['r_eff', 'LWC', 'T', 'w', 'p']
    
    # Coefficients from KK2000
    C = 1350.0
    alpha = 2.47    # Exponent for q_c
    beta = -1.79    # Exponent for N_d
    
    # Dimensional information: [Mass, Length, Time, Temperature]
    _dimensions = {
        'q_c':   [0, 0, 0, 0],      # Cloud water mixing ratio (kg/kg, dimensionless)
        'N_d':   [0, -3, 0, 0],     # Droplet number concentration (m^-3)
        'r_eff': [0, 1, 0, 0],      # Effective radius (m)
        'LWC':   [1, -3, 0, 0],     # Liquid water content (kg/m^3)
        'T':     [0, 0, 0, 1],      # Temperature (K)
        'w':     [0, 1, -1, 0],     # Vertical velocity (m/s)
        'p':     [1, -1, -2, 0],    # Pressure (Pa = kg/(m*s^2))
    }
    
    _target_dims = [0, 0, -1, 0]    # Autoconversion rate (s^-1)
    
    # Physically realistic ranges
    _ranges = {
        'q_c':   (1e-4, 5e-3),      # Cloud water: 0.1 - 5 g/kg
        'N_d':   (1e7, 5e8),        # Droplet concentration: 10 - 500 cm^-3
        'r_eff': (5e-6, 25e-6),     # Effective radius: 5 - 25 um
        'LWC':   (0.1, 2.0),        # Liquid water content: 0.1 - 2 g/m^3
        'T':     (270, 300),        # Temperature: -3 to 27 C
        'w':     (-5, 5),           # Vertical velocity: -5 to 5 m/s
        'p':     (8e4, 1e5),        # Pressure: 800 - 1000 hPa
    }
    
    def generate_true_features(self, n_samples: int, seed: int) -> Dict[str, np.ndarray]:
        """Generate cloud water mixing ratio and droplet concentration."""
        np.random.seed(seed)
        return {
            'q_c': np.random.uniform(*self._ranges['q_c'], n_samples),
            'N_d': np.random.uniform(*self._ranges['N_d'], n_samples),
        }
    
    def generate_dummy_features(self, n_samples: int, n_dummy: int, seed: int) -> Dict[str, np.ndarray]:
        """Generate irrelevant features (r_eff, LWC, T, w, p)."""
        if n_dummy <= 0:
            return {}
        
        np.random.seed(seed + 1000)  # Offset seed for dummy features
        
        selected = self.dummy_feature_pool[:min(n_dummy, len(self.dummy_feature_pool))]
        return {
            name: np.random.uniform(*self._ranges[name], n_samples)
            for name in selected
        }
    
    def compute_target(self, features: Dict[str, np.ndarray]) -> np.ndarray:
        """Compute autoconversion rate: P = 1350 * q_c^2.47 * N_d^(-1.79)"""
        q_c = features['q_c']
        N_d = features['N_d']
        return self.C * np.power(q_c, self.alpha) * np.power(N_d, self.beta)
    
    def get_variable_dimensions(self, feature_names: List[str]) -> Dict[str, List[float]]:
        """Return dimensions for specified features."""
        return {name: self._dimensions[name] for name in feature_names if name in self._dimensions}
    
    def get_target_dimensions(self) -> List[float]:
        """Return dimensions for target (s^-1)."""
        return self._target_dims.copy()
    
    def get_physical_bounds(self) -> Dict[str, Dict[str, Optional[float]]]:
        """Return physical bounds for all variables."""
        bounds = {'target': {'min': 0, 'max': None}}  # Rate must be non-negative
        for name, (low, high) in self._ranges.items():
            bounds[name] = {'min': low * 0.1, 'max': high * 10}  # Allow some margin
        return bounds

# Test instantiation
eq1 = KK2000Equation()
print(f"Equation 1: {eq1.full_name}")
print(f"  Formula: {eq1.equation_str}")
print(f"  Type: {eq1.equation_type}")
print(f"  True features: {eq1.true_feature_names}")
print(f"  Dummy pool: {eq1.dummy_feature_pool}")

---
## Section 4: Equation 2 - Newton's Law of Universal Gravitation

In [None]:
# ==============================================================================
# EQUATION 2: NEWTON'S LAW OF UNIVERSAL GRAVITATION
# ==============================================================================

class NewtonGravityEquation(BaseTestEquation):
    """
    Newton's Law of Universal Gravitation.
    
    Equation: F = G * m1 * m2 / r^2
    
    The fundamental law governing gravitational attraction between masses.
    
    Reference:
        Feynman Lectures on Physics, Volume I, Chapter 7.1
    
    Characteristics:
    - Type: Rational function (quotient)
    - True variables: 3 (m1, m2, r)
    - Integer exponents: Yes (1, 1, -2)
    - Difficulty: Easy (tests inverse-square law discovery)
    """
    
    # Class attributes
    name = "newton"
    full_name = "Newton's Law of Universal Gravitation"
    equation_str = "F = G * m1 * m2 / r^2"
    equation_type = "rational"
    equation_index = 2
    
    true_feature_names = ['m1', 'm2', 'r']
    dummy_feature_pool = ['v1', 'v2', 'T', 't', 'rho']
    
    # Gravitational constant
    G = 6.674e-11  # N*m^2/kg^2
    
    # Dimensional information: [Mass, Length, Time, Temperature]
    _dimensions = {
        'm1':  [1, 0, 0, 0],       # Mass (kg)
        'm2':  [1, 0, 0, 0],       # Mass (kg)
        'r':   [0, 1, 0, 0],       # Distance (m)
        'v1':  [0, 1, -1, 0],      # Velocity (m/s)
        'v2':  [0, 1, -1, 0],      # Velocity (m/s)
        'T':   [0, 0, 0, 1],       # Temperature (K)
        't':   [0, 0, 1, 0],       # Time (s)
        'rho': [1, -3, 0, 0],      # Density (kg/m^3)
    }
    
    _target_dims = [1, 1, -2, 0]   # Force (N = kg*m/s^2)
    
    # Physically realistic ranges
    _ranges = {
        'm1':  (1, 1000),          # Mass 1: 1 - 1000 kg
        'm2':  (1, 1000),          # Mass 2: 1 - 1000 kg
        'r':   (1, 100),           # Distance: 1 - 100 m
        'v1':  (0, 100),           # Velocity 1: 0 - 100 m/s
        'v2':  (0, 100),           # Velocity 2: 0 - 100 m/s
        'T':   (200, 400),         # Temperature: 200 - 400 K
        't':   (0, 1000),          # Time: 0 - 1000 s
        'rho': (1, 10),            # Density: 1 - 10 kg/m^3
    }
    
    def generate_true_features(self, n_samples: int, seed: int) -> Dict[str, np.ndarray]:
        """Generate masses and distance."""
        np.random.seed(seed)
        return {
            'm1': np.random.uniform(*self._ranges['m1'], n_samples),
            'm2': np.random.uniform(*self._ranges['m2'], n_samples),
            'r':  np.random.uniform(*self._ranges['r'], n_samples),
        }
    
    def generate_dummy_features(self, n_samples: int, n_dummy: int, seed: int) -> Dict[str, np.ndarray]:
        """Generate irrelevant features (velocities, temperature, time, density)."""
        if n_dummy <= 0:
            return {}
        
        np.random.seed(seed + 1000)
        
        selected = self.dummy_feature_pool[:min(n_dummy, len(self.dummy_feature_pool))]
        return {
            name: np.random.uniform(*self._ranges[name], n_samples)
            for name in selected
        }
    
    def compute_target(self, features: Dict[str, np.ndarray]) -> np.ndarray:
        """Compute gravitational force: F = G * m1 * m2 / r^2"""
        m1 = features['m1']
        m2 = features['m2']
        r = features['r']
        return self.G * m1 * m2 / (r ** 2)
    
    def get_variable_dimensions(self, feature_names: List[str]) -> Dict[str, List[float]]:
        """Return dimensions for specified features."""
        return {name: self._dimensions[name] for name in feature_names if name in self._dimensions}
    
    def get_target_dimensions(self) -> List[float]:
        """Return dimensions for target (N = kg*m/s^2)."""
        return self._target_dims.copy()
    
    def get_physical_bounds(self) -> Dict[str, Dict[str, Optional[float]]]:
        """Return physical bounds for all variables."""
        bounds = {'target': {'min': 0, 'max': None}}  # Force magnitude is non-negative
        for name, (low, high) in self._ranges.items():
            bounds[name] = {'min': low * 0.1, 'max': high * 10}
        return bounds

# Test instantiation
eq2 = NewtonGravityEquation()
print(f"Equation 2: {eq2.full_name}")
print(f"  Formula: {eq2.equation_str}")
print(f"  Type: {eq2.equation_type}")
print(f"  True features: {eq2.true_feature_names}")
print(f"  Dummy pool: {eq2.dummy_feature_pool}")

---
## Section 5: Equation 3 - Ideal Gas Law

In [None]:
# ==============================================================================
# EQUATION 3: IDEAL GAS LAW
# ==============================================================================

class IdealGasEquation(BaseTestEquation):
    """
    Ideal Gas Law (solved for pressure).
    
    Equation: P = n * R * T / V
    
    The fundamental equation of state for ideal gases, relating
    pressure to temperature, volume, and amount of substance.
    
    Reference:
        Feynman Lectures on Physics, Volume I, Chapter 39.11
    
    Characteristics:
    - Type: Rational function (product with division)
    - True variables: 3 (n, T, V)
    - Integer exponents: Yes (1, 1, -1)
    - Difficulty: Easy (tests multi-variable product structure)
    """
    
    # Class attributes
    name = "ideal_gas"
    full_name = "Ideal Gas Law"
    equation_str = "P = n * R * T / V"
    equation_type = "rational"
    equation_index = 3
    
    true_feature_names = ['n', 'T', 'V']
    dummy_feature_pool = ['m', 'rho', 'c_p', 'mu', 'k']
    
    # Universal gas constant
    R = 8.314  # J/(mol*K)
    
    # Dimensional information: [Mass, Length, Time, Temperature]
    # Note: mol is treated as dimensionless for simplicity
    _dimensions = {
        'n':   [0, 0, 0, 0],       # Amount of substance (mol, dimensionless)
        'T':   [0, 0, 0, 1],       # Temperature (K)
        'V':   [0, 3, 0, 0],       # Volume (m^3)
        'm':   [1, 0, 0, 0],       # Mass (kg)
        'rho': [1, -3, 0, 0],      # Density (kg/m^3)
        'c_p': [0, 2, -2, -1],     # Specific heat (J/(kg*K) = m^2/(s^2*K))
        'mu':  [1, -1, -1, 0],     # Dynamic viscosity (Pa*s = kg/(m*s))
        'k':   [1, 1, -3, -1],     # Thermal conductivity (W/(m*K) = kg*m/(s^3*K))
    }
    
    _target_dims = [1, -1, -2, 0]  # Pressure (Pa = kg/(m*s^2))
    
    # Physically realistic ranges
    _ranges = {
        'n':   (0.1, 10),          # Moles: 0.1 - 10 mol
        'T':   (200, 500),         # Temperature: 200 - 500 K
        'V':   (0.001, 1),         # Volume: 1 mL - 1 m^3
        'm':   (0.01, 1),          # Mass: 10 g - 1 kg
        'rho': (0.1, 10),          # Density: 0.1 - 10 kg/m^3
        'c_p': (500, 2000),        # Specific heat: 500 - 2000 J/(kg*K)
        'mu':  (1e-5, 1e-3),       # Viscosity: 1e-5 - 1e-3 Pa*s
        'k':   (0.01, 1),          # Thermal conductivity: 0.01 - 1 W/(m*K)
    }
    
    def generate_true_features(self, n_samples: int, seed: int) -> Dict[str, np.ndarray]:
        """Generate amount of substance, temperature, and volume."""
        np.random.seed(seed)
        return {
            'n': np.random.uniform(*self._ranges['n'], n_samples),
            'T': np.random.uniform(*self._ranges['T'], n_samples),
            'V': np.random.uniform(*self._ranges['V'], n_samples),
        }
    
    def generate_dummy_features(self, n_samples: int, n_dummy: int, seed: int) -> Dict[str, np.ndarray]:
        """Generate irrelevant features (mass, density, specific heat, etc.)."""
        if n_dummy <= 0:
            return {}
        
        np.random.seed(seed + 1000)
        
        selected = self.dummy_feature_pool[:min(n_dummy, len(self.dummy_feature_pool))]
        return {
            name: np.random.uniform(*self._ranges[name], n_samples)
            for name in selected
        }
    
    def compute_target(self, features: Dict[str, np.ndarray]) -> np.ndarray:
        """Compute pressure: P = n * R * T / V"""
        n = features['n']
        T = features['T']
        V = features['V']
        return n * self.R * T / V
    
    def get_variable_dimensions(self, feature_names: List[str]) -> Dict[str, List[float]]:
        """Return dimensions for specified features."""
        return {name: self._dimensions[name] for name in feature_names if name in self._dimensions}
    
    def get_target_dimensions(self) -> List[float]:
        """Return dimensions for target (Pa = kg/(m*s^2))."""
        return self._target_dims.copy()
    
    def get_physical_bounds(self) -> Dict[str, Dict[str, Optional[float]]]:
        """Return physical bounds for all variables."""
        bounds = {'target': {'min': 0, 'max': None}}  # Pressure is non-negative
        for name, (low, high) in self._ranges.items():
            bounds[name] = {'min': low * 0.1, 'max': high * 10}
        return bounds

# Test instantiation
eq3 = IdealGasEquation()
print(f"Equation 3: {eq3.full_name}")
print(f"  Formula: {eq3.equation_str}")
print(f"  Type: {eq3.equation_type}")
print(f"  True features: {eq3.true_feature_names}")
print(f"  Dummy pool: {eq3.dummy_feature_pool}")

---
## Section 6: Equation 4 - Damped Harmonic Oscillation

In [None]:
# ==============================================================================
# EQUATION 4: DAMPED HARMONIC OSCILLATION
# ==============================================================================

class DampedOscillationEquation(BaseTestEquation):
    """
    Damped Harmonic Oscillation Displacement.
    
    Equation: x(t) = A * exp(-gamma * t) * cos(omega * t)
    
    The solution to a damped harmonic oscillator, showing
    exponential decay modulated by oscillation.
    
    Reference:
        Feynman Lectures on Physics, Volume I, Chapter 24
    
    Characteristics:
    - Type: Nested/Composite function (exp * cos)
    - True variables: 4 (A, gamma, omega, t)
    - Transcendental functions: Yes (exp, cos)
    - Difficulty: HARD (tests discovery of nested structures)
    
    Note: This is the most challenging equation in the benchmark,
    as it requires discovering nested transcendental functions.
    """
    
    # Class attributes
    name = "damped"
    full_name = "Damped Harmonic Oscillation"
    equation_str = "x = A * exp(-gamma * t) * cos(omega * t)"
    equation_type = "nested_transcendental"
    equation_index = 4
    
    true_feature_names = ['A', 'gamma', 'omega', 't']
    dummy_feature_pool = ['m', 'k', 'v_0', 'F_ext', 'theta']
    
    # Dimensional information: [Mass, Length, Time, Temperature]
    _dimensions = {
        'A':     [0, 1, 0, 0],      # Initial amplitude (m)
        'gamma': [0, 0, -1, 0],     # Damping coefficient (s^-1)
        'omega': [0, 0, -1, 0],     # Angular frequency (rad/s, s^-1)
        't':     [0, 0, 1, 0],      # Time (s)
        'm':     [1, 0, 0, 0],      # Mass (kg)
        'k':     [1, 0, -2, 0],     # Spring constant (N/m = kg/s^2)
        'v_0':   [0, 1, -1, 0],     # Initial velocity (m/s)
        'F_ext': [1, 1, -2, 0],     # External force (N)
        'theta': [0, 0, 0, 0],      # Phase angle (rad, dimensionless)
    }
    
    _target_dims = [0, 1, 0, 0]    # Displacement (m)
    
    # Physically realistic ranges
    _ranges = {
        'A':     (0.1, 10),         # Amplitude: 0.1 - 10 m
        'gamma': (0.01, 1),         # Damping: 0.01 - 1 s^-1
        'omega': (0.1, 10),         # Frequency: 0.1 - 10 rad/s
        't':     (0, 10),           # Time: 0 - 10 s
        'm':     (0.1, 10),         # Mass: 0.1 - 10 kg
        'k':     (1, 100),          # Spring constant: 1 - 100 N/m
        'v_0':   (0, 10),           # Initial velocity: 0 - 10 m/s
        'F_ext': (0, 10),           # External force: 0 - 10 N
        'theta': (0, 2*np.pi),      # Phase angle: 0 - 2*pi rad
    }
    
    def generate_true_features(self, n_samples: int, seed: int) -> Dict[str, np.ndarray]:
        """Generate amplitude, damping, frequency, and time."""
        np.random.seed(seed)
        return {
            'A':     np.random.uniform(*self._ranges['A'], n_samples),
            'gamma': np.random.uniform(*self._ranges['gamma'], n_samples),
            'omega': np.random.uniform(*self._ranges['omega'], n_samples),
            't':     np.random.uniform(*self._ranges['t'], n_samples),
        }
    
    def generate_dummy_features(self, n_samples: int, n_dummy: int, seed: int) -> Dict[str, np.ndarray]:
        """Generate irrelevant features (mass, spring constant, etc.)."""
        if n_dummy <= 0:
            return {}
        
        np.random.seed(seed + 1000)
        
        selected = self.dummy_feature_pool[:min(n_dummy, len(self.dummy_feature_pool))]
        return {
            name: np.random.uniform(*self._ranges[name], n_samples)
            for name in selected
        }
    
    def compute_target(self, features: Dict[str, np.ndarray]) -> np.ndarray:
        """Compute displacement: x = A * exp(-gamma * t) * cos(omega * t)"""
        A = features['A']
        gamma = features['gamma']
        omega = features['omega']
        t = features['t']
        return A * np.exp(-gamma * t) * np.cos(omega * t)
    
    def add_noise(self, y: np.ndarray, noise_level: float, seed: int) -> np.ndarray:
        """
        Add additive Gaussian noise for damped oscillation.
        
        Unlike the other equations (which have strictly positive targets),
        the damped oscillation can have negative values. We use additive
        noise scaled to the signal amplitude.
        """
        if noise_level <= 0:
            return y.copy()
        
        np.random.seed(seed + 2000)
        
        # Additive Gaussian noise scaled to signal standard deviation
        noise_std = noise_level * np.std(y)
        noise = np.random.normal(0, noise_std, len(y))
        return y + noise
    
    def get_variable_dimensions(self, feature_names: List[str]) -> Dict[str, List[float]]:
        """Return dimensions for specified features."""
        return {name: self._dimensions[name] for name in feature_names if name in self._dimensions}
    
    def get_target_dimensions(self) -> List[float]:
        """Return dimensions for target (m)."""
        return self._target_dims.copy()
    
    def get_physical_bounds(self) -> Dict[str, Dict[str, Optional[float]]]:
        """Return physical bounds for all variables."""
        # Displacement can be positive or negative (oscillation)
        bounds = {'target': {'min': None, 'max': None}}
        for name, (low, high) in self._ranges.items():
            bounds[name] = {'min': low * 0.1, 'max': high * 10}
        return bounds

# Test instantiation
eq4 = DampedOscillationEquation()
print(f"Equation 4: {eq4.full_name}")
print(f"  Formula: {eq4.equation_str}")
print(f"  Type: {eq4.equation_type}")
print(f"  True features: {eq4.true_feature_names}")
print(f"  Dummy pool: {eq4.dummy_feature_pool}")

---
## Section 7: Equation Registry

In [None]:
# ==============================================================================
# EQUATION REGISTRY
# ==============================================================================

# Central registry of all test equations
EQUATION_REGISTRY = {
    'kk2000': KK2000Equation,
    'newton': NewtonGravityEquation,
    'ideal_gas': IdealGasEquation,
    'damped': DampedOscillationEquation,
}

def get_equation(name: str) -> BaseTestEquation:
    """
    Get equation instance by name.
    
    Parameters
    ----------
    name : str
        Equation name ('kk2000', 'newton', 'ideal_gas', 'damped')
        
    Returns
    -------
    BaseTestEquation
        Instantiated equation object
    """
    if name not in EQUATION_REGISTRY:
        raise ValueError(f"Unknown equation: {name}. Available: {list(EQUATION_REGISTRY.keys())}")
    return EQUATION_REGISTRY[name]()

def get_all_equations() -> List[BaseTestEquation]:
    """
    Get instances of all registered equations.
    
    Returns
    -------
    List[BaseTestEquation]
        List of all equation instances in order
    """
    return [cls() for cls in EQUATION_REGISTRY.values()]

# Print summary
print("Equation Registry Summary:")
print("=" * 80)
for name, cls in EQUATION_REGISTRY.items():
    eq = cls()
    print(f"  [{eq.equation_index}] {name:12s} | {eq.equation_type:22s} | True vars: {len(eq.true_feature_names)}")
print("=" * 80)

---
## Section 8: Benchmark Data Generator

In [None]:
# ==============================================================================
# BENCHMARK DATA GENERATOR CLASS
# ==============================================================================

class BenchmarkDataGenerator:
    """
    Generator for benchmark test datasets.
    
    This class handles:
    - Dataset generation with configurable parameters
    - Saving datasets to .npz format with full metadata
    - Loading existing datasets
    - Batch generation for all experimental configurations
    
    File naming convention:
        eq{N}_{name}_n{samples}_noise{level:.2f}_dummy{count}.npz
    
    Example:
        eq1_kk2000_n500_noise0.05_dummy5.npz
    """
    
    def __init__(self, output_dir: Union[str, Path] = DATA_DIR):
        """
        Initialize the data generator.
        
        Parameters
        ----------
        output_dir : str or Path
            Directory to save generated datasets
        """
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(exist_ok=True, parents=True)
    
    def generate_dataset(
        self,
        equation: BaseTestEquation,
        n_samples: int = DEFAULT_N_SAMPLES,
        noise_level: float = 0.0,
        n_dummy: int = 0,
        seed: int = RANDOM_SEED
    ) -> Dict[str, Any]:
        """
        Generate a single benchmark dataset.
        
        Parameters
        ----------
        equation : BaseTestEquation
            Equation object defining the true relationship
        n_samples : int
            Number of samples to generate
        noise_level : float
            Noise level (0.05 = 5%)
        n_dummy : int
            Number of irrelevant features to include
        seed : int
            Random seed for reproducibility
            
        Returns
        -------
        Dict[str, Any]
            Dataset dictionary with all data and metadata
        """
        # Generate true features
        true_features = equation.generate_true_features(n_samples, seed)
        
        # Generate dummy features
        dummy_features = equation.generate_dummy_features(n_samples, n_dummy, seed)
        
        # Combine all features
        all_features = {**true_features, **dummy_features}
        
        # Get feature order (true features first, then dummy)
        feature_names = list(true_features.keys()) + list(dummy_features.keys())
        
        # Build feature matrix X
        X = np.column_stack([all_features[name] for name in feature_names])
        
        # Compute noise-free target
        y_true = equation.compute_target(true_features)
        
        # Add noise
        y = equation.add_noise(y_true, noise_level, seed)
        
        # Get dimensional information
        variable_dimensions = equation.get_variable_dimensions(feature_names)
        target_dimensions = equation.get_target_dimensions()
        physical_bounds = equation.get_physical_bounds()
        
        # Get ground truth
        ground_truth = equation.get_ground_truth()
        
        # Build dataset dictionary
        dataset = {
            # Data arrays
            'X': X,
            'y': y,
            'y_true': y_true,
            
            # Feature information
            'feature_names': np.array(feature_names, dtype=object),
            'true_features': np.array(equation.true_feature_names, dtype=object),
            'dummy_features': np.array(list(dummy_features.keys()), dtype=object),
            
            # Equation information
            'equation_name': equation.name,
            'equation_index': equation.equation_index,
            'equation_str': equation.equation_str,
            'equation_type': equation.equation_type,
            'full_name': equation.full_name,
            
            # Experiment parameters
            'n_samples': n_samples,
            'noise_level': noise_level,
            'n_dummy': n_dummy,
            'seed': seed,
            
            # Dimensional information (for UserInputs)
            'variable_dimensions': variable_dimensions,
            'target_dimensions': np.array(target_dimensions),
            'physical_bounds': physical_bounds,
        }
        
        return dataset
    
    def get_filename(
        self,
        equation: BaseTestEquation,
        n_samples: int,
        noise_level: float,
        n_dummy: int
    ) -> str:
        """
        Generate standardized filename for a dataset.
        
        Format: eq{N}_{name}_n{samples}_noise{level:.2f}_dummy{count}.npz
        """
        return f"eq{equation.equation_index}_{equation.name}_n{n_samples}_noise{noise_level:.2f}_dummy{n_dummy}.npz"
    
    def save_dataset(self, dataset: Dict[str, Any], filename: str) -> Path:
        """
        Save dataset to .npz file.
        
        Parameters
        ----------
        dataset : Dict[str, Any]
            Dataset dictionary from generate_dataset()
        filename : str
            Output filename (without path)
            
        Returns
        -------
        Path
            Full path to saved file
        """
        filepath = self.output_dir / filename
        
        # Prepare data for saving
        # Dictionaries need to be pickled
        save_dict = {}
        for key, value in dataset.items():
            if isinstance(value, dict):
                # Pickle dictionaries to bytes, then store as array
                save_dict[key + '_pkl'] = np.frombuffer(pickle.dumps(value), dtype=np.uint8)
            else:
                save_dict[key] = value
        
        np.savez(filepath, **save_dict)
        return filepath
    
    def load_dataset(self, filename: str) -> Dict[str, Any]:
        """
        Load dataset from .npz file.
        
        Parameters
        ----------
        filename : str
            Filename (with or without path)
            
        Returns
        -------
        Dict[str, Any]
            Loaded dataset dictionary
        """
        # Check if full path or just filename
        filepath = Path(filename)
        if not filepath.exists():
            filepath = self.output_dir / filename
        
        if not filepath.exists():
            raise FileNotFoundError(f"Dataset not found: {filepath}")
        
        # Load npz file
        data = np.load(filepath, allow_pickle=True)
        
        # Reconstruct dictionary
        dataset = {}
        for key in data.files:
            if key.endswith('_pkl'):
                # Unpickle dictionaries
                original_key = key[:-4]
                dataset[original_key] = pickle.loads(data[key].tobytes())
            else:
                value = data[key]
                # Convert 0-d arrays to scalars
                if value.ndim == 0:
                    dataset[key] = value.item()
                else:
                    dataset[key] = value
        
        return dataset
    
    def generate_and_save(
        self,
        equation: BaseTestEquation,
        n_samples: int = DEFAULT_N_SAMPLES,
        noise_level: float = 0.0,
        n_dummy: int = 0,
        seed: int = RANDOM_SEED
    ) -> Path:
        """
        Generate and save a dataset in one call.
        
        Returns
        -------
        Path
            Path to saved file
        """
        dataset = self.generate_dataset(equation, n_samples, noise_level, n_dummy, seed)
        filename = self.get_filename(equation, n_samples, noise_level, n_dummy)
        return self.save_dataset(dataset, filename)

print("BenchmarkDataGenerator class defined.")

---
## Section 9: Generate Core Experiment Datasets

In [None]:
# ==============================================================================
# GENERATE CORE EXPERIMENT DATASETS
# ==============================================================================

def generate_core_datasets(generator: BenchmarkDataGenerator, verbose: bool = True) -> List[Path]:
    """
    Generate all datasets for core experiments.
    
    Core experiments: 4 equations x 2 noise x 2 dummy = 16 configurations
    Each with n=500 samples.
    
    Parameters
    ----------
    generator : BenchmarkDataGenerator
        Data generator instance
    verbose : bool
        Whether to print progress
        
    Returns
    -------
    List[Path]
        List of generated file paths
    """
    generated_files = []
    equations = get_all_equations()
    
    if verbose:
        print("Generating Core Experiment Datasets")
        print("=" * 60)
        total = len(equations) * len(NOISE_LEVELS) * len(DUMMY_COUNTS)
        print(f"Total datasets: {total}")
        print()
    
    count = 0
    for eq in equations:
        for noise_level in NOISE_LEVELS:
            for n_dummy in DUMMY_COUNTS:
                # Generate and save
                filepath = generator.generate_and_save(
                    equation=eq,
                    n_samples=DEFAULT_N_SAMPLES,
                    noise_level=noise_level,
                    n_dummy=n_dummy,
                    seed=RANDOM_SEED
                )
                generated_files.append(filepath)
                count += 1
                
                if verbose:
                    print(f"  [{count:3d}] {filepath.name}")
    
    if verbose:
        print()
        print(f"Generated {len(generated_files)} core datasets.")
    
    return generated_files

# Generate core datasets
generator = BenchmarkDataGenerator(DATA_DIR)
core_files = generate_core_datasets(generator, verbose=True)

---
## Section 10: Generate Supplementary Experiment Datasets

In [None]:
# ==============================================================================
# GENERATE SUPPLEMENTARY EXPERIMENT DATASETS
# ==============================================================================

def generate_supplementary_datasets(generator: BenchmarkDataGenerator, verbose: bool = True) -> List[Path]:
    """
    Generate datasets for supplementary experiments (sample size sensitivity).
    
    Supplementary: 4 equations x 2 sizes (250, 750) = 8 configurations
    Fixed: noise=5%, dummy=5
    
    Parameters
    ----------
    generator : BenchmarkDataGenerator
        Data generator instance
    verbose : bool
        Whether to print progress
        
    Returns
    -------
    List[Path]
        List of generated file paths
    """
    generated_files = []
    equations = get_all_equations()
    
    # Fixed parameters for supplementary experiments
    fixed_noise = 0.05
    fixed_dummy = 5
    
    if verbose:
        print("\nGenerating Supplementary Experiment Datasets")
        print("=" * 60)
        total = len(equations) * len(SAMPLE_SIZES_SUPPLEMENTARY)
        print(f"Total datasets: {total}")
        print(f"Fixed parameters: noise={fixed_noise}, dummy={fixed_dummy}")
        print()
    
    count = 0
    for eq in equations:
        for n_samples in SAMPLE_SIZES_SUPPLEMENTARY:
            # Generate and save
            filepath = generator.generate_and_save(
                equation=eq,
                n_samples=n_samples,
                noise_level=fixed_noise,
                n_dummy=fixed_dummy,
                seed=RANDOM_SEED
            )
            generated_files.append(filepath)
            count += 1
            
            if verbose:
                print(f"  [{count:3d}] {filepath.name}")
    
    if verbose:
        print()
        print(f"Generated {len(generated_files)} supplementary datasets.")
    
    return generated_files

# Generate supplementary datasets
supp_files = generate_supplementary_datasets(generator, verbose=True)

---
## Section 11: Data Verification and Summary

In [None]:
# ==============================================================================
# VERIFY GENERATED DATASETS
# ==============================================================================

def verify_datasets(generator: BenchmarkDataGenerator, verbose: bool = True) -> pd.DataFrame:
    """
    Verify all generated datasets and create summary table.
    """
    # List all npz files
    files = sorted(generator.output_dir.glob('*.npz'))
    
    if verbose:
        print("\nDataset Verification")
        print("=" * 80)
        print(f"Found {len(files)} dataset files in {generator.output_dir}")
        print()
    
    # Build summary
    summary_data = []
    
    for filepath in files:
        try:
            dataset = generator.load_dataset(filepath.name)
            
            # Extract info
            row = {
                'filename': filepath.name,
                'equation': dataset['equation_name'],
                'n_samples': dataset['n_samples'],
                'noise_level': dataset['noise_level'],
                'n_dummy': dataset['n_dummy'],
                'n_features': dataset['X'].shape[1],
                'n_true': len(dataset['true_features']),
                'y_mean': np.mean(dataset['y']),
                'y_std': np.std(dataset['y']),
                'y_min': np.min(dataset['y']),
                'y_max': np.max(dataset['y']),
                'has_dims': 'variable_dimensions' in dataset,
            }
            summary_data.append(row)
            
        except Exception as e:
            print(f"  ERROR loading {filepath.name}: {e}")
    
    # Create DataFrame
    summary_df = pd.DataFrame(summary_data)
    
    if verbose and len(summary_df) > 0:
        # Print summary statistics
        print("Summary by Equation:")
        eq_summary = summary_df.groupby('equation').agg({
            'filename': 'count',
            'n_samples': 'mean',
            'n_features': 'mean',
        }).round(1)
        eq_summary.columns = ['Count', 'Avg Samples', 'Avg Features']
        print(eq_summary.to_string())
        print()
        
        print("Summary by Configuration:")
        config_summary = summary_df.groupby(['noise_level', 'n_dummy']).size().unstack(fill_value=0)
        print(config_summary.to_string())
        print()
    
    return summary_df

# Verify datasets
summary_df = verify_datasets(generator, verbose=True)

In [None]:
# ==============================================================================
# DISPLAY FULL DATASET SUMMARY TABLE
# ==============================================================================

# Display formatted table
print("\nComplete Dataset Summary:")
print("=" * 100)

# Format for display
display_df = summary_df[['filename', 'equation', 'n_samples', 'noise_level', 
                          'n_dummy', 'n_features', 'n_true']].copy()
display_df['noise_level'] = display_df['noise_level'].map(lambda x: f"{x:.0%}")

# Print
print(display_df.to_string(index=False))
print("=" * 100)
print(f"Total datasets: {len(summary_df)}")

---
## Section 12: Sample Dataset Visualization

In [None]:
# ==============================================================================
# VISUALIZE SAMPLE DATASETS
# ==============================================================================

def plot_sample_datasets(generator: BenchmarkDataGenerator):
    """
    Create visualization of sample datasets for each equation.
    """
    equations = get_all_equations()
    
    fig, axes = plt.subplots(2, 4, figsize=(16, 8))
    fig.suptitle('Sample Dataset Visualization (n=500, noise=5%, dummy=5)', fontsize=14)
    
    for i, eq in enumerate(equations):
        # Load dataset with noise and dummy features
        filename = generator.get_filename(eq, 500, 0.05, 5)
        dataset = generator.load_dataset(filename)
        
        X = dataset['X']
        y = dataset['y']
        y_true = dataset['y_true']
        feature_names = list(dataset['feature_names'])
        
        # Top row: True vs Noisy target
        ax1 = axes[0, i]
        ax1.scatter(y_true, y, alpha=0.3, s=5, c='blue')
        ax1.plot([y_true.min(), y_true.max()], [y_true.min(), y_true.max()], 
                 'r--', linewidth=1, label='y=y_true')
        ax1.set_xlabel('y_true')
        ax1.set_ylabel('y (noisy)')
        ax1.set_title(f'{eq.name}\n{eq.equation_type}')
        ax1.legend(loc='upper left', fontsize=8)
        
        # Calculate R^2 between noisy and true
        ss_res = np.sum((y - y_true) ** 2)
        ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
        r2 = 1 - ss_res / ss_tot
        ax1.text(0.95, 0.05, f'R^2={r2:.4f}', transform=ax1.transAxes, 
                 fontsize=8, ha='right', va='bottom')
        
        # Bottom row: Feature importance (correlation with target)
        ax2 = axes[1, i]
        correlations = [np.corrcoef(X[:, j], y)[0, 1] for j in range(X.shape[1])]
        colors = ['green' if name in eq.true_feature_names else 'red' 
                  for name in feature_names]
        
        bars = ax2.bar(range(len(feature_names)), np.abs(correlations), color=colors, alpha=0.7)
        ax2.set_xticks(range(len(feature_names)))
        ax2.set_xticklabels(feature_names, rotation=45, ha='right', fontsize=8)
        ax2.set_ylabel('|Correlation with y|')
        ax2.set_title(f'Feature Correlations\nGreen=True, Red=Dummy')
        ax2.set_ylim([0, 1])
    
    plt.tight_layout(rect=[0, 0.03, 1, 0.95])
    
    # Save figure
    fig_path = generator.output_dir.parent / 'results' / 'figures' / 'data_overview.png'
    fig_path.parent.mkdir(exist_ok=True, parents=True)
    plt.savefig(fig_path, dpi=150, bbox_inches='tight')
    print(f"\nFigure saved to: {fig_path}")
    
    plt.show()

# Generate visualization
plot_sample_datasets(generator)

---
## Section 13: Test Data Loading Interface

In [None]:
# ==============================================================================
# TEST DATA LOADING INTERFACE
# ==============================================================================

def test_data_interface():
    """
    Test the data loading interface to ensure datasets can be
    properly loaded and used for experiments.
    """
    print("Testing Data Loading Interface")
    print("=" * 60)
    
    # Load a sample dataset
    generator = BenchmarkDataGenerator(DATA_DIR)
    filename = 'eq1_kk2000_n500_noise0.05_dummy5.npz'
    
    print(f"\nLoading: {filename}")
    dataset = generator.load_dataset(filename)
    
    # Print structure
    print("\nDataset Structure:")
    print("-" * 40)
    for key, value in dataset.items():
        if isinstance(value, np.ndarray):
            print(f"  {key:25s} : ndarray{value.shape} dtype={value.dtype}")
        elif isinstance(value, dict):
            print(f"  {key:25s} : dict with {len(value)} keys")
        else:
            print(f"  {key:25s} : {type(value).__name__} = {value}")
    
    # Verify dimensions
    print("\nDimensional Information:")
    print("-" * 40)
    for var, dims in dataset['variable_dimensions'].items():
        print(f"  {var:10s} : [M={dims[0]:+.0f}, L={dims[1]:+.0f}, T={dims[2]:+.0f}, Th={dims[3]:+.0f}]")
    print(f"  {'target':10s} : [M={dataset['target_dimensions'][0]:+.0f}, L={dataset['target_dimensions'][1]:+.0f}, T={dataset['target_dimensions'][2]:+.0f}, Th={dataset['target_dimensions'][3]:+.0f}]")
    
    print("\nData loading interface test PASSED.")

# Run test
test_data_interface()

---
## Section 14: Export Utility Functions

In [None]:
# ==============================================================================
# UTILITY FUNCTIONS FOR EXPERIMENTS NOTEBOOK
# ==============================================================================

def list_all_datasets(data_dir: Union[str, Path] = DATA_DIR) -> pd.DataFrame:
    """
    List all available datasets with their configurations.
    """
    data_dir = Path(data_dir)
    files = sorted(data_dir.glob('*.npz'))
    
    records = []
    for f in files:
        # Parse filename: eq{N}_{name}_n{samples}_noise{level}_dummy{count}.npz
        parts = f.stem.split('_')
        eq_idx = int(parts[0][2:])  # eq1 -> 1
        eq_name = parts[1]
        n_samples = int(parts[2][1:])  # n500 -> 500
        noise = float(parts[3][5:])  # noise0.05 -> 0.05
        n_dummy = int(parts[4][5:])  # dummy5 -> 5
        
        # Determine experiment type
        if n_samples == DEFAULT_N_SAMPLES:
            exp_type = 'core'
        else:
            exp_type = 'supplementary'
        
        records.append({
            'filename': f.name,
            'filepath': str(f),
            'equation': eq_name,
            'equation_index': eq_idx,
            'n_samples': n_samples,
            'noise_level': noise,
            'n_dummy': n_dummy,
            'experiment_type': exp_type,
        })
    
    return pd.DataFrame(records)

def get_core_experiment_configs() -> List[Dict]:
    """
    Get all configurations for core experiments.
    """
    configs = []
    equations = ['kk2000', 'newton', 'ideal_gas', 'damped']
    
    for eq in equations:
        for noise in NOISE_LEVELS:
            for dummy in DUMMY_COUNTS:
                for with_dims in [True, False]:
                    configs.append({
                        'equation_name': eq,
                        'n_samples': DEFAULT_N_SAMPLES,
                        'noise_level': noise,
                        'n_dummy': dummy,
                        'with_dims': with_dims,
                    })
    
    return configs

def get_supplementary_experiment_configs() -> List[Dict]:
    """
    Get all configurations for supplementary experiments.
    """
    configs = []
    equations = ['kk2000', 'newton', 'ideal_gas', 'damped']
    
    for eq in equations:
        for n_samples in SAMPLE_SIZES_SUPPLEMENTARY:
            configs.append({
                'equation_name': eq,
                'n_samples': n_samples,
                'noise_level': 0.05,
                'n_dummy': 5,
                'with_dims': True,
            })
    
    return configs

# Print summary
print("Utility Functions Defined:")
print("  - list_all_datasets(data_dir) -> DataFrame")
print("  - get_core_experiment_configs() -> List[Dict]")
print("  - get_supplementary_experiment_configs() -> List[Dict]")
print()
print(f"Core experiments: {len(get_core_experiment_configs())} configurations")
print(f"Supplementary experiments: {len(get_supplementary_experiment_configs())} configurations")

---
## Section 15: Final Summary and Validation

In [None]:
# ==============================================================================
# FINAL SUMMARY AND VALIDATION
# ==============================================================================

print("="*80)
print("DATA GENERATION COMPLETE")
print("="*80)
print()

# Count files
all_files = list(DATA_DIR.glob('*.npz'))
print(f"Total datasets generated: {len(all_files)}")
print(f"Output directory: {DATA_DIR.resolve()}")
print()

# List all generated files
print("Generated files:")
for f in sorted(all_files):
    print(f"  - {f.name}")
print()

# Expected vs actual
expected_core = 4 * 2 * 2  # 4 eq x 2 noise x 2 dummy = 16
expected_supp = 4 * 2      # 4 eq x 2 sizes = 8
expected_total = expected_core + expected_supp

print(f"Expected datasets: {expected_total} ({expected_core} core + {expected_supp} supplementary)")
print(f"Actual datasets:   {len(all_files)}")
print()

if len(all_files) == expected_total:
    print("[PASSED] All expected datasets generated successfully!")
elif len(all_files) > 0:
    print(f"[OK] Generated {len(all_files)} datasets.")
else:
    print("[WARNING] No datasets found!")

# Validate one sample file
if all_files:
    sample_file = all_files[0]
    data = np.load(sample_file)
    print()
    print(f"Sample file validation ({sample_file.name}):")
    print(f"  Keys: {list(data.keys())}")
    print(f"  X shape: {data['X'].shape}")
    print(f"  y shape: {data['y'].shape}")

print()
print("="*80)
print("Ready for Experiments.ipynb")
print("="*80)

---
## Appendix: Quick Reference

### Equation Summary

| # | Name | Formula | Type | True Vars | Difficulty |
|---|------|---------|------|-----------|------------|
| 1 | KK2000 | P = 1350*q_c^2.47*N_d^-1.79 | power_law | 2 | Medium |
| 2 | Newton | F = G*m1*m2/r^2 | rational | 3 | Easy |
| 3 | Ideal Gas | P = nRT/V | rational | 3 | Easy |
| 4 | Damped | x = A*exp(-gamma*t)*cos(omega*t) | nested | 4 | Hard |

### File Naming Convention

```
eq{N}_{name}_n{samples}_noise{level:.2f}_dummy{count}.npz
```

### Dataset Contents

- `X`: Feature matrix (n_samples, n_features)
- `y`: Target with noise
- `y_true`: Noise-free target
- `feature_names`: List of feature names
- `true_features`: Features in true equation
- `dummy_features`: Irrelevant features
- `variable_dimensions`: Dict for UserInputs
- `target_dimensions`: [M, L, T, Theta]
- `physical_bounds`: Physical constraints