# DataGen - Physics-SR Framework v4.1 Benchmark

## Benchmark Data Generation Module

**Author:** Zhengze Zhang  
**Affiliation:** Department of Statistics, Columbia University  
**Contact:** zz3239@columbia.edu  
**Date:** January 2026  
**Version:** 4.1 (AI Feynman Benchmark Integration)

---

### Purpose

This notebook generates synthetic test datasets for benchmarking the Physics-SR Framework v4.1.

**Test Equations (AI Feynman Benchmark):**

| # | Name | AI Feynman ID | Equation | Type | Tests |
|---|------|---------------|----------|------|-------|
| 1 | Coulomb | I.12.2 | $F = k \cdot q_1 \cdot q_2 / r^2$ | Rational | Inverse-square law |
| 2 | Cosines | I.29.16 | $x = \sqrt{x_1^2 + x_2^2 - 2x_1 x_2 \cos(\theta_1 - \theta_2)}$ | Nested Trig | Structure parsing |
| 3 | Barometric | I.40.1 | $n = n_0 \exp(-mgx/k_B T)$ | Exponential | High-dim discovery |
| 4 | DotProduct | I.11.19 | $A = x_1 y_1 + x_2 y_2 + x_3 y_3$ | Polynomial | Interaction discovery |

**Generated Configurations:**
- Sample sizes: 250, 500, 750 (core: 500 only)
- Noise levels: 0%, 5%
- Dummy features: 0, 5

### Output

Datasets saved to `data/` directory as `.npz` files.

### Changelog from v3.0

- Replaced Newton, IdealGas, Damped equations with AI Feynman benchmark equations
- Added `ai_feynman_id` attribute to all equation classes
- Updated data types to Float32 for memory optimization
- Enhanced metadata storage for v4.1 pipeline compatibility

---
## Section 1: Header and Imports

In [None]:
# ==============================================================================
# ENVIRONMENT RESET AND FRESH CLONE
# ==============================================================================
# This cell only resets when necessary (not when already in valid repo)

import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    import os
    import shutil
    import gc
    
    repo_path = '/content/Physics-Informed-Symbolic-Regression'
    current_dir = os.getcwd()
    
    # Only reset if NOT already inside a valid repo
    if not current_dir.startswith(repo_path) or not os.path.exists(repo_path + '/.git'):
        os.chdir('/content')
        gc.collect()
        
        if os.path.exists(repo_path):
            shutil.rmtree(repo_path)
            print("[OK] Removed existing repository.")
        
        !git clone https://github.com/Garthzzz/Physics-Informed-Symbolic-Regression.git
        
        if os.path.exists(repo_path + '/.git'):
            os.chdir(repo_path + '/benchmark')
            print(f"[OK] Working directory: {os.getcwd()}")
        else:
            print("[FAIL] Clone incomplete!")
        
        print("[OK] Environment reset complete.")
    else:
        print("[SKIP] Already in valid repository.")
else:
    print("[INFO] Not in Colab environment.")

In [None]:
"""
DataGen.ipynb - Benchmark Data Generation
==========================================

Physics-SR Framework v4.1 Benchmark Suite

This module provides:
- BaseTestEquation: Abstract base class for test equations
- CoulombEquation: Electrostatic force (AI Feynman I.12.2)
- LawOfCosinesEquation: Wave interference (AI Feynman I.29.16)
- BarometricEquation: Atmospheric pressure (AI Feynman I.40.1)
- DotProductEquation: Vector dot product (AI Feynman I.11.19)
- BenchmarkDataGenerator: Data generation and management

Author: Zhengze Zhang
Affiliation: Department of Statistics, Columbia University
Contact: zz3239@columbia.edu
Version: 4.1
"""

print("DataGen: Initializing benchmark data generation module...")
print("Version: 4.1 (AI Feynman Benchmark)")
print()

In [None]:
# ==============================================================================
# IMPORTS
# ==============================================================================

# Standard library imports
import os
import sys
import pickle
import warnings
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from pathlib import Path
from typing import Dict, List, Tuple, Optional, Union, Any
from datetime import datetime

# Scientific computing
import numpy as np
import pandas as pd
from scipy import stats

# Visualization
import matplotlib.pyplot as plt

print("DataGen: All imports successful.")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

In [None]:
# ==============================================================================
# CONFIGURATION CONSTANTS
# ==============================================================================

# Random seed for reproducibility
RANDOM_SEED = 42

# Data generation parameters
DEFAULT_N_SAMPLES = 500
SAMPLE_SIZES = [250, 500, 750]
NOISE_LEVELS = [0.0, 0.05]
DUMMY_COUNTS = [0, 5]

# Output directory - environment aware
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    # Colab: running from repo root
    DATA_DIR = Path('/content/Physics-Informed-Symbolic-Regression/benchmark/data')
else:
    # Local: running from benchmark directory
    DATA_DIR = Path('data')

# Numerical stability thresholds
Y_MIN_THRESHOLD = 1e-6
Y_MAX_THRESHOLD = 1e8
EPS = 1e-10

# AI Feynman standard variable range
AI_FEYNMAN_RANGE = (1.0, 5.0)

print("Configuration constants defined.")
print(f"  Environment: {'Google Colab' if IN_COLAB else 'Local'}")
print(f"  Data directory: {DATA_DIR}")
print(f"  Random seed: {RANDOM_SEED}")
print(f"  Sample sizes: {SAMPLE_SIZES}")
print(f"  Noise levels: {NOISE_LEVELS}")
print(f"  Dummy counts: {DUMMY_COUNTS}")

In [None]:
# ==============================================================================
# USERINPUTS DATACLASS
# ==============================================================================

@dataclass
class UserInputs:
    """
    User-defined inputs required for the Physics-SR Framework.
    
    This dataclass encapsulates all physics-informed prior knowledge
    that users provide to guide the symbolic regression process.
    
    Attributes
    ----------
    variable_dimensions : Dict[str, List[float]]
        Dictionary mapping variable names to dimensional exponents [M, L, T, Theta].
        M = Mass, L = Length, T = Time, Theta = Temperature.
    target_dimensions : List[float]
        Dimensional exponents [M, L, T, Theta] for the target variable.
    physical_bounds : Dict[str, Dict[str, Optional[float]]]
        Physical constraints: {var_name: {'min': float, 'max': float}}
    variable_mapping : Optional[Dict[str, str]]
        Maps column names to physical variable names.
    unit_conversions : Optional[Dict[str, float]]
        Conversion factors to SI units.
    """
    
    variable_dimensions: Dict[str, List[float]]
    target_dimensions: List[float]
    physical_bounds: Dict[str, Dict[str, Optional[float]]]
    variable_mapping: Optional[Dict[str, str]] = None
    unit_conversions: Optional[Dict[str, float]] = None
    
    def __post_init__(self):
        """Validate inputs after initialization."""
        for var_name, dims in self.variable_dimensions.items():
            if len(dims) != 4:
                raise ValueError(
                    f"Variable '{var_name}' has {len(dims)} dimensional exponents, "
                    f"expected 4 [M, L, T, Theta]"
                )
        if len(self.target_dimensions) != 4:
            raise ValueError(
                f"Target dimensions has {len(self.target_dimensions)} exponents, "
                f"expected 4 [M, L, T, Theta]"
            )
    
    def get_variable_names(self) -> List[str]:
        """Return list of variable names."""
        return list(self.variable_dimensions.keys())
    
    def to_dict(self) -> Dict:
        """Convert to dictionary for serialization."""
        return {
            'variable_dimensions': self.variable_dimensions,
            'target_dimensions': self.target_dimensions,
            'physical_bounds': self.physical_bounds,
            'variable_mapping': self.variable_mapping,
            'unit_conversions': self.unit_conversions
        }
    
    @classmethod
    def from_dict(cls, d: Dict) -> 'UserInputs':
        """Create from dictionary."""
        return cls(
            variable_dimensions=d['variable_dimensions'],
            target_dimensions=d['target_dimensions'],
            physical_bounds=d['physical_bounds'],
            variable_mapping=d.get('variable_mapping'),
            unit_conversions=d.get('unit_conversions')
        )

print("UserInputs dataclass defined.")

---
## Section 2: Base Test Equation Class

In [None]:
# ==============================================================================
# BASE TEST EQUATION CLASS
# ==============================================================================

class BaseTestEquation(ABC):
    """
    Abstract base class for AI Feynman benchmark test equations.
    
    All test equations must implement the methods defined here.
    This provides a consistent interface for data generation.
    
    Class Attributes
    ----------------
    name : str
        Short identifier (e.g., 'coulomb')
    full_name : str
        Full descriptive name (e.g., "Coulomb's Law")
    equation_str : str
        Human-readable equation string
    equation_type : str
        Type classification ('rational', 'nested_trigonometric', etc.)
    ai_feynman_id : str
        AI Feynman benchmark ID (e.g., 'I.12.2')
    true_feature_names : List[str]
        Features that appear in the true equation
    dummy_feature_pool : List[str]
        Available dummy (irrelevant) features
    """
    
    # Class attributes to be defined by subclasses
    name: str = "base"
    full_name: str = "Base Equation"
    equation_str: str = "y = f(x)"
    equation_type: str = "abstract"
    ai_feynman_id: str = "N/A"
    
    true_feature_names: List[str] = []
    dummy_feature_pool: List[str] = []
    
    # Internal storage
    _dimensions: Dict[str, List[float]] = {}
    _target_dims: List[float] = [0, 0, 0, 0]
    _ranges: Dict[str, Tuple[float, float]] = {}
    
    @abstractmethod
    def generate_true_features(
        self, 
        n_samples: int, 
        seed: int
    ) -> Dict[str, np.ndarray]:
        """
        Generate values for true features.
        
        Parameters
        ----------
        n_samples : int
            Number of samples to generate
        seed : int
            Random seed for reproducibility
            
        Returns
        -------
        Dict[str, np.ndarray]
            Dictionary mapping feature names to value arrays
        """
        pass
    
    def generate_dummy_features(
        self, 
        n_samples: int, 
        n_dummy: int, 
        seed: int
    ) -> Dict[str, np.ndarray]:
        """
        Generate values for dummy (irrelevant) features.
        
        Parameters
        ----------
        n_samples : int
            Number of samples
        n_dummy : int
            Number of dummy features to generate
        seed : int
            Random seed (offset by 1000 from true features)
            
        Returns
        -------
        Dict[str, np.ndarray]
            Dictionary mapping dummy feature names to value arrays
        """
        if n_dummy <= 0:
            return {}
        
        np.random.seed(seed + 1000)
        selected = self.dummy_feature_pool[:min(n_dummy, len(self.dummy_feature_pool))]
        return {
            name: np.random.uniform(*self._ranges[name], n_samples)
            for name in selected
        }
    
    @abstractmethod
    def compute_target(
        self, 
        features: Dict[str, np.ndarray]
    ) -> np.ndarray:
        """
        Compute target variable from true features.
        
        Parameters
        ----------
        features : Dict[str, np.ndarray]
            Dictionary of feature arrays
            
        Returns
        -------
        np.ndarray
            Target variable values
        """
        pass
    
    def add_noise(
        self, 
        y: np.ndarray, 
        noise_level: float, 
        seed: int,
        noise_type: str = 'multiplicative'
    ) -> np.ndarray:
        """
        Add noise to target variable.
        
        Parameters
        ----------
        y : np.ndarray
            Clean target values
        noise_level : float
            Noise standard deviation (relative for multiplicative)
        seed : int
            Random seed (offset by 2000)
        noise_type : str
            'multiplicative' (log-normal) or 'additive' (Gaussian)
            
        Returns
        -------
        np.ndarray
            Noisy target values
        """
        if noise_level <= 0:
            return y.copy()
        
        np.random.seed(seed + 2000)
        
        if noise_type == 'multiplicative':
            # Log-normal noise (appropriate for positive targets)
            noise = np.exp(np.random.normal(0, noise_level, len(y)))
            return y * noise
        else:
            # Additive Gaussian noise (relative to target std)
            noise = np.random.normal(0, noise_level * np.std(y), len(y))
            return y + noise
    
    def get_variable_dimensions(
        self, 
        feature_names: List[str]
    ) -> Dict[str, List[float]]:
        """Return dimensional exponents [M, L, T, Theta] for given features."""
        return {name: self._dimensions.get(name, [0, 0, 0, 0]) 
                for name in feature_names}
    
    def get_target_dimensions(self) -> List[float]:
        """Return dimensional exponents [M, L, T, Theta] for target."""
        return self._target_dims.copy()
    
    def get_physical_bounds(self) -> Dict[str, Dict[str, Optional[float]]]:
        """Return physical bounds for all variables."""
        bounds = {'target': {'min': None, 'max': None}}
        for name, (low, high) in self._ranges.items():
            bounds[name] = {'min': low * 0.1, 'max': high * 10}
        return bounds
    
    def create_user_inputs(self, feature_names: List[str]) -> UserInputs:
        """Create UserInputs object for this equation."""
        return UserInputs(
            variable_dimensions=self.get_variable_dimensions(feature_names),
            target_dimensions=self.get_target_dimensions(),
            physical_bounds=self.get_physical_bounds()
        )
    
    def get_ground_truth(self) -> Dict[str, Any]:
        """Return ground truth information for evaluation."""
        return {
            'name': self.name,
            'full_name': self.full_name,
            'equation': self.equation_str,
            'active_features': self.true_feature_names.copy(),
            'equation_type': self.equation_type,
            'ai_feynman_id': self.ai_feynman_id
        }
    
    def generate_dataset(
        self,
        n_samples: int,
        noise_level: float,
        n_dummy: int,
        seed: int
    ) -> Dict[str, Any]:
        """
        Generate complete dataset with all metadata.
        
        Parameters
        ----------
        n_samples : int
            Number of samples
        noise_level : float
            Noise level (0.0 to 0.1)
        n_dummy : int
            Number of dummy features
        seed : int
            Random seed
            
        Returns
        -------
        Dict[str, Any]
            Complete dataset dictionary for .npz storage
        """
        # Generate features
        true_features = self.generate_true_features(n_samples, seed)
        dummy_features = self.generate_dummy_features(n_samples, n_dummy, seed)
        
        all_features = {**true_features, **dummy_features}
        
        true_names = list(true_features.keys())
        dummy_names = list(dummy_features.keys())
        feature_names = true_names + dummy_names
        
        # Assemble feature matrix
        X = np.column_stack([all_features[name] for name in feature_names])
        
        # Compute target
        y_true = self.compute_target(true_features)
        y = self.add_noise(y_true, noise_level, seed)
        
        # Create UserInputs for dimensional analysis
        user_inputs = self.create_user_inputs(feature_names)
        ground_truth = self.get_ground_truth()
        
        return {
            # Data arrays (Float32 for v4.1 memory optimization)
            'X': X.astype(np.float32),
            'y': y.astype(np.float32),
            'y_true': y_true.astype(np.float32),
            
            # Feature metadata
            'feature_names': np.array(feature_names),
            'true_features': np.array(true_names),
            'dummy_features': np.array(dummy_names),
            
            # Equation metadata
            'equation_name': self.name,
            'equation_str': self.equation_str,
            'equation_type': self.equation_type,
            'ai_feynman_id': self.ai_feynman_id,
            
            # Experiment parameters
            'n_samples': n_samples,
            'noise_level': noise_level,
            'n_dummy': n_dummy,
            'seed': seed,
            
            # Dimensional information (for UserInputs)
            'variable_dimensions': user_inputs.variable_dimensions,
            'target_dimensions': np.array(user_inputs.target_dimensions),
            'physical_bounds': user_inputs.physical_bounds,
            
            # Ground truth for evaluation
            'ground_truth': ground_truth
        }

print("BaseTestEquation class defined.")

---
## Section 3: AI Feynman Test Equation Implementations

In [None]:
# ==============================================================================
# EQUATION 1: COULOMB'S LAW OF ELECTROSTATIC FORCE (AI FEYNMAN I.12.2)
# ==============================================================================

class CoulombEquation(BaseTestEquation):
    """
    Coulomb's Law of Electrostatic Force (AI Feynman I.12.2).
    
    Reference:
        Coulomb, C. A., 1785: Premier memoire sur l'electricite et le magnetisme.
        Feynman Lectures on Physics, Vol. II, Chapter 4.
        AI Feynman Benchmark I.12.2
    
    Equation:
        F = k * q1 * q2 / r^2
    
    Variables:
        q1, q2: Electric charges (C)
        r: Distance between charges (m)
        F: Electrostatic force (N)
        k: Coulomb constant = 8.99e9 N*m^2/C^2
    
    Tests:
        - Inverse-square law detection
        - Rational function discovery
        - Dimensional analysis (distinct dimensions)
    
    Expected y range: [0.01, 1000] N (numerically stable)
    """
    
    name = "coulomb"
    full_name = "Coulomb's Law of Electrostatic Force"
    equation_str = "F = k * q1 * q2 / r^2"
    equation_type = "rational"
    ai_feynman_id = "I.12.2"
    
    true_feature_names = ['q1', 'q2', 'r']
    dummy_feature_pool = ['v1', 'v2', 'T', 'm', 'epsilon']
    
    # Coulomb constant
    K = 8.99e9  # N*m^2/C^2
    
    # Dimensional exponents [M, L, T, Theta]
    # Using standard SI: Coulomb = A*s, approximated as [0, 0, 1, 0]
    _dimensions = {
        'q1':      [0, 0, 1, 0],     # Charge (A*s approximation)
        'q2':      [0, 0, 1, 0],
        'r':       [0, 1, 0, 0],     # Length
        'v1':      [0, 1, -1, 0],    # Velocity
        'v2':      [0, 1, -1, 0],
        'T':       [0, 0, 0, 1],     # Temperature
        'm':       [1, 0, 0, 0],     # Mass
        'epsilon': [0, 0, 0, 0],     # Dimensionless
    }
    
    _target_dims = [1, 1, -2, 0]  # Force: N = kg*m/s^2
    
    # Ranges designed for numerical stability: y in [0.01, 1000] N
    _ranges = {
        'q1':      (1e-6, 1e-4),      # 1 - 100 microCoulombs
        'q2':      (1e-6, 1e-4),
        'r':       (0.1, 10.0),       # 0.1 - 10 meters
        'v1':      (0.0, 100.0),
        'v2':      (0.0, 100.0),
        'T':       (200.0, 400.0),
        'm':       (0.01, 10.0),
        'epsilon': (1.0, 10.0),
    }
    
    def generate_true_features(
        self, 
        n_samples: int, 
        seed: int
    ) -> Dict[str, np.ndarray]:
        """Generate electric charges and distance."""
        np.random.seed(seed)
        return {
            'q1': np.random.uniform(*self._ranges['q1'], n_samples),
            'q2': np.random.uniform(*self._ranges['q2'], n_samples),
            'r':  np.random.uniform(*self._ranges['r'], n_samples),
        }
    
    def compute_target(
        self, 
        features: Dict[str, np.ndarray]
    ) -> np.ndarray:
        """Compute electrostatic force using Coulomb's law."""
        q1 = features['q1']
        q2 = features['q2']
        r = features['r']
        return self.K * q1 * q2 / (r ** 2)
    
    def get_physical_bounds(self) -> Dict[str, Dict[str, Optional[float]]]:
        """Return physical bounds for all variables."""
        bounds = {'target': {'min': 0.0, 'max': None}}
        for name, (low, high) in self._ranges.items():
            bounds[name] = {'min': 0.0, 'max': high * 10}
        return bounds

print("CoulombEquation class defined (AI Feynman I.12.2).")

In [None]:
# ==============================================================================
# EQUATION 2: LAW OF COSINES / WAVE INTERFERENCE (AI FEYNMAN I.29.16)
# ==============================================================================

class LawOfCosinesEquation(BaseTestEquation):
    """
    Law of Cosines / Wave Interference (AI Feynman I.29.16).
    
    Reference:
        Feynman Lectures on Physics, Vol. I, Chapter 29.
        AI Feynman Benchmark I.29.16
    
    Equation:
        x = sqrt(x1^2 + x2^2 - 2*x1*x2*cos(theta1 - theta2))
    
    Tests:
        - Trigonometric operator detection (cos)
        - Square root detection (sqrt)
        - Angle difference pattern (translational symmetry)
        - Nested function structure
        - v4.0 Structure-Guided Library benefit
    
    Variables:
        x1, x2: Amplitudes (length)
        theta1, theta2: Phase angles (dimensionless)
        x: Resultant amplitude (length)
    """
    
    name = "cosines"
    full_name = "Law of Cosines (Wave Interference)"
    equation_str = "x = sqrt(x1^2 + x2^2 - 2*x1*x2*cos(theta1 - theta2))"
    equation_type = "nested_trigonometric"
    ai_feynman_id = "I.29.16"
    
    true_feature_names = ['x1', 'x2', 'theta1', 'theta2']
    dummy_feature_pool = ['omega', 't', 'k', 'A', 'phi']
    
    _dimensions = {
        'x1':     [0, 1, 0, 0],     # Amplitude (length)
        'x2':     [0, 1, 0, 0],
        'theta1': [0, 0, 0, 0],     # Angle (dimensionless)
        'theta2': [0, 0, 0, 0],
        'omega':  [0, 0, -1, 0],    # Angular frequency
        't':      [0, 0, 1, 0],     # Time
        'k':      [0, -1, 0, 0],    # Wave number
        'A':      [0, 1, 0, 0],     # Reference amplitude
        'phi':    [0, 0, 0, 0],     # Phase angle
    }
    
    _target_dims = [0, 1, 0, 0]  # Resultant amplitude (length)
    
    # Following AI Feynman standard: all variables in [1, 5]
    _ranges = {
        'x1':     (1.0, 5.0),
        'x2':     (1.0, 5.0),
        'theta1': (0.0, 2 * np.pi),  # Full range for angles
        'theta2': (0.0, 2 * np.pi),
        'omega':  (0.1, 10.0),
        't':      (0.0, 10.0),
        'k':      (0.1, 5.0),
        'A':      (0.5, 5.0),
        'phi':    (0.0, np.pi),
    }
    
    def generate_true_features(
        self, 
        n_samples: int, 
        seed: int
    ) -> Dict[str, np.ndarray]:
        """Generate amplitudes and phase angles."""
        np.random.seed(seed)
        return {
            'x1':     np.random.uniform(*self._ranges['x1'], n_samples),
            'x2':     np.random.uniform(*self._ranges['x2'], n_samples),
            'theta1': np.random.uniform(*self._ranges['theta1'], n_samples),
            'theta2': np.random.uniform(*self._ranges['theta2'], n_samples),
        }
    
    def compute_target(
        self, 
        features: Dict[str, np.ndarray]
    ) -> np.ndarray:
        """Compute resultant amplitude using law of cosines."""
        x1 = features['x1']
        x2 = features['x2']
        theta1 = features['theta1']
        theta2 = features['theta2']
        
        # Law of cosines: c^2 = a^2 + b^2 - 2ab*cos(C)
        arg = x1**2 + x2**2 - 2 * x1 * x2 * np.cos(theta1 - theta2)
        
        # Ensure non-negative before sqrt (numerical safety)
        arg = np.maximum(arg, EPS)
        
        return np.sqrt(arg)
    
    def get_physical_bounds(self) -> Dict[str, Dict[str, Optional[float]]]:
        """Return physical bounds for all variables."""
        bounds = {'target': {'min': 0.0, 'max': None}}
        for name, (low, high) in self._ranges.items():
            if 'theta' in name or name == 'phi':
                bounds[name] = {'min': 0.0, 'max': 2 * np.pi}
            else:
                bounds[name] = {'min': 0.0, 'max': high * 10}
        return bounds

print("LawOfCosinesEquation class defined (AI Feynman I.29.16).")

In [None]:
# ==============================================================================
# EQUATION 3: BAROMETRIC FORMULA (AI FEYNMAN I.40.1)
# ==============================================================================

class BarometricEquation(BaseTestEquation):
    """
    Barometric Formula (AI Feynman I.40.1).
    
    Reference:
        Feynman Lectures on Physics, Vol. I, Chapter 40.
        AI Feynman Benchmark I.40.1
    
    Equation:
        n = n0 * exp(-m * g * x / (k_b * T))
    
    Tests:
        - Exponential operator detection (exp)
        - Multi-variable quotient inside exp
        - High-dimensional discovery (6 true variables)
        - Dimensional analysis benefit (many variables)
    
    Note: Eureqa failed to solve this equation in the AI Feynman study.
    
    Variables:
        n0: Reference number density (dimensionless in normalized form)
        m: Particle mass
        g: Gravitational acceleration
        x: Height
        k_b: Boltzmann constant
        T: Temperature
        n: Number density
    """
    
    name = "barometric"
    full_name = "Barometric Formula (Isothermal Atmosphere)"
    equation_str = "n = n0 * exp(-m * g * x / (k_b * T))"
    equation_type = "exponential"
    ai_feynman_id = "I.40.1"
    
    true_feature_names = ['n0', 'm', 'g', 'x', 'k_b', 'T']
    dummy_feature_pool = ['rho', 'P', 'V', 'R', 'mu']
    
    _dimensions = {
        # True features (using normalized/dimensionless in AI Feynman style)
        'n0':  [0, 0, 0, 0],     # Reference density (normalized)
        'm':   [0, 0, 0, 0],     # Particle mass (normalized)
        'g':   [0, 0, 0, 0],     # Gravitational accel (normalized)
        'x':   [0, 0, 0, 0],     # Height (normalized)
        'k_b': [0, 0, 0, 0],     # Boltzmann constant (normalized)
        'T':   [0, 0, 0, 0],     # Temperature (normalized)
        # Dummy features
        'rho': [1, -3, 0, 0],    # Density
        'P':   [1, -1, -2, 0],   # Pressure
        'V':   [0, 3, 0, 0],     # Volume
        'R':   [0, 0, 0, 0],     # Gas constant (normalized)
        'mu':  [1, -1, -1, 0],   # Viscosity
    }
    
    _target_dims = [0, 0, 0, 0]  # Number density (normalized/dimensionless)
    
    # AI Feynman standard: all variables in [1, 5]
    _ranges = {
        'n0':  (1.0, 5.0),
        'm':   (1.0, 5.0),
        'g':   (1.0, 5.0),
        'x':   (1.0, 5.0),
        'k_b': (1.0, 5.0),
        'T':   (1.0, 5.0),
        'rho': (0.5, 5.0),
        'P':   (1.0, 10.0),
        'V':   (0.5, 5.0),
        'R':   (1.0, 5.0),
        'mu':  (0.1, 2.0),
    }
    
    def generate_true_features(
        self, 
        n_samples: int, 
        seed: int
    ) -> Dict[str, np.ndarray]:
        """Generate all 6 true features."""
        np.random.seed(seed)
        return {
            'n0':  np.random.uniform(*self._ranges['n0'], n_samples),
            'm':   np.random.uniform(*self._ranges['m'], n_samples),
            'g':   np.random.uniform(*self._ranges['g'], n_samples),
            'x':   np.random.uniform(*self._ranges['x'], n_samples),
            'k_b': np.random.uniform(*self._ranges['k_b'], n_samples),
            'T':   np.random.uniform(*self._ranges['T'], n_samples),
        }
    
    def compute_target(
        self, 
        features: Dict[str, np.ndarray]
    ) -> np.ndarray:
        """Compute number density using barometric formula."""
        n0 = features['n0']
        m = features['m']
        g = features['g']
        x = features['x']
        k_b = features['k_b']
        T = features['T']
        
        # n = n0 * exp(-m * g * x / (k_b * T))
        exponent = -m * g * x / (k_b * T)
        
        # Clip exponent to prevent overflow/underflow
        exponent = np.clip(exponent, -20, 20)
        
        return n0 * np.exp(exponent)
    
    def get_physical_bounds(self) -> Dict[str, Dict[str, Optional[float]]]:
        """Return physical bounds for all variables."""
        bounds = {'target': {'min': 0.0, 'max': None}}
        for name, (low, high) in self._ranges.items():
            bounds[name] = {'min': 0.0, 'max': high * 10}
        return bounds

print("BarometricEquation class defined (AI Feynman I.40.1).")

In [None]:
# ==============================================================================
# EQUATION 4: DOT PRODUCT OF TWO 3D VECTORS (AI FEYNMAN I.11.19)
# ==============================================================================

class DotProductEquation(BaseTestEquation):
    """
    Dot Product of Two 3D Vectors (AI Feynman I.11.19).
    
    Reference:
        Feynman Lectures on Physics, Vol. I, Chapter 11.
        AI Feynman Benchmark I.11.19
    
    Equation:
        A = x1*y1 + x2*y2 + x3*y3
    
    Tests:
        - Variable interaction discovery (iRF Stage 1.4)
        - Correct pairing identification
        - 15 possible pairwise interactions, only 3 correct
        - Pure polynomial form
    
    Variables:
        x1, x2, x3: Components of vector x (length)
        y1, y2, y3: Components of vector y (length)
        A: Dot product result (length^2)
    """
    
    name = "dotproduct"
    full_name = "Dot Product (3D Vector)"
    equation_str = "A = x1*y1 + x2*y2 + x3*y3"
    equation_type = "polynomial_interaction"
    ai_feynman_id = "I.11.19"
    
    true_feature_names = ['x1', 'x2', 'x3', 'y1', 'y2', 'y3']
    dummy_feature_pool = ['z1', 'z2', 'z3', 'w1', 'w2']
    
    _dimensions = {
        # All components have same dimension (length)
        'x1': [0, 1, 0, 0],
        'x2': [0, 1, 0, 0],
        'x3': [0, 1, 0, 0],
        'y1': [0, 1, 0, 0],
        'y2': [0, 1, 0, 0],
        'y3': [0, 1, 0, 0],
        'z1': [0, 1, 0, 0],
        'z2': [0, 1, 0, 0],
        'z3': [0, 1, 0, 0],
        'w1': [0, 1, 0, 0],
        'w2': [0, 1, 0, 0],
    }
    
    _target_dims = [0, 2, 0, 0]  # Dot product: length^2
    
    # AI Feynman standard: all variables in [1, 5]
    _ranges = {
        'x1': (1.0, 5.0),
        'x2': (1.0, 5.0),
        'x3': (1.0, 5.0),
        'y1': (1.0, 5.0),
        'y2': (1.0, 5.0),
        'y3': (1.0, 5.0),
        'z1': (1.0, 5.0),
        'z2': (1.0, 5.0),
        'z3': (1.0, 5.0),
        'w1': (1.0, 5.0),
        'w2': (1.0, 5.0),
    }
    
    def generate_true_features(
        self, 
        n_samples: int, 
        seed: int
    ) -> Dict[str, np.ndarray]:
        """Generate all 6 vector components."""
        np.random.seed(seed)
        return {
            'x1': np.random.uniform(*self._ranges['x1'], n_samples),
            'x2': np.random.uniform(*self._ranges['x2'], n_samples),
            'x3': np.random.uniform(*self._ranges['x3'], n_samples),
            'y1': np.random.uniform(*self._ranges['y1'], n_samples),
            'y2': np.random.uniform(*self._ranges['y2'], n_samples),
            'y3': np.random.uniform(*self._ranges['y3'], n_samples),
        }
    
    def compute_target(
        self, 
        features: Dict[str, np.ndarray]
    ) -> np.ndarray:
        """Compute dot product."""
        x1, x2, x3 = features['x1'], features['x2'], features['x3']
        y1, y2, y3 = features['y1'], features['y2'], features['y3']
        
        return x1 * y1 + x2 * y2 + x3 * y3
    
    def get_physical_bounds(self) -> Dict[str, Dict[str, Optional[float]]]:
        """Return physical bounds for all variables."""
        bounds = {'target': {'min': 0.0, 'max': None}}
        for name, (low, high) in self._ranges.items():
            bounds[name] = {'min': 0.0, 'max': high * 10}
        return bounds

print("DotProductEquation class defined (AI Feynman I.11.19).")

In [None]:
# ==============================================================================
# EQUATION REGISTRY
# ==============================================================================

EQUATION_REGISTRY = {
    'coulomb': CoulombEquation,
    'cosines': LawOfCosinesEquation,
    'barometric': BarometricEquation,
    'dotproduct': DotProductEquation,
}

# Ground truth reference dictionary
GROUND_TRUTH_REGISTRY = {
    'coulomb': {
        'equation': 'F = k * q1 * q2 / r**2',
        'active_features': ['q1', 'q2', 'r'],
        'type': 'rational',
        'ai_feynman_id': 'I.12.2',
    },
    'cosines': {
        'equation': 'x = sqrt(x1**2 + x2**2 - 2*x1*x2*cos(theta1 - theta2))',
        'active_features': ['x1', 'x2', 'theta1', 'theta2'],
        'type': 'nested_trigonometric',
        'ai_feynman_id': 'I.29.16',
    },
    'barometric': {
        'equation': 'n = n0 * exp(-m*g*x/(k_b*T))',
        'active_features': ['n0', 'm', 'g', 'x', 'k_b', 'T'],
        'type': 'exponential',
        'ai_feynman_id': 'I.40.1',
    },
    'dotproduct': {
        'equation': 'A = x1*y1 + x2*y2 + x3*y3',
        'active_features': ['x1', 'x2', 'x3', 'y1', 'y2', 'y3'],
        'type': 'polynomial_interaction',
        'ai_feynman_id': 'I.11.19',
    },
}


def get_equation(name: str) -> BaseTestEquation:
    """
    Get equation instance by name.
    
    Parameters
    ----------
    name : str
        Equation name (coulomb, cosines, barometric, dotproduct)
        
    Returns
    -------
    BaseTestEquation
        Equation instance
    """
    if name not in EQUATION_REGISTRY:
        raise ValueError(
            f"Unknown equation: {name}. "
            f"Available: {list(EQUATION_REGISTRY.keys())}"
        )
    return EQUATION_REGISTRY[name]()


print("Equation registry defined.")
print(f"Available equations: {list(EQUATION_REGISTRY.keys())}")
print()
print("AI Feynman Benchmark IDs:")
for name, info in GROUND_TRUTH_REGISTRY.items():
    print(f"  {name}: {info['ai_feynman_id']} ({info['type']})")

In [None]:
# ==============================================================================
# EXPERIMENT CONFIGURATION FUNCTIONS
# ==============================================================================

def get_core_experiment_configs() -> List[Dict[str, Any]]:
    """
    Generate configurations for core experiments.
    
    Core experiments: 4 equations x 2 noise x 2 dummy x 2 dims = 32 dataset configs
    With 3 methods each = 96 total experiments
    
    Returns
    -------
    List[Dict[str, Any]]
        List of experiment configurations
    """
    configs = []
    equations = list(EQUATION_REGISTRY.keys())
    
    for eq_name in equations:
        for noise in [0.0, 0.05]:
            for n_dummy in [0, 5]:
                for with_dims in [True, False]:
                    configs.append({
                        'equation_name': eq_name,
                        'n_samples': 500,
                        'noise_level': noise,
                        'n_dummy': n_dummy,
                        'with_dims': with_dims,
                    })
    
    return configs


def get_supplementary_experiment_configs() -> List[Dict[str, Any]]:
    """
    Generate configurations for supplementary experiments.
    
    Supplementary experiments: 4 equations x 2 sample sizes = 8 configs
    Physics-SR only, with noise and dummy features
    
    Returns
    -------
    List[Dict[str, Any]]
        List of experiment configurations
    """
    configs = []
    equations = list(EQUATION_REGISTRY.keys())
    
    for eq_name in equations:
        for n_samples in [250, 750]:
            configs.append({
                'equation_name': eq_name,
                'n_samples': n_samples,
                'noise_level': 0.05,
                'n_dummy': 5,
                'with_dims': True,
            })
    
    return configs


print("Experiment configuration functions defined.")
print(f"  Core configs: {len(get_core_experiment_configs())}")
print(f"  Supplementary configs: {len(get_supplementary_experiment_configs())}")

---
## Section 4: Benchmark Data Generator Class

In [None]:
# ==============================================================================
# BENCHMARK DATA GENERATOR
# ==============================================================================

class BenchmarkDataGenerator:
    """
    Generate and manage benchmark datasets for Physics-SR Framework v4.1.
    
    This class handles:
    - Dataset generation from equation objects
    - Saving/loading .npz files with pickled nested structures
    - Generating all experiment configurations
    - Dataset verification and statistics
    
    Attributes
    ----------
    output_dir : Path
        Directory for saving generated datasets
    equations : Dict[str, BaseTestEquation]
        Registry of available equation instances
    """
    
    def __init__(self, output_dir: Union[str, Path] = "data"):
        """
        Initialize the data generator.
        
        Parameters
        ----------
        output_dir : Union[str, Path]
            Directory for saving datasets
        """
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)
        
        # Instantiate all equation classes
        self.equations = {
            name: cls() for name, cls in EQUATION_REGISTRY.items()
        }
        
        print(f"BenchmarkDataGenerator initialized.")
        print(f"  Output directory: {self.output_dir}")
        print(f"  Available equations: {list(self.equations.keys())}")
    
    def generate_filename(
        self,
        equation_name: str,
        n_samples: int,
        noise_level: float,
        n_dummy: int
    ) -> str:
        """
        Generate standardized filename for dataset.
        
        Format: eq{N}_{name}_n{samples}_noise{level}_dummy{count}.npz
        
        Parameters
        ----------
        equation_name : str
            Name of the equation
        n_samples : int
            Number of samples
        noise_level : float
            Noise level (e.g., 0.05)
        n_dummy : int
            Number of dummy features
            
        Returns
        -------
        str
            Standardized filename
        """
        eq_names = list(EQUATION_REGISTRY.keys())
        eq_num = eq_names.index(equation_name) + 1
        return f"eq{eq_num}_{equation_name}_n{n_samples}_noise{noise_level:.2f}_dummy{n_dummy}.npz"
    
    def generate_dataset(
        self,
        equation_name: str,
        n_samples: int = DEFAULT_N_SAMPLES,
        noise_level: float = 0.0,
        n_dummy: int = 0,
        seed: int = RANDOM_SEED
    ) -> Dict[str, Any]:
        """
        Generate a single test dataset.
        
        Parameters
        ----------
        equation_name : str
            Name of the equation ('coulomb', 'cosines', 'barometric', 'dotproduct')
        n_samples : int
            Number of samples to generate
        noise_level : float
            Noise level (0.0 to 0.1)
        n_dummy : int
            Number of dummy features to include
        seed : int
            Random seed for reproducibility
            
        Returns
        -------
        Dict[str, Any]
            Generated dataset with all metadata
        """
        if equation_name not in self.equations:
            raise ValueError(
                f"Unknown equation: {equation_name}. "
                f"Available: {list(self.equations.keys())}"
            )
        
        equation = self.equations[equation_name]
        data = equation.generate_dataset(n_samples, noise_level, n_dummy, seed)
        
        # Add additional metadata
        eq_names = list(EQUATION_REGISTRY.keys())
        data['equation_index'] = eq_names.index(equation_name) + 1
        data['full_name'] = equation.full_name
        
        return data
    
    def save_dataset(
        self,
        data: Dict[str, Any],
        filename: Optional[str] = None
    ) -> Path:
        """
        Save dataset to .npz file.
        
        Uses pickle serialization for complex nested dict structures
        (variable_dimensions, physical_bounds, ground_truth).
        
        Parameters
        ----------
        data : Dict[str, Any]
            Dataset dictionary to save
        filename : Optional[str]
            Output filename (auto-generated if None)
            
        Returns
        -------
        Path
            Path to saved file
        """
        if filename is None:
            filename = self.generate_filename(
                data['equation_name'],
                data['n_samples'],
                data['noise_level'],
                data['n_dummy']
            )
        
        filepath = self.output_dir / filename
        
        # Prepare data for saving
        save_data = {}
        for key, value in data.items():
            if key in ['variable_dimensions', 'physical_bounds', 'ground_truth']:
                # Use pickle for complex nested dict structures
                save_data[key + '_pkl'] = np.frombuffer(
                    pickle.dumps(value), dtype=np.uint8
                )
            elif isinstance(value, np.ndarray):
                save_data[key] = value
            elif isinstance(value, (list, tuple)):
                save_data[key] = np.array(value)
            else:
                save_data[key] = value
        
        np.savez_compressed(filepath, **save_data)
        return filepath
    
    def load_dataset(self, filename: Union[str, Path]) -> Dict[str, Any]:
        """
        Load dataset from .npz file.
        
        Automatically unpickles nested dict structures.
        
        Parameters
        ----------
        filename : Union[str, Path]
            Filename to load
            
        Returns
        -------
        Dict[str, Any]
            Loaded dataset
        """
        filepath = self.output_dir / filename
        
        with np.load(filepath, allow_pickle=True) as npz:
            data = {}
            for key in npz.files:
                if key.endswith('_pkl'):
                    # Unpickle nested dict structures
                    original_key = key[:-4]
                    data[original_key] = pickle.loads(npz[key].tobytes())
                else:
                    value = npz[key]
                    # Convert 0-d arrays to scalars
                    if value.ndim == 0:
                        data[key] = value.item()
                    else:
                        data[key] = value
        
        return data
    
    def generate_all_datasets(
        self,
        equations: Optional[List[str]] = None,
        sample_sizes: Optional[List[int]] = None,
        noise_levels: Optional[List[float]] = None,
        dummy_counts: Optional[List[int]] = None,
        seed: int = RANDOM_SEED,
        verbose: bool = True
    ) -> List[Path]:
        """
        Generate all datasets for benchmark suite.
        
        Parameters
        ----------
        equations : Optional[List[str]]
            List of equation names (all if None)
        sample_sizes : Optional[List[int]]
            List of sample sizes (default [500])
        noise_levels : Optional[List[float]]
            List of noise levels (default [0.0, 0.05])
        dummy_counts : Optional[List[int]]
            List of dummy feature counts (default [0, 5])
        seed : int
            Random seed for reproducibility
        verbose : bool
            Print progress messages
            
        Returns
        -------
        List[Path]
            List of paths to generated files
        """
        if equations is None:
            equations = list(EQUATION_REGISTRY.keys())
        if sample_sizes is None:
            sample_sizes = [500]
        if noise_levels is None:
            noise_levels = NOISE_LEVELS
        if dummy_counts is None:
            dummy_counts = DUMMY_COUNTS
        
        total = len(equations) * len(sample_sizes) * len(noise_levels) * len(dummy_counts)
        
        if verbose:
            print(f"Generating {total} datasets...")
            print(f"  Equations: {equations}")
            print(f"  Sample sizes: {sample_sizes}")
            print(f"  Noise levels: {noise_levels}")
            print(f"  Dummy counts: {dummy_counts}")
            print()
        
        generated_files = []
        count = 0
        
        for eq_name in equations:
            for n_samples in sample_sizes:
                for noise in noise_levels:
                    for n_dummy in dummy_counts:
                        data = self.generate_dataset(
                            equation_name=eq_name,
                            n_samples=n_samples,
                            noise_level=noise,
                            n_dummy=n_dummy,
                            seed=seed
                        )
                        
                        # Validate y range
                        self._validate_y_range(data['y'], eq_name)
                        
                        filepath = self.save_dataset(data)
                        generated_files.append(filepath)
                        
                        count += 1
                        if verbose:
                            print(f"  [{count}/{total}] Generated: {filepath.name}")
        
        if verbose:
            print()
            print(f"Generation complete. {len(generated_files)} files created.")
        
        return generated_files
    
    def _validate_y_range(
        self, 
        y: np.ndarray, 
        eq_name: str,
        min_abs: float = Y_MIN_THRESHOLD,
        max_abs: float = Y_MAX_THRESHOLD
    ) -> bool:
        """
        Validate that y values are in numerically stable range.
        
        Parameters
        ----------
        y : np.ndarray
            Target values
        eq_name : str
            Equation name for logging
        min_abs : float
            Minimum acceptable |y| value
        max_abs : float
            Maximum acceptable |y| value
            
        Returns
        -------
        bool
            True if y is in valid range
        """
        y_nonzero = y[y != 0]
        if len(y_nonzero) == 0:
            print(f"    [WARNING] {eq_name}: All y values are zero!")
            return False
        
        y_abs = np.abs(y_nonzero)
        y_min, y_max = y_abs.min(), y_abs.max()
        
        if y_min < min_abs:
            print(f"    [WARNING] {eq_name}: y_min ({y_min:.2e}) < {min_abs:.0e}")
            return False
        if y_max > max_abs:
            print(f"    [WARNING] {eq_name}: y_max ({y_max:.2e}) > {max_abs:.0e}")
            return False
        
        return True
    
    def verify_dataset(self, filename: str) -> Dict[str, Any]:
        """
        Verify dataset integrity and compute statistics.
        
        Parameters
        ----------
        filename : str
            Filename to verify
            
        Returns
        -------
        Dict[str, Any]
            Verification results
        """
        data = self.load_dataset(filename)
        
        X = data['X']
        y = data['y']
        y_true = data['y_true']
        
        # Compute statistics
        stats = {
            'n_samples': X.shape[0],
            'n_features': X.shape[1],
            'n_true_features': len(data['true_features']),
            'n_dummy_features': len(data['dummy_features']),
            'y_min': float(np.min(y)),
            'y_max': float(np.max(y)),
            'y_mean': float(np.mean(y)),
            'y_std': float(np.std(y)),
            'y_has_nan': bool(np.any(np.isnan(y))),
            'y_has_inf': bool(np.any(np.isinf(y))),
            'noise_actual': float(np.std(y - y_true) / (np.std(y_true) + EPS)),
        }
        
        return stats
    
    def list_datasets(self) -> List[Path]:
        """List all datasets in output directory."""
        return sorted(self.output_dir.glob("*.npz"))
    
    def print_summary(self):
        """Print summary of available datasets."""
        datasets = self.list_datasets()
        
        print("=" * 70)
        print(" DATASET SUMMARY")
        print("=" * 70)
        print(f"Output directory: {self.output_dir}")
        print(f"Total datasets: {len(datasets)}")
        print()
        
        if len(datasets) == 0:
            print("No datasets found.")
            return
        
        by_equation = {}
        for path in datasets:
            parts = path.stem.split('_')
            eq_name = parts[1]
            if eq_name not in by_equation:
                by_equation[eq_name] = []
            by_equation[eq_name].append(path)
        
        for eq_name, paths in by_equation.items():
            print(f"  {eq_name}: {len(paths)} datasets")

print("BenchmarkDataGenerator class defined.")

---
## Section 5: Generate All Datasets

In [None]:
print("=" * 70)
print(" GENERATING CORE EXPERIMENT DATASETS")
print("=" * 70)
print()

generator = BenchmarkDataGenerator(output_dir=DATA_DIR)
print()

In [None]:
# Generate core experiment datasets (n=500)
core_files = generator.generate_all_datasets(
    equations=None,  # All equations
    sample_sizes=[500],
    noise_levels=[0.0, 0.05],
    dummy_counts=[0, 5],
    seed=RANDOM_SEED,
    verbose=True
)

print(f"\nCore datasets generated: {len(core_files)}")

In [None]:
print("=" * 70)
print(" GENERATING SUPPLEMENTARY EXPERIMENT DATASETS")
print("=" * 70)
print()

# Generate supplementary datasets (n=250, 750)
supp_files = generator.generate_all_datasets(
    equations=None,  # All equations
    sample_sizes=[250, 750],
    noise_levels=[0.05],
    dummy_counts=[5],
    seed=RANDOM_SEED,
    verbose=True
)

print(f"\nSupplementary datasets generated: {len(supp_files)}")

In [None]:
print("\n" + "=" * 70)
print(" GENERATION COMPLETE")
print("=" * 70)

all_files = generator.list_datasets()
total_size = sum(f.stat().st_size for f in all_files) / 1024 / 1024

print(f"\nTotal datasets: {len(all_files)}")
print(f"Total size: {total_size:.2f} MB")
print(f"Output directory: {generator.output_dir.absolute()}")

---
## Section 6: Validation and Verification

In [None]:
print("=" * 70)
print(" Y-VALUE RANGE VERIFICATION")
print("=" * 70)
print()

for eq_name in EQUATION_REGISTRY.keys():
    filename = generator.generate_filename(eq_name, 500, 0.0, 0)
    filepath = generator.output_dir / filename
    
    if filepath.exists():
        data = generator.load_dataset(filename)
        y = data['y']
        
        y_nonzero = y[y != 0]
        if len(y_nonzero) > 0:
            y_abs = np.abs(y_nonzero)
            y_min, y_max = y_abs.min(), y_abs.max()
        else:
            y_min, y_max = 0, 0
        
        # Check if in stable range
        stable = (y_min >= Y_MIN_THRESHOLD) and (y_max <= Y_MAX_THRESHOLD)
        status = "[OK]" if stable else "[WARNING]"
        
        print(f"{eq_name} ({data['ai_feynman_id']}):")
        print(f"  Equation: {data['equation_str']}")
        print(f"  y range: [{y_min:.2e}, {y_max:.2e}]")
        print(f"  Status: {status}")
        print()

In [None]:
print("=" * 70)
print(" DATASET STRUCTURE VERIFICATION")
print("=" * 70)
print()

for eq_name in EQUATION_REGISTRY.keys():
    filename = generator.generate_filename(eq_name, 500, 0.0, 5)
    filepath = generator.output_dir / filename
    
    if filepath.exists():
        data = generator.load_dataset(filename)
        
        print(f"{eq_name} ({data['ai_feynman_id']}):")
        print(f"  X.shape = {data['X'].shape}")
        print(f"  X.dtype = {data['X'].dtype}")
        print(f"  true_features = {list(data['true_features'])}")
        print(f"  dummy_features = {list(data['dummy_features'])}")
        print(f"  target_dimensions = {data['target_dimensions']}")
        print(f"  has variable_dimensions = {'variable_dimensions' in data}")
        print(f"  has physical_bounds = {'physical_bounds' in data}")
        print()

In [None]:
# Plot target distributions
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for idx, eq_name in enumerate(EQUATION_REGISTRY.keys()):
    ax = axes[idx]
    
    filename = generator.generate_filename(eq_name, 500, 0.0, 0)
    filepath = generator.output_dir / filename
    
    if filepath.exists():
        data = generator.load_dataset(filename)
        y = data['y']
        
        ax.hist(y, bins=50, edgecolor='black', alpha=0.7)
        title = f"{eq_name.upper()} ({data['ai_feynman_id']})\n{data['equation_str']}"
        ax.set_title(title, fontsize=10)
        ax.set_xlabel('Target Value')
        ax.set_ylabel('Frequency')
        
        stats_text = f"Mean: {np.mean(y):.2e}\nStd: {np.std(y):.2e}"
        ax.text(0.95, 0.95, stats_text, transform=ax.transAxes,
                verticalalignment='top', horizontalalignment='right',
                fontsize=8, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.savefig(generator.output_dir / 'target_distributions.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"\nDistribution plot saved to: {generator.output_dir / 'target_distributions.png'}")

In [None]:
generator.print_summary()

---
## Section 7: Usage Examples

In [None]:
print("=" * 70)
print(" EXAMPLE: Loading Dataset for Experiments")
print("=" * 70)
print()

# Example: Load Coulomb dataset with noise and dummy features
example_file = "eq1_coulomb_n500_noise0.05_dummy5.npz"
data = generator.load_dataset(example_file)

print(f"Loaded: {example_file}")
print()
print("Data structure:")
print(f"  X.shape = {data['X'].shape}")
print(f"  X.dtype = {data['X'].dtype}")
print(f"  y.shape = {data['y'].shape}")
print(f"  feature_names = {list(data['feature_names'])}")
print(f"  true_features = {list(data['true_features'])}")
print(f"  dummy_features = {list(data['dummy_features'])}")
print()
print("Ground truth:")
print(f"  equation = {data['equation_str']}")
print(f"  type = {data['equation_type']}")
print(f"  ai_feynman_id = {data['ai_feynman_id']}")

In [None]:
print("=" * 70)
print(" EXAMPLE: Creating UserInputs for Pipeline")
print("=" * 70)
print()

# Reconstruct UserInputs from loaded dataset
user_inputs = UserInputs(
    variable_dimensions=data['variable_dimensions'],
    target_dimensions=list(data['target_dimensions']),
    physical_bounds=data['physical_bounds']
)

print("UserInputs created from dataset:")
print(f"  Variables: {user_inputs.get_variable_names()}")
print(f"  Target dims: {user_inputs.target_dimensions}")
print()
print("Variable dimensions:")
for var, dims in user_inputs.variable_dimensions.items():
    print(f"  {var}: {dims}")

In [None]:
print("\n" + "=" * 70)
print(" DataGen Module Complete")
print("=" * 70)
print()
print("Available classes:")
print("  - UserInputs")
print("  - BaseTestEquation")
print("  - CoulombEquation (AI Feynman I.12.2)")
print("  - LawOfCosinesEquation (AI Feynman I.29.16)")
print("  - BarometricEquation (AI Feynman I.40.1)")
print("  - DotProductEquation (AI Feynman I.11.19)")
print("  - BenchmarkDataGenerator")
print()
print("Available functions:")
print("  - get_equation(name)")
print("  - get_core_experiment_configs()")
print("  - get_supplementary_experiment_configs()")
print()
print(f"Generated datasets: {len(generator.list_datasets())}")
print(f"Output directory: {generator.output_dir.absolute()}")

In [None]:
'''
# ==============================================================================
# DOWNLOAD GENERATED DATA (Uncomment in Colab)
# ==============================================================================

import shutil
from pathlib import Path
from google.colab import files

# Define path directly
data_dir = Path('/content/Physics-Informed-Symbolic-Regression/benchmark/data')
zip_path = '/content/benchmark_data_v4.1.zip'

# Create zip of data directory
shutil.make_archive('/content/benchmark_data_v4.1', 'zip', data_dir)

print(f"Data directory: {data_dir}")
print(f"Number of files: {len(list(data_dir.glob('*.npz')))}")
print()
print("Downloading benchmark_data_v4.1.zip...")

files.download(zip_path)
'''