# 00_Core - Physics-SR Framework v4.1

## Foundation Module: DataClasses, Utilities, TimeBudgetManager, and Test Data Generators

**Author:** Zhengze Zhang  
**Affiliation:** Department of Statistics, Columbia University  
**Date:** January 2026  
**Version:** 4.1 (Structure-Guided Feature Library Enhancement + Computational Optimization)

---

### Purpose

This notebook provides the foundational components for the Three-Stage Physics-Informed Symbolic Regression Framework:

1. **DataClasses**: `UserInputs`, `Stage1Results`, `Stage2Results`, `Stage3Results`
2. **TimeBudgetManager**: Adaptive time allocation for computational optimization (NEW v4.1)
3. **Utility Functions**: Safe math operations, formatting, metrics, memory management
4. **Test Data Generators**: Warm rain microphysics, polynomial, trigonometric, pendulum
5. **Global Configuration Constants**: Including PYSR_MODES (v4.1)

### Usage

This module is imported by all other notebooks via:
```python
%run 00_Core.ipynb
```

### Changelog v4.1

- Added TimeBudgetManager for adaptive time allocation
- Added PYSR_MODES configuration dictionary
- Added Float32 conversion and memory cleanup utilities
- Updated Stage2Results with parsed_terms, detected_operators, augmented_library
- Added timing field to all Stage dataclasses

---
## Section 1: Imports and Constants

In [None]:
"""
00_Core.ipynb - Foundation Module v4.1
======================================

Three-Stage Physics-Informed Symbolic Regression Framework v4.1

This module provides:
- DataClasses for user inputs and stage results
- TimeBudgetManager for computational optimization (NEW v4.1)
- Utility functions for safe numerical operations
- Test data generators for algorithm validation
- Global configuration constants including PYSR_MODES

Author: Zhengze Zhang
Affiliation: Department of Statistics, Columbia University
Contact: zz3239@columbia.edu
"""

# Standard library imports
import warnings
import time
import gc
from dataclasses import dataclass, field
from typing import Dict, List, Tuple, Optional, Union, Any

# Scientific computing
import numpy as np
import pandas as pd
from scipy import stats
from scipy.special import comb
from scipy.integrate import simpson

# Machine learning
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge, LassoCV
from sklearn.metrics import mean_squared_error, r2_score

# Symbolic computation
import sympy as sp
from sympy import symbols, sympify, expand, Add, Mul, Pow

# Optional imports (with graceful fallback)
try:
    from pysr import PySRRegressor
    PYSR_AVAILABLE = True
except ImportError:
    PYSR_AVAILABLE = False
    warnings.warn("PySR not installed. PySR pathway will be disabled.")

# Parallel computing
try:
    from joblib import Parallel, delayed
    JOBLIB_AVAILABLE = True
except ImportError:
    JOBLIB_AVAILABLE = False

print("00_Core v4.1: All imports successful.")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"PySR available: {PYSR_AVAILABLE}")
print(f"Joblib available: {JOBLIB_AVAILABLE}")

In [None]:
# ==============================================================================
# GLOBAL CONFIGURATION CONSTANTS v4.1
# ==============================================================================

# Stage 1 defaults
DEFAULT_MAX_EXPONENT = 4              # Buckingham pi search range
DEFAULT_IMPORTANCE_THRESHOLD = 0.01   # Variable screening threshold
DEFAULT_POWERLAW_R2_THRESHOLD = 0.9   # Power-law detection R^2 threshold
DEFAULT_SOFTMAX_TEMPERATURE = 0.5     # iRF soft reweighting temperature
DEFAULT_STABILITY_THRESHOLD = 0.5     # Interaction stability threshold

# Stage 2 defaults
DEFAULT_MAX_POLY_DEGREE = 3           # Feature library polynomial degree
DEFAULT_STLSQ_THRESHOLD = 0.1         # STLSQ sparsity threshold
DEFAULT_STLSQ_MAX_ITER = 20           # STLSQ maximum iterations
DEFAULT_ALASSO_GAMMA = 1.0            # Adaptive Lasso gamma parameter
DEFAULT_ALASSO_EPS = 1e-6             # Adaptive Lasso stabilization constant

# Stage 3 defaults
DEFAULT_CV_FOLDS = 5                  # Cross-validation folds
DEFAULT_EBIC_GAMMA = 0.5              # EBIC gamma parameter
DEFAULT_N_BOOTSTRAP = 200             # Number of bootstrap samples
DEFAULT_CONFIDENCE_LEVEL = 0.95       # Confidence interval level
DEFAULT_DIM_TOLERANCE = 0.05          # Dimensional check tolerance

# v4.1 Computational Optimization Constants
DEFAULT_RUNTIME_BUDGET = 180          # Total runtime budget (seconds)
DEFAULT_PYSR_TIMEOUT = 100            # PySR timeout (seconds)
DEFAULT_PROCS = 2                     # Number of parallel processes
DEFAULT_PRECISION = 32                # Float precision (32 or 64)

# PySR Mode Configurations (v4.1)
PYSR_MODES = {
    'fast': {
        'niterations': 20,
        'maxsize': 18,
        'maxdepth': 8,
        'populations': 8,
        'population_size': 33,
        'ncycles_per_iteration': 350,
        'timeout_in_seconds': 60
    },
    'standard': {
        'niterations': 40,
        'maxsize': 20,
        'maxdepth': 10,
        'populations': 15,
        'population_size': 33,
        'ncycles_per_iteration': 400,
        'timeout_in_seconds': 100
    },
    'thorough': {
        'niterations': 80,
        'maxsize': 25,
        'maxdepth': 12,
        'populations': 20,
        'population_size': 50,
        'ncycles_per_iteration': 550,
        'timeout_in_seconds': 150
    }
}

# Random seed for reproducibility
RANDOM_SEED = 42

# Numerical stability constants
EPS_LOG = 1e-10                       # Epsilon for log safety
EPS_DIV = 1e-10                       # Epsilon for division safety
EPS_EXP_CLIP = 20                     # Clip for exp to prevent overflow

# ==============================================================================
# v4.7 DUAL-TRACK SELECTION CONSTANTS
# ==============================================================================
DEFAULT_PYSR_TRUST_THRESHOLD = 0.70   # Above this, trust PySR structure
DEFAULT_PYSR_SKIP_THRESHOLD = 0.95    # Above this, HIGH TRUST mode
DEFAULT_MAX_PYSR_TERMS = 12           # Cap on PySR terms even in HIGH TRUST
DEFAULT_MAX_TOTAL_TERMS = 15          # Cap on total selected terms

print("Global configuration constants v4.7 defined.")
print(f"PYSR_MODES available: {list(PYSR_MODES.keys())}")
print(f"v4.7 Dual-Track: TRUST_THRESHOLD={DEFAULT_PYSR_TRUST_THRESHOLD}, MAX_PYSR_TERMS={DEFAULT_MAX_PYSR_TERMS}")

---
## Section 2: DataClasses

In [None]:
# ==============================================================================
# USER INPUTS DATACLASS
# ==============================================================================

@dataclass
class UserInputs:
    """
    User-defined inputs required for the Physics-SR Framework.
    
    These inputs must be prepared before running the pipeline and require
    domain knowledge about the physical system being modeled.
    
    Attributes
    ----------
    variable_dimensions : Dict[str, List[float]]
        Dictionary mapping variable names to their dimensional exponents [M, L, T, Theta].
        M = Mass, L = Length, T = Time, Theta = Temperature.
        Example: {'velocity': [0, 1, -1, 0]}  # m/s has L^1 * T^-1
        
    target_dimensions : List[float]
        Dimensional exponents [M, L, T, Theta] for the target variable.
        Example: [0, 0, -1, 0] for a rate with units s^-1
        
    physical_bounds : Dict[str, Dict[str, Optional[float]]]
        Physical constraints for variables and target.
        Format: {var_name: {'min': float or None, 'max': float or None}}
        Example: {'target': {'min': 0, 'max': None}}  # Non-negative target
        
    variable_mapping : Optional[Dict[str, str]]
        Maps data column names to standardized physical variable names.
        Example: {'cloud_water_mixing_ratio': 'q_c'}
        
    unit_conversions : Optional[Dict[str, float]]
        Conversion factors to convert data to SI units.
        Example: {'radius_um': 1e-6}  # Convert micrometers to meters
    """
    
    # Required fields
    variable_dimensions: Dict[str, List[float]]
    target_dimensions: List[float]
    physical_bounds: Dict[str, Dict[str, Optional[float]]]
    
    # Optional fields with defaults
    variable_mapping: Optional[Dict[str, str]] = None
    unit_conversions: Optional[Dict[str, float]] = None
    
    def __post_init__(self):
        """Validate inputs after initialization."""
        # Validate dimensional exponents have length 4
        for var_name, dims in self.variable_dimensions.items():
            if len(dims) != 4:
                raise ValueError(
                    f"Variable '{var_name}' has {len(dims)} dimensional exponents, "
                    f"expected 4 [M, L, T, Theta]"
                )
        
        if len(self.target_dimensions) != 4:
            raise ValueError(
                f"Target dimensions has {len(self.target_dimensions)} exponents, "
                f"expected 4 [M, L, T, Theta]"
            )
    
    def get_variable_names(self) -> List[str]:
        """Return list of variable names."""
        return list(self.variable_dimensions.keys())
    
    def get_dimension_matrix(self) -> np.ndarray:
        """Return dimensional matrix D where D[i,j] = exponent of dimension i for variable j."""
        var_names = self.get_variable_names()
        n_vars = len(var_names)
        D = np.zeros((4, n_vars))
        for j, var_name in enumerate(var_names):
            D[:, j] = self.variable_dimensions[var_name]
        return D

print("UserInputs dataclass defined.")

In [None]:
# ==============================================================================
# STAGE 1 RESULTS DATACLASS
# ==============================================================================

@dataclass
class Stage1Results:
    """
    Results from Stage 1: Variable Selection & Preprocessing.
    
    Contains outputs from:
    - 1.1 Buckingham Pi Dimensional Analysis
    - 1.2 PAN+SR Variable Screening
    - 1.3 Symmetry Analysis
    - 1.4 iRF Interaction Discovery
    
    Attributes
    ----------
    # Buckingham Pi results
    pi_groups : Optional[Dict[str, np.ndarray]]
        Dictionary mapping pi-group names to their exponent vectors
    pi_exponents : Optional[np.ndarray]
        Matrix of exponents for selected pi-groups (n_groups x n_variables)
    pi_group_names : Optional[List[str]]
        Human-readable names for each pi-group
    X_transformed : Optional[np.ndarray]
        Data transformed to dimensionless pi-groups
    all_pi_candidates : Optional[List[Dict]]
        All candidate pi-groups with complexity scores
        
    # Variable screening results
    selected_indices : Optional[List[int]]
        Indices of variables selected by screening
    selected_names : Optional[List[str]]
        Names of selected variables
    importance_scores : Optional[Dict[str, float]]
        RF permutation importance for each variable
        
    # Symmetry analysis results
    is_power_law : bool
        Whether power-law relationship was detected
    estimated_exponents : Optional[Dict[str, float]]
        Estimated power-law exponents for each variable
    power_law_r2 : Optional[float]
        R-squared of log-log regression
    structural_hints : Optional[Dict]
        Hints about equation structure for Stage 2
        
    # Interaction discovery results
    stable_interactions : Optional[List[Tuple[int, ...]]]
        List of stable feature interactions (as tuples of indices)
    interaction_stability : Optional[Dict[Tuple, float]]
        Stability scores for each interaction
    soft_weights : Optional[np.ndarray]
        Soft reweighting weights from iRF
        
    # Timing (v4.1)
    timing : Optional[Dict[str, float]]
        Execution time for each sub-stage
    """
    
    # Buckingham Pi results
    pi_groups: Optional[Dict[str, np.ndarray]] = None
    pi_exponents: Optional[np.ndarray] = None
    pi_group_names: Optional[List[str]] = None
    X_transformed: Optional[np.ndarray] = None
    all_pi_candidates: Optional[List[Dict]] = None
    
    # Variable screening results
    selected_indices: Optional[List[int]] = None
    selected_names: Optional[List[str]] = None
    importance_scores: Optional[Dict[str, float]] = None
    
    # Symmetry analysis results
    is_power_law: bool = False
    estimated_exponents: Optional[Dict[str, float]] = None
    power_law_r2: Optional[float] = None
    structural_hints: Optional[Dict] = None
    
    # Interaction discovery results
    stable_interactions: Optional[List[Tuple[int, ...]]] = None
    interaction_stability: Optional[Dict[Tuple, float]] = None
    soft_weights: Optional[np.ndarray] = None
    
    # Timing (v4.1)
    timing: Optional[Dict[str, float]] = None

print("Stage1Results dataclass defined.")

In [None]:
# ==============================================================================
# STAGE 2 RESULTS DATACLASS (v4.1 Enhanced)
# ==============================================================================

@dataclass
class Stage2Results:
    """
    Results from Stage 2: Structure-Guided Discovery (v4.1).
    
    Contains outputs from:
    - 2.1 PySR Structure Exploration
    - 2.2 Structure Parsing (NEW v4.0)
    - 2.3 Augmented Library Construction (NEW v4.0)
    - 2.4 E-WSINDy Sparse Selection
    - 2.5 Adaptive Lasso Verification (optional)
    
    Attributes
    ----------
    # PySR results
    pysr_equations : Optional[List[str]]
        List of equations discovered by PySR
    pysr_pareto : Optional[pd.DataFrame]
        Pareto front of complexity vs accuracy
    best_pysr_equation : Optional[str]
        Best equation from PySR
    best_pysr_sympy : Optional[Any]
        Best PySR equation as SymPy expression
    best_pysr_r2 : Optional[float]
        R-squared of best PySR equation
    pysr_elapsed_time : Optional[float]
        PySR execution time (v4.1)
        
    # Structure Parsing results (NEW v4.0)
    parsed_terms : Optional[List[Tuple]]
        List of (expr, name, func) tuples from parsing
    detected_operators : Optional[set]
        Set of operators found in PySR equations {'sin', 'cos', 'exp', ...}
    term_to_equation_map : Optional[Dict]
        Mapping from terms to source equations
        
    # Augmented Library (NEW v4.0)
    augmented_library : Optional[np.ndarray]
        5-layer augmented feature matrix Phi_aug
    library_names : Optional[List[str]]
        Feature names with source tags [PowLaw], [PySR], [Var], [Poly], [Op]
    library_info : Optional[Dict]
        Library composition statistics
    library_builder : Optional[AugmentedLibraryBuilder]
        Builder object for transforming new data (v4.1, for test predictions)
        
    # E-WSINDy results
    ewsindy_coefficients : Optional[np.ndarray]
        Coefficient vector from E-WSINDy
    ewsindy_support : Optional[np.ndarray]
        Boolean mask of selected features
    ewsindy_equation : Optional[str]
        Equation string from E-WSINDy
    ewsindy_r2 : Optional[float]
        R-squared of E-WSINDy fit
    selection_analysis : Optional[Dict]
        Analysis of which sources contributed selected terms (v4.1)
        
    # Adaptive Lasso results (optional)
    alasso_coefficients : Optional[np.ndarray]
        Coefficient vector from Adaptive Lasso
    alasso_support : Optional[np.ndarray]
        Boolean mask of selected features
    alasso_r2 : Optional[float]
        R-squared of Adaptive Lasso fit
        
    # Timing (v4.1)
    timing : Optional[Dict[str, float]]
        Execution time for each sub-stage
        
    # Backward compatibility aliases (v4.1)
    @property
    def feature_library(self):
        return self.augmented_library
    
    @property
    def feature_names(self):
        return self.library_names
    """
    
    # PySR results
    pysr_equations: Optional[List[str]] = None
    pysr_pareto: Optional[pd.DataFrame] = None
    best_pysr_equation: Optional[str] = None
    best_pysr_sympy: Optional[Any] = None
    best_pysr_r2: Optional[float] = None
    pysr_elapsed_time: Optional[float] = None
    pysr_model: Optional[Any] = None  # v4.1.2: Store PySR model for test predictions
    
    # Structure Parsing results (NEW v4.0)
    parsed_terms: Optional[List[Tuple]] = None
    detected_operators: Optional[set] = None
    term_to_equation_map: Optional[Dict] = None
    
    # Augmented Library (NEW v4.0)
    augmented_library: Optional[np.ndarray] = None
    library_names: Optional[List[str]] = None
    library_info: Optional[Dict] = None
    library_builder: Optional[Any] = None  # v4.1: For test set transformation
    
    # E-WSINDy results
    ewsindy_coefficients: Optional[np.ndarray] = None
    ewsindy_support: Optional[np.ndarray] = None
    ewsindy_equation: Optional[str] = None
    ewsindy_r2: Optional[float] = None
    selection_analysis: Optional[Dict] = None
    
    # Adaptive Lasso results (optional)
    alasso_coefficients: Optional[np.ndarray] = None
    alasso_support: Optional[np.ndarray] = None
    alasso_r2: Optional[float] = None
    
    # Timing (v4.1)
    timing: Optional[Dict[str, float]] = None
    
    # v4.7 Dual-Track results
    ewsindy_intercept: Optional[float] = None
    final_method: Optional[str] = None  # 'pysr_refined' or 'ewsindy'
    pysr_refined_equation: Optional[str] = None
    pysr_refined_r2: Optional[float] = None
    curve_fit_success: Optional[bool] = None
    
    # Backward compatibility properties
    @property
    def feature_library(self):
        """Alias for augmented_library (backward compatibility)."""
        return self.augmented_library
    
    @property
    def feature_names(self):
        """Alias for library_names (backward compatibility)."""
        return self.library_names

print("Stage2Results dataclass v4.7 defined.")

In [None]:
# ==============================================================================
# STAGE 3 RESULTS DATACLASS
# ==============================================================================

@dataclass
class Stage3Results:
    """
    Results from Stage 3: Validation & Uncertainty Quantification.
    
    Contains outputs from:
    - 3.1 Model Selection (CV + EBIC)
    - 3.2 Physics Verification
    - 3.3 Uncertainty Quantification (Three-Layer)
    - 3.4 Formal Statistical Inference
    
    Attributes
    ----------
    # Model selection
    cv_scores : Optional[Dict[str, Tuple[float, float]]]
        CV scores for candidate models {model_id: (mean, std)}
    ebic_scores : Optional[Dict[str, float]]
        EBIC scores for candidate models
    best_model : Optional[str]
        Identifier of best model
        
    # Physics verification
    dim_consistent : bool
        Whether equation is dimensionally consistent
    dim_details : Optional[Dict]
        Details of dimensional analysis
    bounds_violations : Optional[Dict]
        Physical bounds violations detected
    physics_score : Optional[float]
        Overall physics verification score [0-1]
        
    # Structural UQ (Layer 1)
    inclusion_probabilities : Optional[np.ndarray]
        Bootstrap inclusion probabilities for each term
    structural_confidence : Optional[Dict[str, str]]
        Confidence classification (HIGH/MODERATE/LOW) per term
    bootstrap_supports : Optional[np.ndarray]
        Support patterns across bootstrap samples
        
    # Parametric UQ (Layer 2)
    coefficient_estimates : Optional[np.ndarray]
        Point estimates of coefficients
    coefficient_CI : Optional[np.ndarray]
        Confidence intervals (n_coef x 2)
    coefficient_SE : Optional[np.ndarray]
        Standard errors of coefficients
    bootstrap_coefficients : Optional[np.ndarray]
        Bootstrap coefficient samples (B x n_coef)
        
    # Predictive UQ (Layer 3)
    prediction_intervals : Optional[Tuple[np.ndarray, np.ndarray]]
        Lower and upper prediction interval bounds
    pi_coverage : Optional[float]
        Empirical coverage of prediction intervals
    model_variance : Optional[np.ndarray]
        Model uncertainty component
    residual_variance : Optional[float]
        Residual uncertainty component
        
    # Hypothesis testing
    p_values : Optional[Dict[str, float]]
        P-values from hypothesis tests
    significant_terms : Optional[List[str]]
        Terms that are statistically significant
    test_statistics : Optional[Dict[str, float]]
        Test statistics for each term
        
    # Final equation
    final_equation : Optional[str]
        Final equation string
    final_coefficients : Optional[Dict[str, float]]
        Final coefficient values by term name
        
    # Timing (v4.1)
    timing : Optional[Dict[str, float]]
        Execution time for each sub-stage
    """
    
    # Model selection
    cv_scores: Optional[Dict[str, Tuple[float, float]]] = None
    ebic_scores: Optional[Dict[str, float]] = None
    best_model: Optional[str] = None
    
    # Physics verification
    dim_consistent: bool = False
    dim_details: Optional[Dict] = None
    bounds_violations: Optional[Dict] = None
    physics_score: Optional[float] = None
    
    # Structural UQ (Layer 1)
    inclusion_probabilities: Optional[np.ndarray] = None
    structural_confidence: Optional[Dict[str, str]] = None
    bootstrap_supports: Optional[np.ndarray] = None
    
    # Parametric UQ (Layer 2)
    coefficient_estimates: Optional[np.ndarray] = None
    coefficient_CI: Optional[np.ndarray] = None
    coefficient_SE: Optional[np.ndarray] = None
    bootstrap_coefficients: Optional[np.ndarray] = None
    
    # Predictive UQ (Layer 3)
    prediction_intervals: Optional[Tuple[np.ndarray, np.ndarray]] = None
    pi_coverage: Optional[float] = None
    model_variance: Optional[np.ndarray] = None
    residual_variance: Optional[float] = None
    
    # Hypothesis testing
    p_values: Optional[Dict[str, float]] = None
    significant_terms: Optional[List[str]] = None
    test_statistics: Optional[Dict[str, float]] = None
    
    # Final equation
    final_equation: Optional[str] = None
    final_coefficients: Optional[Dict[str, float]] = None
    
    # Timing (v4.1)
    timing: Optional[Dict[str, float]] = None

print("Stage3Results dataclass defined.")

---
## Section 3: TimeBudgetManager (NEW v4.1)

In [None]:
# ==============================================================================
# TIME BUDGET MANAGER (NEW v4.1)
# ==============================================================================

class TimeBudgetManager:
    """
    Manage runtime budget across pipeline stages.
    
    Provides adaptive time allocation for computational optimization.
    Designed for Google Colab Pro with 180-second total budget.
    
    Attributes
    ----------
    total_budget : float
        Total runtime budget in seconds
    start_time : float
        Pipeline start time
    stage_times : Dict[str, float]
        Cumulative time at each checkpoint
    
    Examples
    --------
    >>> budget = TimeBudgetManager(total_budget_seconds=180)
    >>> pysr_timeout = budget.allocate_pysr_time()
    >>> # ... run PySR ...
    >>> budget.record_stage('PySR')
    >>> n_bootstrap = budget.allocate_bootstrap_count()
    >>> print(budget.report())
    """
    
    def __init__(self, total_budget_seconds: float = DEFAULT_RUNTIME_BUDGET):
        """
        Initialize TimeBudgetManager.
        
        Parameters
        ----------
        total_budget_seconds : float
            Total runtime budget (default: 180 for Colab Pro)
        """
        self.total_budget = total_budget_seconds
        self.start_time = time.time()
        self.stage_times = {}
    
    def elapsed(self) -> float:
        """Return elapsed time since start."""
        return time.time() - self.start_time
    
    def remaining(self) -> float:
        """Return remaining time in budget."""
        return max(0, self.total_budget - self.elapsed())
    
    def allocate_pysr_time(self, reserve_for_stage3: float = 40) -> int:
        """
        Calculate PySR timeout based on remaining budget.
        
        Parameters
        ----------
        reserve_for_stage3 : float
            Time to reserve for Stage 3 (default: 40s)
            
        Returns
        -------
        int
            PySR timeout in seconds
        """
        available = self.remaining() - reserve_for_stage3
        # Give 70% to PySR, cap at 120s
        pysr_time = max(30, min(available * 0.7, 120))
        return int(pysr_time)
    
    def allocate_bootstrap_count(self, time_per_bootstrap: float = 0.2) -> int:
        """
        Calculate bootstrap count based on remaining budget.
        
        Parameters
        ----------
        time_per_bootstrap : float
            Estimated time per bootstrap sample
            
        Returns
        -------
        int
            Number of bootstrap samples
        """
        available = self.remaining() - 10  # Reserve 10s for output
        max_bootstraps = int(available / time_per_bootstrap)
        return max(50, min(max_bootstraps, 200))
    
    def should_skip_optional(self, min_required: float = 30) -> bool:
        """
        Decide whether to skip optional components.
        
        Parameters
        ----------
        min_required : float
            Minimum time required for optional component
            
        Returns
        -------
        bool
            True if should skip
        """
        return self.remaining() < min_required
    
    def record_stage(self, stage_name: str) -> None:
        """
        Record time for a stage.
        
        Parameters
        ----------
        stage_name : str
            Name of the stage
        """
        self.stage_times[stage_name] = self.elapsed()
    
    def get_stage_duration(self, stage_name: str) -> float:
        """Get duration of a specific stage."""
        times = list(self.stage_times.values())
        keys = list(self.stage_times.keys())
        if stage_name not in keys:
            return 0.0
        idx = keys.index(stage_name)
        if idx == 0:
            return times[0]
        return times[idx] - times[idx-1]
    
    def report(self) -> str:
        """Generate timing report."""
        lines = ["=== Timing Report ==="]
        prev = 0
        for stage, cumulative in self.stage_times.items():
            duration = cumulative - prev
            lines.append(f"  {stage}: {duration:.1f}s")
            prev = cumulative
        lines.append(f"  ---")
        lines.append(f"  Total: {self.elapsed():.1f}s / {self.total_budget}s")
        lines.append(f"  Remaining: {self.remaining():.1f}s")
        return "\n".join(lines)
    
    def to_dict(self) -> Dict[str, float]:
        """Return timing as dictionary."""
        result = {}
        prev = 0
        for stage, cumulative in self.stage_times.items():
            result[stage] = cumulative - prev
            prev = cumulative
        result['total'] = self.elapsed()
        result['remaining'] = self.remaining()
        return result

print("TimeBudgetManager class defined.")

---
## Section 4: Utility Functions

In [None]:
# ==============================================================================
# SAFE NUMERICAL OPERATIONS
# ==============================================================================

def safe_log(x: np.ndarray, eps: float = EPS_LOG) -> np.ndarray:
    """
    Safe logarithm that handles zero and negative values.
    
    Parameters
    ----------
    x : np.ndarray
        Input array
    eps : float
        Small constant to add before taking log
        
    Returns
    -------
    np.ndarray
        log(|x| + eps)
    """
    return np.log(np.abs(x) + eps)


def safe_exp(x: np.ndarray, clip: float = EPS_EXP_CLIP) -> np.ndarray:
    """
    Safe exponential that prevents overflow.
    
    Parameters
    ----------
    x : np.ndarray
        Input array
    clip : float
        Maximum absolute value for clipping
        
    Returns
    -------
    np.ndarray
        exp(clip(x, -clip, clip))
    """
    return np.exp(np.clip(x, -clip, clip))


def safe_divide(num: np.ndarray, denom: np.ndarray, eps: float = EPS_DIV) -> np.ndarray:
    """
    Safe division that handles zero denominators.
    
    Parameters
    ----------
    num : np.ndarray
        Numerator
    denom : np.ndarray
        Denominator
    eps : float
        Small constant to add to denominator
        
    Returns
    -------
    np.ndarray
        num / (denom + eps * sign(denom))
    """
    sign = np.sign(denom)
    sign[sign == 0] = 1
    return num / (denom + eps * sign)

# Alias for backward compatibility
safe_div = safe_divide


def safe_sqrt(x: np.ndarray, eps: float = EPS_LOG) -> np.ndarray:
    """
    Safe square root that handles negative values.
    
    Parameters
    ----------
    x : np.ndarray
        Input array
    eps : float
        Small constant to add
        
    Returns
    -------
    np.ndarray
        sqrt(|x| + eps)
    """
    return np.sqrt(np.abs(x) + eps)


def safe_power(base: np.ndarray, exp: float, eps: float = EPS_LOG) -> np.ndarray:
    """
    Safe power function that handles negative bases with non-integer exponents.
    
    Parameters
    ----------
    base : np.ndarray
        Base values
    exp : float
        Exponent
    eps : float
        Small constant to add
        
    Returns
    -------
    np.ndarray
        (|base| + eps) ** exp * sign(base)
    """
    abs_base = np.abs(base) + eps
    sign = np.sign(base)
    sign[sign == 0] = 1
    return (abs_base ** exp) * sign

print("Safe numerical operations defined.")

In [None]:
# ==============================================================================
# METRICS AND EVALUATION FUNCTIONS
# ==============================================================================

def compute_r2(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """
    Compute R-squared coefficient of determination.
    
    Parameters
    ----------
    y_true : np.ndarray
        True values
    y_pred : np.ndarray
        Predicted values
        
    Returns
    -------
    float
        R-squared value
    """
    ss_res = np.sum((y_true - y_pred) ** 2)
    ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
    return 1 - ss_res / ss_tot if ss_tot > 0 else 0.0


def compute_mse(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """
    Compute Mean Squared Error.
    
    Parameters
    ----------
    y_true : np.ndarray
        True values
    y_pred : np.ndarray
        Predicted values
        
    Returns
    -------
    float
        MSE value
    """
    return np.mean((y_true - y_pred) ** 2)


def compute_rmse(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """
    Compute Root Mean Squared Error.
    
    Parameters
    ----------
    y_true : np.ndarray
        True values
    y_pred : np.ndarray
        Predicted values
        
    Returns
    -------
    float
        RMSE value
    """
    return np.sqrt(np.mean((y_true - y_pred) ** 2))


def compute_mae(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """
    Compute Mean Absolute Error.
    
    Parameters
    ----------
    y_true : np.ndarray
        True values
    y_pred : np.ndarray
        Predicted values
        
    Returns
    -------
    float
        MAE value
    """
    return np.mean(np.abs(y_true - y_pred))

print("Metrics and evaluation functions defined.")

In [None]:
# ==============================================================================
# FORMATTING AND OUTPUT FUNCTIONS
# ==============================================================================

def format_equation(coefficients: np.ndarray, 
                   feature_names: List[str],
                   threshold: float = 1e-10,
                   precision: int = 4) -> str:
    """
    Format sparse coefficients as equation string.
    
    Parameters
    ----------
    coefficients : np.ndarray
        Coefficient vector
    feature_names : List[str]
        Names of features
    threshold : float
        Minimum coefficient magnitude to include
    precision : int
        Decimal precision for coefficients
        
    Returns
    -------
    str
        Formatted equation string
    """
    terms = []
    for coef, name in zip(coefficients, feature_names):
        if np.abs(coef) > threshold:
            if name == '1' or name == '[Poly] 1':
                terms.append(f"{coef:.{precision}f}")
            else:
                # Remove source tags for cleaner display
                clean_name = name
                for tag in ['[PySR] ', '[Var] ', '[Poly] ', '[Op] ']:
                    clean_name = clean_name.replace(tag, '')
                terms.append(f"{coef:.{precision}f} * {clean_name}")
    
    if not terms:
        return "0"
    
    equation = terms[0]
    for term in terms[1:]:
        if term.startswith('-'):
            equation += f" {term}"
        else:
            equation += f" + {term}"
    
    return equation


def print_section_header(title: str, width: int = 70) -> None:
    """
    Print formatted section header.
    
    Parameters
    ----------
    title : str
        Section title
    width : int
        Total width of header line
    """
    print("=" * width)
    print(f" {title}")
    print("=" * width)


def print_subsection_header(title: str, width: int = 70) -> None:
    """
    Print formatted subsection header.
    
    Parameters
    ----------
    title : str
        Subsection title
    width : int
        Total width of header line
    """
    print("-" * width)
    print(f" {title}")
    print("-" * width)

print("Formatting and output functions defined.")

In [None]:
# ==============================================================================
# DATA PREPROCESSING FUNCTIONS
# ==============================================================================

def normalize_features(X: np.ndarray) -> Tuple[np.ndarray, StandardScaler]:
    """
    Normalize features using StandardScaler.
    
    Parameters
    ----------
    X : np.ndarray
        Feature matrix (n_samples, n_features)
        
    Returns
    -------
    X_normalized : np.ndarray
        Normalized feature matrix
    scaler : StandardScaler
        Fitted scaler for inverse transform
    """
    scaler = StandardScaler()
    X_normalized = scaler.fit_transform(X)
    return X_normalized, scaler


def check_data_validity(X: np.ndarray, y: np.ndarray) -> Dict[str, Any]:
    """
    Check data for common issues.
    
    Parameters
    ----------
    X : np.ndarray
        Feature matrix
    y : np.ndarray
        Target vector
        
    Returns
    -------
    Dict[str, Any]
        Dictionary with validity checks:
        - is_valid: bool
        - n_samples: int
        - n_features: int
        - has_nan: bool
        - has_inf: bool
        - constant_features: List[int]
        - warnings: List[str]
    """
    result = {
        'is_valid': True,
        'n_samples': X.shape[0],
        'n_features': X.shape[1],
        'has_nan': False,
        'has_inf': False,
        'constant_features': [],
        'warnings': []
    }
    
    # Check for NaN values
    if np.any(np.isnan(X)) or np.any(np.isnan(y)):
        result['has_nan'] = True
        result['is_valid'] = False
        result['warnings'].append("Data contains NaN values")
    
    # Check for Inf values
    if np.any(np.isinf(X)) or np.any(np.isinf(y)):
        result['has_inf'] = True
        result['is_valid'] = False
        result['warnings'].append("Data contains Inf values")
    
    # Check for constant features
    for j in range(X.shape[1]):
        if np.std(X[:, j]) < 1e-10:
            result['constant_features'].append(j)
            result['warnings'].append(f"Feature {j} is constant")
    
    # Check sample size
    if X.shape[0] < 50:
        result['warnings'].append(f"Small sample size ({X.shape[0]}), results may be unreliable")
    
    return result

print("Data preprocessing functions defined.")

In [None]:
# ==============================================================================
# v4.1 MEMORY AND OPTIMIZATION UTILITIES
# ==============================================================================

def cleanup_memory() -> None:
    """Force garbage collection to free memory."""
    gc.collect()


def convert_to_float32(X: np.ndarray, y: np.ndarray = None) -> Tuple:
    """
    Convert arrays to Float32 for memory efficiency.
    
    Parameters
    ----------
    X : np.ndarray
        Feature matrix
    y : np.ndarray, optional
        Target vector
        
    Returns
    -------
    Tuple
        Converted arrays
    """
    X_32 = np.asarray(X, dtype=np.float32)
    if y is not None:
        y_32 = np.asarray(y, dtype=np.float32)
        return X_32, y_32
    return X_32


def is_valid_feature(values: np.ndarray, 
                    existing_columns: List[np.ndarray] = None,
                    corr_threshold: float = 0.9999) -> bool:
    """
    Check if feature is valid (finite, non-constant, non-duplicate).
    
    Parameters
    ----------
    values : np.ndarray
        Feature values
    existing_columns : List[np.ndarray], optional
        Existing features for duplicate checking
    corr_threshold : float
        Correlation threshold for duplicate detection
        
    Returns
    -------
    bool
        True if feature is valid
    """
    # Check finite
    if not np.all(np.isfinite(values)):
        return False
    
    # Check non-constant
    if np.std(values) < 1e-10:
        return False
    
    # Check non-duplicate
    if existing_columns is not None:
        for existing in existing_columns[-20:]:  # Only check recent columns
            if len(existing) == len(values):
                corr = np.corrcoef(values, existing)[0, 1]
                if np.abs(corr) > corr_threshold:
                    return False
    
    return True

print("v4.1 memory and optimization utilities defined.")

---
## Section 5: Test Data Generators

In [None]:
# ==============================================================================
# WARM RAIN DATA GENERATOR
# ==============================================================================

def generate_warm_rain_data(
    n_samples: int = 500,
    noise_level: float = 0.01,
    seed: int = RANDOM_SEED
) -> Tuple[np.ndarray, np.ndarray, List[str], UserInputs]:
    """
    Generate synthetic warm rain microphysics data.
    
    True equation: dq_r/dt = 0.89 * q_c^2.47 * N_d^(-1.79)
    (KK2000 autoconversion rate)
    
    Parameters
    ----------
    n_samples : int
        Number of data points to generate
    noise_level : float
        Relative noise standard deviation
    seed : int
        Random seed for reproducibility
        
    Returns
    -------
    X : np.ndarray
        Feature matrix (n_samples, 4)
    y : np.ndarray
        Target vector
    feature_names : List[str]
        List of feature names
    user_inputs : UserInputs
        Complete UserInputs with dimensional information
    """
    np.random.seed(seed)
    
    # Generate realistic ranges
    q_c = np.random.uniform(1e-4, 5e-3, n_samples)      # kg/kg
    N_d = np.random.uniform(1e7, 5e8, n_samples)        # m^-3
    r_eff = np.random.uniform(5e-6, 25e-6, n_samples)   # m
    LWC = np.random.uniform(0.1, 2.0, n_samples)        # kg/m^3
    
    # True autoconversion rate
    y_true = 0.89 * (q_c ** 2.47) * (N_d ** (-1.79))
    
    # Add multiplicative noise
    if noise_level > 0:
        noise = np.exp(np.random.normal(0, noise_level, n_samples))
        y = y_true * noise
    else:
        y = y_true.copy()
    
    # Construct feature matrix
    X = np.column_stack([q_c, N_d, r_eff, LWC])
    feature_names = ['q_c', 'N_d', 'r_eff', 'LWC']
    
    # Create UserInputs with dimensional information
    user_inputs = UserInputs(
        variable_dimensions={
            'q_c':   [0, 0, 0, 0],     # kg/kg (dimensionless)
            'N_d':   [0, -3, 0, 0],    # m^-3
            'r_eff': [0, 1, 0, 0],     # m
            'LWC':   [1, -3, 0, 0],    # kg/m^3
        },
        target_dimensions=[0, 0, -1, 0],  # s^-1 (rate)
        physical_bounds={
            'target': {'min': 0, 'max': None},
            'q_c': {'min': 0, 'max': 0.01},
            'N_d': {'min': 0, 'max': None},
            'r_eff': {'min': 0, 'max': None},
            'LWC': {'min': 0, 'max': None},
        },
        variable_mapping=None,
        unit_conversions=None
    )
    
    return X, y, feature_names, user_inputs

print("Warm rain data generator defined.")

In [None]:
# ==============================================================================
# POLYNOMIAL DATA GENERATOR
# ==============================================================================

def generate_polynomial_data(
    n_samples: int = 500,
    noise_level: float = 0.01,
    seed: int = RANDOM_SEED
) -> Tuple[np.ndarray, np.ndarray, List[str], Dict[str, float]]:
    """
    Generate synthetic polynomial data for testing.
    
    True equation: y = 0.5*x1^2 + 0.3*x2*x3 - 0.1*x1*x2^2 + 0.8
    
    Parameters
    ----------
    n_samples : int
        Number of data points to generate
    noise_level : float
        Relative noise standard deviation
    seed : int
        Random seed for reproducibility
        
    Returns
    -------
    X : np.ndarray
        Feature matrix (n_samples, 5)
    y : np.ndarray
        Target vector
    feature_names : List[str]
        List of feature names
    true_coefficients : Dict[str, float]
        True coefficient values
    """
    np.random.seed(seed)
    
    # Generate base features
    x1 = np.random.uniform(-1, 1, n_samples)
    x2 = np.random.uniform(-1, 1, n_samples)
    x3 = np.random.uniform(-1, 1, n_samples)
    
    # Noise features (should not be selected)
    noise1 = np.random.uniform(-1, 1, n_samples)
    noise2 = np.random.uniform(-1, 1, n_samples)
    
    # True equation: y = 0.5*x1^2 + 0.3*x2*x3 - 0.1*x1*x2^2 + 0.8
    y_true = 0.5 * x1**2 + 0.3 * x2 * x3 - 0.1 * x1 * x2**2 + 0.8
    
    # Add noise
    if noise_level > 0:
        noise = np.random.normal(0, noise_level * np.std(y_true), n_samples)
        y = y_true + noise
    else:
        y = y_true.copy()
    
    # Construct feature matrix
    X = np.column_stack([x1, x2, x3, noise1, noise2])
    feature_names = ['x1', 'x2', 'x3', 'noise1', 'noise2']
    
    true_coefficients = {
        '1': 0.8,
        'x1^2': 0.5,
        'x2*x3': 0.3,
        'x1*x2^2': -0.1
    }
    
    return X, y, feature_names, true_coefficients

print("Polynomial data generator defined.")

In [None]:
# ==============================================================================
# TRIGONOMETRIC DATA GENERATOR
# ==============================================================================

def generate_trigonometric_data(
    n_samples: int = 500,
    noise_level: float = 0.01,
    seed: int = RANDOM_SEED
) -> Tuple[np.ndarray, np.ndarray, List[str], Dict[str, float]]:
    """
    Generate synthetic trigonometric data for testing.
    
    True equation: y = sin(x1) + 0.5*cos(x2) + 0.3*x1*x2
    
    Parameters
    ----------
    n_samples : int
        Number of data points to generate
    noise_level : float
        Relative noise standard deviation
    seed : int
        Random seed for reproducibility
        
    Returns
    -------
    X : np.ndarray
        Feature matrix (n_samples, 4)
    y : np.ndarray
        Target vector
    feature_names : List[str]
        List of feature names
    true_coefficients : Dict[str, float]
        True coefficient values
    """
    np.random.seed(seed)
    
    # Generate base features (in range suitable for trig functions)
    x1 = np.random.uniform(-np.pi, np.pi, n_samples)
    x2 = np.random.uniform(-np.pi, np.pi, n_samples)
    
    # Noise features (should not be selected)
    noise1 = np.random.uniform(-np.pi, np.pi, n_samples)
    noise2 = np.random.uniform(-np.pi, np.pi, n_samples)
    
    # True equation: y = sin(x1) + 0.5*cos(x2) + 0.3*x1*x2
    y_true = np.sin(x1) + 0.5 * np.cos(x2) + 0.3 * x1 * x2
    
    # Add noise
    if noise_level > 0:
        noise = np.random.normal(0, noise_level * np.std(y_true), n_samples)
        y = y_true + noise
    else:
        y = y_true.copy()
    
    # Construct feature matrix
    X = np.column_stack([x1, x2, noise1, noise2])
    feature_names = ['x1', 'x2', 'noise1', 'noise2']
    
    true_coefficients = {
        'sin(x1)': 1.0,
        'cos(x2)': 0.5,
        'x1*x2': 0.3
    }
    
    return X, y, feature_names, true_coefficients

print("Trigonometric data generator defined.")

In [None]:
# ==============================================================================
# PENDULUM DATA GENERATOR (FOR BUCKINGHAM PI TESTING)
# ==============================================================================

def generate_pendulum_data(
    n_samples: int = 500,
    noise_level: float = 0.01,
    seed: int = RANDOM_SEED
) -> Tuple[np.ndarray, np.ndarray, List[str], UserInputs]:
    """
    Generate synthetic simple pendulum period data.
    
    True equation: T = 2*pi * sqrt(L/g)
    
    Variables:
    - L: pendulum length (m)
    - m: mass (kg) - does not appear in true equation
    - g: gravitational acceleration (m/s^2)
    - T: period (s)
    
    This is a classic dimensional analysis example where:
    T * sqrt(g/L) is the dimensionless group.
    
    Parameters
    ----------
    n_samples : int
        Number of data points to generate
    noise_level : float
        Relative noise standard deviation
    seed : int
        Random seed for reproducibility
        
    Returns
    -------
    X : np.ndarray
        Feature matrix (n_samples, 3) with L, m, g
    y : np.ndarray
        Target vector - period T
    feature_names : List[str]
        List of feature names
    user_inputs : UserInputs
        Complete UserInputs with dimensional information
    """
    np.random.seed(seed)
    
    # Generate physically realistic ranges
    L = np.random.uniform(0.1, 2.0, n_samples)      # Length (m)
    m = np.random.uniform(0.01, 1.0, n_samples)     # Mass (kg) - irrelevant
    g = np.random.uniform(9.7, 10.0, n_samples)     # Gravity (m/s^2)
    
    # True period: T = 2*pi * sqrt(L/g)
    T_true = 2 * np.pi * np.sqrt(L / g)
    
    # Add multiplicative noise
    if noise_level > 0:
        noise = np.exp(np.random.normal(0, noise_level, n_samples))
        T = T_true * noise
    else:
        T = T_true.copy()
    
    # Construct feature matrix
    X = np.column_stack([L, m, g])
    feature_names = ['L', 'm', 'g']
    
    # Create UserInputs with dimensional information
    user_inputs = UserInputs(
        variable_dimensions={
            'L': [0, 1, 0, 0],      # Length: m
            'm': [1, 0, 0, 0],      # Mass: kg
            'g': [0, 1, -2, 0],     # Acceleration: m/s^2
        },
        target_dimensions=[0, 0, 1, 0],  # Time: s
        physical_bounds={
            'target': {'min': 0, 'max': None},
            'L': {'min': 0, 'max': None},
            'm': {'min': 0, 'max': None},
            'g': {'min': 0, 'max': None},
        },
        variable_mapping=None,
        unit_conversions=None
    )
    
    return X, T, feature_names, user_inputs

print("Pendulum data generator defined.")

---
## Section 6: Module Summary

In [None]:
# ==============================================================================
# MODULE SUMMARY
# ==============================================================================

print("="*70)
print(" 00_Core.ipynb v4.1 - Module Summary")
print("="*70)
print()
print("DATACLASSES:")
print("  - UserInputs: User-defined inputs (dimensions, bounds, mappings)")
print("  - Stage1Results: Variable selection & preprocessing results")
print("  - Stage2Results: Structure-guided discovery results (v4.1 enhanced)")
print("  - Stage3Results: Validation & UQ results")
print()
print("TimeBudgetManager (NEW v4.1):")
print("  - Adaptive time allocation for computational optimization")
print("  - Methods: allocate_pysr_time(), allocate_bootstrap_count()")
print("  - Methods: should_skip_optional(), record_stage(), report()")
print()
print("UTILITY FUNCTIONS:")
print("  - safe_log, safe_exp, safe_divide, safe_sqrt, safe_power: Numerical safety")
print("  - compute_r2, compute_mse, compute_rmse, compute_mae: Metrics")
print("  - format_equation: Equation string formatting")
print("  - print_section_header, print_subsection_header: Output formatting")
print("  - normalize_features, check_data_validity: Data preprocessing")
print("  - cleanup_memory, convert_to_float32, is_valid_feature: v4.1 optimization")
print()
print("DATA GENERATORS:")
print("  - generate_warm_rain_data: dq_r/dt = 0.89 * q_c^2.47 * N_d^(-1.79)")
print("  - generate_polynomial_data: y = 0.5*x1^2 + 0.3*x2*x3 - 0.1*x1*x2^2 + 0.8")
print("  - generate_trigonometric_data: y = sin(x1) + 0.5*cos(x2) + 0.3*x1*x2")
print("  - generate_pendulum_data: T = 2*pi * sqrt(L/g)")
print()
print("CONFIGURATION CONSTANTS:")
print(f"  - RANDOM_SEED: {RANDOM_SEED}")
print(f"  - DEFAULT_N_BOOTSTRAP: {DEFAULT_N_BOOTSTRAP}")
print(f"  - DEFAULT_RUNTIME_BUDGET: {DEFAULT_RUNTIME_BUDGET}s")
print(f"  - PYSR_MODES: {list(PYSR_MODES.keys())}")
print(f"  - PYSR_AVAILABLE: {PYSR_AVAILABLE}")
print()
print("="*70)
print("Module loaded successfully. Import via: %run 00_Core.ipynb")
print("="*70)