# 00_Core - Physics-SR Framework v3.0

## Foundation Module: DataClasses, Utilities, and Test Data Generators

**Author:** Zhengze Zhang  
**Affiliation:** Department of Statistics, Columbia University  
**Date:** January 2026

---

### Purpose

This notebook provides the foundational components for the Three-Stage Physics-Informed Symbolic Regression Framework:

1. **DataClasses**: `UserInputs`, `Stage1Results`, `Stage2Results`, `Stage3Results`
2. **Utility Functions**: Safe math operations, formatting, metrics
3. **Test Data Generators**: Warm rain microphysics, polynomial, trigonometric
4. **Global Configuration Constants**

### Usage

This module is imported by all other notebooks via:
```python
%run 00_Core.ipynb
```

---
## Section 1: Imports and Constants

In [None]:
"""
00_Core.ipynb - Foundation Module
==================================

Three-Stage Physics-Informed Symbolic Regression Framework v3.0

This module provides:
- DataClasses for user inputs and stage results
- Utility functions for safe numerical operations
- Test data generators for algorithm validation
- Global configuration constants

Author: Zhengze Zhang
Affiliation: Department of Statistics, Columbia University
"""

# Standard library imports
import warnings
from dataclasses import dataclass, field
from typing import Dict, List, Tuple, Optional, Union, Any

# Scientific computing
import numpy as np
import pandas as pd
from scipy import stats
from scipy.special import comb

# Machine learning
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error, r2_score

# Symbolic computation
import sympy as sp

print("00_Core: All imports successful.")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

In [None]:
# ==============================================================================
# GLOBAL CONFIGURATION CONSTANTS
# ==============================================================================

# Stage 1 defaults
DEFAULT_MAX_EXPONENT = 4              # Buckingham pi search range
DEFAULT_IMPORTANCE_THRESHOLD = 0.01  # Variable screening threshold
DEFAULT_POWERLAW_R2_THRESHOLD = 0.9  # Power-law detection R^2 threshold
DEFAULT_SOFTMAX_TEMPERATURE = 0.5    # iRF soft reweighting temperature
DEFAULT_STABILITY_THRESHOLD = 0.5    # Interaction stability threshold

# Stage 2 defaults
DEFAULT_MAX_POLY_DEGREE = 3          # Feature library polynomial degree
DEFAULT_STLSQ_THRESHOLD = 0.1        # STLSQ sparsity threshold
DEFAULT_STLSQ_MAX_ITER = 20          # STLSQ maximum iterations
DEFAULT_ALASSO_GAMMA = 1.0           # Adaptive Lasso gamma parameter
DEFAULT_ALASSO_EPS = 1e-6            # Adaptive Lasso stabilization constant

# Stage 3 defaults
DEFAULT_CV_FOLDS = 5                 # Cross-validation folds
DEFAULT_EBIC_GAMMA = 0.5             # EBIC gamma parameter
DEFAULT_N_BOOTSTRAP = 200            # Number of bootstrap samples
DEFAULT_CONFIDENCE_LEVEL = 0.95      # Confidence interval level
DEFAULT_DIM_TOLERANCE = 0.05         # Dimensional check tolerance

# Random seed for reproducibility
RANDOM_SEED = 42

# Numerical stability constants
EPS_LOG = 1e-10                      # Epsilon for log safety
EPS_DIV = 1e-10                      # Epsilon for division safety

print("Global configuration constants defined.")

---
## Section 2: DataClasses

In [None]:
# ==============================================================================
# USER INPUTS DATACLASS
# ==============================================================================

@dataclass
class UserInputs:
    """
    User-defined inputs required for the Physics-SR Framework.
    
    These inputs must be prepared before running the pipeline and require
    domain knowledge about the physical system being modeled.
    
    Attributes
    ----------
    variable_dimensions : Dict[str, List[float]]
        Dictionary mapping variable names to their dimensional exponents [M, L, T, Theta].
        M = Mass, L = Length, T = Time, Theta = Temperature.
        Example: {'velocity': [0, 1, -1, 0]}  # m/s has L^1 * T^-1
        
    target_dimensions : List[float]
        Dimensional exponents [M, L, T, Theta] for the target variable.
        Example: [0, 0, -1, 0] for a rate with units s^-1
        
    physical_bounds : Dict[str, Dict[str, Optional[float]]]
        Physical constraints for variables and target.
        Format: {var_name: {'min': float or None, 'max': float or None}}
        Example: {'target': {'min': 0, 'max': None}}  # Non-negative target
        
    variable_mapping : Optional[Dict[str, str]]
        Maps data column names to standardized physical variable names.
        Example: {'cloud_water_mixing_ratio': 'q_c'}
        
    unit_conversions : Optional[Dict[str, float]]
        Conversion factors to convert data to SI units.
        Example: {'radius_um': 1e-6}  # Convert micrometers to meters
    """
    
    # Required fields
    variable_dimensions: Dict[str, List[float]]
    target_dimensions: List[float]
    physical_bounds: Dict[str, Dict[str, Optional[float]]]
    
    # Optional fields with defaults
    variable_mapping: Optional[Dict[str, str]] = None
    unit_conversions: Optional[Dict[str, float]] = None
    
    def __post_init__(self):
        """Validate inputs after initialization."""
        # Validate dimensional exponents have length 4
        for var_name, dims in self.variable_dimensions.items():
            if len(dims) != 4:
                raise ValueError(
                    f"Variable '{var_name}' has {len(dims)} dimensional exponents, "
                    f"expected 4 [M, L, T, Theta]"
                )
        
        if len(self.target_dimensions) != 4:
            raise ValueError(
                f"Target dimensions has {len(self.target_dimensions)} exponents, "
                f"expected 4 [M, L, T, Theta]"
            )
    
    def get_variable_names(self) -> List[str]:
        """Return list of variable names."""
        return list(self.variable_dimensions.keys())
    
    def get_dimension_matrix(self) -> np.ndarray:
        """Return dimensional matrix D where D[i,j] = exponent of dimension i for variable j."""
        var_names = self.get_variable_names()
        n_vars = len(var_names)
        D = np.zeros((4, n_vars))
        for j, var_name in enumerate(var_names):
            D[:, j] = self.variable_dimensions[var_name]
        return D

print("UserInputs dataclass defined.")

In [None]:
# ==============================================================================
# STAGE 1 RESULTS DATACLASS
# ==============================================================================

@dataclass
class Stage1Results:
    """
    Results from Stage 1: Variable Selection & Preprocessing.
    
    Contains outputs from:
    - 1.1 Buckingham Pi Dimensional Analysis
    - 1.2 PAN+SR Variable Screening
    - 1.3 Symmetry Analysis
    - 1.4 iRF Interaction Discovery
    
    Attributes
    ----------
    pi_groups : Optional[Dict[str, np.ndarray]]
        Dictionary mapping pi-group names to their exponent vectors
    pi_exponents : Optional[np.ndarray]
        Matrix of exponents for selected pi-groups (n_groups x n_variables)
    pi_group_names : Optional[List[str]]
        Human-readable names for each pi-group
    X_transformed : Optional[np.ndarray]
        Data transformed to dimensionless pi-groups
    selected_indices : Optional[List[int]]
        Indices of variables selected by screening
    selected_names : Optional[List[str]]
        Names of selected variables
    importance_scores : Optional[Dict[str, float]]
        RF permutation importance for each variable
    is_power_law : bool
        Whether power-law relationship was detected
    estimated_exponents : Optional[Dict[str, float]]
        Estimated power-law exponents for each variable
    power_law_r2 : Optional[float]
        R-squared of log-log regression
    structural_hints : Optional[Dict]
        Hints about equation structure for Stage 2
    stable_interactions : Optional[List[Tuple[int, ...]]]
        List of stable feature interactions (as tuples of indices)
    interaction_stability : Optional[Dict[Tuple, float]]
        Stability scores for each interaction
    soft_weights : Optional[np.ndarray]
        Soft reweighting weights from iRF
    """
    
    # Buckingham Pi results
    pi_groups: Optional[Dict[str, np.ndarray]] = None
    pi_exponents: Optional[np.ndarray] = None
    pi_group_names: Optional[List[str]] = None
    X_transformed: Optional[np.ndarray] = None
    
    # Variable screening results
    selected_indices: Optional[List[int]] = None
    selected_names: Optional[List[str]] = None
    importance_scores: Optional[Dict[str, float]] = None
    
    # Symmetry analysis results
    is_power_law: bool = False
    estimated_exponents: Optional[Dict[str, float]] = None
    power_law_r2: Optional[float] = None
    structural_hints: Optional[Dict] = None
    
    # Interaction discovery results
    stable_interactions: Optional[List[Tuple[int, ...]]] = None
    interaction_stability: Optional[Dict[Tuple, float]] = None
    soft_weights: Optional[np.ndarray] = None

print("Stage1Results dataclass defined.")

In [None]:
# ==============================================================================
# STAGE 2 RESULTS DATACLASS
# ==============================================================================

@dataclass
class Stage2Results:
    """
    Results from Stage 2: Structure Discovery.
    
    Contains outputs from:
    - 2.1 Feature Library Construction
    - 2.2a PySR Genetic Programming
    - 2.2b E-WSINDy with STLSQ
    - 2.2c Adaptive Lasso
    - 2.3 Structure Parsing
    
    Attributes
    ----------
    feature_library : Optional[np.ndarray]
        Feature library matrix Phi (n_samples x n_features)
    feature_names : Optional[List[str]]
        Names of features in the library
    scaler : Optional[StandardScaler]
        Fitted scaler for feature normalization
    pysr_equations : Optional[List[str]]
        List of equations discovered by PySR
    pysr_pareto : Optional[pd.DataFrame]
        Pareto front of complexity vs accuracy
    best_pysr_equation : Optional[str]
        Best equation from PySR
    best_pysr_sympy : Optional[sp.Expr]
        Best PySR equation as SymPy expression
    refined_features : Optional[np.ndarray]
        Features extracted from PySR equation structure
    refined_feature_names : Optional[List[str]]
        Names of refined features
    stlsq_coefficients : Optional[np.ndarray]
        Coefficients from STLSQ
    stlsq_support : Optional[np.ndarray]
        Boolean mask of selected features from STLSQ
    stlsq_equation : Optional[str]
        Equation string from STLSQ
    weak_form_Q : Optional[np.ndarray]
        Weak-form feature matrix
    weak_form_b : Optional[np.ndarray]
        Weak-form target vector
    alasso_coefficients : Optional[np.ndarray]
        Coefficients from Adaptive Lasso
    alasso_support : Optional[np.ndarray]
        Boolean mask of selected features from Adaptive Lasso
    alasso_equation : Optional[str]
        Equation string from Adaptive Lasso
    alasso_lambda : Optional[float]
        Optimal lambda from Adaptive Lasso
    """
    
    # Feature library
    feature_library: Optional[np.ndarray] = None
    feature_names: Optional[List[str]] = None
    scaler: Optional[Any] = None  # StandardScaler, but using Any for flexibility
    
    # PySR results
    pysr_equations: Optional[List[str]] = None
    pysr_pareto: Optional[pd.DataFrame] = None
    best_pysr_equation: Optional[str] = None
    best_pysr_sympy: Optional[Any] = None  # sp.Expr
    refined_features: Optional[np.ndarray] = None
    refined_feature_names: Optional[List[str]] = None
    
    # E-WSINDy/STLSQ results
    stlsq_coefficients: Optional[np.ndarray] = None
    stlsq_support: Optional[np.ndarray] = None
    stlsq_equation: Optional[str] = None
    weak_form_Q: Optional[np.ndarray] = None
    weak_form_b: Optional[np.ndarray] = None
    
    # Adaptive Lasso results
    alasso_coefficients: Optional[np.ndarray] = None
    alasso_support: Optional[np.ndarray] = None
    alasso_equation: Optional[str] = None
    alasso_lambda: Optional[float] = None

print("Stage2Results dataclass defined.")

In [None]:
# ==============================================================================
# STAGE 3 RESULTS DATACLASS
# ==============================================================================

@dataclass
class Stage3Results:
    """
    Results from Stage 3: Validation & Uncertainty Quantification.
    
    Contains outputs from:
    - 3.1 Model Selection (CV + EBIC)
    - 3.2 Physics Verification (Dimensional + Bounds)
    - 3.3 Three-Layer UQ (Structural, Parametric, Predictive)
    - 3.4 Statistical Inference
    
    Attributes
    ----------
    cv_scores : Optional[Dict[str, Tuple[float, float]]]
        Cross-validation scores {model_name: (mean, std)}
    ebic_scores : Optional[Dict[str, float]]
        EBIC scores for each model
    best_model : Optional[str]
        Name of selected best model
    dim_consistent : bool
        Whether equation is dimensionally consistent
    dim_details : Optional[Dict]
        Detailed dimensional analysis per term
    bounds_violations : Optional[Dict]
        Physical bounds violation statistics
    physics_score : Optional[float]
        Overall physics verification score
    inclusion_probabilities : Optional[np.ndarray]
        Bootstrap inclusion probability for each feature
    structural_confidence : Optional[Dict[str, str]]
        Confidence classification per feature (HIGH/MEDIUM/LOW)
    coefficient_estimates : Optional[np.ndarray]
        Point estimates for coefficients
    coefficient_CI : Optional[np.ndarray]
        95% confidence intervals (n_features x 2)
    coefficient_SE : Optional[np.ndarray]
        Standard errors for coefficients
    prediction_intervals : Optional[Tuple[np.ndarray, np.ndarray]]
        Prediction intervals (lower, upper)
    pi_coverage : Optional[float]
        Empirical coverage of prediction intervals
    p_values : Optional[Dict[str, float]]
        P-values from hypothesis tests
    significant_terms : Optional[List[str]]
        Terms that are statistically significant
    final_equation : Optional[str]
        Final equation string
    final_coefficients : Optional[Dict[str, float]]
        Final coefficient values by term name
    """
    
    # Model selection
    cv_scores: Optional[Dict[str, Tuple[float, float]]] = None
    ebic_scores: Optional[Dict[str, float]] = None
    best_model: Optional[str] = None
    
    # Physics verification
    dim_consistent: bool = False
    dim_details: Optional[Dict] = None
    bounds_violations: Optional[Dict] = None
    physics_score: Optional[float] = None
    
    # Structural UQ (Layer 1)
    inclusion_probabilities: Optional[np.ndarray] = None
    structural_confidence: Optional[Dict[str, str]] = None
    
    # Parametric UQ (Layer 2)
    coefficient_estimates: Optional[np.ndarray] = None
    coefficient_CI: Optional[np.ndarray] = None
    coefficient_SE: Optional[np.ndarray] = None
    
    # Predictive UQ (Layer 3)
    prediction_intervals: Optional[Tuple[np.ndarray, np.ndarray]] = None
    pi_coverage: Optional[float] = None
    
    # Hypothesis testing
    p_values: Optional[Dict[str, float]] = None
    significant_terms: Optional[List[str]] = None
    
    # Final equation
    final_equation: Optional[str] = None
    final_coefficients: Optional[Dict[str, float]] = None

print("Stage3Results dataclass defined.")

---
## Section 3: Utility Functions

In [None]:
# ==============================================================================
# NUMERICAL SAFETY FUNCTIONS
# ==============================================================================

def safe_log(x: np.ndarray, eps: float = EPS_LOG) -> np.ndarray:
    """
    Compute logarithm with numerical safety.
    
    Prevents log(0) by adding a small epsilon to values below threshold.
    
    Parameters
    ----------
    x : np.ndarray
        Input array (should be positive)
    eps : float
        Small constant to prevent log(0)
        
    Returns
    -------
    np.ndarray
        log(max(x, eps))
    """
    return np.log(np.maximum(np.abs(x), eps))


def safe_divide(a: np.ndarray, b: np.ndarray, eps: float = EPS_DIV) -> np.ndarray:
    """
    Compute division with numerical safety.
    
    Prevents division by zero by ensuring denominator is at least eps.
    
    Parameters
    ----------
    a : np.ndarray
        Numerator
    b : np.ndarray
        Denominator
    eps : float
        Small constant to prevent division by zero
        
    Returns
    -------
    np.ndarray
        a / (b + sign(b) * eps), preserving sign of b
    """
    # Handle both positive and negative denominators
    sign_b = np.sign(b)
    sign_b[sign_b == 0] = 1  # Default to positive for zero
    safe_b = np.maximum(np.abs(b), eps) * sign_b
    return a / safe_b


def safe_power(base: np.ndarray, exponent: float, eps: float = EPS_LOG) -> np.ndarray:
    """
    Compute power with numerical safety for non-integer exponents.
    
    For non-integer exponents, ensures base is positive.
    
    Parameters
    ----------
    base : np.ndarray
        Base values
    exponent : float
        Exponent
    eps : float
        Small constant for numerical safety
        
    Returns
    -------
    np.ndarray
        base ** exponent, with numerical safety
    """
    if exponent == int(exponent):
        # Integer exponent: direct computation is safe
        return np.power(base, exponent)
    else:
        # Non-integer exponent: ensure positive base
        safe_base = np.maximum(np.abs(base), eps)
        return np.power(safe_base, exponent)

print("Numerical safety functions defined.")

In [None]:
# ==============================================================================
# METRIC FUNCTIONS
# ==============================================================================

def compute_r2(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """
    Compute coefficient of determination (R-squared).
    
    R^2 = 1 - SS_res / SS_tot
    
    Parameters
    ----------
    y_true : np.ndarray
        True target values
    y_pred : np.ndarray
        Predicted values
        
    Returns
    -------
    float
        R-squared value
    """
    ss_res = np.sum((y_true - y_pred) ** 2)
    ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
    if ss_tot < EPS_DIV:
        return 0.0
    return 1.0 - ss_res / ss_tot


def compute_mse(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """
    Compute Mean Squared Error.
    
    MSE = (1/n) * sum((y_true - y_pred)^2)
    
    Parameters
    ----------
    y_true : np.ndarray
        True target values
    y_pred : np.ndarray
        Predicted values
        
    Returns
    -------
    float
        Mean squared error
    """
    return np.mean((y_true - y_pred) ** 2)


def compute_rmse(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """
    Compute Root Mean Squared Error.
    
    Parameters
    ----------
    y_true : np.ndarray
        True target values
    y_pred : np.ndarray
        Predicted values
        
    Returns
    -------
    float
        Root mean squared error
    """
    return np.sqrt(compute_mse(y_true, y_pred))


def compute_mae(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    """
    Compute Mean Absolute Error.
    
    Parameters
    ----------
    y_true : np.ndarray
        True target values
    y_pred : np.ndarray
        Predicted values
        
    Returns
    -------
    float
        Mean absolute error
    """
    return np.mean(np.abs(y_true - y_pred))

print("Metric functions defined.")

In [None]:
# ==============================================================================
# FORMATTING FUNCTIONS
# ==============================================================================

def format_equation(
    coefficients: np.ndarray,
    feature_names: List[str],
    threshold: float = 1e-6,
    precision: int = 4
) -> str:
    """
    Format coefficients and feature names into an equation string.
    
    Parameters
    ----------
    coefficients : np.ndarray
        Coefficient values
    feature_names : List[str]
        Names of features
    threshold : float
        Coefficients below this threshold are treated as zero
    precision : int
        Number of decimal places for coefficients
        
    Returns
    -------
    str
        Formatted equation string
    """
    terms = []
    
    for coef, name in zip(coefficients, feature_names):
        if np.abs(coef) > threshold:
            if name == '1' or name == 'intercept':
                terms.append(f"{coef:.{precision}f}")
            else:
                if coef > 0 and len(terms) > 0:
                    terms.append(f"+ {coef:.{precision}f}*{name}")
                elif coef < 0:
                    terms.append(f"- {abs(coef):.{precision}f}*{name}")
                else:
                    terms.append(f"{coef:.{precision}f}*{name}")
    
    if len(terms) == 0:
        return "0"
    
    equation = " ".join(terms)
    # Clean up leading +
    if equation.startswith("+ "):
        equation = equation[2:]
    
    return equation


def print_section_header(title: str, width: int = 70, char: str = "=") -> None:
    """
    Print a formatted section header.
    
    Parameters
    ----------
    title : str
        Section title
    width : int
        Total width of header
    char : str
        Character to use for border
    """
    print(char * width)
    print(f" {title}")
    print(char * width)


def print_subsection_header(title: str, width: int = 70, char: str = "-") -> None:
    """
    Print a formatted subsection header.
    
    Parameters
    ----------
    title : str
        Subsection title
    width : int
        Total width of header
    char : str
        Character to use for border
    """
    print(char * width)
    print(f" {title}")
    print(char * width)

print("Formatting functions defined.")

In [None]:
# ==============================================================================
# DATA PREPROCESSING FUNCTIONS
# ==============================================================================

def normalize_features(
    X: np.ndarray,
    scaler: Optional[StandardScaler] = None
) -> Tuple[np.ndarray, StandardScaler]:
    """
    Normalize features to zero mean and unit variance.
    
    Parameters
    ----------
    X : np.ndarray
        Feature matrix (n_samples, n_features)
    scaler : Optional[StandardScaler]
        Pre-fitted scaler. If None, a new scaler is fitted.
        
    Returns
    -------
    Tuple[np.ndarray, StandardScaler]
        Normalized features and fitted scaler
    """
    if scaler is None:
        scaler = StandardScaler()
        X_normalized = scaler.fit_transform(X)
    else:
        X_normalized = scaler.transform(X)
    
    return X_normalized, scaler


def check_data_validity(X: np.ndarray, y: np.ndarray) -> Dict[str, Any]:
    """
    Check data for common issues.
    
    Parameters
    ----------
    X : np.ndarray
        Feature matrix
    y : np.ndarray
        Target vector
        
    Returns
    -------
    Dict[str, Any]
        Dictionary with validity checks:
        - 'valid': bool, overall validity
        - 'n_samples': int
        - 'n_features': int
        - 'has_nan': bool
        - 'has_inf': bool
        - 'n_nan': int
        - 'n_inf': int
        - 'warnings': List[str]
    """
    result = {
        'valid': True,
        'n_samples': X.shape[0],
        'n_features': X.shape[1],
        'has_nan': False,
        'has_inf': False,
        'n_nan': 0,
        'n_inf': 0,
        'warnings': []
    }
    
    # Check X
    n_nan_X = np.sum(np.isnan(X))
    n_inf_X = np.sum(np.isinf(X))
    
    # Check y
    n_nan_y = np.sum(np.isnan(y))
    n_inf_y = np.sum(np.isinf(y))
    
    result['n_nan'] = n_nan_X + n_nan_y
    result['n_inf'] = n_inf_X + n_inf_y
    
    if result['n_nan'] > 0:
        result['has_nan'] = True
        result['valid'] = False
        result['warnings'].append(f"Data contains {result['n_nan']} NaN values")
    
    if result['n_inf'] > 0:
        result['has_inf'] = True
        result['valid'] = False
        result['warnings'].append(f"Data contains {result['n_inf']} Inf values")
    
    # Check sample size
    if result['n_samples'] < 10:
        result['warnings'].append(f"Very small sample size: {result['n_samples']}")
    
    # Check feature count
    if result['n_features'] > result['n_samples']:
        result['warnings'].append(
            f"More features ({result['n_features']}) than samples ({result['n_samples']})"
        )
    
    return result

print("Data preprocessing functions defined.")

---
## Section 4: Test Data Generators

In [None]:
# ==============================================================================
# WARM RAIN MICROPHYSICS DATA GENERATOR
# ==============================================================================

def generate_warm_rain_data(
    n_samples: int = 1000,
    noise_level: float = 0.01,
    seed: int = RANDOM_SEED,
    include_irrelevant: bool = True
) -> Tuple[np.ndarray, np.ndarray, List[str], UserInputs]:
    """
    Generate synthetic warm rain microphysics data.
    
    True equation: dq_r/dt = 0.89 * q_c^2.47 * N_d^(-1.79)
    
    This is a Kessler-type autoconversion parameterization where:
    - q_c: cloud water mixing ratio (kg/kg) - dimensionless
    - N_d: droplet number concentration (m^-3)
    - dq_r/dt: rain water autoconversion rate (s^-1)
    
    Additional variables (r_eff, LWC) are included but do not appear in
    the true equation, serving as irrelevant features for testing.
    
    Parameters
    ----------
    n_samples : int
        Number of data points to generate
    noise_level : float
        Relative noise standard deviation (0.01 = 1% noise)
    seed : int
        Random seed for reproducibility
    include_irrelevant : bool
        Whether to include irrelevant features (r_eff, LWC)
        
    Returns
    -------
    X : np.ndarray
        Feature matrix (n_samples, n_features)
    y : np.ndarray
        Target vector (n_samples,) - autoconversion rate
    feature_names : List[str]
        List of feature names
    user_inputs : UserInputs
        Complete UserInputs with dimensional information
    """
    np.random.seed(seed)
    
    # True equation coefficients
    C = 0.89       # Leading coefficient
    alpha = 2.47   # Exponent for q_c
    beta = -1.79   # Exponent for N_d
    
    # Generate physically realistic ranges
    # q_c: cloud water mixing ratio (0.1 - 5 g/kg = 1e-4 to 5e-3 kg/kg)
    q_c = np.random.uniform(1e-4, 5e-3, n_samples)
    
    # N_d: droplet number concentration (10 - 500 cm^-3 = 1e7 to 5e8 m^-3)
    N_d = np.random.uniform(1e7, 5e8, n_samples)
    
    # Additional variables (not in true equation)
    # r_eff: effective radius (5 - 25 um = 5e-6 to 25e-6 m)
    r_eff = np.random.uniform(5e-6, 25e-6, n_samples)
    
    # LWC: liquid water content (0.1 - 2.0 g/m^3 = 0.1 - 2.0 kg/m^3 for this scale)
    LWC = np.random.uniform(0.1, 2.0, n_samples)
    
    # True autoconversion rate (Kessler-type parameterization)
    y_true = C * (q_c ** alpha) * (N_d ** beta)
    
    # Add multiplicative log-normal noise
    if noise_level > 0:
        noise = np.exp(np.random.normal(0, noise_level, n_samples))
        y = y_true * noise
    else:
        y = y_true.copy()
    
    # Ensure physical constraints (non-negative)
    y = np.maximum(y, 1e-30)
    
    # Construct feature matrix
    if include_irrelevant:
        X = np.column_stack([q_c, N_d, r_eff, LWC])
        feature_names = ['q_c', 'N_d', 'r_eff', 'LWC']
        variable_dimensions = {
            'q_c':   [0, 0, 0, 0],     # kg/kg (dimensionless)
            'N_d':   [0, -3, 0, 0],    # m^-3
            'r_eff': [0, 1, 0, 0],     # m
            'LWC':   [1, -3, 0, 0],    # kg/m^3
        }
    else:
        X = np.column_stack([q_c, N_d])
        feature_names = ['q_c', 'N_d']
        variable_dimensions = {
            'q_c':   [0, 0, 0, 0],     # kg/kg (dimensionless)
            'N_d':   [0, -3, 0, 0],    # m^-3
        }
    
    # Create UserInputs with complete dimensional information
    user_inputs = UserInputs(
        variable_dimensions=variable_dimensions,
        target_dimensions=[0, 0, -1, 0],  # s^-1 (rate)
        physical_bounds={
            'target': {'min': 0, 'max': None},
            'q_c': {'min': 0, 'max': 0.1},
            'N_d': {'min': 0, 'max': 1e10},
            'r_eff': {'min': 0, 'max': 1e-3},
            'LWC': {'min': 0, 'max': 10},
        },
        variable_mapping=None,
        unit_conversions=None
    )
    
    return X, y, feature_names, user_inputs


def get_warm_rain_ground_truth() -> Dict[str, Any]:
    """
    Return ground truth information for warm rain data.
    
    Returns
    -------
    Dict[str, Any]
        Ground truth information including:
        - equation: str, equation string
        - coefficients: dict of coefficient values
        - active_features: list of features in true equation
    """
    return {
        'equation': 'dq_r/dt = 0.89 * q_c^2.47 * N_d^(-1.79)',
        'coefficients': {
            'C': 0.89,
            'alpha_q_c': 2.47,
            'beta_N_d': -1.79
        },
        'active_features': ['q_c', 'N_d'],
        'inactive_features': ['r_eff', 'LWC'],
        'equation_type': 'power_law'
    }

print("Warm rain data generator defined.")

In [None]:
# ==============================================================================
# POLYNOMIAL DATA GENERATOR
# ==============================================================================

def generate_polynomial_data(
    n_samples: int = 1000,
    noise_level: float = 0.01,
    seed: int = RANDOM_SEED
) -> Tuple[np.ndarray, np.ndarray, List[str], Dict[str, float]]:
    """
    Generate synthetic polynomial regression data.
    
    True equation: y = 0.5*x1^2 + 0.3*x2*x3 - 0.1*x1*x2^2 + 0.8
    
    Parameters
    ----------
    n_samples : int
        Number of data points to generate
    noise_level : float
        Additive noise standard deviation
    seed : int
        Random seed for reproducibility
        
    Returns
    -------
    X : np.ndarray
        Feature matrix (n_samples, 5) with x1, x2, x3 and 2 noise features
    y : np.ndarray
        Target vector
    feature_names : List[str]
        List of feature names
    true_coefficients : Dict[str, float]
        True coefficient values
    """
    np.random.seed(seed)
    
    # Generate base features
    x1 = np.random.uniform(-2, 2, n_samples)
    x2 = np.random.uniform(-2, 2, n_samples)
    x3 = np.random.uniform(-2, 2, n_samples)
    
    # Noise features (should not be selected)
    noise1 = np.random.uniform(-2, 2, n_samples)
    noise2 = np.random.uniform(-2, 2, n_samples)
    
    # True equation: y = 0.5*x1^2 + 0.3*x2*x3 - 0.1*x1*x2^2 + 0.8
    y_true = 0.5 * x1**2 + 0.3 * x2 * x3 - 0.1 * x1 * x2**2 + 0.8
    
    # Add noise
    if noise_level > 0:
        noise = np.random.normal(0, noise_level * np.std(y_true), n_samples)
        y = y_true + noise
    else:
        y = y_true.copy()
    
    # Construct feature matrix
    X = np.column_stack([x1, x2, x3, noise1, noise2])
    feature_names = ['x1', 'x2', 'x3', 'noise1', 'noise2']
    
    true_coefficients = {
        'intercept': 0.8,
        'x1^2': 0.5,
        'x2*x3': 0.3,
        'x1*x2^2': -0.1
    }
    
    return X, y, feature_names, true_coefficients

print("Polynomial data generator defined.")

In [None]:
# ==============================================================================
# TRIGONOMETRIC DATA GENERATOR
# ==============================================================================

def generate_trigonometric_data(
    n_samples: int = 1000,
    noise_level: float = 0.01,
    seed: int = RANDOM_SEED
) -> Tuple[np.ndarray, np.ndarray, List[str], Dict[str, float]]:
    """
    Generate synthetic trigonometric regression data.
    
    True equation: y = sin(x1) + 0.5*cos(x2) + 0.3*x1*x2
    
    Parameters
    ----------
    n_samples : int
        Number of data points to generate
    noise_level : float
        Additive noise standard deviation
    seed : int
        Random seed for reproducibility
        
    Returns
    -------
    X : np.ndarray
        Feature matrix (n_samples, 4) with x1, x2 and 2 noise features
    y : np.ndarray
        Target vector
    feature_names : List[str]
        List of feature names
    true_coefficients : Dict[str, float]
        True coefficient values
    """
    np.random.seed(seed)
    
    # Generate base features (in range suitable for trig functions)
    x1 = np.random.uniform(-np.pi, np.pi, n_samples)
    x2 = np.random.uniform(-np.pi, np.pi, n_samples)
    
    # Noise features (should not be selected)
    noise1 = np.random.uniform(-np.pi, np.pi, n_samples)
    noise2 = np.random.uniform(-np.pi, np.pi, n_samples)
    
    # True equation: y = sin(x1) + 0.5*cos(x2) + 0.3*x1*x2
    y_true = np.sin(x1) + 0.5 * np.cos(x2) + 0.3 * x1 * x2
    
    # Add noise
    if noise_level > 0:
        noise = np.random.normal(0, noise_level * np.std(y_true), n_samples)
        y = y_true + noise
    else:
        y = y_true.copy()
    
    # Construct feature matrix
    X = np.column_stack([x1, x2, noise1, noise2])
    feature_names = ['x1', 'x2', 'noise1', 'noise2']
    
    true_coefficients = {
        'sin(x1)': 1.0,
        'cos(x2)': 0.5,
        'x1*x2': 0.3
    }
    
    return X, y, feature_names, true_coefficients

print("Trigonometric data generator defined.")

In [None]:
# ==============================================================================
# PENDULUM DATA GENERATOR (FOR BUCKINGHAM PI TESTING)
# ==============================================================================

def generate_pendulum_data(
    n_samples: int = 500,
    noise_level: float = 0.01,
    seed: int = RANDOM_SEED
) -> Tuple[np.ndarray, np.ndarray, List[str], UserInputs]:
    """
    Generate synthetic simple pendulum period data.
    
    True equation: T = 2*pi * sqrt(L/g)
    
    Variables:
    - L: pendulum length (m)
    - m: mass (kg) - does not appear in true equation
    - g: gravitational acceleration (m/s^2)
    - T: period (s)
    
    This is a classic dimensional analysis example where:
    T * sqrt(g/L) is the dimensionless group.
    
    Parameters
    ----------
    n_samples : int
        Number of data points to generate
    noise_level : float
        Relative noise standard deviation
    seed : int
        Random seed for reproducibility
        
    Returns
    -------
    X : np.ndarray
        Feature matrix (n_samples, 3) with L, m, g
    y : np.ndarray
        Target vector - period T
    feature_names : List[str]
        List of feature names
    user_inputs : UserInputs
        Complete UserInputs with dimensional information
    """
    np.random.seed(seed)
    
    # Generate physically realistic ranges
    L = np.random.uniform(0.1, 2.0, n_samples)      # Length (m)
    m = np.random.uniform(0.01, 1.0, n_samples)     # Mass (kg) - irrelevant
    g = np.random.uniform(9.7, 10.0, n_samples)     # Gravity (m/s^2)
    
    # True period: T = 2*pi * sqrt(L/g)
    T_true = 2 * np.pi * np.sqrt(L / g)
    
    # Add multiplicative noise
    if noise_level > 0:
        noise = np.exp(np.random.normal(0, noise_level, n_samples))
        T = T_true * noise
    else:
        T = T_true.copy()
    
    # Construct feature matrix
    X = np.column_stack([L, m, g])
    feature_names = ['L', 'm', 'g']
    
    # Create UserInputs with dimensional information
    user_inputs = UserInputs(
        variable_dimensions={
            'L': [0, 1, 0, 0],      # Length: m
            'm': [1, 0, 0, 0],      # Mass: kg
            'g': [0, 1, -2, 0],     # Acceleration: m/s^2
        },
        target_dimensions=[0, 0, 1, 0],  # Time: s
        physical_bounds={
            'target': {'min': 0, 'max': None},
            'L': {'min': 0, 'max': None},
            'm': {'min': 0, 'max': None},
            'g': {'min': 0, 'max': None},
        },
        variable_mapping=None,
        unit_conversions=None
    )
    
    return X, T, feature_names, user_inputs

print("Pendulum data generator defined.")

---
## Section 5: Module Summary

In [None]:
# ==============================================================================
# MODULE SUMMARY
# ==============================================================================

print("="*70)
print(" 00_Core.ipynb - Module Summary")
print("="*70)
print()
print("DATACLASSES:")
print("  - UserInputs: User-defined inputs (dimensions, bounds, mappings)")
print("  - Stage1Results: Variable selection & preprocessing results")
print("  - Stage2Results: Structure discovery results")
print("  - Stage3Results: Validation & UQ results")
print()
print("UTILITY FUNCTIONS:")
print("  - safe_log, safe_divide, safe_power: Numerical safety")
print("  - compute_r2, compute_mse, compute_rmse, compute_mae: Metrics")
print("  - format_equation: Equation string formatting")
print("  - print_section_header, print_subsection_header: Output formatting")
print("  - normalize_features: StandardScaler wrapper")
print("  - check_data_validity: Data validation")
print()
print("DATA GENERATORS:")
print("  - generate_warm_rain_data: dq_r/dt = 0.89 * q_c^2.47 * N_d^(-1.79)")
print("  - generate_polynomial_data: y = 0.5*x1^2 + 0.3*x2*x3 - 0.1*x1*x2^2 + 0.8")
print("  - generate_trigonometric_data: y = sin(x1) + 0.5*cos(x2) + 0.3*x1*x2")
print("  - generate_pendulum_data: T = 2*pi * sqrt(L/g)")
print()
print("CONFIGURATION CONSTANTS:")
print(f"  - RANDOM_SEED: {RANDOM_SEED}")
print(f"  - DEFAULT_N_BOOTSTRAP: {DEFAULT_N_BOOTSTRAP}")
print(f"  - DEFAULT_CONFIDENCE_LEVEL: {DEFAULT_CONFIDENCE_LEVEL}")
print()
print("="*70)
print("Module loaded successfully. Import via: %run 00_Core.ipynb")
print("="*70)