# 06_PySR - Physics-SR Framework v4.1

## Stage 2.1-2.2: PySR Structure Exploration + Structure Parsing

**Author:** Zhengze Zhang  
**Affiliation:** Department of Statistics, Columbia University  
**Contact:** zz3239@columbia.edu  
**Date:** January 2026  
**Version:** 4.1 (Structure-Guided Feature Library Enhancement + Computational Optimization)

---

### Purpose

Search for symbolic expressions via evolutionary algorithms and parse discovered structures
for augmented library construction. This is an **ENHANCED** module for v4.1.

### v4.1 Enhancements

| Component | v3.0 | v4.1 |
|-----------|------|------|
| PySRDiscoverer | Basic configuration | Mode-based (fast/standard/thorough) |
| Timeout | None | Hard timeout with elapsed_time tracking |
| Precision | Float64 | Float32 (2x memory savings) |
| StructureParser | Single equation | Multi-equation Pareto parsing |
| Operator Detection | None | Automatic detection for Layer 4 |

### Algorithm Overview

1. **Configure PySR** with mode-based parameters (fast/standard/thorough)
2. **Run symbolic search** via genetic programming with timeout
3. **Extract Pareto front** (complexity vs accuracy tradeoff)
4. **Parse structure** - extract unique terms and operators for library construction

### Reference

- Cranmer, M. (2023). Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl. *arXiv:2305.01582*.
- Framework v4.0/v4.1 Section 4.1-4.2

---
## Section 1: Header and Imports

In [None]:
"""
06_PySR.ipynb - PySR Structure Exploration + Structure Parsing
===============================================================

Three-Stage Physics-Informed Symbolic Regression Framework v4.1

This module provides:
- PySRDiscoverer: Mode-based PySR configuration with computational optimization
- StructureParser: Parse Pareto equations for augmented library construction
- Operator detection for Layer 4 term generation
- Graceful fallback when PySR is not available

v4.1 Key Changes from v3.0:
- Mode-based configuration: 'fast', 'standard', 'thorough'
- Hard timeout with elapsed_time tracking
- Float32 precision and turbo mode
- StructureParser.parse_pareto_equations() for multi-equation parsing
- Automatic operator detection for Layer 4

Output Format:
- PySRDiscoverer.discover() returns dict with elapsed_time, best_r2
- StructureParser.parse_pareto_equations() returns (unique_terms, detected_operators, term_map)

Author: Zhengze Zhang
Affiliation: Department of Statistics, Columbia University
Contact: zz3239@columbia.edu
"""

# Import core module
%run 00_Core.ipynb

In [None]:
# PySR installation check
PYSR_AVAILABLE = False

try:
    from pysr import PySRRegressor
    PYSR_AVAILABLE = True
    print("06_PySR v4.1: PySR is available.")
except ImportError:
    warnings.warn(
        "PySR not available. Install with: pip install pysr\n"
        "PySRDiscoverer will use fallback mode."
    )
    print("06_PySR v4.1: PySR not available, fallback mode enabled.")

In [None]:
# Additional imports for Structure Parser
import sympy as sp
from sympy import Symbol, sympify, Add, Mul, Pow, Float, Integer, lambdify
from typing import Dict, List, Tuple, Optional, Any, Union, Set, Callable
import re
import time

print("06_PySR v4.1: SymPy imports successful.")

---
## Section 2: PySRDiscoverer Class

In [None]:
# ==============================================================================
# PYSR DISCOVERER CLASS (v4.1 ENHANCED)
# ==============================================================================

class PySRDiscoverer:
    """
    PySR-based Symbolic Regression Discoverer (v4.1 Enhanced).
    
    Wrapper around PySR with physics-informed configuration and
    v4.1 computational optimizations.
    
    Attributes
    ----------
    mode : str
        Configuration mode: 'fast', 'standard', 'thorough'
    binary_operators : List[str]
        Binary operators (default: ["+", "-", "*", "/"])
    unary_operators : List[str]
        Unary operators (default: ["sin", "cos", "exp"])
    timeout_seconds : int
        Hard timeout in seconds (v4.1)
    procs : int
        Number of parallel processes (default: 2 for Colab)
    turbo : bool
        Use turbo mode for faster evaluation (v4.1)
    precision : int
        Float precision, 32 or 64 (v4.1)
    model : PySRRegressor
        Fitted PySR model (public)
    elapsed_time : float
        Execution time in seconds (public)
    
    Methods
    -------
    discover(X, y, feature_names) -> Dict
        Run PySR symbolic regression
    get_pareto_equations() -> List[str]
        Get equations from Pareto front
    predict(X) -> np.ndarray
        Predict using best equation
    print_discovery_report() -> None
        Print detailed discovery results
    
    Reference
    ---------
    Cranmer, M. (2023). arXiv:2305.01582.
    Framework v4.0/v4.1 Section 4.1: PySR Structure Exploration
    
    Examples
    --------
    >>> discoverer = PySRDiscoverer(mode='standard')
    >>> result = discoverer.discover(X, y, feature_names)
    >>> print(f"Best equation: {result['best_equation']}")
    >>> print(f"Elapsed time: {result['elapsed_time']:.1f}s")
    """
    
    def __init__(
        self,
        mode: str = 'standard',
        binary_operators: List[str] = None,
        unary_operators: List[str] = None,
        timeout_seconds: int = None,
        procs: int = DEFAULT_PROCS,
        turbo: bool = True,
        precision: int = DEFAULT_PRECISION,
        random_state: int = RANDOM_SEED
    ):
        """
        Initialize PySRDiscoverer.
        
        Parameters
        ----------
        mode : str
            Configuration mode: 'fast', 'standard', 'thorough'.
            Uses PYSR_MODES from 00_Core for settings.
            Default: 'standard'
        binary_operators : List[str], optional
            Binary operators. Default: ["+", "-", "*", "/"]
        unary_operators : List[str], optional
            Unary operators. Default: ["sin", "cos", "exp"]
        timeout_seconds : int, optional
            Hard timeout. If None, uses mode-specific default.
        procs : int
            Number of CPU processes. Default: 2 (Colab optimized)
        turbo : bool
            Enable turbo mode for faster evaluation. Default: True
        precision : int
            Float precision (32 or 64). Default: 32
        random_state : int
            Random seed for reproducibility. Default: 42
        """
        # Validate mode
        if mode not in PYSR_MODES:
            raise ValueError(f"mode must be one of {list(PYSR_MODES.keys())}")
        
        self.mode = mode
        self.binary_operators = binary_operators or ["+", "-", "*", "/"]
        self.unary_operators = unary_operators or ["sin", "cos", "exp"]
        self.procs = procs
        self.turbo = turbo
        self.precision = precision
        self.random_state = random_state
        
        # Get mode-specific timeout or use provided
        mode_config = PYSR_MODES[mode]
        self.timeout_seconds = timeout_seconds or mode_config['timeout_in_seconds']
        
        # Public attributes (v4.1)
        self.model = None
        self.elapsed_time = None
        
        # Internal state
        self._feature_names = None
        self._pareto_front = None
        self._best_equation = None
        self._best_equation_sympy = None
        self._best_complexity = None
        self._best_loss = None
        self._best_r2 = None
        self._discovery_complete = False
    
    def discover(
        self,
        X: np.ndarray,
        y: np.ndarray,
        feature_names: List[str]
    ) -> Dict[str, Any]:
        """
        Run PySR symbolic regression.
        
        Parameters
        ----------
        X : np.ndarray
            Feature matrix of shape (n_samples, n_features)
        y : np.ndarray
            Target vector of shape (n_samples,)
        feature_names : List[str]
            Names of features
        
        Returns
        -------
        Dict[str, Any]
            Dictionary containing:
            - best_equation: Best equation string
            - best_equation_sympy: SymPy expression of best equation
            - best_complexity: Complexity of best equation
            - best_loss: Loss of best equation
            - best_r2: R-squared of best equation
            - pareto_front: DataFrame of Pareto-optimal equations
            - all_equations: List of all candidate equations
            - elapsed_time: Execution time in seconds
            - pysr_available: Whether PySR was used
        """
        self._feature_names = list(feature_names)
        
        if not PYSR_AVAILABLE:
            return self._fallback_discovery(X, y)
        
        # Start timer
        start_time = time.time()
        
        # Configure PySR with v4.1 optimizations
        self.model = self._configure_pysr()
        
        # Convert to float32 if precision=32
        if self.precision == 32:
            X_fit = X.astype(np.float32)
            y_fit = y.astype(np.float32)
        else:
            X_fit = X
            y_fit = y
        
        # Run search
        try:
            self.model.fit(X_fit, y_fit, variable_names=self._feature_names)
        except Exception as e:
            warnings.warn(f"PySR fit failed: {e}. Using fallback.")
            return self._fallback_discovery(X, y)
        
        # Record elapsed time
        self.elapsed_time = time.time() - start_time
        
        # Extract results
        self._pareto_front = self._extract_pareto_front()
        self._best_equation = self._select_best_equation(criterion='accuracy')
        
        # Get best equation details
        if self.model.equations_ is not None and len(self.model.equations_) > 0:
            best_idx = self.model.equations_.loss.idxmin()
            self._best_complexity = int(self.model.equations_.loc[best_idx, 'complexity'])
            self._best_loss = float(self.model.equations_.loc[best_idx, 'loss'])
            
            # Parse to SymPy
            try:
                local_dict = {name: Symbol(name) for name in self._feature_names}
                local_dict.update({'sqrt': sp.sqrt, 'exp': sp.exp, 'log': sp.log,
                                   'sin': sp.sin, 'cos': sp.cos, 'abs': sp.Abs})
                self._best_equation_sympy = sympify(
                    self._best_equation.replace('^', '**'),
                    locals=local_dict
                )
            except Exception:
                self._best_equation_sympy = None
            
            # Compute R2
            try:
                y_pred = self.model.predict(X_fit)
                ss_tot = np.sum((y - np.mean(y))**2)
                ss_res = np.sum((y - y_pred)**2)
                self._best_r2 = 1 - ss_res / ss_tot if ss_tot > 0 else 0.0
            except Exception:
                self._best_r2 = None
        else:
            self._best_complexity = 0
            self._best_loss = float('inf')
            self._best_equation_sympy = None
            self._best_r2 = None
        
        self._discovery_complete = True
        
        return {
            'best_equation': self._best_equation,
            'best_equation_sympy': self._best_equation_sympy,
            'best_complexity': self._best_complexity,
            'best_loss': self._best_loss,
            'best_r2': self._best_r2,
            'pareto_front': self._pareto_front,
            'all_equations': self.get_pareto_equations(),
            'elapsed_time': self.elapsed_time,
            'pysr_available': True,
            'n_equations': len(self.model.equations_) if self.model.equations_ is not None else 0,
            'mode': self.mode
        }
    
    def _configure_pysr(self) -> 'PySRRegressor':
        """
        Configure PySR model with v4.1 optimized settings.
        
        Uses PYSR_MODES configuration based on self.mode.
        
        Returns
        -------
        PySRRegressor
            Configured PySR model
        """
        # Get mode-specific configuration
        config = PYSR_MODES[self.mode]
        
        # Get constraints
        constraints, nested_constraints = self._get_constraints()
        
        model = PySRRegressor(
            # Core parameters from mode config
            niterations=config['niterations'],
            maxsize=config['maxsize'],
            maxdepth=config['maxdepth'],
            populations=config['populations'],
            population_size=config['population_size'],
            ncycles_per_iteration=config['ncycles_per_iteration'],
            
            # Operators
            binary_operators=self.binary_operators,
            unary_operators=self.unary_operators,
            
            # Parallelization (v4.1: reduced for Colab)
            procs=0,  # Must be 0 for deterministic mode
            
            # Performance optimizations (v4.1)
            turbo=self.turbo,
            
            # Time control
            timeout_in_seconds=self.timeout_seconds,
            
            # Constraints
            constraints=constraints,
            nested_constraints=nested_constraints,
            
            # Reproducibility
            random_state=self.random_state,
            deterministic=True,
            parallelism='serial',
            
            # Output control
            verbosity=0,
            progress=False
        )
        return model
    
    def _get_constraints(self) -> Tuple[Dict, Dict]:
        """
        Get complexity and nesting constraints.
        
        Returns
        -------
        Tuple[Dict, Dict]
            (constraints, nested_constraints)
        """
        constraints = {
            "/": (-1, 7),   # Denominator complexity limit
            "exp": 7        # Exponent complexity limit
        }
        
        nested_constraints = {
            "exp": {"exp": 0},              # No nested exponentials
            "sin": {"sin": 0, "cos": 1},    # Limited trig nesting
            "cos": {"sin": 1, "cos": 0}
        }
        
        return constraints, nested_constraints
    
    def _extract_pareto_front(self) -> pd.DataFrame:
        """
        Extract Pareto front from PySR results.
        
        Returns
        -------
        pd.DataFrame
            DataFrame with columns: complexity, loss, equation
        """
        if self.model is None or self.model.equations_ is None:
            return pd.DataFrame(columns=['complexity', 'loss', 'equation'])
        
        df = self.model.equations_[['complexity', 'loss', 'equation']].copy()
        
        # Filter to Pareto-optimal points
        pareto_mask = np.ones(len(df), dtype=bool)
        for i, (c1, l1) in enumerate(zip(df['complexity'], df['loss'])):
            for j, (c2, l2) in enumerate(zip(df['complexity'], df['loss'])):
                if i != j and c2 <= c1 and l2 <= l1 and (c2 < c1 or l2 < l1):
                    pareto_mask[i] = False
                    break
        
        return df[pareto_mask].sort_values('complexity').reset_index(drop=True)
    
    def _select_best_equation(
        self,
        criterion: str = 'accuracy'
    ) -> str:
        """
        Select best equation based on criterion.
        
        Parameters
        ----------
        criterion : str
            Selection criterion: 'accuracy', 'complexity', or 'pareto'
        
        Returns
        -------
        str
            Best equation string
        """
        if self.model is None or self.model.equations_ is None:
            return ""
        
        df = self.model.equations_
        
        if len(df) == 0:
            return ""
        
        if criterion == 'accuracy':
            best_idx = df['loss'].idxmin()
        elif criterion == 'complexity':
            best_idx = df['complexity'].idxmin()
        else:  # pareto - select knee point
            med_complexity = df['complexity'].median()
            simple_df = df[df['complexity'] <= med_complexity]
            if len(simple_df) > 0:
                best_idx = simple_df['loss'].idxmin()
            else:
                best_idx = df['loss'].idxmin()
        
        return str(df.loc[best_idx, 'equation'])
    
    def _fallback_discovery(
        self,
        X: np.ndarray,
        y: np.ndarray
    ) -> Dict[str, Any]:
        """
        Fallback when PySR is not available.
        
        Uses simple polynomial regression to suggest a basic form.
        """
        from sklearn.linear_model import Ridge
        from sklearn.preprocessing import PolynomialFeatures
        
        start_time = time.time()
        
        # Fit simple polynomial
        poly = PolynomialFeatures(degree=2, include_bias=False)
        X_poly = poly.fit_transform(X)
        
        reg = Ridge(alpha=0.1)
        reg.fit(X_poly, y)
        
        # Build equation string
        terms = []
        feature_names_poly = poly.get_feature_names_out(self._feature_names)
        for coef, name in zip(reg.coef_, feature_names_poly):
            if abs(coef) > 0.01:
                terms.append(f"{coef:.4f}*{name}")
        
        if reg.intercept_ != 0:
            equation = f"{reg.intercept_:.4f} + " + " + ".join(terms)
        else:
            equation = " + ".join(terms) if terms else "0"
        
        # Compute metrics
        y_pred = reg.predict(X_poly)
        loss = np.mean((y - y_pred)**2)
        ss_tot = np.sum((y - np.mean(y))**2)
        ss_res = np.sum((y - y_pred)**2)
        r2 = 1 - ss_res / ss_tot if ss_tot > 0 else 0.0
        
        self.elapsed_time = time.time() - start_time
        self._best_equation = equation
        self._best_loss = loss
        self._best_r2 = r2
        self._best_complexity = len(terms) + 1
        self._discovery_complete = True
        
        return {
            'best_equation': equation,
            'best_equation_sympy': None,
            'best_complexity': self._best_complexity,
            'best_loss': loss,
            'best_r2': r2,
            'pareto_front': pd.DataFrame({
                'complexity': [self._best_complexity],
                'loss': [loss],
                'equation': [equation]
            }),
            'all_equations': [equation],
            'elapsed_time': self.elapsed_time,
            'pysr_available': False,
            'n_equations': 1,
            'mode': self.mode
        }
    
    def get_pareto_equations(self) -> List[str]:
        """
        Get equations from Pareto front.
        
        Returns
        -------
        List[str]
            List of equation strings
        """
        if self._pareto_front is None or len(self._pareto_front) == 0:
            return [self._best_equation] if self._best_equation else []
        
        return list(self._pareto_front['equation'].astype(str))
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """
        Predict using the best discovered equation.
        
        Parameters
        ----------
        X : np.ndarray
            Feature matrix
        
        Returns
        -------
        np.ndarray
            Predictions
        """
        if not self._discovery_complete:
            raise RuntimeError("Must run discover() first")
        
        if self.model is not None and PYSR_AVAILABLE:
            if self.precision == 32:
                return self.model.predict(X.astype(np.float32))
            return self.model.predict(X)
        else:
            # Fallback: use StructureParser
            parser = StructureParser()
            terms = parser.parse_single_equation(
                self._best_equation, self._feature_names, X
            )
            if len(terms) > 0:
                result = np.zeros(X.shape[0])
                for expr, name, func in terms:
                    try:
                        result += func(*[X[:, i] for i in range(X.shape[1])])
                    except Exception:
                        pass
                return result
            return np.zeros(X.shape[0])
    
    def print_discovery_report(self) -> None:
        """
        Print a detailed discovery report in v4.1 format.
        """
        if not self._discovery_complete:
            print("Discovery not yet performed. Run discover() first.")
            return
        
        print("=" * 70)
        print("=== PySR Discovery (v4.1) ===")
        print("=" * 70)
        print()
        print(f"Mode: {self.mode}")
        print(f"Timeout: {self.timeout_seconds}s")
        print(f"Elapsed: {self.elapsed_time:.1f}s")
        print(f"PySR available: {PYSR_AVAILABLE}")
        print()
        print("-" * 70)
        print("Configuration:")
        print("-" * 70)
        print(f"  Binary operators: {self.binary_operators}")
        print(f"  Unary operators: {self.unary_operators}")
        print(f"  Turbo mode: {self.turbo}")
        print(f"  Precision: {self.precision}-bit")
        print()
        print("-" * 70)
        print("Best Equation:")
        print("-" * 70)
        print(f"  {self._best_equation}")
        print(f"  Complexity: {self._best_complexity}")
        print(f"  Loss (MSE): {self._best_loss:.6f}")
        if self._best_r2 is not None:
            print(f"  R-squared: {self._best_r2:.4f}")
        print()
        
        if self._pareto_front is not None and len(self._pareto_front) > 0:
            print("-" * 70)
            print("Pareto Front:")
            print("-" * 70)
            for _, row in self._pareto_front.iterrows():
                eq_str = str(row['equation'])[:50]
                print(f"  Complexity {int(row['complexity']):2d}: {eq_str} (loss: {row['loss']:.6f})")
        
        print()
        print("=" * 70)

print("PySRDiscoverer class v4.1 defined.")

---
## Section 3: StructureParser Class

In [None]:
# ==============================================================================
# STRUCTURE PARSER CLASS (v4.0/v4.1 ENHANCED)
# ==============================================================================

class StructureParser:
    """
    Structure Parser for extracting terms from PySR equations (v4.0/v4.1).
    
    Parses PySR equations to extract exact functional terms for
    augmented library construction (Layer 1 and operator detection for Layer 4).
    
    v4.1 Key Features:
    - parse_pareto_equations(): Parse multiple equations from Pareto front
    - Automatic operator detection for Layer 4 terms
    - Returns (unique_terms, detected_operators, term_map)
    - unique_terms format: List[(sympy_expr, name_str, eval_function)]
    
    Attributes
    ----------
    None (stateless parser)
    
    Methods
    -------
    parse_pareto_equations(equations, feature_names, X) -> Tuple
        Parse multiple equations from Pareto front
    parse_single_equation(equation_str, feature_names, X) -> List
        Parse a single equation string
    
    Reference
    ---------
    Framework v4.0/v4.1 Section 4.2: Structure Parsing
    
    Examples
    --------
    >>> parser = StructureParser()
    >>> terms, operators, term_map = parser.parse_pareto_equations(
    ...     equations=['0.5*x**2 + sin(z)'],
    ...     feature_names=['x', 'y', 'z'],
    ...     X=X_data
    ... )
    >>> print(f"Detected operators: {operators}")
    """
    
    def __init__(self):
        """
        Initialize StructureParser.
        """
        pass
    
    def parse_pareto_equations(
        self,
        equations: List[str],
        feature_names: List[str],
        X: np.ndarray
    ) -> Tuple[List[Tuple], Set[str], Dict[str, str]]:
        """
        Parse multiple equations from Pareto front.
        
        Parameters
        ----------
        equations : List[str]
            Equation strings from PySR Pareto front
        feature_names : List[str]
            Variable names
        X : np.ndarray
            Feature matrix for evaluation testing
            
        Returns
        -------
        unique_terms : List[Tuple]
            List of (sympy_expr, name_str, eval_function)
        detected_operators : Set[str]
            Set of operators found {'sin', 'cos', 'exp', 'sqrt', 'log'}
        term_to_equation_map : Dict[str, str]
            Mapping from term names to source equations
        """
        all_terms = []
        detected_operators = set()
        term_to_equation_map = {}
        
        # Create local dictionary for parsing
        local_dict = self._create_local_dict(feature_names)
        symbols_list = [Symbol(name) for name in feature_names]
        
        for equation_str in equations:
            try:
                # Clean and parse equation
                equation_str = equation_str.strip().replace('^', '**')
                expr = sympify(equation_str, locals=local_dict)
                
                # Expand expression
                expr = sp.expand(expr)
                
                # Decompose into additive terms
                if isinstance(expr, Add):
                    terms = list(expr.args)
                else:
                    terms = [expr]
                
                # Process each term
                for term in terms:
                    # Extract coefficient and functional part
                    coef, functional = term.as_coeff_Mul()
                    
                    # Skip pure constants
                    if functional.is_number or functional == 1:
                        continue
                    
                    # Detect operators in this term
                    term_operators = self._detect_operators(functional)
                    detected_operators.update(term_operators)
                    
                    # Create evaluation function
                    try:
                        eval_func = lambdify(symbols_list, functional, modules=['numpy'])
                        
                        # Test evaluability
                        test_values = eval_func(*[X[:10, i] for i in range(X.shape[1])])
                        if not np.all(np.isfinite(test_values)):
                            continue
                        
                        # Add to collection
                        term_name = str(functional)
                        all_terms.append((functional, term_name, eval_func))
                        term_to_equation_map[term_name] = equation_str
                        
                    except Exception:
                        continue
                        
            except Exception:
                continue
        
        # Deduplicate terms
        unique_terms = self._deduplicate_terms(all_terms)
        
        return unique_terms, detected_operators, term_to_equation_map
    
    def parse_single_equation(
        self,
        equation_str: str,
        feature_names: List[str],
        X: np.ndarray
    ) -> List[Tuple]:
        """
        Parse a single equation string.
        
        Parameters
        ----------
        equation_str : str
            Equation string to parse
        feature_names : List[str]
            Variable names
        X : np.ndarray
            Feature matrix for evaluation testing
            
        Returns
        -------
        List[Tuple]
            List of (sympy_expr, name_str, eval_function)
        """
        terms, _, _ = self.parse_pareto_equations([equation_str], feature_names, X)
        return terms
    
    def _create_local_dict(self, feature_names: List[str]) -> Dict:
        """
        Create symbol mapping for SymPy parsing.
        
        Parameters
        ----------
        feature_names : List[str]
            Variable names
            
        Returns
        -------
        Dict
            Local dictionary for sympify
        """
        local_dict = {name: Symbol(name) for name in feature_names}
        local_dict.update({
            'sqrt': sp.sqrt,
            'exp': sp.exp,
            'log': sp.log,
            'sin': sp.sin,
            'cos': sp.cos,
            'abs': sp.Abs,
            'tan': sp.tan
        })
        return local_dict
    
    def _detect_operators(self, expr: sp.Expr) -> Set[str]:
        """
        Detect which operators are used in expression.
        
        Parameters
        ----------
        expr : sp.Expr
            SymPy expression
            
        Returns
        -------
        Set[str]
            Set of operator names found
        """
        operators = set()
        
        operator_map = [
            (sp.sin, 'sin'),
            (sp.cos, 'cos'),
            (sp.exp, 'exp'),
            (sp.log, 'log'),
            (sp.sqrt, 'sqrt'),
            (sp.tan, 'tan'),
            (sp.Abs, 'abs')
        ]
        
        for func_type, name in operator_map:
            if expr.has(func_type):
                operators.add(name)
        
        return operators
    
    def _deduplicate_terms(
        self,
        terms: List[Tuple]
    ) -> List[Tuple]:
        """
        Remove duplicate terms based on simplified expression.
        
        Parameters
        ----------
        terms : List[Tuple]
            List of (expr, name, func) tuples
            
        Returns
        -------
        List[Tuple]
            Deduplicated list
        """
        seen_exprs = set()
        unique_terms = []
        
        for expr, name, func in terms:
            try:
                expr_key = str(sp.simplify(expr))
                if expr_key not in seen_exprs:
                    seen_exprs.add(expr_key)
                    unique_terms.append((expr, name, func))
            except Exception:
                # If simplification fails, use string representation
                if name not in seen_exprs:
                    seen_exprs.add(name)
                    unique_terms.append((expr, name, func))
        
        return unique_terms
    
    def _safe_evaluate(
        self,
        expr: sp.Expr,
        X: np.ndarray,
        feature_names: List[str]
    ) -> Optional[np.ndarray]:
        """
        Safely evaluate a SymPy expression on data.
        
        Handles numerical edge cases (overflow, division by zero, etc.).
        
        Parameters
        ----------
        expr : sp.Expr
            SymPy expression
        X : np.ndarray
            Feature matrix
        feature_names : List[str]
            Variable names
            
        Returns
        -------
        Optional[np.ndarray]
            Evaluated values or None if evaluation fails
        """
        try:
            symbols_list = [Symbol(name) for name in feature_names]
            func = lambdify(symbols_list, expr, modules=['numpy'])
            values = func(*[X[:, i] for i in range(X.shape[1])])
            
            # Check for numerical issues
            if not np.all(np.isfinite(values)):
                return None
            
            return values
        except Exception:
            return None

print("StructureParser class v4.1 defined.")

---
## Section 4: Internal Tests

In [None]:
# ==============================================================================
# TEST CONTROL FLAG
# ==============================================================================

_RUN_TESTS = False  # Set to True to run internal tests

if _RUN_TESTS:
    print("=" * 70)
    print(" RUNNING INTERNAL TESTS FOR 06_PySR v4.1")
    print("=" * 70)

In [None]:
# ==============================================================================
# TEST 1: PySRDiscoverer Mode Configuration
# ==============================================================================

if _RUN_TESTS:
    print()
    print_section_header("Test 1: PySRDiscoverer Mode Configuration")
    
    # Test each mode configuration
    for mode in ['fast', 'standard', 'thorough']:
        discoverer = PySRDiscoverer(mode=mode)
        print(f"Mode: {mode}")
        print(f"  Timeout: {discoverer.timeout_seconds}s")
        print(f"  Turbo: {discoverer.turbo}")
        print(f"  Precision: {discoverer.precision}-bit")
        print()
    
    print("[PASS] All modes configured correctly")

In [None]:
# ==============================================================================
# TEST 2: PySR Discovery (Fallback Mode)
# ==============================================================================

if _RUN_TESTS:
    print()
    print_section_header("Test 2: PySR Discovery (Fallback Mode)")
    
    # Generate test data
    np.random.seed(42)
    n_samples = 100
    
    x = np.random.uniform(0.5, 2, n_samples)
    y_var = np.random.uniform(0.5, 2, n_samples)
    z = np.random.uniform(0.5, 2, n_samples)
    
    y = 0.5 * x**2 + np.sin(z) + 0.01 * np.random.randn(n_samples)
    
    X = np.column_stack([x, y_var, z])
    feature_names = ['x', 'y', 'z']
    
    # Run discovery (will use fallback if PySR not available)
    discoverer = PySRDiscoverer(mode='fast')
    result = discoverer.discover(X, y, feature_names)
    
    print(f"Best equation: {result['best_equation']}")
    print(f"Elapsed time: {result['elapsed_time']:.1f}s")
    print(f"PySR available: {result['pysr_available']}")
    print(f"Mode: {result['mode']}")
    if result['best_r2'] is not None:
        print(f"R-squared: {result['best_r2']:.4f}")
    
    discoverer.print_discovery_report()

In [None]:
# ==============================================================================
# TEST 3: StructureParser - Multi-Equation Parsing
# ==============================================================================

if _RUN_TESTS:
    print()
    print_section_header("Test 3: StructureParser - Multi-Equation Parsing")
    
    np.random.seed(42)
    n_samples = 100
    
    x = np.random.uniform(0.5, 2, n_samples)
    y_var = np.random.uniform(0.5, 2, n_samples)
    z = np.random.uniform(0.5, 2, n_samples)
    
    X = np.column_stack([x, y_var, z])
    feature_names = ['x', 'y', 'z']
    
    # Simulated Pareto front equations
    equations = [
        '0.5*x**2',
        '0.5*x**2 + sin(z)',
        '0.5*x**2 + 0.98*sin(z) + 0.01*y'
    ]
    
    print(f"Input equations: {len(equations)}")
    for eq in equations:
        print(f"  {eq}")
    print()
    
    parser = StructureParser()
    unique_terms, detected_operators, term_map = parser.parse_pareto_equations(
        equations, feature_names, X
    )
    
    print(f"=== Structure Parsing ===")
    print(f"Parsed terms: {len(unique_terms)}")
    for expr, name, func in unique_terms:
        print(f"  - {name}")
    print()
    print(f"Detected operators: {detected_operators}")
    
    # Verify operator detection
    if 'sin' in detected_operators:
        print("[PASS] sin operator detected")
    else:
        print("[WARNING] sin operator not detected")

In [None]:
# ==============================================================================
# TEST 4: StructureParser - Operator Detection
# ==============================================================================

if _RUN_TESTS:
    print()
    print_section_header("Test 4: StructureParser - Operator Detection")
    
    np.random.seed(42)
    n_samples = 100
    
    x = np.random.uniform(0.1, 2, n_samples)
    y_var = np.random.uniform(0.1, 2, n_samples)
    
    X = np.column_stack([x, y_var])
    feature_names = ['x', 'y']
    
    # Test various operators
    test_equations = [
        ('sin(x) + cos(y)', {'sin', 'cos'}),
        ('exp(x) + log(y)', {'exp', 'log'}),
        ('sqrt(x) + x**2', {'sqrt'}),
        ('x + y', set())
    ]
    
    parser = StructureParser()
    
    print(f"{'Equation':<30} {'Expected':<20} {'Detected':<20} {'Status'}")
    print("-" * 80)
    
    all_pass = True
    for eq, expected in test_equations:
        _, detected, _ = parser.parse_pareto_equations([eq], feature_names, X)
        status = "PASS" if detected == expected else "FAIL"
        if status == "FAIL":
            all_pass = False
        print(f"{eq:<30} {str(expected):<20} {str(detected):<20} [{status}]")
    
    print()
    if all_pass:
        print("[PASS] All operator detection tests passed")
    else:
        print("[WARNING] Some operator detection tests failed")

In [None]:
# ==============================================================================
# TEST 5: Evaluation Function Correctness
# ==============================================================================

if _RUN_TESTS:
    print()
    print_section_header("Test 5: Evaluation Function Correctness")
    
    np.random.seed(42)
    n_samples = 50
    
    x = np.random.uniform(0.5, 2, n_samples)
    y_var = np.random.uniform(0.5, 2, n_samples)
    
    X = np.column_stack([x, y_var])
    feature_names = ['x', 'y']
    
    equation = 'x**2 + sin(y)'
    expected = x**2 + np.sin(y_var)
    
    parser = StructureParser()
    terms, _, _ = parser.parse_pareto_equations([equation], feature_names, X)
    
    print(f"Equation: {equation}")
    print(f"Parsed {len(terms)} terms")
    
    # Evaluate each term
    result = np.zeros(n_samples)
    for expr, name, func in terms:
        values = func(x, y_var)
        print(f"  {name}: mean = {np.mean(values):.4f}")
        result += values
    
    # Check correctness
    error = np.mean(np.abs(result - expected))
    print()
    print(f"Mean absolute error: {error:.6f}")
    
    if error < 1e-6:
        print("[PASS] Evaluation functions are correct")
    else:
        print("[WARNING] Evaluation functions may have issues")

---
## Section 5: Module Summary

In [None]:
# ==============================================================================
# MODULE SUMMARY
# ==============================================================================

print("=" * 70)
print(" 06_PySR.ipynb v4.1 - Module Summary")
print("=" * 70)
print()
print("CLASSES:")
print("-" * 70)
print()
print("1. PySRDiscoverer (v4.1 Enhanced)")
print("   Purpose: Symbolic regression via genetic programming")
print("   v4.1 Features:")
print("     - Mode-based configuration: 'fast', 'standard', 'thorough'")
print("     - Hard timeout with elapsed_time tracking")
print("     - Float32 precision and turbo mode")
print("   Main Methods:")
print("     discover(X, y, feature_names) -> Dict with elapsed_time, best_r2")
print("     get_pareto_equations() -> List[str]")
print("     predict(X) -> np.ndarray")
print("     print_discovery_report()")
print()
print("2. StructureParser (v4.1 Enhanced)")
print("   Purpose: Parse equations for augmented library construction")
print("   v4.1 Features:")
print("     - Multi-equation Pareto parsing")
print("     - Automatic operator detection for Layer 4")
print("     - Returns (unique_terms, detected_operators, term_map)")
print("   Main Methods:")
print("     parse_pareto_equations(equations, feature_names, X) -> Tuple")
print("     parse_single_equation(equation_str, feature_names, X) -> List")
print()
print(f"PySR Status: {'Available' if PYSR_AVAILABLE else 'Not available (fallback mode)'}")
print()
print("Usage Example:")
print("-" * 70)
print("""
# Discover equations with PySR
discoverer = PySRDiscoverer(mode='standard')
result = discoverer.discover(X, y, feature_names)
print(f"Best equation: {result['best_equation']}")
print(f"Elapsed: {result['elapsed_time']:.1f}s")
print(f"R-squared: {result['best_r2']:.4f}")

# Parse for augmented library construction
parser = StructureParser()
unique_terms, detected_operators, term_map = parser.parse_pareto_equations(
    result['all_equations'], feature_names, X
)
print(f"Detected operators: {detected_operators}")

# Use with AugmentedLibraryBuilder
builder = AugmentedLibraryBuilder()
Phi, names, info = builder.build(
    X, feature_names,
    parsed_terms=unique_terms,
    detected_operators=detected_operators,
    pysr_r2=result['best_r2']
)
""")
print()
print("=" * 70)
print("Module loaded successfully. Import via: %run 06_PySR.ipynb")
print("=" * 70)