# 08_AdaptiveLasso - Physics-SR Framework v4.1

## Stage 2.5: Adaptive Lasso with Oracle Property

**Author:** Zhengze Zhang  
**Affiliation:** Department of Statistics, Columbia University  
**Contact:** zz3239@columbia.edu  
**Date:** January 2026  
**Version:** 4.1 (Structure-Guided Feature Library Enhancement + Computational Optimization)

---

### Purpose

Achieve the oracle property for variable selection in symbolic regression.
This is a **minor update** module for v4.1.

### v4.1 Modifications

| Feature | v3.0 | v4.1 |
|---------|------|------|
| Source tracking | None | selection_analysis in output |
| Agreement metric | None | compute_agreement() method |
| Parameter names | feature_names | library_names (compatible) |

### Oracle Property

The oracle property guarantees:
1. **Selection consistency:** $P(\text{support} = \text{true support}) \to 1$ as $n \to \infty$
2. **Asymptotic normality:** $\sqrt{n}(\hat{\xi} - \xi^*) \xrightarrow{d} N(0, V)$

### Key Innovation

Adaptive LASSO uses **data-driven weights**:
$$w_j = \frac{1}{(|\hat{\beta}_j^{init}| + \varepsilon)^\gamma}$$

### Reference

- Zou, H. (2006). The adaptive lasso and its oracle properties. *JASA*, 101(476), 1418-1429.
- Framework v4.0/v4.1 Section 4.5

---
## Section 1: Header and Imports

In [None]:
"""
08_AdaptiveLasso.ipynb - Adaptive Lasso with Oracle Property
=============================================================

Three-Stage Physics-Informed Symbolic Regression Framework v4.1

This module provides:
- AdaptiveLassoSelector: Adaptive Lasso with data-driven weights
- Oracle property for variable selection consistency
- Epsilon-stabilization to prevent weight explosion
- Cross-validation for lambda selection

v4.1 Key Changes from v3.0:
- selection_analysis in output (source attribution)
- compute_agreement() method for STLSQ comparison
- library_names parameter (compatible with feature_names)

Author: Zhengze Zhang
Affiliation: Department of Statistics, Columbia University
Contact: zz3239@columbia.edu
"""

# Import core module
%run 00_Core.ipynb

In [None]:
# Additional imports for Adaptive Lasso
from sklearn.linear_model import Ridge, LassoCV, Lasso
from sklearn.preprocessing import StandardScaler
from typing import Dict, List, Tuple, Optional, Any

print("08_AdaptiveLasso v4.1: Additional imports successful.")

---
## Section 2: Class Definition

In [None]:
# ==============================================================================
# ADAPTIVE LASSO SELECTOR CLASS (v4.1 Minor Update)
# ==============================================================================

class AdaptiveLassoSelector:
    """
    Adaptive Lasso with Oracle Property (v4.1 Minor Update).
    
    Achieves selection consistency and asymptotic normality.
    Uses epsilon-stabilization to handle zero initial estimates.
    
    v4.1 Enhancements:
    - selection_analysis in output (source attribution)
    - compute_agreement() method for STLSQ comparison
    - Accepts library_names with source tags
    
    Attributes
    ----------
    gamma : float
        Weight exponent (default: 1.0). Higher gamma = stronger penalty
        on small initial coefficients.
    eps : float
        Stabilization constant to prevent weight explosion (default: 1e-6)
    cv_folds : int
        Number of folds for cross-validation (default: 5)
    initial_method : str
        Method for initial estimate: 'ridge' or 'ols' (default: 'ridge')
    ridge_alpha : float
        Regularization for initial Ridge estimate (default: 0.1)
    
    Methods
    -------
    fit(feature_library, y, library_names) -> Dict
        Fit Adaptive Lasso model
    compute_agreement(support_stlsq, support_alasso) -> float
        Compute Jaccard similarity between supports (v4.1)
    
    Reference
    ---------
    Zou, H. (2006). JASA, 101(476), 1418-1429.
    
    Examples
    --------
    >>> selector = AdaptiveLassoSelector(gamma=1.0)
    >>> result = selector.fit(Phi, y, library_names=names)
    >>> print(f"Selection analysis: {result['selection_analysis']}")
    """
    
    def __init__(
        self,
        gamma: float = DEFAULT_ALASSO_GAMMA,
        eps: float = DEFAULT_ALASSO_EPS,
        cv_folds: int = DEFAULT_CV_FOLDS,
        initial_method: str = 'ridge',
        ridge_alpha: float = 0.1
    ):
        """
        Initialize AdaptiveLassoSelector.
        
        Parameters
        ----------
        gamma : float
            Exponent for adaptive weights. gamma=1 is standard,
            gamma=2 provides stronger adaptation.
            Default: 1.0
        eps : float
            Stabilization constant to prevent division by zero.
            Default: 1e-6
        cv_folds : int
            Number of cross-validation folds for lambda selection.
            Default: 5
        initial_method : str
            'ridge' or 'ols' for initial estimate.
            Default: 'ridge' (more stable)
        ridge_alpha : float
            Ridge regularization parameter for initial estimate.
            Default: 0.1
        """
        self.gamma = gamma
        self.eps = eps
        self.cv_folds = cv_folds
        self.initial_method = initial_method
        self.ridge_alpha = ridge_alpha
        
        # Internal state
        self._coefficients = None
        self._support = None
        self._library_names = None
        self._n_features = None
        self._initial_estimate = None
        self._adaptive_weights = None
        self._optimal_lambda = None
        self._fit_complete = False
        self._r2_score = None
        self._mse = None
        self._selection_analysis = None
    
    def fit(
        self,
        feature_library: np.ndarray,
        y: np.ndarray,
        library_names: List[str] = None,
        feature_names: List[str] = None  # Backward compatibility alias
    ) -> Dict[str, Any]:
        """
        Fit Adaptive Lasso model.
        
        Parameters
        ----------
        feature_library : np.ndarray
            Feature library matrix
        y : np.ndarray
            Target vector
        library_names : List[str], optional
            Feature names with source tags for attribution
        feature_names : List[str], optional
            Backward compatibility alias for library_names
            
        Returns
        -------
        Dict
            - coefficients: Adaptive Lasso coefficients
            - support: Boolean mask of selected terms
            - equation: Formatted equation string
            - lambda_optimal: Selected regularization parameter
            - initial_coef: Initial estimate used for weights
            - selection_analysis: Source attribution (v4.1)
            - r2_score: R-squared on training data
            - mse: Mean squared error
        """
        n_samples, n_features = feature_library.shape
        self._n_features = n_features
        
        # Handle backward compatibility
        names = library_names or feature_names
        if names is None:
            self._library_names = [f'f{i}' for i in range(n_features)]
        else:
            self._library_names = list(names)
        
        # Step 1: Compute initial estimate
        self._initial_estimate = self._compute_initial_estimate(
            feature_library, y
        )
        
        # Step 2: Compute adaptive weights
        self._adaptive_weights = self._compute_adaptive_weights(
            self._initial_estimate
        )
        
        # Step 3: Fit weighted Lasso
        self._coefficients, self._optimal_lambda = self._fit_weighted_lasso(
            feature_library, y, self._adaptive_weights
        )
        
        # Determine support
        self._support = np.abs(self._coefficients) > 1e-10
        
        # Compute metrics
        y_pred = feature_library @ self._coefficients
        self._mse = np.mean((y - y_pred)**2)
        ss_tot = np.sum((y - np.mean(y))**2)
        ss_res = np.sum((y - y_pred)**2)
        self._r2_score = 1 - ss_res / ss_tot if ss_tot > 0 else 0.0
        
        # Analyze selection sources (v4.1)
        self._selection_analysis = self._analyze_selection_sources(
            self._support, self._library_names
        )
        
        self._fit_complete = True
        
        return {
            'coefficients': self._coefficients,
            'support': self._support,
            'equation': self.get_equation(),
            'n_active_terms': int(np.sum(self._support)),
            'lambda_optimal': self._optimal_lambda,
            'optimal_lambda': self._optimal_lambda,  # Alias
            'initial_coef': self._initial_estimate,
            'initial_estimate': self._initial_estimate,  # Alias
            'adaptive_weights': self._adaptive_weights,
            'selection_analysis': self._selection_analysis,  # v4.1
            'r2_score': self._r2_score,
            'r_squared': self._r2_score,  # Alias
            'mse': self._mse,
            'gamma': self.gamma,
            'eps': self.eps
        }
    
    def _analyze_selection_sources(
        self,
        support: np.ndarray,
        library_names: List[str]
    ) -> Dict[str, int]:
        """
        Analyze where selected terms originated (v4.1).
        
        Parameters
        ----------
        support : np.ndarray
            Boolean mask of selected terms
        library_names : List[str]
            Feature names with source tags
            
        Returns
        -------
        Dict[str, int]
            Source attribution counts
        """
        sources = {
            'from_pysr': 0,
            'from_variant': 0,
            'from_poly': 0,
            'from_op': 0,
            'from_unknown': 0,
            'total_selected': 0
        }
        
        selected_indices = np.where(support)[0]
        sources['total_selected'] = len(selected_indices)
        
        for idx in selected_indices:
            if idx >= len(library_names):
                sources['from_unknown'] += 1
                continue
                
            name = library_names[idx]
            if name.startswith('[PySR]'):
                sources['from_pysr'] += 1
            elif name.startswith('[Var]'):
                sources['from_variant'] += 1
            elif name.startswith('[Poly]'):
                sources['from_poly'] += 1
            elif name.startswith('[Op]'):
                sources['from_op'] += 1
            else:
                sources['from_unknown'] += 1
        
        return sources
    
    def compute_agreement(
        self,
        support_stlsq: np.ndarray,
        support_alasso: np.ndarray = None
    ) -> float:
        """
        Compute agreement score between STLSQ and Adaptive Lasso (v4.1).
        
        Uses Jaccard similarity: |intersection| / |union|
        
        Parameters
        ----------
        support_stlsq : np.ndarray
            Boolean support mask from E-WSINDy STLSQ
        support_alasso : np.ndarray, optional
            Boolean support mask from Adaptive Lasso.
            If None, uses self._support.
            
        Returns
        -------
        float
            Jaccard similarity in [0, 1]. 1.0 = perfect agreement.
        """
        if support_alasso is None:
            if self._support is None:
                raise RuntimeError("Must call fit() first or provide support_alasso")
            support_alasso = self._support
        
        # Ensure same length
        min_len = min(len(support_stlsq), len(support_alasso))
        s1 = support_stlsq[:min_len]
        s2 = support_alasso[:min_len]
        
        intersection = np.sum(s1 & s2)
        union = np.sum(s1 | s2)
        
        return intersection / union if union > 0 else 1.0
    
    def _compute_initial_estimate(
        self,
        Phi: np.ndarray,
        y: np.ndarray
    ) -> np.ndarray:
        """
        Compute initial coefficient estimate.
        
        Uses Ridge regression for stability.
        """
        if self.initial_method == 'ridge':
            ridge = Ridge(alpha=self.ridge_alpha, fit_intercept=False)
            ridge.fit(Phi, y)
            return ridge.coef_
        else:  # OLS
            try:
                beta, _, _, _ = np.linalg.lstsq(Phi, y, rcond=None)
                return beta
            except np.linalg.LinAlgError:
                ridge = Ridge(alpha=0.1, fit_intercept=False)
                ridge.fit(Phi, y)
                return ridge.coef_
    
    def _compute_adaptive_weights(
        self,
        beta_init: np.ndarray
    ) -> np.ndarray:
        """
        Compute adaptive weights with epsilon stabilization.
        
        w_j = 1 / (|beta_init[j]| + eps)^gamma
        """
        return 1.0 / (np.abs(beta_init) + self.eps) ** self.gamma
    
    def _fit_weighted_lasso(
        self,
        Phi: np.ndarray,
        y: np.ndarray,
        weights: np.ndarray
    ) -> Tuple[np.ndarray, float]:
        """
        Fit weighted Lasso via variable transformation.
        """
        sqrt_weights = np.sqrt(weights)
        Phi_weighted = Phi / sqrt_weights
        
        lasso_cv = LassoCV(
            cv=self.cv_folds,
            fit_intercept=False,
            max_iter=10000,
            tol=1e-6
        )
        lasso_cv.fit(Phi_weighted, y)
        
        beta_weighted = lasso_cv.coef_
        beta = beta_weighted / sqrt_weights
        
        return beta, lasso_cv.alpha_
    
    def get_equation(self) -> str:
        """
        Get string representation of discovered equation.
        """
        if self._coefficients is None:
            return ""
        
        terms = []
        for i, (coef, active) in enumerate(zip(self._coefficients, self._support)):
            if active:
                name = self._library_names[i]
                if abs(coef) > 0.001:
                    terms.append(f"{coef:.3f} * {name}")
        
        if len(terms) == 0:
            return "0"
        
        return " + ".join(terms)
    
    def get_active_terms(self) -> List[Tuple[str, float]]:
        """
        Get list of active terms with coefficients.
        """
        if not self._fit_complete:
            raise RuntimeError("Must call fit() first")
        
        active = []
        for i, (coef, name) in enumerate(zip(self._coefficients, self._library_names)):
            if self._support[i]:
                active.append((name, float(coef)))
        
        return active
    
    def predict(self, Phi_new: np.ndarray) -> np.ndarray:
        """
        Predict using fitted model.
        """
        if self._coefficients is None:
            raise RuntimeError("Must call fit() before predict()")
        return Phi_new @ self._coefficients
    
    def print_alasso_report(self) -> None:
        """
        Print detailed Adaptive Lasso report in v4.1 format.
        """
        if not self._fit_complete:
            print("Fit not yet performed. Call fit() first.")
            return
        
        print("=" * 70)
        print("=== Adaptive Lasso Results (v4.1) ===")
        print("=" * 70)
        print()
        print(f"Configuration:")
        print(f"  Gamma: {self.gamma}")
        print(f"  Epsilon: {self.eps}")
        print(f"  CV folds: {self.cv_folds}")
        print(f"  Initial method: {self.initial_method}")
        print(f"  Optimal lambda: {self._optimal_lambda:.6f}")
        print()
        print(f"Selected terms: {np.sum(self._support)}")
        print()
        
        # Print active terms with source tags
        for name, coef in self.get_active_terms():
            print(f"  {coef:8.3f} * {name}")
        print()
        
        # Print selection analysis (v4.1)
        print("Selection Analysis:")
        print(f"  from_pysr: {self._selection_analysis['from_pysr']}")
        print(f"  from_variant: {self._selection_analysis['from_variant']}")
        print(f"  from_poly: {self._selection_analysis['from_poly']}")
        print(f"  from_op: {self._selection_analysis['from_op']}")
        print()
        
        print(f"R-squared: {self._r2_score:.4f}")
        print(f"MSE: {self._mse:.6f}")
        print()
        print("=" * 70)

print("AdaptiveLassoSelector class v4.1 defined.")

---
## Section 3: Internal Tests

In [None]:
# ==============================================================================
# TEST CONTROL FLAG
# ==============================================================================

_RUN_TESTS = False  # Set to True to run internal tests

if _RUN_TESTS:
    print("=" * 70)
    print(" RUNNING INTERNAL TESTS FOR 08_AdaptiveLasso v4.1")
    print("=" * 70)

In [None]:
# ==============================================================================
# TEST 1: Basic Adaptive Lasso with Source Attribution
# ==============================================================================

if _RUN_TESTS:
    print()
    print_section_header("Test 1: Basic Adaptive Lasso with Source Attribution")
    
    np.random.seed(42)
    n_samples = 200
    
    x = np.random.uniform(0.1, 2, n_samples)
    z = np.random.uniform(0.1, 2, n_samples)
    y = 0.5*x**2 + np.sin(z) + 0.01*np.random.randn(n_samples)
    
    # Simulated augmented library with source tags
    Phi = np.column_stack([
        x**2,
        np.sin(z),
        np.ones(n_samples),
        x,
        z,
        np.cos(x)
    ])
    
    library_names = [
        '[PySR] x**2',
        '[PySR] sin(z)',
        '[Poly] 1',
        '[Poly] x',
        '[Poly] z',
        '[Op] cos(x)'
    ]
    
    selector = AdaptiveLassoSelector(gamma=1.0)
    result = selector.fit(Phi, y, library_names=library_names)
    
    print(f"True: y = 0.5*x^2 + sin(z)")
    print()
    print("Selected terms:")
    for name, coef in selector.get_active_terms():
        print(f"  {coef:8.3f} * {name}")
    print()
    print(f"Selection Analysis: {result['selection_analysis']}")
    print(f"R-squared: {result['r2_score']:.4f}")

In [None]:
# ==============================================================================
# TEST 2: compute_agreement Method
# ==============================================================================

if _RUN_TESTS:
    print()
    print_section_header("Test 2: compute_agreement Method")
    
    # Test case 1: Perfect agreement
    support_stlsq = np.array([True, True, False, False])
    support_alasso = np.array([True, True, False, False])
    
    selector = AdaptiveLassoSelector()
    agreement = selector.compute_agreement(support_stlsq, support_alasso)
    print(f"Case 1 - Perfect match: {agreement:.4f}")
    
    # Test case 2: Partial agreement
    support_stlsq = np.array([True, True, False, False])
    support_alasso = np.array([True, False, True, False])
    
    agreement = selector.compute_agreement(support_stlsq, support_alasso)
    print(f"Case 2 - Partial overlap: {agreement:.4f}")
    
    # Test case 3: No agreement
    support_stlsq = np.array([True, True, False, False])
    support_alasso = np.array([False, False, True, True])
    
    agreement = selector.compute_agreement(support_stlsq, support_alasso)
    print(f"Case 3 - No overlap: {agreement:.4f}")
    
    # Expected: 1.0, 0.333, 0.0
    print()
    print("[INFO] Jaccard similarity: |intersection| / |union|")

In [None]:
# ==============================================================================
# TEST 3: Oracle Property Verification
# ==============================================================================

if _RUN_TESTS:
    print()
    print_section_header("Test 3: Oracle Property Verification")
    
    np.random.seed(42)
    n_samples = 300
    
    # Sparse ground truth: 3 of 20 features active
    n_features = 20
    true_support = np.zeros(n_features, dtype=bool)
    true_support[:3] = True
    true_coefs = np.zeros(n_features)
    true_coefs[:3] = [2.0, 1.5, -1.0]
    
    # Generate data
    X = np.random.randn(n_samples, n_features)
    y = X @ true_coefs + 0.1*np.random.randn(n_samples)
    
    library_names = [f'[Poly] f{i}' for i in range(n_features)]
    
    selector = AdaptiveLassoSelector(gamma=1.0)
    result = selector.fit(X, y, library_names=library_names)
    
    recovered_support = result['support']
    
    # Compute support recovery accuracy
    correct = np.sum(recovered_support == true_support)
    accuracy = correct / n_features * 100
    
    print(f"True active features: {np.where(true_support)[0]}")
    print(f"Recovered active: {np.where(recovered_support)[0]}")
    print(f"Support recovery accuracy: {accuracy:.1f}%")
    print()
    
    if accuracy > 90:
        print("[PASS] Oracle property - support recovery > 90%")
    else:
        print(f"[INFO] Support recovery: {accuracy:.1f}%")

In [None]:
# ==============================================================================
# TEST 4: Agreement with Simulated STLSQ
# ==============================================================================

if _RUN_TESTS:
    print()
    print_section_header("Test 4: Agreement with Simulated STLSQ")
    
    np.random.seed(42)
    n_samples = 200
    
    x = np.random.uniform(0.1, 2, n_samples)
    z = np.random.uniform(0.1, 2, n_samples)
    y = 2*x + 0.5*x**2 + 0.01*np.random.randn(n_samples)
    
    Phi = np.column_stack([np.ones(n_samples), x, x**2, x**3, z, z**2])
    library_names = ['[Poly] 1', '[Poly] x', '[Poly] x^2', '[Poly] x^3', '[Poly] z', '[Poly] z^2']
    
    # Fit Adaptive Lasso
    selector = AdaptiveLassoSelector(gamma=1.0)
    result = selector.fit(Phi, y, library_names=library_names)
    
    # Simulate STLSQ support (typically agrees well)
    support_stlsq = np.array([False, True, True, False, False, False])
    
    agreement = selector.compute_agreement(support_stlsq)
    
    print(f"STLSQ support: {np.where(support_stlsq)[0]}")
    print(f"ALasso support: {np.where(result['support'])[0]}")
    print(f"Agreement score: {agreement:.4f}")
    print()
    
    if agreement > 0.8:
        print("[PASS] Agreement with STLSQ > 0.8")
    else:
        print(f"[INFO] Agreement: {agreement:.4f}")

In [None]:
# ==============================================================================
# TEST 5: Full Report Output
# ==============================================================================

if _RUN_TESTS:
    print()
    print_section_header("Test 5: Full Report Output")
    
    np.random.seed(42)
    n_samples = 200
    
    x = np.random.uniform(0.1, 2, n_samples)
    z = np.random.uniform(0.1, 2, n_samples)
    y = 0.5*x**2 + np.sin(z) + 0.01*np.random.randn(n_samples)
    
    Phi = np.column_stack([x**2, np.sin(z), np.ones(n_samples), x, z])
    library_names = ['[PySR] x**2', '[PySR] sin(z)', '[Poly] 1', '[Poly] x', '[Poly] z']
    
    selector = AdaptiveLassoSelector(gamma=1.0)
    result = selector.fit(Phi, y, library_names=library_names)
    
    # Print full report
    selector.print_alasso_report()

---
## Section 4: Module Summary

In [None]:
# ==============================================================================
# MODULE SUMMARY
# ==============================================================================

print("=" * 70)
print(" 08_AdaptiveLasso.ipynb v4.1 - Module Summary")
print("=" * 70)
print()
print("CLASS: AdaptiveLassoSelector (v4.1 Minor Update)")
print("-" * 70)
print()
print("Purpose:")
print("  Adaptive Lasso with oracle property for variable selection.")
print("  Achieves selection consistency and asymptotic normality.")
print()
print("v4.1 Modifications:")
print("  - selection_analysis in output (source attribution)")
print("  - compute_agreement() method for STLSQ comparison")
print("  - library_names parameter (backward compatible)")
print()
print("Main Methods:")
print("  fit(feature_library, y, library_names=None) -> Dict")
print("      Returns: coefficients, support, selection_analysis, r2_score, ...")
print()
print("  compute_agreement(support_stlsq, support_alasso=None) -> float")
print("      Returns: Jaccard similarity in [0, 1]")
print()
print("  get_equation() -> str")
print("      Get string representation with source tags")
print()
print("  print_alasso_report()")
print("      Print detailed results with selection analysis")
print()
print("Key Parameters:")
print("  gamma: Weight exponent (1.0 = standard, 2.0 = stronger)")
print("  eps: Stabilization constant (default: 1e-6)")
print()
print("Usage Example (with augmented library):")
print("-" * 70)
print("""
# Build augmented library (from 05_FeatureLibrary)
builder = AugmentedLibraryBuilder(max_poly_degree=3)
Phi, names, info = builder.build(X, feature_names, ...)

# Fit Adaptive Lasso with source attribution
selector = AdaptiveLassoSelector(gamma=1.0)
result = selector.fit(Phi, y, library_names=names)

# Check source attribution
print(f"Selection Analysis: {result['selection_analysis']}")

# Compare with STLSQ
agreement = selector.compute_agreement(stlsq_result['support'])
print(f"Agreement with STLSQ: {agreement:.4f}")
""")
print()
print("=" * 70)
print("Module loaded successfully. Import via: %run 08_AdaptiveLasso.ipynb")
print("=" * 70)