# 02_VariableScreening - Physics-SR Framework v3.0

## Stage 1.2: PAN+SR Nonlinear Variable Screening

**Author:** Zhengze Zhang  
**Affiliation:** Department of Statistics, Columbia University  
**Date:** January 2026

---

### Purpose

Select important variables using nonlinear importance measures, addressing the limitation of LARS which relies on linear correlation.

### Problem with LARS

LARS selects features based on linear correlation:
$$\hat{j} = \arg\max_j |\text{corr}(X_j, r)|$$

This fails for nonlinear dependencies. For example, if $y = \sin(x)$ with symmetric $x \in [-\pi, \pi]$, then $\text{corr}(x, y) \approx 0$.

### Solution: Random Forest Permutation Importance

Permutation importance measures how much the prediction error increases when a feature is randomly shuffled:

$$I_j = \frac{\text{MSE}_{\text{permuted}_j} - \text{MSE}_{\text{baseline}}}{\text{MSE}_{\text{baseline}}}$$

This captures nonlinear dependencies because Random Forest can model arbitrary functions.

### Reference

- Breiman, L. (2001). Random Forests. *Machine Learning*, 45(1), 5-32.
- Strobl, C., et al. (2007). Bias in random forest variable importance measures. *BMC Bioinformatics*, 8, 25.

---
## Section 1: Header and Imports

In [None]:
"""
02_VariableScreening.ipynb - PAN+SR Nonlinear Variable Screening
=================================================================

Three-Stage Physics-Informed Symbolic Regression Framework v3.0

This module provides:
- PANSRVariableScreener: Variable selection using Random Forest permutation importance
- Nonlinear importance detection that captures dependencies LARS misses
- Normalized importance scores and automatic threshold selection

Algorithm:
    1. Fit Random Forest with n_estimators=500, max_features='sqrt'
    2. Compute permutation importance for each feature
    3. Normalize importance scores to sum to 1
    4. Select features with normalized importance > threshold

Author: Zhengze Zhang
Affiliation: Department of Statistics, Columbia University
"""

# Import core module
%run 00_Core.ipynb

In [None]:
# Additional imports for Variable Screening
from sklearn.ensemble import RandomForestRegressor
from sklearn.inspection import permutation_importance
from typing import Dict, List, Tuple, Optional, Any

print("02_VariableScreening: Additional imports successful.")

---
## Section 2: Class Definition

In [None]:
# ==============================================================================
# PAN+SR VARIABLE SCREENER CLASS
# ==============================================================================

class PANSRVariableScreener:
    """
    PAN+SR Nonlinear Variable Screening using Random Forest Permutation Importance.
    
    This screener addresses the limitation of LARS (Least Angle Regression) which
    relies on linear correlation and fails for nonlinear dependencies. Random Forest
    permutation importance can detect arbitrary nonlinear relationships.
    
    The algorithm:
    1. Fits a Random Forest regressor to capture nonlinear patterns
    2. Computes permutation importance by measuring MSE increase when each
       feature is randomly shuffled
    3. Normalizes importance scores to create a probability distribution
    4. Selects features exceeding the importance threshold
    
    Attributes
    ----------
    n_estimators : int
        Number of trees in the Random Forest (default: 500)
    importance_threshold : float
        Minimum normalized importance for feature selection (default: 0.01)
    n_permutations : int
        Number of permutations for importance estimation (default: 10)
    random_state : int
        Random seed for reproducibility
    _rf_model : RandomForestRegressor
        Fitted Random Forest model
    _raw_importance : np.ndarray
        Raw permutation importance scores
    _normalized_importance : np.ndarray
        Normalized importance scores (sum to 1)
    _feature_names : List[str]
        Names of input features
    _selected_indices : List[int]
        Indices of selected features
    _screening_complete : bool
        Whether screening has been performed
    
    Examples
    --------
    >>> screener = PANSRVariableScreener(importance_threshold=0.01)
    >>> result = screener.screen(X, y, feature_names)
    >>> print(result['selected_names'])
    ['q_c', 'N_d']  # Features in the true equation
    """
    
    def __init__(
        self,
        n_estimators: int = 500,
        importance_threshold: float = DEFAULT_IMPORTANCE_THRESHOLD,
        n_permutations: int = 10,
        random_state: int = RANDOM_SEED
    ):
        """
        Initialize PANSRVariableScreener.
        
        Parameters
        ----------
        n_estimators : int
            Number of trees in Random Forest. More trees give more stable
            importance estimates but increase computation time.
            Default: 500
        importance_threshold : float
            Minimum normalized importance for a feature to be selected.
            Features with importance < threshold are considered unimportant.
            Default: 0.01 (1%)
        n_permutations : int
            Number of times to permute each feature when computing importance.
            More permutations give more stable estimates.
            Default: 10
        random_state : int
            Random seed for reproducibility.
            Default: 42
        """
        self.n_estimators = n_estimators
        self.importance_threshold = importance_threshold
        self.n_permutations = n_permutations
        self.random_state = random_state
        
        # Internal state
        self._rf_model = None
        self._raw_importance = None
        self._normalized_importance = None
        self._importance_std = None
        self._feature_names = None
        self._selected_indices = None
        self._selected_names = None
        self._baseline_mse = None
        self._rf_r2 = None
        self._screening_complete = False
    
    def screen(
        self,
        X: np.ndarray,
        y: np.ndarray,
        feature_names: List[str]
    ) -> Dict[str, Any]:
        """
        Perform variable screening using Random Forest permutation importance.
        
        Parameters
        ----------
        X : np.ndarray
            Feature matrix of shape (n_samples, n_features)
        y : np.ndarray
            Target vector of shape (n_samples,)
        feature_names : List[str]
            Names of features corresponding to columns of X
        
        Returns
        -------
        Dict[str, Any]
            Dictionary containing:
            - n_features: Total number of input features
            - n_selected: Number of selected features
            - selected_indices: List of selected feature indices
            - selected_names: List of selected feature names
            - importance_scores: Dict mapping feature names to normalized importance
            - importance_ranking: List of (name, importance) sorted by importance
            - rf_r2: R-squared of Random Forest fit
            - baseline_mse: Baseline MSE before permutation
        """
        self._feature_names = list(feature_names)
        n_features = X.shape[1]
        
        # Step 1: Fit Random Forest
        self._rf_model = self._fit_random_forest(X, y)
        
        # Compute baseline metrics
        y_pred = self._rf_model.predict(X)
        self._baseline_mse = compute_mse(y, y_pred)
        self._rf_r2 = compute_r2(y, y_pred)
        
        # Step 2: Compute permutation importance
        self._raw_importance, self._importance_std = self._compute_permutation_importance(X, y)
        
        # Step 3: Normalize importance
        self._normalized_importance = self._normalize_importance(self._raw_importance)
        
        # Step 4: Select features
        self._selected_indices = [
            i for i in range(n_features)
            if self._normalized_importance[i] > self.importance_threshold
        ]
        self._selected_names = [self._feature_names[i] for i in self._selected_indices]
        
        self._screening_complete = True
        
        # Build importance scores dict
        importance_scores = {
            name: float(self._normalized_importance[i])
            for i, name in enumerate(self._feature_names)
        }
        
        # Build ranking
        importance_ranking = sorted(
            importance_scores.items(),
            key=lambda x: x[1],
            reverse=True
        )
        
        return {
            'n_features': n_features,
            'n_selected': len(self._selected_indices),
            'selected_indices': self._selected_indices,
            'selected_names': self._selected_names,
            'importance_scores': importance_scores,
            'importance_ranking': importance_ranking,
            'rf_r2': self._rf_r2,
            'baseline_mse': self._baseline_mse,
            'threshold_used': self.importance_threshold
        }
    
    def _fit_random_forest(
        self,
        X: np.ndarray,
        y: np.ndarray
    ) -> RandomForestRegressor:
        """
        Fit Random Forest regressor.
        
        Parameters
        ----------
        X : np.ndarray
            Feature matrix
        y : np.ndarray
            Target vector
        
        Returns
        -------
        RandomForestRegressor
            Fitted Random Forest model
        """
        rf = RandomForestRegressor(
            n_estimators=self.n_estimators,
            max_features='sqrt',
            min_samples_leaf=5,
            n_jobs=-1,
            random_state=self.random_state
        )
        rf.fit(X, y)
        return rf
    
    def _compute_permutation_importance(
        self,
        X: np.ndarray,
        y: np.ndarray
    ) -> Tuple[np.ndarray, np.ndarray]:
        """
        Compute permutation importance for all features.
        
        For each feature j:
        1. Compute baseline MSE
        2. Shuffle column j and recompute MSE
        3. Importance = (MSE_permuted - MSE_baseline) / MSE_baseline
        
        Parameters
        ----------
        X : np.ndarray
            Feature matrix
        y : np.ndarray
            Target vector
        
        Returns
        -------
        Tuple[np.ndarray, np.ndarray]
            - importance_mean: Mean importance for each feature
            - importance_std: Standard deviation of importance
        """
        result = permutation_importance(
            self._rf_model,
            X,
            y,
            n_repeats=self.n_permutations,
            random_state=self.random_state,
            n_jobs=-1,
            scoring='neg_mean_squared_error'
        )
        
        # permutation_importance returns negative values for neg_mse
        # Higher (less negative) = more important
        # We want: importance = increase in error when permuted
        importance_mean = -result.importances_mean  # Convert to positive
        importance_std = result.importances_std
        
        # Ensure non-negative (some features may have negative importance due to noise)
        importance_mean = np.maximum(importance_mean, 0)
        
        return importance_mean, importance_std
    
    def _normalize_importance(
        self,
        importance: np.ndarray
    ) -> np.ndarray:
        """
        Normalize importance scores to sum to 1.
        
        Parameters
        ----------
        importance : np.ndarray
            Raw importance scores
        
        Returns
        -------
        np.ndarray
            Normalized importance (sums to 1)
        """
        total = np.sum(importance)
        if total < EPS_DIV:
            # All importances are zero - return uniform distribution
            return np.ones_like(importance) / len(importance)
        return importance / total
    
    def get_selected_features(self) -> List[str]:
        """
        Get list of selected feature names.
        
        Returns
        -------
        List[str]
            Names of selected features
        
        Raises
        ------
        ValueError
            If screening has not been performed
        """
        if not self._screening_complete:
            raise ValueError("Must run screen() before getting selected features")
        return self._selected_names.copy()
    
    def get_importance_ranking(self) -> List[Tuple[str, float]]:
        """
        Get features ranked by importance.
        
        Returns
        -------
        List[Tuple[str, float]]
            List of (feature_name, normalized_importance) sorted descending
        
        Raises
        ------
        ValueError
            If screening has not been performed
        """
        if not self._screening_complete:
            raise ValueError("Must run screen() before getting ranking")
        
        ranking = [
            (name, float(self._normalized_importance[i]))
            for i, name in enumerate(self._feature_names)
        ]
        return sorted(ranking, key=lambda x: x[1], reverse=True)
    
    def get_gini_importance(self) -> Dict[str, float]:
        """
        Get Gini importance (mean decrease in impurity) from Random Forest.
        
        This is a faster alternative to permutation importance, though
        potentially biased for features with many categories.
        
        Returns
        -------
        Dict[str, float]
            Feature names to Gini importance
        """
        if self._rf_model is None:
            raise ValueError("Must run screen() before getting Gini importance")
        
        gini_importance = self._rf_model.feature_importances_
        return {
            name: float(gini_importance[i])
            for i, name in enumerate(self._feature_names)
        }
    
    def print_screening_report(self) -> None:
        """
        Print a detailed screening report.
        """
        if not self._screening_complete:
            print("Screening not yet performed. Run screen() first.")
            return
        
        print("=" * 70)
        print(" Variable Screening Results (PAN+SR)")
        print("=" * 70)
        print()
        print(f"Random Forest Configuration:")
        print(f"  n_estimators: {self.n_estimators}")
        print(f"  n_permutations: {self.n_permutations}")
        print(f"  importance_threshold: {self.importance_threshold}")
        print()
        print(f"Random Forest Performance:")
        print(f"  R-squared: {self._rf_r2:.4f}")
        print(f"  Baseline MSE: {self._baseline_mse:.6e}")
        print()
        print("-" * 70)
        print(" Feature Importance Ranking:")
        print("-" * 70)
        print(f"{'Rank':<6} {'Feature':<20} {'Importance':<15} {'Selected'}")
        print("-" * 70)
        
        ranking = self.get_importance_ranking()
        for rank, (name, importance) in enumerate(ranking, 1):
            selected = "YES" if name in self._selected_names else "no"
            print(f"{rank:<6} {name:<20} {importance:<15.4f} {selected}")
        
        print()
        print("-" * 70)
        print(f" Selected Features: {len(self._selected_names)} / {len(self._feature_names)}")
        print("-" * 70)
        print(f"  {self._selected_names}")
        print()
        print("=" * 70)

---
## Section 3: Internal Tests

In [None]:
# ==============================================================================
# TEST CONTROL FLAG
# ==============================================================================

_RUN_TESTS = False  # Set to True to run internal tests

if _RUN_TESTS:
    print("=" * 70)
    print(" RUNNING INTERNAL TESTS FOR 02_VariableScreening")
    print("=" * 70)

In [None]:
# ==============================================================================
# TEST 1: Warm Rain Data
# ==============================================================================

if _RUN_TESTS:
    print()
    print_section_header("Test 1: Warm Rain Data")
    
    # Generate warm rain data
    # True equation: dq_r/dt = 0.89 * q_c^2.47 * N_d^(-1.79)
    # Expected: q_c and N_d should have highest importance
    X, y, feature_names, user_inputs = generate_warm_rain_data(
        n_samples=1000, noise_level=0.01, seed=42
    )
    
    print(f"Data shape: X={X.shape}, y={y.shape}")
    print(f"Features: {feature_names}")
    print(f"True equation uses: q_c, N_d")
    print(f"Irrelevant features: r_eff, LWC")
    print()
    
    # Run screening
    screener = PANSRVariableScreener(
        n_estimators=200,  # Reduced for faster testing
        importance_threshold=0.05,
        n_permutations=5
    )
    result = screener.screen(X, y, feature_names)
    
    # Print results
    screener.print_screening_report()
    
    # Verification
    print("\nVerification:")
    ranking = result['importance_ranking']
    top_2_features = [name for name, _ in ranking[:2]]
    print(f"  Top 2 features: {top_2_features}")
    
    if 'q_c' in top_2_features and 'N_d' in top_2_features:
        print("  [PASS] Correctly identified q_c and N_d as most important")
    else:
        print("  [WARNING] Did not correctly identify both true features")

In [None]:
# ==============================================================================
# TEST 2: Irrelevant Feature Detection
# ==============================================================================

if _RUN_TESTS:
    print()
    print_section_header("Test 2: Irrelevant Feature Detection")
    
    # Generate data with many irrelevant features
    np.random.seed(42)
    n_samples = 500
    
    # True features
    x1 = np.random.uniform(0, 1, n_samples)
    x2 = np.random.uniform(0, 1, n_samples)
    
    # Irrelevant features (noise)
    noise_features = np.random.randn(n_samples, 5)
    
    # True equation: y = x1^2 + x2
    y = x1**2 + x2 + 0.01 * np.random.randn(n_samples)
    
    # Combine
    X = np.column_stack([x1, x2, noise_features])
    feature_names = ['x1', 'x2', 'noise1', 'noise2', 'noise3', 'noise4', 'noise5']
    
    print(f"True equation: y = x1^2 + x2")
    print(f"Irrelevant features: noise1-5")
    print()
    
    # Screen
    screener = PANSRVariableScreener(
        n_estimators=200,
        importance_threshold=0.05
    )
    result = screener.screen(X, y, feature_names)
    
    # Print ranking
    print("Importance Ranking:")
    for name, imp in result['importance_ranking']:
        marker = " <-- TRUE" if name in ['x1', 'x2'] else ""
        print(f"  {name}: {imp:.4f}{marker}")
    
    print(f"\nSelected: {result['selected_names']}")
    
    # Verify
    noise_selected = [n for n in result['selected_names'] if 'noise' in n]
    if len(noise_selected) == 0:
        print("[PASS] No noise features selected")
    else:
        print(f"[WARNING] Noise features selected: {noise_selected}")

In [None]:
# ==============================================================================
# TEST 3: Nonlinear Relationship Detection
# ==============================================================================

if _RUN_TESTS:
    print()
    print_section_header("Test 3: Nonlinear Relationship Detection")
    
    # Generate data with nonlinear relationships that LARS would miss
    np.random.seed(42)
    n_samples = 500
    
    # x1 with symmetric range - correlation with sin(x1) would be ~0
    x1 = np.random.uniform(-np.pi, np.pi, n_samples)
    x2 = np.random.uniform(0, 1, n_samples)
    noise_feat = np.random.randn(n_samples)
    
    # True equation: y = sin(x1) + x2^2
    y = np.sin(x1) + x2**2 + 0.01 * np.random.randn(n_samples)
    
    X = np.column_stack([x1, x2, noise_feat])
    feature_names = ['x1', 'x2', 'noise']
    
    print(f"True equation: y = sin(x1) + x2^2")
    print(f"Note: x1 has symmetric range, so corr(x1, sin(x1)) ~ 0")
    print(f"LARS would fail to detect x1's importance")
    print()
    
    # Compute linear correlation (what LARS would use)
    corr_x1 = np.corrcoef(x1, y)[0, 1]
    corr_x2 = np.corrcoef(x2, y)[0, 1]
    print(f"Linear correlations (LARS perspective):")
    print(f"  corr(x1, y) = {corr_x1:.4f}")
    print(f"  corr(x2, y) = {corr_x2:.4f}")
    print()
    
    # Screen with RF
    screener = PANSRVariableScreener(
        n_estimators=200,
        importance_threshold=0.05
    )
    result = screener.screen(X, y, feature_names)
    
    print("RF Permutation Importance:")
    for name, imp in result['importance_ranking']:
        print(f"  {name}: {imp:.4f}")
    
    print(f"\nSelected: {result['selected_names']}")
    
    # Verify
    if 'x1' in result['selected_names'] and 'x2' in result['selected_names']:
        print("[PASS] RF correctly detected both nonlinear relationships")
    else:
        print("[WARNING] RF did not detect both true features")

In [None]:
# ==============================================================================
# TEST 4: Threshold Sensitivity
# ==============================================================================

if _RUN_TESTS:
    print()
    print_section_header("Test 4: Threshold Sensitivity")
    
    # Use warm rain data
    X, y, feature_names, _ = generate_warm_rain_data(
        n_samples=500, noise_level=0.01
    )
    
    thresholds = [0.001, 0.01, 0.05, 0.1, 0.2]
    
    print(f"Testing different importance thresholds:")
    print(f"{'Threshold':<12} {'N Selected':<12} {'Selected Features'}")
    print("-" * 60)
    
    for thresh in thresholds:
        screener = PANSRVariableScreener(
            n_estimators=100,
            importance_threshold=thresh
        )
        result = screener.screen(X, y, feature_names)
        print(f"{thresh:<12} {result['n_selected']:<12} {result['selected_names']}")
    
    print()
    print("Note: Threshold 0.05-0.1 typically works well for physics problems")

---
## Section 4: Module Summary

In [None]:
# ==============================================================================
# MODULE SUMMARY
# ==============================================================================

print("=" * 70)
print(" 02_VariableScreening.ipynb - Module Summary")
print("=" * 70)
print()
print("CLASS: PANSRVariableScreener")
print("-" * 70)
print()
print("Purpose:")
print("  Select important variables using Random Forest permutation importance.")
print("  Detects nonlinear dependencies that LARS misses.")
print()
print("Main Methods:")
print("  screen(X, y, feature_names)")
print("      Perform variable screening")
print("      Returns: dict with selected features and importance scores")
print()
print("  get_selected_features()")
print("      Get list of selected feature names")
print()
print("  get_importance_ranking()")
print("      Get features ranked by importance")
print()
print("  get_gini_importance()")
print("      Get Gini importance (faster alternative)")
print()
print("  print_screening_report()")
print("      Print detailed screening report")
print()
print("Usage Example:")
print("-" * 70)
print("""
# Create screener
screener = PANSRVariableScreener(
    n_estimators=500,
    importance_threshold=0.01
)

# Run screening
result = screener.screen(X, y, feature_names)

# Get selected features
selected = result['selected_names']
print(f"Selected: {selected}")

# View importance ranking
for name, importance in result['importance_ranking']:
    print(f"{name}: {importance:.4f}")
""")
print()
print("=" * 70)
print("Module loaded successfully. Import via: %run 02_VariableScreening.ipynb")
print("=" * 70)