# **Chapter 16: Feature Selection and Dimensionality Reduction**

## **16.1 The Need for Feature Selection**

Feature selection is the process of identifying and retaining the most predictive subset of features while eliminating redundant or irrelevant ones. In the context of NEPSE (Nepal Stock Exchange) prediction systems, feature selection addresses the "curse of dimensionality" where an abundance of technical indicators, fundamental ratios, and macroeconomic variables can degrade model performance rather than improve it.

**The Dimensionality Problem in NEPSE:**

A comprehensive NEPSE prediction system might generate 200+ features per stock:
- Price action features (Open, High, Low, Close, VWAP) → 20+ indicators
- Volume metrics (Vol, Turnover, Trans.) → 15+ transformations  
- Technical indicators (RSI, MACD, Bollinger, ATR) → 30+ calculations
- Lagged values (t-1, t-5, t-20) → 60+ time-shifted features
- Cross-sectional rankings → 10+ relative metrics
- Fundamental data (PE, PB, EPS growth) → 20+ ratios

**Consequences of High Dimensionality:**
1. **Overfitting**: With only 252 trading days per year, 200 features provide near-infinite combinations to fit noise. A NEPSE model with 200 features and 500 training samples has high variance and will fail on out-of-sample data.
2. **Multicollinearity**: NEPSE technical indicators are highly correlated. RSI(14) and Stochastic(14) share 0.85+ correlation; including both creates redundant parameters and unstable coefficient estimates in linear models.
3. **Computational Cost**: Training XGBoost or LSTM models on 200 features requires 10x more time than on 20 carefully selected features, complicating real-time NEPSE trading systems.
4. **Interpretability**: A model using 5 interpretable features (Price momentum, Volume spike, RSI divergence, VWAP deviation, 52-week position) is actionable for traders; a black box with 200 features is not.

**Feature Selection vs. Dimensionality Reduction:**
- **Selection**: Chooses a subset of original features (e.g., select RSI but not Stochastic). Preserves interpretability and feature names.
- **Reduction**: Creates new synthetic features (e.g., PCA components combining RSI, MACD, and Volume). Loses direct interpretability but may capture latent patterns.

**NEPSE-Specific Considerations:**
- **Regime Dependency**: Features predictive in NEPSE bull markets (e.g., momentum) may fail in bear markets (where mean-reversion dominates). Selection must test stability across 2019-2020 (crash), 2021-2022 (recovery), and 2023-2024 (bull run) regimes.
- **Low Signal-to-Noise**: NEPSE has lower liquidity than major exchanges, creating random price movements. Feature selection must distinguish true predictive signals from spurious correlations that arise by chance in small samples.

---

## **16.2 Filter Methods**

Filter methods evaluate features based on statistical measures independent of any machine learning algorithm. They are computationally efficient and provide a preprocessing step to eliminate obviously irrelevant features before model-specific selection.

### **16.2.1 Correlation-Based Selection**

Correlation analysis identifies and removes redundant features that provide duplicate information, common with NEPSE technical indicators derived from the same price series.

```python
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import spearmanr

def correlation_based_selection(df, features, target_col='Target_Return', 
                               threshold=0.85, method='spearman'):
    """
    Select features based on correlation analysis for NEPSE data.
    
    Removes highly correlated features to reduce multicollinearity.
    Uses Spearman (rank) correlation to capture non-linear monotonic relationships
    common in financial time-series.
    
    Parameters:
    -----------
    df : pd.DataFrame
        NEPSE feature matrix
    features : list
        Candidate feature columns
    target_col : str
        Target variable for predictive correlation check
    threshold : float
        Correlation threshold for redundancy (0.85 = 85% correlated)
    method : str
        'spearman' (rank-based, robust to outliers) or 'pearson' (linear)
    
    Returns:
    --------
    selected_features : list
        Non-redundant features with high target correlation
    corr_matrix : pd.DataFrame
        Full correlation matrix for inspection
    """
    # Calculate correlation matrix
    X = df[features]
    if method == 'spearman':
        corr_matrix = X.corr(method='spearman')
    else:
        corr_matrix = X.corr(method='pearson')
    
    # Find highly correlated pairs
    upper_triangle = corr_matrix.where(
        np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
    )
    
    # Identify features to drop (above threshold)
    to_drop = []
    for col in upper_triangle.columns:
        high_corr = upper_triangle[col][abs(upper_triangle[col]) > threshold]
        if not high_corr.empty:
            # Keep the one with higher correlation to target, drop the other
            for idx, corr_val in high_corr.items():
                if idx in to_drop or col in to_drop:
                    continue
                
                # Compare correlation with target
                corr_idx_target = abs(df[idx].corr(df[target_col], method=method))
                corr_col_target = abs(df[col].corr(df[target_col], method=method))
                
                if corr_col_target > corr_idx_target:
                    to_drop.append(idx)
                else:
                    to_drop.append(col)
    
    selected_features = [f for f in features if f not in to_drop]
    
    # Additional filter: Remove features with low target correlation (< 0.05)
    target_corrs = {}
    for feat in selected_features:
        corr, p_value = spearmanr(df[feat], df[target_col], nan_policy='omit')
        target_corrs[feat] = {'correlation': corr, 'p_value': p_value}
    
    # Keep only statistically significant features (p < 0.05)
    significant_features = [
        f for f in selected_features 
        if target_corrs[f]['p_value'] < 0.05 and abs(target_corrs[f]['correlation']) > 0.03
    ]
    
    return significant_features, corr_matrix, target_corrs

def visualize_nepse_correlation_network(df, features, threshold=0.7):
    """
    Visualize feature correlations as a network graph for NEPSE indicators.
    
    Helps identify clusters of redundant indicators (e.g., all momentum indicators
    grouping together, all volatility indicators grouping together).
    """
    import networkx as nx
    
    corr_matrix = df[features].corr().abs()
    
    # Create graph
    G = nx.Graph()
    
    # Add edges for correlations above threshold
    for i in range(len(features)):
        for j in range(i+1, len(features)):
            if corr_matrix.iloc[i, j] > threshold:
                G.add_edge(features[i], features[j], weight=corr_matrix.iloc[i, j])
    
    # Plot
    plt.figure(figsize=(12, 8))
    pos = nx.spring_layout(G, k=3, iterations=50)
    
    edges = G.edges(data=True)
    weights = [d['weight'] for (u, v, d) in edges]
    
    nx.draw_networkx_nodes(G, pos, node_color='lightblue', node_size=500)
    nx.draw_networkx_edges(G, pos, width=[w*2 for w in weights], alpha=0.5)
    nx.draw_networkx_labels(G, pos, font_size=8)
    
    plt.title('NEPSE Feature Correlation Network\n(Edges > 0.7 correlation)')
    plt.show()
    
    # Identify clusters (connected components)
    clusters = list(nx.connected_components(G))
    print(f"Identified {len(clusters)} feature clusters:")
    for i, cluster in enumerate(clusters):
        print(f"  Cluster {i+1}: {cluster}")
    
    return clusters

# Detailed Explanation:
#
# 1. Spearman vs Pearson for NEPSE:
#    - Pearson measures linear correlation (sensitive to outliers)
#    - Spearman measures rank correlation (monotonic relationships)
#    - NEPSE has non-linear relationships (e.g., RSI 30→40 is bullish, 
#      but RSI 70→80 is bearish/overbought). Spearman captures the 
#      ordinal relationship better than Pearson.
#
# 2. Target Correlation Threshold (0.03):
#    - In noisy financial data, correlations < 0.03 are often spurious
#    - With 252 samples/year, correlation of 0.03 has p-value ~ 0.10
#    - We use 0.03 as minimum predictive signal, but require p < 0.05
#      for statistical significance
#
# 3. Handling Multicollinearity:
#    - When RSI and Stochastic correlate at 0.90, we keep the one with
#      higher target correlation (better predictive power)
#    - This greedy approach ensures we retain the most informative 
#      member of each correlated cluster
#
# 4. Network Visualization:
#    - Clusters reveal feature redundancy (e.g., all Bollinger Band 
#      features cluster together; all MACD components cluster)
#    - Between-cluster features provide orthogonal information
#    - Ideal NEPSE model uses one representative from each cluster
```

**Explanation:**

The `correlation_based_selection` function implements a two-stage filter for NEPSE features. **Stage 1** removes multicollinear features using a greedy algorithm: when two features correlate above 0.85 (threshold), it retains whichever has stronger predictive correlation with the target (next-day returns) and discards the other. This prevents the "double counting" of information common in technical analysis where indicators like RSI and Stochastic Oscillator measure similar momentum concepts.

**Spearman vs. Pearson:**
The function defaults to Spearman (rank) correlation because NEPSE relationships are often non-linear. For example, the relationship between RSI and future returns is inverse-U shaped (extreme values predict reversals, middle values predict continuation). Pearson correlation might show near-zero linear relationship, while Spearman captures the monotonic predictive structure.

**Statistical Significance:**
Even after removing redundancy, features must pass individual predictive tests. With typical NEPSE sample sizes (500-1000 trading days), correlations below 0.03 often occur by chance (p > 0.05). The function filters these out, ensuring only statistically robust predictors remain.

### **16.2.2 Variance Threshold**

Variance thresholding removes features with near-zero variance (constant or nearly constant values), which provide no discriminative power for NEPSE prediction.

```python
def variance_threshold_selection(df, features, threshold=0.01, target_aware=True):
    """
    Remove low-variance features from NEPSE dataset.
    
    In financial time-series, low variance can indicate:
    - Illiquid stocks with constant prices for days
    - Technical indicators stuck at bounds (RSI = 100 for weeks)
    - Calculated features with numerical errors
    
    Parameters:
    -----------
    threshold : float
        Minimum normalized variance to retain feature
    target_aware : bool
        If True, only remove features with low variance AND low target correlation
        ( preserves low-variance features that are strongly predictive)
    """
    from sklearn.feature_selection import VarianceThreshold
    
    X = df[features]
    
    # Normalize variance by mean (coefficient of variation) for scale-invariance
    cv_threshold = threshold  # Coefficient of variation = std / |mean|
    
    variances = X.var()
    means = X.abs().mean()
    cv = variances / (means + 1e-8)  # Coefficient of variation
    
    low_var_features = cv[cv < cv_threshold].index.tolist()
    
    if target_aware and 'Target' in df.columns:
        # Check if low-variance features still predict target (rare but possible)
        for feat in low_var_features[:]:  # Copy list to allow removal
            corr = abs(df[feat].corr(df['Target']))
            if corr > 0.05:  # If predictive despite low variance, keep it
                low_var_features.remove(feat)
                print(f"Keeping low-variance feature {feat} due to high target correlation ({corr:.3f})")
    
    selected = [f for f in features if f not in low_var_features]
    
    print(f"Removed {len(low_var_features)} low-variance features:")
    print(f"  {low_var_features}")
    print(f"Retained {len(selected)} features")
    
    return selected

# NEPSE Example:
# Feature "Is_Holiday" might have variance 0.02 (2% of days are holidays)
# But it has zero predictive power for returns → Remove
# 
# Feature "RSI_Above_70" might have variance 0.03 (rarely >70)
# But when True, predicts -2% returns → Keep despite low variance
```

### **16.2.3 Mutual Information**

Mutual Information (MI) measures the reduction in uncertainty about the target variable given knowledge of a feature, capturing non-linear relationships that correlation misses.

```python
from sklearn.feature_selection import mutual_info_regression, mutual_info_classif

def mutual_information_selection(df, features, target_col, n_neighbors=3, 
                                task='regression', n_features=20):
    """
    Select features using Mutual Information for NEPSE prediction.
    
    MI measures: I(X;Y) = H(Y) - H(Y|X)
    (Reduction in entropy of target given feature)
    
    Advantages for NEPSE:
    - Captures non-linear relationships (e.g., V-shaped patterns)
    - Detects interactions (e.g., Volume * Volatility)
    - No assumption of Gaussian distribution
    
    Parameters:
    -----------
    n_neighbors : int
        Number of neighbors for k-NN based MI estimation (3-5 for financial data)
    task : str
        'regression' for continuous returns, 'classification' for direction
    """
    X = df[features]
    y = df[target_col]
    
    # Handle missing values
    X = X.fillna(X.median())
    
    if task == 'regression':
        mi_scores = mutual_info_regression(X, y, n_neighbors=n_neighbors, random_state=42)
    else:
        mi_scores = mutual_info_classif(X, y, n_neighbors=n_neighbors, random_state=42)
    
    mi_results = pd.DataFrame({
        'Feature': features,
        'MI_Score': mi_scores
    }).sort_values('MI_Score', ascending=False)
    
    # Select top N features
    selected = mi_results.head(n_features)['Feature'].tolist()
    
    # MI per unit complexity (penalize redundant features)
    # Calculate conditional redundancy
    redundancy_penalty = {}
    for feat in selected:
        redundancies = []
        for other_feat in selected:
            if feat != other_feat:
                # Approximate redundancy via feature-feature MI
                mi_ff = mutual_info_regression(X[[feat]], X[other_feat], n_neighbors=3)[0]
                redundancies.append(mi_ff)
        redundancy_penalty[feat] = np.mean(redundancies) if redundancies else 0
    
    mi_results['Redundancy'] = mi_results['Feature'].map(redundancy_penalty)
    mi_results['Adjusted_MI'] = mi_results['MI_Score'] - 0.5 * mi_results['Redundancy']
    
    # Re-rank by adjusted MI
    selected_adjusted = mi_results.sort_values('Adjusted_MI', ascending=False).head(n_features)['Feature'].tolist()
    
    return selected_adjusted, mi_results

# Explanation:
#
# 1. Why MI for NEPSE:
#    - Correlation misses non-linear patterns like "extreme RSI predicts reversal"
#    - MI captures: "When RSI is between 30-70, returns are random (low MI)
#      but when RSI < 20 or > 80, returns have high negative/positive skew (high MI)"
#
# 2. Adjusted MI:
#    - Pure MI selects features that are individually informative but redundant
#    - Adjusted MI penalizes features that are highly mutual with other selected features
#    - Ensures diverse information set (e.g., selects RSI OR Stochastic, not both)
```

### **16.2.4 Chi-Square Test**

Chi-square tests evaluate independence between categorical features and categorical targets, useful for NEPSE regime classification (bull/bear/sideways markets).

```python
from sklearn.feature_selection import chi2
from sklearn.preprocessing import KBinsDiscretizer

def chi_square_feature_selection(df, features, target_col, n_bins=5):
    """
    Apply Chi-Square test for categorical feature selection in NEPSE.
    
    Best for: Discrete features (candlestick patterns, gap types, volume spikes)
    predicting discrete targets (Up/Down/Flat next day).
    
    Requires binning continuous features into quantiles first.
    """
    # Discretize continuous features into quintiles
    X = df[features].copy()
    discretizer = KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy='quantile')
    X_discrete = discretizer.fit_transform(X)
    X_discrete = pd.DataFrame(X_discrete, columns=features, index=X.index)
    
    # Ensure target is discrete (0, 1, 2 for Down, Flat, Up)
    y = df[target_col]
    if y.nunique() > 5:
        # Bin continuous target into classes
        y = pd.qcut(y, q=3, labels=[0, 1, 2])
    
    # Chi-square test (requires non-negative features)
    X_discrete = X_discrete.abs()
    
    chi_scores, p_values = chi2(X_discrete, y)
    
    results = pd.DataFrame({
        'Feature': features,
        'Chi2_Score': chi_scores,
        'P_Value': p_values
    }).sort_values('Chi2_Score', ascending=False)
    
    # Select significant features (p < 0.05)
    selected = results[results['P_Value'] < 0.05]['Feature'].tolist()
    
    return selected, results
```

---

## **16.3 Wrapper Methods**

Wrapper methods evaluate feature subsets by training and testing actual machine learning models, selecting the combination that optimizes predictive performance. They are computationally expensive but capture feature interactions missed by filter methods.

### **16.3.1 Recursive Feature Elimination (RFE)**

RFE iteratively removes the weakest features according to model coefficients or feature importance, starting with all features and pruning backward.

```python
from sklearn.feature_selection import RFE, RFECV
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
from sklearn.model_selection import TimeSeriesSplit

def recursive_feature_elimination(df, features, target_col, n_features=15, 
                                 model_type='ridge', step=1):
    """
    Apply Recursive Feature Elimination for NEPSE feature selection.
    
    Process:
    1. Train model on all features
    2. Rank features by importance (coefficients for linear, feature_importances_ for trees)
    3. Remove weakest feature(s)
    4. Repeat until n_features remain
    
    Time-Series Aware: Uses TimeSeriesSplit to prevent look-ahead bias.
    """
    X = df[features]
    y = df[target_col]
    
    # Choose estimator
    if model_type == 'ridge':
        estimator = Ridge(alpha=1.0)
    elif model_type == 'random_forest':
        estimator = RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42)
    else:
        raise ValueError("Unsupported model type")
    
    # Time-series cross-validation
    tscv = TimeSeriesSplit(n_splits=5)
    
    # RFE with cross-validation to find optimal number of features
    selector = RFECV(
        estimator=estimator,
        step=step,  # Remove 1 feature at a time
        cv=tscv,
        scoring='neg_mean_squared_error',
        n_jobs=-1,
        min_features_to_select=5
    )
    
    selector.fit(X, y)
    
    selected_features = [f for f, selected in zip(features, selector.support_) if selected]
    ranking = dict(zip(features, selector.ranking_))
    
    print(f"Optimal number of features: {selector.n_features_}")
    print(f"Selected: {selected_features}")
    
    return selected_features, ranking, selector

# Explanation:
#
# 1. Why RFE for NEPSE:
#    - Considers feature interactions (e.g., RSI might be useless alone but 
#      powerful when combined with Volume)
#    - Model-specific selection (Ridge prefers uncorrelated features, 
#      Random Forest handles redundancy better)
#
# 2. TimeSeriesSplit:
#    - Standard KFold leaks future information into training set
#    - TimeSeriesSplit ensures training set always before validation set
#    - Critical for NEPSE where feature importance varies by regime
#
# 3. Step Size:
#    - step=1 is accurate but slow for 200+ features
#    - step=5 (remove 5 at a time) faster but might miss optimal combinations
```

### **16.3.2 Forward Selection**

Forward selection starts with no features and iteratively adds the feature that most improves model performance until a stopping criterion is met.

```python
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

def forward_feature_selection(df, features, target_col, max_features=20, 
                             model_type='ridge', cv_folds=5):
    """
    Forward Selection for NEPSE: Start empty, add best feature each iteration.
    
    More conservative than RFE (backward), less prone to overfitting with small NEPSE samples.
    Stops when CV score improvement < 0.1%.
    """
    X = df[features]
    y = df[target_col]
    
    if model_type == 'ridge':
        model = Ridge(alpha=1.0)
    
    selected = []
    remaining = set(features)
    best_score = -np.inf
    scores_history = []
    
    while remaining and len(selected) < max_features:
        scores = []
        
        for feature in remaining:
            # Try adding this feature
            trial_features = selected + [feature]
            X_trial = X[trial_features]
            
            # Time-series CV
            tscv = TimeSeriesSplit(n_splits=cv_folds)
            cv_scores = cross_val_score(model, X_trial, y, cv=tscv, 
                                       scoring='neg_mean_squared_error')
            scores.append((feature, cv_scores.mean()))
        
        # Find best feature to add
        scores.sort(key=lambda x: x[1], reverse=True)
        best_feature, best_new_score = scores[0]
        
        # Stopping criterion: improvement > 0.1%
        if best_new_score > best_score + 0.001:
            selected.append(best_feature)
            remaining.remove(best_feature)
            best_score = best_new_score
            scores_history.append(best_score)
            print(f"Added {best_feature}, CV Score: {best_score:.4f}")
        else:
            print("No significant improvement, stopping.")
            break
    
    return selected, scores_history
```

### **16.3.3 Backward Elimination**

Backward elimination starts with all features and removes the least significant feature iteratively (similar to RFE but often using statistical significance rather than importance).

```python
import statsmodels.api as sm

def backward_elimination(df, features, target_col, significance_level=0.05):
    """
    Statistical backward elimination using p-values from OLS regression.
    
    Appropriate for NEPSE linear factor models where statistical significance
    and interpretability matter.
    """
    X = df[features]
    y = df[target_col]
    
    # Add constant for intercept
    X = sm.add_constant(X)
    
    features = list(features)
    
    while len(features) > 0:
        model = sm.OLS(y, X[features + ['const']]).fit()
        p_values = model.pvalues.drop('const')
        max_p = p_values.max()
        
        if max_p > significance_level:
            excluded_feature = p_values.idxmax()
            features.remove(excluded_feature)
            print(f"Removed {excluded_feature}, p-value: {max_p:.4f}")
        else:
            break
    
    return features, model
```

### **16.3.4 Stepwise Selection**

Stepwise selection combines forward and backward steps, allowing features to be added and removed dynamically based on statistical criteria.

```python
def stepwise_selection(df, features, target_col, initial_list=[], 
                      threshold_in=0.01, threshold_out=0.05, verbose=True):
    """
    Stepwise selection for NEPSE: Add significant features, remove insignificant ones.
    
    Balances forward and backward passes to find optimal subset.
    """
    included = list(initial_list)
    X = df[features]
    y = df[target_col]
    
    while True:
        changed = False
        
        # Forward step
        excluded = list(set(features) - set(included))
        new_pvals = pd.Series(index=excluded, dtype=float)
        
        for new_col in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [new_col]])).fit()
            new_pvals[new_col] = model.pvalues[new_col]
        
        best_pval = new_pvals.min()
        if best_pval < threshold_in:
            best_feature = new_pvals.idxmin()
            included.append(best_feature)
            changed = True
            if verbose:
                print(f"Add  {best_feature:30s} | p-value: {best_pval:.4f}")
        
        # Backward step
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvalues = model.pvalues.iloc[1:]  # Exclude intercept
        worst_pval = pvalues.max()
        
        if worst_pval > threshold_out:
            worst_feature = pvalues.idxmax()
            included.remove(worst_feature)
            changed = True
            if verbose:
                print(f"Drop {worst_feature:30s} | p-value: {worst_pval:.4f}")
        
        if not changed:
            break
    
    return included
```

---

## **16.4 Embedded Methods**

Embedded methods perform feature selection during model training, penalizing complex models with many features. They are efficient and model-specific.

### **16.4.1 LASSO (L1 Regularization)**

LASSO (Least Absolute Shrinkage and Selection Operator) adds a penalty proportional to the absolute value of coefficients, driving some coefficients to exactly zero (automatic feature selection).

```python
from sklearn.linear_model import Lasso, LassoCV
from sklearn.preprocessing import StandardScaler

def lasso_feature_selection(df, features, target_col, alphas=None, 
                           cv_folds=5, max_features=30):
    """
    LASSO regression for NEPSE feature selection.
    
    L1 penalty: Loss = MSE + alpha * sum(|coefficients|)
    As alpha increases, more coefficients become exactly zero.
    
    Advantages for NEPSE:
    - Automatic feature selection (sparse models)
    - Handles multicollinearity by selecting one from correlated group
    - Interpretable coefficients (Rs per unit feature change)
    """
    X = df[features]
    y = df[target_col]
    
    # Standardize features (LASSO requires scaling)
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Cross-validated LASSO to find optimal alpha
    if alphas is None:
        alphas = np.logspace(-4, 0, 50)  # 0.0001 to 1.0
    
    lasso_cv = LassoCV(alphas=alphas, cv=TimeSeriesSplit(cv_folds), 
                       max_iter=10000, random_state=42)
    lasso_cv.fit(X_scaled, y)
    
    # Get selected features (non-zero coefficients)
    coefs = pd.Series(lasso_cv.coef_, index=features)
    selected = coefs[coefs != 0].index.tolist()
    
    print(f"Optimal alpha: {lasso_cv.alpha_:.6f}")
    print(f"Selected {len(selected)} features out of {len(features)}")
    print(f"R² Score: {lasso_cv.score(X_scaled, y):.4f}")
    
    # Feature importance by coefficient magnitude
    importance = coefs.abs().sort_values(ascending=False)
    
    return selected, importance, lasso_cv

# Explanation:
#
# 1. Alpha Tuning:
#    - Low alpha (0.0001): Almost OLS, keeps many features, overfitting risk
#    - High alpha (1.0): Aggressive sparsity, may underfit
#    - LassoCV finds sweet spot via cross-validation on TimeSeriesSplit
#
# 2. Multicollinearity Handling:
#    - When RSI and Stochastic are correlated (0.9), LASSO tends to pick
#      one and zero out the other, avoiding redundant features
#    - Which one survives depends on minor noise; use Elastic Net for stability
#
# 3. NEPSE Interpretation:
#    - If "RSI_Oversold" has coefficient +0.02, it means:
#      "When RSI < 30, expected next-day return increases by 0.02% (2 basis points)"
#    - Sparse models allow traders to understand exactly which signals matter
```

### **16.4.2 Ridge (L2 Regularization)**

Ridge regression uses L2 penalty (squared coefficients) which shrinks coefficients toward zero but rarely to exactly zero. It handles multicollinearity by distributing coefficients among correlated features.

```python
from sklearn.linear_model import RidgeCV

def ridge_feature_selection(df, features, target_col, alphas=None):
    """
    Ridge regression for NEPSE: Shrinkage without elimination.
    
    Useful when all features have some predictive power and you want
    to keep them all with reduced weights (e.g., ensemble models).
    """
    X = df[features]
    y = df[target_col]
    
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    if alphas is None:
        alphas = np.logspace(-2, 2, 50)
    
    ridge = RidgeCV(alphas=alphas, cv=TimeSeriesSplit(5))
    ridge.fit(X_scaled, y)
    
    coefs = pd.Series(ridge.coef_, index=features)
    
    # Ridge keeps all features but ranks them by coefficient magnitude
    # Consider removing features with |coef| < 1e-4 (numerical zero)
    selected = coefs[abs(coefs) > 1e-4].index.tolist()
    
    return selected, coefs, ridge
```

### **16.4.3 Elastic Net**

Elastic Net combines L1 (LASSO) and L2 (Ridge) penalties, providing both feature selection and handling of correlated groups.

```python
from sklearn.linear_model import ElasticNetCV

def elastic_net_selection(df, features, target_col, l1_ratio=0.5):
    """
    Elastic Net for NEPSE: Balance between LASSO (selection) and Ridge (grouping).
    
    l1_ratio=0.5 means equal L1 and L2 penalty.
    l1_ratio=1.0 is pure LASSO, 0.0 is pure Ridge.
    """
    X = df[features]
    y = df[target_col]
    
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Search over alphas and l1_ratios
    enet = ElasticNetCV(
        l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 0.99, 1],
        cv=TimeSeriesSplit(5),
        max_iter=10000,
        random_state=42
    )
    
    enet.fit(X_scaled, y)
    
    coefs = pd.Series(enet.coef_, index=features)
    selected = coefs[coefs != 0].index.tolist()
    
    print(f"Optimal l1_ratio: {enet.l1_ratio_}")
    print(f"Optimal alpha: {enet.alpha_:.6f}")
    
    return selected, coefs, enet
```

### **16.4.4 Tree-Based Importance**

Tree ensembles (Random Forest, XGBoost) provide built-in feature importance measures based on impurity reduction or permutation.

```python
import xgboost as xgb
from sklearn.inspection import permutation_importance

def tree_based_selection(df, features, target_col, method='gain', 
                        n_features=20, n_permutations=10):
    """
    Feature selection using XGBoost importance scores.
    
    Methods:
    - 'gain': Average gain of splits using feature (default)
    - 'cover': Average coverage (number of samples affected)
    - 'weight': Number of times feature appears in trees
    - 'permutation': Shuffle feature and measure performance drop (most accurate)
    """
    X = df[features]
    y = df[target_col]
    
    # Train XGBoost
    model = xgb.XGBRegressor(
        n_estimators=100,
        max_depth=5,
        learning_rate=0.05,
        subsample=0.8,
        colsample_bytree=0.8,
        random_state=42
    )
    
    model.fit(X, y)
    
    if method == 'permutation':
        # Permutation importance (model-agnostic, more reliable)
        perm_importance = permutation_importance(
            model, X, y, 
            n_repeats=n_permutations,
            random_state=42,
            scoring='neg_mean_squared_error'
        )
        importance_scores = perm_importance.importances_mean
    else:
        # Built-in importance
        importance_dict = model.get_booster().get_score(importance_type=method)
        importance_scores = [importance_dict.get(f, 0) for f in features]
    
    importance_df = pd.DataFrame({
        'Feature': features,
        'Importance': importance_scores
    }).sort_values('Importance', ascending=False)
    
    selected = importance_df.head(n_features)['Feature'].tolist()
    
    return selected, importance_df, model

# Explanation:
#
# 1. Gain vs Permutation:
#    - Gain is biased toward high-cardinality features (continuous vs categorical)
#    - Permutation shuffles each feature and measures performance drop
#    - Permutation is slower but more accurate for NEPSE feature selection
#
# 2. Tree Importance Bias:
#    - Correlated features (RSI, Stochastic) share importance
#    - May show both as important even though they're redundant
#    - Combine with RFE or correlation filtering for best results
```

---

## **16.5 Dimensionality Reduction Techniques**

When features are highly correlated or when the feature space is too large for the sample size, dimensionality reduction creates new composite features that capture the majority of variance in fewer dimensions.

### **16.5.1 Principal Component Analysis (PCA)**

PCA transforms features into orthogonal components ranked by explained variance, removing multicollinearity and reducing noise.

```python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

def apply_pca_reduction(df, features, variance_threshold=0.95, 
                       n_components=None, visualize=True):
    """
    Apply PCA to NEPSE features for dimensionality reduction.
    
    Critical for:
    - Removing multicollinearity among technical indicators
    - Reducing noise (lower components often capture signal, higher capture noise)
    - Visualization (plotting stocks in 2D/3D factor space)
    
    Parameters:
    -----------
    variance_threshold : float
        Retain components explaining 95% of total variance
    """
    X = df[features]
    
    # Standardize (essential for PCA)
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # Determine number of components
    if n_components is None:
        pca_full = PCA()
        pca_full.fit(X_scaled)
        cumsum = np.cumsum(pca_full.explained_variance_ratio_)
        n_components = np.argmax(cumsum >= variance_threshold) + 1
    
    # Fit PCA
    pca = PCA(n_components=n_components)
    X_pca = pca.fit_transform(X_scaled)
    
    # Create component DataFrame
    components_df = pd.DataFrame(
        X_pca, 
        columns=[f'PC{i+1}' for i in range(n_components)],
        index=df.index
    )
    
    # Component interpretation (loadings)
    loadings = pd.DataFrame(
        pca.components_.T,
        columns=[f'PC{i+1}' for i in range(n_components)],
        index=features
    )
    
    print(f"Reduced from {len(features)} to {n_components} dimensions")
    print(f"Explained variance: {pca.explained_variance_ratio_.sum():.2%}")
    
    # Interpret top components for NEPSE
    for i in range(min(3, n_components)):
        top_features = loadings[f'PC{i+1}'].abs().sort_values(ascending=False).head(5)
        print(f"\nPC{i+1} top loadings:")
        print(top_features)
    
    if visualize and n_components >= 2:
        plt.figure(figsize=(10, 6))
        plt.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.5, c=df.index.map(lambda x: x.year))
        plt.xlabel('PC1')
        plt.ylabel('PC2')
        plt.title('NEPSE Stocks in PCA Space (colored by year)')
        plt.colorbar(label='Year')
        plt.show()
    
    return components_df, loadings, pca

# Explanation:
#
# 1. PCA for NEPSE Technical Indicators:
#    - PC1 often represents "Market Trend" (positive loadings on all price MAs)
#    - PC2 often represents "Volatility" (positive on ATR, Range, Volume)
#    - PC3 often represents "Momentum" (RSI, MACD, Stochastic)
#    - Instead of 30 correlated indicators, use 3-5 orthogonal components
#
# 2. Variance Threshold:
#    - 95% retains most information while removing noise
#    - For NEPSE with 200 features, typically reduces to 15-30 components
#    - Check scree plot to identify "elbow" (optimal cutoff)
#
# 3. Limitations:
#    - Components are linear combinations (lose interpretability)
#    - "PC1" is hard to explain to traders vs "RSI_Oversold"
#    - Use for intermediate modeling, not final interpretable models
```

### **16.5.2 t-SNE**

t-Distributed Stochastic Neighbor Embedding is a non-linear technique for visualizing high-dimensional data in 2D or 3D, useful for identifying clusters of similar market regimes.

```python
from sklearn.manifold import TSNE

def apply_tsne_visualization(df, features, perplexity=30, n_iter=1000):
    """
    t-SNE for visualizing NEPSE market regimes and stock similarities.
    
    Non-linear dimensionality reduction preserving local neighborhoods.
    Good for identifying:
    - Clusters of similar trading days (bull vs bear regimes)
    - Outlier days (crashes, circuit breakers)
    - Evolution of market structure over time
    """
    X = df[features]
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    tsne = TSNE(n_components=2, perplexity=perplexity, 
                n_iter=n_iter, random_state=42)
    X_tsne = tsne.fit_transform(X_scaled)
    
    tsne_df = pd.DataFrame(X_tsne, columns=['TSNE1', 'TSNE2'], index=df.index)
    
    # Plot with market regimes colored
    plt.figure(figsize=(12, 8))
    scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], 
                         c=df['Close'], cmap='viridis', alpha=0.6)
    plt.colorbar(scatter, label='Close Price')
    plt.title('NEPSE Market Regimes (t-SNE)')
    plt.show()
    
    return tsne_df
```

### **16.5.3 UMAP**

Uniform Manifold Approximation and Projection is a modern technique that often preserves both local and global structure better than t-SNE.

```python
import umap

def apply_umap_reduction(df, features, n_neighbors=15, min_dist=0.1, n_components=2):
    """
    UMAP for NEPSE feature reduction and visualization.
    
    Faster than t-SNE and better preserves global structure (relative distances
    between clusters, not just local neighborhoods).
    """
    X = df[features]
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    reducer = umap.UMAP(
        n_neighbors=n_neighbors,
        min_dist=min_dist,
        n_components=n_components,
        random_state=42
    )
    
    X_umap = reducer.fit_transform(X_scaled)
    
    return pd.DataFrame(X_umap, columns=[f'UMAP{i+1}' for i in range(n_components)])
```

### **16.5.4 Autoencoders**

Neural network-based dimensionality reduction that learns non-linear compressed representations.

```python
import tensorflow as tf
from tensorflow.keras import layers, Model

def build_feature_autoencoder(input_dim, encoding_dim=10):
    """
    Build autoencoder for non-linear dimensionality reduction of NEPSE features.
    
    Useful when PCA fails to capture non-linear relationships between
    technical indicators (e.g., RSI thresholds are non-linear).
    """
    # Encoder
    input_layer = layers.Input(shape=(input_dim,))
    encoded = layers.Dense(64, activation='relu')(input_layer)
    encoded = layers.Dense(32, activation='relu')(encoded)
    encoded = layers.Dense(encoding_dim, activation='linear', name='bottleneck')(encoded)
    
    # Decoder
    decoded = layers.Dense(32, activation='relu')(encoded)
    decoded = layers.Dense(64, activation='relu')(decoded)
    decoded = layers.Dense(input_dim, activation='linear')(decoded)
    
    autoencoder = Model(input_layer, decoded)
    encoder = Model(input_layer, encoded)
    
    autoencoder.compile(optimizer='adam', loss='mse')
    
    return autoencoder, encoder

def train_autoencoder(df, features, encoding_dim=10, epochs=100):
    """
    Train autoencoder to compress NEPSE features into latent space.
    """
    X = df[features].values
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    autoencoder, encoder = build_feature_autoencoder(len(features), encoding_dim)
    
    # Train with early stopping
    early_stop = tf.keras.callbacks.EarlyStopping(
        monitor='val_loss', patience=10, restore_best_weights=True
    )
    
    autoencoder.fit(
        X_scaled, X_scaled,
        epochs=epochs,
        batch_size=32,
        validation_split=0.2,
        callbacks=[early_stop],
        verbose=0
    )
    
    # Extract encoded features
    encoded_features = encoder.predict(X_scaled)
    encoded_df = pd.DataFrame(
        encoded_features,
        columns=[f'AE_{i+1}' for i in range(encoding_dim)],
        index=df.index
    )
    
    return encoded_df, autoencoder, encoder
```

---

## **16.6 Feature Selection for Time-Series**

Time-series feature selection requires special handling to prevent lookahead bias and ensure temporal consistency.

```python
class TimeSeriesFeatureSelector:
    """
    Feature selection pipeline designed specifically for NEPSE time-series.
    
    Enforces:
    - No future data leakage in feature statistics
    - Rolling/expanding windows for stability metrics
    - Regime-aware selection (features stable across bull/bear markets)
    """
    
    def __init__(self, df, date_col='Date'):
        self.df = df.sort_values(date_col)
        self.selected_features = []
        
    def temporal_train_test_split(self, features, target, train_end_date):
        """
        Split data temporally: train before date, test after.
        """
        train_mask = self.df['Date'] <= train_end_date
        test_mask = self.df['Date'] > train_end_date
        
        X_train = self.df.loc[train_mask, features]
        y_train = self.df.loc[train_mask, target]
        X_test = self.df.loc[test_mask, features]
        y_test = self.df.loc[test_mask, target]
        
        return X_train, X_test, y_train, y_test
    
    def stability_selection(self, features, target, n_splits=5, threshold=0.7):
        """
        Select features that are consistently important across time periods.
        
        A feature might work well in 2021 (bull market) but fail in 2022 (bear).
        We want features robust across regimes.
        """
        # Split into temporal folds
        dates = self.df['Date'].quantile(np.linspace(0, 1, n_splits+1)).values
        importances = pd.DataFrame(index=features)
        
        for i in range(n_splits):
            train_end = dates[i+1]
            X_train, X_test, y_train, y_test = self.temporal_train_test_split(
                features, target, train_end
            )
            
            # Train model and get importances
            model = xgb.XGBRegressor(n_estimators=50, max_depth=3)
            model.fit(X_train, y_train)
            
            importances[f'period_{i}'] = model.feature_importances_
        
        # Calculate stability (mean importance / std across periods)
        importances['mean'] = importances.mean(axis=1)
        importances['std'] = importances.std(axis=1)
        importances['stability'] = importances['mean'] / (importances['std'] + 1e-8)
        
        # Select features with high mean importance and low variance (stable)
        stable_features = importances[
            (importances['mean'] > importances['mean'].quantile(0.5)) &
            (importances['stability'] > threshold)
        ].index.tolist()
        
        return stable_features, importances
    
    def select_features(self, features, target, method='mutual_info', 
                       n_features=20, validation_date='2023-01-01'):
        """
        Main selection method with temporal validation.
        """
        # Split data
        X_train, X_val, y_train, y_val = self.temporal_train_test_split(
            features, target, validation_date
        )
        
        if method == 'mutual_info':
            selector = SelectKBest(mutual_info_regression, k=n_features)
        elif method == 'f_regression':
            from sklearn.feature_selection import f_regression, SelectKBest
            selector = SelectKBest(f_regression, k=n_features)
        
        selector.fit(X_train, y_train)
        
        # Get selected feature names
        mask = selector.get_support()
        selected = [f for f, m in zip(features, mask) if m]
        
        # Validate performance
        train_score = selector.score(X_train, y_train).mean()
        val_score = selector.score(X_val, y_val).mean()
        
        print(f"Selected {len(selected)} features")
        print(f"Train score: {train_score:.4f}, Val score: {val_score:.4f}")
        print(f"Features: {selected}")
        
        return selected
```

---

## **16.7 Stability Analysis**

Feature stability measures how consistently a feature performs across different time periods and market regimes in NEPSE.

```python
def analyze_feature_stability(df, features, target, window=252, step=63):
    """
    Analyze how feature importance changes over time in NEPSE.
    
    Parameters:
    -----------
    window : int
        Rolling window size (252 = 1 year of NEPSE trading)
    step : int
        Step size between analyses (63 = quarterly re-evaluation)
    
    Returns stability metrics for each feature.
    """
    stability_results = {}
    
    for start in range(0, len(df) - window, step):
        end = start + window
        period_data = df.iloc[start:end]
        
        # Calculate correlations for this period
        period_corr = period_data[features].corrwith(period_data[target])
        
        period_label = f"{df.index[start]} to {df.index[end]}"
        stability_results[period_label] = period_corr
    
    stability_df = pd.DataFrame(stability_results).T
    
    # Calculate stability metrics
    metrics = pd.DataFrame(index=features)
    metrics['mean_correlation'] = stability_df.mean()
    metrics['std_correlation'] = stability_df.std()
    metrics['stability_ratio'] = metrics['mean_correlation'].abs() / (metrics['std_correlation'] + 0.01)
    metrics['percent_positive'] = (stability_df > 0).mean()
    
    # Rank by stability (high mean correlation, low variance)
    metrics['stability_score'] = (
        metrics['mean_correlation'].abs() * 
        (1 - metrics['std_correlation']) * 
        metrics['percent_positive']
    )
    
    return metrics.sort_values('stability_score', ascending=False), stability_df

# Interpretation for NEPSE:
# - High stability features: Volume trends, 52-week position, long-term MAs
#   (These work consistently across bull/bear markets)
# - Low stability features: Short-term RSI, daily gaps, news sentiment
#   (These work only in specific regimes)
#
# Strategy: Build ensemble using stable features as base, 
# add regime-specific features as overlays.
```

---

## **16.8 Automated Feature Selection**

Automated pipelines combine multiple selection methods to find optimal feature subsets without manual intervention.

```python
from sklearn.feature_selection import RFECV
from sklearn.pipeline import Pipeline

def automated_feature_selection_pipeline(df, features, target, max_features=30):
    """
    Automated multi-stage feature selection for NEPSE.
    
    Stage 1: Filter - Remove low variance and high correlation
    Stage 2: Embedded - LASSO for initial screening
    Stage 3: Wrapper - RFE with XGBoost for final selection
    Stage 4: Stability - Test across time periods
    """
    print("Stage 1: Filter Methods...")
    # Remove low variance
    X = df[features]
    selector_var = VarianceThreshold(threshold=0.01)
    X_var = selector_var.fit_transform(X)
    features_var = [f for f, m in zip(features, selector_var.get_support()) if m]
    
    # Remove high correlation
    corr_matrix = X[features_var].corr().abs()
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
    to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
    features_uncorr = [f for f in features_var if f not in to_drop]
    print(f"  Filtered from {len(features)} to {len(features_uncorr)}")
    
    print("Stage 2: LASSO Screening...")
    X_scaled = StandardScaler().fit_transform(df[features_uncorr])
    lasso = LassoCV(cv=TimeSeriesSplit(5), random_state=42)
    lasso.fit(X_scaled, df[target])
    features_lasso = [f for f, c in zip(features_uncorr, lasso.coef_) if c != 0]
    print(f"  LASSO selected {len(features_lasso)}")
    
    print("Stage 3: RFE with XGBoost...")
    if len(features_lasso) > max_features:
        rfe = RFE(
            estimator=xgb.XGBRegressor(n_estimators=50, max_depth=3),
            n_features_to_select=max_features,
            step=1
        )
        rfe.fit(df[features_lasso], df[target])
        final_features = [f for f, m in zip(features_lasso, rfe.support_) if m]
    else:
        final_features = features_lasso
    
    print(f"Final selection: {len(final_features)} features")
    return final_features
```

---

## **16.9 Evaluation and Validation**

Validating feature selection ensures the chosen subset generalizes to unseen NEPSE data and future time periods.

```python
def evaluate_feature_selection(df, all_features, selected_features, target, 
                              test_start_date='2023-01-01'):
    """
    Compare model performance with full feature set vs selected subset.
    """
    from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
    
    # Temporal split
    train = df[df['Date'] < test_start_date]
    test = df[df['Date'] >= test_start_date]
    
    results = {}
    
    for feature_set, name in [(all_features, 'All_Features'), 
                              (selected_features, 'Selected')]:
        X_train = train[feature_set]
        X_test = test[feature_set]
        y_train = train[target]
        y_test = test[target]
        
        model = xgb.XGBRegressor(n_estimators=100, max_depth=5, random_state=42)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        
        results[name] = {
            'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)),
            'MAE': mean_absolute_error(y_test, y_pred),
            'R2': r2_score(y_test, y_pred),
            'Feature_Count': len(feature_set)
        }
    
    comparison = pd.DataFrame(results).T
    print(comparison)
    
    # Check for improvement
    if comparison.loc['Selected', 'RMSE'] < comparison.loc['All_Features', 'RMSE']:
        print("\nFeature selection improved generalization!")
    else:
        print("\nWarning: Feature selection may have removed predictive features.")
    
    return comparison

# Key Metrics for NEPSE Feature Selection:
# 1. Out-of-time R² (should be positive and stable)
# 2. RMSE improvement (selected should be equal or better with fewer features)
# 3. Sharpe Ratio of strategy using selected features (if trading)
# 4. Turnover stability (selected features should persist across retrainings)
```

**End of Chapter 16**

This chapter covered systematic feature selection and dimensionality reduction techniques essential for managing the high-dimensional feature space in NEPSE prediction systems, from filter methods (correlation, mutual information) to wrapper methods (RFE, forward selection) and embedded methods (LASSO, tree importance), with specific attention to time-series validation and stability analysis across market regimes.



<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='15. feature_scaling_and_normalization.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='17. advanced_feature_engineering_techniques.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
