# Feature Selection: The Ten-Level Framework

## Overview

This notebook demonstrates the **Ten-Level Feature Selection Framework**, a novel methodology for robust and reproducible DNA methylation biomarker identification. The framework provides graduated stringency levels that enable researchers to balance between discovery power and reproducibility.

### The Ten-Level Framework

The framework defines ten stringency levels, each with calibrated thresholds for:

| Level | Name | P-value | Effect Size | Use Case |
|-------|------|---------|-------------|----------|
| L1 | Discovery | 0.05 | 0.2 | Initial exploration |
| L2 | Liberal | 0.01 | 0.3 | Hypothesis generation |
| L3 | Standard | 0.005 | 0.4 | Standard analysis |
| L4 | Conservative | 0.001 | 0.5 | Robust features |
| L5 | Moderate | 0.001 | 0.5 | Balanced approach |
| L6 | Stringent | 0.0005 | 0.6 | High confidence |
| L7 | Very Stringent | 0.0001 | 0.7 | Publication quality |
| L8 | Ultra | 0.00005 | 0.8 | High replication potential |
| L9 | Extreme | 0.00001 | 0.9 | Very stringent filtering |
| L10 | Maximum | 0.000001 | 1.0 | Maximum stringency |

### Learning Objectives

By the end of this notebook, you will be able to:

1. Configure and apply the Ten-Level Feature Selection Framework
2. Perform binary feature selection (HIIT vs Control)
3. Perform multiclass feature selection (4W/8W/12W duration)
4. Analyze time-series methylation trajectories
5. Identify consensus features across methods

## 1. Environment Setup

In [None]:
# Standard library imports
import sys
import logging
from pathlib import Path

# Scientific computing
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

# Project-specific imports - Feature Selection Framework
from src.features import (
    TenLevelFeatureSelector,
    FeatureSelectionConfig,
    StatisticalFeatureSelector,
    LassoFeatureSelector,
    ElasticNetFeatureSelector,
    RandomForestFeatureSelector,
    TimeSeriesFeatureAnalyzer,
    run_ttest,
    run_anova,
    calculate_effect_size,
    adjust_pvalues
)

# Visualization
from src.visualization import (
    plot_volcano,
    plot_heatmap,
    plot_feature_importance
)

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Visualization settings
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('colorblind')

print(f"Project root: {project_root}")

## 2. Load Preprocessed Data

In [None]:
import pickle

# Define paths
processed_dir = project_root / 'data' / 'processed'
features_dir = processed_dir / 'features'
figures_dir = project_root / 'data' / 'figures' / 'feature_importance'

# Create output directories
features_dir.mkdir(parents=True, exist_ok=True)
figures_dir.mkdir(parents=True, exist_ok=True)

# Load preprocessed methylation data
with open(processed_dir / 'methyl_data_preprocessed.pkl', 'rb') as f:
    methylation_data = pickle.load(f)

# Load sample mapping
sample_mapping = pd.read_csv(
    project_root / 'data' / 'raw' / 'GSE171140_sample_mapping.csv'
)

print(f"Methylation data: {methylation_data.shape[0]:,} probes x {methylation_data.shape[1]} samples")
print(f"Sample mapping: {len(sample_mapping)} entries")

## 3. Prepare Data for Feature Selection

Organize samples into appropriate groups for each classification task.

In [None]:
# Align sample mapping with methylation data
sample_ids = methylation_data.columns.tolist()
sample_info = sample_mapping.set_index('sample_id').loc[sample_ids].reset_index()

# Create binary labels
binary_mask = sample_info['binary_class'].isin(['HIIT', 'Control'])
binary_samples = sample_info[binary_mask]['sample_id'].tolist()
binary_labels = sample_info[binary_mask]['binary_class'].values

# Create multiclass labels (HIIT duration)
multiclass_mask = sample_info['multi_class'].notna()
multiclass_samples = sample_info[multiclass_mask]['sample_id'].tolist()
multiclass_labels = sample_info[multiclass_mask]['multi_class'].values

print("Binary Classification Dataset:")
print(f"  HIIT samples: {sum(binary_labels == 'HIIT')}")
print(f"  Control samples: {sum(binary_labels == 'Control')}")

print(f"\nMulticlass Classification Dataset:")
for cls in ['4W', '8W', '12W']:
    print(f"  {cls}: {sum(multiclass_labels == cls)} samples")

## 4. Initialize the Ten-Level Feature Selector

The `TenLevelFeatureSelector` is the main orchestrator of the framework. It integrates multiple selection methods and provides unified access to all stringency levels.

In [None]:
# Create configuration for the selector
config = FeatureSelectionConfig()

# Display available levels
print("Available Stringency Levels:")
print("=" * 60)
for level_name, level_config in config.levels.items():
    print(f"{level_name}: p-value={level_config['p_value']}, "
          f"effect_size={level_config['effect_size']}")

In [None]:
# Initialize the Ten-Level Feature Selector
selector = TenLevelFeatureSelector(config)

print("Ten-Level Feature Selector initialized.")
print(f"\nSelection methods available:")
print("  - Statistical: t-test, ANOVA with FDR correction")
print("  - Machine Learning: LASSO, Elastic Net, Random Forest")
print("  - Time-series: Trajectory analysis")

## 5. Binary Feature Selection: HIIT vs Control

We apply the framework to identify CpG sites that differentiate HIIT intervention samples from control/baseline samples.

In [None]:
# Extract binary classification data
X_binary = methylation_data[binary_samples].T  # Samples as rows
y_binary = (pd.Series(binary_labels) == 'HIIT').astype(int).values

print(f"Binary data shape: {X_binary.shape}")
print(f"Label distribution: HIIT={sum(y_binary)}, Control={len(y_binary)-sum(y_binary)}")

### 5.1 Statistical Selection at Different Levels

We perform feature selection at multiple stringency levels to understand the trade-off between discovery power and stringency.

In [None]:
# Run feature selection at multiple levels
levels_to_test = ['L1_discovery', 'L3_standard', 'L5_moderate', 'L7_very_stringent']

binary_results = {}
for level in levels_to_test:
    print(f"\nRunning feature selection at {level}...")
    
    # Select features at this level
    features = selector.select_binary_features(
        X_binary,
        y_binary,
        level=level
    )
    
    binary_results[level] = features
    print(f"  Selected features: {len(features)}")

In [None]:
# Visualize feature counts across levels
fig, ax = plt.subplots(figsize=(10, 6))

levels = list(binary_results.keys())
counts = [len(binary_results[level]) for level in levels]

bars = ax.bar(range(len(levels)), counts, color='steelblue', edgecolor='black')
ax.set_xticks(range(len(levels)))
ax.set_xticklabels([l.replace('_', '\n') for l in levels], rotation=0)
ax.set_ylabel('Number of Selected Features')
ax.set_title('Binary Feature Selection: Features at Different Stringency Levels')

# Add count labels on bars
for bar, count in zip(bars, counts):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 50, 
            str(count), ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.savefig(figures_dir / 'binary_feature_counts_by_level.png', dpi=150)
plt.show()

### 5.2 Detailed Statistical Analysis

For comprehensive analysis, we examine p-values, effect sizes, and their distributions.

In [None]:
# Run statistical tests on all probes
stat_selector = StatisticalFeatureSelector()

# Group samples
hiit_samples = [s for s, l in zip(binary_samples, binary_labels) if l == 'HIIT']
control_samples = [s for s, l in zip(binary_samples, binary_labels) if l == 'Control']

# Calculate statistics for all probes
print("Calculating statistics for all probes...")

# This would contain the actual implementation
# For demonstration, we'll use the selector's internal statistics
ttest_results = run_ttest(
    methylation_data[hiit_samples],
    methylation_data[control_samples]
)

# Calculate effect sizes (Cohen's d)
effect_sizes = calculate_effect_size(
    methylation_data[hiit_samples],
    methylation_data[control_samples]
)

# Adjust p-values for multiple testing
adjusted_pvalues = adjust_pvalues(ttest_results['pvalue'], method='fdr_bh')

print(f"\nStatistics calculated for {len(ttest_results)} probes")

In [None]:
# Create a statistics DataFrame
stats_df = pd.DataFrame({
    'probe_id': methylation_data.index,
    'pvalue': ttest_results['pvalue'],
    'adjusted_pvalue': adjusted_pvalues,
    'effect_size': effect_sizes,
    't_statistic': ttest_results['statistic']
})

# Calculate log fold change (difference in means)
stats_df['mean_diff'] = (
    methylation_data[hiit_samples].mean(axis=1) - 
    methylation_data[control_samples].mean(axis=1)
).values

# Summary statistics
print("Statistical Summary:")
print(f"  Probes with p < 0.05: {(stats_df['pvalue'] < 0.05).sum():,}")
print(f"  Probes with adj p < 0.05: {(stats_df['adjusted_pvalue'] < 0.05).sum():,}")
print(f"  Probes with |effect| > 0.5: {(stats_df['effect_size'].abs() > 0.5).sum():,}")

In [None]:
# Create volcano plot
fig, ax = plot_volcano(
    stats_df['mean_diff'],
    stats_df['pvalue'],
    title='Volcano Plot: HIIT vs Control',
    pvalue_threshold=0.05,
    effect_threshold=0.1,
    figsize=(10, 8)
)

plt.savefig(figures_dir / 'binary_volcano_plot.png', dpi=150, bbox_inches='tight')
plt.show()

### 5.3 Machine Learning-Based Selection

Complement statistical methods with regularized regression models.

In [None]:
# LASSO feature selection
print("Running LASSO feature selection...")

lasso_selector = LassoFeatureSelector(alpha=0.01)
lasso_features = lasso_selector.fit_select(X_binary, y_binary)

print(f"LASSO selected {len(lasso_features)} features")

In [None]:
# Elastic Net feature selection (combines L1 and L2 regularization)
print("Running Elastic Net feature selection...")

enet_selector = ElasticNetFeatureSelector(alpha=0.01, l1_ratio=0.5)
enet_features = enet_selector.fit_select(X_binary, y_binary)

print(f"Elastic Net selected {len(enet_features)} features")

In [None]:
# Random Forest feature importance
print("Running Random Forest feature selection...")

rf_selector = RandomForestFeatureSelector(
    n_estimators=100,
    importance_threshold=0.001
)
rf_features = rf_selector.fit_select(X_binary, y_binary)

print(f"Random Forest selected {len(rf_features)} features")

### 5.4 Consensus Features

Identify features that are selected by multiple methods for higher confidence.

In [None]:
# Find consensus features across methods
statistical_features = set(binary_results['L5_moderate'])
ml_features = set(lasso_features) & set(enet_features) & set(rf_features)

# Features selected by both statistical and ML methods
consensus_features = statistical_features & ml_features

print("Consensus Feature Analysis:")
print(f"  Statistical features (L5): {len(statistical_features)}")
print(f"  ML consensus features: {len(ml_features)}")
print(f"  Overall consensus: {len(consensus_features)}")

In [None]:
# Visualize feature overlap
from matplotlib_venn import venn3

fig, ax = plt.subplots(figsize=(10, 8))

venn3(
    [statistical_features, set(lasso_features), set(rf_features)],
    set_labels=('Statistical', 'LASSO', 'Random Forest'),
    ax=ax
)
ax.set_title('Feature Selection Method Overlap (Binary Classification)')

plt.savefig(figures_dir / 'binary_feature_overlap.png', dpi=150, bbox_inches='tight')
plt.show()

## 6. Multiclass Feature Selection: HIIT Duration

Select features that distinguish between different HIIT training durations (4W, 8W, 12W).

In [None]:
# Prepare multiclass data
X_multi = methylation_data[multiclass_samples].T

# Encode labels
label_map = {'4W': 0, '8W': 1, '12W': 2}
y_multi = np.array([label_map[l] for l in multiclass_labels])

print(f"Multiclass data shape: {X_multi.shape}")
print(f"Class distribution: {np.bincount(y_multi)}")

In [None]:
# Run multiclass feature selection
multiclass_results = {}

for level in ['L1_discovery', 'L3_standard', 'L5_moderate']:
    print(f"\nRunning multiclass selection at {level}...")
    
    features = selector.select_multiclass_features(
        X_multi,
        y_multi,
        level=level
    )
    
    multiclass_results[level] = features
    print(f"  Selected features: {len(features)}")

In [None]:
# ANOVA-based selection for multiclass
print("\nRunning ANOVA for multiclass comparison...")

# Group samples by duration
groups = {
    '4W': [s for s, l in zip(multiclass_samples, multiclass_labels) if l == '4W'],
    '8W': [s for s, l in zip(multiclass_samples, multiclass_labels) if l == '8W'],
    '12W': [s for s, l in zip(multiclass_samples, multiclass_labels) if l == '12W']
}

anova_results = run_anova(
    [methylation_data[groups['4W']],
     methylation_data[groups['8W']],
     methylation_data[groups['12W']]]
)

# Adjust p-values
anova_adjusted = adjust_pvalues(anova_results['pvalue'], method='fdr_bh')

significant_probes = sum(anova_adjusted < 0.05)
print(f"Significant probes (FDR < 0.05): {significant_probes:,}")

## 7. Time-Series Feature Analysis

Analyze methylation trajectories over the training period to identify CpG sites with consistent temporal patterns.

In [None]:
# Initialize time-series analyzer
ts_analyzer = TimeSeriesFeatureAnalyzer()

# Prepare time-series data
# For each individual, we need measurements at multiple timepoints
timepoint_map = {'Baseline': 0, '4W HIIT': 4, '8W HIIT': 8, '12W HIIT': 12}

# Create time-series structure
ts_data = sample_info[
    sample_info['time_point'].isin(timepoint_map.keys())
].copy()
ts_data['time_numeric'] = ts_data['time_point'].map(timepoint_map)

print("Time-series data structure:")
print(ts_data.groupby('time_point').size())

In [None]:
# Analyze temporal patterns
print("Analyzing temporal methylation patterns...")

# Run time-series analysis
ts_features = ts_analyzer.analyze_trajectories(
    methylation_data,
    ts_data,
    time_column='time_numeric',
    individual_column='individual_id'
)

print(f"\nIdentified {len(ts_features)} features with significant temporal patterns")

In [None]:
# Categorize temporal patterns
# Features can show: increasing, decreasing, or non-monotonic trends

if len(ts_features) > 0:
    pattern_counts = ts_features['pattern_type'].value_counts()
    print("Temporal Pattern Distribution:")
    for pattern, count in pattern_counts.items():
        print(f"  {pattern}: {count} features")

## 8. Feature Set Comparison

Compare features selected across different tasks and methods.

In [None]:
# Compare binary, multiclass, and time-series features
binary_set = set(binary_results['L5_moderate'])
multi_set = set(multiclass_results.get('L5_moderate', []))
ts_set = set(ts_features['probe_id'].tolist()) if len(ts_features) > 0 else set()

print("Feature Set Comparison:")
print(f"  Binary (L5): {len(binary_set)} features")
print(f"  Multiclass (L5): {len(multi_set)} features")
print(f"  Time-series: {len(ts_set)} features")

# Calculate overlaps
binary_multi = binary_set & multi_set
binary_ts = binary_set & ts_set
multi_ts = multi_set & ts_set
all_three = binary_set & multi_set & ts_set

print(f"\nOverlaps:")
print(f"  Binary & Multiclass: {len(binary_multi)}")
print(f"  Binary & Time-series: {len(binary_ts)}")
print(f"  Multiclass & Time-series: {len(multi_ts)}")
print(f"  All three: {len(all_three)}")

## 9. Save Selected Features

Save all feature sets for use in downstream classification and enrichment analysis.

In [None]:
import json

# Save binary features at different levels
for level, features in binary_results.items():
    feature_path = features_dir / f'binary_features_{level}.csv'
    pd.DataFrame({'probe_id': list(features)}).to_csv(feature_path, index=False)
    print(f"Saved: {feature_path.name} ({len(features)} features)")

In [None]:
# Save multiclass features
for level, features in multiclass_results.items():
    feature_path = features_dir / f'multiclass_features_{level}.csv'
    pd.DataFrame({'probe_id': list(features)}).to_csv(feature_path, index=False)
    print(f"Saved: {feature_path.name} ({len(features)} features)")

In [None]:
# Save consensus features
consensus_path = features_dir / 'consensus_features.csv'
pd.DataFrame({'probe_id': list(consensus_features)}).to_csv(consensus_path, index=False)
print(f"Saved consensus features: {len(consensus_features)} features")

# Save time-series features
if len(ts_features) > 0:
    ts_path = features_dir / 'timeseries_features.csv'
    ts_features.to_csv(ts_path, index=False)
    print(f"Saved time-series features: {len(ts_features)} features")

In [None]:
# Save summary statistics
summary = {
    'binary_features': {
        level: len(features) for level, features in binary_results.items()
    },
    'multiclass_features': {
        level: len(features) for level, features in multiclass_results.items()
    },
    'ml_features': {
        'lasso': len(lasso_features),
        'elastic_net': len(enet_features),
        'random_forest': len(rf_features)
    },
    'consensus_features': len(consensus_features),
    'timeseries_features': len(ts_features) if len(ts_features) > 0 else 0
}

summary_path = features_dir / 'feature_selection_summary.json'
with open(summary_path, 'w') as f:
    json.dump(summary, f, indent=2)

print(f"\nSummary saved to: {summary_path}")

## Summary

In this notebook, we demonstrated the Ten-Level Feature Selection Framework:

### Key Accomplishments

1. **Framework Configuration**: Set up graduated stringency levels with calibrated thresholds

2. **Binary Feature Selection**: Identified CpG sites differentiating HIIT from Control
   - Statistical methods (t-test with FDR correction)
   - Effect size filtering (Cohen's d)
   - Multiple stringency levels for different use cases

3. **Multiclass Feature Selection**: Found features distinguishing training durations
   - ANOVA-based selection
   - Duration-specific patterns

4. **Time-Series Analysis**: Identified temporal methylation patterns
   - Trajectory analysis across timepoints
   - Pattern categorization

5. **Consensus Features**: Combined multiple methods for robust selection
   - Statistical + ML agreement
   - Cross-task overlap analysis

### Framework Benefits

- **Reproducibility**: Standardized thresholds at each level
- **Flexibility**: Choose appropriate stringency for your use case
- **Robustness**: Multiple method consensus reduces false discoveries
- **Interpretability**: Clear documentation of selection criteria

### Next Steps

Continue to **04_classification.ipynb** to:
- Train classifiers using selected features
- Evaluate model performance with cross-validation
- Compare different feature sets and data versions

In [None]:
# Session summary
print("=" * 60)
print("FEATURE SELECTION COMPLETE")
print("=" * 60)
print(f"\nBinary features (L5): {len(binary_results['L5_moderate'])}")
print(f"Multiclass features (L5): {len(multiclass_results.get('L5_moderate', []))}")
print(f"Consensus features: {len(consensus_features)}")
print(f"\nAll feature sets saved to: {features_dir}")