# Feature Importance with Monte Carlo Cross-Validation (Python)

**Purpose:** Calculate scaled feature importance using multiple ML models  
**Method:** Normalized feature importance scaled by MC-CV Recall scores  
**Updated:** December 2024  
**Hardware:** Optimized for EC2 (32 cores, 1TB RAM)  
**Validation:** Proper evaluation on unseen test data

## Key Features

‚úÖ **Monte Carlo Cross-Validation** ‚Äì up to 1000 random train/test splits (100-split runs used for faster iteration)  
‚úÖ **Stratified Sampling** - Maintains target distribution  
‚úÖ **Parallel Processing** - Fast execution with joblib (‚âà30 workers)  
‚úÖ **95% Confidence Intervals** - Narrow, precise estimates (tighter with more splits)  
‚úÖ **Multiple Models** - Tree ensembles: CatBoost, Random Forest, XGBoost, XGBoost RF, LightGBM, ExtraTrees  
‚úÖ **Linear Models** - LogisticRegression, LinearSVC, ElasticNet, LASSO  

## Methodology

This notebook implements the feature selection methodology:

1. Load cohort data from parquet files (same as FP-Growth notebook)
   - **Training data**: Years 2016-2018 (combined)
   - **Test data**: Year 2019 (avoiding COVID year 2020)
2. Create patient-level features (one-hot encoding of items)
3. For each model type:
   - Create 100‚Äì1000 MC-CV splits (each split samples from 2016-2018 training data)
   - Train model on sampled training subset
   - Evaluate Recall on 2019 test set (temporal validation)
   - Extract feature importance
   - Aggregate results across splits
4. Normalize and scale feature importance by MC-CV Recall
5. Aggregate across models
6. Extract top features

## Expected Runtime

- **100 splits (current default):**
  - Local (4 cores): ~2‚Äì4 hours
  - Workstation (16 cores): ~1‚Äì2 hours
  - EC2 (32 cores, 1TB RAM): ~1‚Äì2 hours ‚úÖ **RECOMMENDED FOR DEVELOPMENT**
- **1000 splits (extended / publication-level):**
  - Local (4 cores): 8‚Äì12+ hours
  - Workstation (16 cores): ~8‚Äì16 hours
  - EC2 (32 cores, 1TB RAM): ~10‚Äì20 hours ‚úÖ **RECOMMENDED FOR FINAL RESULTS**

**üìñ Documentation:** See [Feature Importance README](README.md) for detailed documentation, usage examples, and troubleshooting.


## 1. Setup and Configuration

Load required packages and configure parallel processing.


In [None]:
import sys
import os
import pandas as pd
import numpy as np
from pathlib import Path
from typing import Dict, List, Optional, Tuple
from itertools import product
import warnings
warnings.filterwarnings('ignore')

# Add project root to path
project_root = Path.cwd().parent if Path.cwd().name == '3_feature_importance' else Path.cwd()
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Import constants
from helpers_1997_13.constants import AGE_BANDS, COHORT_NAMES, EVENT_YEARS, S3_BUCKET

# Import helper modules from helpers_1997_13
from helpers_1997_13.logging_utils import setup_r_logging, save_logs_to_s3_r, check_memory_usage_r
from helpers_1997_13.model_utils import calculate_recall, calculate_logloss
from helpers_1997_13.mc_cv_utils import run_mc_cv_method
from helpers_1997_13.feature_importance_utils import run_cohort_analysis
from helpers_1997_13.s3_utils import check_feature_importance_results_exist, check_cohort_file_exists

print("‚úì All packages loaded successfully")
print(f"‚úì Age bands loaded from constants: {', '.join(AGE_BANDS)}")
print(f"‚úì Cohorts loaded from constants: {', '.join(COHORT_NAMES)}")
print(f"‚úì Event years loaded from constants: {', '.join(EVENT_YEARS)}")


## 2. Configuration

Set debug mode, MC-CV parameters, and model configuration.


In [None]:
# ============================================================
# DEBUG/TEST MODE - Quick testing before full run
# ============================================================
# Set DEBUG_MODE = True for quick testing (5 splits, ~2-5 min)
# Set DEBUG_MODE = False for full analysis (100 splits, ~1-2 hours on EC2)

DEBUG_MODE = False  # Change to True for quick test

if DEBUG_MODE:
    print("\n" + "="*80)
    print("üîç DEBUG MODE ENABLED")
    print("="*80)
    print("\nQuick test configuration:")
    print("  ‚Ä¢ MC-CV Splits: 5 (instead of 100)")
    print("  ‚Ä¢ Expected time: 2-5 minutes")
    print("  ‚Ä¢ Purpose: Verify everything works before full run")
    print("\nTo run full analysis, set DEBUG_MODE = False\n")

# Configuration
# Train on 2016-2018, test on 2019 (avoiding COVID year 2020)
TRAIN_YEARS = [2016, 2017, 2018]  # Years to use for training
TEST_YEAR = 2019  # Year to use for testing

N_SPLITS = 5 if DEBUG_MODE else 200  # MC-CV splits (5 for debug, 100 for development, 1000 for production)
TEST_SIZE = 0.2  # Test set proportion (20%) - used for sampling from training years
TRAIN_PROP = 1 - TEST_SIZE  # Training proportion (80%) - used for sampling from training years

# Scaling metric for feature importance
# Options: "recall" (default) or "logloss"
SCALING_METRIC = "recall"  # Change to "logloss" if preferred

# Model parameters
MODEL_PARAMS = {
    # Tree Ensembles
    'catboost': {
        'iterations': 100 if DEBUG_MODE else 500,
        'learning_rate': 0.1,
        'depth': 6,
        'verbose': False,
        'random_seed': 42
    },
    'random_forest': {
        'ntree': 100 if DEBUG_MODE else 500,
        'mtry': None,  # Will be set to sqrt(n_features)
        'nodesize': 1,
        'maxnodes': None,
        'random_seed': 42
    },
    'xgboost': {
        'max_depth': 6,
        'learning_rate': 0.1,
        'n_estimators': 100 if DEBUG_MODE else 500,
        'subsample': 1.0,
        'colsample_bytree': 1.0,
        'random_seed': 42
    },
    'xgboost_rf': {
        'max_depth': 6,
        'learning_rate': 0.1,
        'n_estimators': 100 if DEBUG_MODE else 500,
        'subsample': 0.8,
        'max_features': None,  # Will be set to sqrt(n_features)
        'random_seed': 42
    },
    'lightgbm': {
        'n_estimators': 100 if DEBUG_MODE else 500,
        'learning_rate': 0.1,
        'num_leaves': 31,
        'feature_fraction': 1.0,
        'bagging_fraction': 1.0,
        'bagging_freq': 0,
        'random_seed': 42
    },
    'extratrees': {
        'n_estimators': 100 if DEBUG_MODE else 500,
        'max_features': None,  # Will be set to sqrt(n_features)
        'min_samples_leaf': 1,
        'max_depth': None,
        'random_seed': 42
    },
    # Linear Models
    'logistic_regression': {
        'penalty': 'l2',
        'C': 1.0,
        'solver': 'lbfgs',
        'max_iter': 1000,
        'random_seed': 42
    },
    'linearsvc': {
        'penalty': 'l2',
        'C': 1.0,
        'loss': 'squared_hinge',
        'max_iter': 1000,
        'dual': True,
        'random_seed': 42
    },
    'elasticnet': {
        'C': 1.0,
        'l1_ratio': 0.5,
        'max_iter': 1000,
        'random_seed': 42
    },
    'lasso': {
        'C': 1.0,
        'max_iter': 1000,
        'random_seed': 42
    }
}

# Set up parallel processing
# EC2 optimization: Use 30 out of 32 cores (leave 2 for system)
import multiprocessing
N_WORKERS = int(os.getenv("N_WORKERS", "0"))
if N_WORKERS < 1:
    # Auto-detect: use all cores minus 2 for system
    total_cores = multiprocessing.cpu_count()
    N_WORKERS = max(1, total_cores - 2)
    print(f"Auto-detected {total_cores} cores, using {N_WORKERS} workers")
else:
    print(f"Using {N_WORKERS} workers from N_WORKERS environment variable")

# Output directory
output_dir = Path("3_feature_importance/outputs")
output_dir.mkdir(exist_ok=True, parents=True)

print(f"\nOutput directory: {output_dir}")
print(f"MC-CV Configuration: {N_SPLITS} splits, {TRAIN_PROP*100:.0f}/{TEST_SIZE*100:.0f} train/test split")
print(f"Cohorts to process: {', '.join(COHORT_NAMES)}")
print(f"Age bands to process: {', '.join(AGE_BANDS)}")
print(f"Running {len(COHORT_NAMES)} cohort(s) √ó {len(AGE_BANDS)} age-band(s) = {len(COHORT_NAMES) * len(AGE_BANDS)} combinations")


## 3. Parallel Execution

Run feature importance analysis for all cohort √ó age-band combinations in parallel.


In [None]:
from joblib import Parallel, delayed
from tqdm import tqdm

# Create all combinations of cohort and age-band
combinations = list(product(COHORT_NAMES, AGE_BANDS))

# Filter combinations: check if cohort files exist and if results already exist
combinations_to_process = []
for cohort_name, age_band in combinations:
    # Check if training files exist (need all train years)
    train_files_exist = all(
        check_cohort_file_exists(cohort_name, age_band, year) 
        for year in TRAIN_YEARS
    )
    
    if not train_files_exist:
        missing_years = [y for y in TRAIN_YEARS if not check_cohort_file_exists(cohort_name, age_band, y)]
        print(f"‚ö† Skipping {cohort_name}/{age_band}: missing training files for years: {missing_years}")
        continue
    
    # Check if test file exists
    if not check_cohort_file_exists(cohort_name, age_band, TEST_YEAR):
        print(f"‚ö† Skipping {cohort_name}/{age_band}: test file not found for year {TEST_YEAR}")
        continue
    
    # Check if results already exist (idempotency) - using test_year for S3 path
    if check_feature_importance_results_exist(cohort_name, age_band, TEST_YEAR):
        print(f"‚úì Skipping {cohort_name}/{age_band} (train: {TRAIN_YEARS}, test: {TEST_YEAR}): results already exist in S3")
        continue
    
    combinations_to_process.append((cohort_name, age_band))

print(f"\n{'='*80}")
print(f"Processing {len(combinations_to_process)} combinations")
print(f"{'='*80}\n")

if len(combinations_to_process) == 0:
    print("No combinations to process. All results already exist or cohort files missing.")
else:
    # Run analysis in parallel
    def run_single_combination(args):
        cohort_name, age_band = args
        try:
            result = run_cohort_analysis(
                cohort_name=cohort_name,
                age_band=age_band,
                train_years=TRAIN_YEARS,
                test_year=TEST_YEAR,
                n_splits=N_SPLITS,
                train_prop=TRAIN_PROP,
                n_workers=N_WORKERS,
                scaling_metric=SCALING_METRIC,
                model_params=MODEL_PARAMS,
                debug_mode=DEBUG_MODE,
                output_dir=str(output_dir)
            )
            return result
        except Exception as e:
            print(f"‚úó Error processing {cohort_name}/{age_band} (train: {TRAIN_YEARS}, test: {TEST_YEAR}): {str(e)}")
            return {
                'cohort': cohort_name,
                'age_band': age_band,
                'train_years': TRAIN_YEARS,
                'test_year': TEST_YEAR,
                'status': 'error',
                'error': str(e)
            }
    
    # Run in parallel with progress bar
    results = Parallel(n_jobs=min(N_WORKERS, len(combinations_to_process)), verbose=0)(
        delayed(run_single_combination)(combo) 
        for combo in tqdm(combinations_to_process, desc="Processing combinations")
    )
    
    # Print summary
    print(f"\n{'='*80}")
    print("Processing Summary")
    print(f"{'='*80}")
    
    successful = [r for r in results if r.get('status') == 'success']
    failed = [r for r in results if r.get('status') == 'error']
    
    print(f"‚úì Successful: {len(successful)}")
    print(f"‚úó Failed: {len(failed)}")
    
    if failed:
        print("\nFailed combinations:")
        for r in failed:
            train_years_str = ', '.join(map(str, r.get('train_years', TRAIN_YEARS)))
            print(f"  - {r['cohort']}/{r['age_band']} (train: {train_years_str}, test: {r.get('test_year', TEST_YEAR)}): {r.get('error', 'Unknown error')}")
