# Feature Importance Cohort Runner (EC2)

This notebook is designed to run the **feature importance Monte Carlo CV pipeline** for all configured cohorts on an **EC2 instance**.

- **Cohort scripts**: `3_feature_importance/run_cohort_*.py`
- **Data location (EC2)**: `/mnt/nvme/cohorts` (synced from `s3://pgxdatalake/gold/cohorts_F1120/`)
- **Environment**: Python environment with `xgboost`, `catboost`, `lightgbm`, `scikit-learn`, `pandas`, `numpy`, etc.

**Purpose:** Calculate scaled feature importance across various ML algorithms  
**Method:** Normalized feature importance scaled by MC-CV Recall scores  
**Hardware:** Optimized for EC2 (32 cores, 1TB RAM)  

## Performance Optimizations

ðŸš€ **Parallel Processing** â€“ Leverages all available CPU cores for maximum performance:
- **Feature Matrix Creation**: Parallel column-by-column construction using `joblib.Parallel` (uses `cpu_count() - 2` workers)
- **MC-CV Splits**: Parallel execution of Monte Carlo cross-validation splits (uses `cpu_count() - 2` workers)
- **Memory Efficient**: Replaced memory-intensive `pivot_table` with incremental column building to handle large cohorts (32,000+ features)

ðŸ’¾ **Memory Optimization**:
- Column-by-column feature matrix construction reduces peak memory usage
- Optimized for large cohorts (e.g., age band 25-44 with 78,000+ patients and 32,000+ features)
- Efficient handling of sparse categorical features for CatBoost

## Key Features

âœ… **Monte Carlo Cross-Validation** â€“ up to 1000 random train/test splits (200-split runs used for faster iteration)  
âœ… **Stratified Sampling** - Maintains target distribution  
âœ… **95% Confidence Intervals** - Narrow, precise estimates (tighter with more splits)  
âœ… **Multiple Models** - CatBoost, Random Forest, XGBoost, LightGBM, Extra Trees, Logistic Regression, LinearSVC, ElasticNet, Lasso  
âœ… **Permutation-Based Feature Importance** - Model-agnostic importance calculation for fair comparison across models  
âœ… **Idempotent Workflow** - Automatically skips models with existing results (checks local files first, then S3)  
âœ… **Parallel Processing** - Leverages all available CPU cores for feature matrix creation and MC-CV splits (uses `cpu_count() - 2` workers)  
âœ… **Memory Optimized** - Column-by-column feature matrix construction handles large cohorts (32,000+ features, 78,000+ patients)  

## Methodology

This notebook implements the feature selection methodology:

1. Load cohort data from parquet files (same as FP-Growth notebook)
2. Create patient-level features (one-hot encoding of items)
   - **Optimized**: Parallel feature matrix creation for CatBoost categorical features
   - Filters constant features globally before MC-CV splits
3. For each model type:
   - Create 200 stratified train/test splits (parallelized across all cores)
   - Train model on training set
   - Evaluate Recall on unseen test set
   - Extract permutation-based feature importance
   - Aggregate results across splits
4. Normalize and scale feature importance by MC-CV Recall
5. Aggregate across models
6. Extract top features

## Output Files

All results are saved locally and uploaded to S3:
- **Individual model results**: `{cohort}_{age_band}_{method}_feature_importance.csv`
- **Aggregated results**: `{cohort}_{age_band}_aggregated_feature_importance.csv`
- **Constant features**: `{cohort}_{age_band}_constant_features.csv`
- **S3 location**: `s3://pgxdatalake/gold/feature_importance/{cohort}/{age_band}/`

# Environment

In [6]:
import os
import sys
from pathlib import Path


PYTHON_BIN = Path("/home/pgx3874/jupyter-env/bin/python3.11")

if not PYTHON_BIN.exists():
    raise FileNotFoundError(
        f"Python binary not found at:\n  {PYTHON_BIN}\n"
        "Ensure your EC2 environment path is correct."
    )

print(f"[INFO] Using Python binary: {PYTHON_BIN}")

# -------------------------------------------------------------
# Resolve project_root robustly for BOTH notebook + script mode
# -------------------------------------------------------------
def resolve_project_root():
    # Case 1: running as a script â†’ __file__ exists
    if '__file__' in globals():
        return Path(__file__).resolve().parents[1]

    # Case 2: running in Jupyter/Notebook â†’ no __file__
    # Fallback = assume notebook is running inside project folder structure
    notebook_path = Path(os.getcwd()).resolve()

    # If running in .../pgx-analysis/3_feature_importance, go up 1 level
    if notebook_path.name == "3_feature_importance":
        return notebook_path.parent

    # If running deeper inside scripts, go up until pgx-analysis appears
    for parent in notebook_path.parents:
        if parent.name == "pgx-analysis":
            return parent

    # Last fallback: use current working directory
    return notebook_path


PROJECT_ROOT = resolve_project_root()
print(f"[INFO] Project root: {PROJECT_ROOT}")

# Add to sys.path if needed
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

# Expected EC2 data location (synced from S3) 
DATA_PATH = Path("/mnt/nvme/cohorts") 
print(f"Expected cohort data path: {DATA_PATH}") 

[INFO] Using Python binary: /home/pgx3874/jupyter-env/bin/python3.11
[INFO] Project root: /home/pgx3874/pgx-analysis
Expected cohort data path: /mnt/nvme/cohorts


# Per-Cohort Runner Cells

Each cell below runs a **single cohort script**. This makes it easy to:

- Debug failures for a specific cohort/age-band
- Modify a cohort script and immediately re-run just that cohort

All scripts automatically leverage parallel processing and are idempotent (skip completed models).

All cells assume this notebook is running from the `3_feature_importance/` directory (the default when opened from Jupyter in the project root).



## Cohort 1 â€“ Age 0â€“12



In [7]:
# Cohort 1, Age 0-12

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_1_0_12.py"],
    cwd=PROJECT_ROOT,
)


Running feature importance analysis:
  Cohort: opioid_ed
  Age Band: 0-12
  Train Years: [2016, 2017, 2018]
  Test Year: 2019
  MC-CV Splits: 200
  Workers: 30
  Output Directory: 3_feature_importance/outputs

Note: This script is idempotent - models with existing results in S3 will be skipped.

2025-11-27 08:36:41,274 - INFO - FEATURE IMPORTANCE ANALYSIS - MONTE CARLO CROSS-VALIDATION
2025-11-27 08:36:41,274 - INFO - Cohort: opioid_ed
2025-11-27 08:36:41,274 - INFO - Age Band: 0-12
2025-11-27 08:36:41,274 - INFO - Train Years: 2016, 2017, 2018
2025-11-27 08:36:41,274 - INFO - Test Year: 2019
2025-11-27 08:36:41,274 - INFO - MC-CV Splits: 200
2025-11-27 08:36:41,274 - INFO - Scaling Metric: recall
2025-11-27 08:36:41,274 - INFO - Debug Mode: Disabled
2025-11-27 08:36:41,274 - INFO - Loading cohort data...
2025-11-27 08:36:41,275 - INFO - Memory usage [Before Data Loading]: 336.3 MB
2025-11-27 08:36:41,275 - INFO - Loading training data from years: 2016, 2017, 2018
2025-11-27 08:36:41,2

CompletedProcess(args=['/home/pgx3874/jupyter-env/bin/python3.11', '3_feature_importance/run_cohort_1_0_12.py'], returncode=0)

## Cohort 1 â€“ Age 13â€“24



In [None]:
# Cohort 1, Age 13-24
# Medium cohort: ~9,800 patients Ã— 12,500+ features

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_1_13_24.py"],
    cwd=PROJECT_ROOT,
)


Running feature importance for opioid_ed / 13-24
2025-11-27 08:36:55,989 - INFO - FEATURE IMPORTANCE ANALYSIS - MONTE CARLO CROSS-VALIDATION
2025-11-27 08:36:55,989 - INFO - Cohort: opioid_ed
2025-11-27 08:36:55,989 - INFO - Age Band: 13-24
2025-11-27 08:36:55,989 - INFO - Train Years: 2016, 2017, 2018
2025-11-27 08:36:55,989 - INFO - Test Year: 2019
2025-11-27 08:36:55,989 - INFO - MC-CV Splits: 200
2025-11-27 08:36:55,989 - INFO - Scaling Metric: recall
2025-11-27 08:36:55,989 - INFO - Debug Mode: Disabled
2025-11-27 08:36:55,989 - INFO - Loading cohort data...
2025-11-27 08:36:55,990 - INFO - Memory usage [Before Data Loading]: 336.1 MB
2025-11-27 08:36:55,990 - INFO - Loading training data from years: 2016, 2017, 2018
2025-11-27 08:36:56,150 - INFO - Loaded 236568 records from year 2016
2025-11-27 08:36:56,229 - INFO - Loaded 116367 records from year 2017
2025-11-27 08:36:56,283 - INFO - Loaded 83047 records from year 2018
2025-11-27 08:36:56,332 - INFO - Combined training data: 43

## Cohort 1 â€“ Age 25â€“44



In [None]:
# Cohort 1, Age 25-44
# Large cohort: ~78,000 patients Ã— 32,000+ features

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_1_25_44.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 1 â€“ Age 45â€“54



In [None]:
# Cohort 1, Age 45-54

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_1_45_54.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 1 â€“ Age 55â€“64



In [None]:
# Cohort 1, Age 55-64

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_1_55_64.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 1 â€“ Age 65â€“74



In [None]:
# Cohort 1, Age 65-74

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_1_65_74.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 1 â€“ Age 75â€“84



In [None]:
# Cohort 1, Age 75-84
# Note: Feature matrix creation and MC-CV splits run in parallel using all available cores

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_1_75_84.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 1 â€“ Age 85â€“94



In [None]:
# Cohort 1, Age 85-94
# Note: Feature matrix creation and MC-CV splits run in parallel using all available cores

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_1_85_94.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 1 â€“ Age 95â€“114



In [None]:
# Cohort 1, Age 95-114
# Note: Feature matrix creation and MC-CV splits run in parallel using all available cores

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_1_95_114.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 2 â€“ Age 0â€“12



In [None]:
# Cohort 2, Age 0-12 (e.g., non-opioid_ed)

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_2_0_12.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 2 â€“ Age 13â€“24



In [None]:
# Cohort 2, Age 13-24

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_2_13_24.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 2 â€“ Age 25â€“44



In [None]:
# Cohort 2, Age 25-44
# Large cohort: Similar size to Cohort 1, Age 25-44

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_2_25_44.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 2 â€“ Age 45â€“54



In [None]:
# Cohort 2, Age 45-54

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_2_45_54.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 2 â€“ Age 55â€“64



In [None]:
# Cohort 2, Age 55-64

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_2_55_64.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 2 â€“ Age 65â€“74



In [None]:
# Cohort 2, Age 65-74

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_2_65_74.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 2 â€“ Age 75â€“84



In [None]:
# Cohort 2, Age 75-84

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_2_75_84.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 2 â€“ Age 85â€“94



In [None]:
# Cohort 2, Age 85-94

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_2_85_94.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 2 â€“ Age 95â€“114



In [None]:
# Cohort 2, Age 95-114

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_2_95_114.py"],
    cwd=PROJECT_ROOT,
)


# Run All Cohorts

## Run All Age Bands in Parallel (Per Cohort)

Use this section to run all **9 age bands** for a given cohort in parallel on EC2.

- Uses `ThreadPoolExecutor` to launch multiple `run_cohort_*` scripts concurrently
- Each script remains idempotent (skips models with existing results in S3)
- Adjust `MAX_PARALLEL_AGE_BANDS` based on available CPU/memory



In [None]:
import subprocess
from concurrent.futures import ThreadPoolExecutor, as_completed

# Configuration: which cohort to run in parallel
# 1 => opioid_ed (run_cohort_1_*.py)
# 2 => non_opioid_ed (run_cohort_2_*.py)
COHORT_ID = 1

# Age-band suffixes used in the script filenames
AGE_BAND_SUFFIXES = [
    "0_12",
    "13_24",
    "25_44",
    "45_54",
    "55_64",
    "65_74",
    "75_84",
    "85_94",
    "95_114",
]

MAX_PARALLEL_AGE_BANDS = 9  # Set lower (e.g., 3-4) if memory is tight
FAIL_FAST = True


def run_age_band_script(script_rel: str) -> int:
    """Run a single cohort age-band script and return its exit code."""
    print("=" * 80)
    print(f"[PARALLEL] Starting: {script_rel}")
    print("=" * 80)

    result = subprocess.run(
        [str(PYTHON_BIN), script_rel],
        cwd=PROJECT_ROOT,
    )

    if result.returncode == 0:
        print(f"[PARALLEL] COMPLETED: {script_rel}")
    else:
        print(f"[PARALLEL] FAILED ({result.returncode}): {script_rel}")
    return result.returncode


scripts = [
    f"3_feature_importance/run_cohort_{COHORT_ID}_{suffix}.py"
    for suffix in AGE_BAND_SUFFIXES
]

print(f"Running {len(scripts)} age bands in parallel for cohort ID {COHORT_ID}...")

errors = []
with ThreadPoolExecutor(max_workers=MAX_PARALLEL_AGE_BANDS) as executor:
    future_to_script = {
        executor.submit(run_age_band_script, script): script
        for script in scripts
    }

    for future in as_completed(future_to_script):
        script = future_to_script[future]
        try:
            code = future.result()
        except Exception as exc:
            print(f"[PARALLEL] EXCEPTION in {script}: {exc}")
            errors.append((script, str(exc)))
            if FAIL_FAST:
                break
        else:
            if code != 0:
                errors.append((script, f"exit code {code}"))
                if FAIL_FAST:
                    break

if errors:
    print("\nOne or more age-band runs failed:")
    for script, msg in errors:
        print(f"  - {script}: {msg}")
else:
    print("\nAll age bands completed successfully (or were skipped as already done).")



In [None]:
# Run all cohort scripts sequentially
# Each script is idempotent and will skip work if results already exist in S3.

FAIL_FAST = True  # Stop on first failure; set to False to continue on errors

for script in COHORT_SCRIPTS:
    rel_path = script.relative_to(PROJECT_ROOT)
    print("=" * 80)
    print(f"Running cohort script: {rel_path}")
    print("=" * 80)

    result = subprocess.run(
        [str(PYTHON_BIN), str(rel_path)],
        cwd=PROJECT_ROOT,
    )

    if result.returncode != 0:
        msg = f"Script {rel_path} failed with exit code {result.returncode}"
        print(msg)
        if FAIL_FAST:
            raise RuntimeError(msg)

print("\nAll cohort scripts completed (or were skipped as already done).")



# Sync Results and Code to S3

Sync output files and code (notebook + R script) to S3 bucket. 
- Outputs: CSV results files
- Code: Notebook and R script for reproducibility

In [None]:
# Sync outputs and code to S3
# On EC2, we're in the feature_importance directory  
s3_bucket <- "s3://pgx-repository/pgx-analysis/3_feature_importance/"

# Find AWS CLI (check common locations - EC2 typically has it in /usr/local/bin or /usr/bin)
aws_cmd <- Sys.which("aws")
if (aws_cmd == "") {
  # Try common EC2 installation paths
  aws_paths <- c(
    "/usr/local/bin/aws",
    "/usr/bin/aws",
    "/home/ec2-user/.local/bin/aws"
  )
  aws_cmd <- NULL
  for (path in aws_paths) {
    if (file.exists(path)) {
      aws_cmd <- path
      break
    }
  }
  if (is.null(aws_cmd)) {
    stop("AWS CLI not found. Please install AWS CLI or ensure it's in your PATH.")
  }
}

cat("Syncing outputs and code to S3...\n")
cat("Source: feature_importance/ directory\n")
cat("Destination:", s3_bucket, "\n")
cat("AWS CLI:", aws_cmd, "\n\n")

# Get current directory (should be feature_importance)
current_dir <- getwd()
if (!grepl("feature_importance", current_dir)) {
  warning("Current directory doesn't appear to be feature_importance. Double-check sync destination.")
}

# Sync feature_importance directory (includes outputs/ and code files)
# Explicitly include notebook, R scripts, README files, and outputs directory
# Exclude temporary files, checkpoints, and unnecessary directories
# Note: --delete flag removed for safety (won't delete files in S3 that don't exist locally)
# Include patterns are processed before exclude patterns, then exclude everything else
sync_cmd <- sprintf(
  '"%s" s3 sync "%s" %s --include "*.ipynb" --include "*.R" --include "README*.md" --include "outputs/**" --exclude "*checkpoint*" --exclude "*.tmp" --exclude "*.ipynb_checkpoints/*" --exclude "*.RData" --exclude "*.Rhistory" --exclude ".Rproj.user/*" --exclude "catboost_info/*" --exclude "*.log" --exclude "*"',
  aws_cmd,
  current_dir,
  s3_bucket
)

cat("Running:", sync_cmd, "\n\n")
result <- system(sync_cmd)

if (result == 0) {
  cat("âœ“ Successfully synced outputs and code to S3\n")
  cat("  - Outputs:", file.path(output_dir), "\n")
  cat("  - Code: *.ipynb, *.R, README*.md\n")
} else {
  warning(sprintf("S3 sync returned exit code %d. Check AWS credentials and permissions.", result))
}

# ============================================================
# SAVE LOGS TO S3 (aligned with 2_create_cohort)
# ============================================================
cat("\n========================================\n")
cat("Saving logs to S3...\n")
cat("========================================\n")

# Close log file connection
if (exists("log_setup") && !is.null(log_setup$log_connection)) {
  if (isOpen(log_setup$log_connection)) {
    close(log_setup$log_connection)
  }
}

# Save logs to S3
if (exists("logger") && exists("log_file_path")) {
  tryCatch({
    s3_path <- save_logs_to_s3_r(log_file_path, COHORT_NAME, AGE_BAND, EVENT_YEAR, logger)
    if (!is.null(s3_path)) {
      logger$info("âœ“ Analysis completed successfully. Logs saved to S3.")
    }
  }, error = function(e) {
    cat(sprintf("Warning: Could not save logs to S3: %s\n", e$message))
    cat(sprintf("Log file saved locally: %s\n", log_file_path))
  })
} else {
  cat("Warning: Logger not initialized. Logs not saved to S3.\n")
}


# Shutdown EC2

In [None]:

# Shutdown EC2 instance after analysis completes
# Set SHUTDOWN_EC2 = TRUE to enable, FALSE to disable
SHUTDOWN_EC2 <- TRUE  # Change to TRUE to enable auto-shutdown

if (SHUTDOWN_EC2) {
  cat("\n========================================\n")
  cat("Shutting down EC2 instance...\n")
  cat("========================================\n")
  
  # Get instance ID from EC2 metadata service
  instance_id <- tryCatch({
    system("curl -s http://169.254.169.254/latest/meta-data/instance-id", intern = TRUE)
  }, error = function(e) {
    cat("Warning: Could not retrieve instance ID from metadata service.\n")
    cat("If running on EC2, check that metadata service is accessible.\n")
    return(NULL)
  })
  
  if (!is.null(instance_id) && length(instance_id) > 0 && nchar(instance_id[1]) > 0) {
    instance_id <- instance_id[1]
    cat(sprintf("Instance ID: %s\n", instance_id))
    
    # Find AWS CLI
    aws_cmd <- Sys.which("aws")
    if (aws_cmd == "") {
      aws_paths <- c(
        "/usr/local/bin/aws",
        "/usr/bin/aws",
        "/home/ec2-user/.local/bin/aws"
      )
      aws_cmd <- NULL
      for (path in aws_paths) {
        if (file.exists(path)) {
          aws_cmd <- path
          break
        }
      }
    }
    
    if (!is.null(aws_cmd) && aws_cmd != "") {
      # Stop the instance (use terminate-instances for permanent deletion)
      shutdown_cmd <- sprintf(
        '"%s" ec2 stop-instances --instance-ids %s',
        aws_cmd,
        instance_id
      )
      
      cat("Running:", shutdown_cmd, "\n")
      result <- system(shutdown_cmd)
      
      if (result == 0) {
        cat("âœ“ EC2 instance stop command sent successfully\n")
        cat("Instance will stop in a few moments.\n")
        cat("Note: This is a STOP (not terminate), so you can restart it later.\n")
      } else {
        warning(sprintf("EC2 stop command returned exit code %d. Check AWS credentials and permissions.", result))
      }
    } else {
      cat("Warning: AWS CLI not found. Cannot shutdown instance.\n")
      cat("Install AWS CLI or ensure it's in your PATH.\n")
    }
  } else {
    cat("Warning: Could not determine instance ID. Skipping shutdown.\n")
    cat("If you want to shutdown manually, use:\n")
    cat("  aws ec2 stop-instances --instance-ids <your-instance-id>\n")
  }
} else {
  cat("\n========================================\n")
  cat("EC2 Auto-Shutdown: DISABLED\n")
  cat("========================================\n")
  cat("To enable auto-shutdown, set SHUTDOWN_EC2 = TRUE in this cell.\n")
  cat("Instance will continue running.\n")
}
