# Feature Importance Cohort Runner (EC2)

This notebook is designed to run the **feature importance Monte Carlo CV pipeline** for all configured cohorts on an **EC2 instance**.

- **Cohort scripts**: `3_feature_importance/run_cohort_*.py`
- **Data location (EC2)**: `/mnt/nvme/cohorts` (synced from `s3://pgxdatalake/gold/cohorts_F1120/`)
- **Environment**: Python environment with `xgboost`, `catboost`, `lightgbm`, `scikit-learn`, `pandas`, `numpy`, etc.

**Purpose:** Calculate scaled feature importance across various MLalgorithms
**Method:** Normalized feature importance scaled by MC-CV Recall scores  
**Hardware:** Optimized for EC2 (32 cores, 1TB RAM)  

## Key Features

✅ **Monte Carlo Cross-Validation** – up to 1000 random train/test splits (100-split runs used for faster iteration)  
✅ **Stratified Sampling** - Maintains target distribution  
✅ **95% Confidence Intervals** - Narrow, precise estimates (tighter with more splits)  
✅ **Multiple Models** - CatBoost (R) and Random Forest (R)  

## Methodology

This notebook implements the feature selection methodology:

1. Load cohort data from parquet files (same as FP-Growth notebook)
2. Create patient-level features (one-hot encoding of items)
3. For each model type:
   - Create 100–1000 stratified train/test splits
   - Train model on training set
   - Evaluate Recall on unseen test set
   - Extract feature importance
   - Aggregate results across splits
4. Normalize and scale feature importance by MC-CV Recall
5. Aggregate across models
6. Extract top features

# Environment

In [7]:
import os
import sys
from pathlib import Path

# -------------------------------------------------------------
# Resolve project_root robustly for BOTH notebook + script mode
# -------------------------------------------------------------
def resolve_project_root():
    # Case 1: running as a script → __file__ exists
    if '__file__' in globals():
        return Path(__file__).resolve().parents[1]

    # Case 2: running in Jupyter/Notebook → no __file__
    # Fallback = assume notebook is running inside project folder structure
    notebook_path = Path(os.getcwd()).resolve()

    # If running in .../pgx-analysis/3_feature_importance, go up 1 level
    if notebook_path.name == "3_feature_importance":
        return notebook_path.parent

    # If running deeper inside scripts, go up until pgx-analysis appears
    for parent in notebook_path.parents:
        if parent.name == "pgx-analysis":
            return parent

    # Last fallback: use current working directory
    return notebook_path


project_root = resolve_project_root()
print(f"[INFO] Project root: {project_root}")

# Add to sys.path if needed
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))



[INFO] Project root: /home/pgx3874/pgx-analysis


# Per-Cohort Runner Cells

Each cell below runs a **single cohort script**. This makes it easy to:

- Debug failures for a specific cohort/age-band
- Modify a cohort script and immediately re-run just that cohort

All cells assume this notebook is running from the `3_feature_importance/` directory (the default when opened from Jupyter in the project root).



## Cohort 1 – Age 0–12



In [8]:
# Cohort 1, Age 0-12

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_1_0_12.py"],
    cwd=PROJECT_ROOT,
)


Running feature importance analysis:
  Cohort: opioid_ed
  Age Band: 0-12
  Train Years: [2016, 2017, 2018]
  Test Year: 2019
  MC-CV Splits: 200
  Workers: 30
  Output Directory: 3_feature_importance/outputs

Note: This script is idempotent - models with existing results in S3 will be skipped.

2025-11-27 07:37:47,363 - INFO - FEATURE IMPORTANCE ANALYSIS - MONTE CARLO CROSS-VALIDATION
2025-11-27 07:37:47,363 - INFO - Cohort: opioid_ed
2025-11-27 07:37:47,363 - INFO - Age Band: 0-12
2025-11-27 07:37:47,363 - INFO - Train Years: 2016, 2017, 2018
2025-11-27 07:37:47,363 - INFO - Test Year: 2019
2025-11-27 07:37:47,364 - INFO - MC-CV Splits: 200
2025-11-27 07:37:47,364 - INFO - Scaling Metric: recall
2025-11-27 07:37:47,364 - INFO - Debug Mode: Disabled
2025-11-27 07:37:47,364 - INFO - Loading cohort data...
2025-11-27 07:37:47,364 - INFO - Memory usage [Before Data Loading]: 340.1 MB
2025-11-27 07:37:47,364 - INFO - Loading training data from years: 2016, 2017, 2018
2025-11-27 07:37:47,4

CompletedProcess(args=['/home/pgx3874/jupyter-env/bin/python3.11', '3_feature_importance/run_cohort_1_0_12.py'], returncode=0)

## Cohort 1 – Age 13–24



In [None]:
# Cohort 1, Age 13-24

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_1_13_24.py"],
    cwd=PROJECT_ROOT,
)


Running feature importance for opioid_ed / 13-24
2025-11-27 07:38:56,968 - INFO - FEATURE IMPORTANCE ANALYSIS - MONTE CARLO CROSS-VALIDATION
2025-11-27 07:38:56,968 - INFO - Cohort: opioid_ed
2025-11-27 07:38:56,968 - INFO - Age Band: 13-24
2025-11-27 07:38:56,968 - INFO - Train Years: 2016, 2017, 2018
2025-11-27 07:38:56,968 - INFO - Test Year: 2019
2025-11-27 07:38:56,968 - INFO - MC-CV Splits: 200
2025-11-27 07:38:56,968 - INFO - Scaling Metric: recall
2025-11-27 07:38:56,968 - INFO - Debug Mode: Disabled
2025-11-27 07:38:56,968 - INFO - Loading cohort data...
2025-11-27 07:38:56,968 - INFO - Memory usage [Before Data Loading]: 336.0 MB
2025-11-27 07:38:56,968 - INFO - Loading training data from years: 2016, 2017, 2018
2025-11-27 07:38:57,133 - INFO - Loaded 236568 records from year 2016
2025-11-27 07:38:57,215 - INFO - Loaded 116367 records from year 2017
2025-11-27 07:38:57,275 - INFO - Loaded 83047 records from year 2018
2025-11-27 07:38:57,323 - INFO - Combined training data: 43

## Cohort 1 – Age 25–44



In [None]:
# Cohort 1, Age 25-44

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_1_25_44.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 1 – Age 45–54



In [None]:
# Cohort 1, Age 45-54

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_1_45_54.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 1 – Age 55–64



In [None]:
# Cohort 1, Age 55-64

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_1_55_64.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 1 – Age 65–74



In [None]:
# Cohort 1, Age 65-74

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_1_65_74.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 1 – Age 75–84



In [None]:
# Cohort 1, Age 75-84

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_1_75_84.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 1 – Age 85–94



In [None]:
# Cohort 1, Age 85-94

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_1_85_94.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 1 – Age 95–114



In [None]:
# Cohort 1, Age 95-114

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_1_95_114.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 2 – Age 0–12



In [None]:
# Cohort 2, Age 0-12 (e.g., non-opioid_ed)

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_2_0_12.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 2 – Age 13–24



In [None]:
# Cohort 2, Age 13-24

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_2_13_24.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 2 – Age 25–44



In [None]:
# Cohort 2, Age 25-44

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_2_25_44.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 2 – Age 45–54



In [None]:
# Cohort 2, Age 45-54

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_2_45_54.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 2 – Age 55–64



In [None]:
# Cohort 2, Age 55-64

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_2_55_64.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 2 – Age 65–74



In [None]:
# Cohort 2, Age 65-74

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_2_65_74.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 2 – Age 75–84



In [None]:
# Cohort 2, Age 75-84

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_2_75_84.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 2 – Age 85–94



In [None]:
# Cohort 2, Age 85-94

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_2_85_94.py"],
    cwd=PROJECT_ROOT,
)


## Cohort 2 – Age 95–114



In [None]:
# Cohort 2, Age 95-114

import subprocess

subprocess.run(
    [str(PYTHON_BIN), "3_feature_importance/run_cohort_2_95_114.py"],
    cwd=PROJECT_ROOT,
)


# Run All Cohorts

In [None]:
# Run all cohort scripts sequentially
# Each script is idempotent and will skip work if results already exist in S3.

FAIL_FAST = True  # Stop on first failure; set to False to continue on errors

for script in COHORT_SCRIPTS:
    rel_path = script.relative_to(PROJECT_ROOT)
    print("=" * 80)
    print(f"Running cohort script: {rel_path}")
    print("=" * 80)

    result = subprocess.run(
        [str(PYTHON_BIN), str(rel_path)],
        cwd=PROJECT_ROOT,
    )

    if result.returncode != 0:
        msg = f"Script {rel_path} failed with exit code {result.returncode}"
        print(msg)
        if FAIL_FAST:
            raise RuntimeError(msg)

print("\nAll cohort scripts completed (or were skipped as already done).")



# Sync Results and Code to S3

Sync output files and code (notebook + R script) to S3 bucket. 
- Outputs: CSV results files
- Code: Notebook and R script for reproducibility

In [None]:
# Sync outputs and code to S3
# On EC2, we're in the feature_importance directory  
s3_bucket <- "s3://pgx-repository/pgx-analysis/3_feature_importance/"

# Find AWS CLI (check common locations - EC2 typically has it in /usr/local/bin or /usr/bin)
aws_cmd <- Sys.which("aws")
if (aws_cmd == "") {
  # Try common EC2 installation paths
  aws_paths <- c(
    "/usr/local/bin/aws",
    "/usr/bin/aws",
    "/home/ec2-user/.local/bin/aws"
  )
  aws_cmd <- NULL
  for (path in aws_paths) {
    if (file.exists(path)) {
      aws_cmd <- path
      break
    }
  }
  if (is.null(aws_cmd)) {
    stop("AWS CLI not found. Please install AWS CLI or ensure it's in your PATH.")
  }
}

cat("Syncing outputs and code to S3...\n")
cat("Source: feature_importance/ directory\n")
cat("Destination:", s3_bucket, "\n")
cat("AWS CLI:", aws_cmd, "\n\n")

# Get current directory (should be feature_importance)
current_dir <- getwd()
if (!grepl("feature_importance", current_dir)) {
  warning("Current directory doesn't appear to be feature_importance. Double-check sync destination.")
}

# Sync feature_importance directory (includes outputs/ and code files)
# Explicitly include notebook, R scripts, README files, and outputs directory
# Exclude temporary files, checkpoints, and unnecessary directories
# Note: --delete flag removed for safety (won't delete files in S3 that don't exist locally)
# Include patterns are processed before exclude patterns, then exclude everything else
sync_cmd <- sprintf(
  '"%s" s3 sync "%s" %s --include "*.ipynb" --include "*.R" --include "README*.md" --include "outputs/**" --exclude "*checkpoint*" --exclude "*.tmp" --exclude "*.ipynb_checkpoints/*" --exclude "*.RData" --exclude "*.Rhistory" --exclude ".Rproj.user/*" --exclude "catboost_info/*" --exclude "*.log" --exclude "*"',
  aws_cmd,
  current_dir,
  s3_bucket
)

cat("Running:", sync_cmd, "\n\n")
result <- system(sync_cmd)

if (result == 0) {
  cat("✓ Successfully synced outputs and code to S3\n")
  cat("  - Outputs:", file.path(output_dir), "\n")
  cat("  - Code: *.ipynb, *.R, README*.md\n")
} else {
  warning(sprintf("S3 sync returned exit code %d. Check AWS credentials and permissions.", result))
}

# ============================================================
# SAVE LOGS TO S3 (aligned with 2_create_cohort)
# ============================================================
cat("\n========================================\n")
cat("Saving logs to S3...\n")
cat("========================================\n")

# Close log file connection
if (exists("log_setup") && !is.null(log_setup$log_connection)) {
  if (isOpen(log_setup$log_connection)) {
    close(log_setup$log_connection)
  }
}

# Save logs to S3
if (exists("logger") && exists("log_file_path")) {
  tryCatch({
    s3_path <- save_logs_to_s3_r(log_file_path, COHORT_NAME, AGE_BAND, EVENT_YEAR, logger)
    if (!is.null(s3_path)) {
      logger$info("✓ Analysis completed successfully. Logs saved to S3.")
    }
  }, error = function(e) {
    cat(sprintf("Warning: Could not save logs to S3: %s\n", e$message))
    cat(sprintf("Log file saved locally: %s\n", log_file_path))
  })
} else {
  cat("Warning: Logger not initialized. Logs not saved to S3.\n")
}


# Shutdown EC2

In [None]:

# Shutdown EC2 instance after analysis completes
# Set SHUTDOWN_EC2 = TRUE to enable, FALSE to disable
SHUTDOWN_EC2 <- TRUE  # Change to TRUE to enable auto-shutdown

if (SHUTDOWN_EC2) {
  cat("\n========================================\n")
  cat("Shutting down EC2 instance...\n")
  cat("========================================\n")
  
  # Get instance ID from EC2 metadata service
  instance_id <- tryCatch({
    system("curl -s http://169.254.169.254/latest/meta-data/instance-id", intern = TRUE)
  }, error = function(e) {
    cat("Warning: Could not retrieve instance ID from metadata service.\n")
    cat("If running on EC2, check that metadata service is accessible.\n")
    return(NULL)
  })
  
  if (!is.null(instance_id) && length(instance_id) > 0 && nchar(instance_id[1]) > 0) {
    instance_id <- instance_id[1]
    cat(sprintf("Instance ID: %s\n", instance_id))
    
    # Find AWS CLI
    aws_cmd <- Sys.which("aws")
    if (aws_cmd == "") {
      aws_paths <- c(
        "/usr/local/bin/aws",
        "/usr/bin/aws",
        "/home/ec2-user/.local/bin/aws"
      )
      aws_cmd <- NULL
      for (path in aws_paths) {
        if (file.exists(path)) {
          aws_cmd <- path
          break
        }
      }
    }
    
    if (!is.null(aws_cmd) && aws_cmd != "") {
      # Stop the instance (use terminate-instances for permanent deletion)
      shutdown_cmd <- sprintf(
        '"%s" ec2 stop-instances --instance-ids %s',
        aws_cmd,
        instance_id
      )
      
      cat("Running:", shutdown_cmd, "\n")
      result <- system(shutdown_cmd)
      
      if (result == 0) {
        cat("✓ EC2 instance stop command sent successfully\n")
        cat("Instance will stop in a few moments.\n")
        cat("Note: This is a STOP (not terminate), so you can restart it later.\n")
      } else {
        warning(sprintf("EC2 stop command returned exit code %d. Check AWS credentials and permissions.", result))
      }
    } else {
      cat("Warning: AWS CLI not found. Cannot shutdown instance.\n")
      cat("Install AWS CLI or ensure it's in your PATH.\n")
    }
  } else {
    cat("Warning: Could not determine instance ID. Skipping shutdown.\n")
    cat("If you want to shutdown manually, use:\n")
    cat("  aws ec2 stop-instances --instance-ids <your-instance-id>\n")
  }
} else {
  cat("\n========================================\n")
  cat("EC2 Auto-Shutdown: DISABLED\n")
  cat("========================================\n")
  cat("To enable auto-shutdown, set SHUTDOWN_EC2 = TRUE in this cell.\n")
  cat("Instance will continue running.\n")
}
