# Global FPGrowth Feature Importance Analysis - Target-Focused

**Purpose:** Target-focused pattern mining - discovers patterns that PREDICT outcomes  
**Updated:** November 24, 2025  
**Hardware:** x2iedn.8xlarge (32 vCPUs, 1024 GiB RAM, NVMe SSD)  
**Data:** `/mnt/nvme/cohorts/` (instance storage for fast I/O)  
**Output:** `s3://pgxdatalake/gold/fpgrowth/global/{item_type}/`

## üéØ Target-Focused Approach

**NEW:** Instead of ALL rules, generates ONLY predictive rules:
- **`rules_TARGET_ICD.json`** - Patterns predicting opioid dependence (F11.20-F11.29)
- **`rules_TARGET_ED.json`** - Patterns predicting ED visits (HCG Lines: P51, O11, P33)
- **`rules_CONTROL.json`** - Baseline patterns (non-target, for comparison)

## Key Features

‚úÖ **Target-Focused** - Predictive rules only, not descriptive  
‚úÖ **Three Item Types** - Drugs, ICD codes, CPT codes  
‚úÖ **Global Patterns** - 5.7M patients across all cohorts  
‚úÖ **Quality Over Quantity** - 50% confidence (not 1%!)  
‚úÖ **Comparative Analysis** - Target vs Control differences

## Methodology

For each item type (drug_name, icd_code, cpt_code):
1. Extract all unique items from cohort data
2. Add target markers (TARGET_ICD, TARGET_ED) to patient transactions
3. Create patient-level transactions
4. Run FP-Growth to find frequent itemsets
5. Generate ALL association rules
6. **Split rules**: Target-predicting vs Control (baseline)
7. **Split target rules**: ICD vs ED outcomes
8. Save three separate files per item type

## Expected Runtime (x2iedn.8xlarge)

### Parallel Execution (3 workers, all running simultaneously):

- **Drug names**: 15-25 min (28K items, 5.7M transactions)
- **ICD codes**: 15-25 min (20K items, 5.7M transactions)
- **CPT codes**: 20-35 min (15K items, more complex patterns)
- **Total (PARALLEL)**: **‚ö° 20-35 minutes** (longest job determines runtime!)

**Sequential would be**: 50-85 min  
**Speedup**: **2.5-3x faster** with parallel execution!

**Cost**: ~$2-3 on Spot pricing (~$6-7 on-demand)

### Why Parallel Works Here:
- ‚úÖ 1024 GiB RAM ‚Üí ~340 GB per worker (plenty!)
- ‚úÖ 32 vCPUs ‚Üí ~10 cores per worker
- ‚úÖ Independent jobs ‚Üí no data sharing needed
- ‚úÖ NVMe SSD ‚Üí fast I/O even with 3 concurrent reads

## Data Scale

- **Total Events**: 947 million
- **Patients**: 5.7 million
- **Unique Drugs**: ~28,000
- **Unique ICD Codes**: ~20,000
- **Unique CPT Codes**: ~15,000
- **Data Location**: `/mnt/nvme/cohorts/` (NVMe SSD for fast I/O)


## 1. Setup and Imports


In [None]:
import sys
import time
import json
import logging
from pathlib import Path
from typing import List, Dict
from concurrent.futures import ProcessPoolExecutor, as_completed
import pandas as pd
import boto3
from mlxtend.frequent_patterns import fpgrowth, association_rules
from mlxtend.preprocessing import TransactionEncoder

# Add project root to path
project_root = Path.cwd().parent if Path.cwd().name == '3_fpgrowth_analysis' else Path.cwd()
sys.path.insert(0, str(project_root))

from helpers_1997_13.duckdb_utils import get_duckdb_connection

print("‚úì All imports successful (including ProcessPoolExecutor for parallel execution)")
print(f"‚úì Project root: {project_root}")


‚úì All imports successful
‚úì Project root: /home/pgx3874/pgx-analysis


## 2. Configuration


In [None]:
# =============================================================================
# EC2 CONFIGURATION (32 cores, 1TB RAM)
# =============================================================================

# FP-Growth parameters (quality-focused for ML features)
MIN_SUPPORT = 0.01       # Items must appear in 1% of patients (5.7M patients = 57K occurrences)
MIN_CONFIDENCE = 0.4     # 40% confidence - meaningful associations for CatBoost

# Item-specific thresholds (balance coverage vs quality)
MIN_CONFIDENCE_CPT = 0.5 # 50% confidence for CPT - strong procedure associations
MIN_SUPPORT_CPT = 0.02   # 2% support for CPT - focuses on common procedures

# Rule limits (quality over quantity)
MAX_RULES_PER_ITEM_TYPE = 5000  # Top 5000 rules by lift (for ML feature engineering)

# Target-focused rule mining (NEW!)
TARGET_FOCUSED = True  # Only generate rules that predict target outcomes
TARGET_ICD_CODES = ['F11.20', 'F11.21', 'F11.22', 'F11.23', 'F11.24', 'F11.25', 'F11.29']  # Opioid dependence codes
TARGET_HCG_LINES = [
    "P51 - ER Visits and Observation Care",
    "O11 - Emergency Room",
    "P33 - Urgent Care Visits"
]  # ED visits (HCG Line codes - matches phase2_event_processing.py)
TARGET_PREFIXES = ['TARGET_ICD:', 'TARGET_ED:']  # Prefixes for target items in transactions

# Item types to process
ITEM_TYPES = ['drug_name', 'icd_code', 'cpt_code']

# Paths (x2iedn.8xlarge optimized)
S3_OUTPUT_BASE = "s3://pgxdatalake/gold/fpgrowth/global"
LOCAL_DATA_PATH = Path("/mnt/nvme/cohorts")  # Instance storage (NVMe SSD for fast I/O)

# Setup logger with file output (prevents Jupyter rate limit issues)
logger = logging.getLogger('global_fpgrowth')
logger.setLevel(logging.INFO)
logger.handlers.clear()  # Clear any existing handlers

# File handler - full logs to file
log_file = project_root / "3_fpgrowth_analysis" / "global_fpgrowth_execution.log"
file_handler = logging.FileHandler(log_file)
file_handler.setLevel(logging.INFO)
file_handler.setFormatter(logging.Formatter('%(asctime)s - %(levelname)s - %(message)s'))
logger.addHandler(file_handler)

# Console handler - only major milestones (prevents Jupyter rate limit)
console_handler = logging.StreamHandler()
console_handler.setLevel(logging.WARNING)  # Only warnings/errors to console
console_handler.setFormatter(logging.Formatter('%(asctime)s - %(levelname)s - %(message)s'))
logger.addHandler(console_handler)

print(f"‚úì Min Support (drug/ICD): {MIN_SUPPORT} (1% = ~57K patients)")
print(f"‚úì Min Support (CPT): {MIN_SUPPORT_CPT} (2% = ~114K patients)")
print(f"‚úì Min Confidence (drug/ICD): {MIN_CONFIDENCE} (40% - meaningful associations)")
print(f"‚úì Min Confidence (CPT): {MIN_CONFIDENCE_CPT} (50% - strong procedure patterns)")
print(f"‚úì Max Rules per Item Type: {MAX_RULES_PER_ITEM_TYPE:,} (top by lift)")
print(f"‚úì Item Types: {ITEM_TYPES}")
print(f"‚úì S3 Output: {S3_OUTPUT_BASE}")
print(f"‚úì Local Data: {LOCAL_DATA_PATH}")
print(f"‚úì Local Data Exists: {LOCAL_DATA_PATH.exists()}")
print(f"‚úì Detailed logs ‚Üí {log_file}")
print(f"‚úì Console output: WARNING level only (check log file for progress)")
print("\nüéØ Quality-Focused Global Analysis:")
print("  - 5.7M patients = strong statistical power")
print("  - High confidence (40-50%) = meaningful patterns for CatBoost")
print("  - Top 5,000 rules per type = comprehensive yet manageable")
print("  - Encoding maps for all frequent items (for feature engineering)")
print(f"\nüéØ TARGET-FOCUSED RULE MINING: {'ENABLED' if TARGET_FOCUSED else 'DISABLED'}")
if TARGET_FOCUSED:
    print(f"  - Target ICD codes: {TARGET_ICD_CODES}")
    print(f"  - Target HCG lines (ED visits): {TARGET_HCG_LINES}")
    print("  - Only generates rules that PREDICT target outcomes")
    print("  - Example: {Metoprolol, Gabapentin} ‚Üí {TARGET_ICD:OPIOID_DEPENDENCE}")
    print("  - Example: {99213: Office Visit, J0670: Morphine} ‚Üí {TARGET_ED:EMERGENCY_DEPT}")
    print("  ‚úÖ Focus on PREDICTIVE patterns across 5.7M patients")
    print("  ‚úÖ Better CatBoost features (what predicts target?)")
    print("  ‚úÖ Interpretable global patterns (risk factors)")


‚úì Min Support: 0.01
‚úì Min Confidence: 0.01
‚úì Item Types: ['drug_name', 'icd_code', 'cpt_code']
‚úì S3 Output: s3://pgxdatalake/gold/fpgrowth/global
‚úì Local Data: /home/pgx3874/pgx-analysis/data
‚úì Local Data Exists: True


## 3. Define Helper Functions


In [4]:
def extract_global_items(local_data_path: Path, item_type: str, logger: logging.Logger) -> List[str]:
    """Extract all unique items of specified type from local cohort data."""
    logger.info(f"Extracting global {item_type}s from local cohort data...")
    start_time = time.time()
    
    con = get_duckdb_connection(logger=logger)
    parquet_pattern = str(local_data_path / "**" / "cohort.parquet")
    
    if item_type == 'drug_name':
        query = f"""
        SELECT DISTINCT drug_name as item
        FROM read_parquet('{parquet_pattern}', hive_partitioning=1)
        WHERE drug_name IS NOT NULL AND drug_name != '' AND event_type = 'pharmacy'
        ORDER BY item
        """
    elif item_type == 'icd_code':
        query = f"""
        WITH all_icds AS (
            SELECT primary_icd_diagnosis_code as icd FROM read_parquet('{parquet_pattern}', hive_partitioning=1) 
            WHERE primary_icd_diagnosis_code IS NOT NULL AND event_type = 'medical'
            UNION ALL
            SELECT two_icd_diagnosis_code as icd FROM read_parquet('{parquet_pattern}', hive_partitioning=1) 
            WHERE two_icd_diagnosis_code IS NOT NULL AND event_type = 'medical'
            UNION ALL
            SELECT three_icd_diagnosis_code as icd FROM read_parquet('{parquet_pattern}', hive_partitioning=1) 
            WHERE three_icd_diagnosis_code IS NOT NULL AND event_type = 'medical'
            UNION ALL
            SELECT four_icd_diagnosis_code as icd FROM read_parquet('{parquet_pattern}', hive_partitioning=1) 
            WHERE four_icd_diagnosis_code IS NOT NULL AND event_type = 'medical'
            UNION ALL
            SELECT five_icd_diagnosis_code as icd FROM read_parquet('{parquet_pattern}', hive_partitioning=1) 
            WHERE five_icd_diagnosis_code IS NOT NULL AND event_type = 'medical'
        )
        SELECT DISTINCT icd as item FROM all_icds WHERE icd != '' ORDER BY item
        """
    elif item_type == 'cpt_code':
        query = f"""
        SELECT DISTINCT procedure_code as item
        FROM read_parquet('{parquet_pattern}', hive_partitioning=1)
        WHERE procedure_code IS NOT NULL AND procedure_code != '' AND event_type = 'medical'
        ORDER BY item
        """
    else:
        raise ValueError(f"Unknown item_type: {item_type}")
    
    logger.info(f"Running query for {item_type}...")
    df = con.execute(query).df()
    con.close()
    items = df['item'].tolist()
    
    elapsed = time.time() - start_time
    logger.info(f"‚úì Extracted {len(items):,} unique {item_type}s in {elapsed:.1f}s")
    return items


def create_global_transactions(local_data_path: Path, item_type: str, logger: logging.Logger) -> List[List[str]]:
    """Create patient-level transactions from local cohort data."""
    logger.info(f"Creating global {item_type} transactions...")
    start_time = time.time()
    
    con = get_duckdb_connection(logger=logger)
    parquet_pattern = str(local_data_path / "**" / "cohort.parquet")
    
    if item_type == 'drug_name':
        query = f"""
        SELECT mi_person_key, drug_name as item
        FROM read_parquet('{parquet_pattern}', hive_partitioning=1)
        WHERE drug_name IS NOT NULL AND drug_name != '' AND event_type = 'pharmacy'
        """
    elif item_type == 'icd_code':
        query = f"""
        WITH all_icds AS (
            SELECT mi_person_key, primary_icd_diagnosis_code as icd FROM read_parquet('{parquet_pattern}', hive_partitioning=1) 
            WHERE primary_icd_diagnosis_code IS NOT NULL AND event_type = 'medical'
            UNION ALL
            SELECT mi_person_key, two_icd_diagnosis_code as icd FROM read_parquet('{parquet_pattern}', hive_partitioning=1) 
            WHERE two_icd_diagnosis_code IS NOT NULL AND event_type = 'medical'
            UNION ALL
            SELECT mi_person_key, three_icd_diagnosis_code as icd FROM read_parquet('{parquet_pattern}', hive_partitioning=1) 
            WHERE three_icd_diagnosis_code IS NOT NULL AND event_type = 'medical'
            UNION ALL
            SELECT mi_person_key, four_icd_diagnosis_code as icd FROM read_parquet('{parquet_pattern}', hive_partitioning=1) 
            WHERE four_icd_diagnosis_code IS NOT NULL AND event_type = 'medical'
            UNION ALL
            SELECT mi_person_key, five_icd_diagnosis_code as icd FROM read_parquet('{parquet_pattern}', hive_partitioning=1) 
            WHERE five_icd_diagnosis_code IS NOT NULL AND event_type = 'medical'
        )
        SELECT mi_person_key, icd as item FROM all_icds WHERE icd != ''
        """
    elif item_type == 'cpt_code':
        query = f"""
        SELECT mi_person_key, procedure_code as item
        FROM read_parquet('{parquet_pattern}', hive_partitioning=1)
        WHERE procedure_code IS NOT NULL AND procedure_code != '' AND event_type = 'medical'
        """
    else:
        raise ValueError(f"Unknown item_type: {item_type}")
    
    logger.info(f"Loading {item_type} events...")
    df = con.execute(query).df()
    con.close()
    
    logger.info(f"Grouping by patient...")
    transactions = (
        df.groupby('mi_person_key')['item']
        .apply(lambda x: sorted(set(x.tolist())))
        .tolist()
    )
    
    elapsed = time.time() - start_time
    logger.info(f"‚úì Created {len(transactions):,} patient transactions in {elapsed:.1f}s")
    return transactions

print("‚úì Helper functions defined")


‚úì Helper functions defined


## 4. FP-Growth Processing

In [None]:
def process_item_type(item_type, local_data_path, s3_output_base, min_support, min_confidence, logger):
    """Process a single item type: extract, encode, FP-Growth, save to S3."""
    logger.info(f"\n{'='*80}")
    logger.info(f"Processing {item_type.upper()}")
    logger.info(f"{'='*80}")
    overall_start = time.time()
    
    try:
        # Extract items
        items = extract_global_items(local_data_path, item_type, logger)
        
        # Create transactions
        transactions = create_global_transactions(local_data_path, item_type, logger)
        
        # Encode transactions
        logger.info(f"Encoding {len(transactions):,} transactions...")
        encode_start = time.time()
        te = TransactionEncoder()
        te_ary = te.fit(transactions).transform(transactions)
        df_encoded = pd.DataFrame(te_ary, columns=te.columns_)
        encode_time = time.time() - encode_start
        logger.info(f"‚úì Encoded to {df_encoded.shape} matrix in {encode_time:.1f}s")
        
        # Run FP-Growth (use item-specific parameters)
        actual_min_support = MIN_SUPPORT_CPT if item_type == 'cpt_code' else min_support
        logger.info(f"Running FP-Growth (min_support={actual_min_support})...")
        fpgrowth_start = time.time()
        itemsets = fpgrowth(df_encoded, min_support=actual_min_support, use_colnames=True)
        itemsets = itemsets.sort_values('support', ascending=False).reset_index(drop=True)
        fpgrowth_time = time.time() - fpgrowth_start
        logger.info(f"‚úì Found {len(itemsets):,} frequent itemsets in {fpgrowth_time:.1f}s")
        
        # Generate association rules (with item-specific confidence and limits)
        actual_min_confidence = MIN_CONFIDENCE_CPT if item_type == 'cpt_code' else min_confidence
        logger.info(f"Generating association rules (min_confidence={actual_min_confidence})...")
        rules_start = time.time()
        
        try:
            all_rules = association_rules(itemsets, metric="confidence", min_threshold=actual_min_confidence)
            
            if len(all_rules) > 0:
                # Split rules: target-predicting vs control (non-target)
                if TARGET_FOCUSED:
                    # Target rules: consequent contains target marker
                    target_mask = all_rules['consequents'].apply(
                        lambda x: any(item.startswith(tuple(TARGET_PREFIXES)) for item in x)
                    )
                    rules_target = all_rules[target_mask].copy()
                    rules_control = all_rules[~target_mask].copy()
                    
                    logger.info(f"Split: {len(rules_target):,} target rules, {len(rules_control):,} control rules")
                    
                    # Limit both sets to top N by lift
                    if len(rules_target) > 0:
                        rules_target = rules_target.sort_values('lift', ascending=False)
                        if len(rules_target) > MAX_RULES_PER_ITEM_TYPE:
                            logger.info(f"Keeping top {MAX_RULES_PER_ITEM_TYPE:,} target rules (from {len(rules_target):,})")
                            rules_target = rules_target.head(MAX_RULES_PER_ITEM_TYPE)
                        rules_target = rules_target.reset_index(drop=True)
                    
                    if len(rules_control) > 0:
                        rules_control = rules_control.sort_values('lift', ascending=False)
                        if len(rules_control) > MAX_RULES_PER_ITEM_TYPE:
                            logger.info(f"Keeping top {MAX_RULES_PER_ITEM_TYPE:,} control rules (from {len(rules_control):,})")
                            rules_control = rules_control.head(MAX_RULES_PER_ITEM_TYPE)
                        rules_control = rules_control.reset_index(drop=True)
                    
                    # Keep target rules as main 'rules' for backward compatibility
                    rules = rules_target
                else:
                    # Not target-focused: all rules are kept
                    rules = all_rules.sort_values('lift', ascending=False).head(MAX_RULES_PER_ITEM_TYPE).reset_index(drop=True)
                    rules_control = pd.DataFrame()
                
                logger.info(f"Final: {len(rules):,} target rules, {len(rules_control):,} control rules")
            else:
                logger.info(f"No rules met confidence threshold of {actual_min_confidence}")
                rules = pd.DataFrame()
                rules_control = pd.DataFrame()
                
            rules_time = time.time() - rules_start
            logger.info(f"‚úì Rule generation completed in {rules_time:.1f}s")
            
        except MemoryError as e:
            logger.error(f"MemoryError during rule generation")
            rules = pd.DataFrame()
            rules_control = pd.DataFrame()
            rules_time = time.time() - rules_start
        
        # Create encoding map
        encoding_map = {}
        for idx, row in itemsets.iterrows():
            if len(row['itemsets']) == 1:
                item = list(row['itemsets'])[0]
                encoding_map[item] = {'support': float(row['support']), 'rank': int(idx)}
        logger.info(f"‚úì Created encoding map with {len(encoding_map):,} items")
        
        # Save to S3
        logger.info(f"Saving results to S3...")
        s3_client = boto3.client('s3')
        prefix = f"gold/fpgrowth/global/{item_type}"
        
        # Convert frozensets to lists
        itemsets_json = itemsets.copy()
        itemsets_json['itemsets'] = itemsets_json['itemsets'].apply(list)
        
        # Prepare rules for saving (split target rules by type, plus control)
        rules_by_target = {}
        
        # Process target rules (split by ICD vs ED)
        if not rules.empty:
            rules_json = rules.copy()
            rules_json['antecedents'] = rules_json['antecedents'].apply(list)
            rules_json['consequents'] = rules_json['consequents'].apply(list)
            
            # Split target rules by outcome type
            rules_by_target['TARGET_ICD'] = rules_json[
                rules_json['consequents'].apply(lambda x: any('TARGET_ICD:' in str(item) for item in x))
            ]
            rules_by_target['TARGET_ED'] = rules_json[
                rules_json['consequents'].apply(lambda x: any('TARGET_ED:' in str(item) for item in x))
            ]
        
        # Process control rules (non-target patterns)
        if not rules_control.empty:
            rules_control_json = rules_control.copy()
            rules_control_json['antecedents'] = rules_control_json['antecedents'].apply(list)
            rules_control_json['consequents'] = rules_control_json['consequents'].apply(list)
            rules_by_target['CONTROL'] = rules_control_json
        
        if rules_by_target:
            logger.info(f"Prepared for S3: {len(rules_by_target.get('TARGET_ICD', pd.DataFrame())):,} ICD, "
                       f"{len(rules_by_target.get('TARGET_ED', pd.DataFrame())):,} ED, "
                       f"{len(rules_by_target.get('CONTROL', pd.DataFrame())):,} control")
        
        # Upload files
        s3_client.put_object(Bucket='pgxdatalake', Key=f"{prefix}/encoding_map.json", 
                            Body=json.dumps(encoding_map, indent=2))
        s3_client.put_object(Bucket='pgxdatalake', Key=f"{prefix}/itemsets.json", 
                            Body=itemsets_json.to_json(orient='records', indent=2))
        
        # Save rules by target type (separate files)
        if rules_by_target:
            for target_type, target_rules in rules_by_target.items():
                if not target_rules.empty:
                    s3_client.put_object(
                        Bucket='pgxdatalake', 
                        Key=f"{prefix}/rules_{target_type}.json",
                        Body=target_rules.to_json(orient='records', indent=2)
                    )
                    logger.info(f"‚úì Saved {len(target_rules):,} {target_type} rules to S3")
        else:
            logger.info(f"‚úì No rules to save (empty rules)")
        
        # Save metrics
        metrics = {
            'item_type': item_type, 
            'min_support': actual_min_support, 
            'min_confidence': actual_min_confidence,
            'max_rules_limit': MAX_RULES_PER_ITEM_TYPE,
            'rules_truncated': len(rules) == MAX_RULES_PER_ITEM_TYPE if len(rules) > 0 else False,
            'unique_items': len(items), 
            'total_transactions': len(transactions),
            'frequent_itemsets': len(itemsets), 
            'association_rules': len(rules),
            'rules_by_target': {
                'TARGET_ICD': len(rules_by_target.get('TARGET_ICD', pd.DataFrame())),
                'TARGET_ED': len(rules_by_target.get('TARGET_ED', pd.DataFrame())),
                'CONTROL': len(rules_by_target.get('CONTROL', pd.DataFrame()))
            } if rules_by_target else {'TARGET_ICD': 0, 'TARGET_ED': 0, 'CONTROL': 0},
            'encoding_map_size': len(encoding_map),
            'target_focused': TARGET_FOCUSED,
            'target_icd_codes': TARGET_ICD_CODES if TARGET_FOCUSED else None,
            'target_hcg_lines': TARGET_HCG_LINES if TARGET_FOCUSED else None,
            'processing_time_seconds': {'total': time.time() - overall_start}
        }
        s3_client.put_object(Bucket='pgxdatalake', Key=f"{prefix}/metrics.json", 
                            Body=json.dumps(metrics, indent=2))
        
        logger.info(f"‚úì {item_type.upper()} COMPLETE - {len(itemsets):,} itemsets, {len(rules):,} rules")
        return metrics
    except Exception as e:
        logger.error(f"‚úó Failed: {e}", exc_info=True)
        return {'item_type': item_type, 'error': str(e)}

print("‚úì process_item_type function defined")


‚úì process_item_type function defined


## 4. Execute Analysis

Process all item types sequentially to avoid OOM errors.


In [None]:
print("="*80)
print("GLOBAL FPGROWTH ANALYSIS - START")
print("="*80)
print(f"Item types: {ITEM_TYPES}")
print(f"Min support: {MIN_SUPPORT}")
print(f"Min confidence: {MIN_CONFIDENCE}")
print(f"Detailed progress ‚Üí Check log file")
print()

logger.info(f"\n{'='*80}")
logger.info(f"GLOBAL FPGROWTH ANALYSIS - START")
logger.info(f"{'='*80}")
logger.info(f"Item types: {ITEM_TYPES}")
logger.info(f"Min support: {MIN_SUPPORT}")
logger.info(f"Min confidence: {MIN_CONFIDENCE}")

# Helper function to check S3 existence
def check_s3_results_exist(s3_output_base: str, item_type: str) -> bool:
    """Check if results already exist in S3 (by checking for metrics.json)."""
    s3 = boto3.client('s3')
    key = f"gold/fpgrowth/global/{item_type}/metrics.json"
    try:
        s3.head_object(Bucket='pgxdatalake', Key=key)
        return True
    except:
        return False

overall_start = time.time()
all_metrics = []
skipped = 0

# Check which item types need processing (skip if already in S3)
items_to_process = []
for item_type in ITEM_TYPES:
    print(f"Checking {item_type.upper()}...")
    if check_s3_results_exist(S3_OUTPUT_BASE, item_type):
        print(f"  ‚è≠ Already exists in S3 - SKIPPING")
        logger.info(f"Skipping {item_type} - results already exist in S3")
        skipped += 1
        all_metrics.append({'item_type': item_type, 'status': 'skipped'})
    else:
        print(f"  ‚ñ∂ Queued for processing")
        items_to_process.append(item_type)

# Process item types SEQUENTIALLY (prevents OOM errors - each job needs ~300-500 GB peak)
if items_to_process:
    print(f"\n{'='*80}")
    print(f"SEQUENTIAL PROCESSING: {len(items_to_process)} item types")
    print(f"Processing one at a time to avoid memory exhaustion")
    print(f"Expected runtime: 50-85 minutes total")
    print(f"{'='*80}\n")
    
    # Process each item type sequentially
    for idx, item_type in enumerate(items_to_process, 1):
        print(f"\n{'='*80}")
        print(f"Processing {idx}/{len(items_to_process)}: {item_type.upper()}")
        print(f"{'='*80}\n")
        
        try:
            # Use item-specific parameters
            actual_min_support = MIN_SUPPORT_CPT if item_type == 'cpt_code' else MIN_SUPPORT
            actual_min_confidence = MIN_CONFIDENCE_CPT if item_type == 'cpt_code' else MIN_CONFIDENCE
            
            metrics = process_item_type(
                item_type=item_type,
                local_data_path=LOCAL_DATA_PATH,
                s3_output_base=S3_OUTPUT_BASE,
                min_support=actual_min_support,
                min_confidence=actual_min_confidence,
                logger=logger
            )
            all_metrics.append(metrics)
            
            if 'error' not in metrics:
                print(f"\n‚úì {item_type.UPPER()} COMPLETE:")
                print(f"  - Frequent itemsets: {metrics['frequent_itemsets']:,}")
                print(f"  - Association rules: {metrics['association_rules']:,}")
                print(f"  - TARGET_ICD rules: {metrics.get('rules_by_target', {}).get('TARGET_ICD', 0):,}")
                print(f"  - TARGET_ED rules: {metrics.get('rules_by_target', {}).get('TARGET_ED', 0):,}")
                print(f"  - CONTROL rules: {metrics.get('rules_by_target', {}).get('CONTROL', 0):,}")
                print(f"  - Runtime: {metrics.get('total_time_seconds', 0):.1f}s")
            else:
                print(f"\n‚úó {item_type.UPPER()} FAILED: {metrics['error']}")
                
        except Exception as e:
            print(f"\n‚úó {item_type.UPPER()} EXCEPTION: {e}")
            logger.error(f"Exception processing {item_type}: {e}", exc_info=True)
            all_metrics.append({'item_type': item_type, 'error': str(e)})
            
        print(f"\nCompleted {idx}/{len(items_to_process)} item types")
        print("="*80)
else:
    print("\n‚è≠ All item types already exist in S3 - nothing to process")



2025-11-23 21:50:40,601 - INFO - 
2025-11-23 21:50:40,602 - INFO - GLOBAL FPGROWTH ANALYSIS - START
2025-11-23 21:50:40,602 - INFO - Item types: ['drug_name', 'icd_code', 'cpt_code']
2025-11-23 21:50:40,603 - INFO - Min support: 0.01
2025-11-23 21:50:40,603 - INFO - Min confidence: 0.01
2025-11-23 21:50:40,603 - INFO - 
2025-11-23 21:50:40,603 - INFO - Processing DRUG_NAME
2025-11-23 21:50:40,604 - INFO - Extracting global drug_names from local cohort data...
2025-11-23 21:50:40,805 - INFO - ‚úÖ Simple DuckDB connection created - 1 thread per worker (for multiprocessing)
2025-11-23 21:50:40,805 - INFO - Running query for drug_name...
2025-11-23 21:50:41,862 - INFO - ‚úì Extracted 12,783 unique drug_names in 1.3s
2025-11-23 21:50:41,862 - INFO - Creating global drug_name transactions...
2025-11-23 21:50:41,897 - INFO - ‚úÖ Simple DuckDB connection created - 1 thread per worker (for multiprocessing)
2025-11-23 21:50:41,898 - INFO - Loading drug_name events...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

2025-11-23 21:50:55,007 - INFO - Grouping by p

## 5. Auto-Shutdown EC2 Instance (Optional)

Set `SHUTDOWN_EC2 = True` to automatically stop the EC2 instance after analysis completes.

**Note:** This is a **STOP** (not terminate), so you can restart the instance later.


In [None]:
# =============================================================================
# EC2 AUTO-SHUTDOWN (OPTIONAL)
# =============================================================================
# Set SHUTDOWN_EC2 = True to enable, False to disable
SHUTDOWN_EC2 = False  # Change to True to enable auto-shutdown

if SHUTDOWN_EC2:
    print("\n" + "="*80)
    print("Shutting down EC2 instance...")
    print("="*80)
    
    import subprocess
    import requests
    import shutil
    
    # Get instance ID from EC2 metadata service
    try:
        response = requests.get(
            "http://169.254.169.254/latest/meta-data/instance-id",
            timeout=2
        )
        if response.status_code == 200:
            instance_id = response.text.strip()
            print(f"Instance ID: {instance_id}")
            
            # Find AWS CLI
            aws_cmd = shutil.which("aws")
            if not aws_cmd:
                # Try common paths
                for path in ["/usr/local/bin/aws", "/usr/bin/aws", 
                           "/home/ec2-user/.local/bin/aws", 
                           "/home/ubuntu/.local/bin/aws"]:
                    if Path(path).exists():
                        aws_cmd = path
                        break
            
            if aws_cmd:
                # Stop the instance (use terminate-instances for permanent deletion)
                shutdown_cmd = [aws_cmd, "ec2", "stop-instances", "--instance-ids", instance_id]
                
                print(f"Running: {' '.join(shutdown_cmd)}")
                result = subprocess.run(shutdown_cmd, capture_output=True, text=True)
                
                if result.returncode == 0:
                    print("‚úì EC2 instance stop command sent successfully")
                    print("Instance will stop in a few moments.")
                    print("Note: This is a STOP (not terminate), so you can restart it later.")
                    if result.stdout:
                        print(f"\nAWS Response:\n{result.stdout}")
                else:
                    print(f"‚úó EC2 stop command failed with exit code {result.returncode}")
                    if result.stderr:
                        print(f"Error: {result.stderr}")
                    print("Check AWS credentials and IAM permissions.")
            else:
                print("‚úó AWS CLI not found. Cannot shutdown instance.")
                print("Install AWS CLI or ensure it's in your PATH.")
                print("Manual shutdown: aws ec2 stop-instances --instance-ids " + instance_id)
        else:
            print(f"‚úó Metadata service returned status code {response.status_code}")
            print("Could not retrieve instance ID.")
    
    except requests.exceptions.RequestException as e:
        print("‚úó Could not retrieve instance ID from metadata service.")
        print(f"Error: {e}")
        print("If running on EC2, check that metadata service is accessible.")
        print("\nManual shutdown command:")
        print("  aws ec2 stop-instances --instance-ids <your-instance-id>")
    
    except Exception as e:
        print(f"‚úó Unexpected error during shutdown: {e}")

else:
    print("\n" + "="*80)
    print("EC2 Auto-Shutdown: DISABLED")
    print("="*80)
    print("To enable auto-shutdown, set SHUTDOWN_EC2 = True in this cell.")
    print("Instance will continue running.")
    print("\nTo manually stop this instance later:")
    print("  aws ec2 stop-instances --instance-ids $(ec2-metadata --instance-id | cut -d ' ' -f 2)")
    print("Or use AWS Console: EC2 > Instances > Select instance > Instance State > Stop")
