# Global FPGrowth Feature Importance Analysis

## Overview

This notebook performs **global FPGrowth analysis** across all cohorts to create universal encoding features for machine learning models. The analysis covers three item types:

1. **Drug Names**: Pharmacy events (drug co-prescriptions)
2. **ICD Codes**: Diagnosis codes (condition associations)
3. **CPT Codes**: Procedure codes (treatment patterns)

## Use Cases

- **CatBoost Feature Engineering**: Creates consistent encodings across training/validation/test sets
- **Population-Level Insights**: Discovers association patterns across all patients
- **Feature Importance**: Identifies most frequent patterns in the population

## Key Outputs (per item type)

Each item type gets its own folder with:
- **Global Encoding Map**: Universal encodings for ML
- **Frequent Itemsets**: Combinations that appear frequently
- **Association Rules**: Co-occurrence patterns
- **Summary Metrics**: Processing statistics

## S3 Output Structure

```
gold/fpgrowth/global/
‚îú‚îÄ‚îÄ drug_name/
‚îÇ   ‚îú‚îÄ‚îÄ encoding_map.json
‚îÇ   ‚îú‚îÄ‚îÄ itemsets.json
‚îÇ   ‚îú‚îÄ‚îÄ rules.json
‚îÇ   ‚îî‚îÄ‚îÄ metrics.json
‚îú‚îÄ‚îÄ icd_code/
‚îÇ   ‚îî‚îÄ‚îÄ (same files)
‚îî‚îÄ‚îÄ cpt_code/
    ‚îî‚îÄ‚îÄ (same files)
```

## Parameters

- **Min Support**: 0.005 (items must appear in 0.5% of transactions)
- **Min Confidence**: 0.01 (rules must have 1% confidence)
- **Data Source**: Local cohort data from `data/gold/cohorts_F1120/`

---


## Setup and Imports


In [1]:
import os
import sys
import json
import pandas as pd
import numpy as np
from datetime import datetime
import logging
from pathlib import Path
import time

# MLxtend for FP-Growth
from mlxtend.frequent_patterns import fpgrowth, association_rules
from mlxtend.preprocessing import TransactionEncoder

# Project root
project_root = Path.cwd().parent if Path.cwd().name == '3_fpgrowth_analysis' else Path.cwd()
sys.path.insert(0, str(project_root))

# Project utilities
from helpers_1997_13.common_imports import s3_client, S3_BUCKET
from helpers_1997_13.duckdb_utils import get_duckdb_connection
from helpers_1997_13.s3_utils import save_to_s3_json, save_to_s3_parquet
from helpers_1997_13.drug_utils import encode_drug_name
from helpers_1997_13.visualization_utils import create_network_visualization

print(f"‚úì Project root: {project_root}")
print(f"‚úì All imports successful")
print(f"‚úì Timestamp: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")


‚úì Project root: C:\Projects\pgx-analysis
‚úì All imports successful
‚úì Timestamp: 2025-11-22 14:37:31


## Configuration


In [2]:
# FP-Growth parameters
MIN_SUPPORT = 0.005  # 0.5% support threshold
MIN_CONFIDENCE = 0.01  # 1% confidence threshold
TOP_K = 50  # Top K itemsets to analyze

# Item types to process
ITEM_TYPES = ['drug_name', 'icd_code', 'cpt_code']

# S3 output base path
S3_OUTPUT_BASE = f"s3://{S3_BUCKET}/gold/fpgrowth/global"

# Local data path
LOCAL_DATA_PATH = project_root / "data" / "gold" / "cohorts_F1120"

# Create logger
logger = logging.getLogger('global_fpgrowth')
logger.setLevel(logging.INFO)
if not logger.handlers:
    handler = logging.StreamHandler()
    formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
    handler.setFormatter(formatter)
    logger.addHandler(handler)

print(f"‚úì Min Support: {MIN_SUPPORT}")
print(f"‚úì Min Confidence: {MIN_CONFIDENCE}")
print(f"‚úì Item Types: {ITEM_TYPES}")
print(f"‚úì S3 Output: {S3_OUTPUT_BASE}")
print(f"‚úì Local Data: {LOCAL_DATA_PATH}")
print(f"‚úì Local Data Exists: {LOCAL_DATA_PATH.exists()}")


‚úì Min Support: 0.005
‚úì Min Confidence: 0.01
‚úì Item Types: ['drug_name', 'icd_code', 'cpt_code']
‚úì S3 Output: s3://pgxdatalake/gold/fpgrowth/global
‚úì Local Data: C:\Projects\pgx-analysis\data\gold\cohorts_F1120
‚úì Local Data Exists: True


## Step 1: Define Item Extraction Functions

Create functions to extract different item types from cohort data.


In [3]:
def extract_global_items(local_data_path, item_type, logger):
    """
    Extract all unique items of specified type from local cohort data.
    
    Args:
        item_type: 'drug_name', 'icd_code', or 'cpt_code'
    """
    logger.info(f"Extracting global {item_type}s from local cohort data...")
    start_time = time.time()
    
    # Get DuckDB connection
    con = get_duckdb_connection(logger=logger)
    
    # Build glob pattern for all parquet files
    parquet_pattern = str(local_data_path / "**" / "cohort.parquet")
    
    # Build query based on item type
    if item_type == 'drug_name':
        query = f"""
        SELECT DISTINCT drug_name as item
        FROM read_parquet('{parquet_pattern}', hive_partitioning=1)
        WHERE drug_name IS NOT NULL 
          AND drug_name != ''
          AND event_type = 'PHARMACY'
        ORDER BY item
        """
    elif item_type == 'icd_code':
        # Collect from all ICD diagnosis columns
        query = f"""
        WITH all_icds AS (
            SELECT primary_icd_diagnosis_code as icd FROM read_parquet('{parquet_pattern}', hive_partitioning=1) WHERE primary_icd_diagnosis_code IS NOT NULL AND event_type = 'MEDICAL'
            UNION ALL
            SELECT two_icd_diagnosis_code as icd FROM read_parquet('{parquet_pattern}', hive_partitioning=1) WHERE two_icd_diagnosis_code IS NOT NULL AND event_type = 'MEDICAL'
            UNION ALL
            SELECT three_icd_diagnosis_code as icd FROM read_parquet('{parquet_pattern}', hive_partitioning=1) WHERE three_icd_diagnosis_code IS NOT NULL AND event_type = 'MEDICAL'
            UNION ALL
            SELECT four_icd_diagnosis_code as icd FROM read_parquet('{parquet_pattern}', hive_partitioning=1) WHERE four_icd_diagnosis_code IS NOT NULL AND event_type = 'MEDICAL'
            UNION ALL
            SELECT five_icd_diagnosis_code as icd FROM read_parquet('{parquet_pattern}', hive_partitioning=1) WHERE five_icd_diagnosis_code IS NOT NULL AND event_type = 'MEDICAL'
        )
        SELECT DISTINCT icd as item FROM all_icds WHERE icd != '' ORDER BY item
        """
    elif item_type == 'cpt_code':
        query = f"""
        SELECT DISTINCT procedure_code as item
        FROM read_parquet('{parquet_pattern}', hive_partitioning=1)
        WHERE procedure_code IS NOT NULL 
          AND procedure_code != ''
          AND event_type = 'MEDICAL'
        ORDER BY item
        """
    else:
        raise ValueError(f"Unknown item_type: {item_type}")
    
    logger.info(f"Running query for {item_type}...")
    df = con.execute(query).df()
    con.close()
    
    items = df['item'].tolist()
    
    elapsed = time.time() - start_time
    logger.info(f"‚úì Extracted {len(items):,} unique {item_type}s in {elapsed:.1f}s")
    
    return items

# Test extraction function
print("Testing item extraction...")
test_items = extract_global_items(LOCAL_DATA_PATH, 'drug_name', logger)
print(f"‚úì Found {len(test_items):,} drugs")
print(f"  Sample: {test_items[:5]}")


2025-11-22 14:37:31,880 - INFO - Extracting global drug_names from local cohort data...


Testing item extraction...


2025-11-22 14:37:32,215 - INFO - ‚úÖ Simple DuckDB connection created - 1 thread per worker (for multiprocessing)


2025-11-22 14:37:32,217 - INFO - Running query for drug_name...


2025-11-22 14:37:33,362 - INFO - ‚úì Extracted 0 unique drug_names in 1.5s


‚úì Found 0 drugs
  Sample: []


## Step 2: Define Transaction Creation Functions

Create patient-level transactions for each item type.


In [4]:
def create_global_transactions(local_data_path, item_type, logger):
    """
    Create patient-level transactions from local cohort data.
    
    Args:
        item_type: 'drug_name', 'icd_code', or 'cpt_code'
    """
    logger.info(f"Creating global {item_type} transactions...")
    start_time = time.time()
    
    # Get DuckDB connection
    con = get_duckdb_connection(logger=logger)
    
    # Build glob pattern for all parquet files
    parquet_pattern = str(local_data_path / "**" / "cohort.parquet")
    
    # Build query based on item type
    if item_type == 'drug_name':
        query = f"""
        SELECT mi_person_key, drug_name as item
        FROM read_parquet('{parquet_pattern}', hive_partitioning=1)
        WHERE drug_name IS NOT NULL AND drug_name != '' AND event_type = 'PHARMACY'
        """
    elif item_type == 'icd_code':
        query = f"""
        WITH all_icds AS (
            SELECT mi_person_key, primary_icd_diagnosis_code as icd FROM read_parquet('{parquet_pattern}', hive_partitioning=1) WHERE primary_icd_diagnosis_code IS NOT NULL AND event_type = 'MEDICAL'
            UNION ALL
            SELECT mi_person_key, two_icd_diagnosis_code as icd FROM read_parquet('{parquet_pattern}', hive_partitioning=1) WHERE two_icd_diagnosis_code IS NOT NULL AND event_type = 'MEDICAL'
            UNION ALL
            SELECT mi_person_key, three_icd_diagnosis_code as icd FROM read_parquet('{parquet_pattern}', hive_partitioning=1) WHERE three_icd_diagnosis_code IS NOT NULL AND event_type = 'MEDICAL'
        )
        SELECT mi_person_key, icd as item FROM all_icds WHERE icd != ''
        """
    elif item_type == 'cpt_code':
        query = f"""
        SELECT mi_person_key, procedure_code as item
        FROM read_parquet('{parquet_pattern}', hive_partitioning=1)
        WHERE procedure_code IS NOT NULL AND procedure_code != '' AND event_type = 'MEDICAL'
        """
    else:
        raise ValueError(f"Unknown item_type: {item_type}")
    
    logger.info(f"Loading {item_type} events...")
    df = con.execute(query).df()
    con.close()
    
    # Group by patient and create item lists
    logger.info(f"Grouping by patient...")
    transactions = (
        df.groupby('mi_person_key')['item']
        .apply(lambda x: sorted(set(x.tolist())))
        .tolist()
    )
    
    elapsed = time.time() - start_time
    logger.info(f"‚úì Created {len(transactions):,} patient transactions in {elapsed:.1f}s")
    
    return transactions

print("‚úì Transaction creation function defined")


‚úì Transaction creation function defined


## Step 3: Process All Item Types

Run FP-Growth analysis for each item type (drug_name, icd_code, cpt_code).


In [5]:
def process_item_type(item_type, local_data_path, s3_output_base, min_support, min_confidence, logger):
    """
    Process a single item type end-to-end: extract, create transactions, run FP-Growth, save results.
    """
    logger.info(f"\n{'='*80}")
    logger.info(f"Processing {item_type.upper()}")
    logger.info(f"{'='*80}")
    
    overall_start = time.time()
    
    # Step 1: Extract items
    items = extract_global_items(local_data_path, item_type, logger)
    
    # Step 2: Create transactions
    transactions = create_global_transactions(local_data_path, item_type, logger)
    
    # Step 3: Encode transactions
    logger.info(f"Encoding {len(transactions):,} transactions...")
    te = TransactionEncoder()
    te_ary = te.fit(transactions).transform(transactions)
    df_encoded = pd.DataFrame(te_ary, columns=te.columns_)
    logger.info(f"‚úì Encoded to {df_encoded.shape} matrix")
    
    # Step 4: Run FP-Growth
    logger.info(f"Running FP-Growth...")
    itemsets = fpgrowth(df_encoded, min_support=min_support, use_colnames=True)
    itemsets = itemsets.sort_values('support', ascending=False).reset_index(drop=True)
    logger.info(f"‚úì Found {len(itemsets):,} frequent itemsets")
    
    # Step 5: Generate association rules
    logger.info(f"Generating association rules...")
    try:
        rules = association_rules(itemsets, metric="confidence", min_threshold=min_confidence)
        rules = rules.sort_values('lift', ascending=False).reset_index(drop=True)
        logger.info(f"‚úì Generated {len(rules):,} association rules")
    except ValueError as e:
        logger.warning(f"Could not generate rules: {e}")
        rules = pd.DataFrame()
    
    # Step 6: Create encoding map
    logger.info(f"Creating encoding map...")
    encoding_map = {}
    for item in items:
        support = 0.0
        matching = itemsets[itemsets['itemsets'].apply(lambda x: item in x)]
        if not matching.empty:
            support = matching['support'].max()
        
        confidence = 0.0
        if not rules.empty:
            matching_rules = rules[
                rules['antecedents'].apply(lambda x: item in x) |
                rules['consequents'].apply(lambda x: item in x)
            ]
            if not matching_rules.empty:
                confidence = matching_rules['confidence'].max()
        
        # Simple encoding: item_supportXXX_confidenceYYY
        encoding = f"{item}_{int(support*1000):04d}_{int(confidence*1000):04d}"
        encoding_map[item] = encoding
    
    logger.info(f"‚úì Created encoding map with {len(encoding_map):,} items")
    
    # Step 7: Save to S3
    s3_folder = f"{s3_output_base}/{item_type}"
    logger.info(f"Saving to {s3_folder}...")
    
    # Convert frozensets to lists
    itemsets_json = itemsets.copy()
    itemsets_json['itemsets'] = itemsets_json['itemsets'].apply(lambda x: list(x))
    
    # Save encoding map
    encoding_path = f"{s3_folder}/encoding_map.json"
    save_to_s3_json(encoding_map, encoding_path)
    
    # Save itemsets
    itemsets_path = f"{s3_folder}/itemsets.json"
    save_to_s3_json(itemsets_json.to_dict(orient='records'), itemsets_path)
    
    # Save rules
    if not rules.empty:
        rules_json = rules.copy()
        rules_json['antecedents'] = rules_json['antecedents'].apply(lambda x: list(x))
        rules_json['consequents'] = rules_json['consequents'].apply(lambda x: list(x))
        rules_path = f"{s3_folder}/rules.json"
        save_to_s3_json(rules_json.to_dict(orient='records'), rules_path)
    
    # Save metrics
    summary = {
        'timestamp': datetime.now().isoformat(),
        'item_type': item_type,
        'total_items': len(items),
        'total_patients': len(transactions),
        'total_itemsets': len(itemsets),
        'total_rules': len(rules),
        'min_support': min_support,
        'min_confidence': min_confidence,
        'avg_items_per_patient': float(np.mean([len(t) for t in transactions]))
    }
    metrics_path = f"{s3_folder}/metrics.json"
    save_to_s3_json(summary, metrics_path)
    
    elapsed = time.time() - overall_start
    logger.info(f"‚úì {item_type} complete in {elapsed:.1f}s ({elapsed/60:.1f}min)")
    
    return {
        'item_type': item_type,
        'total_items': len(items),
        'total_patients': len(transactions),
        'total_itemsets': len(itemsets),
        'total_rules': len(rules),
        'elapsed_seconds': elapsed,
        's3_folder': s3_folder
    }

# Process all item types
print("\nüöÄ Starting FP-Growth analysis for all item types...\n")
results = []

for item_type in ITEM_TYPES:
    try:
        result = process_item_type(item_type, LOCAL_DATA_PATH, S3_OUTPUT_BASE, MIN_SUPPORT, MIN_CONFIDENCE, logger)
        results.append(result)
        print(f"\n‚úÖ {item_type}: {result['total_itemsets']:,} itemsets, {result['total_rules']:,} rules")
    except Exception as e:
        logger.error(f"‚ùå Failed to process {item_type}: {e}")
        results.append({'item_type': item_type, 'error': str(e)})

print("\n" + "="*80)
print("ALL ITEM TYPES PROCESSED")
print("="*80)


2025-11-22 14:37:33,401 - INFO - 


2025-11-22 14:37:33,401 - INFO - Processing DRUG_NAME




2025-11-22 14:37:33,401 - INFO - Extracting global drug_names from local cohort data...



üöÄ Starting FP-Growth analysis for all item types...



2025-11-22 14:37:33,711 - INFO - ‚úÖ Simple DuckDB connection created - 1 thread per worker (for multiprocessing)


2025-11-22 14:37:33,713 - INFO - Running query for drug_name...


2025-11-22 14:37:34,782 - INFO - ‚úì Extracted 0 unique drug_names in 1.4s


2025-11-22 14:37:34,783 - INFO - Creating global drug_name transactions...


2025-11-22 14:37:35,076 - INFO - ‚úÖ Simple DuckDB connection created - 1 thread per worker (for multiprocessing)


2025-11-22 14:37:35,076 - INFO - Loading drug_name events...


2025-11-22 14:37:36,090 - INFO - Grouping by patient...


2025-11-22 14:37:36,090 - INFO - ‚úì Created 0 patient transactions in 1.3s


2025-11-22 14:37:36,090 - INFO - Encoding 0 transactions...


2025-11-22 14:37:36,090 - INFO - ‚úì Encoded to (0, 0) matrix


2025-11-22 14:37:36,090 - INFO - Running FP-Growth...


2025-11-22 14:37:36,102 - INFO - ‚úì Found 0 frequent itemsets


2025-11-22 14:37:36,103 - INFO - Generating association rules...




2025-11-22 14:37:36,107 - INFO - Creating encoding map...


2025-11-22 14:37:36,107 - INFO - ‚úì Created encoding map with 0 items


2025-11-22 14:37:36,107 - INFO - Saving to s3://pgxdatalake/gold/fpgrowth/global/drug_name...


  datetime_now = datetime.datetime.utcnow()


  datetime_now = datetime.datetime.utcnow()
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  datetime_now = datetime.datetime.utcnow()
2025-11-22 14:37:36,758 - INFO - ‚úì drug_name complete in 3.4s (0.1min)


2025-11-22 14:37:36,758 - INFO - 


2025-11-22 14:37:36,758 - INFO - Processing ICD_CODE




2025-11-22 14:37:36,758 - INFO - Extracting global icd_codes from local cohort data...



‚úÖ drug_name: 0 itemsets, 0 rules


2025-11-22 14:37:37,057 - INFO - ‚úÖ Simple DuckDB connection created - 1 thread per worker (for multiprocessing)


2025-11-22 14:37:37,061 - INFO - Running query for icd_code...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

2025-11-22 14:37:42,115 - INFO - ‚úì Extracted 0 unique icd_codes in 5.4s


2025-11-22 14:37:42,117 - INFO - Creating global icd_code transactions...


2025-11-22 14:37:42,403 - INFO - ‚úÖ Simple DuckDB connection created - 1 thread per worker (for multiprocessing)


2025-11-22 14:37:42,404 - INFO - Loading icd_code events...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

2025-11-22 14:37:45,115 - INFO - Grouping by patient...


2025-11-22 14:37:45,115 - INFO - ‚úì Created 0 patient transactions in 3.0s


2025-11-22 14:37:45,115 - INFO - Encoding 0 transactions...


2025-11-22 14:37:45,115 - INFO - ‚úì Encoded to (0, 0) matrix


2025-11-22 14:37:45,118 - INFO - Running FP-Growth...


2025-11-22 14:37:45,118 - INFO - ‚úì Found 0 frequent itemsets


2025-11-22 14:37:45,118 - INFO - Generating association rules...




2025-11-22 14:37:45,118 - INFO - Creating encoding map...


2025-11-22 14:37:45,118 - INFO - ‚úì Created encoding map with 0 items


2025-11-22 14:37:45,118 - INFO - Saving to s3://pgxdatalake/gold/fpgrowth/global/icd_code...


  datetime_now = datetime.datetime.utcnow()


  datetime_now = datetime.datetime.utcnow()
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  datetime_now = datetime.datetime.utcnow()


2025-11-22 14:37:45,740 - INFO - ‚úì icd_code complete in 9.0s (0.1min)


2025-11-22 14:37:45,740 - INFO - 


2025-11-22 14:37:45,740 - INFO - Processing CPT_CODE




2025-11-22 14:37:45,743 - INFO - Extracting global cpt_codes from local cohort data...


2025-11-22 14:37:45,885 - INFO - ‚úÖ Simple DuckDB connection created - 1 thread per worker (for multiprocessing)


2025-11-22 14:37:45,889 - INFO - Running query for cpt_code...



‚úÖ icd_code: 0 itemsets, 0 rules


2025-11-22 14:37:46,468 - INFO - ‚úì Extracted 0 unique cpt_codes in 0.7s


2025-11-22 14:37:46,468 - INFO - Creating global cpt_code transactions...


2025-11-22 14:37:46,609 - INFO - ‚úÖ Simple DuckDB connection created - 1 thread per worker (for multiprocessing)


2025-11-22 14:37:46,610 - INFO - Loading cpt_code events...


2025-11-22 14:37:47,156 - INFO - Grouping by patient...


2025-11-22 14:37:47,160 - INFO - ‚úì Created 0 patient transactions in 0.7s


2025-11-22 14:37:47,160 - INFO - Encoding 0 transactions...


2025-11-22 14:37:47,162 - INFO - ‚úì Encoded to (0, 0) matrix


2025-11-22 14:37:47,162 - INFO - Running FP-Growth...


2025-11-22 14:37:47,162 - INFO - ‚úì Found 0 frequent itemsets


2025-11-22 14:37:47,164 - INFO - Generating association rules...




2025-11-22 14:37:47,164 - INFO - Creating encoding map...


2025-11-22 14:37:47,164 - INFO - ‚úì Created encoding map with 0 items


2025-11-22 14:37:47,164 - INFO - Saving to s3://pgxdatalake/gold/fpgrowth/global/cpt_code...


  datetime_now = datetime.datetime.utcnow()
  datetime_now = datetime.datetime.utcnow()


  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  datetime_now = datetime.datetime.utcnow()


2025-11-22 14:37:47,451 - INFO - ‚úì cpt_code complete in 1.7s (0.0min)



‚úÖ cpt_code: 0 itemsets, 0 rules

ALL ITEM TYPES PROCESSED


## View Results

Check the processing output above for results from each item type.