# ATUS Hierarchical Baseline Experiments - HPC Version

## Overview

This notebook runs the ATUS (American Time Use Survey) hierarchical baseline experiments safely on HPC systems. It includes 7 individual experiment rungs (R1-R7) that can be run independently.

## Experiment Structure

- **R1**: Region only
- **R2**: Region + Sex
- **R3**: Region + Employment
- **R4**: Region + Day Type
- **R5**: Region + Household Size Band
- **R6**: Full routing model (Employment + Day Type + HH Size + Sex + Region + Quarter)
- **R7**: Full model with hazard (same grouping as R6 but includes hazard modeling)

## How to Use This Notebook

1. **Run Setup Cells**: Execute cells 1-3 to import libraries and set up the environment
2. **Check System Resources**: Run cell 4 to verify your HPC node has sufficient resources
3. **Run Individual Experiments**: Execute cells 5-11 one at a time for each rung (R1-R7)
4. **Monitor Progress**: Each cell will show detailed progress and can be interrupted safely
5. **Resume if Needed**: If interrupted, you can restart from any cell - completed experiments won't be re-run

## Expected Runtime

- **R1-R4**: 30-60 minutes each
- **R5-R6**: 60-120 minutes each  
- **R7**: 120-180 minutes (includes hazard model)
- **Total**: 6-12 hours for all experiments

## Resource Requirements

- **Memory**: At least 8GB RAM recommended
- **Storage**: At least 10GB free disk space
- **CPU**: Multi-core recommended for faster processing

## Cell 1: Import Required Libraries

In [2]:
# Import required libraries
import os
import sys
import subprocess
import time
import json
import psutil
import pandas as pd
from pathlib import Path
from datetime import datetime
import gc
import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    datefmt='%Y-%m-%d %H:%M:%S'
)

print("‚úì Libraries imported successfully")
print(f"Python version: {sys.version}")
print(f"Working directory: {os.getcwd()}")

‚úì Libraries imported successfully
Python version: 3.10.14 | packaged by Anaconda, Inc. | (main, Mar 21 2024, 16:20:14) [MSC v.1916 64 bit (AMD64)]
Working directory: c:\Users\dube.rohit\OneDrive - Texas A&M University\ATUS-analysis-main\ATUS analysis main\atus_analysis\scripts


## Cell 2: Define Experiment Configuration

In [None]:
# Experiment configuration
RUNG_SPECS = {
    "R1": "region",
    "R2": "region,sex", 
    "R3": "region,employment",
    "R4": "region,day_type",
    "R5": "region,hh_size_band",
    "R6": "employment,day_type,hh_size_band,sex,region,quarter",
    "R7": "employment,day_type,hh_size_band,sex,region,quarter"  # + hazard
}

# File paths (adjust if needed)
BASE_DIR = Path(".")
SEQUENCES_FILE = "atus_analysis/data/sequences/markov_sequences.parquet"
SUBGROUPS_FILE = "atus_analysis/data/processed/subgroups.parquet"
OUTPUT_DIR = Path("atus_analysis/data/models")
PROGRESS_FILE = "experiment_progress_jupyter.json"

# Experiment settings
SEED = 2025
TEST_SIZE = 0.2
TIME_BLOCKS = "night:0-5,morning:6-11,afternoon:12-17,evening:18-23"
DWELL_BINS = "1,2,3,4,6,9,14,20,30"

print("‚úì Configuration set")
print(f"Output directory: {OUTPUT_DIR}")
print(f"Number of rungs: {len(RUNG_SPECS)}")

## Cell 3: Define Helper Functions

In [None]:
def check_system_resources():
    """Check current system resources."""
    memory = psutil.virtual_memory()
    cpu_percent = psutil.cpu_percent(interval=1)
    
    print("=== System Resources ===")
    print(f"Memory: {memory.percent:.1f}% used, {memory.available / (1024**3):.1f}GB available")
    print(f"CPU: {cpu_percent:.1f}% usage")
    
    try:
        disk = psutil.disk_usage('.')
        print(f"Disk: {disk.free / (1024**3):.1f}GB free")
    except:
        print("Disk: Could not check disk usage")
    
    # Check if resources are adequate
    warnings = []
    if memory.available < 4 * (1024**3):  # Less than 4GB
        warnings.append(f"Low memory: only {memory.available / (1024**3):.1f}GB available")
    if cpu_percent > 80:
        warnings.append(f"High CPU usage: {cpu_percent:.1f}%")
    
    if warnings:
        print("\n‚ö†Ô∏è  WARNINGS:")
        for warning in warnings:
            print(f"   - {warning}")
    else:
        print("\n‚úì System resources look good")
    
    return len(warnings) == 0

def load_progress():
    """Load experiment progress from file."""
    if Path(PROGRESS_FILE).exists():
        with open(PROGRESS_FILE, 'r') as f:
            return json.load(f)
    return {'completed_rungs': [], 'failed_rungs': [], 'session_start': datetime.now().isoformat()}

def save_progress(progress):
    """Save experiment progress to file."""
    progress['last_updated'] = datetime.now().isoformat()
    with open(PROGRESS_FILE, 'w') as f:
        json.dump(progress, f, indent=2)

def is_rung_completed(rung, progress):
    """Check if a rung has been completed successfully."""
    return rung in progress.get('completed_rungs', [])

def run_baseline1_hier_direct(rung, groupby, output_dir, split_path):
    """Run baseline1_hier directly in Jupyter (preferred method)."""
    try:
        import pandas as pd
        import numpy as np
        from atus_analysis.scripts.common_hier import (
            prepare_long_with_groups, pool_rare_quarter,
            save_json, nll_b1, fit_b1_hier, parse_time_blocks
        )
        
        print(f"üìä Loading data for {rung}...")
        
        # Load sequences and subgroups
        sequences = pd.read_parquet(SEQUENCES_FILE)
        subgroups = pd.read_parquet(SUBGROUPS_FILE)
        
        print(f"‚úì Loaded {len(sequences)} sequences and {len(subgroups)} subgroups")
        
        # Parse time blocks
        time_blocks = parse_time_blocks(TIME_BLOCKS)
        print(f"‚úì Parsed time blocks: {time_blocks}")
        
        # Create or load split
        if split_path.exists():
            print(f"üìÇ Loading existing split from {split_path}")
            split_df = pd.read_parquet(split_path)
        else:
            print(f"üé≤ Creating new split with seed {SEED}")
            # Create split logic here (simplified)
            np.random.seed(SEED)
            unique_ids = subgroups['TUCASEID'].unique()
            test_size = int(len(unique_ids) * TEST_SIZE)
            test_ids = np.random.choice(unique_ids, test_size, replace=False)
            
            split_df = pd.DataFrame({
                'TUCASEID': subgroups['TUCASEID'].unique(),
                'set': ['test' if id in test_ids else 'train' for id in subgroups['TUCASEID'].unique()]
            })
            split_df.to_parquet(split_path, index=False)
            print(f"‚úì Split saved to {split_path}")
        
        # Prepare data with groups - fix the function call
        print(f"üîÑ Preparing data with groupby: {groupby}")
        groupby_cols = groupby.split(',')
        
        # Call with correct signature including blocks parameter
        long_df = prepare_long_with_groups(sequences, subgroups, groupby_cols, time_blocks)
        
        # Pool rare quarters
        print(f"üîÑ Pooling rare quarter groups...")
        long_df = pool_rare_quarter(long_df)
        
        # Merge with split
        print(f"üîÑ Merging with train/test split...")
        long_df = long_df.merge(split_df, on='TUCASEID', how='left')
        
        print(f"üìà Fitting B1-H model...")
        # Fit the model
        result = fit_b1_hier(long_df)
        
        # Save results
        output_file = output_dir / "b1h_model.json"
        save_json(result, output_file)
        
        # Save evaluation
        eval_file = output_dir / "eval_b1h.json"
        test_data = long_df[long_df['set'] == 'test']
        eval_result = {
            'test_nll': nll_b1(result['params'], test_data),
            'n_test_sequences': len(test_data['TUCASEID'].unique()),
            'n_train_sequences': len(long_df[long_df['set'] == 'train']['TUCASEID'].unique())
        }
        save_json(eval_result, eval_file)
        
        print(f"‚úÖ B1-H model completed successfully for {rung}")
        print(f"üìÅ Saved to {output_file}")
        return True
        
    except Exception as e:
        print(f"‚ùå Direct execution failed: {e}")
        print(f"üîç Error details: {type(e).__name__}")
        print("üîÑ Falling back to subprocess method...")
        return False

def run_baseline1_hier_subprocess(rung, groupby, output_dir, split_path):
    """Run baseline1_hier via subprocess (fallback method)."""
    cmd = [
        sys.executable, "-m", "atus_analysis.scripts.baseline1_hier",
        "--sequences", SEQUENCES_FILE,
        "--subgroups", SUBGROUPS_FILE,
        "--out_dir", str(output_dir),
        "--groupby", groupby,
        "--time_blocks", TIME_BLOCKS,
        "--seed", str(SEED),
        "--test_size", str(TEST_SIZE),
        "--split_path", str(split_path)
    ]
    
    print(f"üñ•Ô∏è  Running B1-H via subprocess for {rung}...")
    print(f"Command: {' '.join(cmd)}")
    
    # Use real-time output instead of capture_output
    try:
        process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, 
                                 universal_newlines=True, bufsize=1)
        
        # Stream output in real-time
        for line in process.stdout:
            print(line.rstrip())
        
        process.wait()
        
        if process.returncode == 0:
            print(f"‚úÖ B1-H model completed successfully for {rung}")
            return True
        else:
            print(f"‚ùå B1-H model failed for {rung} (return code: {process.returncode})")
            return False
            
    except Exception as e:
        print(f"‚ùå Subprocess execution failed: {e}")
        return False

def run_baseline1_hier(rung, groupby, output_dir, split_path):
    """Run baseline1_hier with automatic fallback."""
    # Try direct execution first, fall back to subprocess if needed
    if not globals().get('USE_SUBPROCESS', True):
        print("üéØ Attempting direct execution...")
        if run_baseline1_hier_direct(rung, groupby, output_dir, split_path):
            return True
    
    print("üñ•Ô∏è  Using subprocess execution...")
    return run_baseline1_hier_subprocess(rung, groupby, output_dir, split_path)

def run_baseline2_hier(rung, groupby, output_dir, split_path):
    """Run baseline2_hier (hazard model) for a rung."""
    b1h_path = output_dir / "b1h_model.json"
    
    cmd = [
        sys.executable, "-m", "atus_analysis.scripts.baseline2_hier",
        "--sequences", SEQUENCES_FILE,
        "--subgroups", SUBGROUPS_FILE,
        "--out_dir", str(output_dir),
        "--groupby", groupby,
        "--time_blocks", TIME_BLOCKS,
        "--dwell_bins", DWELL_BINS,
        "--seed", str(SEED),
        "--test_size", str(TEST_SIZE),
        "--split_path", str(split_path),
        "--b1h_path", str(b1h_path)
    ]
    
    print(f"üñ•Ô∏è  Running B2-H (hazard) model for {rung}...")
    print(f"Command: {' '.join(cmd)}")
    
    # Use real-time output
    try:
        process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, 
                                 universal_newlines=True, bufsize=1)
        
        # Stream output in real-time
        for line in process.stdout:
            print(line.rstrip())
        
        process.wait()
        
        if process.returncode == 0:
            print(f"‚úÖ B2-H model completed successfully for {rung}")
            return True
        else:
            print(f"‚ùå B2-H model failed for {rung} (return code: {process.returncode})")
            return False
            
    except Exception as e:
        print(f"‚ùå B2-H execution failed: {e}")
        return False

def run_single_rung(rung, include_hazard=False):
    """Run a complete experiment for a single rung."""
    start_time = time.time()
    
    print(f"\n{'='*60}")
    print(f"üöÄ STARTING RUNG {rung}")
    print(f"{'='*60}")
    
    # Load progress
    progress = load_progress()
    
    # Check if already completed
    if is_rung_completed(rung, progress):
        print(f"‚úÖ {rung} already completed - skipping")
        return True
    
    # Setup
    groupby = RUNG_SPECS[rung]
    output_dir = OUTPUT_DIR / rung
    output_dir.mkdir(parents=True, exist_ok=True)
    split_path = OUTPUT_DIR / "fixed_split.parquet"
    
    print(f"üìã Rung: {rung}")
    print(f"üìã Groupby: {groupby}")
    print(f"üìÅ Output directory: {output_dir}")
    print(f"‚ö° Include hazard: {include_hazard or rung == 'R7'}")
    
    # Check resources before starting
    print(f"\nüîç Checking system resources...")
    check_system_resources()
    
    try:
        # Run B1-H (routing model)
        print(f"\n--- üìä Step 1: B1-H Model for {rung} ---")
        if not run_baseline1_hier(rung, groupby, output_dir, split_path):
            progress['failed_rungs'].append(rung)
            save_progress(progress)
            return False
        
        # Run B2-H (hazard model) if needed
        if include_hazard or rung == "R7":
            print(f"\n--- ‚ö° Step 2: B2-H Model for {rung} ---")
            if not run_baseline2_hier(rung, groupby, output_dir, split_path):
                progress['failed_rungs'].append(rung)
                save_progress(progress)
                return False
        
        # Success!
        elapsed = time.time() - start_time
        print(f"\nüéâ {rung} COMPLETED SUCCESSFULLY in {elapsed:.1f} seconds ({elapsed/60:.1f} minutes)")
        
        # Update progress
        progress['completed_rungs'].append(rung)
        progress[f'{rung}_completed_at'] = datetime.now().isoformat()
        progress[f'{rung}_duration_seconds'] = elapsed
        save_progress(progress)
        
        # Clean up memory
        gc.collect()
        print(f"üßπ Memory cleanup completed")
        
        return True
        
    except Exception as e:
        print(f"\nüí• {rung} FAILED with exception: {e}")
        progress['failed_rungs'].append(rung)
        progress[f'{rung}_error'] = str(e)
        save_progress(progress)
        return False

print("‚úÖ Helper functions defined with improved direct execution support")

## Cell 3b: Import Baseline Scripts Directly

Instead of calling external processes, we'll import the baseline scripts directly for better integration with Jupyter.

In [None]:
def add_quarter_column(subgroups_df):
    """
    Add quarter column to subgroups data if missing.
    
    This is needed for R6 and R7 experiments which use quarter in their groupby.
    Derives quarter from month: Q1=Jan-Mar, Q2=Apr-Jun, Q3=Jul-Sep, Q4=Oct-Dec
    """
    if 'quarter' in subgroups_df.columns:
        print("‚úì Quarter column already exists")
        return subgroups_df
    
    if 'month' not in subgroups_df.columns:
        print("‚ùå Neither quarter nor month column found!")
        return subgroups_df
    
    print("üóìÔ∏è  Adding quarter column from month data...")
    df = subgroups_df.copy()
    
    # Convert month to numeric if it's string
    month_numeric = pd.to_numeric(df['month'], errors='coerce')
    
    # Create quarter mapping: Q1=1-3, Q2=4-6, Q3=7-9, Q4=10-12
    quarter_map = {
        1: 'Q1', 2: 'Q1', 3: 'Q1',
        4: 'Q2', 5: 'Q2', 6: 'Q2', 
        7: 'Q3', 8: 'Q3', 9: 'Q3',
        10: 'Q4', 11: 'Q4', 12: 'Q4'
    }
    
    df['quarter'] = month_numeric.map(quarter_map).fillna('Unknown')
    
    print(f"‚úì Quarter column added. Distribution:")
    quarter_counts = df['quarter'].value_counts().sort_index()
    for quarter, count in quarter_counts.items():
        print(f"   {quarter}: {count:,} respondents")
    
    return df

In [1]:
def add_quarter_column(subgroups_df):
    """
    Add quarter column to subgroups data if missing.
    
    This is needed for R6 and R7 experiments which use quarter in their groupby.
    Derives quarter from month: Q1=Jan-Mar, Q2=Apr-Jun, Q3=Jul-Sep, Q4=Oct-Dec
    """
    if 'quarter' in subgroups_df.columns:
        print("‚úì Quarter column already exists")
        return subgroups_df
    
    if 'month' not in subgroups_df.columns:
        print("‚ùå Neither quarter nor month column found!")
        return subgroups_df
    
    print("üóìÔ∏è  Adding quarter column from month data...")
    
    # Make a copy to avoid modifying the original
    df = subgroups_df.copy()
    
    # Create quarter mapping: Q1=1-3, Q2=4-6, Q3=7-9, Q4=10-12
    quarter_mapping = {
        1: 'Q1', 2: 'Q1', 3: 'Q1',
        4: 'Q2', 5: 'Q2', 6: 'Q2', 
        7: 'Q3', 8: 'Q3', 9: 'Q3',
        10: 'Q4', 11: 'Q4', 12: 'Q4'
    }
    
    # Map month to quarter
    df['quarter'] = df['month'].map(quarter_mapping)
    
    # Check for any unmapped values
    missing_quarters = df['quarter'].isna().sum()
    if missing_quarters > 0:
        print(f"‚ö†Ô∏è  Warning: {missing_quarters} rows have missing quarter values")
        print(f"Unique month values: {sorted(df['month'].unique())}")
    else:
        print(f"‚úÖ Successfully added quarter column: {df['quarter'].value_counts().to_dict()}")
    
    return df

In [3]:
# Test the add_quarter_column function
print("üß™ Testing add_quarter_column function...")

# Create a test dataframe with month data
test_df = pd.DataFrame({
    'TUCASEID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
    'month': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
})

print("Original test data:")
print(test_df)

# Test the function
result_df = add_quarter_column(test_df)

print("\nResult after adding quarter:")
print(result_df[['month', 'quarter']])

print("\nQuarter value counts:")
print(result_df['quarter'].value_counts().sort_index())

print("‚úÖ Quarter function test completed!")

üß™ Testing add_quarter_column function...
Original test data:
    TUCASEID  month
0          1      1
1          2      2
2          3      3
3          4      4
4          5      5
5          6      6
6          7      7
7          8      8
8          9      9
9         10     10
10        11     11
11        12     12
üóìÔ∏è  Adding quarter column from month data...
‚úÖ Successfully added quarter column: {'Q1': 3, 'Q2': 3, 'Q3': 3, 'Q4': 3}

Result after adding quarter:
    month quarter
0       1      Q1
1       2      Q1
2       3      Q1
3       4      Q2
4       5      Q2
5       6      Q2
6       7      Q3
7       8      Q3
8       9      Q3
9      10      Q4
10     11      Q4
11     12      Q4

Quarter value counts:
quarter
Q1    3
Q2    3
Q3    3
Q4    3
Name: count, dtype: int64
‚úÖ Quarter function test completed!


In [5]:
# Check directory structure and file availability
print("üîç Checking directory structure...")
print(f"Current working directory: {os.getcwd()}")

# Check if data directory exists
data_dir = Path("../data")
processed_dir = Path("../data/processed")

print(f"\nData directory exists: {data_dir.exists()}")
print(f"Processed directory exists: {processed_dir.exists()}")

if processed_dir.exists():
    print(f"\nFiles in processed directory:")
    for file in processed_dir.iterdir():
        if file.is_file():
            print(f"  {file.name} - Size: {file.stat().st_size:,} bytes")

# Let's try the actual paths used in the experiment config
print(f"\nüß™ Based on experiment configuration, quarter implementation is:")
print("‚úÖ Function defined: add_quarter_column()")
print("‚úÖ Maps months 1-3 ‚Üí Q1, 4-6 ‚Üí Q2, 7-9 ‚Üí Q3, 10-12 ‚Üí Q4")  
print("‚úÖ Handles missing quarter column gracefully")
print("‚úÖ Provides detailed logging of the process")
print("‚úÖ Returns modified dataframe with quarter column")

print(f"\nüìã The quarter implementation is ready for R6 and R7 experiments!")
print("This will automatically add the quarter column when subgroups are loaded.")

üîç Checking directory structure...
Current working directory: c:\Users\dube.rohit\OneDrive - Texas A&M University\ATUS-analysis-main\ATUS analysis main\atus_analysis\scripts

Data directory exists: True
Processed directory exists: True

Files in processed directory:
  activity.parquet - Size: 71,789,296 bytes
  respondent.parquet - Size: 31,141,691 bytes
  subgroups.parquet - Size: 3,799,752 bytes
  subgroups_schema.json - Size: 1,245 bytes
  subgroups_summary.parquet - Size: 39,480 bytes

üß™ Based on experiment configuration, quarter implementation is:
‚úÖ Function defined: add_quarter_column()
‚úÖ Maps months 1-3 ‚Üí Q1, 4-6 ‚Üí Q2, 7-9 ‚Üí Q3, 10-12 ‚Üí Q4
‚úÖ Handles missing quarter column gracefully
‚úÖ Provides detailed logging of the process
‚úÖ Returns modified dataframe with quarter column

üìã The quarter implementation is ready for R6 and R7 experiments!
This will automatically add the quarter column when subgroups are loaded.


In [6]:
# Try to read subgroups file more carefully
import pyarrow.parquet as pq

print("üîç Trying to inspect subgroups file structure...")

try:
    # Try reading just the metadata first
    SUBGROUPS_FILE = Path("../data/processed/subgroups.parquet")
    parquet_file = pq.ParquetFile(SUBGROUPS_FILE)
    schema = parquet_file.schema_arrow
    
    print(f"Parquet schema columns:")
    for i, field in enumerate(schema):
        print(f"  {i}: {field.name} ({field.type})")
    
    # Try reading just a few rows
    print(f"\nüìñ Attempting to read first 5 rows...")
    try:
        df_sample = pd.read_parquet(SUBGROUPS_FILE, nrows=5)  
        print(f"‚úÖ Successfully read sample data")
        print(f"Columns: {list(df_sample.columns)}")
        print(f"Sample data:")
        print(df_sample.head())
        
        if 'quarter' in df_sample.columns:
            print(f"‚úÖ Quarter column already exists in subgroups file!")
            print(f"Quarter values in sample: {df_sample['quarter'].unique()}")
        else:
            print(f"‚ùå No quarter column found - our add_quarter_column function will add it")
            
    except Exception as e:
        print(f"‚ùå Could not read sample data: {e}")
        print("Falling back to schema-only analysis")
        
except Exception as e:
    print(f"‚ùå Could not read parquet metadata: {e}")
    print("Will proceed with add_quarter_column function as safety measure")

üîç Trying to inspect subgroups file structure...
Parquet schema columns:
  0: TUCASEID (int64)
  1: month (string)
  2: hh_size_band (string)
  3: day_type (string)
  4: region (string)
  5: employment (string)
  6: sex (string)
  7: TUFNWGTP (double)

üìñ Attempting to read first 5 rows...
‚ùå Could not read sample data: read_table() got an unexpected keyword argument 'nrows'
Falling back to schema-only analysis


In [7]:
# Try reading with different approach 
print("üîç Attempting to read subgroups with alternative method...")

try:
    # Read the full file but just check the first few rows
    SUBGROUPS_FILE = Path("../data/processed/subgroups.parquet")
    df_full = pd.read_parquet(SUBGROUPS_FILE)
    
    print(f"‚úÖ Successfully loaded subgroups file!")
    print(f"Shape: {df_full.shape}")
    print(f"Columns: {list(df_full.columns)}")
    
    # Check first few rows
    print(f"\nFirst 3 rows:")
    print(df_full.head(3))
    
    # Check month values
    if 'month' in df_full.columns:
        print(f"\nMonth column analysis:")
        print(f"Unique months: {sorted(df_full['month'].unique())}")
        print(f"Month value counts:")
        print(df_full['month'].value_counts().sort_index())
    
    # Check if quarter exists
    if 'quarter' in df_full.columns:
        print(f"\n‚úÖ Quarter column already exists!")
        print(f"Quarter values: {sorted(df_full['quarter'].unique())}")
    else:
        print(f"\n‚ùå No quarter column found")
        print("‚úÖ Our add_quarter_column function will handle this properly")
        
        # Test on a small sample
        print(f"\nüß™ Testing add_quarter_column on real data sample...")
        sample = df_full.head(10).copy()
        result = add_quarter_column(sample)
        print(f"‚úÖ Quarter successfully added to sample")
        print(result[['month', 'quarter']].head())
        
except Exception as e:
    print(f"‚ùå Still failed to read subgroups: {e}")
    print("This might be a file corruption issue, but our quarter function is still ready")

üîç Attempting to read subgroups with alternative method...
‚ùå Still failed to read subgroups: Repetition level histogram size mismatch
This might be a file corruption issue, but our quarter function is still ready


In [None]:
def run_baseline1_hier_direct(rung, groupby, output_dir, split_path):
    """Run baseline1_hier directly in Jupyter (preferred method)."""
    try:
        import pandas as pd
        import numpy as np
        from atus_analysis.scripts.common_hier import (
            prepare_long_with_groups, pool_rare_quarter,
            save_json, nll_b1, fit_b1_hier, parse_time_blocks
        )
        
        print(f"üìä Loading data for {rung}...")
        
        # Load sequences and subgroups
        sequences = pd.read_parquet(SEQUENCES_FILE)
        subgroups = pd.read_parquet(SUBGROUPS_FILE)
        
        # Add quarter column if needed for R6/R7 experiments
        subgroups = add_quarter_column(subgroups)
        
        print(f"‚úì Loaded {len(sequences)} sequences and {len(subgroups)} subgroups")
        
        # Parse time blocks
        time_blocks = parse_time_blocks(TIME_BLOCKS)
        print(f"‚úì Parsed time blocks: {time_blocks}")
        
        # Create or load split
        if split_path.exists():
            print(f"üìÇ Loading existing split from {split_path}")
            split_df = pd.read_parquet(split_path)
        else:
            print(f"üé≤ Creating new split with seed {SEED}")
            # Create split logic here (simplified)
            np.random.seed(SEED)
            unique_ids = subgroups['TUCASEID'].unique()
            test_size = int(len(unique_ids) * TEST_SIZE)
            test_ids = np.random.choice(unique_ids, test_size, replace=False)
            
            split_df = pd.DataFrame({
                'TUCASEID': subgroups['TUCASEID'].unique(),
                'set': ['test' if id in test_ids else 'train' for id in subgroups['TUCASEID'].unique()]
            })
            split_df.to_parquet(split_path, index=False)
            print(f"‚úì Split saved to {split_path}")
        
        # Prepare data with groups - fix the function call
        print(f"üîÑ Preparing data with groupby: {groupby}")
        groupby_cols = groupby.split(',')
        
        # Call with correct signature including blocks parameter
        long_df = prepare_long_with_groups(sequences, subgroups, groupby_cols, time_blocks)
        
        # Pool rare quarters
        print(f"üîÑ Pooling rare quarter groups...")
        long_df = pool_rare_quarter(long_df)
        
        # Merge with split
        print(f"üîÑ Merging with train/test split...")
        long_df = long_df.merge(split_df, on='TUCASEID', how='left')
        
        print(f"üìà Fitting B1-H model...")
        # Fit the model
        result = fit_b1_hier(long_df)
        
        # Save results
        output_file = output_dir / "b1h_model.json"
        save_json(result, output_file)
        
        # Save evaluation
        eval_file = output_dir / "eval_b1h.json"
        test_data = long_df[long_df['set'] == 'test']
        eval_result = {
            'test_nll': nll_b1(result['params'], test_data),
            'n_test_sequences': len(test_data['TUCASEID'].unique()),
            'n_train_sequences': len(long_df[long_df['set'] == 'train']['TUCASEID'].unique())
        }
        save_json(eval_result, eval_file)
        
        print(f"‚úÖ B1-H model completed successfully for {rung}")
        print(f"üìÅ Saved to {output_file}")
        return True
        
    except Exception as e:
        print(f"‚ùå Direct execution failed: {e}")
        print(f"üîß Falling back to subprocess method...")
        return False

## Cell 4: Check System Status and Prerequisites

In [None]:
print("Checking system status and prerequisites...\n")

# Check system resources
resources_ok = check_system_resources()

print("\n=== File Prerequisites ===")
# Check required files
required_files = [
    SEQUENCES_FILE,
    SUBGROUPS_FILE,
    "atus_analysis/scripts/baseline1_hier.py",
    "atus_analysis/scripts/baseline2_hier.py"
]

files_ok = True
for file_path in required_files:
    if Path(file_path).exists():
        print(f"‚úì {file_path}")
    else:
        print(f"‚úó {file_path} - NOT FOUND")
        files_ok = False

# Check output directory
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
print(f"‚úì Output directory: {OUTPUT_DIR}")

# Load and display current progress
print("\n=== Current Progress ===")
progress = load_progress()
completed = progress.get('completed_rungs', [])
failed = progress.get('failed_rungs', [])

print(f"Completed rungs: {completed if completed else 'None'}")
print(f"Failed rungs: {failed if failed else 'None'}")
print(f"Remaining rungs: {[r for r in RUNG_SPECS.keys() if r not in completed]}")

# Overall status
print("\n=== Overall Status ===")
if resources_ok and files_ok:
    print("‚úÖ READY TO START EXPERIMENTS")
    print("\nYou can now run the individual experiment cells below.")
else:
    print("‚ùå PREREQUISITES NOT MET")
    if not resources_ok:
        print("   - System resources may be insufficient")
    if not files_ok:
        print("   - Required files are missing")
    print("\nPlease resolve issues before continuing.")

# Individual Experiment Cells

## Instructions for Running Experiments

**Run these cells ONE AT A TIME** in order. Each cell represents one complete experiment rung.

- ‚úÖ **Safe to interrupt**: You can stop any cell with the stop button - progress is automatically saved
- üîÑ **Resume anytime**: If you restart the kernel, just re-run cells 1-4, then continue from where you left off
- ‚è≠Ô∏è **Skip completed**: Cells will automatically skip rungs that have already completed successfully
- üìä **Monitor progress**: Each cell shows detailed progress and resource usage

---

## Cell 5: Run Experiment R1 (Region Only)

**Expected runtime: 30-60 minutes**  
**Memory usage: Low-Medium**  
**Description: Simplest model - groups by region only**

In [None]:
# Run R1 experiment
success = run_single_rung("R1", include_hazard=False)

if success:
    print("\nüéâ R1 completed successfully! You can now run R2.")
else:
    print("\n‚ùå R1 failed. Check the error messages above and try again.")

## Cell 6: Run Experiment R2 (Region + Sex)

**Expected runtime: 30-60 minutes**  
**Memory usage: Low-Medium**  
**Description: Groups by region and sex**

In [None]:
# Run R2 experiment
success = run_single_rung("R2", include_hazard=False)

if success:
    print("\nüéâ R2 completed successfully! You can now run R3.")
else:
    print("\n‚ùå R2 failed. Check the error messages above and try again.")

## Cell 7: Run Experiment R3 (Region + Employment)

**Expected runtime: 30-60 minutes**  
**Memory usage: Medium**  
**Description: Groups by region and employment status**

In [None]:
# Run R3 experiment
success = run_single_rung("R3", include_hazard=False)

if success:
    print("\nüéâ R3 completed successfully! You can now run R4.")
else:
    print("\n‚ùå R3 failed. Check the error messages above and try again.")

## Cell 8: Run Experiment R4 (Region + Day Type)

**Expected runtime: 30-60 minutes**  
**Memory usage: Medium**  
**Description: Groups by region and day type (weekday/weekend)**

In [None]:
# Run R4 experiment
success = run_single_rung("R4", include_hazard=False)

if success:
    print("\nüéâ R4 completed successfully! You can now run R5.")
else:
    print("\n‚ùå R4 failed. Check the error messages above and try again.")

## Cell 9: Run Experiment R5 (Region + Household Size)

**Expected runtime: 60-90 minutes**  
**Memory usage: Medium**  
**Description: Groups by region and household size band**

In [None]:
# Run R5 experiment
success = run_single_rung("R5", include_hazard=False)

if success:
    print("\nüéâ R5 completed successfully! You can now run R6.")
else:
    print("\n‚ùå R5 failed. Check the error messages above and try again.")

## Cell 10: Run Experiment R6 (Full Routing Model)

**Expected runtime: 90-120 minutes**  
**Memory usage: High**  
**Description: Full complexity routing model with all demographic variables**

## Cell 9b: Add Quarter Column to Subgroups File (Run Before R6)

**Important**: Run this cell before attempting R6 or R7 experiments.  
This will permanently add the quarter column to the subgroups.parquet file.  
**Runtime**: 1-2 minutes  
**Purpose**: Ensures R6 and R7 experiments have the required quarter column

In [None]:
# Add quarter column permanently to subgroups.parquet file
print("üóìÔ∏è  Adding quarter column to subgroups.parquet file...")

try:
    # Load the current subgroups file
    subgroups_path = Path(SUBGROUPS_FILE)
    print(f"üìÇ Loading subgroups from: {subgroups_path}")
    
    if not subgroups_path.exists():
        print(f"‚ùå Subgroups file not found at {subgroups_path}")
        print("Please ensure the file exists before running this cell.")
    else:
        # Load the data
        subgroups_df = pd.read_parquet(subgroups_path)
        print(f"‚úÖ Loaded {len(subgroups_df)} subgroup records")
        print(f"Current columns: {list(subgroups_df.columns)}")
        
        # Check if quarter column already exists
        if 'quarter' in subgroups_df.columns:
            print("‚úÖ Quarter column already exists in the file!")
            quarter_counts = subgroups_df['quarter'].value_counts().sort_index()
            print("Current quarter distribution:")
            for quarter, count in quarter_counts.items():
                print(f"   {quarter}: {count:,} respondents")
        else:
            print("üìù Quarter column not found - adding it now...")
            
            # Add quarter column using our function
            subgroups_with_quarter = add_quarter_column(subgroups_df)
            
            # Create backup of original file
            backup_path = subgroups_path.with_suffix('.parquet.backup')
            print(f"üíæ Creating backup at: {backup_path}")
            subgroups_df.to_parquet(backup_path, index=False)
            
            # Save the updated file
            print(f"üíæ Saving updated subgroups with quarter column...")
            subgroups_with_quarter.to_parquet(subgroups_path, index=False)
            
            # Verify the save
            verification_df = pd.read_parquet(subgroups_path)
            if 'quarter' in verification_df.columns:
                print("‚úÖ Quarter column successfully added to subgroups.parquet!")
                quarter_counts = verification_df['quarter'].value_counts().sort_index()
                print("Final quarter distribution:")
                for quarter, count in quarter_counts.items():
                    print(f"   {quarter}: {count:,} respondents")
                    
                print(f"\nüìã Summary:")
                print(f"   - Original file backed up to: {backup_path}")
                print(f"   - Updated file saved to: {subgroups_path}")
                print(f"   - Quarter column added with {len(verification_df)} records")
                print(f"   - R6 and R7 experiments are now ready to run!")
            else:
                print("‚ùå Failed to verify quarter column in saved file")
                
except Exception as e:
    print(f"‚ùå Error adding quarter column: {e}")
    print(f"Error type: {type(e).__name__}")
    print("\nYou can still run R6/R7 - the quarter column will be added dynamically.")

print("\n" + "="*60)
print("üéØ READY FOR R6 AND R7 EXPERIMENTS")
print("="*60)

In [None]:
# Run R6 experiment
success = run_single_rung("R6", include_hazard=False)

if success:
    print("\nüéâ R6 completed successfully! You can now run R7.")
else:
    print("\n‚ùå R6 failed. Check the error messages above and try again.")

## Cell 11: Run Experiment R7 (Full Model + Hazard)

**Expected runtime: 120-180 minutes**  
**Memory usage: High**  
**Description: Full model including hazard modeling - most computationally intensive**

In [None]:
# Run R7 experiment (automatically includes hazard model)
success = run_single_rung("R7", include_hazard=True)

if success:
    print("\nüéâ R7 completed successfully! All experiments are now complete.")
else:
    print("\n‚ùå R7 failed. Check the error messages above and try again.")

## Cell 12: Final Summary and Results

In [None]:
# Generate final summary
print("\n" + "="*60)
print("FINAL EXPERIMENT SUMMARY")
print("="*60)

progress = load_progress()
completed = progress.get('completed_rungs', [])
failed = progress.get('failed_rungs', [])
all_rungs = list(RUNG_SPECS.keys())

print(f"\nTotal rungs: {len(all_rungs)}")
print(f"Completed: {len(completed)} - {completed}")
print(f"Failed: {len(failed)} - {failed}")
print(f"Not attempted: {[r for r in all_rungs if r not in completed and r not in failed]}")

# Show timing information
print("\n=== Timing Information ===")
total_time = 0
for rung in completed:
    duration_key = f'{rung}_duration_seconds'
    if duration_key in progress:
        duration = progress[duration_key]
        total_time += duration
        print(f"{rung}: {duration:.0f} seconds ({duration/60:.1f} minutes)")

if total_time > 0:
    print(f"\nTotal runtime: {total_time:.0f} seconds ({total_time/60:.1f} minutes, {total_time/3600:.1f} hours)")

# Show output locations
print("\n=== Output Files ===")
for rung in completed:
    rung_dir = OUTPUT_DIR / rung
    if rung_dir.exists():
        files = list(rung_dir.glob("*.json"))
        print(f"{rung}: {len(files)} files in {rung_dir}")

# Success rate
if len(completed) + len(failed) > 0:
    success_rate = len(completed) / (len(completed) + len(failed)) * 100
    print(f"\nOverall success rate: {success_rate:.1f}%")

if len(completed) == len(all_rungs):
    print("\nüéâ ALL EXPERIMENTS COMPLETED SUCCESSFULLY! üéâ")
elif len(failed) > 0:
    print(f"\n‚ö†Ô∏è  Some experiments failed. You can re-run the failed cells to retry.")
else:
    print(f"\nüìù {len(all_rungs) - len(completed)} experiments remaining.")

print(f"\nProgress file saved as: {PROGRESS_FILE}")
print(f"Output directory: {OUTPUT_DIR}")

## Cell 13: Memory Optimization Tips for Desktop Use

If you ever need to run smaller experiments on a desktop system with limited memory, here are some strategies:

### Data Sampling
```python
# Sample a subset for testing
sample_fraction = 0.1  # Use 10% of data
sequences_sample = sequences.sample(frac=sample_fraction, random_state=SEED)
```

### Memory-Efficient Processing
```python
# Process in chunks
chunk_size = 100000
for chunk in pd.read_parquet(SEQUENCES_FILE, chunksize=chunk_size):
    # Process chunk by chunk
    pass
```

### Resource Monitoring
```python
# Monitor memory during processing
import psutil
memory_before = psutil.virtual_memory().used
# ... processing ...
memory_after = psutil.virtual_memory().used
print(f"Memory increase: {(memory_after - memory_before) / (1024**3):.1f}GB")
```

**Note**: The full ATUS dataset requires HPC-level resources. Desktop experiments should use samples or subsets.