# ATUS Hierarchical Baseline Experiments - HPC Version

## Overview

This notebook runs the ATUS (American Time Use Survey) hierarchical baseline experiments safely on HPC systems. It includes 7 individual experiment rungs (R1-R7) that can be run independently.

## Experiment Structure

- **R1**: Region only
- **R2**: Region + Sex
- **R3**: Region + Employment
- **R4**: Region + Day Type
- **R5**: Region + Household Size Band
- **R6**: Full routing model (Employment + Day Type + HH Size + Sex + Region + Quarter)
- **R7**: Full model with hazard (same grouping as R6 but includes hazard modeling)

## How to Use This Notebook

1. **Run Setup Cells**: Execute cells 1-3 to import libraries and set up the environment
2. **Check System Resources**: Run cell 4 to verify your HPC node has sufficient resources
3. **Run Individual Experiments**: Execute cells 5-11 one at a time for each rung (R1-R7)
4. **Monitor Progress**: Each cell will show detailed progress and can be interrupted safely
5. **Resume if Needed**: If interrupted, you can restart from any cell - completed experiments won't be re-run

## Expected Runtime

- **R1-R4**: 30-60 minutes each
- **R5-R6**: 60-120 minutes each  
- **R7**: 120-180 minutes (includes hazard model)
- **Total**: 6-12 hours for all experiments

## Resource Requirements

- **Memory**: At least 16GB RAM recommended
- **Storage**: At least 20GB free disk space
- **CPU**: Multi-core recommended for faster processing

## Import Required Libraries

In [1]:
# Import required libraries
import os
import sys
import subprocess
import time
import json
import psutil
import pandas as pd
from pathlib import Path
from datetime import datetime
import gc
import logging

from pathlib import Path
# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    datefmt='%Y-%m-%d %H:%M:%S'
)

print("✓ Libraries imported successfully")
print(f"Python version: {sys.version}")
# Set working directory
os.chdir('/ztank/scratch/user/u.rd143338/atus_analysis-main')

print(f"Working directory: {os.getcwd()}")

✓ Libraries imported successfully
Python version: 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]
Working directory: /ztank/scratch/user/u.rd143338/atus_analysis-main


## Define Experiment Configuration

In [19]:
# Experiment configuration
RUNG_SPECS = {
    "R1": "region",
    "R2": "region,sex", 
    "R3": "region,employment",
    "R4": "region,day_type",
    "R5": "region,hh_size_band",
    "R6": "employment,day_type,hh_size_band,sex,region,quarter",
    "R7": "employment,day_type,hh_size_band,sex,region,quarter"  # + hazard
}

# File paths (adjust if needed)
BASE_DIR = Path(".")
SEQUENCES_FILE = "atus_analysis/data/sequences/markov_sequences.parquet"
SUBGROUPS_FILE = "atus_analysis/data/processed/subgroups.parquet"
OUTPUT_DIR = Path("atus_analysis/data/models")
PROGRESS_FILE = "experiment_progress_jupyter.json"

# Experiment settings
SEED = 2025
TEST_SIZE = 0.2
TIME_BLOCKS = "night:0-5,morning:6-11,afternoon:12-17,evening:18-23"
DWELL_BINS = "1,2,3,4,6,9,14,20,30"

print("✓ Configuration set")
print(f"Output directory: {OUTPUT_DIR}")
print(f"Number of rungs: {len(RUNG_SPECS)}")

✓ Configuration set
Output directory: atus_analysis/data/models
Number of rungs: 7


## Define Helper Functions

In [28]:
def check_system_resources():
    """Check current system resources."""
    memory = psutil.virtual_memory()
    cpu_percent = psutil.cpu_percent(interval=1)
    
    print("=== System Resources ===")
    print(f"Memory: {memory.percent:.1f}% used, {memory.available / (1024**3):.1f}GB available")
    print(f"CPU: {cpu_percent:.1f}% usage")
    
    try:
        disk = psutil.disk_usage('.')
        print(f"Disk: {disk.free / (1024**3):.1f}GB free")
    except:
        print("Disk: Could not check disk usage")
    
    # Check if resources are adequate
    warnings = []
    if memory.available < 4 * (1024**3):  # Less than 4GB
        warnings.append(f"Low memory: only {memory.available / (1024**3):.1f}GB available")
    if cpu_percent > 80:
        warnings.append(f"High CPU usage: {cpu_percent:.1f}%")
    
    if warnings:
        print("\n⚠️  WARNINGS:")
        for warning in warnings:
            print(f"   - {warning}")
    else:
        print("\n✓ System resources look good")
    
    return len(warnings) == 0

def load_progress():
    """Load experiment progress from file."""
    if Path(PROGRESS_FILE).exists():
        with open(PROGRESS_FILE, 'r') as f:
            return json.load(f)
    return {'completed_rungs': [], 'failed_rungs': [], 'session_start': datetime.now().isoformat()}

def save_progress(progress):
    """Save experiment progress to file."""
    progress['last_updated'] = datetime.now().isoformat()
    with open(PROGRESS_FILE, 'w') as f:
        json.dump(progress, f, indent=2)

def is_rung_completed(rung, progress):
    """Check if a rung has been completed successfully."""
    return rung in progress.get('completed_rungs', [])

def run_baseline1_hier_direct(rung, groupby, output_dir, split_path):
    """Run baseline1_hier directly in Jupyter (preferred method)."""
    try:
        import pandas as pd
        import numpy as np
        from atus_analysis.scripts.common_hier import (
            prepare_long_with_groups, pool_rare_quarter,
            save_json, nll_b1, fit_b1_hier, parse_time_blocks
        )
        
        print(f"📊 Loading data for {rung}...")
        
        # Load sequences and subgroups
        sequences = pd.read_parquet(SEQUENCES_FILE)
        subgroups = pd.read_parquet(SUBGROUPS_FILE)
        
        print(f"✓ Loaded {len(sequences)} sequences and {len(subgroups)} subgroups")
        
        # Parse time blocks
        time_blocks = parse_time_blocks(TIME_BLOCKS)
        print(f"✓ Parsed time blocks: {time_blocks}")
        
        # Create or load split
        if split_path.exists():
            print(f"📂 Loading existing split from {split_path}")
            split_df = pd.read_parquet(split_path)
        else:
            print(f"🎲 Creating new split with seed {SEED}")
            # Create split logic here (simplified)
            np.random.seed(SEED)
            unique_ids = subgroups['TUCASEID'].unique()
            test_size = int(len(unique_ids) * TEST_SIZE)
            test_ids = np.random.choice(unique_ids, test_size, replace=False)
            
            split_df = pd.DataFrame({
                'TUCASEID': subgroups['TUCASEID'].unique(),
                'set': ['test' if id in test_ids else 'train' for id in subgroups['TUCASEID'].unique()]
            })
            split_df.to_parquet(split_path, index=False)
            print(f"✓ Split saved to {split_path}")
        
        # Prepare data with groups - fix the function call
        print(f"🔄 Preparing data with groupby: {groupby}")
        groupby_cols = groupby.split(',')
        
        # Call with correct signature including blocks parameter
        long_df = prepare_long_with_groups(sequences, subgroups, groupby_cols, time_blocks)
        
        # Pool rare quarters
        print(f"🔄 Pooling rare quarter groups...")
        long_df = pool_rare_quarter(long_df)
        
        # Merge with split
        print(f"🔄 Merging with train/test split...")
        long_df = long_df.merge(split_df, on='TUCASEID', how='left')
        
        print(f"📈 Fitting B1-H model...")
        # Fit the model
        result = fit_b1_hier(long_df)
        
        # Save results
        output_file = output_dir / "b1h_model.json"
        save_json(result, output_file)
        
        # Save evaluation
        eval_file = output_dir / "eval_b1h.json"
        test_data = long_df[long_df['set'] == 'test']
        eval_result = {
            'test_nll': nll_b1(result['params'], test_data),
            'n_test_sequences': len(test_data['TUCASEID'].unique()),
            'n_train_sequences': len(long_df[long_df['set'] == 'train']['TUCASEID'].unique())
        }
        save_json(eval_result, eval_file)
        
        print(f"✅ B1-H model completed successfully for {rung}")
        print(f"📁 Saved to {output_file}")
        return True
        
    except Exception as e:
        print(f"❌ Direct execution failed: {e}")
        print(f"🔍 Error details: {type(e).__name__}")
        print("🔄 Falling back to subprocess method...")
        return False

def run_baseline1_hier_subprocess(rung, groupby, output_dir, split_path):
    """Run baseline1_hier via subprocess (fallback method)."""
    cmd = [
        sys.executable, "-m", "atus_analysis.scripts.baseline1_hier",
        "--sequences", SEQUENCES_FILE,
        "--subgroups", SUBGROUPS_FILE,
        "--out_dir", str(output_dir),
        "--groupby", groupby,
        "--time_blocks", TIME_BLOCKS,
        "--seed", str(SEED),
        "--test_size", str(TEST_SIZE),
        "--split_path", str(split_path)
    ]
    
    print(f"🖥️  Running B1-H via subprocess for {rung}...")
    print(f"Command: {' '.join(cmd)}")
    
    # Use real-time output instead of capture_output
    try:
        process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, 
                                 universal_newlines=True, bufsize=1)
        
        # Stream output in real-time
        for line in process.stdout:
            print(line.rstrip())
        
        process.wait()
        
        if process.returncode == 0:
            print(f"✅ B1-H model completed successfully for {rung}")
            return True
        else:
            print(f"❌ B1-H model failed for {rung} (return code: {process.returncode})")
            return False
            
    except Exception as e:
        print(f"❌ Subprocess execution failed: {e}")
        return False

def run_baseline1_hier(rung, groupby, output_dir, split_path):
    """Run baseline1_hier with automatic fallback."""
    # Try direct execution first, fall back to subprocess if needed
    if not globals().get('USE_SUBPROCESS', True):
        print("🎯 Attempting direct execution...")
        if run_baseline1_hier_direct(rung, groupby, output_dir, split_path):
            return True
    
    print("🖥️  Using subprocess execution...")
    return run_baseline1_hier_subprocess(rung, groupby, output_dir, split_path)

def run_baseline2_hier(rung, groupby, output_dir, split_path):
    """Run baseline2_hier (hazard model) for a rung."""
    b1h_path = output_dir / "b1h_model.json"
    
    cmd = [
        sys.executable, "-m", "atus_analysis.scripts.baseline2_hier",
        "--sequences", SEQUENCES_FILE,
        "--subgroups", SUBGROUPS_FILE,
        "--out_dir", str(output_dir),
        "--groupby", groupby,
        "--time_blocks", TIME_BLOCKS,
        "--dwell_bins", DWELL_BINS,
        "--seed", str(SEED),
        "--test_size", str(TEST_SIZE),
        "--split_path", str(split_path),
        "--b1h_path", str(b1h_path)
    ]
    
    print(f"🖥️  Running B2-H (hazard) model for {rung}...")
    print(f"Command: {' '.join(cmd)}")
    
    # Use real-time output
    try:
        process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, 
                                 universal_newlines=True, bufsize=1)
        
        # Stream output in real-time
        for line in process.stdout:
            print(line.rstrip())
        
        process.wait()
        
        if process.returncode == 0:
            print(f"✅ B2-H model completed successfully for {rung}")
            return True
        else:
            print(f"❌ B2-H model failed for {rung} (return code: {process.returncode})")
            return False
            
    except Exception as e:
        print(f"❌ B2-H execution failed: {e}")
        return False

def run_single_rung(rung, include_hazard=False):
    """Run a complete experiment for a single rung."""
    start_time = time.time()
    
    print(f"\n{'='*60}")
    print(f"🚀 STARTING RUNG {rung}")
    print(f"{'='*60}")
    
    # Load progress
    progress = load_progress()
    
    # Check if already completed
    if is_rung_completed(rung, progress):
        print(f"✅ {rung} already completed - skipping")
        return True
    
    # Setup
    groupby = RUNG_SPECS[rung]
    output_dir = OUTPUT_DIR / rung
    output_dir.mkdir(parents=True, exist_ok=True)
    split_path = OUTPUT_DIR / "fixed_split.parquet"
    
    print(f"📋 Rung: {rung}")
    print(f"📋 Groupby: {groupby}")
    print(f"📁 Output directory: {output_dir}")
    print(f"⚡ Include hazard: {include_hazard or rung == 'R7'}")
    
    # Check resources before starting
    print(f"\n🔍 Checking system resources...")
    check_system_resources()
    
    try:
        # Run B1-H (routing model)
        print(f"\n--- 📊 Step 1: B1-H Model for {rung} ---")
        if not run_baseline1_hier(rung, groupby, output_dir, split_path):
            progress['failed_rungs'].append(rung)
            save_progress(progress)
            return False
        
        # Run B2-H (hazard model) if needed
        if include_hazard or rung == "R7":
            print(f"\n--- ⚡ Step 2: B2-H Model for {rung} ---")
            if not run_baseline2_hier(rung, groupby, output_dir, split_path):
                progress['failed_rungs'].append(rung)
                save_progress(progress)
                return False
        
        # Success!
        elapsed = time.time() - start_time
        print(f"\n🎉 {rung} COMPLETED SUCCESSFULLY in {elapsed:.1f} seconds ({elapsed/60:.1f} minutes)")
        
        # Update progress
        progress['completed_rungs'].append(rung)
        progress[f'{rung}_completed_at'] = datetime.now().isoformat()
        progress[f'{rung}_duration_seconds'] = elapsed
        save_progress(progress)
        
        # Clean up memory
        gc.collect()
        print(f"🧹 Memory cleanup completed")
        
        return True
        
    except Exception as e:
        print(f"\n💥 {rung} FAILED with exception: {e}")
        progress['failed_rungs'].append(rung)
        progress[f'{rung}_error'] = str(e)
        save_progress(progress)
        return False

print("✅ Helper functions defined with improved direct execution support")

✅ Helper functions defined with improved direct execution support


In [29]:
def add_quarter_column(subgroups_df):
    """
    Add quarter column to subgroups data if missing.
    
    This is needed for R6 and R7 experiments which use quarter in their groupby.
    Derives quarter from month: Q1=Jan-Mar, Q2=Apr-Jun, Q3=Jul-Sep, Q4=Oct-Dec
    """
    if 'quarter' in subgroups_df.columns:
        print("✓ Quarter column already exists")
        return subgroups_df
    
    if 'month' not in subgroups_df.columns:
        print("❌ Neither quarter nor month column found!")
        return subgroups_df
    
    print("🗓️  Adding quarter column from month data...")
    df = subgroups_df.copy()
    
    # Convert month to numeric if it's string
    month_numeric = pd.to_numeric(df['month'], errors='coerce')
    
    # Create quarter mapping: Q1=1-3, Q2=4-6, Q3=7-9, Q4=10-12
    quarter_map = {
        1: 'Q1', 2: 'Q1', 3: 'Q1',
        4: 'Q2', 5: 'Q2', 6: 'Q2', 
        7: 'Q3', 8: 'Q3', 9: 'Q3',
        10: 'Q4', 11: 'Q4', 12: 'Q4'
    }
    
    df['quarter'] = month_numeric.map(quarter_map).fillna('Unknown')
    
    print(f"✓ Quarter column added. Distribution:")
    quarter_counts = df['quarter'].value_counts().sort_index()
    for quarter, count in quarter_counts.items():
        print(f"   {quarter}: {count:,} respondents")
    
    return df

In [30]:
def run_baseline1_hier_direct(rung, groupby, output_dir, split_path):
    """Run baseline1_hier directly in Jupyter (preferred method)."""
    try:
        import pandas as pd
        import numpy as np
        from atus_analysis.scripts.common_hier import (
            prepare_long_with_groups, pool_rare_quarter,
            save_json, nll_b1, fit_b1_hier, parse_time_blocks
        )
        
        print(f"📊 Loading data for {rung}...")
        
        # Load sequences and subgroups
        sequences = pd.read_parquet(SEQUENCES_FILE)
        subgroups = pd.read_parquet(SUBGROUPS_FILE)
        
        # Add quarter column if needed for R6/R7 experiments
        subgroups = add_quarter_column(subgroups)
        
        print(f"✓ Loaded {len(sequences)} sequences and {len(subgroups)} subgroups")
        
        # Parse time blocks
        time_blocks = parse_time_blocks(TIME_BLOCKS)
        print(f"✓ Parsed time blocks: {time_blocks}")
        
        # Create or load split
        if split_path.exists():
            print(f"📂 Loading existing split from {split_path}")
            split_df = pd.read_parquet(split_path)
        else:
            print(f"🎲 Creating new split with seed {SEED}")
            # Create split logic here (simplified)
            np.random.seed(SEED)
            unique_ids = subgroups['TUCASEID'].unique()
            test_size = int(len(unique_ids) * TEST_SIZE)
            test_ids = np.random.choice(unique_ids, test_size, replace=False)
            
            split_df = pd.DataFrame({
                'TUCASEID': subgroups['TUCASEID'].unique(),
                'set': ['test' if id in test_ids else 'train' for id in subgroups['TUCASEID'].unique()]
            })
            split_df.to_parquet(split_path, index=False)
            print(f"✓ Split saved to {split_path}")
        
        # Prepare data with groups - fix the function call
        print(f"🔄 Preparing data with groupby: {groupby}")
        groupby_cols = groupby.split(',')
        
        # Call with correct signature including blocks parameter
        long_df = prepare_long_with_groups(sequences, subgroups, groupby_cols, time_blocks)
        
        # Pool rare quarters
        print(f"🔄 Pooling rare quarter groups...")
        long_df = pool_rare_quarter(long_df)
        
        # Merge with split
        print(f"🔄 Merging with train/test split...")
        long_df = long_df.merge(split_df, on='TUCASEID', how='left')
        
        print(f"📈 Fitting B1-H model...")
        # Fit the model
        result = fit_b1_hier(long_df)
        
        # Save results
        output_file = output_dir / "b1h_model.json"
        save_json(result, output_file)
        
        # Save evaluation
        eval_file = output_dir / "eval_b1h.json"
        test_data = long_df[long_df['set'] == 'test']
        eval_result = {
            'test_nll': nll_b1(result['params'], test_data),
            'n_test_sequences': len(test_data['TUCASEID'].unique()),
            'n_train_sequences': len(long_df[long_df['set'] == 'train']['TUCASEID'].unique())
        }
        save_json(eval_result, eval_file)
        
        print(f"✅ B1-H model completed successfully for {rung}")
        print(f"📁 Saved to {output_file}")
        return True
        
    except Exception as e:
        print(f"❌ Direct execution failed: {e}")
        print(f"🔧 Falling back to subprocess method...")
        return False

## Cell 3b: Import Baseline Scripts Directly

Instead of calling external processes, we'll import the baseline scripts directly for better integration with Jupyter.

In [31]:
# Import the baseline scripts directly instead of using subprocess
import sys
from pathlib import Path

# Add the project root to Python path so we can import the modules
project_root = Path('.').resolve()
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

try:
    # Import the baseline functions directly
    from atus_analysis.scripts.common_hier import (
        prepare_long_with_groups, pool_rare_quarter, 
        save_json, nll_b1, fit_b1_hier, parse_time_blocks
    )
    print("✓ Successfully imported baseline common functions")
    
    
except ImportError as e:
    print(f"⚠️  Could not import baseline functions directly: {e}")
    print("Will fall back to subprocess calls")
    USE_SUBPROCESS = True
else:
    USE_SUBPROCESS = False

✓ Successfully imported baseline common functions


## Check System Status and Prerequisites

In [32]:
print("Checking system status and prerequisites...\n")

# Check system resources
resources_ok = check_system_resources()

print("\n=== File Prerequisites ===")
# Check required files
required_files = [
    SEQUENCES_FILE,
    SUBGROUPS_FILE,
    "atus_analysis/scripts/baseline1_hier.py",
    "atus_analysis/scripts/baseline2_hier.py"
]

files_ok = True
for file_path in required_files:
    if Path(file_path).exists():
        print(f"✓ {file_path}")
    else:
        print(f"✗ {file_path} - NOT FOUND")
        files_ok = False

# Check output directory
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
print(f"✓ Output directory: {OUTPUT_DIR}")

# Load and display current progress
print("\n=== Current Progress ===")
progress = load_progress()
completed = progress.get('completed_rungs', [])
failed = progress.get('failed_rungs', [])

print(f"Completed rungs: {completed if completed else 'None'}")
print(f"Failed rungs: {failed if failed else 'None'}")
print(f"Remaining rungs: {[r for r in RUNG_SPECS.keys() if r not in completed]}")

# Overall status
print("\n=== Overall Status ===")
if resources_ok and files_ok:
    print("✅ READY TO START EXPERIMENTS")
    print("\nYou can now run the individual experiment cells below.")
else:
    print("❌ PREREQUISITES NOT MET")
    if not resources_ok:
        print("   - System resources may be insufficient")
    if not files_ok:
        print("   - Required files are missing")
    print("\nPlease resolve issues before continuing.")

Checking system status and prerequisites...

=== System Resources ===
Memory: 1.2% used, 745.8GB available
CPU: 0.0% usage
Disk: 1941552.2GB free

✓ System resources look good

=== File Prerequisites ===
✓ atus_analysis/data/sequences/markov_sequences.parquet
✓ atus_analysis/data/processed/subgroups.parquet
✓ atus_analysis/scripts/baseline1_hier.py
✓ atus_analysis/scripts/baseline2_hier.py
✓ Output directory: atus_analysis/data/models

=== Current Progress ===
Completed rungs: ['R1', 'R2', 'R3', 'R4', 'R5']
Failed rungs: ['R1', 'R6', 'R6', 'R6']
Remaining rungs: ['R6', 'R7']

=== Overall Status ===
✅ READY TO START EXPERIMENTS

You can now run the individual experiment cells below.


# Individual Experiment Cells

## Instructions for Running Experiments

**Run these cells ONE AT A TIME** in order. Each cell represents one complete experiment rung.

- ✅ **Safe to interrupt**: You can stop any cell with the stop button - progress is automatically saved
- 🔄 **Resume anytime**: If you restart the kernel, just re-run cells 1-4, then continue from where you left off
- ⏭️ **Skip completed**: Cells will automatically skip rungs that have already completed successfully
- 📊 **Monitor progress**: Each cell shows detailed progress and resource usage

---

## Cell 5: Run Experiment R1 (Region Only)

**Expected runtime: 30-60 minutes**  
**Memory usage: Low-Medium**  
**Description: Simplest model - groups by region only**

In [6]:
# Run R1 experiment
success = run_single_rung("R1", include_hazard=False)

if success:
    print("\n R1 completed successfully! You can now run R2.")
else:
    print("\n❌ R1 failed. Check the error messages above and try again.")


🚀 STARTING RUNG R1
📋 Rung: R1
📋 Groupby: region
📁 Output directory: atus_analysis/data/models/R1
⚡ Include hazard: False

🔍 Checking system resources...
=== System Resources ===
Memory: 1.1% used, 746.5GB available
CPU: 0.0% usage
Disk: 1941548.8GB free

✓ System resources look good

--- 📊 Step 1: B1-H Model for R1 ---
🎯 Attempting direct execution...
📊 Loading data for R1...
✓ Loaded 36404352 sequences and 252808 subgroups
📂 Loading existing split from atus_analysis/data/models/fixed_split.parquet
🔄 Preparing data with groupby: region
❌ Direct execution failed: prepare_long_with_groups() missing 1 required positional argument: 'blocks'
🔄 Falling back to subprocess method...
🖥️  Using subprocess execution...
🖥️  Running B1-H via subprocess for R1...
Command: /sw/eb/sw/Anaconda3/2023.09-0/bin/python -m atus_analysis.scripts.baseline1_hier --sequences atus_analysis/data/sequences/markov_sequences.parquet --subgroups atus_analysis/data/processed/subgroups.parquet --out_dir atus_analysis/

## Cell 6: Run Experiment R2 (Region + Sex)

**Expected runtime: 30-60 minutes**  
**Memory usage: Low-Medium**  
**Description: Groups by region and sex**

In [7]:
# Run R2 experiment
success = run_single_rung("R2", include_hazard=False)

if success:
    print("\n🎉 R2 completed successfully! You can now run R3.")
else:
    print("\n❌ R2 failed. Check the error messages above and try again.")


🚀 STARTING RUNG R2
📋 Rung: R2
📋 Groupby: region,sex
📁 Output directory: atus_analysis/data/models/R2
⚡ Include hazard: False

🔍 Checking system resources...
=== System Resources ===
Memory: 1.2% used, 746.0GB available
CPU: 0.0% usage
Disk: 1941548.8GB free

✓ System resources look good

--- 📊 Step 1: B1-H Model for R2 ---
🎯 Attempting direct execution...
📊 Loading data for R2...
✓ Loaded 36404352 sequences and 252808 subgroups
📂 Loading existing split from atus_analysis/data/models/fixed_split.parquet
🔄 Preparing data with groupby: region,sex
❌ Direct execution failed: prepare_long_with_groups() missing 1 required positional argument: 'blocks'
🔄 Falling back to subprocess method...
🖥️  Using subprocess execution...
🖥️  Running B1-H via subprocess for R2...
Command: /sw/eb/sw/Anaconda3/2023.09-0/bin/python -m atus_analysis.scripts.baseline1_hier --sequences atus_analysis/data/sequences/markov_sequences.parquet --subgroups atus_analysis/data/processed/subgroups.parquet --out_dir atus_a

## Cell 7: Run Experiment R3 (Region + Employment)

**Expected runtime: 30-60 minutes**  
**Memory usage: Medium**  
**Description: Groups by region and employment status**

In [8]:
# Run R3 experiment
success = run_single_rung("R3", include_hazard=False)

if success:
    print("\n🎉 R3 completed successfully! You can now run R4.")
else:
    print("\n❌ R3 failed. Check the error messages above and try again.")


🚀 STARTING RUNG R3
📋 Rung: R3
📋 Groupby: region,employment
📁 Output directory: atus_analysis/data/models/R3
⚡ Include hazard: False

🔍 Checking system resources...
=== System Resources ===
Memory: 1.2% used, 745.9GB available
CPU: 0.0% usage
Disk: 1941548.8GB free

✓ System resources look good

--- 📊 Step 1: B1-H Model for R3 ---
🎯 Attempting direct execution...
📊 Loading data for R3...
✓ Loaded 36404352 sequences and 252808 subgroups
📂 Loading existing split from atus_analysis/data/models/fixed_split.parquet
🔄 Preparing data with groupby: region,employment
❌ Direct execution failed: prepare_long_with_groups() missing 1 required positional argument: 'blocks'
🔄 Falling back to subprocess method...
🖥️  Using subprocess execution...
🖥️  Running B1-H via subprocess for R3...
Command: /sw/eb/sw/Anaconda3/2023.09-0/bin/python -m atus_analysis.scripts.baseline1_hier --sequences atus_analysis/data/sequences/markov_sequences.parquet --subgroups atus_analysis/data/processed/subgroups.parquet --

## Cell 8: Run Experiment R4 (Region + Day Type)

**Expected runtime: 30-60 minutes**  
**Memory usage: Medium**  
**Description: Groups by region and day type (weekday/weekend)**

In [9]:
# Run R4 experiment
success = run_single_rung("R4", include_hazard=False)

if success:
    print("\n🎉 R4 completed successfully! You can now run R5.")
else:
    print("\n❌ R4 failed. Check the error messages above and try again.")


🚀 STARTING RUNG R4
📋 Rung: R4
📋 Groupby: region,day_type
📁 Output directory: atus_analysis/data/models/R4
⚡ Include hazard: False

🔍 Checking system resources...
=== System Resources ===
Memory: 1.2% used, 745.9GB available
CPU: 0.0% usage
Disk: 1941548.7GB free

✓ System resources look good

--- 📊 Step 1: B1-H Model for R4 ---
🎯 Attempting direct execution...
📊 Loading data for R4...
✓ Loaded 36404352 sequences and 252808 subgroups
📂 Loading existing split from atus_analysis/data/models/fixed_split.parquet
🔄 Preparing data with groupby: region,day_type
❌ Direct execution failed: prepare_long_with_groups() missing 1 required positional argument: 'blocks'
🔄 Falling back to subprocess method...
🖥️  Using subprocess execution...
🖥️  Running B1-H via subprocess for R4...
Command: /sw/eb/sw/Anaconda3/2023.09-0/bin/python -m atus_analysis.scripts.baseline1_hier --sequences atus_analysis/data/sequences/markov_sequences.parquet --subgroups atus_analysis/data/processed/subgroups.parquet --out_

## Cell 9: Run Experiment R5 (Region + Household Size)

**Expected runtime: 60-90 minutes**  
**Memory usage: Medium**  
**Description: Groups by region and household size band**

In [10]:
# Run R5 experiment
success = run_single_rung("R5", include_hazard=False)

if success:
    print("\n🎉 R5 completed successfully! You can now run R6.")
else:
    print("\n❌ R5 failed. Check the error messages above and try again.")


🚀 STARTING RUNG R5
📋 Rung: R5
📋 Groupby: region,hh_size_band
📁 Output directory: atus_analysis/data/models/R5
⚡ Include hazard: False

🔍 Checking system resources...
=== System Resources ===
Memory: 1.2% used, 745.9GB available
CPU: 0.0% usage
Disk: 1941548.6GB free

✓ System resources look good

--- 📊 Step 1: B1-H Model for R5 ---
🎯 Attempting direct execution...
📊 Loading data for R5...
✓ Loaded 36404352 sequences and 252808 subgroups
📂 Loading existing split from atus_analysis/data/models/fixed_split.parquet
🔄 Preparing data with groupby: region,hh_size_band
❌ Direct execution failed: prepare_long_with_groups() missing 1 required positional argument: 'blocks'
🔄 Falling back to subprocess method...
🖥️  Using subprocess execution...
🖥️  Running B1-H via subprocess for R5...
Command: /sw/eb/sw/Anaconda3/2023.09-0/bin/python -m atus_analysis.scripts.baseline1_hier --sequences atus_analysis/data/sequences/markov_sequences.parquet --subgroups atus_analysis/data/processed/subgroups.parque

## Cell 9b: Add Quarter Column to Subgroups File (Run Before R6)

**Important**: Run this cell before attempting R6 or R7 experiments.  
This will permanently add the quarter column to the subgroups.parquet file.  
**Runtime**: 1-2 minutes  
**Purpose**: Ensures R6 and R7 experiments have the required quarter column

In [35]:
# Add quarter column permanently to subgroups.parquet file
print("🗓️  Adding quarter column to subgroups.parquet file...")

try:
    # Load the current subgroups file
    subgroups_path = Path(SUBGROUPS_FILE)
    print(f"📂 Loading subgroups from: {subgroups_path}")
    
    if not subgroups_path.exists():
        print(f"❌ Subgroups file not found at {subgroups_path}")
        print("Please ensure the file exists before running this cell.")
    else:
        # Load the data
        subgroups_df = pd.read_parquet(subgroups_path)
        print(f"✅ Loaded {len(subgroups_df)} subgroup records")
        print(f"Current columns: {list(subgroups_df.columns)}")
        
        # Check if quarter column already exists
        if 'quarter' in subgroups_df.columns:
            print("✅ Quarter column already exists in the file!")
            quarter_counts = subgroups_df['quarter'].value_counts().sort_index()
            print("Current quarter distribution:")
            for quarter, count in quarter_counts.items():
                print(f"   {quarter}: {count:,} respondents")
        else:
            print("📝 Quarter column not found - adding it now...")
            
            # Add quarter column using our function
            subgroups_with_quarter = add_quarter_column(subgroups_df)
            
            # Create backup of original file
            backup_path = subgroups_path.with_suffix('.parquet.backup')
            print(f"💾 Creating backup at: {backup_path}")
            subgroups_df.to_parquet(backup_path, index=False)
            
            # Save the updated file
            print(f"💾 Saving updated subgroups with quarter column...")
            subgroups_with_quarter.to_parquet(subgroups_path, index=False)
            
            # Verify the save
            verification_df = pd.read_parquet(subgroups_path)
            if 'quarter' in verification_df.columns:
                print("✅ Quarter column successfully added to subgroups.parquet!")
                quarter_counts = verification_df['quarter'].value_counts().sort_index()
                print("Final quarter distribution:")
                for quarter, count in quarter_counts.items():
                    print(f"   {quarter}: {count:,} respondents")
                    
                print(f"\n📋 Summary:")
                print(f"   - Original file backed up to: {backup_path}")
                print(f"   - Updated file saved to: {subgroups_path}")
                print(f"   - Quarter column added with {len(verification_df)} records")
                print(f"   - R6 and R7 experiments are now ready to run!")
            else:
                print("❌ Failed to verify quarter column in saved file")
                
except Exception as e:
    print(f"❌ Error adding quarter column: {e}")
    print(f"Error type: {type(e).__name__}")
    print("\nYou can still run R6/R7 - the quarter column will be added dynamically.")

print("\n" + "="*60)
print("🎯 READY FOR R6 AND R7 EXPERIMENTS")
print("="*60)

🗓️  Adding quarter column to subgroups.parquet file...
📂 Loading subgroups from: atus_analysis/data/processed/subgroups.parquet
✅ Loaded 252808 subgroup records
Current columns: ['TUCASEID', 'sex', 'hh_size_band', 'month', 'region', 'employment', 'day_type', 'TUFNWGTP']
📝 Quarter column not found - adding it now...
🗓️  Adding quarter column from month data...
✓ Quarter column added. Distribution:
   Q1: 68,017 respondents
   Q2: 62,107 respondents
   Q3: 61,880 respondents
   Q4: 60,804 respondents
💾 Creating backup at: atus_analysis/data/processed/subgroups.parquet.backup
💾 Saving updated subgroups with quarter column...
✅ Quarter column successfully added to subgroups.parquet!
Final quarter distribution:
   Q1: 68,017 respondents
   Q2: 62,107 respondents
   Q3: 61,880 respondents
   Q4: 60,804 respondents

📋 Summary:
   - Original file backed up to: atus_analysis/data/processed/subgroups.parquet.backup
   - Updated file saved to: atus_analysis/data/processed/subgroups.parquet
   - Q

## Cell 10: Run Experiment R6 (Full Routing Model)

**Expected runtime: 90-120 minutes**  
**Memory usage: High**  
**Description: Full complexity routing model with all demographic variables**

In [36]:
# Run R6 experiment
success = run_single_rung("R6", include_hazard=False)

if success:
    print("\n🎉 R6 completed successfully! You can now run R7.")
else:
    print("\n❌ R6 failed. Check the error messages above and try again.")


🚀 STARTING RUNG R6
📋 Rung: R6
📋 Groupby: employment,day_type,hh_size_band,sex,region,quarter
📁 Output directory: atus_analysis/data/models/R6
⚡ Include hazard: False

🔍 Checking system resources...
=== System Resources ===
Memory: 1.2% used, 745.8GB available
CPU: 0.0% usage
Disk: 1941552.2GB free

✓ System resources look good

--- 📊 Step 1: B1-H Model for R6 ---
🎯 Attempting direct execution...
📊 Loading data for R6...
✓ Quarter column already exists
✓ Loaded 36404352 sequences and 252808 subgroups
✓ Parsed time blocks: [('night', 0, 5), ('morning', 6, 11), ('afternoon', 12, 17), ('evening', 18, 23)]
📂 Loading existing split from atus_analysis/data/models/fixed_split.parquet
🔄 Preparing data with groupby: employment,day_type,hh_size_band,sex,region,quarter
❌ Direct execution failed: prepare_long_with_groups() missing 1 required positional argument: 'blocks'
🔧 Falling back to subprocess method...
🖥️  Using subprocess execution...
🖥️  Running B1-H via subprocess for R6...
Command: /sw/

## Cell 11: Run Experiment R7 (Full Model + Hazard)

**Expected runtime: 120-180 minutes**  
**Memory usage: High**  
**Description: Full model including hazard modeling - most computationally intensive**

In [39]:
# Run R7 experiment (automatically includes hazard model)
success = run_single_rung("R7", include_hazard=True)

if success:
    print("\n🎉 R7 completed successfully! All experiments are now complete.")
else:
    print("\n❌ R7 failed. Check the error messages above and try again.")


🚀 STARTING RUNG R7
📋 Rung: R7
📋 Groupby: employment,day_type,hh_size_band,sex,region,quarter
📁 Output directory: atus_analysis/data/models/R7
⚡ Include hazard: True

🔍 Checking system resources...
=== System Resources ===
Memory: 1.2% used, 745.8GB available
CPU: 0.0% usage
Disk: 1941553.0GB free

✓ System resources look good

--- 📊 Step 1: B1-H Model for R7 ---
🎯 Attempting direct execution...
📊 Loading data for R7...
✓ Quarter column already exists
✓ Loaded 36404352 sequences and 252808 subgroups
✓ Parsed time blocks: [('night', 0, 5), ('morning', 6, 11), ('afternoon', 12, 17), ('evening', 18, 23)]
📂 Loading existing split from atus_analysis/data/models/fixed_split.parquet
🔄 Preparing data with groupby: employment,day_type,hh_size_band,sex,region,quarter
❌ Direct execution failed: prepare_long_with_groups() missing 1 required positional argument: 'blocks'
🔧 Falling back to subprocess method...
🖥️  Using subprocess execution...
🖥️  Running B1-H via subprocess for R7...
Command: /sw/e

## Cell 12: Final Summary and Results

In [41]:
# Generate final summary
print("\n" + "="*60)
print("FINAL EXPERIMENT SUMMARY")
print("="*60)

progress = load_progress()
completed = progress.get('completed_rungs', [])
failed = progress.get('failed_rungs', [])
all_rungs = list(RUNG_SPECS.keys())

print(f"\nTotal rungs: {len(all_rungs)}")
print(f"Completed: {len(completed)} - {completed}")
print(f"Failed: {len(failed)} - {failed}")
print(f"Not attempted: {[r for r in all_rungs if r not in completed and r not in failed]}")

# Show timing information
print("\n=== Timing Information ===")
total_time = 0
for rung in completed:
    duration_key = f'{rung}_duration_seconds'
    if duration_key in progress:
        duration = progress[duration_key]
        total_time += duration
        print(f"{rung}: {duration:.0f} seconds ({duration/60:.1f} minutes)")

if total_time > 0:
    print(f"\nTotal runtime: {total_time:.0f} seconds ({total_time/60:.1f} minutes, {total_time/3600:.1f} hours)")

# Show output locations
print("\n=== Output Files ===")
for rung in completed:
    rung_dir = OUTPUT_DIR / rung
    if rung_dir.exists():
        files = list(rung_dir.glob("*.json"))
        print(f"{rung}: {len(files)} files in {rung_dir}")

# Success rate
if len(completed) + len(failed) > 0:
    success_rate = len(completed) / 7 * 100
    print(f"\nOverall success rate: {success_rate:.1f}%")

if len(completed) == len(all_rungs):
    print("\n🎉 ALL EXPERIMENTS COMPLETED SUCCESSFULLY! 🎉")
elif len(failed) > 0:
    print(f"\n⚠️  Some experiments failed. You can re-run the failed cells to retry.")
else:
    print(f"\n📝 {len(all_rungs) - len(completed)} experiments remaining.")

print(f"\nProgress file saved as: {PROGRESS_FILE}")
print(f"Output directory: {OUTPUT_DIR}")


FINAL EXPERIMENT SUMMARY

Total rungs: 7
Completed: 7 - ['R1', 'R2', 'R3', 'R4', 'R5', 'R6', 'R7']
Failed: 6 - ['R1', 'R6', 'R6', 'R6', 'R6', 'R7']
Not attempted: []

=== Timing Information ===
R1: 1094 seconds (18.2 minutes)
R2: 1537 seconds (25.6 minutes)
R3: 1525 seconds (25.4 minutes)
R4: 1528 seconds (25.5 minutes)
R5: 1525 seconds (25.4 minutes)
R6: 2185 seconds (36.4 minutes)
R7: 4735 seconds (78.9 minutes)

Total runtime: 14129 seconds (235.5 minutes, 3.9 hours)

=== Output Files ===
R1: 2 files in atus_analysis/data/models/R1
R2: 2 files in atus_analysis/data/models/R2
R3: 2 files in atus_analysis/data/models/R3
R4: 2 files in atus_analysis/data/models/R4
R5: 2 files in atus_analysis/data/models/R5
R6: 2 files in atus_analysis/data/models/R6
R7: 4 files in atus_analysis/data/models/R7

Overall success rate: 100.0%

🎉 ALL EXPERIMENTS COMPLETED SUCCESSFULLY! 🎉

Progress file saved as: experiment_progress_jupyter.json
Output directory: atus_analysis/data/models
