# Phase 2: Data Preprocessing & Longitudinal Transitions

Process full PATH Study data (Waves 1-5) to create person-period dataset with quit outcomes.

## Objectives

1. Load all 7 waves of PATH Adult Public Use Files
2. Create longitudinal person-period structure
3. Define quit outcome: smoking status at wave t+1
4. Filter to baseline smokers with follow-up data
5. Apply feature engineering from Phase 3 (motivation + environment updates)
6. Normalize PATH negative missing codes (e.g. -9, -8, -1 plus extended set)
7. Save compact dataset with canonical engineered features only

## 1. Setup and Configuration

## Optional: Run Full Preprocessing Script

Run the standalone pipeline to regenerate the processed dataset with only canonical engineered features (no raw or alias columns):

```python
!python ../scripts/run_preprocessing.py
```

Reload and inspect basic shape:

```python
import pandas as pd
processed = pd.read_csv('../data/processed/pooled_transitions.csv')
print('Rows, Cols:', processed.shape)
print('Columns (first 25):', list(processed.columns[:25]))
```

In [1]:
# Import libraries
import sys
from pathlib import Path
import pandas as pd
import numpy as np
from tqdm import tqdm

# Add parent directory to path
sys.path.insert(0, str(Path.cwd().parent))

# Import feature engineering
from src.feature_engineering import engineer_all_features, map_from_codebook

print("✓ Libraries imported")

✓ Libraries imported


In [2]:
# Configuration
DATA_DIR = Path('../data/raw')
OUTPUT_DIR = Path('../data/processed')
OUTPUT_DIR.mkdir(exist_ok=True)

# Wave files
WAVE_FILES = {
    1: DATA_DIR / 'PATH_W1_Adult_Public.dta',
    2: DATA_DIR / 'PATH_W2_Adult_Public.dta',
    3: DATA_DIR / 'PATH_W3_Adult_Public.dta',
    4: DATA_DIR / 'PATH_W4_Adult_Public.dta',
    5: DATA_DIR / 'PATH_W5_Adult_Public.dta',
    6: DATA_DIR / 'PATH_W6_Adult_Public.dta',
    7: DATA_DIR / 'PATH_W7_Adult_Public.dta'
}

# Check which files exist
print("Checking for PATH data files:")
for wave, path in WAVE_FILES.items():
    status = "✓" if path.exists() else "✗"
    print(f"  {status} Wave {wave}: {path.name}")

Checking for PATH data files:
  ✓ Wave 1: PATH_W1_Adult_Public.dta
  ✓ Wave 2: PATH_W2_Adult_Public.dta
  ✓ Wave 3: PATH_W3_Adult_Public.dta
  ✓ Wave 4: PATH_W4_Adult_Public.dta
  ✓ Wave 5: PATH_W5_Adult_Public.dta
  ✓ Wave 6: PATH_W6_Adult_Public.dta
  ✓ Wave 7: PATH_W7_Adult_Public.dta


## 2. Load Individual Waves

Load each wave and extract key smoking status variables for transition analysis.

In [3]:
# Key variables to track across waves for smoking status
# These patterns work for most waves (adjust if needed)
SMOKING_STATUS_PATTERNS = [
    'R0{wave}R_A_EVERSMOKE',      # Ever smoked
    'R0{wave}_AC1002',             # Smoked in past 30 days
    'R0{wave}_AC1003',             # Current smoking frequency (every day, some days, not at all)
    'R0{wave}R_A_CURRCIGUSE',      # Current cigarette use (derived)
    'R0{wave}R_A_EVERCIGUSE',      # Ever cigarette use (derived)
]

print("Variables to track smoking status transitions:")
for pattern in SMOKING_STATUS_PATTERNS:
    print(f"  - {pattern}")

Variables to track smoking status transitions:
  - R0{wave}R_A_EVERSMOKE
  - R0{wave}_AC1002
  - R0{wave}_AC1003
  - R0{wave}R_A_CURRCIGUSE
  - R0{wave}R_A_EVERCIGUSE


In [4]:
def load_wave(wave_num, nrows=None):
    """
    Load a single wave of PATH data.
    
    Parameters
    ----------
    wave_num : int
        Wave number (1-5)
    nrows : int, optional
        Number of rows to load (for testing)
    
    Returns
    -------
    pd.DataFrame
        Wave data with wave number added as column
    """
    path = WAVE_FILES[wave_num]
    
    if not path.exists():
        print(f"⚠️  Wave {wave_num} file not found: {path}")
        return None
    
    print(f"Loading Wave {wave_num}...", end=' ')
    
    # Load data - disable convert_categoricals to avoid duplicate label errors
    reader = pd.read_stata(path, iterator=True, convert_categoricals=False)
    df = reader.read(nrows=nrows)
    
    # Add wave identifier
    df['wave'] = wave_num
    
    print(f"✓ {len(df):,} rows, {len(df.columns):,} columns")
    return df

In [5]:
# Load all waves (start with sample for testing, then switch to full data)
# For initial testing, use nrows=1000 per wave
# For full run, use nrows=None

SAMPLE_SIZE = None  # Set to None for full data, or 1000 for testing

print(f"Loading waves with sample_size={SAMPLE_SIZE}...\n")

waves_data = {}
for wave_num in range(1, 6):
    df = load_wave(wave_num, nrows=SAMPLE_SIZE)
    if df is not None:
        waves_data[wave_num] = df

print(f"\n✓ Loaded {len(waves_data)} waves")

Loading waves with sample_size=None...

Loading Wave 1... 

  df['wave'] = wave_num


✓ 32,320 rows, 1,743 columns
Loading Wave 2... 

  df['wave'] = wave_num


✓ 28,362 rows, 2,209 columns
Loading Wave 3... 

  df['wave'] = wave_num


✓ 28,148 rows, 2,141 columns
Loading Wave 4... 

  df['wave'] = wave_num


✓ 33,822 rows, 2,182 columns
Loading Wave 5... ✓ 34,309 rows, 2,316 columns

✓ Loaded 5 waves
✓ 34,309 rows, 2,316 columns

✓ Loaded 5 waves


  df['wave'] = wave_num


## 3. Identify Baseline Smokers

For each wave, identify current smokers who could potentially quit by the next wave.

In [6]:
def identify_current_smokers(df, wave_num):
    """
    Identify current smokers in a given wave.
    
    Current smoker definition:
    - Smoked in past 30 days (R0X_AC1002 = 1 "Yes")
    - OR smoking frequency is "Every day" or "Some days" (R0X_AC1003 = 1 or 2)
    
    Parameters
    ----------
    df : pd.DataFrame
        Wave data
    wave_num : int
        Wave number for variable names
    
    Returns
    -------
    pd.Series
        Boolean series indicating current smokers
    """
    # Variable names for this wave
    smoked_30d = f'R0{wave_num}_AC1002'  # Past 30 day smoking
    freq_var = f'R0{wave_num}_AC1003'     # Smoking frequency
    
    # Extract numeric codes from categorical variables
    from src.feature_engineering import _extract_numeric_code
    
    is_smoker = pd.Series(False, index=df.index)
    
    # Check if variables exist
    if smoked_30d in df.columns:
        smoked_code = _extract_numeric_code(df[smoked_30d])
        is_smoker |= (smoked_code == 1)  # 1 = Yes
    
    if freq_var in df.columns:
        freq_code = _extract_numeric_code(df[freq_var])
        is_smoker |= (freq_code.isin([1, 2]))  # 1 = Every day, 2 = Some days
    
    return is_smoker

In [7]:
# Count current smokers in each wave
print("Current smokers by wave:")
print("="*50)

smoker_counts = {}
for wave_num, df in waves_data.items():
    is_smoker = identify_current_smokers(df, wave_num)
    n_smokers = is_smoker.sum()
    pct_smokers = 100 * n_smokers / len(df)
    
    smoker_counts[wave_num] = n_smokers
    print(f"Wave {wave_num}: {n_smokers:>6,} / {len(df):>6,} ({pct_smokers:>5.1f}%)")
    
    # Store flag in dataframe
    waves_data[wave_num]['is_current_smoker'] = is_smoker

print("="*50)

  waves_data[wave_num]['is_current_smoker'] = is_smoker
  waves_data[wave_num]['is_current_smoker'] = is_smoker
  waves_data[wave_num]['is_current_smoker'] = is_smoker
  waves_data[wave_num]['is_current_smoker'] = is_smoker


Current smokers by wave:
Wave 1: 25,183 / 32,320 ( 77.9%)
Wave 2: 10,722 / 28,362 ( 37.8%)
Wave 3:  9,817 / 28,148 ( 34.9%)
Wave 4: 10,967 / 33,822 ( 32.4%)
Wave 5:  9,705 / 34,309 ( 28.3%)


  waves_data[wave_num]['is_current_smoker'] = is_smoker


## 4. Create Person-Period Transitions

For each person who is a smoker at wave t, create a record with:
- Baseline features from wave t
- Outcome (quit_success) from wave t+1

In [8]:
def create_transitions(wave_t_data, wave_t1_data, wave_t, wave_t1):
    """
    Create transition records from wave t to wave t+1.
    
    Parameters
    ----------
    wave_t_data : pd.DataFrame
        Baseline wave data
    wave_t1_data : pd.DataFrame
        Follow-up wave data
    wave_t : int
        Baseline wave number
    wave_t1 : int
        Follow-up wave number
    
    Returns
    -------
    pd.DataFrame
        Transition records with baseline features and quit outcome
    """
    print(f"\nCreating transitions: Wave {wave_t} → Wave {wave_t1}")
    
    # Get smokers at baseline
    smokers_t = wave_t_data[wave_t_data['is_current_smoker']].copy()
    print(f"  Baseline smokers: {len(smokers_t):,}")
    
    # Merge with follow-up data on PERSONID
    transitions = smokers_t.merge(
        wave_t1_data[['PERSONID', 'is_current_smoker']],
        on='PERSONID',
        how='inner',
        suffixes=('', '_t1')
    )
    
    print(f"  With follow-up data: {len(transitions):,}")
    
    # Define quit success: was smoking at t, not smoking at t+1
    transitions['quit_success'] = (~transitions['is_current_smoker_t1']).astype(int)
    
    # Add transition info
    transitions['baseline_wave'] = wave_t
    transitions['followup_wave'] = wave_t1
    transitions['transition'] = f'W{wave_t}→W{wave_t1}'
    
    # Calculate quit rate
    quit_rate = 100 * transitions['quit_success'].mean()
    print(f"  Quit rate: {quit_rate:.1f}%")
    
    return transitions

In [9]:
# Create transitions for all consecutive wave pairs
print("Creating person-period transitions...")
print("="*70)

all_transitions = []

for wave_t in range(1, 5):  # Waves 1-4 (need t+1 for outcome)
    wave_t1 = wave_t + 1
    
    if wave_t in waves_data and wave_t1 in waves_data:
        transitions = create_transitions(
            waves_data[wave_t],
            waves_data[wave_t1],
            wave_t,
            wave_t1
        )
        all_transitions.append(transitions)

# Pool all transitions
if all_transitions:
    pooled = pd.concat(all_transitions, ignore_index=True)
    print("\n" + "="*70)
    print(f"✓ Total transitions: {len(pooled):,}")
    print(f"✓ Overall quit rate: {100 * pooled['quit_success'].mean():.1f}%")
    print(f"✓ Unique persons: {pooled['PERSONID'].nunique():,}")
else:
    print("⚠️  No transitions created")
    pooled = None

Creating person-period transitions...

Creating transitions: Wave 1 → Wave 2
  Baseline smokers: 25,183
  Baseline smokers: 25,183
  With follow-up data: 20,656
  Quit rate: 50.2%

Creating transitions: Wave 2 → Wave 3
  With follow-up data: 20,656
  Quit rate: 50.2%

Creating transitions: Wave 2 → Wave 3
  Baseline smokers: 10,722
  With follow-up data: 9,504
  Quit rate: 12.6%

Creating transitions: Wave 3 → Wave 4
  Baseline smokers: 10,722
  With follow-up data: 9,504
  Quit rate: 12.6%

Creating transitions: Wave 3 → Wave 4
  Baseline smokers: 9,817
  With follow-up data: 8,618
  Quit rate: 12.2%

Creating transitions: Wave 4 → Wave 5
  Baseline smokers: 9,817
  With follow-up data: 8,618
  Quit rate: 12.2%

Creating transitions: Wave 4 → Wave 5
  Baseline smokers: 10,967
  With follow-up data: 9,104
  Quit rate: 17.7%
  Baseline smokers: 10,967
  With follow-up data: 9,104
  Quit rate: 17.7%

✓ Total transitions: 47,882
✓ Overall quit rate: 29.7%
✓ Unique persons: 23,411

✓ Total

## 5. Apply Feature Engineering

Use Phase 3 feature engineering pipeline on the pooled transitions.

In [10]:
# Codebook overrides from Phase 3
codebook_overrides = {
    # Demographics
    'age': 'R01R_A_AGECAT7',  # Use wave-specific version (adjust for each wave)
    'sex': 'R01R_A_SEX',
    'income': 'R01R_POVCAT3',
    'education_code': None,  # Not available
    
    # Race/Ethnicity
    'race': 'R01R_A_RACECAT3',
    'hispanic': 'R01R_A_HISP',
    'race_map': {1: 'White', 2: 'Black', 3: 'Other'},
    'hisp_yes_values': (1,),
    'race_collapse_to_other': (),
    
    # Smoking behavior
    'cpd': 'R01R_A_PERDAY_P30D_CIGS',
    'ttfc_minutes': 'R01R_A_MINFIRST_CIGS',
    
    # Cessation methods
    'nrt_any': 'R01R_A_PST12M_LSTQUIT_NRT',
    'varenicline': 'R01R_A_PST12M_LSTQUIT_RX',
}

print("Codebook overrides configured")

Codebook overrides configured


In [11]:
def adjust_overrides_for_wave(overrides, wave_num):
    """
    Adjust variable names for specific wave.
    Replace R01R_ prefix with R0{wave}R_.
    """
    adjusted = {}
    for key, value in overrides.items():
        if isinstance(value, str) and value.startswith('R01R_'):
            adjusted[key] = value.replace('R01R_', f'R0{wave_num}R_')
        elif isinstance(value, str) and value.startswith('R01_'):
            adjusted[key] = value.replace('R01_', f'R0{wave_num}_')
        else:
            adjusted[key] = value
    return adjusted

# Test
wave1_overrides = adjust_overrides_for_wave(codebook_overrides, 1)
wave2_overrides = adjust_overrides_for_wave(codebook_overrides, 2)

print("Wave-specific variable adjustment function ready")

Wave-specific variable adjustment function ready


In [12]:
if pooled is not None:
    print("Applying feature engineering to pooled transitions...")
    print("="*70)
    
    # We need to handle multiple waves - for now, use the baseline_wave for variable names
    # More sophisticated approach: engineer features separately per wave then pool
    
    # Simple approach: most transitions are W1→W2, use W1 variables
    # For production, you'd want to engineer features per transition
    
    print("Transition distribution:")
    print(pooled['transition'].value_counts())
    print()
    
    # Engineer features (assuming W1 variable names dominate)
    wave1_overrides = adjust_overrides_for_wave(codebook_overrides, 1)
    
    print("Running feature engineering...")
    engineered = engineer_all_features(
        pooled.copy(),
        codebook_overrides=wave1_overrides,
        recode_missing=True
    )
    
    print(f"\n✓ Features created: {engineered.shape[1]} columns")
    print(f"✓ Records: {len(engineered):,}")
else:
    print("⚠️  No data to engineer features")
    engineered = None

Applying feature engineering to pooled transitions...
Transition distribution:
transition
W1→W2    20656
W2→W3     9504
W4→W5     9104
W3→W4     8618
Name: count, dtype: int64

Running feature engineering...

✓ Features created: 8341 columns
✓ Records: 47,882

✓ Features created: 8341 columns
✓ Records: 47,882


## 6. Validate and Summarize Dataset

In [13]:
if engineered is not None:
    print("Data Quality Summary")
    print("="*70)
    
    # Overall shape
    print(f"\nDataset shape: {engineered.shape[0]:,} rows × {engineered.shape[1]:,} columns")
    
    # Outcome distribution
    print(f"\nOutcome (quit_success):")
    print(engineered['quit_success'].value_counts())
    print(f"Quit rate: {100 * engineered['quit_success'].mean():.1f}%")
    
    # Missing data in key features
    from src.feature_engineering import get_feature_list
    feature_cols = get_feature_list()
    available_features = [f for f in feature_cols if f in engineered.columns]
    
    print(f"\nFeature availability: {len(available_features)}/{len(feature_cols)}")
    
    # Check missing rates for key features
    print("\nMissing rates for key features:")
    key_features = ['age', 'sex', 'cpd', 'ttfc_minutes', 'high_dependence', 
                    'race_white', 'used_nrt', 'used_varenicline']
    
    for feat in key_features:
        if feat in engineered.columns:
            missing_pct = 100 * engineered[feat].isna().mean()
            print(f"  {feat:20s}: {missing_pct:>5.1f}% missing")

Data Quality Summary

Dataset shape: 47,882 rows × 8,341 columns

Outcome (quit_success):
quit_success
0    33662
1    14220
Name: count, dtype: int64
Quit rate: 29.7%

Feature availability: 52/52

Missing rates for key features:
  age                 :   0.0% missing
  sex                 :   0.1% missing
  cpd                 :  77.7% missing
  ttfc_minutes        :  26.4% missing
  high_dependence     :   0.0% missing
  race_white          :   0.0% missing
  used_nrt            :   0.0% missing
  used_varenicline    :   0.0% missing


## 7. Save Final Dataset

In [14]:
if engineered is not None:
    from src.feature_engineering import get_feature_list
    feature_cols = get_feature_list()

    # Ensure stable schema: add any missing engineered feature columns with zeros
    missing_features = [f for f in feature_cols if f not in engineered.columns]
    if missing_features:
        print(f"Adding {len(missing_features)} missing feature column(s) with zeros:")
        for col in missing_features:
            engineered[col] = 0
            print(f"  • {col}")

    # Only keep identifiers + canonical engineered features (no raw or alias columns)
    modeling_cols = ['PERSONID', 'baseline_wave', 'followup_wave', 'transition', 'quit_success'] + feature_cols
    existing_cols = [c for c in modeling_cols if c in engineered.columns]
    modeling_data = engineered[existing_cols].copy()

    csv_path = OUTPUT_DIR / 'pooled_transitions.csv'
    parquet_path = OUTPUT_DIR / 'pooled_transitions.parquet'

    print("Saving compact dataset with canonical features only...")
    modeling_data.to_csv(csv_path, index=False)
    modeling_data.to_parquet(parquet_path, index=False)

    print(f"\n✓ Saved: {csv_path}")
    print(f"✓ Saved: {parquet_path}")
    print(f"Final dataset: {len(modeling_data):,} rows × {len(modeling_data.columns):,} columns")
else:
    print("⚠️  No data to save")

Saving compact dataset with canonical features only...

✓ Saved: ../data/processed/pooled_transitions.csv
✓ Saved: ../data/processed/pooled_transitions.parquet
Final dataset: 47,882 rows × 57 columns

✓ Saved: ../data/processed/pooled_transitions.csv
✓ Saved: ../data/processed/pooled_transitions.parquet
Final dataset: 47,882 rows × 57 columns
