# Demo: MVP Conditional Expected Values with Linear Regression

This demo shows how to use the MVP conditional expected values estimator in a linear regression pipeline with cross-validation.

**Key Features:**
- Flight lineage features (already applied in `split.py` when folds are created)
- Conditional expected values (computed on-the-fly from training data)
- Simple linear regression model
- Cross-validation with automatic fold handling

**Note:** Flight lineage features are pre-applied to all folds by `split.py`, so they don't need to be added in the pipeline.

**Naming Convention:**
- **"crs"** prefix = Raw data columns from source (or direct LAGs of raw data)
  - Examples: `crs_elapsed_time`, `prev_flight_crs_elapsed_time`
- **"scheduled"** prefix = Engineered/computed features derived from CRS data
  - Examples: `scheduled_lineage_rotation_time_minutes`, `prev_flight_scheduled_flight_time_minutes`

**Label and Prediction Handling:**
- **Log Transform**: DEP_DELAY is transformed using `SIGN(DEP_DELAY) * LOG(ABS(DEP_DELAY) + 1.0)` to handle the right-skewed distribution while preserving sign
- **Inverse Transform**: Predictions are transformed back using `SIGN(pred) * (EXP(ABS(pred)) - 1)` and floored to 0
- This approach:
  - Better handles the skewed delay distribution
  - Preserves information about negative delays (early departures) during training
  - Ensures final predictions are non-negative (matches hypothesis: `departure_delay ≈ max(0, required_time - rotation_time)`)
  - Prevents penalizing the model for "under-predicting" negative delays

## Dependencies

In [0]:
import sys
import pandas as pd

# Load modules from our Databricks repo
import importlib.util

# CV module
cv_path = "/Workspace/Shared/Team 4_2/flight-departure-delay-predictive-modeling/notebooks/Cross Validator/cv.py"
spec = importlib.util.spec_from_file_location("cv", cv_path)
cv = importlib.util.module_from_spec(spec)
spec.loader.exec_module(cv)

# Conditional expected values MVP estimator
cond_exp_mvp_path = "/Workspace/Shared/Team 4_2/flight-departure-delay-predictive-modeling/notebooks/Feature Engineering/flight_lineage/mvp_simple/conditional_expected_values_mvp.py"
spec = importlib.util.spec_from_file_location("conditional_expected_values_mvp", cond_exp_mvp_path)
conditional_expected_values_mvp = importlib.util.module_from_spec(spec)
spec.loader.exec_module(conditional_expected_values_mvp)

from pyspark.sql import functions as F
from pyspark.ml.feature import Imputer, StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler, SQLTransformer
from pyspark.ml.regression import LinearRegression
from pyspark.ml import Pipeline

## Define Features

In [0]:
categorical_features = [
    # 'day_of_week',
    # 'op_carrier',
    # 'origin',

    # Binary flag: True when required_time > rotation_time (impossible to depart on time)
    # Non-linear indicator that captures "impossible on-time" scenarios
    # If True: Not enough time between previous departure and scheduled departure
    # Always data leakage-free
    'safe_impossible_on_time_flag',
]

# Raw numerical features (pared back to test hypothesis)
raw_numerical_features = [
    # 'hourlyprecipitation',
    # 'hourlysealevelpressure',
    # 'hourlyaltimetersetting',
    # 'hourlywetbulbtemperature',
    # 'hourlystationpressure',
    # 'crs_elapsed_time',
    # 'distance',
    # ============================================================================
    # Flight Lineage Features (pre-applied in split.py)
    # ============================================================================
    
    # Cumulative count of recorded flights for this tail_num across all time in dataset
    # Rank 1 = first recorded flight for this aircraft (plane already at airport, large buffer)
    # Higher ranks = later flights in aircraft's history (may have accumulated delays from previous flights)
    # Note: This is NOT per-day; it's cumulative across the entire dataset
    'lineage_rank',
    
    # Safe rotation time: Time from previous departure to current scheduled departure
    # Handles data leakage intelligently:
    #   - If prev flight already departed: use actual departure time
    #   - If prev flight hasn't departed yet: use scheduled time or "right now" (prediction cutoff)
    # Always data leakage-free, safe for training
    'safe_lineage_rotation_time_minutes',
    
    # Scheduled rotation time: Time from previous scheduled departure to current scheduled departure
    # Entire scheduled sequence: prev_crs_dep → flight → arrival → turnover → curr_crs_dep
    # Always data leakage-free (uses scheduled times only)
    'scheduled_lineage_rotation_time_minutes',
    
    # Scheduled turnover time: Time from previous scheduled arrival to current scheduled departure
    # Ground time between flights (component of rotation time)
    # Rotation Time = Air Time + Turnover Time
    # Always data leakage-free (uses scheduled times only)
    'scheduled_lineage_turnover_time_minutes',
    
    # Previous flight's scheduled flight duration (air time)
    # Computed as: prev_flight_crs_arr_time - prev_flight_crs_dep_time
    # Hypothesis: When rotation_time >> scheduled_flight_time, we have more buffer = likely to depart on time
    # Imputed to 0.0 for first flights or jumps
    'prev_flight_scheduled_flight_time_minutes',
    
    # Required time: Expected air_time + expected_turnover_time (hypothesis feature)
    # Computed as: prev_flight_crs_elapsed_time + scheduled_lineage_turnover_time_minutes
    # Note: At flight lineage computation time (split.py), conditional expected values don't exist yet
    # The function CAN use conditional expected values if available (from ConditionalExpectedValuesEstimator),
    # but when computed during lineage feature engineering, it falls back to scheduled times
    # Core hypothesis: departure_delay ≈ max(0, required_time - rotation_time)
    # Always data leakage-free (uses scheduled times which are known in advance)
    'safe_required_time_prev_flight_minutes',
        
    # Current flight's scheduled elapsed time (air time)
    # Helps capture cascading delay effects - longer flights may have more variability
    'crs_elapsed_time',
    # ============================================================================
    # Conditional Expected Values (added by ConditionalExpectedValuesMVPEstimator)
    # ============================================================================
    
    # Expected air time: Average historical air time for route (origin → dest)
    # Non-temporal baseline (all-time average)
    # Captures route-specific characteristics (distance, typical weather patterns)
    'expected_air_time_route_minutes',
    
    # Expected air time: Average historical air time for route × month
    # Temporal baseline (monthly average)
    # Captures seasonal effects (jet streams, weather patterns vary by month)
    # Falls back to non-temporal route average if month data insufficient
    'expected_air_time_route_month_minutes',
    
    # Expected turnover time: Average historical turnover time for carrier × airport
    # Non-temporal baseline (all-time average)
    # Captures carrier-specific operational efficiency at specific airports
    'expected_turnover_time_carrier_airport_minutes',
    
    # Expected turnover time: Average historical turnover time for carrier × airport × month
    # Temporal baseline (monthly average)
    # Captures seasonal operational patterns (holidays, weather, seasonal demand)
    # Falls back to carrier-airport → airport-month → airport → 0.0
    'expected_turnover_time_carrier_airport_month_minutes',
    
    # Expected turnover time: Average historical turnover time for airport (all carriers)
    # Non-temporal baseline, airport-level fallback
    # Used when carrier-airport data is sparse
    'expected_turnover_time_airport_minutes',
    
    # Expected turnover time: Average historical turnover time for airport × month
    # Temporal baseline, airport-level fallback with seasonality
    # Used when carrier-airport-month data is sparse
    'expected_turnover_time_airport_month_minutes',
]


numerical_features = raw_numerical_features

print(f"Using {len(numerical_features)} numerical features:")
for i, feat in enumerate(numerical_features, 1):
    print(f"  {i}. {feat}")


## Construct Model Pipeline

In [0]:
# Step 1: Conditional Expected Values MVP Estimator
# (Computes conditional means on-the-fly from training data)
# Note: Flight lineage features are already in the folds (applied by split.py),
# so turnover time columns are available for conditional expected values computation
cond_exp_estimator = conditional_expected_values_mvp.ConditionalExpectedValuesMVPEstimator()

# Step 2: Imputer
imputer = Imputer(
    inputCols=numerical_features,
    outputCols=[f"{col}_IMPUTED" for col in numerical_features],
    strategy="mean"
)

# Step 3: Vector Assembler (no categorical features, so skip indexer/encoder)
assembler = VectorAssembler(
    inputCols=[f"{col}_IMPUTED" for col in numerical_features],
    outputCol="features",
    handleInvalid="skip"
)

# Step 4: Standard Scaler
scaler = StandardScaler(
    inputCol="features",
    outputCol="scaled_features",
    withMean=True,
    withStd=True
)

# Step 5: Log transform DEP_DELAY
# This preserves the sign while applying log transform
# SIGN(DEP_DELAY) * LOG(ABS(DEP_DELAY) + 1.0) handles both positive and negative delays
# Note: Filters out NULL DEP_DELAY values (required for LinearRegression)
log_transform = SQLTransformer(
    statement="""
    SELECT *, 
           SIGN(DEP_DELAY) * LOG(ABS(DEP_DELAY) + 1.0) AS DEP_DELAY_log
    FROM __THIS__
    WHERE DEP_DELAY IS NOT NULL
    """
)

# Step 6: Linear Regression (train on log-transformed label)
lr = LinearRegression(
    featuresCol="scaled_features", 
    labelCol="DEP_DELAY_log",  # Use log-transformed target
    predictionCol="prediction_log",  # Predictions in log space
    elasticNetParam=0.0,
)

# Step 7: Inverse log transform (convert predictions back from log space)
# Inverse of SIGN(y) * LOG(ABS(y) + 1) is SIGN(y) * (EXP(ABS(y)) - 1)
# Also floor to 0 to match hypothesis: departure_delay ≈ max(0, required_time - rotation_time)
exp_transform = SQLTransformer(
    statement="""
    SELECT *, 
           GREATEST(
               CASE 
                   WHEN prediction_log IS NOT NULL THEN 
                       SIGN(prediction_log) * (EXP(ABS(prediction_log)) - 1.0)
                   ELSE 0.0
               END,
               0.0
           ) AS prediction
    FROM __THIS__
    """
)

# Complete Pipeline
lr_pipe = Pipeline(stages=[
    cond_exp_estimator,           # Adds conditional expected values (uses pre-existing flight lineage features)
    imputer,                      # Imputes missing values
    assembler,                    # Assembles feature vector
    scaler,                       # Standardizes features
    log_transform,                # Log transform DEP_DELAY (SIGN * LOG(ABS + 1))
    lr,                           # Linear regression model (trained on log-transformed label)
    exp_transform                 # Inverse log transform and floor to 0
])

print("Pipeline configured with:")
print("  - Log transform: SIGN(DEP_DELAY) * LOG(ABS(DEP_DELAY) + 1.0)")
print("  - Inverse transform: SIGN(pred) * (EXP(ABS(pred)) - 1), floored to 0")
print("  - Handles both positive and negative delays while preserving sign")


## Run Cross-Validation

In [0]:
# Initialize cross-validator
# FlightDelayCV automatically sets version and fold_index on estimators that support it
cv_obj = cv.FlightDelayCV(
    estimator=lr_pipe,
    version="60M"
)

# Run cross-validation (fits on folds 0-2, excludes test fold)
metrics_df = cv_obj.fit()


In [0]:
print("=" * 80)
print("CROSS-VALIDATION RESULTS")
print("=" * 80)
print(metrics_df)

# Print coefficients for each fold
print("\n" + "=" * 80)
print("MODEL COEFFICIENTS (by fold)")
print("=" * 80)
for i, model in enumerate(cv_obj.models):
    print(f"\n--- Fold {i+1} ---")
    lr_model = model.stages[-2]  # Linear regression is second-to-last (exp_transform is last)
    
    # Get feature names
    feature_names = numerical_features
    
    # Get coefficients
    coefficients = lr_model.coefficients.toArray()
    intercept = lr_model.intercept
    
    print(f"Intercept: {intercept:.4f}")
    print("\nCoefficients:")
    for name, coef in zip(feature_names, coefficients):
        print(f"  {name:50s}: {coef:10.4f}")
    
    # Print model summary stats
    print(f"\nModel Summary:")
    print(f"  R²: {lr_model.summary.r2:.4f}")
    print(f"  RMSE: {lr_model.summary.rootMeanSquaredError:.4f}")
    print(f"  MSE: {lr_model.summary.meanSquaredError:.4f}")

# Analyze feature importance (absolute coefficient values)
print("\n" + "=" * 80)
print("FEATURE IMPORTANCE (Average |coefficient| across folds)")
print("=" * 80)
avg_coefficients = {}
for i, model in enumerate(cv_obj.models):
    lr_model = model.stages[-2]  # Linear regression is second-to-last (exp_transform is last)
    coefficients = lr_model.coefficients.toArray()
    for name, coef in zip(numerical_features, coefficients):
        if name not in avg_coefficients:
            avg_coefficients[name] = []
        avg_coefficients[name].append(abs(coef))

for name in sorted(avg_coefficients.keys(), key=lambda x: sum(avg_coefficients[x])/len(avg_coefficients[x]), reverse=True):
    avg_abs_coef = sum(avg_coefficients[name]) / len(avg_coefficients[name])
    print(f"  {name:50s}: {avg_abs_coef:10.4f}")


## Evaluate on Test Set

In [0]:
# Evaluate on held-out test fold
print("\n" + "=" * 80)
print("TEST SET EVALUATION")
print("=" * 80)
test_results = cv_obj.evaluate()
print(test_results)

# Print test model coefficients
print("\n" + "=" * 80)
print("TEST MODEL COEFFICIENTS")
print("=" * 80)
test_lr_model = cv_obj.test_model.stages[-1]
test_coefficients = test_lr_model.coefficients.toArray()
test_intercept = test_lr_model.intercept

print(f"Intercept: {test_intercept:.4f}")
print("\nCoefficients:")
for name, coef in zip(numerical_features, test_coefficients):
    print(f"  {name:50s}: {coef:10.4f}")

print(f"\nTest Model Summary:")
print(f"  R²: {test_lr_model.summary.r2:.4f}")
print(f"  RMSE: {test_lr_model.summary.rootMeanSquaredError:.4f}")

## Sanity Checks

In [0]:
# Sanity check for flight lineage features
print("=" * 80)
print("SANITY CHECK: FLIGHT LINEAGE FEATURES")
print("=" * 80)

# Statistics for lineage features
lineage_features = [
    'lineage_rank', 
    'safe_lineage_rotation_time_minutes', 
    'scheduled_lineage_rotation_time_minutes', 
    'scheduled_lineage_turnover_time_minutes',
    'prev_flight_scheduled_flight_time_minutes',
    'safe_required_time_prev_flight_minutes',
    'safe_impossible_on_time_flag',
    'crs_elapsed_time'
]

for col_name in lineage_features:
    if col_name in train_transformed.columns:
        print(f"\n{col_name}:")
        print("-" * 80)
        
        # Basic statistics
        stats = train_transformed.select(
            F.avg(col_name).alias('mean'),
            F.min(col_name).alias('min'),
            F.max(col_name).alias('max'),
            F.stddev(col_name).alias('std'),
            F.count(col_name).alias('count'),
            F.percentile_approx(col_name, 0.5).alias('median'),
            F.percentile_approx(col_name, 0.25).alias('q25'),
            F.percentile_approx(col_name, 0.75).alias('q75')
        ).collect()[0]
        
        total_count = stats['count']
        print(f"  Count:   {total_count:,}")
        print(f"  Mean:    {stats['mean']:.4f}")
        print(f"  Median:  {stats['median']:.4f if stats['median'] else 'N/A'}")
        print(f"  Q25:     {stats['q25']:.4f if stats['q25'] else 'N/A'}")
        print(f"  Q75:     {stats['q75']:.4f if stats['q75'] else 'N/A'}")
        std_val = f"{stats['std']:.4f}" if stats['std'] is not None else "N/A"
        print(f"  Std:     {std_val}")
        min_val = f"{stats['min']:.4f}" if stats['min'] is not None else "N/A"
        max_val = f"{stats['max']:.4f}" if stats['max'] is not None else "N/A"
        print(f"  Min:     {min_val}")
        print(f"  Max:     {max_val}")
        
        # Null and zero counts
        null_count = train_transformed.filter(F.col(col_name).isNull()).count()
        zero_count = train_transformed.filter(F.col(col_name) == 0.0).count()
        pct_null = (null_count / total_count * 100) if total_count > 0 else 0
        pct_zero = (zero_count / total_count * 100) if total_count > 0 else 0
        print(f"  Nulls:   {null_count:,} ({pct_null:.2f}%)")
        print(f"  Zeros:   {zero_count:,} ({pct_zero:.2f}%)")
        
        # Correlation with DEP_DELAY
        corr = train_transformed.select(
            F.corr(col_name, 'DEP_DELAY').alias('corr')
        ).collect()[0]['corr']
        print(f"  Corr with DEP_DELAY: {corr:.4f}" if corr else "  Corr with DEP_DELAY: N/A")

# Check relationship between safe and scheduled rotation time
print("\n" + "=" * 80)
print("SAFE vs SCHEDULED ROTATION TIME COMPARISON")
print("=" * 80)

if 'safe_lineage_rotation_time_minutes' in train_transformed.columns and 'scheduled_lineage_rotation_time_minutes' in train_transformed.columns:
    comparison = train_transformed.select(
        F.col('safe_lineage_rotation_time_minutes').alias('safe'),
        F.col('scheduled_lineage_rotation_time_minutes').alias('scheduled')
    )
    
    # Where they differ (data leakage detected)
    different = comparison.filter(
        F.col('safe') != F.col('scheduled')
    ).count()
    
    # Where safe is less than scheduled (imputed due to data leakage)
    safe_less = comparison.filter(
        F.col('safe') < F.col('scheduled')
    ).count()
    
    total = train_transformed.count()
    same_count = total - different
    pct_different = (different / total * 100) if total > 0 else 0
    pct_safe_less = (safe_less / total * 100) if total > 0 else 0
    
    print(f"\nTotal rows: {total:,}")
    print(f"Rows where safe == scheduled: {same_count:,} ({100-pct_different:.2f}%)")
    print(f"Rows where safe != scheduled: {different:,} ({pct_different:.2f}%)")
    print(f"Rows where safe < scheduled (data leakage imputed): {safe_less:,} ({pct_safe_less:.2f}%)")
    
    # Sample rows where they differ
    if different > 0:
        sample_diff = train_transformed.filter(
            F.col('safe_lineage_rotation_time_minutes') != F.col('scheduled_lineage_rotation_time_minutes')
        ).select(
            'FL_DATE', 'origin', 'dest',
            'lineage_rank',
            'safe_lineage_rotation_time_minutes',
            'scheduled_lineage_rotation_time_minutes'
        ).limit(10).toPandas()
        print(f"\nSample rows where safe != scheduled (data leakage detected):")
        print(sample_diff.to_string())

# Check lineage_rank distribution
print("\n" + "=" * 80)
print("LINEAGE RANK DISTRIBUTION")
print("=" * 80)

if 'lineage_rank' in train_transformed.columns:
    rank_dist = train_transformed.groupBy('lineage_rank').agg(
        F.count('*').alias('count')
    ).orderBy('lineage_rank').limit(20).toPandas()
    
    total = train_transformed.count()
    print(f"\nDistribution of lineage_rank (first 20 ranks):")
    print("-" * 80)
    for _, row in rank_dist.iterrows():
        pct = (row['count'] / total * 100) if total > 0 else 0
        print(f"  Rank {int(row['lineage_rank']):3d}: {row['count']:8,} flights ({pct:5.2f}%)")
    
    # Check if rank=1 flights have rotation_time=0 (first flight in lineage)
    rank_1_rotation = train_transformed.filter(F.col('lineage_rank') == 1.0).select(
        F.avg('safe_lineage_rotation_time_minutes').alias('avg_rotation_rank1'),
        F.avg('scheduled_lineage_rotation_time_minutes').alias('avg_scheduled_rank1')
    ).collect()[0]
    
    print(f"\nRank 1 (first flight) statistics:")
    print(f"  Avg safe rotation time: {rank_1_rotation['avg_rotation_rank1']:.4f}" if rank_1_rotation['avg_rotation_rank1'] else "  Avg safe rotation time: N/A")
    print(f"  Avg scheduled rotation time: {rank_1_rotation['avg_scheduled_rank1']:.4f}" if rank_1_rotation['avg_scheduled_rank1'] else "  Avg scheduled rotation time: N/A")
    print(f"  (Should be ~1440 minutes (24 hours) for first flight - indicates plane already at airport, plenty of buffer)")
