# Demo: Phase 2 Meta-Model for Departure Delay Prediction

This demo demonstrates Phase 2 of the flight delay prediction approach using **meta-models** to predict previous flight components.

**Key Features:**
- **Meta-Models**: Random Forest models predict previous flight components:
  - `predicted_prev_flight_air_time`: Predicted actual air time for previous flight
  - `predicted_prev_flight_taxi_time`: Predicted actual taxi time for previous flight
  - `predicted_prev_flight_total_duration`: Predicted actual total duration for previous flight
- **Final Model**: Random Forest using meta-model predictions + comprehensive covariates
- Cross-validation with automatic fold handling

**Meta-Model Approach:**
- Train Random Forest models on training folds to predict previous flight components
- Use comprehensive covariates: weather, temporal features, carrier, state-level location
- CV-safe: Meta-models train only on training data, applied to validation/test
- **These predictions ARE the conditional expected values** for previous flight components

**Meta-Model Covariates:**
- Weather variables at previous flight origin and destination (comprehensive set: precipitation, pressure, wind, temperature, visibility, etc.)
- Temporal features (month, day_of_week, day_of_month, time blocks)
- Route characteristics (previous flight route, scheduled times, distance)
- Carrier and state-level location
- Airport characteristics (elevation if available)

**Final Model Features:**
- Flight lineage features (rotation time, turnover time, etc.)
- Meta-model predictions (predicted previous flight components - these ARE the conditional expected values)
- Weather variables (current origin)
- Temporal features (month, day_of_week, day_of_month, dep_time_blk, arr_time_blk) - aligned with MLP PyTorch notebook
- Flight characteristics (distance, scheduled times, elevation)

**Log Transform:**
- Applied to `DEP_DELAY` to handle right-skewed distribution: `SIGN(DEP_DELAY) * LOG(ABS(DEP_DELAY) + 1.0)`
- Inverse transform: `SIGN(pred) * (EXP(ABS(pred)) - 1)`, floored to 0
- Similar to MVP demo approach


## Dependencies

In [0]:
import sys
import pandas as pd

# Load modules from our Databricks repo
import importlib.util

# CV module
cv_path = "/Workspace/Shared/Team 4_2/flight-departure-delay-predictive-modeling/notebooks/Cross Validator/cv.py"
spec = importlib.util.spec_from_file_location("cv", cv_path)
cv = importlib.util.module_from_spec(spec)
spec.loader.exec_module(cv)

# Meta-model estimator
meta_model_path = "/Workspace/Shared/Team 4_2/flight-departure-delay-predictive-modeling/notebooks/Feature Engineering/meta_model_estimator.py"
spec = importlib.util.spec_from_file_location("meta_model_estimator", meta_model_path)
meta_model_estimator = importlib.util.module_from_spec(spec)
spec.loader.exec_module(meta_model_estimator)

from pyspark.sql import functions as F
from pyspark.ml.feature import Imputer, StringIndexer, OneHotEncoder, VectorAssembler, SQLTransformer
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml import Pipeline


## Define Features

**Categorical Features**: State-level location, carrier, temporal features (avoiding high-cardinality airport codes)

**Numerical Features**: 
- Flight lineage features (rotation time, turnover time, etc.)
- Meta-model predictions (predicted previous flight components - these ARE the conditional expected values)
- Weather variables (current origin)
- Flight characteristics (distance, scheduled times)


In [0]:
# Categorical features (state-level to avoid high cardinality)
# Note: Avoiding origin/dest directly as they have 200+ categories
# Based on MLP PyTorch notebook and other models
categorical_features = [
    'op_carrier',              # Carrier
    # 'origin_state_abr',        # Origin state (lower cardinality than airport)
    # 'dest_state_abr',          # Destination state
    'day_of_week',             # Day of week
    'month',                   # Month
    # 'day_of_month',            # Day of month (from MLP notebook)
    'dep_time_blk',            # Departure time block
    'arr_time_blk',            # Arrival time block (from MLP notebook)
]

# Numerical features
raw_numerical_features = [
    # ============================================================================
    # Flight Lineage Features (pre-applied in split.py)
    # ============================================================================
    'lineage_rank',
    'safe_lineage_rotation_time_minutes',
    'scheduled_lineage_rotation_time_minutes',
    'scheduled_lineage_turnover_time_minutes',
    'prev_flight_scheduled_flight_time_minutes',
    'safe_required_time_prev_flight_minutes',
        
    # ============================================================================
    # Note: Meta-model predictions (predicted_prev_flight_*) are added by MetaModelEstimator
    # and do NOT need imputation (they're already predicted values)
    # They will be added to final_model_features after meta-models run
    # ============================================================================
    
    # ============================================================================
    # Weather Variables (Current Origin)
    # ============================================================================
    'hourlyprecipitation',
    'hourlysealevelpressure',
    'hourlyaltimetersetting',
    'hourlywetbulbtemperature',
    'hourlystationpressure',
    'hourlywinddirection',
    'hourlyrelativehumidity',
    'hourlywindspeed',
    'hourlydewpointtemperature',
    'hourlydrybulbtemperature',
    'hourlyvisibility',
    
    # ============================================================================
    # Flight Characteristics
    # ============================================================================
    'crs_elapsed_time',        # Scheduled elapsed time
    'distance',                # Flight distance
    'elevation',               # Airport elevation (if available)
]

numerical_features = raw_numerical_features

print(f"Using {len(categorical_features)} categorical features:")
for i, feat in enumerate(categorical_features, 1):
    print(f"  {i}. {feat}")

print(f"\nUsing {len(numerical_features)} numerical features:")
for i, feat in enumerate(numerical_features, 1):
    print(f"  {i}. {feat}")


## Construct Model Pipeline

Pipeline stages:
1. **Shared Feature Processing**: Imputer + StringIndexer + OneHotEncoder (processes ALL features once)
2. **Meta-Model Estimator** (Uses preprocessed features to predict previous flight components)
3. **Final Model Feature Assembly**: VectorAssembler (combines processed features + meta-model predictions)
4. **Random Forest Regressor** (Final model)

**Note**: Feature processing runs FIRST so both meta-models and final model use consistently encoded features.


In [0]:
# ============================================================================
# STEP 1: Shared Feature Processing (runs FIRST for consistency)
# ============================================================================
# Process ALL features upfront so meta-models and final model use consistently encoded features

# Get ALL features (current flight + previous flight) for processing
# Meta-models use prev_flight_* features, final model uses current flight features
all_categorical_features = categorical_features.copy()
all_numerical_features = numerical_features.copy()

# Add previous flight features that meta-models need (but final model doesn't use directly)
meta_model_categorical = [
    'prev_flight_op_carrier',
    'prev_flight_origin_state_abr',
    'prev_flight_dest_state_abr',
    'prev_flight_day_of_week',
    'prev_flight_month',
    'prev_flight_dep_time_blk',  # For taxi time model
    'prev_flight_arr_time_blk',  # For taxi time model
]

meta_model_numerical = [
    'prev_flight_crs_elapsed_time',
    'prev_flight_distance',
    'prev_flight_crs_dep_time',
    'prev_flight_crs_arr_time',
    'prev_flight_day_of_month',
    # Weather at previous origin/dest (add if available)
    'prev_flight_origin_hourlyprecipitation',
    'prev_flight_origin_hourlysealevelpressure',
    'prev_flight_origin_hourlyaltimetersetting',
    'prev_flight_origin_hourlywetbulbtemperature',
    'prev_flight_origin_hourlystationpressure',
    'prev_flight_origin_hourlywinddirection',
    'prev_flight_origin_hourlyrelativehumidity',
    'prev_flight_origin_hourlywindspeed',
    'prev_flight_origin_hourlydewpointtemperature',
    'prev_flight_origin_hourlydrybulbtemperature',
    'prev_flight_origin_hourlyvisibility',
    'prev_flight_dest_hourlyprecipitation',
    'prev_flight_dest_hourlysealevelpressure',
    'prev_flight_dest_hourlyaltimetersetting',
    'prev_flight_dest_hourlywetbulbtemperature',
    'prev_flight_dest_hourlystationpressure',
    'prev_flight_dest_hourlywinddirection',
    'prev_flight_dest_hourlyrelativehumidity',
    'prev_flight_dest_hourlywindspeed',
    'prev_flight_dest_hourlydewpointtemperature',
    'prev_flight_dest_hourlydrybulbtemperature',
    'prev_flight_dest_hourlyvisibility',
]

# Combine all features for processing
all_categorical_features.extend([f for f in meta_model_categorical if f not in all_categorical_features])
all_numerical_features.extend([f for f in meta_model_numerical if f not in all_numerical_features])

# Note: We'll filter to only existing features when we actually build the pipeline
# For now, this defines the complete feature set we want to process
print(f"Target feature set: {len(all_categorical_features)} categorical and {len(all_numerical_features)} numerical features")
print("(Actual features used will be filtered to only those available in the data)")

# Note: Feature lists will be filtered to available features in the next cell
# Step 1a: Imputer (all numerical features)
# Will be updated after we check available features
imputer = None  # Will be set after feature filtering

# Step 1b: StringIndexer (all categorical features)  
indexer = None  # Will be set after feature filtering

# Step 1c: OneHotEncoder (all categorical features)
encoder = None  # Will be set after feature filtering

# ============================================================================
# STEP 2: Meta-Model Estimator (uses preprocessed features)
# ============================================================================
# Trains Random Forest models to predict previous flight components
# Note: use_preprocessed_features=True so it uses already-processed features
meta_model_est = meta_model_estimator.MetaModelEstimator(
    num_trees=50,
    max_depth=20,  # Increased for high-cardinality categoricals (3000+ routes)
    # Random Forest can handle high cardinality, but needs depth to capture route-specific patterns
    # Depth 20 balances: enough depth for route-specific splits vs overfitting risk
    min_instances_per_node=20,  # Increased to prevent overfitting with deeper trees
    use_preprocessed_features=True  # Use features processed in Step 1
)

# ============================================================================
# STEP 3: Final Model Feature Assembly
# ============================================================================
# Assemble features for final model: processed current flight features + meta-model predictions
# Note: final_model_categorical and final_model_numerical will be defined after feature filtering
assembler = None  # Will be set after feature filtering

# ============================================================================
# STEP 4: Log Transform (for skewed target distribution)
# ============================================================================
# Log transform DEP_DELAY to handle right-skewed distribution
# SIGN(DEP_DELAY) * LOG(ABS(DEP_DELAY) + 1.0) preserves sign for negative delays
log_transform = SQLTransformer(
    statement="""
    SELECT *, 
           SIGN(DEP_DELAY) * LOG(ABS(DEP_DELAY) + 1.0) AS DEP_DELAY_log
    FROM __THIS__
    WHERE DEP_DELAY IS NOT NULL
    """
)

# ============================================================================
# STEP 5: Final Model
# ============================================================================
# Random Forest Regressor (final model)
# Note: No StandardScaler needed for tree-based models
# Training on log-transformed target to handle skewed distribution
rf = RandomForestRegressor(
    featuresCol="features",
    labelCol="DEP_DELAY_log",  # Use log-transformed target
    predictionCol="prediction_log",  # Predictions in log space
    numTrees=50,  # Reduced from 100 to save memory (60M rows is large)
    maxDepth=12,  # Reduced from 15 to save memory and reduce overfitting
    minInstancesPerNode=50,  # Increased from 20 to reduce tree size and memory usage
    seed=42
)

# ============================================================================
# STEP 6: Inverse Log Transform
# ============================================================================
# Convert predictions back from log space
# Inverse of SIGN(y) * LOG(ABS(y) + 1) is SIGN(y) * (EXP(ABS(y)) - 1)
# Floor to 0 to match hypothesis: departure_delay ≈ max(0, required_time - rotation_time)
exp_transform = SQLTransformer(
    statement="""
    SELECT *, 
           GREATEST(
               CASE 
                   WHEN prediction_log IS NOT NULL THEN 
                       SIGN(prediction_log) * (EXP(ABS(prediction_log)) - 1.0)
                   ELSE 0.0
               END,
               0.0
           ) AS prediction
    FROM __THIS__
    """
)

# Complete Pipeline will be built after feature filtering (in next cell)
# Order matters: Feature processing FIRST, then meta-models (which use processed features), then final model
rf_pipe = None  # Will be built after feature filtering

print("Pipeline configured with:")
print("  - Step 1: Shared Feature Processing (Imputer + Indexer + Encoder)")
print("    → Processes ALL features once for consistent encoding")
print("  - Step 2: Meta-Model Estimator (uses preprocessed features)")
print("    → Predicts: air_time, taxi_time, total_duration (these ARE the conditional expected values)")
print("  - Step 3: Final Model Assembly (processed current flight features + meta-model predictions)")
print("  - Step 4: Log Transform (SIGN(DEP_DELAY) * LOG(ABS(DEP_DELAY) + 1.0))")
print("    → Handles right-skewed distribution, preserves sign for negative delays")
print("  - Step 5: Random Forest Regressor (trained on log-transformed target)")
print("  - Step 6: Inverse Log Transform (convert back and floor to 0)")
print("\nBenefits:")
print("  ✓ Consistent feature encoding across meta-models and final model")
print("  ✓ No duplicate processing")
print("  ✓ Same categories get same indices/encoding")
print("  ✓ Log transform handles skewed delay distribution")


## Dynamic Feature Filtering

Before building the pipeline, we need to filter features to only those that exist in the data. This ensures we don't try to process columns that don't exist.

In [0]:
# Load a sample of data to check which features are available
# (We'll do this dynamically in the pipeline, but for documentation purposes)
data_loader = cv.FlightDelayDataLoader()
data_loader.load()
folds = data_loader.get_version("60M")
sample_train_df, _ = folds[0]

# Filter to only features that exist in the data
available_categorical = [f for f in all_categorical_features if f in sample_train_df.columns]
available_numerical = [f for f in all_numerical_features if f in sample_train_df.columns]

print(f"Available categorical features: {len(available_categorical)}/{len(all_categorical_features)}")
print(f"Available numerical features: {len(available_numerical)}/{len(all_numerical_features)}")

# Update feature lists to only include available features
all_categorical_features = available_categorical
all_numerical_features = available_numerical

# Also update final model features to only include what's available
final_model_categorical = [f for f in categorical_features if f in sample_train_df.columns]
final_model_numerical = [f for f in numerical_features if f in sample_train_df.columns]

# Now build the final model assembler with filtered features
final_model_features = (
    [f"{col}_VEC" for col in final_model_categorical] + 
    [f"{col}_IMPUTED" for col in final_model_numerical] +
    ['predicted_prev_flight_air_time', 'predicted_prev_flight_taxi_time', 'predicted_prev_flight_total_duration']
)

# Now create the feature processing stages with filtered features
# Step 1a: Imputer (all numerical features)
imputer = Imputer(
    inputCols=all_numerical_features,
    outputCols=[f"{col}_IMPUTED" for col in all_numerical_features],
    strategy="mean"
) if all_numerical_features else None

# Step 1b: StringIndexer (all categorical features)
# StringIndexer can handle boolean columns, but we need to ensure NULLs are handled
# The flight_lineage_features.py already imputes safe_impossible_on_time_flag to False
# handleInvalid="keep" will assign a new index for NULL/invalid values
indexer = StringIndexer(
    inputCols=all_categorical_features,
    outputCols=[f"{col}_INDEX" for col in all_categorical_features],
    handleInvalid="keep"  # Handles NULLs and unknown categories by assigning a new index
) if all_categorical_features else None

# Step 1c: OneHotEncoder (all categorical features)
encoder = OneHotEncoder(
    inputCols=[f"{col}_INDEX" for col in all_categorical_features],
    outputCols=[f"{col}_VEC" for col in all_categorical_features],
    dropLast=False  # Keep all categories for Random Forest
)

# Step 3: Final Model Feature Assembly
assembler = VectorAssembler(
    inputCols=final_model_features,
    outputCol="features",
    handleInvalid="skip"
)

# Build the complete pipeline with filtered features
# Filter out None stages (in case we have no numerical or categorical features)
pipeline_stages = []
if imputer is not None:
    pipeline_stages.append(imputer)       # Step 1a: Impute ALL numerical features
if indexer is not None:
    pipeline_stages.append(indexer)       # Step 1b: Index ALL categorical features
pipeline_stages.extend([
    encoder,                   # Step 1c: One-hot encode ALL categorical features
    meta_model_est,            # Step 2: Meta-models (use preprocessed features to predict prev flight components)
    assembler,                 # Step 3: Assemble final model features (processed current flight + meta-model predictions)
    log_transform,             # Step 4: Log transform DEP_DELAY (SIGN * LOG(ABS + 1))
    rf,                        # Step 5: Random Forest (trained on log-transformed target)
    exp_transform              # Step 6: Inverse log transform and floor to 0
])

rf_pipe = Pipeline(stages=pipeline_stages)

print(f"\nFinal model will use:")
print(f"  - {len(final_model_categorical)} categorical features (current flight)")
print(f"  - {len(final_model_numerical)} numerical features (current flight)")
print(f"  - 3 meta-model predictions (predicted_prev_flight_*)")
print("\n✓ Pipeline built with filtered features!")


## Run Cross-Validation


In [0]:
# Initialize cross-validator
# FlightDelayCV automatically sets version and fold_index on estimators that support it
cv_rf_meta_12 = cv.FlightDelayCV(
    estimator=rf_pipe,
    version="12M"  # Use 60M for better data coverage
)

# Run cross-validation (fits on folds 0-2, excludes test fold)
print("=" * 80)
print("RUNNING CROSS-VALIDATION WITH RF & META-MODELS")
print("=" * 80)
metrics_df = cv_rf_meta_12.fit()


In [0]:
cv_rf_meta_12.evaluate()

# 60M

In [0]:
# Initialize cross-validator
# FlightDelayCV automatically sets version and fold_index on estimators that support it
cv_rf_meta_60 = cv.FlightDelayCV(
    estimator=rf_pipe,
    version="60M"  # Use 60M for better data coverage
)

# Run cross-validation (fits on folds 0-2, excludes test fold)
print("=" * 80)
print("RUNNING CROSS-VALIDATION WITH RF & META-MODELS")
print("=" * 80)
metrics_df = cv_rf_meta_60.fit()

In [0]:
cv_rf_meta_60.evaluate()

## View Cross-Validation Results


In [0]:
print("=" * 80)
print("CROSS-VALIDATION RESULTS")
print("=" * 80)
print(metrics_df)

# Print feature importance for each fold
print("\n" + "=" * 80)
print("FEATURE IMPORTANCE (by fold)")
print("=" * 80)
for i, model in enumerate(cv_obj.models):
    print(f"\n--- Fold {i+1} ---")
    # Random Forest is second-to-last (exp_transform is last)
    rf_model = model.stages[-2]
    feature_importance = rf_model.featureImportances.toArray()
    
    # Get feature names (use filtered lists from cell 8)
    feature_names = [f"{col}_VEC" for col in final_model_categorical] + [f"{col}_IMPUTED" for col in final_model_numerical] + [
        'predicted_prev_flight_air_time', 'predicted_prev_flight_taxi_time', 'predicted_prev_flight_total_duration'
    ]
    
    # Sort by importance
    importance_pairs = list(zip(feature_names, feature_importance))
    importance_pairs.sort(key=lambda x: x[1], reverse=True)
    
    print("Top 20 Features:")
    for name, importance in importance_pairs[:20]:
        print(f"  {name:50s}: {importance:10.6f}")


## Evaluate on Test Set

Evaluate the final model on the held-out test fold (fold 4). This gives us an unbiased estimate of model performance on unseen data.


In [0]:
# Evaluate on held-out test fold
print("=" * 80)
print("TEST SET EVALUATION")
print("=" * 80)

test_results = cv_obj.evaluate()
print("\nTest Set Metrics:")
print(test_results)

# Print test model feature importance
print("\n" + "=" * 80)
print("TEST MODEL FEATURE IMPORTANCE")
print("=" * 80)

test_rf_model = cv_obj.test_model.stages[-2]  # Random Forest is second-to-last (exp_transform is last)
feature_importance = test_rf_model.featureImportances.toArray()

# Get feature names (use filtered lists from cell 8)
feature_names = [f"{col}_VEC" for col in final_model_categorical] + [f"{col}_IMPUTED" for col in final_model_numerical] + [
    'predicted_prev_flight_air_time', 'predicted_prev_flight_taxi_time', 'predicted_prev_flight_total_duration'
]

# Sort by importance
importance_pairs = list(zip(feature_names, feature_importance))
importance_pairs.sort(key=lambda x: x[1], reverse=True)

print("\nTop 20 Features (Test Model):")
for name, importance in importance_pairs[:20]:
    print(f"  {name:50s}: {importance:10.6f}")
