# XGBoost with Graph Features and Meta-Model

**Based on top-performing XGBoost pipeline** with additions:
- **Graph Features**: PageRank features (weighted and unweighted) for origin and destination airports
- **Meta-Models**: Random Forest models predict previous flight components (air_time, taxi_time, total_duration)
- **Log Transform**: Handle right-skewed target distribution
- **Final Model**: XGBoost (fast, optimized, top leaderboard performer)

**Pipeline Order:**
1. Graph Features Estimator (builds graph, adds PageRank)
2. Imputer (numerical features + graph features - graph features already filled to 0.0 but included for consistency)
3. StringIndexer (categorical features)
4. OneHotEncoder (categorical features)
5. Meta-Model Estimator (uses preprocessed features to predict prev_flight components)
6. VectorAssembler (combines processed features + meta-model predictions)
7. Log Transform (`SIGN(DEP_DELAY) * LOG(ABS(DEP_DELAY) + 1.0)`)
8. XGBoost Regressor
9. Inverse Log Transform (convert back and floor to 0)

**Delta from Base XGBoost Pipeline:**
- ✅ Added: Meta-Model Estimator (predicts prev_flight_air_time, prev_flight_taxi_time, prev_flight_total_duration)
- ✅ Added: Log transform (handles skewed target)
- ✅ Added: Inverse log transform (converts predictions back)
- ✅ Kept: Same optimized structure (graph → impute → index → encode → assemble → model)


## Dependencies


In [0]:
import sys
import pandas as pd

# Load modules from our Databricks repo
import importlib.util

# CV module
cv_path = "/Workspace/Shared/Team 4_2/flight-departure-delay-predictive-modeling/notebooks/Cross Validator/cv.py"
spec = importlib.util.spec_from_file_location("cv", cv_path)
cv = importlib.util.module_from_spec(spec)
spec.loader.exec_module(cv)

# Graph features
graph_features_path = "/Workspace/Shared/Team 4_2/flight-departure-delay-predictive-modeling/notebooks/Feature Engineering/graph_features.py"
spec = importlib.util.spec_from_file_location("graph_features", graph_features_path)
graph_features = importlib.util.module_from_spec(spec)
spec.loader.exec_module(graph_features)

# Meta-model estimator
meta_model_path = "/Workspace/Shared/Team 4_2/flight-departure-delay-predictive-modeling/notebooks/Feature Engineering/meta_model_estimator.py"
spec = importlib.util.spec_from_file_location("meta_model_estimator", meta_model_path)
meta_model_estimator = importlib.util.module_from_spec(spec)
spec.loader.exec_module(meta_model_estimator)

from pyspark.sql import functions as F
from pyspark.ml.feature import Imputer, StringIndexer, OneHotEncoder, VectorAssembler, SQLTransformer
from pyspark.ml import Pipeline
from xgboost.spark import SparkXGBRegressor


## Define Features

**Categorical Features**: Includes `origin` and `dest` airports directly (XGBoost handles high cardinality well), plus carrier and temporal features

**Numerical Features**: 
- Flight lineage features (rotation time, turnover time, etc.)
- **Graph features** (PageRank scores - added by GraphFeaturesEstimator)
- Meta-model predictions (predicted previous flight components - added by MetaModelEstimator)
- Weather variables (current origin)
- Flight characteristics (distance, scheduled times)
- `prev_flight_distance` (from top-performing approach)


In [0]:
# Categorical features (XGBoost handles high cardinality well, so include origin/dest directly)
# Based on top-performing XGBoost approach
categorical_features = [
    'op_carrier',              # Carrier
    'origin',                  # Origin airport code (XGBoost handles ~200 categories well)
    'origin_state_abr',        # Origin state
    'dest',                    # Destination airport code (XGBoost handles ~200 categories well)
    'dest_state_abr',          # Destination state
    'day_of_week',             # Day of week
    'month',                   # Month
    'day_of_month',            # Day of month
    'dep_time_blk',            # Departure time block
    'arr_time_blk',            # Arrival time block
]

# Numerical features (graph features will be added by estimator)
raw_numerical_features = [
    # ============================================================================
    # Flight Lineage Features (pre-applied in split.py)
    # ============================================================================
    'lineage_rank',
    'safe_lineage_rotation_time_minutes',
    'scheduled_lineage_rotation_time_minutes',
    'scheduled_lineage_turnover_time_minutes',
    'prev_flight_scheduled_flight_time_minutes',
    'safe_required_time_prev_flight_minutes',
    'prev_flight_distance',    # From top-performing XGBoost approach
    
    # ============================================================================
    # Graph Features (added by GraphFeaturesEstimator)
    # Note: These will be added dynamically, listed here for reference
    # 'origin_pagerank_weighted',
    # 'origin_pagerank_unweighted',
    # 'dest_pagerank_weighted',
    # 'dest_pagerank_unweighted',
    
    # ============================================================================
    # Note: Meta-model predictions (predicted_prev_flight_*) are added by MetaModelEstimator
    # and do NOT need imputation (they're already predicted values)
    # They will be added to final_model_features after meta-models run
    # ============================================================================
    
    # ============================================================================
    # Weather Variables (Current Origin)
    # ============================================================================
    'hourlyprecipitation',
    'hourlysealevelpressure',
    'hourlyaltimetersetting',
    'hourlywetbulbtemperature',
    'hourlystationpressure',
    'hourlywinddirection',
    'hourlyrelativehumidity',
    'hourlywindspeed',
    'hourlydewpointtemperature',
    'hourlydrybulbtemperature',
    'hourlyvisibility',
    
    # ============================================================================
    # Flight Characteristics
    # ============================================================================
    'crs_elapsed_time',        # Scheduled elapsed time
    'distance',                # Flight distance
    'elevation',               # Airport elevation (if available)
]

# Graph feature column names (for reference, added by GraphFeaturesEstimator)
# Main model uses: origin_*, dest_*, and prev_flight_origin_* graph features
# Meta models also need: prev_flight_dest_* (for jumps where prev_flight_dest != origin)
graph_feature_cols = [
    'origin_pagerank_weighted',
    'origin_pagerank_unweighted',
    'dest_pagerank_weighted',
    'dest_pagerank_unweighted',
    'prev_flight_origin_pagerank_weighted',
    'prev_flight_origin_pagerank_unweighted',
    'prev_flight_dest_pagerank_weighted',  # Needed by meta models (air_time, total_duration) for jumps
    'prev_flight_dest_pagerank_unweighted'  # Note: Main model doesn't use this (redundant with origin_* for normal rotations)
]

numerical_features = raw_numerical_features

print(f"Using {len(categorical_features)} categorical features:")
for i, feat in enumerate(categorical_features, 1):
    print(f"  {i}. {feat}")

print(f"\nUsing {len(numerical_features)} base numerical features:")
for i, feat in enumerate(numerical_features, 1):
    print(f"  {i}. {feat}")

print(f"\nGraph features (added by estimator): {len(graph_feature_cols)} features")


## Construct Model Pipeline

**Following optimized pipeline structure** with meta-model additions:

1. **Graph Features Estimator**: Builds graph, adds PageRank features (already fills NULLs to 0.0)
2. **Imputer**: Imputes numerical features + graph features (graph already filled but included for consistency)
3. **StringIndexer**: Indexes categorical features
4. **OneHotEncoder**: Encodes categorical features
5. **Meta-Model Estimator**: Predicts prev_flight components using preprocessed features
6. **VectorAssembler**: Combines processed features + meta-model predictions
7. **Log Transform**: Transforms DEP_DELAY to handle skew
8. **XGBoost Regressor**: Final model
9. **Inverse Log Transform**: Converts predictions back


In [0]:
# ============================================================================
# Build Pipeline (Optimized Structure)
# ============================================================================

# Step 1: Graph Features Estimator
# Increased max_iter to 30 for better PageRank convergence
# GraphFrames PageRank doesn't support tolerance-based convergence, only maxIter
# With ~200 airports, PageRank typically converges in 10-30 iterations
# Runtime is fast (~10s for 3M with 10 iter), so 30 iter should still be <30s
graph_estimator = graph_features.GraphFeaturesEstimator(
    origin_col="origin",
    dest_col="dest",
    reset_probability=0.15,
    max_iter=30
)

# Step 2: Imputer (numerical features + graph features)
# Note: Graph features are already filled to 0.0 by GraphFeaturesEstimator,
# but including them in Imputer for consistency
imputer = Imputer(
    inputCols=numerical_features + graph_feature_cols,
    outputCols=[f"{col}_IMPUTED" for col in numerical_features + graph_feature_cols],
    strategy="mean"
)

# Step 3: StringIndexer (categorical features)
indexer = StringIndexer(
    inputCols=categorical_features,
    outputCols=[f"{col}_INDEX" for col in categorical_features],
    handleInvalid="keep"
)

# Step 4: OneHotEncoder (categorical features)
encoder = OneHotEncoder(
    inputCols=[f"{col}_INDEX" for col in categorical_features],
    outputCols=[f"{col}_VEC" for col in categorical_features]
)

# Step 5: Meta-Model Estimator (uses preprocessed features)
# Note: Meta-models need prev_flight features which will be processed by indexer/encoder above
# The meta-model estimator will automatically find and use the preprocessed versions
meta_model_est = meta_model_estimator.MetaModelEstimator(
    num_trees=50,
    max_depth=20,
    min_instances_per_node=20,
    use_preprocessed_features=True
)

# Step 6: VectorAssembler (processed current flight features + meta-model predictions)
# Final model uses: current flight processed features + 3 meta-model predictions
assembler = VectorAssembler(
    inputCols=[f"{col}_VEC" for col in categorical_features] + 
              [f"{col}_IMPUTED" for col in numerical_features + graph_feature_cols] +
              ['predicted_prev_flight_air_time', 'predicted_prev_flight_taxi_time', 'predicted_prev_flight_total_duration'],
    outputCol="features",
    handleInvalid="skip"
)

# Step 7: Log Transform
log_transform = SQLTransformer(
    statement="""
    SELECT *, 
           SIGN(DEP_DELAY) * LOG(ABS(DEP_DELAY) + 1.0) AS DEP_DELAY_log
    FROM __THIS__
    WHERE DEP_DELAY IS NOT NULL
    """
)

# Step 8: XGBoost Regressor (default - NO log transform)
xgb = SparkXGBRegressor(
    num_workers=sc.defaultParallelism,
    label_col="DEP_DELAY",  # Direct target, no log transform
    features_col="features",
    prediction_col="prediction",  # Direct prediction, no log space
    missing=0.0
)

# Step 8b: XGBoost Regressor (WITH log transform - for comparison)
xgb_with_log = SparkXGBRegressor(
    num_workers=sc.defaultParallelism,
    label_col="DEP_DELAY_log",
    features_col="features",
    prediction_col="prediction_log",
    missing=0.0
)

# Step 9: Inverse Log Transform
exp_transform = SQLTransformer(
    statement="""
    SELECT *, 
           GREATEST(
               CASE 
                   WHEN prediction_log IS NOT NULL THEN 
                       SIGN(prediction_log) * (EXP(ABS(prediction_log)) - 1.0)
                   ELSE 0.0
               END,
               0.0
           ) AS prediction
    FROM __THIS__
    """
)

# Build pipeline WITHOUT log transform (default)
xgb_pipe = Pipeline(stages=[
    graph_estimator,    # Step 1: Add graph features
    imputer,            # Step 2: Impute numerical + graph features
    indexer,            # Step 3: Index categorical features
    encoder,            # Step 4: Encode categorical features
    meta_model_est,     # Step 5: Meta-models (use preprocessed features)
    assembler,          # Step 6: Assemble final features
    # No log_transform - train directly on DEP_DELAY
    xgb                 # Step 8: XGBoost model (trained on raw target)
    # No exp_transform - predictions are already in original scale
])

# Build pipeline WITH log transform (for comparison)
xgb_pipe_with_log = Pipeline(stages=[
    graph_estimator,    # Step 1: Add graph features
    imputer,            # Step 2: Impute numerical + graph features
    indexer,            # Step 3: Index categorical features
    encoder,            # Step 4: Encode categorical features
    meta_model_est,     # Step 5: Meta-models (use preprocessed features)
    assembler,          # Step 6: Assemble final features
    log_transform,      # Step 7: Log transform target
    xgb_with_log,       # Step 8: XGBoost model (trained on log-transformed target)
    exp_transform       # Step 9: Inverse log transform (creates "prediction" column)
])

print("✓ Two pipelines built for comparison:")
print(f"  - xgb_pipe: No log transform (direct prediction)")
print(f"  - xgb_pipe_with_log: Uses log transform (handles skewed distribution)")
print(f"\n  Common features:")
print(f"  - {len(categorical_features)} categorical features")
print(f"  - {len(numerical_features)} numerical features")
print(f"  - {len(graph_feature_cols)} graph features")
print(f"  - 3 meta-model predictions")
print(f"\n  Usage: Train both and compare performance!")

## Run Cross-Validation


In [0]:
# # 3M Data
# cv_xgb_3M = cv.FlightDelayCV(
#     estimator=xgb_pipe,
#     version="3M"
# )
# cv_xgb_3M.fit()


In [0]:
# cv_xgb_3M.evaluate()


In [0]:
# 12M Data
cv_xgb_12M = cv.FlightDelayCV(
    estimator=xgb_pipe,
    version="12M"
)
cv_xgb_12M.fit()


In [0]:

cv_xgb_12M.evaluate()


In [0]:
# 60M Data
cv_xgb_60M = cv.FlightDelayCV(
    estimator=xgb_pipe,
    version="60M"
)
cv_xgb_60M.fit()


In [0]:

cv_xgb_60M.evaluate()


## Benchmarking Training Time

In [0]:
benchmark_12M = cv_xgb_12M.benchmark_inference(dataset="fold_3_val")
benchmark_12M

In [0]:
benchmark_60M = cv_xgb_60M.benchmark_inference(dataset="fold_3_val")
benchmark_60M

## Detailed Evaluation (Optional)

Run these cells after the main training is complete for detailed analysis (feature importance, etc.)


In [0]:
# Example: Detailed evaluation for 12M (modify for 3M or 60M as needed)
cv_obj = cv_xgb_12M  # Change to cv_xgb_3M or cv_xgb_60M as needed

print("=" * 80)
print("FEATURE IMPORTANCE (by fold)")
print("=" * 80)

# Get feature names (for reference)
feature_names = [f"{col}_VEC" for col in categorical_features] + \
                [f"{col}_IMPUTED" for col in numerical_features + graph_feature_cols] + \
                ['predicted_prev_flight_air_time', 'predicted_prev_flight_taxi_time', 'predicted_prev_flight_total_duration']

for i, model in enumerate(cv_obj.models):
    print(f"\n--- Fold {i+1} ---")
    # XGBoost is second-to-last (exp_transform is last)
    xgb_model = model.stages[-2]
    
    # Get feature importance from XGBoost
    try:
        feature_importance = xgb_model.getFeatureImportances()
    except AttributeError:
        try:
            feature_importance = xgb_model._xgb_skl_model.feature_importances_
        except:
            print("    ⚠ Warning: Could not extract feature importance for this fold")
            continue
    
    # Sort by importance
    importance_pairs = list(zip(feature_names, feature_importance))
    importance_pairs.sort(key=lambda x: x[1], reverse=True)
    
    print("Top 20 Features:")
    for name, importance in importance_pairs[:20]:
        print(f"  {name:50s}: {importance:10.6f}")

# Test set feature importance
print("\n" + "=" * 80)
print("TEST MODEL FEATURE IMPORTANCE")
print("=" * 80)

test_xgb_model = cv_obj.test_model.stages[-2]
try:
    feature_importance = test_xgb_model.getFeatureImportances()
except AttributeError:
    try:
        feature_importance = test_xgb_model._xgb_skl_model.feature_importances_
    except:
        print("    ⚠ Warning: Could not extract feature importance")
        feature_importance = []

importance_pairs = list(zip(feature_names, feature_importance))
importance_pairs.sort(key=lambda x: x[1], reverse=True)

print("\nTop 20 Features (Test Model):")
for name, importance in importance_pairs[:20]:
    print(f"  {name:50s}: {importance:10.6f}")

# Test set metrics summary
test_results = cv_obj.evaluate()
print("\n" + "=" * 80)
print("Test Model Summary:")
print(f"  RMSE: {test_results.get('rmse', 'N/A'):.4f}")
print(f"  MAE: {test_results.get('mae', 'N/A'):.4f}")
print(f"  R²: {test_results.get('r2', 'N/A'):.4f}")
print(f"  OTPA: {test_results.get('otpa', 'N/A'):.4f}")
print(f"  SDDR: {test_results.get('sddr', 'N/A'):.4f}")
