# Time-Series Feature Engineering for Flight Delay Prediction

This notebook generates time-series based conditional expected values using Prophet models.
These features capture temporal trends and seasonality for use in Flight Lineage feature engineering.

## Features Generated:

### Time-Series Aggregations:
1. **Global**: All flights aggregated by date
2. **Per Airport**: Aggregated by origin airport and date
3. **Per Airline (Carrier)**: Aggregated by carrier and date
4. **Per Airport & Airline**: Aggregated by carrier-airport and date
5. **Per Route (Origin-Dest)**: Aggregated by route and date (for air time)

### Features of Interest:
- **Departure Delay**: Expected departure delays (conditional on carrier, airport, time)
- **Arrival Delay**: Expected arrival delays
- **Air Time**: Expected flight time from origin to destination
- **Taxi Times**: Expected taxi-in and taxi-out times
- **Turnover Time**: Time between arrival and next departure (requires joining to prior leg - see note at end)

### Prophet Models:
Each aggregation level uses Prophet to capture:
- Long-term trends
- Weekly seasonality
- Yearly seasonality (if sufficient data)
- Date-specific forecasts


In [None]:
# Dependencies
import importlib.util
import sys

# Load cv module
cv_path = "/Workspace/Shared/Team 4_2/flight-departure-delay-predictive-modeling/notebooks/Cross Validator/cv.py"
spec = importlib.util.spec_from_file_location("cv", cv_path)
cv = importlib.util.module_from_spec(spec)
spec.loader.exec_module(cv)

# Dependencies for time series features
from pyspark.sql import functions as F
from pyspark.sql.functions import col, to_timestamp, when, hour, dayofweek, month
import pandas as pd
import numpy as np
from prophet import Prophet

# Dependencies for visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Other dependencies
import time

# Path for persistent storage
FOLDER_PATH = "dbfs:/mnt/mids-w261/student-groups/Group_4_2/experiments"


## Load Data

Load 2-year fold (training + validation) for time-series analysis


In [None]:
# Load from data_loader and save snapshot (run once)
ts_data_path = f"{FOLDER_PATH}/timeseries_data_snapshot.parquet"

print("Loading from data_loader and saving snapshot...")
start = time.time()
data_loader = cv.FlightDelayDataLoader()
data_loader.load()
folds = data_loader.get_version("60M")

# Use final fold and union training and validation to get 2 years of data for time series analysis
# This allows us to learn yearly seasonality without data leakage
train_df, val_df = folds[-1]

# Union training and validation folds to get 2-year time period
ts_data = train_df.union(val_df)

# Check partition count and repartition if needed
num_partitions = ts_data.rdd.getNumPartitions()
if num_partitions > 500:
    ts_data = ts_data.coalesce(200)
elif num_partitions < 10:
    ts_data = ts_data.repartition(50)

# Save snapshot
ts_data.write.mode("overwrite").parquet(ts_data_path)
print(f"Saved snapshot in {time.time() - start:.2f} seconds")

print(f"\nTime series data: {ts_data.count():,} flights")
print(f"Date range: {ts_data.agg(F.min('FL_DATE'), F.max('FL_DATE')).collect()}")
