# Flight Sequence Feature Engineering

This notebook engineers features based on the **Flight Sequence** - tracking how delays compound as planes travel through multiple flights in a day.

## Graph/Network Representation

### Formal Structure:
- **Nodes**: Individual flights (each flight is a node with attributes: origin, dest, scheduled/actual times, delays, etc.)
- **Edges**: Connect consecutive flights in a **Flight Sequence** (same aircraft/tail_num, same day)
  - Edge represents the relationship: "previous flight → current flight"
  - **Edge Weight**: Air time (actual or expected) of the previous flight
- **Flight Sequence**: The ordered sequence of flights an aircraft operates in a day (e.g., A → B → C)

### Node Attributes:
- `origin`: Origin airport
- `dest`: Destination airport  
- `scheduled_dep_time`: Scheduled departure time
- `scheduled_arr_time`: Scheduled arrival time
- `actual_dep_time`: Actual departure time (if available)
- `actual_arr_time`: Actual arrival time (if available)
- `dep_delay`: Departure delay
- `arr_delay`: Arrival delay

### Edge Attributes:
- `air_time`: Actual or expected flight time (edge weight)
- `turn_time`: Time between arrival at destination and next departure

## Deterministic Prediction Formula

For a **Flight Sequence A → B** where B has not been realized yet:

**Expected Departure Time of B** = 
- **Actual Departure Time of A** (if available >= 2 hours before B's scheduled departure)
- **+ Expected Air Time** (conditional on weather, aircraft type, route)
- **+ Expected Turn Time at B** (conditional on carrier, airport, time of day)

**Impossible On-Time Flag** = 1 if Expected Departure Time of B > Scheduled Departure Time of B, else 0

## Key Concepts:

1. **Flight Sequence**: Ordered sequence of flights by same aircraft in a day (A → B → C)
2. **Previous Flight/Leg**: The flight immediately before the current one in the sequence
3. **Cumulative Delay**: Total delay accumulated since first flight of the day (typically since 3 AM)
4. **Conditional Dependencies**:
   - **Turn Time**: Varies by carrier, airport, time of day, aircraft type
   - **Air Time**: Varies by weather (wind speed/direction), aircraft type, route
   - **Taxi Time**: Varies by airport, runway configuration, gate position, time of day
5. **Data Leakage Prevention**: Only use actual departure times that are >= 2 hours before the current flight's scheduled departure

In [0]:
# Dependencies
import importlib.util
import sys

# Load cv module
cv_path = "/Workspace/Shared/Team 4_2/flight-departure-delay-predictive-modeling/notebooks/Cross Validator/cv.py"
spec = importlib.util.spec_from_file_location("cv", cv_path)
cv = importlib.util.module_from_spec(spec)
spec.loader.exec_module(cv)

from pyspark.sql import functions as F
from pyspark.sql.functions import col, to_timestamp, when, lag, sum as spark_sum, count, avg, min as spark_min, max as spark_max
from pyspark.sql.window import Window
import pandas as pd
import time

# Path for persistent storage
FOLDER_PATH = "dbfs:/mnt/mids-w261/student-groups/Group_4_2/experiments"

## Load Data

### Column Names (Verified from Dataset)

Based on actual data exploration, the following columns are available for Flight Sequence feature engineering:

**Flight Identification:**
- `tail_num`: Tail number (aircraft identifier) - **used to track Flight Sequences**
- `FL_DATE`: Flight date (uppercase, date format)
- `op_carrier`: Operating carrier code
- `op_carrier_fl_num`: Flight number
- `origin`: Origin airport code
- `dest`: Destination airport code

**Scheduled Times:**
- `crs_dep_time`: Scheduled departure time (integer HHMM format, e.g., 1158 = 11:58)
- `crs_arr_time`: Scheduled arrival time (integer HHMM format)
- `crs_elapsed_time`: Scheduled elapsed time (minutes)
- `sched_depart_date_time`: Scheduled departure datetime (if created from `FL_DATE` + `crs_dep_time`)

**Actual Times:**
- `dep_time`: Actual departure time (integer HHMM format, may be NULL, e.g., 1151 = 11:51)
- `arr_time`: Actual arrival time (integer HHMM format, may be NULL)
- `actual_elapsed_time`: Actual elapsed time (minutes)
- `wheels_off`: Wheels off time (if available)
- `wheels_on`: Wheels on time (if available)

**Time Components (verify availability):**
- `air_time`: Air time (minutes) - flight time in the air
- `taxi_out`: Taxi-out time (minutes) - from gate to wheels off
- `taxi_in`: Taxi-in time (minutes) - from wheels on to gate

**Delays:**
- `DEP_DELAY`: Departure delay (minutes, uppercase - likely the label column)
- `dep_delay_new`: Alternative departure delay calculation
- `ARR_DELAY`: Arrival delay (minutes, uppercase)
- `arr_delay`: Arrival delay (minutes, lowercase)
- `arr_delay_new`: Alternative arrival delay calculation

**Other:**
- `distance`: Flight distance (miles, if available)
- `cancelled`: Cancellation indicator (if available)
- `diverted`: Diversion indicator (if available)

**Note:** Column names are a mix of uppercase and lowercase. Use exact case when referencing columns.

In [0]:
# Load data for lineage feature engineering
# We need tail_num to track planes, and scheduled/actual times
lineage_data_path = f"{FOLDER_PATH}/lineage_data_snapshot.parquet"

print("Loading data for lineage feature engineering...")
start = time.time()
data_loader = cv.FlightDelayDataLoader()
data_loader.load()
folds = data_loader.get_version("3M")  # Start with 3M for faster iteration

# Use first fold for now
train_df, val_df = folds[0]
lineage_data = train_df

# Check partition count and repartition if needed
num_partitions = lineage_data.rdd.getNumPartitions()
if num_partitions > 500:
    lineage_data = lineage_data.coalesce(200)
elif num_partitions < 10:
    lineage_data = lineage_data.repartition(50)

# Save snapshot
lineage_data.write.mode("overwrite").parquet(lineage_data_path)
print(f"Saved snapshot in {time.time() - start:.2f} seconds")

print(f"\nLineage data: {lineage_data.count():,} flights")
print(f"Date range: {lineage_data.agg(F.min('FL_DATE'), F.max('FL_DATE')).collect()})")

In [0]:
# Load from saved snapshot (run this on subsequent runs)
lineage_data_path = f"{FOLDER_PATH}/lineage_data_snapshot.parquet"

print(f"Loading lineage data from {lineage_data_path}...")
start = time.time()
lineage_data = spark.read.parquet(lineage_data_path)
lineage_data.count()  # Materialize
print(f"Loaded in {time.time() - start:.2f} seconds")

print(f"\nLineage data: {lineage_data.count():,} flights")
print(f"Date range: {lineage_data.agg(F.min('FL_DATE'), F.max('FL_DATE')).collect()}")

## Prepare Data for Lineage Analysis

We need:
- `tail_num`: To track the same plane across flights
- `FL_DATE`: To group flights by day
- Scheduled and actual departure/arrival times
- Flight duration information

In [0]:
# Check what columns are available
print("Checking available columns for lineage analysis...")
print(f"Total columns: {len(lineage_data.columns)}")

# Look for tail number and time columns
tail_cols = [c for c in lineage_data.columns if 'tail' in c.lower()]
time_cols = [c for c in lineage_data.columns if any(term in c.lower() for term in ['time', 'dep', 'arr', 'crs', 'sched'])]

print(f"\nTail number columns: {tail_cols}")
print(f"\nTime-related columns (first 20): {time_cols[:20]}")

# Check for specific time component columns needed for feature engineering
time_component_cols = ['air_time', 'taxi_in', 'taxi_out', 'wheels_off', 'wheels_on']
available_time_components = [c for c in time_component_cols if c in lineage_data.columns]
missing_time_components = [c for c in time_component_cols if c not in lineage_data.columns]

print(f"\nTime component columns:")
print(f"  Available: {available_time_components}")
if missing_time_components:
    print(f"  Missing: {missing_time_components}")

# Sample data to see structure
print("\nSample row:")
display_cols = tail_cols + ['FL_DATE', 'origin', 'dest', 'crs_dep_time', 'dep_time', 
                            'crs_arr_time', 'arr_time', 'DEP_DELAY', 'ARR_DELAY']
# Add available time components
display_cols.extend(available_time_components)
# Filter to only columns that exist
display_cols = [c for c in display_cols if c in lineage_data.columns]
lineage_data.select(display_cols).limit(5).show(truncate=False)

## Feature Engineering: Flight Sequence Features

### Features to Create:

#### 1. Previous Flight/Leg Information
- `prev_flight_arr_delay`: Previous flight arrival delay
- `prev_flight_dep_delay`: Previous flight departure delay
- `prev_flight_actual_dep_time`: Previous flight actual departure time (if >= 2 hours before current scheduled departure)
- `prev_flight_actual_arr_time`: Previous flight actual arrival time (if available)
- `prev_flight_origin`: Origin airport of previous flight
- `prev_flight_dest`: Destination airport of previous flight (should match current origin)

#### 2. Cumulative Delay Features
- `cumulative_delay_since_3am`: Total delay accumulated by the plane's previous flights in the day
- `num_previous_flights_today`: Number of flights the plane has already completed today
- `avg_delay_per_previous_flight`: Mean delay across previous flights
- `max_delay_previous_flights`: Maximum delay in previous flights

#### 3. Conditional Turn Time Features
- `expected_turn_time_carrier_airport`: Average time between arrival and departure for this carrier at this airport
- `expected_turn_time_carrier_airport_time`: Conditional on time of day (morning/afternoon/evening)
- `expected_turn_time_carrier_airport_aircraft`: Conditional on aircraft type
- `actual_turn_time`: Actual time between previous flight arrival and current scheduled departure (if previous flight has arrived)

#### 4. Conditional Air Time Features (Edge Weights)
- `expected_air_time_route`: Average air time for this origin-destination pair
- `expected_air_time_route_weather`: Conditional on wind speed, wind direction, weather conditions
- `expected_air_time_route_aircraft`: Conditional on aircraft type
- `expected_air_time_route_time_of_day`: Conditional on time of day (affects air traffic, congestion)
- `expected_air_time_route_time_of_year`: Conditional on time of year (affects seasonal wind patterns, jet streams)
- `expected_air_time_route_weather_time`: Conditional on weather AND time of day/year (combined effects)
- `prev_flight_actual_air_time`: Actual air time of previous flight (if completed)

#### 5. Deterministic "Impossible On-Time" Features
- `expected_arrival_time_prev_flight`: Previous actual departure (or scheduled if not available) + expected air time
- `expected_departure_time_current`: Expected arrival time of previous flight + expected turn time at current airport
- `time_buffer`: Scheduled departure time - Expected departure time (can be negative)
- `impossible_on_time_flag`: Binary (1 if Expected departure time > Scheduled departure time, else 0)
- `minutes_until_departure`: Minutes between expected arrival and scheduled departure (can be negative)

#### 6. Taxi Time Features
- `expected_taxi_in_time_airport`: Average taxi time from landing to gate at destination airport
- `expected_taxi_out_time_airport`: Average taxi time from gate to takeoff at origin airport
- `expected_taxi_time_airport_time`: Conditional on time of day