# Flight Sequence Feature Engineering

This notebook engineers features based on the **Flight Sequence** - tracking how delays compound as planes travel through multiple flights in a day.

## Graph/Network Representation

### Formal Structure:
- **Nodes**: Individual flights (each flight is a node with attributes: origin, dest, scheduled/actual times, delays, etc.)
- **Edges**: Connect consecutive flights in a **Flight Sequence** (same aircraft/tail_num, same day)
  - Edge represents the relationship: "previous flight → current flight"
  - **Edge Weight**: Air time (actual or expected) of the previous flight
- **Flight Sequence**: The ordered sequence of flights an aircraft operates in a day (e.g., A → B → C)

### Node Attributes:
- `origin`: Origin airport
- `dest`: Destination airport  
- `scheduled_dep_time`: Scheduled departure time
- `scheduled_arr_time`: Scheduled arrival time
- `actual_dep_time`: Actual departure time (if available)
- `actual_arr_time`: Actual arrival time (if available)
- `dep_delay`: Departure delay
- `arr_delay`: Arrival delay

### Edge Attributes:
- `air_time`: Actual or expected flight time (edge weight)
- `turn_time`: Time between arrival at destination and next departure

## Deterministic Prediction Formula

For a **Flight Sequence A → B** where B has not been realized yet:

**Expected Departure Time of B** = 
- **Actual Departure Time of A** (if available >= 2 hours before B's scheduled departure)
- **+ Expected Air Time** (conditional on weather, aircraft type, route)
- **+ Expected Turn Time at B** (conditional on carrier, airport, time of day)

**Impossible On-Time Flag** = 1 if Expected Departure Time of B > Scheduled Departure Time of B, else 0

## Key Concepts:

1. **Flight Sequence**: Ordered sequence of flights by same aircraft in a day (A → B → C)
2. **Previous Flight/Leg**: The flight immediately before the current one in the sequence
3. **Cumulative Delay**: Total delay accumulated since first flight of the day (typically since 3 AM)
4. **Conditional Dependencies**:
   - **Turn Time**: Varies by carrier, airport, time of day, aircraft type
   - **Air Time**: Varies by weather (wind speed/direction), aircraft type, route
   - **Taxi Time**: Varies by airport, runway configuration, gate position, time of day
5. **Data Leakage Prevention**: Only use actual departure times that are >= 2 hours before the current flight's scheduled departure

In [0]:
# Dependencies
import importlib.util
import sys

# Load cv module
cv_path = "/Workspace/Shared/Team 4_2/flight-departure-delay-predictive-modeling/notebooks/Cross Validator/cv.py"
spec = importlib.util.spec_from_file_location("cv", cv_path)
cv = importlib.util.module_from_spec(spec)
spec.loader.exec_module(cv)

from pyspark.sql import functions as F
from pyspark.sql.functions import col, to_timestamp, when, lag, sum as spark_sum, count, avg, min as spark_min, max as spark_max
from pyspark.sql.window import Window
import pandas as pd
import time

# Path for persistent storage
FOLDER_PATH = "dbfs:/mnt/mids-w261/student-groups/Group_4_2/experiments"

## Load Data

### Column Names (Verified from Dataset)

Based on actual data exploration, the following columns are available for Flight Sequence feature engineering:

**Flight Identification:**
- `tail_num`: Tail number (aircraft identifier) - **used to track Flight Sequences**
- `FL_DATE`: Flight date (uppercase, date format)
- `op_carrier`: Operating carrier code
- `op_carrier_fl_num`: Flight number
- `origin`: Origin airport code
- `dest`: Destination airport code

**Scheduled Times:**
- `crs_dep_time`: Scheduled departure time (integer HHMM format, e.g., 1158 = 11:58)
- `crs_arr_time`: Scheduled arrival time (integer HHMM format)
- `crs_elapsed_time`: Scheduled elapsed time (minutes)
- `sched_depart_date_time`: Scheduled departure datetime (if created from `FL_DATE` + `crs_dep_time`)

**Actual Times:**
- `dep_time`: Actual departure time (integer HHMM format, may be NULL, e.g., 1151 = 11:51)
- `arr_time`: Actual arrival time (integer HHMM format, may be NULL)
- `actual_elapsed_time`: Actual elapsed time (minutes)
- `wheels_off`: Wheels off time (if available)
- `wheels_on`: Wheels on time (if available)

**Time Components (verify availability):**
- `air_time`: Air time (minutes) - flight time in the air
- `taxi_out`: Taxi-out time (minutes) - from gate to wheels off
- `taxi_in`: Taxi-in time (minutes) - from wheels on to gate

**Delays:**
- `DEP_DELAY`: Departure delay (minutes, uppercase - likely the label column)
- `dep_delay_new`: Alternative departure delay calculation
- `ARR_DELAY`: Arrival delay (minutes, uppercase)
- `arr_delay`: Arrival delay (minutes, lowercase)
- `arr_delay_new`: Alternative arrival delay calculation

**Other:**
- `distance`: Flight distance (miles, if available)
- `cancelled`: Cancellation indicator (if available)
- `diverted`: Diversion indicator (if available)

**Note:** Column names are a mix of uppercase and lowercase. Use exact case when referencing columns.

In [0]:
# Load data for lineage feature engineering
# We need tail_num to track planes, and scheduled/actual times
lineage_data_path = f"{FOLDER_PATH}/lineage_data_snapshot.parquet"

print("Loading data for lineage feature engineering...")
start = time.time()
data_loader = cv.FlightDelayDataLoader()
data_loader.load()
folds = data_loader.get_version("3M")  # Start with 3M for faster iteration

# Use first fold for now
train_df, val_df = folds[0]
lineage_data = train_df

# Check partition count and repartition if needed
num_partitions = lineage_data.rdd.getNumPartitions()
if num_partitions > 500:
    lineage_data = lineage_data.coalesce(200)
elif num_partitions < 10:
    lineage_data = lineage_data.repartition(50)

# Save snapshot
lineage_data.write.mode("overwrite").parquet(lineage_data_path)
print(f"Saved snapshot in {time.time() - start:.2f} seconds")

print(f"\nLineage data: {lineage_data.count():,} flights")
print(f"Date range: {lineage_data.agg(F.min('FL_DATE'), F.max('FL_DATE')).collect()})")

In [0]:
# Load from saved snapshot (run this on subsequent runs)
lineage_data_path = f"{FOLDER_PATH}/lineage_data_snapshot.parquet"

print(f"Loading lineage data from {lineage_data_path}...")
start = time.time()
lineage_data = spark.read.parquet(lineage_data_path)
lineage_data.count()  # Materialize
print(f"Loaded in {time.time() - start:.2f} seconds")

print(f"\nLineage data: {lineage_data.count():,} flights")
print(f"Date range: {lineage_data.agg(F.min('FL_DATE'), F.max('FL_DATE')).collect()}")

## Prepare Data for Lineage Analysis

We need:
- `tail_num`: To track the same plane across flights
- `FL_DATE`: To group flights by day
- Scheduled and actual departure/arrival times
- Flight duration information

In [0]:
# Check what columns are available
print("Checking available columns for lineage analysis...")
print(f"Total columns: {len(lineage_data.columns)}")

# Look for tail number and time columns
tail_cols = [c for c in lineage_data.columns if 'tail' in c.lower()]
time_cols = [c for c in lineage_data.columns if any(term in c.lower() for term in ['time', 'dep', 'arr', 'crs', 'sched'])]

print(f"\nTail number columns: {tail_cols}")
print(f"\nTime-related columns (first 20): {time_cols[:20]}")

# Check for specific time component columns needed for feature engineering
time_component_cols = ['air_time', 'taxi_in', 'taxi_out', 'wheels_off', 'wheels_on']
available_time_components = [c for c in time_component_cols if c in lineage_data.columns]
missing_time_components = [c for c in time_component_cols if c not in lineage_data.columns]

print(f"\nTime component columns:")
print(f"  Available: {available_time_components}")
if missing_time_components:
    print(f"  Missing: {missing_time_components}")

# Sample data to see structure
print("\nSample row:")
display_cols = tail_cols + ['FL_DATE', 'origin', 'dest', 'crs_dep_time', 'dep_time', 
                            'crs_arr_time', 'arr_time', 'DEP_DELAY', 'ARR_DELAY']
# Add available time components
display_cols.extend(available_time_components)
# Filter to only columns that exist
display_cols = [c for c in display_cols if c in lineage_data.columns]
lineage_data.select(display_cols).limit(5).show(truncate=False)

## Feature Engineering: Flight Sequence Features

### Features to Create:

#### 1. Previous Flight/Leg Information
- `prev_flight_arr_delay`: Previous flight arrival delay
- `prev_flight_dep_delay`: Previous flight departure delay
- `prev_flight_actual_dep_time`: Previous flight actual departure time (if >= 2 hours before current scheduled departure)
- `prev_flight_actual_arr_time`: Previous flight actual arrival time (if available)
- `prev_flight_origin`: Origin airport of previous flight
- `prev_flight_dest`: Destination airport of previous flight (should match current origin)

#### 2. Cumulative Delay Features
- `cumulative_delay_since_3am`: Total delay accumulated by the plane's previous flights in the day
- `num_previous_flights_today`: Number of flights the plane has already completed today
- `avg_delay_per_previous_flight`: Mean delay across previous flights
- `max_delay_previous_flights`: Maximum delay in previous flights

#### 3. Conditional Turn Time Features
- `expected_turn_time_carrier_airport`: Average time between arrival and departure for this carrier at this airport
- `expected_turn_time_carrier_airport_time`: Conditional on time of day (morning/afternoon/evening)
- `expected_turn_time_carrier_airport_aircraft`: Conditional on aircraft type
- `actual_turn_time`: Actual time between previous flight arrival and current scheduled departure (if previous flight has arrived)

#### 4. Conditional Air Time Features (Edge Weights)
- `expected_air_time_route`: Average air time for this origin-destination pair
- `expected_air_time_route_weather`: Conditional on wind speed, wind direction, weather conditions
- `expected_air_time_route_aircraft`: Conditional on aircraft type
- `expected_air_time_route_time_of_day`: Conditional on time of day (affects air traffic, congestion)
- `expected_air_time_route_time_of_year`: Conditional on time of year (affects seasonal wind patterns, jet streams)
- `expected_air_time_route_weather_time`: Conditional on weather AND time of day/year (combined effects)
- `prev_flight_actual_air_time`: Actual air time of previous flight (if completed)

#### 5. Deterministic "Impossible On-Time" Features
- `expected_arrival_time_prev_flight`: Previous actual departure (or scheduled if not available) + expected air time
- `expected_departure_time_current`: Expected arrival time of previous flight + expected turn time at current airport
- `time_buffer`: Scheduled departure time - Expected departure time (can be negative)
- `impossible_on_time_flag`: Binary (1 if Expected departure time > Scheduled departure time, else 0)
- `minutes_until_departure`: Minutes between expected arrival and scheduled departure (can be negative)

#### 6. Taxi Time Features
- `expected_taxi_in_time_airport`: Average taxi time from landing to gate at destination airport
- `expected_taxi_out_time_airport`: Average taxi time from gate to takeoff at origin airport
- `expected_taxi_time_airport_time`: Conditional on time of day

## Implementation: Pre-Compute Lineage Features

### Step 1: Create Unique Flight Key

We need a unique identifier to join lineage features back to the main dataset.


In [None]:
# Create unique flight key for joining lineage features back to main dataset
# This key uniquely identifies each flight and can be used for efficient joins

def create_flight_key(df, tail_col='tail_num'):
    """
    Create a unique flight identifier.
    
    Options:
    1. Composite key: {tail_num}_{FL_DATE}_{crs_dep_time}_{op_carrier_fl_num}_{origin}_{dest}
       - Readable, debuggable
       - Includes flight number for uniqueness
       - Longer strings, but Spark handles this well
    
    2. Hash key: SHA2 hash of composite fields
       - Shorter, more efficient
       - Not human-readable
    
    Recommendation: Use composite key with flight number for uniqueness and debuggability.
    """
    # Composite key approach (recommended)
    # Include op_carrier_fl_num for uniqueness (same tail_num can have multiple flights at same time)
    if 'op_carrier_fl_num' in df.columns:
        flight_key = (
            df
            .withColumn(
                'flight_key',
                F.concat(
                    col(tail_col), F.lit('_'),
                    col('FL_DATE'), F.lit('_'),
                    col('crs_dep_time').cast('string'), F.lit('_'),
                    col('op_carrier_fl_num').cast('string'), F.lit('_'),
                    col('origin'), F.lit('_'),
                    col('dest')
                )
            )
        )
    else:
        # Fallback if flight number not available
        flight_key = (
            df
            .withColumn(
                'flight_key',
                F.concat(
                    col(tail_col), F.lit('_'),
                    col('FL_DATE'), F.lit('_'),
                    col('crs_dep_time').cast('string'), F.lit('_'),
                    col('origin'), F.lit('_'),
                    col('dest')
                )
            )
        )
    
    # Alternative: Hash-based key (uncomment if needed)
    # if 'op_carrier_fl_num' in df.columns:
    #     flight_key = (
    #         df
    #         .withColumn(
    #             'flight_key',
    #             F.sha2(
    #                 F.concat(
    #                     col(tail_col), col('FL_DATE'),
    #                     col('crs_dep_time').cast('string'),
    #                     col('op_carrier_fl_num').cast('string'),
    #                     col('origin'), col('dest')
    #                 ),
    #                 256
    #             )
    #         )
    #     )
    # else:
    #     flight_key = (
    #         df
    #         .withColumn(
    #             'flight_key',
    #             F.sha2(
    #                 F.concat(
    #                     col(tail_col), col('FL_DATE'),
    #                     col('crs_dep_time').cast('string'),
    #                     col('origin'), col('dest')
    #                 ),
    #                 256
    #             )
    #         )
    #     )
    
    return flight_key

# Test on sample data
if 'lineage_data' in locals():
    print("Creating flight keys for lineage data...")
    tail_cols = [c for c in lineage_data.columns if 'tail' in c.lower()]
    if tail_cols:
        tail_col = tail_cols[0]
        lineage_with_keys = create_flight_key(lineage_data, tail_col)
        
        print(f"\nSample flight keys:")
        display(lineage_with_keys.select('flight_key', tail_col, 'FL_DATE', 'crs_dep_time', 
                                         'origin', 'dest').limit(10))
        
        # Check for uniqueness
        total_flights = lineage_with_keys.count()
        unique_keys = lineage_with_keys.select('flight_key').distinct().count()
        print(f"\nUniqueness check:")
        print(f"  Total flights: {total_flights:,}")
        print(f"  Unique flight keys: {unique_keys:,}")
        if total_flights == unique_keys:
            print("  ✓ All flight keys are unique!")
        else:
            print(f"  ⚠ Warning: {total_flights - unique_keys} duplicate keys found")
            print("    May need to add more fields to flight_key (e.g., op_carrier_fl_num)")
    else:
        print("tail_num column not found.")
else:
    print("lineage_data not available. Load data first.")


### Step 2: Compute All Lineage Features (One-Time Computation)

This is the expensive operation - compute ONCE, save as materialized table.


In [None]:
# Compute all lineage features using window functions
# This should be run ONCE after the custom join, then saved as a materialized table

if 'lineage_data' in locals():
    print("Computing lineage features (one-time computation)...")
    print("This may take a while - will save as materialized table for future use.\n")
    
    tail_cols = [c for c in lineage_data.columns if 'tail' in c.lower()]
    if tail_cols:
        tail_col = tail_cols[0]
        
        # Create flight key first
        df = create_flight_key(lineage_data, tail_col)
        
        # Filter to flights with required columns
        df = df.filter(
            col(tail_col).isNotNull() & 
            col('FL_DATE').isNotNull() &
            col('origin').isNotNull() & 
            col('dest').isNotNull() &
            col('crs_dep_time').isNotNull()
        )
        
        # Window specification: partition by tail_num and date, order by scheduled departure time
        window_spec = Window.partitionBy(tail_col, 'FL_DATE').orderBy('crs_dep_time')
        
        print("Pass 1: Sequence numbers and previous flight raw values...")
        # Sequence number
        df = df.withColumn('seq_num', F.row_number().over(window_spec))
        
        # Previous flight raw values
        df = df.withColumn('prev_flight_dest', F.lag('dest', 1).over(window_spec))
        df = df.withColumn('prev_flight_origin', F.lag('origin', 1).over(window_spec))
        df = df.withColumn('prev_flight_actual_dep_time', F.lag('dep_time', 1).over(window_spec))
        df = df.withColumn('prev_flight_actual_arr_time', F.lag('arr_time', 1).over(window_spec))
        df = df.withColumn('prev_flight_dep_delay', F.lag('DEP_DELAY', 1).over(window_spec))
        df = df.withColumn('prev_flight_arr_delay', F.lag('ARR_DELAY', 1).over(window_spec))
        df = df.withColumn('prev_flight_air_time', F.lag('air_time', 1).over(window_spec))
        
        print("Pass 2: Jump detection and cumulative features...")
        # Jump detection
        df = df.withColumn(
            'is_jump',
            when(col('seq_num') == 1, F.lit(False))
            .when(col('prev_flight_dest').isNull(), F.lit(True))
            .otherwise(col('prev_flight_dest') != col('origin'))
        )
        
        # Cumulative delay features
        df = df.withColumn(
            'cumulative_delay_since_3am',
            F.sum('DEP_DELAY').over(
                window_spec.rowsBetween(Window.unboundedPreceding, -1)
            )
        )
        
        df = df.withColumn(
            'num_previous_flights_today',
            F.count('*').over(
                window_spec.rowsBetween(Window.unboundedPreceding, -1)
            )
        )
        
        df = df.withColumn(
            'avg_delay_per_previous_flight',
            F.avg('DEP_DELAY').over(
                window_spec.rowsBetween(Window.unboundedPreceding, -1)
            )
        )
        
        df = df.withColumn(
            'max_delay_previous_flights',
            F.max('DEP_DELAY').over(
                window_spec.rowsBetween(Window.unboundedPreceding, -1)
            )
        )
        
        print("Pass 3: Join with conditional expected values...")
        print("  IMPORTANT: For previous flight's route, join on (prev_flight_origin, prev_flight_dest)")
        print("             For current flight's airport, join on (op_carrier, origin)\n")
        
        # Load conditional expected value tables (from Time-Series Features Experiment)
        # These should be pre-computed and saved as parquet files
        
        # Join expected air time for PREVIOUS flight's route
        # CRITICAL: Join on prev_flight_origin → prev_flight_dest, not origin → dest!
        expected_air_time_route_path = f"{FOLDER_PATH}/expected_values_route.parquet"
        try:
            expected_air_time_route = spark.read.parquet(expected_air_time_route_path)
            
            # Join on previous flight's route
            df = df.join(
                expected_air_time_route.alias('prev_route'),
                (col('prev_flight_origin') == col('prev_route.origin')) &
                (col('prev_flight_dest') == col('prev_route.dest')),
                'left'
            ).select(
                [col(c) for c in df.columns] +  # Keep all original columns
                [col('prev_route.expected_air_time_route').alias('prev_expected_air_time_route')]
            )
            print("  ✓ Joined expected_air_time_route for previous flight's route")
        except Exception as e:
            print(f"  ⚠ Could not load expected_air_time_route: {e}")
            print("    Run Time-Series Features Experiment to generate this table")
            # Add placeholder column
            df = df.withColumn('prev_expected_air_time_route', F.lit(None).cast('double'))
        
        # Join expected turn time for CURRENT flight's airport
        expected_turn_time_path = f"{FOLDER_PATH}/expected_values_carrier_airport.parquet"
        try:
            expected_turn_time = spark.read.parquet(expected_turn_time_path)
            # Rename columns to avoid conflicts
            expected_turn_time_renamed = expected_turn_time.select(
                col('carrier').alias('turn_carrier'),
                col('origin').alias('turn_origin'),
                col('expected_taxi_out_carrier_airport'),
                col('expected_taxi_in_carrier_airport')
            )
            df = df.join(
                expected_turn_time_renamed,
                (col('op_carrier') == col('turn_carrier')) & 
                (col('origin') == col('turn_origin')),
                'left'
            ).drop('turn_carrier', 'turn_origin')  # Drop temporary join columns
            print("  ✓ Joined expected_turn_time for current flight's airport")
        except Exception as e:
            print(f"  ⚠ Could not load expected_turn_time: {e}")
            print("    Run Time-Series Features Experiment to generate this table")
            # Add placeholder columns
            df = df.withColumn('expected_taxi_out_carrier_airport', F.lit(None).cast('double'))
            df = df.withColumn('expected_taxi_in_carrier_airport', F.lit(None).cast('double'))
        
        print("\nPass 4: Deterministic features (using previous flight's values)...")
        # Compute expected arrival time of previous flight
        df = df.withColumn(
            'expected_arrival_time_prev_flight',
            when(
                (col('prev_actual_dep_time').isNotNull()) &
                (col('prev_expected_air_time_route').isNotNull()),
                col('prev_actual_dep_time') + col('prev_expected_air_time_route')
            )
            .otherwise(None)
        )
        
        # Pull previous flight's expected arrival time for use in current flight calculations
        df = df.withColumn(
            'prev_expected_arrival_time',
            F.lag('expected_arrival_time_prev_flight', 1).over(window_spec)
        )
        
        # Compute expected departure time for CURRENT flight
        # Uses previous flight's expected arrival + expected turn time at current airport
        df = df.withColumn(
            'expected_departure_time_current',
            when(
                (col('prev_expected_arrival_time').isNotNull()) &
                (col('expected_taxi_out_carrier_airport').isNotNull()),
                col('prev_expected_arrival_time') + col('expected_taxi_out_carrier_airport')
            )
            .otherwise(None)
        )
        
        # Compute time buffer and impossible on-time flag
        # Convert crs_dep_time to minutes for comparison (HHMM format)
        df = df.withColumn(
            'crs_dep_time_minutes',
            (F.floor(col('crs_dep_time') / 100) * 60 + (col('crs_dep_time') % 100))
        )
        
        df = df.withColumn(
            'time_buffer',
            when(
                (col('expected_departure_time_current').isNotNull()) &
                (col('crs_dep_time_minutes').isNotNull()),
                col('crs_dep_time_minutes') - col('expected_departure_time_current')
            )
            .otherwise(None)
        )
        
        df = df.withColumn(
            'impossible_on_time_flag',
            when(col('time_buffer').isNotNull(),
                 when(col('time_buffer') < 0, 1).otherwise(0))
            .otherwise(None)
        )
        
        print("  ✓ Computed deterministic features")
        
        print("\n✓ Lineage features computed!")
        print(f"Total flights with lineage features: {df.count():,}")
        
        # Show sample
        display_cols = [
            'flight_key', tail_col, 'FL_DATE', 'seq_num', 'origin', 'dest',
            'prev_flight_dest', 'is_jump', 'cumulative_delay_since_3am',
            'num_previous_flights_today'
        ]
        print("\nSample lineage features:")
        display(df.select(display_cols).limit(10))
        
        # Save as materialized table
        lineage_features_path = f"{FOLDER_PATH}/lineage_features_materialized.parquet"
        print(f"\nSaving materialized lineage features to: {lineage_features_path}")
        print("This table can be joined back to main dataset using flight_key")
        
        # Partition by date for efficient joins
        df.write.mode("overwrite").partitionBy("FL_DATE").parquet(lineage_features_path)
        print("✓ Saved! Can now join back to main dataset using flight_key")
        
    else:
        print("tail_num column not found.")
else:
    print("lineage_data not available. Load data first.")


### Step 3: Join Lineage Features Back to Main Dataset

Example of how to use the materialized table in model training/prediction.


In [None]:
# Example: Join lineage features back to main dataset
# This is fast - just a simple join operation

lineage_features_path = f"{FOLDER_PATH}/lineage_features_materialized.parquet"

try:
    # Load materialized lineage features
    lineage_features = spark.read.parquet(lineage_features_path)
    
    print(f"Loaded materialized lineage features: {lineage_features.count():,} flights")
    print(f"Date range: {lineage_features.agg(F.min('FL_DATE'), F.max('FL_DATE')).collect()}")
    
    # Example: Join to main dataset
    # In production, this would be your main flight dataset after custom join
    if 'lineage_data' in locals():
        print("\nExample: Joining lineage features back to main dataset...")
        
        # Create flight_key on main dataset
        tail_cols = [c for c in lineage_data.columns if 'tail' in c.lower()]
        if tail_cols:
            tail_col = tail_cols[0]
            main_with_keys = create_flight_key(lineage_data, tail_col)
            
            # Join lineage features
            main_with_lineage = main_with_keys.join(
                lineage_features.select('flight_key', 
                    'seq_num', 'is_jump', 'prev_flight_dest',
                    'cumulative_delay_since_3am', 'num_previous_flights_today',
                    'prev_flight_dep_delay', 'prev_flight_arr_delay'
                    # Add other lineage feature columns as needed
                ),
                'flight_key',
                'left'
            )
            
            print(f"\n✓ Joined lineage features to main dataset")
            print(f"Total flights: {main_with_lineage.count():,}")
            print(f"Flights with lineage features: {main_with_lineage.filter(col('seq_num').isNotNull()).count():,}")
            
            print("\nSample joined data:")
            display(main_with_lineage.select(
                'flight_key', tail_col, 'FL_DATE', 'origin', 'dest',
                'seq_num', 'is_jump', 'cumulative_delay_since_3am'
            ).limit(10))
            
            print("\n" + "="*60)
            print("Performance Benefits:")
            print("="*60)
            print("1. Window functions computed ONCE (not on every model run)")
            print("2. Simple join operation (fast, efficient)")
            print("3. Can partition lineage table by date for even faster joins")
            print("4. Clear separation: feature computation vs model training")
        else:
            print("tail_num column not found in main dataset.")
    else:
        print("Main dataset not available. This is just an example.")
        
except Exception as e:
    print(f"Could not load materialized lineage features: {e}")
    print("Run the previous cell to compute and save them first.")


In [None]:
# Visual example: How LAG works to get previous flight data
# This demonstrates that LAG is NOT a join - it's a window function

if 'lineage_data' in locals():
    print("Demonstrating how LAG works to get previous flight data...\n")
    
    tail_cols = [c for c in lineage_data.columns if 'tail' in c.lower()]
    if tail_cols:
        tail_col = tail_cols[0]
        
        # Get a sample aircraft with multiple flights on the same day
        sample_flights = (
            lineage_data
            .filter(col(tail_col).isNotNull() & col('FL_DATE').isNotNull())
            .select(tail_col, 'FL_DATE', 'origin', 'dest', 'crs_dep_time', 'DEP_DELAY')
            .orderBy(tail_col, 'FL_DATE', 'crs_dep_time')
            .limit(1000)  # Get enough to find a good example
        )
        
        # Find an aircraft with multiple flights on same day
        multi_flight_aircraft = (
            sample_flights
            .groupBy(tail_col, 'FL_DATE')
            .count()
            .filter(col('count') > 2)
            .orderBy(F.desc('count'))
            .limit(1)
        )
        
        if multi_flight_aircraft.count() > 0:
            example = multi_flight_aircraft.collect()[0]
            example_tail = example[tail_col]
            example_date = example['FL_DATE']
            
            print(f"Example aircraft: {example_tail} on {example_date}")
            print(f"Number of flights: {example['count']}\n")
            
            # Get flights for this aircraft on this day
            example_flights = (
                sample_flights
                .filter(
                    (col(tail_col) == example_tail) & 
                    (col('FL_DATE') == example_date)
                )
                .orderBy('crs_dep_time')
            )
            
            print("BEFORE LAG (original data):")
            display(example_flights.select(tail_col, 'FL_DATE', 'crs_dep_time', 'origin', 'dest', 'DEP_DELAY'))
            
            # Apply LAG to get previous flight's destination
            window_spec = Window.partitionBy(tail_col, 'FL_DATE').orderBy('crs_dep_time')
            
            example_with_lag = example_flights.withColumn(
                'prev_flight_dest',
                F.lag('dest', 1).over(window_spec)
            ).withColumn(
                'prev_flight_dep_delay',
                F.lag('DEP_DELAY', 1).over(window_spec)
            ).withColumn(
                'seq_num',
                F.row_number().over(window_spec)
            )
            
            print("\nAFTER LAG (with previous flight data):")
            print("Notice how prev_flight_dest and prev_flight_dep_delay are automatically")
            print("filled from the previous row in the sequence!\n")
            display(example_with_lag.select(
                'seq_num', tail_col, 'FL_DATE', 'crs_dep_time', 
                'origin', 'dest', 'DEP_DELAY',
                'prev_flight_dest', 'prev_flight_dep_delay'
            ))
            
            print("\n" + "="*60)
            print("Key Points:")
            print("="*60)
            print("1. LAG is a WINDOW FUNCTION, not a join")
            print("2. It operates on the SAME DataFrame (no separate table needed)")
            print("3. Window partitions by (tail_num, FL_DATE) and orders by crs_dep_time")
            print("4. LAG(column, 1) gets the value from 1 row before in the partition")
            print("5. No primary key or join needed - the window function handles sequencing")
            print("6. Much more efficient than self-joining the DataFrame")
        else:
            print("No aircraft with multiple flights found in sample.")
            print("LAG still works, but example would be clearer with multi-flight sequences.")
    else:
        print("tail_num column not found.")
else:
    print("lineage_data not available. Load data first.")


## Understanding LAG: How We Get Previous Flight Data

### What is LAG?

**LAG is NOT a primary key or join operation.** It's a **window function** that gets the value from the previous row within the same DataFrame.

### How LAG Works

LAG operates on a **window** (partition + ordering):
- **Partition**: Groups rows by `(tail_num, FL_DATE)` - all flights by same plane on same day
- **Order**: Orders by `crs_dep_time` - earliest to latest
- **LAG(column, 1)**: Gets the value from 1 row before in the ordered partition

**Example**:
```
Window: Partition by (tail_num='N12345', FL_DATE='2023-01-15'), Order by crs_dep_time

Row 1: Flight 1 (crs_dep_time=800, origin='LAX', dest='JFK')
Row 2: Flight 2 (crs_dep_time=1200, origin='JFK', dest='LAX')
Row 3: Flight 3 (crs_dep_time=1500, origin='LAX', dest='SFO')

After LAG('dest', 1):
Row 1: prev_flight_dest = NULL (no previous flight)
Row 2: prev_flight_dest = 'JFK' (from Flight 1)
Row 3: prev_flight_dest = 'LAX' (from Flight 2)
```

### Why LAG Instead of Join?

**LAG is more efficient** because:
1. **No join needed**: All data is in the same DataFrame
2. **Automatic ordering**: Window function handles the sequence automatically
3. **Single pass**: Computes all previous flight values in one operation
4. **No key matching**: No need to create join keys

**Join would require**:
- Creating a join key for each flight
- Self-joining the DataFrame on that key
- More complex and less efficient

### Join Logic for Expected Values

**Important**: When joining expected values for the **previous flight**, you must join on `(prev_flight_origin, prev_flight_dest)`, NOT `(origin, dest)`.

**Example**:
- Flight 1: LAX → JFK
- Flight 2: JFK → LAX  
- Flight 3: LAX → SFO

For Flight 3:
- To get expected air time for Flight 2's route: Join on `(prev_flight_origin='JFK', prev_flight_dest='LAX')`
- To get expected turn time for Flight 3's airport: Join on `(op_carrier, origin='LAX')`

**Wrong**: `df.join(expected_air_time_route, ['origin', 'dest'], 'left')` 
- This joins on Flight 3's route (LAX→SFO), not Flight 2's route (JFK→LAX)!

**Correct**: `df.join(expected_air_time_route, (prev_flight_origin == origin) & (prev_flight_dest == dest), 'left')`
- This joins on Flight 2's route (JFK→LAX) ✓
