## Temporal Delay Feature Engineering

A key concept we wanted to encapsulate in some of our features is flight delays caused by what we refer to as the "snowball" effect. This term refers to flights that are delayed due to previous flights directly related to the given flight or previously delayed flights at the departure destination. To explore this effect, we added two new features:

* **DEP_DEL15_PREV_FIN** - Whether or not the plane scheduled to run this flight was delayed by 15 minutes or more on its most recent available flight.
* **NUM_PREV_DELAYS_ORIGIN** - The number of flights that have been previously delayed at the departure airport on the departure flight date.

An essential aspect of creating these features was only utilizing past flight data that could realistically be known at prediction time. For both features, we only considered flights that were scheduled to depart more than 2 hours and 15 minutes before the departure time of the flight in question. This is because it is impossible to know if a flight scheduled to depart an hour and a half before a given flight was delayed by 15 minutes at a prediction time of two hours ahead of a given flight.

In [0]:
# Import appropriate packages

from pyspark.sql.functions import concat,col, lit, lag
from pyspark.sql import Window
import pyspark.sql.functions as f

Below is a function to add `DEP_DEL15_PREV_FIN` to the joined dataframe of flights and weather. The function defines many intermediate columns to calculate if the most flight before a flight is question was more than 15 minuted delayed. These columns are then dropped. The intermediate columns are:

* `PREV_FLIGHT_DEP_DATETIME_1` - The UTC departure date and time of the most recent flight for a given tail number
* `PREV_FLIGHT_DEP_DATETIME_2` - The UTC departure date and time of the second most recent flight for a given tail number
* `FL_DATE_PREV_1` - The local date of the most recent flight for a given tail number
* `FL_DATE_PREV_2` - The local date of the second most recent flight for a given tail number
* `DIFF_IN_PREV_DEP_AND_PREDICT_TIME_1` - The difference in time between two hours before a given flight departure time and `PREV_FLIGHT_DEP_DATETIME_1` in minutes
* `DIFF_IN_PREV_DEP_AND_PREDICT_TIME_2` - The difference in time between two hours before a given flight departure time and `PREV_FLIGHT_DEP_DATETIME_2` in minutes
* `DEP_DELAY_NEW_PREV_1` - The amount of time the most recent flight for a given tail number is delayed
* `DEP_DELAY_NEW_PREV_2` - The amount of time the second most recent flight for a given tail number is delayed

In [0]:
def add_most_recent_tail_num_delayed(df):
  # Define window partitioned by the tail number variable with entries ordered
  # by departure time in UTC time
  windowSpec = Window.partitionBy("TAIL_NUM").orderBy("CRS_DEP_DATETIME")
  
  # Add above columns, define DEP_DEL15_PREV_FIN as true if the most recent flight 
  # for the flight in question's tail number was scheduled to depart more than two hours
  # and 15 minutes before the flight in question and the flights were on the same 
  # day and the most recent flight was delayed by more than 15 minutes. If the most 
  # recent flight was scheduled to depart less than two hours and 15 minutes ago, do
  # the same logic on the second most recent flight
  df_most_recent_tail_number_delayed = df.withColumn("PREV_FLIGHT_DEP_DATETIME_1",lag("CRS_DEP_DATETIME",1).over(windowSpec)) \
                                         .withColumn("PREV_FLIGHT_DEP_DATETIME_2",lag("CRS_DEP_DATETIME",2).over(windowSpec)) \
                                         .withColumn("FL_DATE_PREV_1",lag("FL_DATE",1).over(windowSpec)) \
                                         .withColumn("FL_DATE_PREV_2",lag("FL_DATE",2).over(windowSpec)) \
                                         .withColumn('DIFF_IN_PREV_DEP_AND_PREDICT_TIME_1',((f.unix_timestamp("EARLIER_DATETIME") - f.unix_timestamp('PREV_FLIGHT_DEP_DATETIME_1'))/60)) \
                                         .withColumn('DIFF_IN_PREV_DEP_AND_PREDICT_TIME_2',((f.unix_timestamp("EARLIER_DATETIME") - f.unix_timestamp('PREV_FLIGHT_DEP_DATETIME_2'))/60)) \
                                         .withColumn("DEP_DELAY_NEW_PREV_1",lag("DEP_DELAY_NEW",1).over(windowSpec)) \
                                         .withColumn("DEP_DELAY_NEW_PREV_2",lag("DEP_DELAY_NEW",2).over(windowSpec)) \
                                         .withColumn("DEP_DEL15_PREV_FIN", f.when((f.col("DIFF_IN_PREV_DEP_AND_PREDICT_TIME_1") >= 15) & (f.col("FL_DATE_PREV_1") == f.col("FL_DATE")) & \
                                                                            (f.col("DEP_DELAY_NEW_PREV_1") >= 15), 1).when((f.col("DIFF_IN_PREV_DEP_AND_PREDICT_TIME_2") >= 15) & \
                                                                            (f.col("DIFF_IN_PREV_DEP_AND_PREDICT_TIME_1") < 15) & (f.col("FL_DATE_PREV_2") == f.col("FL_DATE")) & \
                                                                            (f.col("DEP_DELAY_NEW_PREV_2") >= 15), 1).otherwise(0)) \
                                         .drop("PREV_FLIGHT_DEP_DATETIME_1", "PREV_FLIGHT_DEP_DATETIME_2", "FL_DATE_PREV_1", "FL_DATE_PREV_2",'DIFF_IN_PREV_DEP_AND_PREDICT_TIME_1', \
                                                                            'DIFF_IN_PREV_DEP_AND_PREDICT_TIME_2', "DEP_DELAY_NEW_PREV_1", "DEP_DELAY_NEW_PREV_2")
  
  # Return edited dataframe
  return df_most_recent_tail_number_delayed

Next define function to add `NUM_PREV_DELAYS_ORIGIN`. This function has the same time requirements as the previous function, but instead it partitions flight by their flight date and origin, and examines all flights in the partitions more than two hours and 15 minutes before the flight in question that were delayed by more than 15 minutes.

In [0]:
def add_num_delays_origin_airport(df):
  # Define columns for partitioning
  column_list = ["FL_DATE","ORIGIN"]
  
  # Partition and examine flights within partition who's scheduled 
  # departure time were more than two hours and 15 minutes before departure
  # time of flight in question (-135 * 60 second)
  windowval = (Window.partitionBy([col(x) for x in column_list]).orderBy(col('CRS_DEP_DATETIME').cast("long")) \    
                                                              .rangeBetween(Window.unboundedPreceding, -135*60))
  
  # Add number of delyaed flights by 15 minutes or mover over defined window
  df_num_origin_delays = df.withColumn('NUM_PREV_DELAYS_ORIGIN', f.sum(f.when(col("DEP_DELAY_NEW") >= 15,1) \ 
                                                                       .otherwise(0)).over(windowval))
  return df_num_origin_delays

To add these two features, load the dataframe `all_flight_weather_5y_v*` which contains all joined flight and weather data and run both functions on the given dataframe.

In [0]:
df_full_joined = spark.read.parquet("/mnt/team11/all_flight_weather_5y_v2/")
df_full_joined = add_most_recent_tail_num_delayed(df_full_joined)
df_full_joined = add_num_delays_origin_airport(df_full_joined)

In [0]:
# Write to parquet
df_full_joined.write.parquet("/mnt/team11/all_flight_weather_5y_tail_num_recent_flight_delayed_num_origin_delays_v2")

In [0]:
display(df_full_joined)

YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID,OP_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_FIPS,ORIGIN_STATE_NM,ORIGIN_WAC,DEST_AIRPORT_ID,DEST_AIRPORT_SEQ_ID,DEST_CITY_MARKET_ID,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_FIPS,DEST_STATE_NM,DEST_WAC,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DELAY_NEW,DEP_DEL15,DEP_DELAY_GROUP,DEP_TIME_BLK,TAXI_OUT,WHEELS_OFF,WHEELS_ON,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DELAY_NEW,ARR_DEL15,ARR_DELAY_GROUP,ARR_TIME_BLK,CANCELLED,CANCELLATION_CODE,DIVERTED,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,AIR_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,flightID,ORIGIN_airportID,ORIGIN_ICAO,ORIGIN_Timezone,ORIGIN_TZ,DEST_airportID,DEST_ICAO,DEST_Timezone,DEST_TZ,CRS_DEP_DATETIME,ACT_DEP_DATETIME,EARLIER_DATETIME,CRS_ARR_DATETIME,ACT_ARR_DATETIME,ORIGIN_near_station,DEST_near_station,ORIG_weather_date,ORIG_LATITUDE,ORIG_LONGITUDE,ORIG_ELEVATION,ORIG_direction_angle,ORIG_speed,ORIG_ceiling_height,ORIG_ceiling_quality,ORIG_vis_distance,ORIG_variability,ORIG_air_temp,ORIG_dew_point,ORIG_sea_level_pressure,ORIG_precipitation_hrs,ORIG_precipitation_depth,DEST_weather_date,DEST_LATITUDE,DEST_LONGITUDE,DEST_ELEVATION,DEST_direction_angle,DEST_speed,DEST_ceiling_height,DEST_ceiling_quality,DEST_vis_distance,DEST_variability,DEST_air_temp,DEST_dew_point,DEST_sea_level_pressure,DEST_precipitation_hrs,DEST_precipitation_depth,DEP_DEL15_PREV_FIN,NUM_PREV_DELAYS_ORIGIN
2015,1,1,1,4,2015-01-01,UA,19977,UA,N37293,1162,10732,1073203,30732,BQN,"Aguadilla, PR",PR,72,Puerto Rico,3,11618,1161802,31703,EWR,"Newark, NJ",NJ,34,New Jersey,21,259,258,-1.0,0.0,0.0,-1,0001-0559,14.0,312,557,8.0,559,605,6.0,6.0,0.0,0,0001-0559,0.0,,0.0,240.0,247.0,225.0,1.0,1585.0,7,,,,,,60132018794,2885,TJBQ,-4,America/Puerto_Rico,3494,KEWR,-5,America/New_York,2015-01-01T06:59:00.000+0000,2015-01-01T06:58:00.000+0000,2015-01-01T04:59:00.000+0000,2015-01-01T10:59:00.000+0000,2015-01-01T11:05:00.000+0000,78514011603.0,72502014734.0,2015-01-01T00:50:00.000+0000,18.4977,-67.1294,66.4,140.0,26.0,22000.0,1.0,16093.0,9.0,230.0,220.0,,,,2015-01-01T04:59:00.000+0000,40.6825,-74.1694,2.1,,,,9.0,,9.0,,,,24.0,0.0,0,
2015,1,1,1,4,2015-01-01,B6,20409,B6,N239JB,1030,10732,1073203,30732,BQN,"Aguadilla, PR",PR,72,Puerto Rico,3,13204,1320402,31454,MCO,"Orlando, FL",FL,12,Florida,33,307,304,-3.0,0.0,0.0,-1,0001-0559,25.0,329,509,11.0,500,520,20.0,20.0,1.0,1,0001-0559,0.0,,0.0,173.0,196.0,160.0,1.0,1129.0,5,0.0,0.0,20.0,0.0,0.0,42952603948,2885,TJBQ,-4,America/Puerto_Rico,3878,KMCO,-5,America/New_York,2015-01-01T07:07:00.000+0000,2015-01-01T07:04:00.000+0000,2015-01-01T05:07:00.000+0000,2015-01-01T10:00:00.000+0000,2015-01-01T10:20:00.000+0000,78514011603.0,72205012815.0,2015-01-01T00:50:00.000+0000,18.4977,-67.1294,66.4,140.0,26.0,22000.0,1.0,16093.0,9.0,230.0,220.0,,,,2015-01-01T04:59:00.000+0000,28.4339,-81.325,27.4,,,,9.0,,9.0,,,,24.0,0.0,0,
2015,1,1,1,4,2015-01-01,B6,20409,B6,N621JB,730,10732,1073203,30732,BQN,"Aguadilla, PR",PR,72,Puerto Rico,3,13204,1320402,31454,MCO,"Orlando, FL",FL,12,Florida,33,419,423,4.0,4.0,0.0,0,0001-0559,11.0,434,602,4.0,613,606,-7.0,0.0,0.0,-1,0600-0659,0.0,,0.0,174.0,163.0,148.0,1.0,1129.0,5,,,,,,42952599672,2885,TJBQ,-4,America/Puerto_Rico,3878,KMCO,-5,America/New_York,2015-01-01T08:19:00.000+0000,2015-01-01T08:23:00.000+0000,2015-01-01T06:19:00.000+0000,2015-01-01T11:13:00.000+0000,2015-01-01T11:06:00.000+0000,78514011603.0,72205012815.0,2015-01-01T00:50:00.000+0000,18.4977,-67.1294,66.4,140.0,26.0,22000.0,1.0,16093.0,9.0,230.0,220.0,,,,2015-01-01T06:00:00.000+0000,28.4339,-81.325,27.4,360.0,36.0,,9.0,8000.0,9.0,156.0,144.0,10232.0,6.0,0.0,0,
2015,1,1,1,4,2015-01-01,B6,20409,B6,N715JB,838,10732,1073203,30732,BQN,"Aguadilla, PR",PR,72,Puerto Rico,3,12478,1247802,31703,JFK,"New York, NY",NY,36,New York,22,600,601,1.0,1.0,0.0,0,0600-0659,11.0,612,857,10.0,852,907,15.0,15.0,1.0,1,0800-0859,0.0,,0.0,232.0,246.0,225.0,1.0,1576.0,7,1.0,0.0,14.0,0.0,0.0,42952600954,2885,TJBQ,-4,America/Puerto_Rico,3797,KJFK,-5,America/New_York,2015-01-01T10:00:00.000+0000,2015-01-01T10:01:00.000+0000,2015-01-01T08:00:00.000+0000,2015-01-01T13:52:00.000+0000,2015-01-01T14:07:00.000+0000,78514011603.0,74486094789.0,2015-01-01T00:50:00.000+0000,18.4977,-67.1294,66.4,140.0,26.0,22000.0,1.0,16093.0,9.0,230.0,220.0,,,,2015-01-01T07:51:00.000+0000,40.6386,-73.7622,3.4,250.0,62.0,22000.0,5.0,16093.0,,-17.0,-117.0,10212.0,1.0,0.0,0,0.0
2015,1,1,1,4,2015-01-01,B6,20409,B6,N794JB,938,10732,1073203,30732,BQN,"Aguadilla, PR",PR,72,Puerto Rico,3,12478,1247802,31703,JFK,"New York, NY",NY,36,New York,22,1108,1109,1.0,1.0,0.0,0,1100-1159,10.0,1119,1402,4.0,1359,1406,7.0,7.0,0.0,0,1300-1359,0.0,,0.0,231.0,237.0,223.0,1.0,1576.0,7,,,,,,42952602921,2885,TJBQ,-4,America/Puerto_Rico,3797,KJFK,-5,America/New_York,2015-01-01T15:08:00.000+0000,2015-01-01T15:09:00.000+0000,2015-01-01T13:08:00.000+0000,2015-01-01T18:59:00.000+0000,2015-01-01T19:06:00.000+0000,78514011603.0,74486094789.0,2015-01-01T12:50:00.000+0000,18.4977,-67.1294,66.4,120.0,36.0,22000.0,1.0,16093.0,9.0,240.0,210.0,,,,2015-01-01T12:51:00.000+0000,40.6386,-73.7622,3.4,260.0,62.0,22000.0,5.0,16093.0,,-17.0,-128.0,10222.0,1.0,0.0,0,0.0
2015,1,1,1,4,2015-01-01,MQ,20398,MQ,N651MQ,3009,11067,1106702,31067,CMI,"Champaign/Urbana, IL",IL,17,Illinois,41,11298,1129803,30194,DFW,"Dallas/Fort Worth, TX",TX,48,Texas,74,645,642,-3.0,0.0,0.0,-1,0600-0659,12.0,654,851,10.0,910,901,-9.0,0.0,0.0,-1,0900-0959,0.0,,0.0,145.0,139.0,117.0,1.0,693.0,3,,,,,,42952759866,4049,KCMI,-6,America/Chicago,3670,KDFW,-6,America/Chicago,2015-01-01T12:45:00.000+0000,2015-01-01T12:42:00.000+0000,2015-01-01T10:45:00.000+0000,2015-01-01T15:10:00.000+0000,2015-01-01T15:01:00.000+0000,72531594870.0,72259003927.0,2015-01-01T09:53:00.000+0000,40.03972,-88.27778,229.8,230.0,51.0,22000.0,5.0,16093.0,,-83.0,-122.0,10240.0,1.0,0.0,2015-01-01T09:53:00.000+0000,32.8978,-97.0189,170.7,50.0,31.0,2438.0,5.0,16093.0,,0.0,-39.0,10266.0,1.0,0.0,0,
2015,1,1,1,4,2015-01-01,MQ,20398,MQ,N847MQ,3170,11067,1106702,31067,CMI,"Champaign/Urbana, IL",IL,17,Illinois,41,13930,1393003,30977,ORD,"Chicago, IL",IL,17,Illinois,41,730,722,-8.0,0.0,0.0,-1,0700-0759,18.0,740,812,7.0,845,819,-26.0,0.0,0.0,-2,0800-0859,0.0,,0.0,75.0,57.0,32.0,1.0,135.0,1,,,,,,42952635772,4049,KCMI,-6,America/Chicago,3830,KORD,-6,America/Chicago,2015-01-01T13:30:00.000+0000,2015-01-01T13:22:00.000+0000,2015-01-01T11:30:00.000+0000,2015-01-01T14:45:00.000+0000,2015-01-01T14:19:00.000+0000,72531594870.0,72530094846.0,2015-01-01T10:53:00.000+0000,40.03972,-88.27778,229.8,230.0,67.0,22000.0,5.0,16093.0,,-78.0,-117.0,10239.0,1.0,0.0,2015-01-01T10:51:00.000+0000,41.995,-87.9336,201.8,240.0,62.0,22000.0,5.0,16093.0,,-100.0,-150.0,10178.0,1.0,0.0,0,
2015,1,1,1,4,2015-01-01,MQ,20398,MQ,N675MQ,3120,11067,1106702,31067,CMI,"Champaign/Urbana, IL",IL,17,Illinois,41,13930,1393003,30977,ORD,"Chicago, IL",IL,17,Illinois,41,835,830,-5.0,0.0,0.0,-1,0800-0859,12.0,842,915,55.0,940,1010,30.0,30.0,1.0,2,0900-0959,0.0,,0.0,65.0,100.0,33.0,1.0,135.0,1,0.0,0.0,30.0,0.0,0.0,42952633454,4049,KCMI,-6,America/Chicago,3830,KORD,-6,America/Chicago,2015-01-01T14:35:00.000+0000,2015-01-01T14:30:00.000+0000,2015-01-01T12:35:00.000+0000,2015-01-01T15:40:00.000+0000,2015-01-01T16:10:00.000+0000,72531594870.0,72530094846.0,2015-01-01T11:53:00.000+0000,40.03972,-88.27778,229.8,230.0,51.0,22000.0,5.0,16093.0,,-78.0,-117.0,10235.0,1.0,0.0,2015-01-01T12:00:00.000+0000,41.995,-87.9336,201.8,240.0,82.0,,9.0,16000.0,9.0,-94.0,-150.0,10181.0,,,0,
2015,1,1,1,4,2015-01-01,MQ,20398,MQ,N673MQ,2974,11067,1106702,31067,CMI,"Champaign/Urbana, IL",IL,17,Illinois,41,13930,1393003,30977,ORD,"Chicago, IL",IL,17,Illinois,41,950,943,-7.0,0.0,0.0,-1,0900-0959,17.0,1000,1030,27.0,1055,1057,2.0,2.0,0.0,0,1000-1059,0.0,,0.0,65.0,74.0,30.0,1.0,135.0,1,,,,,,42952760985,4049,KCMI,-6,America/Chicago,3830,KORD,-6,America/Chicago,2015-01-01T15:50:00.000+0000,2015-01-01T15:43:00.000+0000,2015-01-01T13:50:00.000+0000,2015-01-01T16:55:00.000+0000,2015-01-01T16:57:00.000+0000,72531594870.0,72530094846.0,2015-01-01T12:53:00.000+0000,40.03972,-88.27778,229.8,220.0,51.0,22000.0,5.0,16093.0,,-89.0,-117.0,10225.0,1.0,0.0,2015-01-01T12:51:00.000+0000,41.995,-87.9336,201.8,240.0,82.0,22000.0,5.0,16093.0,,-94.0,-150.0,10179.0,1.0,0.0,0,0.0
2015,1,1,1,4,2015-01-01,MQ,20398,MQ,N698MQ,3274,11067,1106702,31067,CMI,"Champaign/Urbana, IL",IL,17,Illinois,41,13930,1393003,30977,ORD,"Chicago, IL",IL,17,Illinois,41,1150,1343,113.0,113.0,1.0,7,1100-1159,10.0,1353,1432,10.0,1250,1442,112.0,112.0,1.0,7,1200-1259,0.0,,0.0,60.0,59.0,39.0,1.0,135.0,1,112.0,0.0,0.0,0.0,0.0,42952865966,4049,KCMI,-6,America/Chicago,3830,KORD,-6,America/Chicago,2015-01-01T17:50:00.000+0000,2015-01-01T19:43:00.000+0000,2015-01-01T15:50:00.000+0000,2015-01-01T18:50:00.000+0000,2015-01-01T20:42:00.000+0000,72531594870.0,72530094846.0,2015-01-01T14:53:00.000+0000,40.03972,-88.27778,229.8,230.0,88.0,22000.0,5.0,16093.0,,-61.0,-100.0,10230.0,1.0,0.0,2015-01-01T14:51:00.000+0000,41.995,-87.9336,201.8,240.0,93.0,22000.0,5.0,16093.0,,-83.0,-144.0,10187.0,1.0,0.0,0,0.0


Both of these features will contain null values that we keep null. These null values make sense for each new feature. For DEP_DEL15_PREV_FIN, a null value means that there were no previous flights that day for a given tail number that could have been delayed by the time the prediction for delay given flight. For NUM_PREV_DELAYS_ORIGIN, a null value means there have not been any flights at the origin airport so far that we can know whether or not they have been delayed for a given day.