# Flight Preprocessing - Data Generation

In [1]:
%AddJar file:///home/jovyan/work/apps/Emiasd-Flight-Data-Analysis.jar

Starting download from file:///home/jovyan/work/apps/Emiasd-Flight-Data-Analysis.jar
Finished download of Emiasd-Flight-Data-Analysis.jar
Using cached version of Emiasd-Flight-Data-Analysis.jar


In [2]:
import org.apache.spark.sql.SparkSession
import com.flightdelay.config.{AppConfiguration, ConfigurationLoader}
import com.flightdelay.data.loaders.FlightDataLoader

//Env Configuration
val args: Array[String] = Array("juniper")

val spark = SparkSession.builder()
  .config(sc.getConf)
  .getOrCreate()

// Rendre la session Spark implicite
implicit val session = spark
implicit val configuration: AppConfiguration = ConfigurationLoader.loadConfiguration(args)

// Cellule 4: Test
val flightData = FlightDataLoader.loadFromConfiguration(false)


[STEP 1][DataLoader] Flight Data Loading - Start

Loading from existing Parquet file:
  - Path: /home/jovyan/work/output/common/data/raw_flights.parquet
  - Loaded 486133 records from Parquet (optimized)

Schema:
root
 |-- FL_DATE: date (nullable = true)
 |-- OP_CARRIER_AIRLINE_ID: integer (nullable = true)
 |-- OP_CARRIER_FL_NUM: integer (nullable = true)
 |-- ORIGIN_AIRPORT_ID: integer (nullable = true)
 |-- DEST_AIRPORT_ID: integer (nullable = true)
 |-- CRS_DEP_TIME: integer (nullable = true)
 |-- ARR_DELAY_NEW: double (nullable = true)
 |-- CANCELLED: integer (nullable = true)
 |-- DIVERTED: integer (nullable = true)
 |-- CRS_ELAPSED_TIME: double (nullable = true)
 |-- WEATHER_DELAY: double (nullable = true)
 |-- NAS_DELAY: double (nullable = true)


Sample data (10 rows):
+----------+---------------------+-----------------+-----------------+---------------+------------+-------------+---------+--------+----------------+-------------+---------+
|   FL_DATE|OP_CARRIER_AIRLINE_ID|OP

args = Array(juniper)
spark = org.apache.spark.sql.SparkSession@6d08c785
session = org.apache.spark.sql.SparkSession@6d08c785
configuration = AppConfiguration(local,CommonConfig(42,DataConfig(/home/jovyan/work/data,FileConfig(/home/jovyan/work/data/FLIGHT-3Y/Flights/201201.csv),FileConfig(/home/jovyan/work/data/FLIGHT-3Y/Weather/201201hourly.txt),FileConfig(/home/jovyan/work/data/FLIGHT-3Y/wban_airport_timezone.csv)),OutputConfig(/home/jovyan/work/output,FileConfig(/home/jovyan/work/output/data),FileCo...


AppConfiguration(local,CommonConfig(42,DataConfig(/home/jovyan/work/data,FileConfig(/home/jovyan/work/data/FLIGHT-3Y/Flights/201201.csv),FileConfig(/home/jovyan/work/data/FLIGHT-3Y/Weather/201201hourly.txt),FileConfig(/home/jovyan/work/data/FLIGHT-3Y/wban_airport_timezone.csv)),OutputConfig(/home/jovyan/work/output,FileConfig(/home/jovyan/work/output/data),FileCo...

In [3]:
import com.flightdelay.data.utils.DataQualityMetrics

val flightDataMetrics = DataQualityMetrics.metrics(flightData)
flightDataMetrics.show()

+--------------------+-----------+-------+----------+----------------+
|                name|   origType|colType| compRatio|nbDistinctValues|
+--------------------+-----------+-------+----------+----------------+
|             FL_DATE|   DateType|   date|       1.0|              31|
|OP_CARRIER_AIRLIN...|IntegerType|numeric|       1.0|              15|
|   OP_CARRIER_FL_NUM|IntegerType|numeric|       1.0|            6237|
|   ORIGIN_AIRPORT_ID|IntegerType|numeric|       1.0|             287|
|     DEST_AIRPORT_ID|IntegerType|numeric|       1.0|             287|
|        CRS_DEP_TIME|IntegerType|numeric|       1.0|            1153|
|       ARR_DELAY_NEW| DoubleType|numeric| 0.9833214|             549|
|           CANCELLED|IntegerType|numeric|       0.0|               1|
|            DIVERTED|IntegerType|numeric|       0.0|               1|
|    CRS_ELAPSED_TIME| DoubleType|numeric|       1.0|             419|
|       WEATHER_DELAY| DoubleType|numeric|0.14586131|             288|
|     

flightDataMetrics = [name: string, origType: string ... 3 more fields]


[name: string, origType: string ... 3 more fields]

In [6]:
import com.flightdelay.data.preprocessing.flights.FlightDataCleaner

val flightCleanedData = FlightDataCleaner.preprocess(flightData)



[STEP 2][DataCleaner] Flight Data Cleaning - Start

Original dataset: 486133 records

Phase 1: Basic Cleaning
  - Current count: 486133 records

Phase 2: Filter Invalid Flights
  - Filtering cancelled and diverted flights
  - Filtering invalid departure times
  - Filtering invalid airports
  - Current count: 486133 records

Phase 3: Data Type Conversion
Conversion des types de données: NAS_DELAY, OP_CARRIER_AIRLINE_ID, OP_CARRIER_FL_NUM, WEATHER_DELAY, DEST_AIRPORT_ID, ORIGIN_AIRPORT_ID, CRS_ELAPSED_TIME, FL_DATE, CRS_DEP_TIME, ARR_DELAY_NEW
  - Filtering invalid flight dates
  - Current count: 486133 records

Phase 4: Outlier Filtering
  - Filtering delays > 600 minutes
  - Filtering flight times (10 min - 24 hours)
  - Current count: 486067 records

Phase 5: Final Validation
  - Validation passed: 486067 records

Cleaning Summary
Original records:       486,133
Final records:          486,067
Removed records:             66
Reduction:             0%


flightCleanedData = [FL_DATE: date, OP_CARRIER_AIRLINE_ID: int ... 8 more fields]


[FL_DATE: date, OP_CARRIER_AIRLINE_ID: int ... 8 more fields]

In [7]:
import com.flightdelay.data.preprocessing.flights.FlightDataGenerator

val withTemporalFeatures = FlightDataGenerator.addTemporalFeatures(flightCleanedData)




Phase 1: Add Temporal Features
Temporal features added: 24


withTemporalFeatures = [FL_DATE: date, OP_CARRIER_AIRLINE_ID: int ... 32 more fields]


[FL_DATE: date, OP_CARRIER_AIRLINE_ID: int ... 32 more fields]

In [8]:
val addedFeatureColumns = Seq(
  "FL_DATE",
  
  /**"OP_CARRIER_AIRLINE_ID",
  "OP_CARRIER_FL_NUM",
  "ORIGIN_AIRPORT_ID",
  "DEST_AIRPORT_ID",
  "CANCELLED",
  "DIVERTED",
  "WEATHER_DELAY",
  "NAS_DELAY",**/

  "CRS_DEP_TIME",
  "CRS_ELAPSED_TIME",
  "ARR_DELAY_NEW",
  
    
  /**"feature_flight_timestamp",
  "feature_flight_year",
  "feature_flight_month",
  "feature_flight_quarter",
  "feature_flight_day_of_month",
  "feature_flight_day_of_week",
  "feature_flight_day_of_year",
  "feature_flight_week_of_year",
  "feature_departure_hour",
  "feature_departure_minute",
  "feature_departure_hour_decimal",
  "feature_departure_quarter_day",
  "feature_minutes_since_midnight",
  "feature_departure_quarter_name",
  "feature_departure_time_period",**/

  "feature_departure_hour_rounded",
  "feature_departure_minutes_total",
  "feature_arrival_minutes_total",
  "feature_arrival_time",
  "feature_arrival_hour",
  "feature_arrival_minute",
  "feature_arrival_hour_decimal",
  "feature_arrival_hour_rounded"  
)

val df = withTemporalFeatures
  .select(addedFeatureColumns.head, addedFeatureColumns.tail: _*)
df.show(5)


+------------------------+-------------------+--------------------+----------------------+---------------------------+--------------------------+--------------------------+---------------------------+----------------------+------------------------+------------------------------+-----------------------------+------------------------------+------------------------------+-----------------------------+
|feature_flight_timestamp|feature_flight_year|feature_flight_month|feature_flight_quarter|feature_flight_day_of_month|feature_flight_day_of_week|feature_flight_day_of_year|feature_flight_week_of_year|feature_departure_hour|feature_departure_minute|feature_departure_hour_decimal|feature_departure_quarter_day|feature_minutes_since_midnight|feature_departure_quarter_name|feature_departure_time_period|
+------------------------+-------------------+--------------------+----------------------+---------------------------+--------------------------+--------------------------+------------------------

addedFeatureColumns = List(feature_flight_timestamp, feature_flight_year, feature_flight_month, feature_flight_quarter, feature_flight_day_of_month, feature_flight_day_of_week, feature_flight_day_of_year, feature_flight_week_of_year, feature_departure_hour, feature_departure_minute, feature_departure_hour_decimal, feature_departure_quarter_day, feature_minutes_since_midnight, feature_departure_quarter_name, feature_departure_time_period)
df = [feature_flight_timestamp: timestamp, feature_flight_year: int ... 13 more fields]


[feature_flight_timestamp: timestamp, feature_flight_year: int ... 13 more fields]

In [8]:
import com.flightdelay.data.utils.DataQualityMetrics

val temporalfeatureMetric = DataQualityMetrics.metrics(df3)
temporalfeatureMetric.show()

temporalfeatureMetric = [name: string, origType: string ... 3 more fields]


+--------------------+-------------+-------+---------+----------------+
|                name|     origType|colType|compRatio|nbDistinctValues|
+--------------------+-------------+-------+---------+----------------+
|feature_flight_ti...|TimestampType|   date|      1.0|              31|
| feature_flight_year|  IntegerType|numeric|      1.0|               1|
|feature_flight_month|  IntegerType|numeric|      1.0|               1|
|feature_flight_qu...|  IntegerType|numeric|      1.0|               1|
|feature_flight_da...|  IntegerType|numeric|      1.0|              31|
|feature_flight_da...|  IntegerType|numeric|      1.0|               7|
|feature_flight_da...|  IntegerType|numeric|      1.0|              31|
|feature_flight_we...|  IntegerType|numeric|      1.0|               6|
|feature_departure...|  IntegerType|numeric|      1.0|              24|
|feature_departure...|  IntegerType|numeric|      1.0|              60|
|feature_departure...|   DoubleType|numeric|      1.0|          

[name: string, origType: string ... 3 more fields]

In [8]:
import com.flightdelay.data.preprocessing.FlightDataGenerator

val withFlightFeatures = FlightDataGenerator.addFlightCharacteristics(withTemporalFeatures)


Phase 2: Add Flight Characteristics
- Add feature_flight_unique_id 
- Add feature_distance_category (short, medium, long, very_long) 
- Add feature_distance_score 
- Add feature_is_likely_domestic 
- Add feature_carrier_hash 
- Add feature_route_id 
- Add feature_is_roundtrip_candidate 
Added Flight features: 7


withFlightFeatures = [FL_DATE: string, OP_CARRIER_AIRLINE_ID: int ... 31 more fields]


[FL_DATE: string, OP_CARRIER_AIRLINE_ID: int ... 31 more fields]

In [9]:
val filgthCarateristiquesColumns = Seq(
  "feature_flight_unique_id",
  "feature_distance_category",
  "feature_distance_score",
  "feature_is_likely_domestic",
  "feature_carrier_hash",
  "feature_route_id",
  "feature_is_roundtrip_candidate"
)

val df = withFlightFeatures
  .select(filgthCarateristiquesColumns.head, filgthCarateristiquesColumns.tail: _*)
df.show(5)

+------------------------+-------------------------+----------------------+--------------------------+--------------------+----------------+------------------------------+
|feature_flight_unique_id|feature_distance_category|feature_distance_score|feature_is_likely_domestic|feature_carrier_hash|feature_route_id|feature_is_roundtrip_candidate|
+------------------------+-------------------------+----------------------+--------------------------+--------------------+----------------+------------------------------+
|    2012-01-01_20366_...|                   medium|   0.26166666666666666|                         1|         -2080468873|     10397_11618|                             1|
|    2012-01-01_20366_...|                   medium|   0.23666666666666666|                         1|         -2080468873|     11618_13930|                             1|
|    2012-01-01_20366_...|                    short|                 0.135|                         1|         -2080468873|     10397_11617|

filgthCarateristiquesColumns = List(feature_flight_unique_id, feature_distance_category, feature_distance_score, feature_is_likely_domestic, feature_carrier_hash, feature_route_id, feature_is_roundtrip_candidate)
df = [feature_flight_unique_id: string, feature_distance_category: string ... 5 more fields]


[feature_flight_unique_id: string, feature_distance_category: string ... 5 more fields]

In [10]:
import com.flightdelay.data.utils.DataQualityMetrics

val fightFeaturesMetric = DataQualityMetrics.metrics(df)
fightFeaturesMetric.show()

fightFeaturesMetric = [name: string, origType: string ... 3 more fields]


+--------------------+-----------+-------+---------+----------------+
|                name|   origType|colType|compRatio|nbDistinctValues|
+--------------------+-----------+-------+---------+----------------+
|feature_flight_un...| StringType|textual|      1.0|          477960|
|feature_distance_...| StringType|textual|      1.0|               4|
|feature_distance_...| DoubleType|numeric|      1.0|             417|
|feature_is_likely...|IntegerType|numeric|      1.0|               2|
|feature_carrier_hash|IntegerType|numeric|      1.0|              15|
|    feature_route_id| StringType|textual|      1.0|            2033|
|feature_is_roundt...|IntegerType|numeric|      1.0|               2|
+--------------------+-----------+-------+---------+----------------+



[name: string, origType: string ... 3 more fields]

In [11]:
val filgthCarateristiquesNumericColumns = Seq(
  "feature_distance_score",
  "feature_is_likely_domestic",
  "feature_carrier_hash",
  "feature_is_roundtrip_candidate"
)

val filgthCarateristiquesCategoricalColumns = Seq(
  "feature_flight_unique_id",
  "feature_distance_category",
  "feature_route_id"
)

filgthCarateristiquesNumericColumns = List(feature_distance_score, feature_is_likely_domestic, feature_carrier_hash, feature_is_roundtrip_candidate)
filgthCarateristiquesCategoricalColumns = List(feature_flight_unique_id, feature_distance_category, feature_route_id)


List(feature_flight_unique_id, feature_distance_category, feature_route_id)

In [12]:
import com.flightdelay.data.preprocessing.FlightDataGenerator

val withPeriodIndicators = FlightDataGenerator.addPeriodIndicators(withFlightFeatures)


Phase 3: Add Period <indicator
- Add feature_is_weekend, feature_is_friday, feature_is_monday
- Add feature_is_summer, feature_is_winter, feature_is_spring, feature_is_fall 
- Add feature_is_holiday_season (approximative)
- Add feature_is_early_morning 
- Add feature_is_morning_rush 
- Add feature_is_business_hours 
- Add feature_is_evening_rush 
- Add feature_is_night_flight 
- Add feature_is_month_start 
- Add feature_is_month_end 
- Add feature_is_extended_weekend 


withPeriodIndicators = [FL_DATE: string, OP_CARRIER_AIRLINE_ID: int ... 47 more fields]


Added Flight features: 16


[FL_DATE: string, OP_CARRIER_AIRLINE_ID: int ... 47 more fields]

In [13]:
val filgthPeriodIndicatorsColumns = Seq(
  "feature_is_weekend",
  "feature_is_friday",
  "feature_is_monday",
  "feature_is_summer",
  "feature_is_winter",
  "feature_is_spring",
  "feature_is_fall",
  "feature_is_holiday_season",
  "feature_is_early_morning", 
  "feature_is_morning_rush", 
  "feature_is_business_hours", 
  "feature_is_evening_rush",
  "feature_is_night_flight", 
  "feature_is_month_start", 
  "feature_is_month_end", 
  "feature_is_extended_weekend"     
)

val df = withPeriodIndicators
  .select(filgthPeriodIndicatorsColumns.head, filgthPeriodIndicatorsColumns.tail: _*)
df.show(5)

+------------------+-----------------+-----------------+-----------------+-----------------+-----------------+---------------+-------------------------+------------------------+-----------------------+-------------------------+-----------------------+-----------------------+----------------------+--------------------+---------------------------+
|feature_is_weekend|feature_is_friday|feature_is_monday|feature_is_summer|feature_is_winter|feature_is_spring|feature_is_fall|feature_is_holiday_season|feature_is_early_morning|feature_is_morning_rush|feature_is_business_hours|feature_is_evening_rush|feature_is_night_flight|feature_is_month_start|feature_is_month_end|feature_is_extended_weekend|
+------------------+-----------------+-----------------+-----------------+-----------------+-----------------+---------------+-------------------------+------------------------+-----------------------+-------------------------+-----------------------+-----------------------+----------------------+------

filgthPeriodIndicatorsColumns = List(feature_is_weekend, feature_is_friday, feature_is_monday, feature_is_summer, feature_is_winter, feature_is_spring, feature_is_fall, feature_is_holiday_season, feature_is_early_morning, feature_is_morning_rush, feature_is_business_hours, feature_is_evening_rush, feature_is_night_flight, feature_is_month_start, feature_is_month_end, feature_is_extended_weekend)
df = [feature_is_weekend: int, feature_is_friday: int ... 14 more fields]


[feature_is_weekend: int, feature_is_friday: int ... 14 more fields]

In [14]:
import com.flightdelay.data.preprocessing.FlightDataGenerator

val withGeographicFeatures = FlightDataGenerator.addGeographicFeatures(withPeriodIndicators)


Phase 4: Add Geographical Features
- Add feature_origin_is_major_hub (10397, 11298, 12266, 13930, 14107, 14771, 15016  // Principaux hubs US)
- Add feature_dest_is_major_hub  (10397, 11298, 12266, 13930, 14107, 14771, 15016  // Principaux hubs US)
- Add feature_is_hub_to_hub
- Add feature_flight_quarter
- Add feature_origin_complexity_score
- Add feature_dest_complexity_score
- Add feature_timezone_diff_proxy
- Add feature_flight_week_of_year
- Add feature_is_eastbound
- Add feature_is_westbound
Added Flight features: 8


withGeographicFeatures = [FL_DATE: string, OP_CARRIER_AIRLINE_ID: int ... 55 more fields]


[FL_DATE: string, OP_CARRIER_AIRLINE_ID: int ... 55 more fields]

In [15]:
val filgthGeographicFeaturesColumns = Seq(
  "feature_origin_is_major_hub",
  "feature_dest_is_major_hub",
  "feature_is_hub_to_hub",
  "feature_flight_quarter",
  "feature_origin_complexity_score",
  "feature_dest_complexity_score",
  "feature_timezone_diff_proxy",
  "feature_flight_week_of_year",
  "feature_is_eastbound",
  "feature_is_westbound"
)

val df = withGeographicFeatures
  .select(filgthGeographicFeaturesColumns.head, filgthGeographicFeaturesColumns.tail: _*)
df.show(5)

+---------------------------+-------------------------+---------------------+----------------------+-------------------------------+-----------------------------+---------------------------+---------------------------+--------------------+--------------------+
|feature_origin_is_major_hub|feature_dest_is_major_hub|feature_is_hub_to_hub|feature_flight_quarter|feature_origin_complexity_score|feature_dest_complexity_score|feature_timezone_diff_proxy|feature_flight_week_of_year|feature_is_eastbound|feature_is_westbound|
+---------------------------+-------------------------+---------------------+----------------------+-------------------------------+-----------------------------+---------------------------+---------------------------+--------------------+--------------------+
|                          0|                        1|                    0|                     1|                           0.18|                         0.97|                          1|                         52

filgthGeographicFeaturesColumns = List(feature_origin_is_major_hub, feature_dest_is_major_hub, feature_is_hub_to_hub, feature_flight_quarter, feature_origin_complexity_score, feature_dest_complexity_score, feature_timezone_diff_proxy, feature_flight_week_of_year, feature_is_eastbound, feature_is_westbound)
df = [feature_origin_is_major_hub: int, feature_dest_is_major_hub: int ... 8 more fields]


[feature_origin_is_major_hub: int, feature_dest_is_major_hub: int ... 8 more fields]

In [16]:
import com.flightdelay.data.preprocessing.FlightDataGenerator

val withAggregatedFeatures = FlightDataGenerator.addAggregatedFeatures(withGeographicFeatures)


Phase 5 : Add Aggregated Features
- Add feature_flights_on_route
- Add feature_carrier_flight_count
- Add feature_origin_airport_traffic
- Add feature_route_popularity_score
- Add feature_carrier_size_category


withAggregatedFeatures = [FL_DATE: string, OP_CARRIER_AIRLINE_ID: int ... 60 more fields]


Added Flight features: 5


[FL_DATE: string, OP_CARRIER_AIRLINE_ID: int ... 60 more fields]

In [17]:
val filgthAggregatedFeaturesColumns = Seq(
  "feature_flights_on_route",
  "feature_carrier_flight_count",
  "feature_origin_airport_traffic",
  "feature_route_popularity_score",
  "feature_carrier_size_category"
)

val df = withAggregatedFeatures
  .select(filgthAggregatedFeaturesColumns.head, filgthAggregatedFeaturesColumns.tail: _*)
df.show(5)

+------------------------+----------------------------+------------------------------+------------------------------+-----------------------------+
|feature_flights_on_route|feature_carrier_flight_count|feature_origin_airport_traffic|feature_route_popularity_score|feature_carrier_size_category|
+------------------------+----------------------------+------------------------------+------------------------------+-----------------------------+
|                      59|                       33946|                           245|                        medium|                        major|
|                      59|                       33946|                           245|                        medium|                        major|
|                      59|                       33946|                           245|                        medium|                        major|
|                      59|                       33946|                           245|                        me

filgthAggregatedFeaturesColumns = List(feature_flights_on_route, feature_carrier_flight_count, feature_origin_airport_traffic, feature_route_popularity_score, feature_carrier_size_category)
df = [feature_flights_on_route: bigint, feature_carrier_flight_count: bigint ... 3 more fields]


[feature_flights_on_route: bigint, feature_carrier_flight_count: bigint ... 3 more fields]