# Flight Data Pipeline

In [1]:
%AddJar file:///home/jovyan/work/apps/Emiasd-Flight-Data-Analysis.jar

Starting download from file:///home/jovyan/work/apps/Emiasd-Flight-Data-Analysis.jar
Finished download of Emiasd-Flight-Data-Analysis.jar
Using cached version of Emiasd-Flight-Data-Analysis.jar


In [2]:
import org.apache.spark.sql.SparkSession
import com.flightdelay.config.{AppConfiguration, ConfigurationLoader}
import com.flightdelay.data.loaders.FlightDataLoader

//Env Configuration
val args: Array[String] = Array("juniper")

val spark = SparkSession.builder()
  .config(sc.getConf)
  .getOrCreate()

// Rendre la session Spark implicite
implicit val session = spark
implicit val configuration: AppConfiguration = ConfigurationLoader.loadConfiguration(args)

// Cellule 4: Test
val flightData = FlightDataLoader.loadFromConfiguration(false)


[STEP 1][DataLoader] Flight Data Loading - Start

Loading from existing Parquet file:
  - Path: /home/jovyan/work/output/common/data/raw_flights.parquet
  - Loaded 486133 records from Parquet (optimized)

Schema:
root
 |-- FL_DATE: date (nullable = true)
 |-- OP_CARRIER_AIRLINE_ID: integer (nullable = true)
 |-- OP_CARRIER_FL_NUM: integer (nullable = true)
 |-- ORIGIN_AIRPORT_ID: integer (nullable = true)
 |-- DEST_AIRPORT_ID: integer (nullable = true)
 |-- CRS_DEP_TIME: integer (nullable = true)
 |-- ARR_DELAY_NEW: double (nullable = true)
 |-- CANCELLED: integer (nullable = true)
 |-- DIVERTED: integer (nullable = true)
 |-- CRS_ELAPSED_TIME: double (nullable = true)
 |-- WEATHER_DELAY: double (nullable = true)
 |-- NAS_DELAY: double (nullable = true)


Sample data (10 rows):
+----------+---------------------+-----------------+-----------------+---------------+------------+-------------+---------+--------+----------------+-------------+---------+
|   FL_DATE|OP_CARRIER_AIRLINE_ID|OP

args = Array(juniper)
spark = org.apache.spark.sql.SparkSession@657c5a35
session = org.apache.spark.sql.SparkSession@657c5a35
configuration = AppConfiguration(local,CommonConfig(42,DataConfig(/home/jovyan/work/data,FileConfig(/home/jovyan/work/data/FLIGHT-3Y/Flights/201201.csv),FileConfig(/home/jovyan/work/data/FLIGHT-3Y/Weather/201201hourly.txt),FileConfig(/home/jovyan/work/data/FLIGHT-3Y/wban_airport_timezone.csv)),OutputConfig(/home/jovyan/work/output,FileConfig(/home/jovyan/work/output/data),FileCo...


AppConfiguration(local,CommonConfig(42,DataConfig(/home/jovyan/work/data,FileConfig(/home/jovyan/work/data/FLIGHT-3Y/Flights/201201.csv),FileConfig(/home/jovyan/work/data/FLIGHT-3Y/Weather/201201hourly.txt),FileConfig(/home/jovyan/work/data/FLIGHT-3Y/wban_airport_timezone.csv)),OutputConfig(/home/jovyan/work/output,FileConfig(/home/jovyan/work/output/data),FileCo...

# Data Pipeline

In [3]:
import com.flightdelay.data.DataPipeline

val dataPipelineDF = DataPipeline.execute()


[DataPipeline] Complete Data Pipeline - Start

[Step 1/6] Loading raw flight data...

[STEP 1][DataLoader] Flight Data Loading - Start

Loading from existing Parquet file:
  - Path: /home/jovyan/work/output/common/data/raw_flights.parquet
  - Loaded 486133 records from Parquet (optimized)

Schema:
root
 |-- FL_DATE: date (nullable = true)
 |-- OP_CARRIER_AIRLINE_ID: integer (nullable = true)
 |-- OP_CARRIER_FL_NUM: integer (nullable = true)
 |-- ORIGIN_AIRPORT_ID: integer (nullable = true)
 |-- DEST_AIRPORT_ID: integer (nullable = true)
 |-- CRS_DEP_TIME: integer (nullable = true)
 |-- ARR_DELAY_NEW: double (nullable = true)
 |-- CANCELLED: integer (nullable = true)
 |-- DIVERTED: integer (nullable = true)
 |-- CRS_ELAPSED_TIME: double (nullable = true)
 |-- WEATHER_DELAY: double (nullable = true)
 |-- NAS_DELAY: double (nullable = true)


Sample data (10 rows):
+----------+---------------------+-----------------+-----------------+---------------+------------+-------------+---------+-

dataPipelineDF = [FL_DATE: date, OP_CARRIER_AIRLINE_ID: int ... 227 more fields]


[FL_DATE: date, OP_CARRIER_AIRLINE_ID: int ... 227 more fields]

In [4]:
dataPipelineDF.cache()
dataPipelineDF.count()

134665

In [5]:
dataPipelineDF.printSchema

root
 |-- FL_DATE: date (nullable = true)
 |-- OP_CARRIER_AIRLINE_ID: integer (nullable = true)
 |-- OP_CARRIER_FL_NUM: integer (nullable = true)
 |-- ORIGIN_AIRPORT_ID: integer (nullable = true)
 |-- DEST_AIRPORT_ID: integer (nullable = true)
 |-- CRS_DEP_TIME: integer (nullable = true)
 |-- ARR_DELAY_NEW: double (nullable = true)
 |-- CRS_ELAPSED_TIME: double (nullable = true)
 |-- WEATHER_DELAY: double (nullable = true)
 |-- NAS_DELAY: double (nullable = true)
 |-- ORIGIN_WBAN: string (nullable = true)
 |-- ORIGIN_TIMEZONE: integer (nullable = true)
 |-- DEST_WBAN: string (nullable = true)
 |-- DEST_TIMEZONE: integer (nullable = true)
 |-- UTC_CRS_DEP_TIME: string (nullable = true)
 |-- UTC_FL_DATE: date (nullable = true)
 |-- feature_departure_minute: integer (nullable = true)
 |-- feature_flight_days_span: integer (nullable = true)
 |-- feature_arrival_minutes_total: double (nullable = true)
 |-- feature_crosses_midnight: integer (nullable = true)
 |-- feature_flight_day_of_week: 

In [6]:
val addedFeatureColumns = Seq(
  "FL_DATE",
  "UTC_FL_DATE",
  /**"OP_CARRIER_AIRLINE_ID",
  "OP_CARRIER_FL_NUM",
  "ORIGIN_AIRPORT_ID",
  "DEST_AIRPORT_ID",
  "WEATHER_DELAY",
  "NAS_DELAY",**/

  "CRS_DEP_TIME",
  "UTC_CRS_DEP_TIME",  
  "CRS_ELAPSED_TIME",
    
  //"ORIGIN_WBAN",
  "ORIGIN_TIMEZONE",  
  //"DEST_WBAN",  
  //"DEST_TIMEZONE",  
    
  /**"feature_flight_timestamp",
  "feature_flight_year",
  "feature_flight_month",
  "feature_flight_quarter",
  "feature_flight_day_of_month",
  "feature_flight_day_of_week",
  "feature_flight_day_of_year",
  "feature_flight_week_of_year",
  "feature_departure_hour",
  "feature_departure_minute",
  "feature_departure_hour_decimal",
  "feature_departure_quarter_day",
  "feature_minutes_since_midnight",
  "feature_departure_quarter_name",
  "feature_departure_time_period",**/

  "feature_departure_hour_rounded",
  "feature_utc_departure_hour_rounded",  
  //"feature_departure_minutes_total",
      //"feature_arrival_hour_rounded", 
  /**"feature_arrival_minutes_total",
  "feature_arrival_time",
  "feature_arrival_hour",
  "feature_arrival_minute",
  "feature_arrival_hour_decimal",**/
      //"feature_crosses_midnight",
      //"feature_flight_days_span", "feature_arrival_date"  
   "origin_weather_SkyCondition_array",
   "origin_weather_SkyConditionFlag_array" 
)

val df = dataPipelineDF
  .select(addedFeatureColumns.head, addedFeatureColumns.tail: _*)
df.show(100)


+----------+-----------+------------+----------------+----------------+---------------+------------------------------+----------------------------------+---------------------------------+-------------------------------------+
|   FL_DATE|UTC_FL_DATE|CRS_DEP_TIME|UTC_CRS_DEP_TIME|CRS_ELAPSED_TIME|ORIGIN_TIMEZONE|feature_departure_hour_rounded|feature_utc_departure_hour_rounded|origin_weather_SkyCondition_array|origin_weather_SkyConditionFlag_array|
+----------+-----------+------------+----------------+----------------+---------------+------------------------------+----------------------------------+---------------------------------+-------------------------------------+
|2012-01-27| 2012-01-27|        1047|            1847|            82.0|             -8|                          1100|                              1900|                               []|                                   []|
|2012-01-20| 2012-01-20|         700|            1500|            96.0|             -8|         

addedFeatureColumns = List(FL_DATE, UTC_FL_DATE, CRS_DEP_TIME, UTC_CRS_DEP_TIME, CRS_ELAPSED_TIME, ORIGIN_TIMEZONE, feature_departure_hour_rounded, feature_utc_departure_hour_rounded, origin_weather_SkyCondition_array, origin_weather_SkyConditionFlag_array)
df = [FL_DATE: date, UTC_FL_DATE: date ... 8 more fields]


[FL_DATE: date, UTC_FL_DATE: date ... 8 more fields]