# Flight Data Pipeline

In [5]:
%AddJar file:///home/jovyan/work/apps/Emiasd-Flight-Data-Analysis.jar

Using cached version of Emiasd-Flight-Data-Analysis.jar
Using cached version of Emiasd-Flight-Data-Analysis.jar


In [6]:
import org.apache.spark.sql.SparkSession
import com.flightdelay.config.{AppConfiguration, ConfigurationLoader}
import com.flightdelay.data.loaders.FlightDataLoader

//Env Configuration
val args: Array[String] = Array("juniper")

val spark = SparkSession.builder()
  .config(sc.getConf)
  .getOrCreate()

// Rendre la session Spark implicite
implicit val session = spark
implicit val configuration: AppConfiguration = ConfigurationLoader.loadConfiguration(args)

// Cellule 4: Test
val flightData = FlightDataLoader.loadFromConfiguration(false)


[STEP 1][DataLoader] Flight Data Loading - Start

Loading from existing Parquet file:
  - Path: /home/jovyan/work/output/common/data/raw_flights.parquet
  - Loaded 486133 records from Parquet (optimized)

Schema:
root
 |-- FL_DATE: date (nullable = true)
 |-- OP_CARRIER_AIRLINE_ID: integer (nullable = true)
 |-- OP_CARRIER_FL_NUM: integer (nullable = true)
 |-- ORIGIN_AIRPORT_ID: integer (nullable = true)
 |-- DEST_AIRPORT_ID: integer (nullable = true)
 |-- CRS_DEP_TIME: integer (nullable = true)
 |-- ARR_DELAY_NEW: double (nullable = true)
 |-- CANCELLED: integer (nullable = true)
 |-- DIVERTED: integer (nullable = true)
 |-- CRS_ELAPSED_TIME: double (nullable = true)
 |-- WEATHER_DELAY: double (nullable = true)
 |-- NAS_DELAY: double (nullable = true)


Sample data (10 rows):


args = Array(juniper)
spark = org.apache.spark.sql.SparkSession@636bb7f5
session = org.apache.spark.sql.SparkSession@636bb7f5
configuration = AppConfiguration(local,CommonConfig(42,DataConfig(/home/jovyan/work/data,FileConfig(/home/jovyan/work/data/FLIGHT-3Y/Flights/201201.csv),FileConfig(/home/jovyan/work/data/FLIGHT-3Y/Weather/201201hourly.txt),FileConfig(/home/jovyan/work/data/FLIGHT-3Y/wban_airport_timezone.csv)),OutputConfig(/home/jovyan/work/output,FileConfig(/home/jovyan/work/output/data),FileCo...


+----------+---------------------+-----------------+-----------------+---------------+------------+-------------+---------+--------+----------------+-------------+---------+
|   FL_DATE|OP_CARRIER_AIRLINE_ID|OP_CARRIER_FL_NUM|ORIGIN_AIRPORT_ID|DEST_AIRPORT_ID|CRS_DEP_TIME|ARR_DELAY_NEW|CANCELLED|DIVERTED|CRS_ELAPSED_TIME|WEATHER_DELAY|NAS_DELAY|
+----------+---------------------+-----------------+-----------------+---------------+------------+-------------+---------+--------+----------------+-------------+---------+
|2012-01-01|                20366|             4426|            15370|          12266|         845|          0.0|     NULL|    NULL|            99.0|         NULL|     NULL|
|2012-01-01|                20366|             4427|            12266|          15370|         858|          0.0|     NULL|    NULL|            88.0|         NULL|     NULL|
|2012-01-01|                20366|             4427|            15370|          12266|        1051|          0.0|     NULL|    NUL

AppConfiguration(local,CommonConfig(42,DataConfig(/home/jovyan/work/data,FileConfig(/home/jovyan/work/data/FLIGHT-3Y/Flights/201201.csv),FileConfig(/home/jovyan/work/data/FLIGHT-3Y/Weather/201201hourly.txt),FileConfig(/home/jovyan/work/data/FLIGHT-3Y/wban_airport_timezone.csv)),OutputConfig(/home/jovyan/work/output,FileConfig(/home/jovyan/work/output/data),FileCo...

# Data Pipeline

In [7]:
import com.flightdelay.data.DataPipeline

val dataPipelineDF = DataPipeline.execute()


[DataPipeline] Complete Data Pipeline - Start

[Step 1/6] Loading raw flight data...

[STEP 1][DataLoader] Flight Data Loading - Start

Loading from existing Parquet file:
  - Path: /home/jovyan/work/output/common/data/raw_flights.parquet
  - Loaded 486133 records from Parquet (optimized)

Schema:
root
 |-- FL_DATE: date (nullable = true)
 |-- OP_CARRIER_AIRLINE_ID: integer (nullable = true)
 |-- OP_CARRIER_FL_NUM: integer (nullable = true)
 |-- ORIGIN_AIRPORT_ID: integer (nullable = true)
 |-- DEST_AIRPORT_ID: integer (nullable = true)
 |-- CRS_DEP_TIME: integer (nullable = true)
 |-- ARR_DELAY_NEW: double (nullable = true)
 |-- CANCELLED: integer (nullable = true)
 |-- DIVERTED: integer (nullable = true)
 |-- CRS_ELAPSED_TIME: double (nullable = true)
 |-- WEATHER_DELAY: double (nullable = true)
 |-- NAS_DELAY: double (nullable = true)


Sample data (10 rows):
+----------+---------------------+-----------------+-----------------+---------------+------------+-------------+---------+-

dataPipelineDF = [FL_DATE: date, OP_CARRIER_AIRLINE_ID: int ... 89 more fields]


[FL_DATE: date, OP_CARRIER_AIRLINE_ID: int ... 89 more fields]

In [10]:
dataPipelineDF.cache()
dataPipelineDF.count()

135006

In [11]:
dataPipelineDF.printSchema

root
 |-- FL_DATE: date (nullable = true)
 |-- OP_CARRIER_AIRLINE_ID: integer (nullable = true)
 |-- OP_CARRIER_FL_NUM: integer (nullable = true)
 |-- ORIGIN_AIRPORT_ID: integer (nullable = true)
 |-- DEST_AIRPORT_ID: integer (nullable = true)
 |-- CRS_DEP_TIME: integer (nullable = true)
 |-- CRS_ELAPSED_TIME: double (nullable = true)
 |-- ORIGIN_WBAN: string (nullable = true)
 |-- ORIGIN_TIMEZONE: integer (nullable = true)
 |-- DEST_WBAN: string (nullable = true)
 |-- DEST_TIMEZONE: integer (nullable = true)
 |-- feature_departure_minute: integer (nullable = true)
 |-- feature_flight_days_span: integer (nullable = true)
 |-- feature_arrival_minutes_total: double (nullable = true)
 |-- feature_crosses_midnight: integer (nullable = false)
 |-- feature_flight_day_of_week: integer (nullable = true)
 |-- feature_departure_hour_decimal: double (nullable = true)
 |-- feature_flight_timestamp: timestamp (nullable = true)
 |-- feature_flight_quarter_name: string (nullable = false)
 |-- feature

In [13]:
val addedFeatureColumns = Seq(
  "FL_DATE",
  
  /**"OP_CARRIER_AIRLINE_ID",
  "OP_CARRIER_FL_NUM",
  "ORIGIN_AIRPORT_ID",
  "DEST_AIRPORT_ID",
  "CANCELLED",
  "DIVERTED",
  "WEATHER_DELAY",
  "NAS_DELAY",**/

  "CRS_DEP_TIME",
  "CRS_ELAPSED_TIME",
  
    
  /**"feature_flight_timestamp",
  "feature_flight_year",
  "feature_flight_month",
  "feature_flight_quarter",
  "feature_flight_day_of_month",
  "feature_flight_day_of_week",
  "feature_flight_day_of_year",
  "feature_flight_week_of_year",
  "feature_departure_hour",
  "feature_departure_minute",
  "feature_departure_hour_decimal",
  "feature_departure_quarter_day",
  "feature_minutes_since_midnight",
  "feature_departure_quarter_name",
  "feature_departure_time_period",**/

  "feature_departure_hour_rounded",
  //"feature_departure_minutes_total",
  "feature_arrival_hour_rounded", 
  /**"feature_arrival_minutes_total",
  "feature_arrival_time",
  "feature_arrival_hour",
  "feature_arrival_minute",
  "feature_arrival_hour_decimal",**/
  "feature_crosses_midnight",
  "feature_flight_days_span", "feature_arrival_date"  
   
)

val df = dataPipelineDF
  .select(addedFeatureColumns.head, addedFeatureColumns.tail: _*)
df.show(5)


+----------+------------+----------------+------------------------------+----------------------------+------------------------+------------------------+--------------------+
|   FL_DATE|CRS_DEP_TIME|CRS_ELAPSED_TIME|feature_departure_hour_rounded|feature_arrival_hour_rounded|feature_crosses_midnight|feature_flight_days_span|feature_arrival_date|
+----------+------------+----------------+------------------------------+----------------------------+------------------------+------------------------+--------------------+
|2012-01-30|        1410|            90.0|                          1400|                        1600|                       0|                       0|          2012-01-30|
|2012-01-22|         955|            95.0|                          1000|                        1200|                       0|                       0|          2012-01-22|
|2012-01-27|        1900|            90.0|                          1900|                        2100|                       0|   

lastException = null
addedFeatureColumns = List(FL_DATE, CRS_DEP_TIME, CRS_ELAPSED_TIME, feature_departure_hour_rounded, feature_arrival_hour_rounded, feature_crosses_midnight, feature_flight_days_span, feature_arrival_date)
df = [FL_DATE: date, CRS_DEP_TIME: int ... 6 more fields]


[FL_DATE: date, CRS_DEP_TIME: int ... 6 more fields]