# Flight Preprocessing

In [1]:
%AddJar file:///home/jovyan/work/apps/Emiasd-Flight-Data-Analysis.jar

Starting download from file:///home/jovyan/work/apps/Emiasd-Flight-Data-Analysis.jar
Finished download of Emiasd-Flight-Data-Analysis.jar
Using cached version of Emiasd-Flight-Data-Analysis.jar


In [2]:
import org.apache.spark.sql.SparkSession
import com.flightdelay.config.{AppConfiguration, ConfigurationLoader}
import com.flightdelay.data.loaders.FlightDataLoader

//Env Configuration
val args: Array[String] = Array("juniper")

val spark = SparkSession.builder()
  .config(sc.getConf)
  .getOrCreate()

// Rendre la session Spark implicite
implicit val session = spark
implicit val configuration: AppConfiguration = ConfigurationLoader.loadConfiguration(args)

// Cellule 4: Test
val flightData = FlightDataLoader.loadFromConfiguration(false)


[STEP 1][DataLoader] Flight Data Loading - Start

Loading from existing Parquet file:
  - Path: /home/jovyan/work/output/common/data/raw_flights.parquet
  - Loaded 486133 records from Parquet (optimized)

Schema:
root
 |-- FL_DATE: date (nullable = true)
 |-- OP_CARRIER_AIRLINE_ID: integer (nullable = true)
 |-- OP_CARRIER_FL_NUM: integer (nullable = true)
 |-- ORIGIN_AIRPORT_ID: integer (nullable = true)
 |-- DEST_AIRPORT_ID: integer (nullable = true)
 |-- CRS_DEP_TIME: integer (nullable = true)
 |-- ARR_DELAY_NEW: double (nullable = true)
 |-- CANCELLED: integer (nullable = true)
 |-- DIVERTED: integer (nullable = true)
 |-- CRS_ELAPSED_TIME: double (nullable = true)
 |-- WEATHER_DELAY: double (nullable = true)
 |-- NAS_DELAY: double (nullable = true)


Sample data (10 rows):
+----------+---------------------+-----------------+-----------------+---------------+------------+-------------+---------+--------+----------------+-------------+---------+
|   FL_DATE|OP_CARRIER_AIRLINE_ID|OP

args = Array(juniper)
spark = org.apache.spark.sql.SparkSession@55c111fd
session = org.apache.spark.sql.SparkSession@55c111fd
configuration = AppConfiguration(local,CommonConfig(42,DataConfig(/home/jovyan/work/data,FileConfig(/home/jovyan/work/data/FLIGHT-3Y/Flights/201201.csv),FileConfig(/home/jovyan/work/data/FLIGHT-3Y/Weather/201201hourly.txt),FileConfig(/home/jovyan/work/data/FLIGHT-3Y/wban_airport_timezone.csv)),OutputConfig(/home/jovyan/work/output,FileConfig(/home/jovyan/work/output/data),FileCo...


AppConfiguration(local,CommonConfig(42,DataConfig(/home/jovyan/work/data,FileConfig(/home/jovyan/work/data/FLIGHT-3Y/Flights/201201.csv),FileConfig(/home/jovyan/work/data/FLIGHT-3Y/Weather/201201hourly.txt),FileConfig(/home/jovyan/work/data/FLIGHT-3Y/wban_airport_timezone.csv)),OutputConfig(/home/jovyan/work/output,FileConfig(/home/jovyan/work/output/data),FileCo...

In [5]:
import com.flightdelay.data.preprocessing.flights.FlightPreprocessingPipeline

val flightProcessedData = FlightPreprocessingPipeline.execute()


[Preprocessing] Flight Data Preprocessing Pipeline - Start

Loading raw data from parquet:
  - Path: /home/jovyan/work/output/common/data/raw_flights.parquet
  - Loaded 486133 raw records

[STEP 2][DataCleaner] Flight Data Cleaning - Start

Original dataset: 486133 records

Phase 1: Basic Cleaning
  - Current count: 486133 records

Phase 2: Filter Invalid Flights
  - Filtering cancelled and diverted flights
  - Filtering invalid departure times
  - Filtering invalid airports
  - Current count: 486133 records

Phase 3: Data Type Conversion
Conversion des types de données: NAS_DELAY, OP_CARRIER_AIRLINE_ID, OP_CARRIER_FL_NUM, WEATHER_DELAY, DEST_AIRPORT_ID, ORIGIN_AIRPORT_ID, CRS_ELAPSED_TIME, FL_DATE, CRS_DEP_TIME, ARR_DELAY_NEW
  - Filtering invalid flight dates
  - Current count: 486133 records

Phase 4: Outlier Filtering
  - Filtering delays > 600 minutes
  - Filtering flight times (10 min - 24 hours)
  - Current count: 486067 records

Phase 5: Final Validation
  - Validation passed:

flightProcessedData = [FL_DATE: date, OP_CARRIER_AIRLINE_ID: int ... 79 more fields]


[FL_DATE: date, OP_CARRIER_AIRLINE_ID: int ... 79 more fields]

In [6]:
flightProcessedData.printSchema

root
 |-- FL_DATE: date (nullable = true)
 |-- OP_CARRIER_AIRLINE_ID: integer (nullable = true)
 |-- OP_CARRIER_FL_NUM: integer (nullable = true)
 |-- ORIGIN_AIRPORT_ID: integer (nullable = true)
 |-- DEST_AIRPORT_ID: integer (nullable = true)
 |-- CRS_DEP_TIME: integer (nullable = true)
 |-- CRS_ELAPSED_TIME: double (nullable = true)
 |-- ORIGIN_WBAN: string (nullable = true)
 |-- ORIGIN_TIMEZONE: integer (nullable = true)
 |-- DEST_WBAN: string (nullable = true)
 |-- DEST_TIMEZONE: integer (nullable = true)
 |-- feature_departure_minute: integer (nullable = true)
 |-- feature_flight_day_of_week: integer (nullable = true)
 |-- feature_departure_hour_decimal: double (nullable = true)
 |-- feature_flight_timestamp: timestamp (nullable = true)
 |-- feature_flight_quarter_name: string (nullable = false)
 |-- feature_departure_hour: integer (nullable = true)
 |-- feature_minutes_since_midnight: double (nullable = true)
 |-- feature_departure_quarter_day: integer (nullable = false)
 |-- fea