# Flight Preprocessing - Data Generation

In [1]:
%AddJar file:///home/jovyan/work/apps/Emiasd-Flight-Data-Analysis.jar

Starting download from file:///home/jovyan/work/apps/Emiasd-Flight-Data-Analysis.jar
Finished download of Emiasd-Flight-Data-Analysis.jar
Using cached version of Emiasd-Flight-Data-Analysis.jar


In [2]:
import org.apache.spark.sql.SparkSession
import com.flightdelay.config.{AppConfiguration, ConfigurationLoader}
import com.flightdelay.data.loaders.FlightDataLoader

//Env Configuration
val args: Array[String] = Array("juniper")

val spark = SparkSession.builder()
  .config(sc.getConf)
  .getOrCreate()

// Rendre la session Spark implicite
implicit val session = spark
implicit val configuration: AppConfiguration = ConfigurationLoader.loadConfiguration(args)

// Cellule 4: Test
val flightData = FlightDataLoader.loadFromConfiguration(false)


[STEP 1][DataLoader] Flight Data Loading - Start

Loading from existing Parquet file:
  - Path: /home/jovyan/work/output/common/data/raw_flights.parquet
  - Loaded 486133 records from Parquet (optimized)

Schema:
root
 |-- FL_DATE: date (nullable = true)
 |-- OP_CARRIER_AIRLINE_ID: integer (nullable = true)
 |-- OP_CARRIER_FL_NUM: integer (nullable = true)
 |-- ORIGIN_AIRPORT_ID: integer (nullable = true)
 |-- DEST_AIRPORT_ID: integer (nullable = true)
 |-- CRS_DEP_TIME: integer (nullable = true)
 |-- ARR_DELAY_NEW: double (nullable = true)
 |-- CANCELLED: integer (nullable = true)
 |-- DIVERTED: integer (nullable = true)
 |-- CRS_ELAPSED_TIME: double (nullable = true)
 |-- WEATHER_DELAY: double (nullable = true)
 |-- NAS_DELAY: double (nullable = true)


Sample data (10 rows):
+----------+---------------------+-----------------+-----------------+---------------+------------+-------------+---------+--------+----------------+-------------+---------+
|   FL_DATE|OP_CARRIER_AIRLINE_ID|OP

args = Array(juniper)
spark = org.apache.spark.sql.SparkSession@779b9e9f
session = org.apache.spark.sql.SparkSession@779b9e9f
configuration = AppConfiguration(local,CommonConfig(42,DataConfig(/home/jovyan/work/data,FileConfig(/home/jovyan/work/data/FLIGHT-3Y/Flights/201201.csv),FileConfig(/home/jovyan/work/data/FLIGHT-3Y/Weather/201201hourly.txt),FileConfig(/home/jovyan/work/data/FLIGHT-3Y/wban_airport_timezone.csv)),OutputConfig(/home/jovyan/work/output,FileConfig(/home/jovyan/work/output/data),FileCo...


AppConfiguration(local,CommonConfig(42,DataConfig(/home/jovyan/work/data,FileConfig(/home/jovyan/work/data/FLIGHT-3Y/Flights/201201.csv),FileConfig(/home/jovyan/work/data/FLIGHT-3Y/Weather/201201hourly.txt),FileConfig(/home/jovyan/work/data/FLIGHT-3Y/wban_airport_timezone.csv)),OutputConfig(/home/jovyan/work/output,FileConfig(/home/jovyan/work/output/data),FileCo...

# Jointure entre Flight et WBAN

In [4]:
import com.flightdelay.data.preprocessing.flights.FlightWBANEnricher

val withWBANData = FlightWBANEnricher.preprocess(flightData)


[Preprocessing] Flight WBAN Enrichment - Start

Loading WBAN-Airport-Timezone mapping:
  - Path: /home/jovyan/work/output/common/data/raw_wban_airport_timezone.parquet
  - Loaded 305 airport-WBAN mappings

Enrichment statistics:
  - Total flights: 461369
  - Flights with origin WBAN: 461369 (100%)
  - Flights with destination WBAN: 461369 (100%)

[Preprocessing] Flight WBAN Enrichment - End


withWBANData = [FL_DATE: date, OP_CARRIER_AIRLINE_ID: int ... 14 more fields]


[FL_DATE: date, OP_CARRIER_AIRLINE_ID: int ... 14 more fields]

In [5]:
flightData.count()

486133

In [6]:
flightData.show(5)

+----------+---------------------+-----------------+-----------------+---------------+------------+-------------+---------+--------+----------------+-------------+---------+
|   FL_DATE|OP_CARRIER_AIRLINE_ID|OP_CARRIER_FL_NUM|ORIGIN_AIRPORT_ID|DEST_AIRPORT_ID|CRS_DEP_TIME|ARR_DELAY_NEW|CANCELLED|DIVERTED|CRS_ELAPSED_TIME|WEATHER_DELAY|NAS_DELAY|
+----------+---------------------+-----------------+-----------------+---------------+------------+-------------+---------+--------+----------------+-------------+---------+
|2012-01-01|                20366|             4426|            15370|          12266|         845|          0.0|     NULL|    NULL|            99.0|         NULL|     NULL|
|2012-01-01|                20366|             4427|            12266|          15370|         858|          0.0|     NULL|    NULL|            88.0|         NULL|     NULL|
|2012-01-01|                20366|             4427|            15370|          12266|        1051|          0.0|     NULL|    NUL

In [14]:
import org.apache.spark.sql.functions._

// Obtenir tous les IDs d'aéroports distincts (départ + arrivée)
val allAirportIDs = flightData.select("ORIGIN_AIRPORT_ID")
  .union(flightData.select("DEST_AIRPORT_ID"))
  .distinct()

allAirportIDs.show()

// Compter le nombre total d'aéroports uniques
val nbAirportsUniques = allAirportIDs.count()
println(s"Nombre d'aéroports uniques: $nbAirportsUniques")

+-----------------+
|ORIGIN_AIRPORT_ID|
+-----------------+
|            14570|
|            11146|
|            11630|
|            13795|
|            12264|
|            10257|
|            15070|
|            14771|
|            12436|
|            12523|
|            12007|
|            11057|
|            13830|
|            13377|
|            10994|
|            15096|
|            14814|
|            13873|
|            12191|
|            15024|
+-----------------+
only showing top 20 rows

Nombre d'aéroports uniques: 287


allAirportIDs = [ORIGIN_AIRPORT_ID: int]
nbAirportsUniques = 287


287

In [7]:
withWBANData.count()

461369

In [12]:
val wbanParquetPath = s"${configuration.common.output.basePath}/common/data/raw_wban_airport_timezone.parquet"
val wbanMappingDf = spark.read.parquet(wbanParquetPath)

wbanParquetPath = /home/jovyan/work/output/common/data/raw_wban_airport_timezone.parquet
wbanMappingDf = [AirportID: int, WBAN: string ... 1 more field]


[AirportID: int, WBAN: string ... 1 more field]

In [None]:
wbanMappingDf.count()

In [15]:
wbanMappingDf.show(5)

+---------+-----+--------+
|AirportID| WBAN|TimeZone|
+---------+-----+--------+
|    10685|54831|      -6|
|    14871|24232|      -8|
|    10620|24033|      -7|
|    14747|24233|      -8|
|    11252|12834|      -5|
+---------+-----+--------+
only showing top 5 rows



In [16]:
import org.apache.spark.sql.functions._

// Trouver les aéroports dans Flights qui ne sont pas dans Airport
val airportsInFlightsOnly = allAirportIDs
  .select(col("ORIGIN_AIRPORT_ID").as("AirportID"))
  .distinct()
  .join(
    wbanMappingDf.select("AirportID"),
    Seq("AirportID"),
    "left_anti"  // Garde uniquement les lignes de gauche qui n'ont pas de correspondance
  )

airportsInFlightsOnly.show()
println(s"Nombre d'aéroports dans Flights mais pas dans Airport: ${airportsInFlightsOnly.count()}")

+---------+
|AirportID|
+---------+
|    11146|
|    11630|
|    13795|
|    15070|
|    10994|
|    11076|
|    10135|
|    13290|
|    10165|
|    14254|
|    10627|
|    10643|
|    10434|
|    13256|
|    15323|
|    15919|
|    11995|
|    11617|
|    14794|
|    12758|
+---------+
only showing top 20 rows

Nombre d'aéroports dans Flights mais pas dans Airport: 38


airportsInFlightsOnly = [AirportID: int]


[AirportID: int]