# Jointure NAN Values

In [1]:
%AddJar file:///home/jovyan/work/apps/Emiasd-Flight-Data-Analysis.jar

Starting download from file:///home/jovyan/work/apps/Emiasd-Flight-Data-Analysis.jar
Finished download of Emiasd-Flight-Data-Analysis.jar
Using cached version of Emiasd-Flight-Data-Analysis.jar


In [2]:
import org.apache.spark.sql.SparkSession
import com.flightdelay.config.{AppConfiguration, ConfigurationLoader, ExperimentConfig}
import com.flightdelay.data.loaders.FlightDataLoader

//Env Configuration
val args: Array[String] = Array("jupyter")

val spark = SparkSession.builder()
  .config(sc.getConf)
  .getOrCreate()

// Rendre la session Spark implicite
implicit val session = spark
implicit val configuration: AppConfiguration = ConfigurationLoader.loadConfiguration(args)
implicit val experiment: ExperimentConfig = configuration.experiments(1)

//Set CheckPoint Dir
spark.sparkContext.setCheckpointDir(s"${configuration.common.output.basePath}/spark-checkpoints")
// Réduire les logs pour plus de clarté
spark.sparkContext.setLogLevel("WARN")

args = Array(jupyter)
spark = org.apache.spark.sql.SparkSession@6a356ab0
session = org.apache.spark.sql.SparkSession@6a356ab0
configuration = AppConfiguration(local,CommonConfig(42,DataConfig(/home/jovyan/work/data,FileConfig(/home/jovyan/work/data/FLIGHT-3Y/Flights/*.csv),FileConfig(/home/jovyan/work/data/FLIGHT-3Y/Weather/*.txt),FileConfig(/home/jovyan/work/data/FLIGHT-3Y/wban_airport_timezone.csv)),OutputConfig(/home/jovyan/work/output,FileConfig(/home/jovyan/work/output/data),File...


AppConfiguration(local,CommonConfig(42,DataConfig(/home/jovyan/work/data,FileConfig(/home/jovyan/work/data/FLIGHT-3Y/Flights/*.csv),FileConfig(/home/jovyan/work/data/FLIGHT-3Y/Weather/*.txt),FileConfig(/home/jovyan/work/data/FLIGHT-3Y/wban_airport_timezone.csv)),OutputConfig(/home/jovyan/work/output,FileConfig(/home/jovyan/work/output/data),File...

## Flights Raw - Process Job

In [3]:
val rawFlightsDFPath = s"${configuration.common.output.basePath}/common/data/raw_flights.parquet"
val rawFlightsDF = spark.read.parquet(rawFlightsDFPath)


rawFlightsDFPath = /home/jovyan/work/output/common/data/raw_flights.parquet
rawFlightsDF = [FL_DATE: date, OP_CARRIER_AIRLINE_ID: int ... 10 more fields]


[FL_DATE: date, OP_CARRIER_AIRLINE_ID: int ... 10 more fields]

In [4]:
rawFlightsDF.printSchema

root
 |-- FL_DATE: date (nullable = true)
 |-- OP_CARRIER_AIRLINE_ID: integer (nullable = true)
 |-- OP_CARRIER_FL_NUM: integer (nullable = true)
 |-- ORIGIN_AIRPORT_ID: integer (nullable = true)
 |-- DEST_AIRPORT_ID: integer (nullable = true)
 |-- CRS_DEP_TIME: integer (nullable = true)
 |-- ARR_DELAY_NEW: double (nullable = true)
 |-- CANCELLED: integer (nullable = true)
 |-- DIVERTED: integer (nullable = true)
 |-- CRS_ELAPSED_TIME: double (nullable = true)
 |-- WEATHER_DELAY: double (nullable = true)
 |-- NAS_DELAY: double (nullable = true)



In [5]:
import com.flightdelay.data.preprocessing.flights.FlightPreprocessingPipeline
val processedFlightData = FlightPreprocessingPipeline.execute()


[Preprocessing] Flight Data Preprocessing Pipeline - Start

Loading raw data from parquet:
  - Path: /home/jovyan/work/output/common/data/raw_flights.parquet
  - Loaded 486133 raw records

[Pipeline Step 1/9] Cleaning flight data...

[STEP 2][DataCleaner] Flight Data Cleaning - Start

Original dataset: 486133 records

Phase 1: Basic Cleaning
  - Current count: 486133 records

Phase 2: Filter Invalid Flights
  - Filtering cancelled and diverted flights
  - Filtering invalid departure times
  - Filtering invalid airports
  - Current count: 486133 records

Phase 3: Data Type Conversion
Conversion des types de données: NAS_DELAY, OP_CARRIER_AIRLINE_ID, OP_CARRIER_FL_NUM, WEATHER_DELAY, DEST_AIRPORT_ID, ORIGIN_AIRPORT_ID, CRS_ELAPSED_TIME, FL_DATE, CRS_DEP_TIME, ARR_DELAY_NEW
  - Filtering invalid flight dates
  - Current count: 486133 records

Phase 5: Final Validation
  - Validation passed: 486133 records

Cleaning Summary
Original records:       486,133
Final records:          486,133
R

processedFlightData = [FL_DATE: date, OP_CARRIER_AIRLINE_ID: int ... 135 more fields]


[FL_DATE: date, OP_CARRIER_AIRLINE_ID: int ... 135 more fields]

## Check Flight Numbers after Join on WBAN (OK VALIDE)

In [9]:
val rawWBANAirPortDFPath = s"${configuration.common.output.basePath}/common/data/raw_wban_airport_timezone.parquet"
val rawWBANAirPortDF = spark.read.parquet(rawWBANAirPortDFPath)

rawWBANAirPortDFPath = /home/jovyan/work/output/common/data/raw_wban_airport_timezone.parquet
rawWBANAirPortDF = [AirportID: int, WBAN: string ... 1 more field]


[AirportID: int, WBAN: string ... 1 more field]

In [10]:
rawWBANAirPortDF.printSchema

root
 |-- AirportID: integer (nullable = true)
 |-- WBAN: string (nullable = true)
 |-- TimeZone: integer (nullable = true)



In [13]:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, SparkSession}

// Hypothèse : rawFlightsDF et rawWBANAirPortDF sont déjà définis

def reportMissingAirports(
  flights: DataFrame,
  wbanAirports: DataFrame
)(implicit spark: SparkSession): Unit = {

  // Référentiel distinct
  val ref = wbanAirports.select(col("AirportID")).distinct().cache()
  val totalFlights = flights.count()

  // Vols avec ORIGIN absent
  val missingOriginFlights = flights
    .filter(col("ORIGIN_AIRPORT_ID").isNotNull)
    .join(broadcast(ref), flights("ORIGIN_AIRPORT_ID") === ref("AirportID"), "left_anti")
    .cache()

  // Vols avec DEST absent
  val missingDestFlights = flights
    .filter(col("DEST_AIRPORT_ID").isNotNull)
    .join(broadcast(ref), flights("DEST_AIRPORT_ID") === ref("AirportID"), "left_anti")
    .cache()

  val cntMissingOrigin = missingOriginFlights.count()
  val cntMissingDest   = missingDestFlights.count()

  // Vols avec au moins un des deux manquant
  val wOriginFlag = flights
    .join(broadcast(ref).withColumn("ok_origin", lit(true)),
          flights("ORIGIN_AIRPORT_ID") === ref("AirportID"), "left")
    .drop(ref("AirportID"))

  val wBothFlags = wOriginFlag
    .join(broadcast(ref).withColumn("ok_dest", lit(true)),
          wOriginFlag("DEST_AIRPORT_ID") === ref("AirportID"), "left")
    .drop(ref("AirportID"))

  val flagged = wBothFlags
    .withColumn("missing_origin", !coalesce(col("ok_origin"), lit(false)))
    .withColumn("missing_dest",   !coalesce(col("ok_dest"),   lit(false)))

  val cntEitherMissing = flagged.filter(col("missing_origin") || col("missing_dest")).count()

  val cntOriginNull = flights.filter(col("ORIGIN_AIRPORT_ID").isNull).count()
  val cntDestNull   = flights.filter(col("DEST_AIRPORT_ID").isNull).count()

  val missingOriginAirportIDs = missingOriginFlights
    .select("ORIGIN_AIRPORT_ID").distinct().orderBy("ORIGIN_AIRPORT_ID")

  val missingDestAirportIDs = missingDestFlights
    .select("DEST_AIRPORT_ID").distinct().orderBy("DEST_AIRPORT_ID")

  // === Affichage corrigé ===
  println("\n===== WBAN Coverage Report =====")
  println(f"Total flights:                 $totalFlights%,d")
  println(f"Missing ORIGIN flights:        $cntMissingOrigin%,d  (${cntMissingOrigin * 100.0 / totalFlights}%.3f%%)")
  println(f"Missing DEST flights:          $cntMissingDest%,d    (${cntMissingDest * 100.0 / totalFlights}%.3f%%)")
  println(f"Either origin/dest missing:    $cntEitherMissing%,d  (${cntEitherMissing * 100.0 / totalFlights}%.3f%%)")
  println(f"NULL ORIGIN_AIRPORT_ID:        $cntOriginNull%,d")
  println(f"NULL DEST_AIRPORT_ID:          $cntDestNull%,d")

  println("\nTop 20 ORIGIN_AIRPORT_ID not in WBAN:")
  missingOriginAirportIDs.show(20, truncate = false)

  println("Top 20 DEST_AIRPORT_ID not in WBAN:")
  missingDestAirportIDs.show(20, truncate = false)
}

// --- Appel ---
reportMissingAirports(rawFlightsDF, rawWBANAirPortDF)


===== WBAN Coverage Report =====
Total flights:                 486,133
Missing ORIGIN flights:        12,377  (2.546%)
Missing DEST flights:          12,387    (2.548%)
Either origin/dest missing:    24,764  (5.094%)
NULL ORIGIN_AIRPORT_ID:        0
NULL DEST_AIRPORT_ID:          0

Top 20 ORIGIN_AIRPORT_ID not in WBAN:
+-----------------+
|ORIGIN_AIRPORT_ID|
+-----------------+
|10135            |
|10158            |
|10165            |
|10434            |
|10599            |
|10627            |
|10643            |
|10994            |
|11003            |
|11049            |
|11076            |
|11146            |
|11617            |
|11630            |
|11973            |
|11995            |
|12016            |
|12206            |
|12448            |
|12758            |
+-----------------+
only showing top 20 rows

Top 20 DEST_AIRPORT_ID not in WBAN:


reportMissingAirports: (flights: org.apache.spark.sql.DataFrame, wbanAirports: org.apache.spark.sql.DataFrame)(implicit spark: org.apache.spark.sql.SparkSession)Unit


+---------------+
|DEST_AIRPORT_ID|
+---------------+
|10135          |
|10158          |
|10165          |
|10434          |
|10599          |
|10627          |
|10643          |
|10994          |
|11003          |
|11049          |
|11076          |
|11146          |
|11617          |
|11630          |
|11973          |
|11995          |
|12016          |
|12206          |
|12448          |
|12758          |
+---------------+
only showing top 20 rows



In [14]:
print(processedFlightData.count()-rawFlightsDF.count())

-347627