# Flight Data Exploratory Analysis – King Khalid International Airport (RUH)

This notebook presents a comprehensive Exploratory Data Analysis (EDA) of flight operations associated with King Khalid International Airport (RUH).
The dataset includes detailed information about airlines, aircraft, flight statuses, terminals, schedules, and destinations.
The objective of this analysis is to understand flight distribution patterns, airline activity, terminal utilization, and time-based trends that characterize RUH’s air traffic.


These insights lay the foundation for future stages such as flight-delay prediction, demand forecasting, or terminal resource optimization using machine learning or simulation models.

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("PySpark Data Analysis") \
    .getOrCreate()

In [4]:
df = spark.read.parquet("/content/flights_RUH.parquet")
df.show(5)

+-------------+---------------+------------+--------------+-----------------+------------+------------+-------+-----------+---------------+-------+--------+-------------------+-------------------+-------------------+-----------------+----------------+------------------------+------------------------+------------------------+-------------------------+--------------------------+----------------------------+
|flight_number| aircraft.model|aircraft.reg|aircraft.modeS|     airline.name|airline.iata|airline.icao| status|flight_type|codeshareStatus|isCargo|callSign|origin_airport_name|origin_airport_icao|origin_airport_iata|movement.terminal|movement.quality|destination_airport_icao|destination_airport_iata|destination_airport_name|movement.airport.timeZone|movement.scheduledTime.utc|movement.scheduledTime.local|
+-------------+---------------+------------+--------------+-----------------+------------+------------+-------+-----------+---------------+-------+--------+-------------------+------

In [9]:
from pyspark.sql import functions as F
def q(name: str):
    return F.col(f"`{name}`")

null_counts = df.select([
    F.sum(F.when(q(c).isNull(), 1).otherwise(0)).alias(c)
    for c in df.columns
])
null_counts.show(truncate=False)

+-------------+--------------+------------+--------------+------------+------------+------------+------+-----------+---------------+-------+--------+-------------------+-------------------+-------------------+-----------------+----------------+------------------------+------------------------+------------------------+-------------------------+--------------------------+----------------------------+
|flight_number|aircraft.model|aircraft.reg|aircraft.modeS|airline.name|airline.iata|airline.icao|status|flight_type|codeshareStatus|isCargo|callSign|origin_airport_name|origin_airport_icao|origin_airport_iata|movement.terminal|movement.quality|destination_airport_icao|destination_airport_iata|destination_airport_name|movement.airport.timeZone|movement.scheduledTime.utc|movement.scheduledTime.local|
+-------------+--------------+------------+--------------+------------+------------+------------+------+-----------+---------------+-------+--------+-------------------+-------------------+-------

In [11]:
cols_to_drop = ['aircraft.reg', 'callSign', 'aircraft.modeS']
df = df.drop(*cols_to_drop)

In [13]:
for c in df.columns:
    df = df.withColumnRenamed(c, c.replace('.', '_'))

df.printSchema()

root
 |-- flight_number: string (nullable = true)
 |-- aircraft_model: string (nullable = true)
 |-- airline_name: string (nullable = true)
 |-- airline_iata: string (nullable = true)
 |-- airline_icao: string (nullable = true)
 |-- status: string (nullable = true)
 |-- flight_type: string (nullable = true)
 |-- codeshareStatus: string (nullable = true)
 |-- isCargo: boolean (nullable = true)
 |-- origin_airport_name: string (nullable = true)
 |-- origin_airport_icao: string (nullable = true)
 |-- origin_airport_iata: string (nullable = true)
 |-- movement_terminal: string (nullable = true)
 |-- movement_quality: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- destination_airport_icao: string (nullable = true)
 |-- destination_airport_iata: string (nullable = true)
 |-- destination_airport_name: string (nullable = true)
 |-- movement_airport_timeZone: string (nullable = true)
 |-- movement_scheduledTime_utc: string (nullable = true)
 |-- movement_scheduledT

In [14]:
cols_to_drop = ['aircraft_reg', 'callSign', 'aircraft_modeS']
df = df.drop(*cols_to_drop)

In [15]:
fill_values = {
    'airline_icao': 'Unknown',
    'aircraft_model': 'Unknown',
    'airline_iata': 'Unknown',
    'movement_terminal': 'Unknown',
    'destination_airport_icao': 'Unknown',
    'destination_airport_iata': 'Unknown',
    'movement_airport_timeZone': 'Unknown',
}
df = df.fillna(fill_values)

#  total number of flights and distinct airlines

In [28]:
df.selectExpr("count(*) as total_flights",
              "count(distinct airline_name) as unique_airlines").show()

+-------------+---------------+
|total_flights|unique_airlines|
+-------------+---------------+
|       153308|             68|
+-------------+---------------+



#  number of airlines serving each destination

In [41]:

df.groupBy("destination_airport_name").agg(F.countDistinct("airline_name").alias("num_airlines")).orderBy(F.col("num_airlines").desc()).show(10, truncate=False)

+------------------------+------------+
|destination_airport_name|num_airlines|
+------------------------+------------+
|Cairo                   |8           |
|Islamabad               |7           |
|Istanbul                |6           |
|Mumbai                  |5           |
|Lahore                  |5           |
|Dubai                   |5           |
|Ad Dammam               |5           |
|Jeddah                  |4           |
|Amman                   |4           |
|Kuwait City             |4           |
+------------------------+------------+
only showing top 10 rows



# average daily flights per airline

In [39]:
(df.groupBy("airline_name", "departure_date")
   .count()
   .groupBy("airline_name")
   .agg(F.avg("count").alias("avg_daily_flights"))
   .orderBy(F.col("avg_daily_flights").desc())
).show(10, truncate=False)

+-------------+------------------+
|airline_name |avg_daily_flights |
+-------------+------------------+
|Saudi Arabian|288.5592417061611 |
|flynas       |160.82938388625593|
|flyadeal     |117.70142180094787|
|Gulf Air     |10.466666666666667|
|flydubai     |9.75829383886256  |
|EgyptAir     |8.942857142857143 |
|Qatar        |8.585714285714285 |
|Etihad       |7.328571428571428 |
|Emirates     |6.469194312796208 |
|Turkish      |5.971563981042654 |
+-------------+------------------+
only showing top 10 rows



#  busiest day overall

In [34]:
(df.groupBy("departure_date")
   .agg(F.count("*").alias("num_flights"))
   .orderBy(F.col("num_flights").desc())
   .limit(1)).show()

+--------------+-----------+
|departure_date|num_flights|
+--------------+-----------+
|    2025-08-11|        814|
+--------------+-----------+



# busiest destination-airline combinations


In [30]:
df.groupBy("airline_name","destination_airport_name").count().orderBy(F.col("count").desc()).show(15, truncate=False)

+-------------+------------------------+-----+
|airline_name |destination_airport_name|count|
+-------------+------------------------+-----+
|Saudi Arabian|Jeddah                  |10915|
|flyadeal     |Jeddah                  |9184 |
|flynas       |Jeddah                  |7833 |
|Saudi Arabian|Ad Dammam               |4556 |
|flynas       |Dubai                   |3385 |
|Saudi Arabian|Abha                    |3365 |
|Saudi Arabian|Dubai                   |3359 |
|flyadeal     |Abha                    |3281 |
|Saudi Arabian|Medina                  |3184 |
|Saudi Arabian|Jazan                   |3056 |
|flynas       |Cairo                   |2882 |
|flynas       |Abha                    |2397 |
|Saudi Arabian|Tabuk                   |2395 |
|Gulf Air     |Manama                  |2198 |
|flyadeal     |Dubai                   |2165 |
+-------------+------------------------+-----+
only showing top 15 rows



# Top 10 most frequent routes (origin → destination)

In [40]:
df.groupBy("origin_airport_name","destination_airport_name").count().orderBy(F.col("count").desc()).show(10, truncate=False)

+-------------------+------------------------+-----+
|origin_airport_name|destination_airport_name|count|
+-------------------+------------------------+-----+
|Riyadh             |Jeddah                  |27938|
|Riyadh             |Dubai                   |12333|
|Riyadh             |Cairo                   |10003|
|Riyadh             |Abha                    |9043 |
|Riyadh             |Ad Dammam               |7393 |
|Riyadh             |Medina                  |6563 |
|Riyadh             |Jazan                   |5196 |
|Riyadh             |Tabuk                   |4295 |
|Riyadh             |Istanbul                |4083 |
|Riyadh             |Amman                   |3325 |
+-------------------+------------------------+-----+
only showing top 10 rows



# What are the top 10 airlines by number of flights?

In [17]:
from pyspark.sql import functions as F
top_airlines = (df.groupBy("airline_name")
                  .count()
                  .orderBy(F.col("count").desc()))
top_airlines.show(10, truncate=False)

+-------------+-----+
|airline_name |count|
+-------------+-----+
|Saudi Arabian|60886|
|flynas       |33935|
|flyadeal     |24835|
|Gulf Air     |2198 |
|flydubai     |2059 |
|EgyptAir     |1878 |
|Qatar        |1803 |
|Etihad       |1539 |
|Emirates     |1365 |
|Turkish      |1260 |
+-------------+-----+
only showing top 10 rows



# What are the most common flight statuses?


In [18]:
df.groupBy("status").count().orderBy(F.col("count").desc()).show()

+-----------------+------+
|           status| count|
+-----------------+------+
|          Unknown|128618|
|         Expected| 13059|
|         Departed| 11520|
|         Canceled|    78|
|CanceledUncertain|    33|
+-----------------+------+



# Which are the top 10 destination airports?

In [20]:
df.groupBy("destination_airport_name").count().orderBy(F.col("count").desc()).show(10, truncate=False)

+------------------------+-----+
|destination_airport_name|count|
+------------------------+-----+
|Jeddah                  |27938|
|Dubai                   |12333|
|Cairo                   |10003|
|Abha                    |9043 |
|Ad Dammam               |7393 |
|Medina                  |6563 |
|Jazan                   |5196 |
|Tabuk                   |4295 |
|Istanbul                |4083 |
|Amman                   |3325 |
+------------------------+-----+
only showing top 10 rows



# What’s the daily flight volume trend?

In [22]:
from pyspark.sql.functions import to_date

df = df.withColumn("departure_date", to_date(col("movement_scheduledTime_utc")))
daily_flights = (df.groupBy("departure_date")
                   .agg(F.count("*").alias("num_flights"))
                   .orderBy("departure_date"))
daily_flights.show(10, truncate=False)

+--------------+-----------+
|departure_date|num_flights|
+--------------+-----------+
|2025-03-14    |69         |
|2025-03-15    |600        |
|2025-03-16    |566        |
|2025-03-17    |597        |
|2025-03-18    |430        |
|2025-03-19    |599        |
|2025-03-20    |644        |
|2025-03-21    |631        |
|2025-03-22    |664        |
|2025-03-23    |609        |
+--------------+-----------+
only showing top 10 rows



# Which airlines fly to the most destinations?

In [23]:
unique_dest = (df.groupBy("airline_name")
                 .agg(F.countDistinct("destination_airport_name").alias("unique_destinations"))
                 .orderBy(F.col("unique_destinations").desc()))
unique_dest.show(10, truncate=False)

+----------------------+-------------------+
|airline_name          |unique_destinations|
+----------------------+-------------------+
|Saudi Arabian         |77                 |
|flynas                |60                 |
|flyadeal              |27                 |
|Air India Express     |7                  |
|Unknown/Private owner |5                  |
|Pakistan International|5                  |
|EgyptAir              |4                  |
|IndiGo                |4                  |
|Air India             |3                  |
|China Southern        |3                  |
+----------------------+-------------------+
only showing top 10 rows



# How many terminals are used at RUH?

In [24]:
df.groupBy("movement_terminal").count().orderBy(F.col("count").desc()).show()

+-----------------+-----+
|movement_terminal|count|
+-----------------+-----+
|                5|80729|
|                1|23420|
|                3|22415|
|                4|21892|
|                2| 4002|
|          Unknown|  850|
+-----------------+-----+



# average daily flights per airline



In [38]:
(df.groupBy("airline_name", "departure_date")
   .count()
   .groupBy("airline_name")
   .agg(F.avg("count").alias("avg_daily_flights"))
   .orderBy(F.col("avg_daily_flights").desc())
).show(10, truncate=False)

+-------------+------------------+
|airline_name |avg_daily_flights |
+-------------+------------------+
|Saudi Arabian|288.5592417061611 |
|flynas       |160.82938388625593|
|flyadeal     |117.70142180094787|
|Gulf Air     |10.466666666666667|
|flydubai     |9.75829383886256  |
|EgyptAir     |8.942857142857143 |
|Qatar        |8.585714285714285 |
|Etihad       |7.328571428571428 |
|Emirates     |6.469194312796208 |
|Turkish      |5.971563981042654 |
+-------------+------------------+
only showing top 10 rows



# Number of Airlines Serving Each Destination

In [42]:
import plotly.express as px
from pyspark.sql import functions as F

dest_airlines = (
    df.groupBy("destination_airport_name")
      .agg(F.countDistinct("airline_name").alias("num_airlines"))
      .orderBy(F.col("num_airlines").desc())
)
dest_airlines_pd = dest_airlines.limit(20).toPandas()
fig = px.bar(
    dest_airlines_pd,
    x="destination_airport_name",
    y="num_airlines",
    title="Number of Airlines Serving Each Destination",
    color="num_airlines",
    color_continuous_scale="Blues",
    text="num_airlines"
)
fig.update_traces(textposition="outside")
fig.update_layout(
    xaxis_title="Destination Airport",
    yaxis_title="Number of Airlines",
    xaxis_tickangle=-45,
    template="plotly_white",
    title_font=dict(size=18, family="Arial", color="black"),
    margin=dict(l=40, r=40, t=60, b=120)
)

fig.show()

# Top 15 Airlines by Number of Flights

In [45]:
airline_counts = (
    df.groupBy("airline_name")
      .count()
      .orderBy(F.col("count").desc())
      .limit(10)
      .toPandas()
)
fig = px.bar(
    airline_counts,
    x="airline_name",
    y="count",
    title="Top 15 Airlines by Number of Flights",
    text="count",
    color_discrete_sequence=["#4B9CD3"]
fig.update_traces(textposition="outside")
fig.update_layout(
    xaxis_title="Airline",
    yaxis_title="Flight Count",
    xaxis_tickangle=-45,
    template="plotly_white"
)
fig.show()

# Daily Flight Volume Over Time

In [47]:
daily_trend = (
    df.groupBy("departure_date")
      .agg(F.count("*").alias("num_flights"))
      .orderBy("departure_date")
      .toPandas()
)

fig = px.line(
    daily_trend,
    x="departure_date",
    y="num_flights",
    title="Daily Flight Volume Over Time",
    markers=True
)
fig.update_layout(template="plotly_white", xaxis_title="Date", yaxis_title="Number of Flights")
fig.show()

# Flight Status Distribution by Airline

In [50]:
top10 = (df.groupBy("airline_name").count()
           .orderBy(F.col("count").desc()).limit(10)
           .select("airline_name"))

data = (df.join(top10, "airline_name")
          .groupBy("airline_name","status").count()
          .orderBy("airline_name","status")
          .toPandas())

fig = px.bar(data, x="airline_name", y="count", color="status",
             title="Flight Status Distribution by Airline (Top 10)")
fig.update_layout(xaxis_tickangle=-45, template="plotly_white",
                  yaxis_title="Flights", xaxis_title="Airline")
fig.show()

# Airline Coverage by Destination

In [51]:
cover = (df.groupBy("airline_name","destination_airport_name")
           .agg(F.count("*").alias("flights"))
           .toPandas())

fig = px.treemap(cover, path=["airline_name","destination_airport_name"], values="flights",
                 title="Airline Coverage by Destination")
fig.update_layout(template="plotly_white")
fig.show()

# Top 20 Destinations by Flight Count

In [52]:
from pyspark.sql import functions as F
import plotly.express as px

dest_counts = (df.groupBy("destination_airport_name")
                 .agg(F.count("*").alias("num_flights"))
                 .orderBy(F.col("num_flights").desc())
                 .limit(20)
                 .toPandas())

fig = px.bar(dest_counts,
             x="num_flights", y="destination_airport_name",
             orientation="h",
             color="num_flights",
             color_continuous_scale="Blues",
             title="Top 20 Destinations by Flight Count",
             text="num_flights")

fig.update_traces(textposition="outside")
fig.update_layout(template="plotly_white",
                  xaxis_title="Flights",
                  yaxis_title="Destination",
                  yaxis=dict(autorange="reversed"))
fig.show()

# Flights by Terminal

In [54]:
fig = px.pie(
    term_counts,
    names="movement_terminal",
    values="count",
    hole=0.4,
    color_discrete_sequence=px.colors.sequential.Blues[::-1],  # reverse colors
    title="Flights by Terminal"
)
fig.update_traces(textinfo="percent+label")
fig.update_layout(template="plotly_white")
fig.show()