# Group Project Milestone 2: Data Exploration & Initial PreProcessing

In this assignment you will need to:

1. Create a GitHub ID
2. Create a GitHub Repository (Public or Private it is up to you. In the end it will have to be Public) and add your group members as collaborators
3. Perform the data exploration step (i.e. evaluate your data, # of observations, details about your data distributions, scales, missing data, column descriptions) Note: For image data you can still describe your data by the number of classes, # of images, plot example classes of the image, size of images, are sizes uniform? Do they need to be cropped? normalized? etc.
4. Plot your data. For tabular data, you will need to run scatters, for image data, you will need to plot your example classes.
5. How will you preprocess your data? You should explain this in your README.md file and link your Jupyter notebook to it. All code and  Jupyter notebooks have be uploaded to your repo.
6. You must also include in your Jupyter Notebook, a link for data download and environment setup requirements: 


!wget !unzip like functions as well as !pip install functions for non standard libraries not available in colab are required to be in the top section of your jupyter lab notebook. Or having the data on GitHub (you will need the academic license for GitHub to do this, larger datasets will require a link to external storage).

## GitHub ID

https://github.com/SmoothData-BigBrain

## Dataset link

https://www.kaggle.com/datasets/robikscube/flight-delay-dataset-20182022

## Setup for spark and data

Perform the data exploration step (i.e. evaluate your data, # of observations, details about your data distributions, scales, missing data, column descriptions) Note: For image data you can still describe your data by the number of classes, # of images, plot example classes of the image, size of images, are sizes uniform? Do they need to be cropped? normalized? etc.

### Import libraries

In [4]:
# Install everything inside the 'requirements.txt' file before running this notebook
!pip install -r ../requirements.txt

pandas==2.2.3
seaborn==0.13.2
plotly==6.1.1
pyspark==3.5.5
matplotlib==3.9.4
numpy==1.26.4


In [2]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import glob
import util


from pyspark import SparkContext
from pyspark.sql import functions as F
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import col, isnan, when, count, isnull, sum, concat_ws, coalesce, lit, avg, rand, round

import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt



### Creat Spark session

In [3]:
# spark = SparkSession.builder \
#     .appName("Flight Data Analysis") \
#     .getOrCreate()

# spark.conf.set("spark.sql.debug.maxToStringFields", 1000)
# spark.sparkContext.setLogLevel("ERROR")

### Mihirs machine spark setting
spark = SparkSession.builder \
    .appName("Flight Data Analysis") \
    .master("local[*]") \
    .config("spark.driver.memory", "12g") \
    .config("spark.executor.memory", "12g") \
    .config("spark.sql.shuffle.partitions", "100") \
    .config("spark.sql.debug.maxToStringFields", "1000") \
    .getOrCreate()

25/05/22 14:32:52 WARN Utils: Your hostname, Mihirs-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.1.8 instead (on interface en0)
25/05/22 14:32:52 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/22 14:32:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Read in data files

> **Note: Update the home_dir and download_path variables before running this cell block**

In [4]:
# home_dir = os.path.expanduser('~')
# path_for_Nam = 'C:/GitGroupProject/GroupProject' # comment this later
# download_path = os.path.join(path_for_Nam,'/data/') # comment this later

# download_path = os.path.join('/workspaces/GroupProject/data/') # Uncomment this later

home_dir = os.path.expanduser('~')
local_download_path = os.path.join(home_dir, 'Desktop/GroupProject/data/')
file_id = '1tch7xbFIgBtXKXa16E4QCpVKedUExfO3'  # My File ID for airlines.zip on GDrive 
util.check_and_fetch_data(file_id, local_download_path)

CSV files already exist in /Users/mihir/Desktop/GroupProject/data/archive/raw/. Skipping data download.


In [5]:
# folder_path = '~/Desktop/GroupProject/data/archive/raw'
# path_for_Nam = 'C:/GitGroupProject/GroupProject'
# home_dir = os.path.expanduser('~')
# download_path = os.path.join(home_dir, 'Desktop/GroupProject/data/')

# download_path = os.path.join(path_for_Nam, '/data/')
# Nam_local = 'C:/lecture-notebooks/GroupProject/data/archive/raw' # comment this later

# csv_files = glob.glob(f"{Nam_local}/*.csv") # comment this later

### *****************
# Do we want to keep this spark.read.csv or just use parquet files from the start? 
# This would avoid running this, then saving to parquet, and reading parquet files into df again
### *****************
# csv_files = glob.glob(f"{local_download_path}archive/raw/*.csv") #Uncomment this later
# df = spark.read.csv(csv_files,
#                        sep = ',',
#                        inferSchema = True,
#                        header = True)

# df.printSchema()

In [4]:
# write out combined single csv file - why? 
# df.coalesce(1).write.csv("combined_file_csv", header=True) # Uncomment this later
# df.write.mode("overwrite").parquet("combined_files")
df = spark.read.parquet("combined_files")

In [5]:
col_des = spark.read.csv('flights_column_des.csv', sep = ',', inferSchema = True, header = True)

## Explore Dataset

### Get dataset shape

In [6]:
# get df shape
num_entries = df.count()
num_cols = len(df.columns)
print(f"Shape of the DataFrame: ({num_entries}, {num_cols})")

Shape of the DataFrame: (29193782, 120)


### Explore null values

#### Computing non-null counts as percentages

In [7]:
non_null_counts = df.select([count(col(c)).alias(c) for c in df.columns]).collect()[0].asDict()

# Calculate non-null percentages
non_null_percentages = {
    col_name: (count_val / num_entries) * 100
    for col_name, count_val in non_null_counts.items()
}

sorted_columns = sorted(non_null_percentages.items(), key=lambda x: x[1], reverse=True)

for col_name, pct in sorted_columns:
    print(f"{col_name}: {pct:.2f}% non-null")



Year: 100.00% non-null
Quarter: 100.00% non-null
Month: 100.00% non-null
DayofMonth: 100.00% non-null
DayOfWeek: 100.00% non-null
FlightDate: 100.00% non-null
Marketing_Airline_Network: 100.00% non-null
Operated_or_Branded_Code_Share_Partners: 100.00% non-null
DOT_ID_Marketing_Airline: 100.00% non-null
IATA_Code_Marketing_Airline: 100.00% non-null
Flight_Number_Marketing_Airline: 100.00% non-null
Operating_Airline : 100.00% non-null
DOT_ID_Operating_Airline: 100.00% non-null
IATA_Code_Operating_Airline: 100.00% non-null
Flight_Number_Operating_Airline: 100.00% non-null
OriginAirportID: 100.00% non-null
OriginAirportSeqID: 100.00% non-null
OriginCityMarketID: 100.00% non-null
Origin: 100.00% non-null
OriginCityName: 100.00% non-null
OriginState: 100.00% non-null
OriginStateFips: 100.00% non-null
OriginStateName: 100.00% non-null
OriginWac: 100.00% non-null
DestAirportID: 100.00% non-null
DestAirportSeqID: 100.00% non-null
DestCityMarketID: 100.00% non-null
Dest: 100.00% non-null
DestCit

                                                                                

#### Subset dataset
removing columns with <90% null values

In [8]:
columns_above_90 = [col_name for col_name, pct in non_null_percentages.items() if pct >= 90]
filtered_df = df.select(columns_above_90)
filtered_df.select(filtered_df.columns[:8]).show(5)

+----+-------+-----+----------+---------+----------+-------------------------+---------------------------------------+
|Year|Quarter|Month|DayofMonth|DayOfWeek|FlightDate|Marketing_Airline_Network|Operated_or_Branded_Code_Share_Partners|
+----+-------+-----+----------+---------+----------+-------------------------+---------------------------------------+
|2021|      3|    7|        15|        4|2021-07-15|                       AS|                                     AS|
|2021|      3|    7|        15|        4|2021-07-15|                       AS|                                     AS|
|2021|      3|    7|        15|        4|2021-07-15|                       AS|                                     AS|
|2021|      3|    7|        15|        4|2021-07-15|                       AS|                                     AS|
|2021|      3|    7|        15|        4|2021-07-15|                       AS|                                     AS|
+----+-------+-----+----------+---------+-------

In [9]:
# Strip the extra spaces from col names 
for c in filtered_df.columns:
    filtered_df = filtered_df.withColumnRenamed(c, c.strip())

In [10]:
# get filtered df shape
filtered_num_rows = filtered_df.count()
filtered_num_cols = len(filtered_df.columns)
print(f"Shape of the Filtered DataFrame removing cols w/ <90% non-null values: ({filtered_num_rows}, {filtered_num_cols})")

Shape of the Filtered DataFrame removing cols w/ <90% non-null values: (29193782, 62)


In [11]:
filtered_df.printSchema()

root
 |-- Year: integer (nullable = true)
 |-- Quarter: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- DayofMonth: integer (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- FlightDate: date (nullable = true)
 |-- Marketing_Airline_Network: string (nullable = true)
 |-- Operated_or_Branded_Code_Share_Partners: string (nullable = true)
 |-- DOT_ID_Marketing_Airline: integer (nullable = true)
 |-- IATA_Code_Marketing_Airline: string (nullable = true)
 |-- Flight_Number_Marketing_Airline: integer (nullable = true)
 |-- Operating_Airline: string (nullable = true)
 |-- DOT_ID_Operating_Airline: integer (nullable = true)
 |-- IATA_Code_Operating_Airline: string (nullable = true)
 |-- Tail_Number: string (nullable = true)
 |-- Flight_Number_Operating_Airline: integer (nullable = true)
 |-- OriginAirportID: integer (nullable = true)
 |-- OriginAirportSeqID: integer (nullable = true)
 |-- OriginCityMarketID: integer (nullable = true)
 |-- Origin: string (null

In [12]:
# delete the master df since we won't need it anymore at all
del df

In [None]:
# save filtered df to not have to redo code later
#filtered_df.coalesce(1).write.mode("overwrite").option("header", True).csv("filtered_df_temp")

# # read in already filtered_df saved previously
# filtered_df = spark.read.csv('part-00000-b248588c-b561-414a-ba2c-bc77825e455a-c000.csv', sep = ',', inferSchema = True, header = True)

#### Discussion on null values
Dataset consists of columns with >90% non-null values and then it drops down to 0-17% non-null. Dataset to be used for further exploration will only include columns with >90% non-null values for more robust analysis

### Remaining Column Descriptions

In [None]:
# get all cols in filtered_df
filtered_cols = filtered_df.columns 

# remove any white space
filtered_cols = [str(c).strip() for c in filtered_cols]

# subset column description dataframe for only columns in filtered dataset
filtered_col_des = col_des.filter(col('column').isin(filtered_cols))

In [None]:
# check df was filtered correctly, length & row count should match
f_col_len = filtered_col_des.count()
f_col_len

In [None]:
# View all column descriptions in filtered dataframe
# Full data col description is in "../data/README.md"
filtered_col_des.show(n=f_col_len, truncate=False)

## Dataset Statistics & Distributions

In [None]:
# get data type for each column
for name, dtype in filtered_df.dtypes:
    print(f"{name}: {dtype}")

In [None]:
non_string_cols = [col_name for col_name, dtype in filtered_df.dtypes if dtype != 'string']

In [None]:
# subset column description dataframe for only non-string
non_string_col_des = filtered_col_des.filter(col('column').isin(non_string_cols))
non_string_col_des.show(n=non_string_col_des.count(), truncate=False)

### Discussion on skewed data distributions

When taking a look at the columns with the most amount of skew in the data distribution, columns that are ID inidicators or Flight numbers do not make sense to further investigations of data distributions. Although these are numerical values, they represent categorical variables as opposed to continuous. 

Columns with 'ID','Number', 'Origin', 'Dest' in the column name will be removed from statistical analysis to remove these categorical variables 

In [None]:
cont_col_des = non_string_col_des.filter(
    ~non_string_col_des['column'].rlike('Dest|Origin|ID|Number|FlightDate')
)
cont_col_des.show(n=cont_col_des.count(), truncate=False)

In [None]:
# get statistics for all continuous variables
cont_cols = [row['column'] for row in cont_col_des.select('column').collect()]

describe_df = filtered_df.select(cont_cols).describe()

# compute Q1, Median, Q3 for each column
stats = {
    "25%": {},
    "50%": {},
    "75%": {}
}

for col_name in cont_cols:
    q1, median, q3 = filtered_df.approxQuantile(col_name, [0.25, 0.5, 0.75], 0.01)
    stats["25%"][col_name] = str(q1)
    stats["50%"][col_name] = str(median)
    stats["75%"][col_name] = str(q3)

# convert new rows to df rows
new_rows = [Row(summary=stat_name, **cols) for stat_name, cols in stats.items()]
quartile_df = spark.createDataFrame(new_rows)

# append the new rows to describe_df
full_summary_df = describe_df.unionByName(quartile_df)

In [None]:
# append the new rows to describe_df
full_summary_df = describe_df.unionByName(quartile_df)

In [None]:
# convert and save as parquet
# full_summary_df.to_parquet("full_summary_df", index=False)
full_summary_df.write.mode("overwrite").parquet("full_summary_df")

In [None]:
full_summary_df = spark.read.parquet("full_summary_df")

In [None]:
full_summary_df.select(full_summary_df.columns[:6]).show(truncate=False)

In [None]:
# view df columns
full_summary_df.select(full_summary_df.columns[11:17]).show(truncate=False)

### Explore skewed data

mean > median, data is right-skewed (longer tail on the right)
median < mean, data is left-skewed (longer tail on the left)

This code is to find top 20 features with largest skews. These features will then be plotted in histograms

The purpose of doing this is to understand if there are any outliers in the dataset that may be worth removing from the dataset prior to applying ML methods

### Explore data distributions

In [None]:
# get mean and median rows as dicts
mean_row = full_summary_df.filter(col("summary") == "mean").collect()[0].asDict()
median_row = full_summary_df.filter(col("summary") == "50%").collect()[0].asDict()

# skip the 'summary' key
cols = [col for col in mean_row.keys() if col != "summary"]

# build rows of (column, absolute_diff, skew direction)
result_rows = []
for c in cols: # for each col
    mean_val = float(mean_row[c]) # get mean
    median_val = float(median_row[c]) # get median
    diff = __builtins__.abs(mean_val - median_val) # get abs difference
    skew = "right" if mean_val > median_val else "left" if mean_val < median_val else "none" # get skew direction
    result_rows.append(Row(column=c, absolute_diff=diff, skew=skew)) # aggregate

# create df
diff_df = spark.createDataFrame(result_rows)

# get top 20
top_skewed = diff_df.orderBy(col("absolute_diff").desc()).limit(20)

top_skewed.show(truncate=False)


In [None]:
# get all cols in filtered_df
skewed_cols = [row['column'] for row in top_skewed.select('column').collect()]

# remove any white space
skewed_cols = [str(c).strip() for c in skewed_cols]

# subset column description dataframe for only columns in filtered dataset
skewed_col_des = col_des.filter(col('column').isin(skewed_cols))

In [None]:
skewed_col_des.show(n=skewed_col_des.count(), truncate=False)

In [None]:
# list of columns from 'top_skewed'
columns_to_plot = [row['column'] for row in top_skewed.collect()]

# filter the columns that exist in filtered_df
valid_columns = [col for col in columns_to_plot if col in filtered_df.columns]

# plot histograms for each column
n_cols = 4  # 4 histograms per row
n_rows = (len(valid_columns) + n_cols - 1) // n_cols  # calculate num rows needed

fig, axes = plt.subplots(n_rows, n_cols, figsize=(20, n_rows * 5))

# flatten axes for easier indexing
axes = axes.flatten()

# loop through cols and plot
for i, column in enumerate(valid_columns):
    hist = filtered_df.select(column).rdd.flatMap(lambda x: x).histogram(20)  # 20 bins

    bin_edges, bin_counts = hist

    # plot the histogram using the bin edges and counts
    axes[i].bar(bin_edges[:-1], bin_counts, width=(bin_edges[1] - bin_edges[0]), edgecolor='black')

    # set axes & title
    axes[i].set_title(f"Histogram of {column}")
    axes[i].set_xlabel('Value')
    axes[i].set_ylabel('Frequency')

# turn off any unused subplots
for i in range(len(valid_columns), len(axes)):
    axes[i].axis('off')

# need to update to add labels for axis 
plt.tight_layout()
plt.show()

### Discussion

The distance column, majority of flights in this dataset have a distance <1000 miles. With a few outliers ranging from 3000-5000 miles. 

Wheels On & Wheels Off time and CRSDepTime & DepTime columns have a few outliers at 0:00-4:00am, majority of times are listed between 5:00 & 23:59

The majority of TaxiOut and TaxiIn times are around 0 (or <50minutes). However, there are outliers sitting at ~1300 & 300 minutes respectively. 

## Questions to analyze data with

### Which Origin Cities had the most delayed flights?

In [None]:
count_delay = filtered_df.select(["Origin", "DepDelay"]).groupBy("Origin")\
        .agg(count(F.when(col("DepDelay") > 0, 1)).alias("DelayCount"), 
             count(F.when(col("DepDelay") < 0, 1)).alias("EarlyCount"),
            count("*").alias("TotalCount")).orderBy(col("TotalCount").desc())
pandas_delay = count_delay.toPandas()# Assuming pdf has these columns: OriginCity, DelayedFlights, EarlyFlights, OnTimeFlights

pdf = pandas_delay.copy()
pdf["OnTimeCount"] = pdf["TotalCount"] - pdf["DelayCount"] - pdf["EarlyCount"]
top_20 = pdf.head(20)

# Bar positions
cities = top_20["Origin"]
x = np.arange(len(cities))

# Heights
early = top_20["EarlyCount"]
on_time = top_20["OnTimeCount"]
delayed = top_20["DelayCount"]

# Plot
plt.figure(figsize=(12, 6))
plt.bar(x, early, label="Early", color="green")
plt.bar(x, on_time, bottom=early, label="On Time", color="gray")
plt.bar(x, delayed, bottom=early + on_time, label="Delayed", color="red")

# Labels and formatting
plt.xticks(x, cities, rotation=45)
plt.ylabel("Number of Flights")
plt.title("Flight Status by Origin City (Top 20)")
plt.legend(title="Flight Status")
plt.tight_layout()
plt.show()

#### Discussion
The plot above shows a stacked bar chart of the origin cities with the 20 highest total flight counts. The bars are stacked according to early departures (green), on time departures (grey), and delayed departures (red) and are organized in descending order starting at the left. From this plot, the overall trend suggests that the majority of flights are early and only a small proportion of flights are actually on time. The origin city with the seemingly largest proportion of delayed departures is Denver and, speaking as someone from Colorado, I can personally attest to this.

### Which years had the most delayed flights?

In [None]:
year_delay = filtered_df.select(["Year", "DepDelay"]).groupBy("Year")\
        .agg(count(F.when(col("DepDelay") > 0, 1)).alias("DelayCount"), 
             count(F.when(col("DepDelay") < 0, 1)).alias("EarlyCount"),
            count("*").alias("TotalCount")).orderBy(col("Year"))
pandas_year_delay = year_delay.toPandas()

pydf = pandas_year_delay.copy()
pydf["OnTimeCount"] = pydf["TotalCount"] - pydf["DelayCount"] - pydf["EarlyCount"]
year_counts = pydf

# Bar positions
years = year_counts["Year"]
x = np.arange(len(years))

# Heights
early = year_counts["EarlyCount"]
on_time = year_counts["OnTimeCount"]
delayed = year_counts["DelayCount"]

# Plot
plt.figure(figsize=(12, 6))
plt.bar(x, early, label="Early", color="green")
plt.bar(x, on_time, bottom=early, label="On Time", color="gray")
plt.bar(x, delayed, bottom=early + on_time, label="Delayed", color="red")

# Labels and formatting
plt.xticks(x, years)
plt.ylabel("Number of Flights")
plt.title("Flight Status by Year")
plt.legend(title="Flight Status")
plt.tight_layout()
plt.show()

#### Discussion
Similar to the first bar chart, we also present a stacked bar chart depicting the overall flight count per year between 2018 and 2022. The scale of this plot is in the millions of flights and 2019 appears to have a much higher overall flight count than the other years. Surprisingly, 2020 had the fewest amount of delayed departures by far even though it had a similar amount of overall flights. This could very well have something to do with the emergence of COVID during the beginning of that year, but it would be interesting to look closer at flight trends during that time.

### Which months had the most delayed flights?

In [None]:
month_delay = filtered_df.select(["Month", "DepDelay"]).groupBy("Month")\
        .agg(count(F.when(col("DepDelay") > 0, 1)).alias("DelayCount"), 
             count(F.when(col("DepDelay") < 0, 1)).alias("EarlyCount"),
            count("*").alias("TotalCount")).orderBy(col("Month"))
pandas_month_delay = month_delay.toPandas()

pmdf = pandas_month_delay.copy()
pmdf["OnTimeCount"] = pmdf["TotalCount"] - pmdf["DelayCount"] - pmdf["EarlyCount"]
month_counts = pmdf

# Bar positions
months = month_counts["Month"]
x = np.arange(len(months))

# Heights
early = month_counts["EarlyCount"]
on_time = month_counts["OnTimeCount"]
delayed = month_counts["DelayCount"]

# Plot
plt.figure(figsize=(12, 6))
plt.bar(x, early, label="Early", color="green")
plt.bar(x, on_time, bottom=early, label="On Time", color="gray")
plt.bar(x, delayed, bottom=early + on_time, label="Delayed", color="red")

# Labels and formatting
plt.xticks(x, months)
plt.ylabel("Number of Flights")
plt.title("Flight Status by Month")
plt.legend(title="Flight Status")
plt.tight_layout()
plt.show()

#### Discussion
Continuing with the bar charts, the above chart depicts departure status as proportions of the total number of flights per month starting with January at 1. Intuitively, one might expect there to be more delayed flights during the winter months December-March. However, this graph depicts that there is really no discernible difference in delayed departures during that time, with the largest proportion of delayed flights actually coming in June. Keep in mind this is only looking at departures, so other statuses could have different outcomes, but it is interesting to note.

### Which routes had the most delays?

In [None]:
year_delay = filtered_df.select(["Year", "DepDelay"]).groupBy("Year")\
        .agg(count(F.when(col("DepDelay") > 0, 1)).alias("DelayCount"), 
             count(F.when(col("DepDelay") < 0, 1)).alias("EarlyCount"),
            count("*").alias("TotalCount")).orderBy(col("TotalCount").desc())
pandas_year_delay = count_delay.toPandas()

In [None]:
cols_to_keep_2 = ["Operating_Airline", "Origin", "Dest", "ArrDelayMinutes", "DepDelayMinutes", "Distance", "OriginCityName", "DestCityName"]
df2 = filtered_df.select(cols_to_keep_2)

In [None]:
# group by origin and city, then calculating the total average delay between the cities
route_delays = df2.groupBy("OriginCityName", "DestCityName") \
    .agg(
        (F.avg("DepDelayMinutes") + F.avg("ArrDelayMinutes")).alias("AvgTotalDelay")
    ) \
    .orderBy(F.col("AvgTotalDelay").desc())

In [None]:
# convert to pandas
route_delays_pd = route_delays.limit(10).toPandas()

In [None]:
route_delays_pd.head()

In [None]:
# combining origin and dest for visual purposes 
route_delays_pd['Route'] = route_delays_pd['OriginCityName'] + ' to ' + route_delays_pd['DestCityName']

In [None]:
# plot
plt.figure(figsize=(14, 8))
sns.barplot(x='AvgTotalDelay', y='Route', data=route_delays_pd)
plt.title('Top 10 Most Delayed Flight Routes')
plt.xlabel('Average Total Delay in Minutes')
plt.ylabel('Route (Origin to Destination)')
plt.xticks(rotation=0)
plt.show()

#### Discussion
This bar plot shows the top 10 most delayed flight routes ranked by the average total delay, which is the combined sum of both departure and arrival delays. The x-axis represents the total average delay in minutes, while the y-axis displays the origin and destination cities. We note that the route with the most significant delay, Bend/Redmond, OR to Medford, OR, occurs within the same state, with an average total delay of around 2200 minutes or 36 hours which is significantly higher than the other routes. Additionally, most of the other delayed flights seem to occur when the flights are approximately halfway across the country. Further analysis could explore how the amount of delayed flights on each route correlates with the average total delay.

### Which airlines experience the most delays?

In [None]:
# combining departure delay and arrival delay to one column
df2 = df2.withColumn('TotalDelay', F.col('DepDelayMinutes') + F.col('ArrDelayMinutes'))

In [None]:
# renaming airline codes to their respective names
airline_mapping = {
    'AX': 'Trans States Airlines',
    'C5': 'Commutair/Champlain Enterprises Inc.',
    'G7': 'GoJet Airlines/United Express',
    'ZW': 'Air Wisconsin Airlines Corp',
    'EV': 'ExpressJet Airlines inc.',
    'B6': 'JetBlue Airways',
    'YV': 'Mesa Airlines Inc.',
    'OO': 'Skywest Airlines Inc',
    'F9': 'Frontier Airlines Inc',
    'G4': 'Allegiant Air'
}

In [None]:
# selecting only delays that are over 0
delayed_flights = df2.filter(df2['TotalDelay'] > 0)

In [None]:
# delayed_flights.printSchema()

In [None]:
# group by airline, then calculating the total average delay
total_delay = delayed_flights.groupBy("Operating_Airline").agg(
    F.avg("TotalDelay").alias("TotalDelayMinutes")
).orderBy(F.col("TotalDelayMinutes").desc())

In [None]:
# convert to pandas
total_delay_pd = total_delay.limit(10).toPandas()

In [None]:
total_delay_pd['Operating_Airline_Name'] = total_delay_pd['Operating_Airline'].map(airline_mapping)

In [None]:
total_delay_pd.head(10)

In [None]:
# plot
plt.figure(figsize=(12, 8))
sns.barplot(x="TotalDelayMinutes", y="Operating_Airline", data=total_delay_pd)
plt.title('Top 10 Airlines with the Most Delay in Minutes')
plt.xlabel('Average Total Delay in Minutes')
plt.ylabel('Airline')
plt.show()

#### Discussion
This bar plot illustrates the top 10 airlines with the highest average total delay, measured in minutes. The x-axis represents the total average delay, while the y-axis lists the airlines. It’s clear from the plot that certain airlines, such as Trans States Airlines and Commutair, experience somewhat higher delays than others with their total delays averaging over 100 minutes or a little over 1.5 hours. Further analysis could focus on the specific factors contributing to delays within these airlines, such as location, weather, or operational challenges. 

### Do flights with a longer distance have longer departure delays?

In [None]:
# convert to pandas
distance_delays_sample_pd = df2.select("Distance", "DepDelayMinutes").sample(withReplacement=False, fraction=0.3, seed=42).toPandas()


In [None]:
# convert to numeric to prevent error
distance_delays_sample_pd['Distance'] = pd.to_numeric(distance_delays_sample_pd['Distance'], errors='coerce')
distance_delays_sample_pd['DepDelayMinutes'] = pd.to_numeric(distance_delays_sample_pd['DepDelayMinutes'], errors='coerce')

In [None]:
# removing any NaNs and 0 values
distance_delays_sample_pd = distance_delays_sample_pd.dropna(subset=['Distance', 'DepDelayMinutes'])
distance_delays_sample_pd = distance_delays_sample_pd[(distance_delays_sample_pd['Distance'] > 0) & (distance_delays_sample_pd['DepDelayMinutes'] > 0)]

In [None]:
# plot
plt.figure(figsize=(12, 8))
sns.scatterplot(x='Distance', y='DepDelayMinutes', data= distance_delays_sample_pd, alpha=0.6)
plt.title('Flight Distance vs Departure Delay')
plt.xlabel('Flight Distance in Miles')
plt.ylabel('Departure Delay in Minutes')
plt.show()

#### Discussion
This scatter plot compares flight distance and departure delay, with flight distance being on the x-axis and departure delay on the y-axis. It is only a sample of the whole dataset, and shows that there’s no clear correlation between the distance and delay. However, we can somewhat see that there is a slightly negative correlation where shorter flights have longer delays and longer flights have shorter delays. Further analysis could explore how specific factors contribute to delays, especially for shorter flights.

### Checking delayed and cancelled flights Group By by operating airlines and time

The following columns were manually chosen after reading column description on Kaggle.

In [None]:
cols_to_check = ['Marketing_Airline_Network', 'Operating_Airline', 'Origin', 'Dest', 'DepDel15', 'DepDelay', 'ArrDel15', 'ArrDelay',\
                 'Cancelled', 'CarrierDelay', 'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay',\
                 'Distance', 'Year', 'Month', 'DayofMonth', 'FlightDate', 'Flight_Number_Operating_Airline']
print(f'col_to_check = {len(cols_to_check)}')
# Check to see if interested columns is in columns_with_few_nulls
removed_col = []
my_cols = []
for c in cols_to_check:
    if c in filtered_df.columns:
        my_cols.append(c)
    else:
        removed_col.append(c)

print('Chosen columns:')
print(my_cols)
print()
print('Rejected columns:')
print(removed_col)

In [None]:
# Filter out a subset of columns to do data analysis
Nam_df = filtered_df.select(my_cols)

Cancelled flights have nulls values in other columns!

In [None]:
cols_to_check = ['DepDel15', 'DepDelay', 'ArrDel15', 'ArrDelay']
print('There are entries where Cancelled column will have value of 1 (canceled) while other columns might be null')
for c in cols_to_check:
    num_to_display = Nam_df.where(isnull(col(c)) & (col('Cancelled') == 1)).count()
    print(f'Number of entries where {c} column is null but Cancelled is 1: {num_to_display}')

Since I intend to do data exploration with cancelled flights, it is not a good idea for me to do dropna() on the dataset as I will lose data for those flights.

In [None]:
# check number of flights that are delayed by more 15 minutes and cancelled flights
num_delayed = Nam_df.where(col('DepDel15') == 1).count()
num_cancelled = Nam_df.where(col('Cancelled') == 1).count()
print(f'Number of flights that were delayed by more than 15 minutes: {num_delayed}')
print(f'Number of flights that were cancelled: {num_cancelled}')

#### Group by operating airline data analysis

In [None]:
# aggregate by operating airline and turn into Pandas dataframe
airline_flights = Nam_df.groupBy(col('Operating_Airline')).agg(F.sum(col('DepDel15')).alias('DelayedFlights'),
                                                               F.sum(col('Cancelled')).alias('CancelledFlights'),
                                                               F.count('*').alias('TotalFlights'))
airline_flights_pd = airline_flights.toPandas()
airline_flights_pd['DelayedPercentage'] = airline_flights_pd['DelayedFlights'] / airline_flights_pd['TotalFlights']
airline_flights_pd['CancelledPercentage'] = airline_flights_pd['CancelledFlights'] / airline_flights_pd['TotalFlights']
airline_flights_pd.head(5)

In [None]:
# Graph number of delayed flights by airlines
airline_flights_pd = airline_flights_pd.sort_values(by = 'DelayedFlights', axis = 0, ascending = True)
fig, ax = plt.subplots(figsize=(10, 10))
percentage_barplot = sns.barplot(data = airline_flights_pd, x = 'Operating_Airline', y = 'DelayedFlights')
plt.title('Number of delayed flights by Operating Airline')
plt.xlabel('')
plt.ylabel('')

In [None]:
# Graph delay flight percentage by airline
airline_flights_pd = airline_flights_pd.sort_values(by = 'DelayedPercentage', axis = 0, ascending = True)
fig, ax = plt.subplots(figsize=(10, 10))
percentage_barplot = sns.barplot(data = airline_flights_pd, x = 'Operating_Airline', y = 'DelayedPercentage')
plt.title('Percentage of delayed flights by Operating Airline')
plt.xlabel('')
plt.ylabel('')

In [None]:
# Graph number of cancelled flights by airlines
airline_flights_pd = airline_flights_pd.sort_values(by = 'CancelledFlights', axis = 0, ascending = True)
fig, ax = plt.subplots(figsize=(10, 10))
sns.barplot(data = airline_flights_pd, x = 'Operating_Airline', y = 'CancelledFlights')
plt.title('Number of cancelled flights by Operating Airline')
plt.xlabel('')
plt.ylabel('')

In [None]:
# Graph peracentage of cancelled flights by airlines
airline_flights_pd = airline_flights_pd.sort_values(by = 'CancelledPercentage', axis = 0, ascending = True)
fig, ax = plt.subplots(figsize=(10, 10))
sns.barplot(data = airline_flights_pd, x = 'Operating_Airline', y = 'CancelledPercentage')
plt.title('Percentage of cancelled flights by Operating Airline')
plt.xlabel('')
plt.ylabel('')

#### Group by time data analysis

In [None]:
# Group by Year and Month and turn into Pandas df
time_agg = Nam_df.groupBy('Year', 'Month').agg(F.sum(col('DepDel15')).alias('DelayedFlights'),
                                              F.sum(col('Cancelled')).alias('CancelledFlights'),
                                              F.count('*').alias('TotalFlights'))
time_agg_pd = time_agg.toPandas()
time_agg_pd['DelayedPercentage'] = time_agg_pd['DelayedFlights'] / time_agg_pd['TotalFlights']
time_agg_pd['CancelledPercentage'] = time_agg_pd['CancelledFlights'] / time_agg_pd['TotalFlights']
time_agg_pd.head(5)

In [None]:
# Graph number of flights overtime
colors = ['red', 'blue', 'green', 'purple', 'gold']
fig, ax = plt.subplots(figsize=(10, 10))
barplot = sns.lineplot(data = time_agg_pd, x = 'Month', y = 'TotalFlights', hue = 'Year', palette = colors)
plt.title('Number of monthly flights over the year')
plt.ylabel('')
plt.legend(loc = 'upper right', bbox_to_anchor = (1.15, 1), title = 'Year')

In [None]:
# Graph percentage of delayed flights over time
colors = ['red', 'blue', 'green', 'purple', 'gold']
fig, ax = plt.subplots(figsize=(10, 10))
barplot = sns.lineplot(data = time_agg_pd, x = 'Month', y = 'DelayedPercentage', hue = 'Year', palette = colors)
plt.title('Percentage of delayed monthly flights over the year')
plt.ylabel('')
plt.legend(loc = 'upper right', bbox_to_anchor = (1.15, 1), title = 'Year')

In [None]:
# Graph percentage of cancelled flights over time
colors = ['red', 'blue', 'green', 'purple', 'gold']
fig, ax = plt.subplots(figsize=(10, 10))
barplot = sns.lineplot(data = time_agg_pd, x = 'Month', y = 'CancelledPercentage', hue = 'Year', palette = colors)
plt.title('Percentage of cancelled monthly flights over the year')
plt.ylabel('')
plt.legend(loc = 'upper right', bbox_to_anchor = (1.15, 1), title = 'Year')

#### Discussion

**<u>Data cleaning</u>**

Thanks to the team's work, we can learn that the majority of columns in our dataset contains a large number of NaN. We are able to quickly filter out those columns and focus on the others.
My analysis focused on delayed and cancelled flights. I learned that entries of cancelled flights will have nulls in other columns making a simple dropna() not a viable data cleaning method.
If we are to work with cancelled flights, we have to find away to fill in the nulls of other columns.

**<u>Aggregated by operating airlines</u>**

Despite how miserable air travelling is, flights are seldomly late. About 10 - 25% of flights are at least 15 minutes late to depart. From the graph, I would say that an average of about 15% of flights operated by any airlines are late to depart.

Airlines are keen to keep their flights operational cancelling less than 5% of their total scheduled flights. This makes sense as cancellation results in not only loss of revenue but also compensation of damages and potential loss of opportunities.

Airline denoted by KS (Peninsula Airways) stood out to me. They do not serve a lot of flights. But their delayed and cancelled metrics are area of improvement to say the least.

**<u>Aggregated by time</u>**

Acknowledgement: I understand that the airline industry was heavy affected by COVID-19 and that the industry is recovering to pre-pandemic numbers.

Overall, it seems that February tends to be a slow month for the airline industry. The summer months are busy. The holiday months of September - December are only slight less busy than the summer months. People are more likely to travel in the second half of the year.

The data for 2018 flights stood out to me. The year started with fewer flights than 2021 and 2022 (post COVID-19 years) but then suddenly gained 300,000 flights for the summer months. This indicates to me that there is a potential socio/economical/geopolitical event happening and/or issue with data collection.

Delayed flights are likely to happen during peak of the traveling seasons contributing to the misery of air travelling. February is an interesting month as customers are not travelling but flights pick up an increase in chance of being delayed. My guess is that flight crews and ground crews are burned out from the holiday season and their performance is decreased.

As for cancelled flights, there is a gigantic mountain that sits in the middle of the graph. Almost half of scheduled flights for April of 2020 are cancelled. I wonder what happened.

# Group Project Milestone 3: Data PreProcessing & First Model

In this assignment you will need to:

1. Finish major preprocessing, this includes scaling and/or transforming your data, imputing your data, encoding your data, feature expansion, Feature expansion (example is taking features and generating new features by transforming via polynomial, log multiplication of features).

2. Train your first model

3. Evaluate your model compare training vs test error

4. Where does your model fit in the fitting graph.

5. What are the next models you are thinking of and why?

6. Update your README.md to include your new work and updates you have all added. Make sure to upload all code and notebooks. Provide links in your README.md

7. Conclusion section: What is the conclusion of your 1st model? What can be done to possibly improve it?

Note: For supervised learning, include example ground truth and predictions for train, validation, and test


## 1. PreProcessing Finalization

### Handling Missing Data

- Columns with >10% of null values are excluded from analysis
- Columns that pass this criteria but still have null values range from 1-3% of null values
- Most columns that remain in the dataset have no null values

#### Columns with null values present
- Tail_Number: 99.08% non-null
- DepTime: 97.39% non-null
- DepDelay: 97.39% non-null
- DepDelayMinutes: 97.39% non-null
- DepDel15: 97.39% non-null
- DepartureDelayGroups: 97.39% non-null
- WheelsOff: 97.33% non-null
- TaxiOut: 97.33% non-null
- ArrTime: 97.31% non-null
- WheelsOn: 97.28% non-null
- TaxiIn: 97.28% non-null
- ActualElapsedTime: 97.10% non-null
- ArrDelay: 97.10% non-null
- ArrDelayMinutes: 97.10% non-null
- ArrDel15: 97.10% non-null
- ArrivalDelayGroups: 97.10% non-null
- AirTime: 97.08% non-null

#### Missing values are likely due to cancelled flights. Check if this is true. How do we want to handle missing data where the flight was cancelled?

- Check if NA values fall under columns where df['Cancelled'] == 1

#### Missing Data Imputation - MissForest

### Removal of Redundant Features

- FlightDate is already parsed out in Year, Quarter, Month, DayofMonth, DayofWeek columns

### Indexing Categorical Variables - StringIndexer

### Feature Expansion Ideas

- Combine Origin & Dest into one feature (Route)
- Ratio features AirTime/Distance
- Log transformation of skewed numeric features - recommended 'Distance' had a >3 fold increase in skew magnitude out of top 20 skewed features

In [27]:
spark.catalog.clearCache()

In [33]:

filtered_df = filtered_df.drop("is_popular_route")
filtered_df.printSchema()

root
 |-- route: string (nullable = false)
 |-- Year: integer (nullable = true)
 |-- Quarter: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- DayofMonth: integer (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- FlightDate: date (nullable = true)
 |-- Marketing_Airline_Network: string (nullable = true)
 |-- Operated_or_Branded_Code_Share_Partners: string (nullable = true)
 |-- DOT_ID_Marketing_Airline: integer (nullable = true)
 |-- IATA_Code_Marketing_Airline: string (nullable = true)
 |-- Flight_Number_Marketing_Airline: integer (nullable = true)
 |-- Operating_Airline: string (nullable = true)
 |-- DOT_ID_Operating_Airline: integer (nullable = true)
 |-- IATA_Code_Operating_Airline: string (nullable = true)
 |-- Tail_Number: string (nullable = true)
 |-- Flight_Number_Operating_Airline: integer (nullable = true)
 |-- OriginAirportID: integer (nullable = true)
 |-- OriginAirportSeqID: integer (nullable = true)
 |-- OriginCityMarketID: integer (null

In [38]:
from pyspark.sql import functions as F
from pyspark.sql.window import Window

# ─── Add "route" Column ─────────────────────────────────────────────
# Format: "Origin - Dest", replacing nulls with "Unknown"
filtered_df = filtered_df.withColumn(
    "route",
    F.concat_ws(
        " - ",
        F.coalesce(F.col("Origin"), F.lit("Unknown")),
        F.coalesce(F.col("Dest"), F.lit("Unknown"))
    )
)

# ─── Add Average Speed (mph) ─────────────────────────────────────────
# Formula: (Distance / AirTime) * 60, rounded to nearest integer
filtered_df = filtered_df.withColumn(
    "avg_speed_mph",
    F.when(
        F.col("AirTime").isNotNull() & (F.col("AirTime") != 0),
        F.round((F.col("Distance") / F.col("AirTime")) * 60)
    ).otherwise(None)
)

# ─── Add num_flights Using Window Function ───────────────────────────
# Counts how many times each route appears, without using join
route_window = Window.partitionBy("route")
filtered_df = filtered_df.withColumn(
    "num_flights",
    F.count("*").over(route_window)
)

# ─── Monthly Route Count (for EDA only, optional) ────────────────────
monthly_counts = filtered_df.groupBy("route", "Month").count()

# ─── Add route_popularity Bucket ─────────────────────────────────────
# 0 = low (<5000), 1 = med (5K–15K), 2 = high (>15K)
filtered_df = filtered_df.withColumn(
    "route_popularity",
    F.when(F.col("num_flights") < 5000, 0)
     .when(F.col("num_flights") < 15000, 1)
     .otherwise(2)
)

# ─── Preview Random Sample of Final Feature Columns ──────────────────
sample_cols = ["route", "avg_speed_mph", "num_flights", "route_popularity"]
filtered_df.select(sample_cols) \
    .orderBy(F.rand(seed=42)) \
    .limit(20) \
    .show(truncate=False)

# ─── DataFrame Shape Summary ─────────────────────────────────────────
num_rows = filtered_df.count()
num_cols = len(filtered_df.columns)
print(f"Shape of the filtered DataFrame: ({num_rows}, {num_cols})")

# ─── Optional: Unpersist if it was cached earlier ────────────────────
filtered_df.unpersist()

                                                                                

+---------+-------------+-----------+----------------+
|route    |avg_speed_mph|num_flights|route_popularity|
+---------+-------------+-----------+----------------+
|JNU - SIT|259.0        |2782       |0               |
|ONT - OAK|NULL         |8590       |1               |
|PHX - BNA|497.0        |6166       |1               |
|LAS - AMA|469.0        |1624       |0               |
|SEA - LAS|430.0        |26350      |2               |
|DFW - AUS|356.0        |14542      |1               |
|BWI - ATL|403.0        |23063      |2               |
|LGA - FLL|394.0        |17765      |2               |
|MSP - MEM|420.0        |3548       |0               |
|SDF - DTW|360.0        |4974       |0               |
|TPA - MEM|423.0        |1070       |0               |
|PDX - BZN|365.0        |1092       |0               |
|GEG - SAN|440.0        |1583       |0               |
|CMH - TPA|452.0        |4128       |0               |
|IAD - ORD|392.0        |7822       |1               |
|RDU - SLC

DataFrame[route: string, Year: int, Quarter: int, Month: int, DayofMonth: int, DayOfWeek: int, FlightDate: date, Marketing_Airline_Network: string, Operated_or_Branded_Code_Share_Partners: string, DOT_ID_Marketing_Airline: int, IATA_Code_Marketing_Airline: string, Flight_Number_Marketing_Airline: int, Operating_Airline: string, DOT_ID_Operating_Airline: int, IATA_Code_Operating_Airline: string, Tail_Number: string, Flight_Number_Operating_Airline: int, OriginAirportID: int, OriginAirportSeqID: int, OriginCityMarketID: int, Origin: string, OriginCityName: string, OriginState: string, OriginStateFips: int, OriginStateName: string, OriginWac: int, DestAirportID: int, DestAirportSeqID: int, DestCityMarketID: int, Dest: string, DestCityName: string, DestState: string, DestStateFips: int, DestStateName: string, DestWac: int, CRSDepTime: int, DepTime: int, DepDelay: double, DepDelayMinutes: double, DepDel15: double, DepartureDelayGroups: int, DepTimeBlk: string, TaxiOut: double, WheelsOff: in

In [53]:
# https://plotly.com/python/lines-on-maps/
airport_cols = ['AirportID', 'Name', 'City', 'Country', 'IATA', 'ICAO',
                'Latitude', 'Longitude', 'Altitude', 'Timezone', 'DST',
                'Tz database time zone', 'Type', 'Source']

airport_df = pd.read_csv(
    "https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat",
    header=None, names=airport_cols
)

airport_coords = airport_df[['IATA', 'Latitude', 'Longitude']].dropna()
airport_coords = airport_coords[airport_coords['IATA'].str.len() == 3]  # Keep valid codes

top_routes_pd = (
    filtered_df.groupBy("route", "Origin", "Dest")
    .agg(F.count("*").alias("num_flights"))
    .orderBy(F.desc("num_flights"))
    .limit(1000)
    .toPandas()
)

# Join lat/lon for Origin and Dest
top_routes_pd = top_routes_pd.merge(airport_coords, left_on="Origin", right_on="IATA") \
                             .rename(columns={"Latitude": "Origin_Lat", "Longitude": "Origin_Lon"}) \
                             .drop("IATA", axis=1)

top_routes_pd = top_routes_pd.merge(airport_coords, left_on="Dest", right_on="IATA") \
                             .rename(columns={"Latitude": "Dest_Lat", "Longitude": "Dest_Lon"}) \
                             .drop("IATA", axis=1)

In [75]:
fig = go.Figure()

# ─── Add Airport Markers ─────────────────────────────────────────────
# Collect unique airport locations from both Origin and Dest
airport_df = pd.concat([
    top_routes_pd[['Origin', 'Origin_Lat', 'Origin_Lon']].rename(
        columns={"Origin": "IATA", "Origin_Lat": "lat", "Origin_Lon": "lon"}),
    top_routes_pd[['Dest', 'Dest_Lat', 'Dest_Lon']].rename(
        columns={"Dest": "IATA", "Dest_Lat": "lat", "Dest_Lon": "lon"})
]).drop_duplicates(subset=["IATA"])

fig.add_trace(go.Scattergeo(
    locationmode='USA-states',
    lon=airport_df["lon"],
    lat=airport_df["lat"],
    hoverinfo='text',
    text=airport_df["IATA"],
    mode='markers',
    marker=dict(
        size=4,
        color='blue',
        line=dict(width=1, color='rgba(68, 68, 68, 0)')
    ),
    name="Airports"
))

# ─── Add Flight Routes ───────────────────────────────────────────────
for _, row in top_routes_pd.iterrows():
    fig.add_trace(go.Scattergeo(
        locationmode='USA-states',
        lon=[row["Origin_Lon"], row["Dest_Lon"]],
        lat=[row["Origin_Lat"], row["Dest_Lat"]],
        mode='lines',
        line=dict(width=1, color='red'),
        opacity=row["num_flights"] / top_routes_pd["num_flights"].max(),
        name=f"{row['Origin']} → {row['Dest']}"
    ))

# ─── Layout & Display ────────────────────────────────────────────────
fig.update_layout(
    title_text='Top 1000 Flight Routes (Interactive Map)',
    showlegend=True,
    geo=dict(
        scope='north america',
        projection_type='azimuthal equal area',
        showland=True,
        landcolor='rgb(243, 243, 243)',
        countrycolor='rgb(204, 204, 204)',
    ),
    height=500
)

fig.show()

In [None]:
# import plotly.graph_objects as go
# import plotly.io as pio

# pio.renderers.default = 'notebook'

# fig = go.Figure()

# for _, row in top_routes_pd.iterrows():
#     fig.add_trace(go.Scattergeo(
#         locationmode='ISO-3',
#         lon=[row["Origin_Lon"], row["Dest_Lon"]],
#         lat=[row["Origin_Lat"], row["Dest_Lat"]],
#         mode='lines',
#         line=dict(width=min(0.01 + row['num_flights'] / top_routes_pd['num_flights'].max(), 1), color='blue'),
#         opacity=min(0.01 + row['num_flights'] / top_routes_pd['num_flights'].max(), 1),
#         hoverinfo='text',
#         text=f"{row['Origin']} → {row['Dest']} ({row['num_flights']} flights)",
#         name=f"{row['Origin']} → {row['Dest']}"  # ✅ this makes it show up in legend
#     ))

# fig.update_geos(
#     scope="north america",  # ✅ focus only on the USA
#     showland=True,
#     landcolor="rgb(243, 243, 243)",
#     countrycolor="rgb(204, 204, 204)"
# )

# fig.update_layout(
#     title_text="Top 1000 Most Frequent Flight Routes",
#     showlegend=True,  # ✅ Ensure legend is visible
#     height=500,
#     margin={"r":0,"t":30,"l":0,"b":0}
# )

# fig.show()

## 2. Train First Model: Random Forest Classifier + Feature Selection

- RF cannot handle missing values
- Values must be numeric (categorical features must be indexed)
- Define Label column (Classification df['ArrDelayGroups'] or Regression df['ArrDelay'])

### Split dataset into train, test, validation set (x & y)

### Optimize Random Forest Classifier Using Validation Dataset

- numTrees
- maxDepth
- minInstancesPerNode

### Train RF Classifier & Apply to Test Dataset