A full documentation of the data we used, relavent links and relavent schema

Flights were sourced from here: https://www.kaggle.com/datasets/usdot/flight-delays?select=flights.csv

Weather was sourced from here: https://asmith.ucdavis.edu/data/prism-weather

AirportLocations.csv was sourced from: https://geodata.bts.gov/datasets/usdot::aviation-facilities/about

For weather use the settings: Temporal unit should be daily, use county as spatial unit, start and end year are both 2015, months go from 1 to 12, states are all states, choose variables are tmin, tmax, tavg, ppt, dday_a5C, dday_b15C



In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import polars as pl
import kagglehub
import seaborn as sns
import gc

State regions were determined by the following image (kind of arbitrary). We decided to set AK to north and HI to pacific, this is here for the purpose of hypothesis testing later on

![Alt Text](region.png)

In [2]:
state_to_region_dict = {
    'WA': 'northwest',
    'OR': 'northwest',
    'ID': 'northwest',
    'MT': 'northwest',
    'WY': 'northwest',
    'CA': 'west',
    'NV': 'west',
    'UT': 'southwest',
    'AZ': 'southwest',
    'CO': 'southwest',
    'NM': 'southwest',
    'TX': 'southwest',
    'OK': 'southwest',
    'ND': 'midwest',
    'SD': 'midwest',
    'NE': 'midwest',
    'KS': 'midwest',
    'MN': 'midwest',
    'IA': 'midwest',
    'MO': 'midwest',
    'WI': 'midwest',
    'IL': 'midwest',
    'MI': 'midwest',
    'KY': 'midwest',
    'IN': 'midwest',
    'OH': 'midwest',
    'AR': 'southeast',
    'LA': 'southeast',
    'MS': 'southeast',
    'AL': 'southeast',
    'GA': 'southeast',
    'FL': 'southeast',
    'TN': 'southeast',
    'NC': 'southeast',
    'SC': 'southeast',
    'VA': 'midatlantic',
    'WV': 'midatlantic',
    'MD': 'midatlantic',
    'DE': 'midatlantic',
    'DC': 'midatlantic',
    'NJ': 'midatlantic',
    'PA': 'midatlantic',
    'NY': 'midatlantic',
    'CT': 'newengland',
    'RI': 'newengland',
    'MA': 'newengland',
    'NH': 'newengland',
    'VT': 'newengland',
    'ME': 'newengland',
    'AK': 'north',
    'HI': 'pacific'
}

In [3]:
flights_df = pl.read_csv('flights.csv')
airport_loc_df = pl.read_csv('AirportLocations.csv')
airlines_df = pl.read_csv('airlines.csv')
weather_df = pl.read_csv('weather.csv')

ER diagram:

This entity diagram is based on the 4 datasets we are using. Namely, the entities below correspond to the dataframes as follows: flight (flight_df), airport (airport_loc_df), airline (airlines_df), weather (weather_df). The relationships that exist are as follows:
- A flight departs from/arrives at exactly 1 aiport (respectively). An airport can be the departure location/arrival location of multiple flights.
- A flight is operated by exaclty 1 airline. An airline may operate multiple flights.
- An airport experiences various weather conditions (on different days). A specific weather condition on a specific day is experienced by exactly 1 airport.

For this last relationship, we assume that all (location, date) pairs are different, and that there is only 1 airport located in each county. That is, weather conditions at a specific location (and hence corresponding to a specific airport) are unique.

Note: Some attributes are left off of the flight entity (AIR_SYSTEM_DELAY, SECURITY_DELAY, AIRLINE_DELAY, LATE_AIRCRAFT_DELAY, WEATHER_DELAY) as they were not used going forward and the "CANCELLATION_REASON" column essentially had the same information.

![Alt Text](project_er_diagram.png)

In [4]:
#delays_df = flights_df.filter(pl.col("DEPARTURE_DELAY") > 0)
#ontime_df = flights_df.filter(pl.col("DEPARTURE_DELAY") <= 0)

Joining the County information so that we can compare weather

In [5]:
airport_loc_df = airport_loc_df.select(["ARPT_ID", "COUNTY_NAME", "STATE_CODE"])
flights_df = flights_df.join(
    airport_loc_df,
    left_on="ORIGIN_AIRPORT",
    right_on="ARPT_ID",
    how="left"
)
#delays_df = flights_df.filter(pl.col("DEPARTURE_DELAY") > 0)
#ontime_df = flights_df.filter(pl.col("DEPARTURE_DELAY") <= 0)
flights_df = flights_df.with_columns(
    pl.datetime(
        year=pl.col("YEAR"),
        month=pl.col("MONTH"),
        day=pl.col("DAY")
    ).cast(pl.Date).alias("date")
)
flights_df

YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,COUNTY_NAME,STATE_CODE,date
i64,i64,i64,i64,str,i64,str,str,str,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,str,i64,i64,i64,i64,i64,str,str,date
2015,1,1,4,"""AS""",98,"""N407AS""","""ANC""","""SEA""",5,2354,-11,21,15,205,194,169,1448,404,4,430,408,-22,0,0,,,,,,,"""ANCHORAGE""","""AK""",2015-01-01
2015,1,1,4,"""AA""",2336,"""N3KUAA""","""LAX""","""PBI""",10,2,-8,12,14,280,279,263,2330,737,4,750,741,-9,0,0,,,,,,,"""LOS ANGELES""","""CA""",2015-01-01
2015,1,1,4,"""US""",840,"""N171US""","""SFO""","""CLT""",20,18,-2,16,34,286,293,266,2296,800,11,806,811,5,0,0,,,,,,,"""SAN MATEO""","""CA""",2015-01-01
2015,1,1,4,"""AA""",258,"""N3HYAA""","""LAX""","""MIA""",20,15,-5,15,30,285,281,258,2342,748,8,805,756,-9,0,0,,,,,,,"""LOS ANGELES""","""CA""",2015-01-01
2015,1,1,4,"""AS""",135,"""N527AS""","""SEA""","""ANC""",25,24,-1,11,35,235,215,199,1448,254,5,320,259,-21,0,0,,,,,,,"""KING""","""WA""",2015-01-01
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2015,12,31,4,"""B6""",688,"""N657JB""","""LAX""","""BOS""",2359,2355,-4,22,17,320,298,272,2611,749,4,819,753,-26,0,0,,,,,,,"""LOS ANGELES""","""CA""",2015-12-31
2015,12,31,4,"""B6""",745,"""N828JB""","""JFK""","""PSE""",2359,2355,-4,17,12,227,215,195,1617,427,3,446,430,-16,0,0,,,,,,,"""QUEENS""","""NY""",2015-12-31
2015,12,31,4,"""B6""",1503,"""N913JB""","""JFK""","""SJU""",2359,2350,-9,17,7,221,222,197,1598,424,8,440,432,-8,0,0,,,,,,,"""QUEENS""","""NY""",2015-12-31
2015,12,31,4,"""B6""",333,"""N527JB""","""MCO""","""SJU""",2359,2353,-6,10,3,161,157,144,1189,327,3,340,330,-10,0,0,,,,,,,"""ORANGE""","""FL""",2015-12-31


Don't run the below cell more than once

In [6]:
weather_df = weather_df.with_columns(
    pl.col("date").cast(pl.Utf8).str.strptime(pl.Date, "%Y%m%d").alias("date")
)
weather_df = weather_df.with_columns(
    pl.col("county_name").str.to_lowercase().alias("county_name")
)
weather_df

st_abb,st_code,county_name,fips,date,stability,tmin,tmax,tavg,ppt,dday_a5C,dday_b15C
str,i64,str,i64,date,str,f64,f64,f64,f64,f64,f64
"""AL""",1,"""autauga""",1001,2015-01-01,"""stable""",-0.835,10.961,5.063,0.059,1.909,9.937
"""AL""",1,"""autauga""",1001,2015-01-02,"""stable""",0.276,13.216,6.746,3.863,3.008,8.254
"""AL""",1,"""autauga""",1001,2015-01-03,"""stable""",8.511,12.552,10.531,14.217,5.532,4.469
"""AL""",1,"""autauga""",1001,2015-01-04,"""stable""",12.328,20.585,16.457,48.919,11.456,0.668
"""AL""",1,"""autauga""",1001,2015-01-05,"""stable""",2.642,15.865,9.254,0.0,4.684,5.841
…,…,…,…,…,…,…,…,…,…,…,…
"""WY""",56,"""weston""",56045,2015-12-27,"""stable""",-19.242,-6.704,-12.973,0.0,0.0,27.973
"""WY""",56,"""weston""",56045,2015-12-28,"""stable""",-18.188,-2.366,-10.277,0.0,0.0,25.277
"""WY""",56,"""weston""",56045,2015-12-29,"""stable""",-20.651,-3.123,-11.887,0.0,0.0,26.887
"""WY""",56,"""weston""",56045,2015-12-30,"""stable""",-18.454,-8.474,-13.464,0.464,0.0,28.464


Cross Referenced Data from here in order to ensure that the temperatures were correctly aligned: https://www.timeanddate.com/weather/usa/new-york/historic?month=12&year=2015

In [7]:
flights_df = flights_df.with_columns(
    pl.col("COUNTY_NAME").str.to_lowercase().alias("COUNTY_NAME")
)

result_df = flights_df.join(
    weather_df,
    left_on=["COUNTY_NAME", "date", "STATE_CODE"],
    right_on=["county_name", "date", "st_abb"],
    how="left"
)

flights_df

YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,COUNTY_NAME,STATE_CODE,date
i64,i64,i64,i64,str,i64,str,str,str,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,str,i64,i64,i64,i64,i64,str,str,date
2015,1,1,4,"""AS""",98,"""N407AS""","""ANC""","""SEA""",5,2354,-11,21,15,205,194,169,1448,404,4,430,408,-22,0,0,,,,,,,"""anchorage""","""AK""",2015-01-01
2015,1,1,4,"""AA""",2336,"""N3KUAA""","""LAX""","""PBI""",10,2,-8,12,14,280,279,263,2330,737,4,750,741,-9,0,0,,,,,,,"""los angeles""","""CA""",2015-01-01
2015,1,1,4,"""US""",840,"""N171US""","""SFO""","""CLT""",20,18,-2,16,34,286,293,266,2296,800,11,806,811,5,0,0,,,,,,,"""san mateo""","""CA""",2015-01-01
2015,1,1,4,"""AA""",258,"""N3HYAA""","""LAX""","""MIA""",20,15,-5,15,30,285,281,258,2342,748,8,805,756,-9,0,0,,,,,,,"""los angeles""","""CA""",2015-01-01
2015,1,1,4,"""AS""",135,"""N527AS""","""SEA""","""ANC""",25,24,-1,11,35,235,215,199,1448,254,5,320,259,-21,0,0,,,,,,,"""king""","""WA""",2015-01-01
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2015,12,31,4,"""B6""",688,"""N657JB""","""LAX""","""BOS""",2359,2355,-4,22,17,320,298,272,2611,749,4,819,753,-26,0,0,,,,,,,"""los angeles""","""CA""",2015-12-31
2015,12,31,4,"""B6""",745,"""N828JB""","""JFK""","""PSE""",2359,2355,-4,17,12,227,215,195,1617,427,3,446,430,-16,0,0,,,,,,,"""queens""","""NY""",2015-12-31
2015,12,31,4,"""B6""",1503,"""N913JB""","""JFK""","""SJU""",2359,2350,-9,17,7,221,222,197,1598,424,8,440,432,-8,0,0,,,,,,,"""queens""","""NY""",2015-12-31
2015,12,31,4,"""B6""",333,"""N527JB""","""MCO""","""SJU""",2359,2353,-6,10,3,161,157,144,1189,327,3,340,330,-10,0,0,,,,,,,"""orange""","""FL""",2015-12-31


In [8]:
'''
result_df2 = flights_df.join(
    weather_df,
    left_on=["COUNTY_NAME", "date"],
    right_on=["county_name", "date"],
    how="inner"
)

result_df2
'''

'\nresult_df2 = flights_df.join(\n    weather_df,\n    left_on=["COUNTY_NAME", "date"],\n    right_on=["county_name", "date"],\n    how="inner"\n)\n\nresult_df2\n'

In [9]:
result_important_df = result_df.filter(pl.col('tavg').is_not_null())
result_important_df = result_important_df.with_columns(
    (pl.col('DEPARTURE_DELAY') + pl.col('ARRIVAL_DELAY')).alias('TOTAL_DELAY')
)
result_important_df
gc.collect()

30

In [10]:
del flights_df

In [11]:
del result_df

RUN TO HERE ELSE COLAB CRASHES!

We will now split in to test and train sets and proceed with EDA on the train set

In [12]:
from sklearn.model_selection import train_test_split

target = ['DEPARTURE_DELAY']

X_train, X_test = train_test_split(result_important_df, test_size=0.2, random_state=0)
print(X_train.shape)
print(X_test.shape)
gc.collect()

KeyboardInterrupt: 

Let us now get some understanding of what our data looks like

In [None]:
basic_stats = X_train.describe()
info = X_train.schema
columns_of_interest = ['SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 'DEPARTURE_DELAY', 'TAXI_OUT', 'WHEELS_OFF', 'ELAPSED_TIME', 'AIR_TIME', 'WHEELS_ON', 'DISTANCE', 'ARRIVAL_DELAY', 'DIVERTED', 'CANCELLED', 'tmin', 'tmax', 'tavg', 'ppt', 'dday_a5C', 'dday_b15C']
X_train_numerical = X_train.select(columns_of_interest)
df = X_train_numerical.to_pandas()
corr_matrix = df.corr()
corr_matrix

In [None]:
'''
VISUAL 1: Naturally as we are interested in DEPARTURE and ARRIVAL delay when choosing features, we should choose features that seem to be highly correlated with these delays
We will use 2 metrics that I think do a good job of measuring correlation, 1 is the standard pearson correlation, then to compare categorical values we will use the CramerV correlation metric
'''
plt.figure(figsize=(12, 10))

sns.heatmap(
    corr_matrix,
    annot=True,
    fmt=".2f",
    cmap='coolwarm',
    square=True,
    linewidths=.5,
    cbar_kws={"shrink": .5}
)

plt.title('Correlation Matrix Heatmap', fontsize=16)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()


As we can see there are very weak correlations, but what if we filter for precipitation > 1?

In [None]:
df_ppt_filtered = df[df['ppt'] > 5]
corr_matrix = df_ppt_filtered.corr()
plt.figure(figsize=(12, 10))

sns.heatmap(
    corr_matrix,
    annot=True,
    fmt=".2f",
    cmap='coolwarm',
    square=True,
    linewidths=.5,
    cbar_kws={"shrink": .5}
)

plt.title('Correlation Matrix Heatmap', fontsize=16)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()
print(df_ppt_filtered.shape)

In [None]:
origin_counts = X_train.group_by("ORIGIN_AIRPORT").count()
origin_counts_pd = origin_counts.to_pandas()
origin_counts_pd = origin_counts_pd.sort_values(by="count", ascending=False)
top_30 = origin_counts_pd.head(30)

# Plot the bar chart
plt.figure(figsize=(12, 6))
plt.bar(top_30["ORIGIN_AIRPORT"], top_30["count"], color="skyblue")
plt.xlabel("Origin Airport", fontsize=14)
plt.ylabel("Flight Count", fontsize=14)
plt.title("Flight Count by Origin Airport", fontsize=16)
plt.xticks(rotation=45, fontsize=12)
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.tight_layout()
plt.show()

In [None]:
airline_counts = X_train.group_by("AIRLINE").count()
airline_counts = airline_counts.to_pandas()
airline_counts = airline_counts.sort_values(by="count", ascending=False)

plt.figure(figsize=(12, 6))
plt.bar(airline_counts["AIRLINE"], airline_counts["count"], color="skyblue")
plt.xlabel("Airlines Represented", fontsize=14)
plt.ylabel("Flight Count", fontsize=14)
plt.title("Airline Count", fontsize=16)
plt.xticks(rotation=45, fontsize=12)
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.tight_layout()
plt.show()

Now for some feature importance! We have a few modeling ideas

1) Given delay, weather, location, and other features in the dataframe, can we accurately predict which airline caused the delay

2) Conversely given an airline, weather, origin, destination, and other features, can we accurately predict the total delay (arrival + departure delay)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

target = ['TOTAL_DELAY']
y_train = X_train.select(target)
X_train_total_delay = X_train.drop(target)

non_numeric_cols = [col for col in X_train_total_delay.columns if X_train_total_delay[col].dtype == pl.Utf8]
encoded_non_numeric = X_train_total_delay
for col in non_numeric_cols:
    le = LabelEncoder()
    encoded_non_numeric = encoded_non_numeric.with_columns(
        pl.Series(name=col, values=le.fit_transform(encoded_non_numeric[col].to_list()))
    )
print(non_numeric_cols)
X_train_total_delay = encoded_non_numeric
X_train_total_delay = pl.concat([X_train_total_delay, y_train], how="horizontal")
X_train_total_delay_cleaned = X_train_total_delay.filter(
    pl.col("TOTAL_DELAY").is_not_null()
)

X_train_total_delay_cleaned = X_train_total_delay_cleaned.to_pandas()
X_train_total_delay_cleaned = X_train_total_delay_cleaned.dropna()
X_train_total_delay_cleaned = pl.DataFrame(X_train_total_delay_cleaned)
X_train_total_delay_cleaned = X_train_total_delay_cleaned.drop(['DEPARTURE_DELAY', 'ARRIVAL_DELAY'])
X_train_total_delay_cleaned

In [None]:
target = ['TOTAL_DELAY']
y_train = X_train_total_delay_cleaned.select(target)
X_train_total_delay = X_train_total_delay_cleaned.drop(target)
X_train_total_delay = X_train_total_delay.with_columns(
    [pl.col(column).cast(pl.Int32) for column in X_train_total_delay.columns if X_train_total_delay[column].dtype == pl.Int64]
)
X_train_total_delay = X_train_total_delay.with_columns(
    [pl.col(column).cast(pl.Float32) for column in X_train_total_delay.columns if X_train_total_delay[column].dtype == pl.Float64]
)


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
X_standardized_np = scaler.fit_transform(X_train_total_delay.to_numpy())


X_standardized = pl.DataFrame(
    X_standardized_np,
    schema=X_train_total_delay.columns
)

X_standardized = X_standardized.drop_nulls()
X_standardized_pd = X_standardized.to_pandas()

lr = LinearRegression()
lr.fit(X_standardized_pd, y_train)

coefficients = lr.coef_.ravel() if len(lr.coef_.shape) > 1 else lr.coef_

feature_importances = pd.DataFrame({
    "Feature": X_standardized.columns,
    "Importance": np.abs(coefficients)
}).sort_values(by="Importance", ascending=False)

plt.figure(figsize=(10, 6))
plt.barh(feature_importances["Feature"], feature_importances["Importance"])
plt.xlabel("Feature Importance")
plt.ylabel("Feature")
plt.title("Feature Importance using Linear Regression")
plt.gca().invert_yaxis()
plt.show()

Now it is time to move into some hypothesis testing

We will be looking at the following hypotheses all tested with $\alpha = 0.05$ for significance:

1) When precipitation is greater than 1 is a flight more likely to be delayed?

$H_0$: When precipitation is greater than 1 a flight has the same liklihood of being delayed

2) Are flights in western, southwestern, and southeastern states less likely to be delayed than states in the midwest, new england, mid-atlantic, and northwest in winter months?

$H_0$: western, southwestern, and southeastern states are delayed the same amount as flights in the midwest, new england, mid-atlantic, and the northwest in winter months.

3) Some of us have had some bad experience with certain airlines (looking at you AA), is this the truth, are some airlines significantly more likely to have delays than others?

$H_0$: all airlines experience the same average delay

In [None]:
# Don't run the following two lines more than once very bad for RAM
precip_df = result_important_df.filter(pl.col("ppt") > 1)
noprecip_df = result_important_df.filter(pl.col("ppt") <= 1)

In [None]:
'''
HYPOTHESIS 1
'''
# TOTAL: 4,933,302 rows
# (a) Are we are testing a population proportion - looking at the proportion of flights that are delayed (as binary)
#     We are testing the difference of population proportions
#     H_0: p_precip - p_noprecip = 0, H_1: p_precip - p_noprecip > 0
# (b) Or are we testing like based on the actual number of minutes - the mean delay time - which would be inference on the mean

# (a)
# Create a new column "IS_DELAYED" that's 1 if delayed and 0 if on time
precip_df = precip_df.with_columns(
    (pl.col('DEPARTURE_DELAY') > 0).alias('IS_DELAYED').cast(pl.Int64)
)
noprecip_df = noprecip_df.with_columns(
    (pl.col('DEPARTURE_DELAY') > 0).alias('IS_DELAYED').cast(pl.Int64)
)
delay_precip_proportion = precip_df.select(pl.sum('IS_DELAYED')) / len(precip_df)
delay_noprecip_proportion = noprecip_df.select(pl.sum('IS_DELAYED')) / len(noprecip_df)

# Calculate original test statistic (0.420968 - 0.354947 = 0.066021)
delay_proportion_diff = delay_precip_proportion[0, 0] - delay_noprecip_proportion[0, 0]

# Simulate the null world - same proportion of flights being delayed regardless of precipitation, 10,000,000 trials each
simulated_delays_precip = np.random.binomial(len(precip_df), 0.5, 10_000_000) / len(precip_df)
simulated_delays_noprecip = np.random.binomial(len(noprecip_df), 0.5, 10_000_000) / len(noprecip_df)
simulated_precip_proportion = sum(simulated_delays_precip) / len(simulated_delays_precip)
simulated_noprecip_proportion = sum(simulated_delays_noprecip) / len(simulated_delays_noprecip)
# simulated values: (0.500 - 0.499 = -3.615e-08)
simulated_proportion_diff = simulated_precip_proportion - simulated_noprecip_proportion

# Calculate the p-value (p = 0)
simulated_p_value = sum(x >= delay_proportion_diff for x in (simulated_delays_precip - simulated_delays_noprecip)) / len(simulated_delays_precip)

Since p = 0 < 0.05, we reject the null hypothesis the the proportion of delayed flights on days with precipitation > 1 is the same as when precipitation <= 1.

In [None]:
# Plot the distribution of simulated difference in proportions of delays for precipitation > 1 vs. <= 1
_ = plt.hist(simulated_delays_precip - simulated_delays_noprecip, bins = 50)
plt.title("Distribution of Simulated Differences in Proportions of Delays for Precipitation > 1 vs. <= 1")
plt.xlabel("Difference in Proportions")
plt.ylabel("Frequency")
round_dpd = round(delay_proportion_diff, 4)
plt.axvline(round_dpd, color = "red", label = "Observed Difference in Proportions: {}".format(round_dpd))
plt.legend()

In [None]:
'''
HYPOTHESIS 2:
Are flights in western, southwestern, and southeastern states less likely to be delayed than states in the midwest, new england, mid-atlantic, and northwest in winter months?
H_0: western, southwestern, and southeastern states are delayed the same amount as flights in the midwest, new england, and the northwest in winter months.
H_0: p_north - p_south = 0, H_1: p_north - p_south > 0
'''
# Winter months: December - February (12, 1, 2), need to use state_to_region_dict to get region classifications
# Convert state_to_region_dict to a dataframe for joining purposes
state_to_region_df = pl.from_dict(state_to_region_dict)
states = pl.Series(state_to_region_df.columns)
state_to_region_df = state_to_region_df.transpose()
state_to_region_df = state_to_region_df.insert_column(1, states)
state_to_region_df.columns = ['region', 'STATE_CODE']
# Join result_important_df with this to add a new column with the regions
result_important_df = result_important_df.join(state_to_region_df, on = 'STATE_CODE', how = 'left')

In [None]:
# Compute smaller dfs: south = western + southerwestern + southeast, north = midwest + new england + mid-atlantic + northwest
# Then filter for winter months
south_df = result_important_df.filter((pl.col("region") == "western") | (pl.col("region") == "southwest") | (pl.col("region") == "southeast"))
north_df = result_important_df.filter((pl.col("region") == "midwest") | (pl.col("region") == "newengland") |
 (pl.col("region") == "midatlantic") | (pl.col("region") == "northwest"))
south_df = south_df.filter((pl.col("MONTH") == 12) | (pl.col("MONTH") == 1) | (pl.col("MONTH") == 2))
north_df = north_df.filter((pl.col("MONTH") == 12) | (pl.col("MONTH") == 1) | (pl.col("MONTH") == 2))

In [None]:
# Create a new column "IS_DELAYED" that's 1 if delayed and 0 if on time
south_df = south_df.with_columns(
    (pl.col('DEPARTURE_DELAY') > 0).alias('IS_DELAYED').cast(pl.Int64)
)
north_df = north_df.with_columns(
    (pl.col('DEPARTURE_DELAY') > 0).alias('IS_DELAYED').cast(pl.Int64)
)
delay_south_proportion = south_df.select(pl.sum('IS_DELAYED')) / len(south_df)
delay_north_proportion = north_df.select(pl.sum('IS_DELAYED')) / len(north_df)

# Calculate original test statistic (0.399198 - 0.386936 = 0.012262)
delay_proportion_diff1 = delay_north_proportion[0, 0] - delay_south_proportion[0, 0]

# Simulate the null world - same proportion of flights being delayed regardless of north/south, 10,000,000 trials each
simulated_delays_north = np.random.binomial(len(north_df), 0.5, 10_000_000) / len(north_df)
simulated_delays_south = np.random.binomial(len(south_df), 0.5, 10_000_000) / len(south_df)
simulated_north_proportion = sum(simulated_delays_north) / len(simulated_delays_north)
simulated_south_proportion = sum(simulated_delays_south) / len(simulated_delays_south)
# simulated values: (0.500 - 0.500 = 0)
simulated_proportion_diff1 = simulated_north_proportion - simulated_south_proportion

# Calculate the p-value (p = 0)
simulated_p_value1 = sum(x >= delay_proportion_diff1 for x in (simulated_delays_north - simulated_delays_south)) / len(simulated_delays_north)

Since p = 0 < 0.05, we reject the hypothesis that the proportion of delayed flights during winter is the same in northern and southern states

In [None]:
# Plot the distribution of simulated difference in proportions of delays for north vs. south states
_ = plt.hist(simulated_delays_north - simulated_delays_south, bins = 50)
plt.title("Distribution of Simulated Differences in Proportions of Delays for Northern - Southern States")
plt.xlabel("Difference in Proportions")
plt.ylabel("Frequency")
round_dpd = round(delay_proportion_diff1, 4)
plt.axvline(round_dpd, color = "red", label = "Observed Difference in Proportions: {}".format(round_dpd))
plt.legend()

In [None]:
'''
HYPOTHESIS 3
Are some airlines significantly more likely to have delays than others? (same format hypotheses for other airlines)
$H_0$: all airlines experience the same average delay
H_0: American Airlines has same average delay as overall airlines average delay (delay_AA - delay_avg = 0), H_1: delay_AA - delay_avg < 0

In this data set American and Delta Airlines has lower than average delay, Southwest has higher than average delay.
Though these are all not statistically significant.
'''
american = result_important_df.filter(pl.col('AIRLINE') == 'AA')
southwest = result_important_df.filter(pl.col('AIRLINE') == 'WN')
delta = result_important_df.filter(pl.col('AIRLINE') == 'DL')

In [None]:
# Calculate test statistics - average delay for American Airlines
aa_delay = american.select(pl.mean('DEPARTURE_DELAY'))[0, 0]

# Calculate mean (9.9388) and stdev (37.4972) for delay across all airlines
mean_delay = result_important_df.select(pl.mean('DEPARTURE_DELAY'))[0, 0]
stdev_delay = result_important_df.select(pl.std('DEPARTURE_DELAY'))[0, 0]

# Simulate the null world where American Airlines has same amount of delay, 10,000,000 trials
# I don't know if we can assume a Normal distribution but ok
simulated_delays = np.random.normal(mean_delay, stdev_delay, 10_000_000)
simulated_aa_delay = sum(simulated_delays) / len(simulated_delays)

# Calculate the p-value (p = 0.49077) this is > 0.05
simulated_p_value_aa = sum(x <= aa_delay for x in simulated_delays) / len(simulated_delays)
simulated_p_value_aa

Repeating the same for Southwest and Delta airlines, the other top airlines in this dataset

In [None]:
# Southwest Airlines
wn_delay = southwest.select(pl.mean('DEPARTURE_DELAY'))[0, 0]

# Calculate the p-value (p = 0.51201) this is > 0.05
simulated_p_value_wn = sum(x <= wn_delay for x in simulated_delays) / len(simulated_delays)
simulated_p_value_wn

In [None]:
# Delta Airlines
dl_delay = delta.select(pl.mean('DEPARTURE_DELAY'))[0, 0]

# Calculate the p-value (p = 0.4906998) this is > 0.05
simulated_p_value_dl = sum(x <= dl_delay for x in simulated_delays) / len(simulated_delays)
simulated_p_value_dl

With p-values of 0.49055 > 0.05, 0.51201 > 0.05, and 0.47657 > 0.05 we fail to reject H_0 that average delay for American Airlines, Southwest Airlines, and Delta Airlines are each significantly lower than average delay for all airlines. These are separate hypothesis tests, just writing them all.

In [None]:
# Plot the distribution of simulated delay times for American Airlines
_ = plt.hist(simulated_delays, bins = 50)
plt.title("Distribution of Simulated Delays for American Airlines")
plt.xlabel("Number of minutes delayed")
plt.ylabel("Frequency")

round_aa = round(aa_delay, 4)
plt.axvline(round_aa, color = "red", label = "Observed American Delay Amount: {}".format(round_aa))

round_wn = round(wn_delay, 4)
plt.axvline(round_wn, color = "green", label = "Observed Southwest Delay Amount: {}".format(round_wn))

round_dl = round(dl_delay, 4)
plt.axvline(round_dl, color = "blue", label = "Observed Delta Delay Amount: {}".format(round_dl))

plt.legend()

MODELING!

We want to predict whether or not a plane will be delayed, given information that can reasoably be attained a few hours before the flight. We first create a column IS_DELAYED for whether a flight has been delayed or not.

In [12]:
target = "DEPARTURE_DELAY"
print(result_important_df.filter(result_important_df[target] > 0).shape)
print(result_important_df.filter(result_important_df[target] <= 0).shape)
binary_results_df = result_important_df.with_columns(
    pl.when(result_important_df[target] > 0)
    .then(1)
    .otherwise(0)
    .alias("IS_DELAYED")  # Replace the column with the modified values
)
binary_results_df = binary_results_df.sample(fraction = 0.3,seed =42)
binary_results_df = binary_results_df.to_pandas()
#del result_important_df

(1844064, 44)
(3010652, 44)


In [13]:
binary_results_df.head()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,...,fips,stability,tmin,tmax,tavg,ppt,dday_a5C,dday_b15C,TOTAL_DELAY,IS_DELAYED
0,2015,2,25,3,DL,758,N983DL,ATL,PWM,2046,...,13121,stable,-1.137,2.655,0.759,2.088,0.0,14.241,,0
1,2015,2,23,1,MQ,3023,N856MQ,CMH,ORD,825,...,39049,stable,-15.008,-0.709,-7.858,0.0,0.0,22.858,235.0,1
2,2015,4,19,7,NK,801,N588NK,MCO,SJU,1216,...,12095,stable,22.17,31.353,26.761,0.001,21.762,0.001,-8.0,0
3,2015,6,4,4,VX,797,N629VA,LAX,SEA,1835,...,6037,stable,13.026,22.682,17.854,0.007,12.854,0.387,62.0,1
4,2015,3,21,6,EV,2535,N685AE,CAE,DFW,1812,...,45063,stable,7.927,19.965,13.946,0.121,8.946,2.472,110.0,1


Use Hot-One Binary Encoding to Encode the categorical variable Airline. We also pick out all of the numerical features that can be reasonably attained a few hours before a flight is scheduled to depart.

In [15]:
from sklearn.preprocessing import OneHotEncoder
numeric_features = ["SCHEDULED_DEPARTURE", "SCHEDULED_TIME","YEAR","MONTH","DAY","DAY_OF_WEEK","DISTANCE","tmin","tmax","tavg","ppt","dday_a5C","dday_b15C"]
target = ["IS_DELAYED"]
features_to_encode = ["AIRLINE"]
one_hot_encoder = OneHotEncoder(sparse_output = False)
encoded_binary_df = one_hot_encoder.fit_transform(binary_results_df[features_to_encode])
encoded_binary_df = pd.DataFrame(encoded_binary_df, columns=one_hot_encoder.get_feature_names_out(features_to_encode))
encoded_binary_df = pd.concat([encoded_binary_df, binary_results_df[numeric_features + target]], axis=1)

In [16]:
print(encoded_binary_df.shape)
encoded_binary_df = encoded_binary_df.dropna()
print(encoded_binary_df.shape)
encoded_binary_df.head()

(1479990, 28)
(1479986, 28)


Unnamed: 0,AIRLINE_AA,AIRLINE_AS,AIRLINE_B6,AIRLINE_DL,AIRLINE_EV,AIRLINE_F9,AIRLINE_HA,AIRLINE_MQ,AIRLINE_NK,AIRLINE_OO,...,DAY,DAY_OF_WEEK,DISTANCE,tmin,tmax,tavg,ppt,dday_a5C,dday_b15C,IS_DELAYED
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,25,3,1027,-1.137,2.655,0.759,2.088,0.0,14.241,0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,23,1,296,-15.008,-0.709,-7.858,0.0,0.0,22.858,1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,19,7,1189,22.17,31.353,26.761,0.001,21.762,0.001,0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,4,4,954,13.026,22.682,17.854,0.007,12.854,0.387,1
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,21,6,922,7.927,19.965,13.946,0.121,8.946,2.472,1


We split the data into train and test sets.

In [17]:
from sklearn.model_selection import train_test_split

#all_features = ["SCHEDULED_DEPARTURE", "SCHEDULED_TIME","YEAR","MONTH","DAY","DAY_OF_WEEK","DISTANCE","tmin","tmax","tavg","ppt","dday_a5C","dday_b15C"] +

X = encoded_binary_df.drop('IS_DELAYED',axis = 1)
y = encoded_binary_df['IS_DELAYED']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)

(1183988, 27)
(1183988,)
(295998, 27)


We scale the non-discrete features in our dataset to fit a normal distribution using StandardScaler.

In [18]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scalable_features = ['DISTANCE', 'tmin', 'tmax', 'tavg', 'ppt', 'dday_a5C','dday_b15C']

X_train_numeric = pd.DataFrame(scaler.fit_transform(X_train[scalable_features]),columns = scaler.get_feature_names_out(scalable_features))
X_train = pd.concat([X_train.drop(scalable_features,axis = 1).reset_index(drop=True),X_train_numeric.reset_index(drop=True)],axis = 1)
X_test_numeric = pd.DataFrame(scaler.transform(X_test[scalable_features]), columns = scaler.get_feature_names_out(scalable_features))
X_test = pd.concat([X_test.drop(scalable_features, axis = 1).reset_index(drop=True),X_test_numeric.reset_index(drop=True)], axis = 1)

In [19]:
X_train.head()

Unnamed: 0,AIRLINE_AA,AIRLINE_AS,AIRLINE_B6,AIRLINE_DL,AIRLINE_EV,AIRLINE_F9,AIRLINE_HA,AIRLINE_MQ,AIRLINE_NK,AIRLINE_OO,...,MONTH,DAY,DAY_OF_WEEK,DISTANCE,tmin,tmax,tavg,ppt,dday_a5C,dday_b15C
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7,10,5,0.239175,1.045295,0.365595,0.707196,1.009148,0.744477,-0.661065
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,7,14,2,1.399754,0.483049,0.387203,0.440894,-0.377476,0.404432,-0.660143
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,7,7,2,1.396331,1.271396,0.600113,0.941464,-0.377476,1.043492,-0.661065
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,9,2,3,1.119024,1.217164,1.294539,1.277677,-0.377476,1.472806,-0.661065
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,5,6,3,-0.175073,0.630168,0.723519,0.689378,-0.377476,0.721724,-0.661065


We predict if a flight will be delayed using a Random Forest Classifier.

In [46]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

clf = RandomForestClassifier(n_estimators=10, random_state=42,verbose = 1)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(clf.score(X_test,y_test))

0.6633896174974155


We look at what features are important when predicting loss.

In [69]:
feature_importances = clf.feature_importances_
for feature, importance in zip(X_train.columns, feature_importances):
    print(f"Feature: {feature}, Importance: {importance:.4f}")

Feature: AIRLINE_AA, Importance: 0.0054
Feature: AIRLINE_AS, Importance: 0.0023
Feature: AIRLINE_B6, Importance: 0.0033
Feature: AIRLINE_DL, Importance: 0.0060
Feature: AIRLINE_EV, Importance: 0.0045
Feature: AIRLINE_F9, Importance: 0.0022
Feature: AIRLINE_HA, Importance: 0.0002
Feature: AIRLINE_MQ, Importance: 0.0037
Feature: AIRLINE_NK, Importance: 0.0025
Feature: AIRLINE_OO, Importance: 0.0042
Feature: AIRLINE_UA, Importance: 0.0073
Feature: AIRLINE_US, Importance: 0.0027
Feature: AIRLINE_VX, Importance: 0.0015
Feature: AIRLINE_WN, Importance: 0.0122
Feature: SCHEDULED_DEPARTURE, Importance: 0.1850
Feature: SCHEDULED_TIME, Importance: 0.1223
Feature: YEAR, Importance: 0.0000
Feature: MONTH, Importance: 0.0339
Feature: DAY, Importance: 0.0604
Feature: DAY_OF_WEEK, Importance: 0.0362
Feature: DISTANCE, Importance: 0.1245
Feature: tmin, Importance: 0.0751
Feature: tmax, Importance: 0.0739
Feature: tavg, Importance: 0.0696
Feature: ppt, Importance: 0.0501
Feature: dday_a5C, Importance: 

In [69]:
from sklearn.metrics import f1_score, accuracy_score,precision_score, recall_score
forest_preds = y_pred
all_labels = y_test.to_numpy()

#calcualte metrics
precision = precision_score(all_labels, forest_preds)
recall = recall_score(all_labels, forest_preds)
f1 = f1_score(all_labels, forest_preds)
accuracy = accuracy_score(all_labels, forest_preds)

print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"Accuracy: {accuracy:.4f}")

Precision: 0.5764
Recall: 0.3858
F1-Score: 0.4622
Accuracy: 0.6634


We now train a neural network to predict if a flight is going to be delayed.

In [70]:
import torch
import torch.nn as nn
import torch.optim as optim

# Define the neural network architecture
class SimpleNN(nn.Module):
    def __init__(self, input_size):
        super(SimpleNN, self).__init__()
        # Define the layers
        self.fc1 = nn.Linear(input_size, 128)   # First fully connected layer
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, 64)           # Second fully connected layer
        self.fc4 = nn.Linear(64, 1)            # Output layer (binary output)
        self.sigmoid = nn.Sigmoid()            # Sigmoid activation for binary output

    def forward(self, x):
        # Forward pass through the network
        x = torch.relu(self.fc1(x))  # ReLU activation for hidden layer
        x = torch.relu(self.fc2(x))  # ReLU activation for hidden layer
        x = torch.relu(self.fc3(x))
        x = self.sigmoid(self.fc4(x))  # Sigmoid activation for binary output
        return x

Training
 the neural network.

In [71]:
from torch.utils.data import DataLoader, TensorDataset

model = SimpleNN(len(X_train.columns))
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)


model.train()  # Set model to training mode
epochs = 10
y_train = y_train.astype(float)
train_dataset = TensorDataset(torch.tensor(X_train.values, dtype=torch.float32), torch.tensor(y_train.values, dtype=torch.float32))
batch_size = 128
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
for i in range(epochs):
  model.train()  # Set the model to training mode
  running_loss = 0.0

  for batch_idx, (inputs, labels) in enumerate(train_loader):
    optimizer.zero_grad()  # Zero the gradients

    # Forward pass
    outputs = model(inputs)

    if torch.isnan(outputs).any():
      print("NaN detected in outputs!")
      break

    #print(f"Inputs shape: {inputs.shape}, Labels shape: {labels.shape}")
    #print(f"Outputs shape: {outputs.shape}, Outputs (sample): {outputs[:5]}")  # Print the first 5 output values
    #print(f"Labels (sample): {labels[:5]}")

    # Compute the loss
    loss = criterion(outputs.squeeze(dim = 1), labels)
    running_loss += loss.item()

    # Backward pass
    loss.backward()

    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

    # Update the model weights
    optimizer.step()

  # Print the average loss for this epoch
  avg_loss = running_loss / len(train_loader)
  print(f"Epoch [{i+1}/{epochs}], Loss: {avg_loss:.4f}")

Epoch [1/10], Loss: 0.6501
Epoch [2/10], Loss: 0.6252
Epoch [3/10], Loss: 0.6240
Epoch [4/10], Loss: 0.6235
Epoch [5/10], Loss: 0.6233
Epoch [6/10], Loss: 0.6227
Epoch [7/10], Loss: 0.6215
Epoch [8/10], Loss: 0.6210
Epoch [9/10], Loss: 0.6200
Epoch [10/10], Loss: 0.6198


Neural network loss initially capped off loss at 5 epochs, so increased complexity of neural network by adding more nodes then adding another layer in order to try and stop underfitting on the model. Chose BCE loss to fit with the nature of the binary classification problem. Loss gets bottlenecked at 0.62.

Evaluating the accuracy of the neural network on the test set.

In [72]:
model.eval()

test_dataset = TensorDataset(torch.tensor(X_test.values, dtype=torch.float32), torch.tensor(y_test.values, dtype=torch.float32))
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

all_preds = []
all_labels = []

with torch.no_grad():  # Disable gradient computation during testing
    for inputs, labels in test_loader:
        # Forward pass: Compute predicted y by passing X_test to the model
        outputs = model(inputs).squeeze(dim = 1)

        # Apply threshold to get binary predictions
        predicted = (outputs >= 0.5).float()

        # Count the number of correct predictions
        all_preds.append(predicted.squeeze().cpu().numpy())  # Convert to numpy for sklearn
        all_labels.append(labels.squeeze().cpu().numpy())


all_preds = np.concatenate(all_preds)
all_labels = np.concatenate(all_labels)

#calcualte metrics
precision = precision_score(all_labels, all_preds)
recall = recall_score(all_labels, all_preds)
f1 = f1_score(all_labels, all_preds)
accuracy = accuracy_score(all_labels, all_preds)

print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"Accuracy: {accuracy:.4f}")

Precision: 0.5809
Recall: 0.3233
F1-Score: 0.4154
Accuracy: 0.6588


In [45]:
1 - torch.tensor(y_test.values, dtype=torch.float32).mean().item()

0.6250244975090027

Considering that a model that only guesses 0 (Not Delayed) will get 62.5% accuracy our models are not much more accurate. Our neural network has decent precision given that 37.5% of elements are 1s, but very bad recall as we correctly predicty only 30% of delayed flights. The Random Forest Classifier is about the same, except with a better recall of 40%.

Train an XGBoosting model to predict if a flight will be delayed.

In [73]:
import xgboost as xgb

xgbClassifier = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss',random_state = 42)
xgbClassifier.fit(X_train,y_train)

xgb_y_pred = xgbClassifier.predict(X_test)

print(xgbClassifier.score(X_train,y_train))
print(xgbClassifier.score(X_test,y_test))

Parameters: { "use_label_encoder" } are not used.



0.6933355743470373
0.6856431462374746


In [74]:
feature_importances = xgbClassifier.feature_importances_
for feature, importance in zip(X_train.columns, feature_importances):
    print(f"Feature: {feature}, Importance: {importance:.4f}")

Feature: AIRLINE_AA, Importance: 0.0087
Feature: AIRLINE_AS, Importance: 0.0656
Feature: AIRLINE_B6, Importance: 0.0193
Feature: AIRLINE_DL, Importance: 0.0141
Feature: AIRLINE_EV, Importance: 0.0202
Feature: AIRLINE_F9, Importance: 0.0133
Feature: AIRLINE_HA, Importance: 0.0095
Feature: AIRLINE_MQ, Importance: 0.0161
Feature: AIRLINE_NK, Importance: 0.0458
Feature: AIRLINE_OO, Importance: 0.0200
Feature: AIRLINE_UA, Importance: 0.2282
Feature: AIRLINE_US, Importance: 0.0182
Feature: AIRLINE_VX, Importance: 0.0164
Feature: AIRLINE_WN, Importance: 0.2121
Feature: SCHEDULED_DEPARTURE, Importance: 0.0840
Feature: SCHEDULED_TIME, Importance: 0.0131
Feature: YEAR, Importance: 0.0000
Feature: MONTH, Importance: 0.0388
Feature: DAY, Importance: 0.0260
Feature: DAY_OF_WEEK, Importance: 0.0182
Feature: DISTANCE, Importance: 0.0191
Feature: tmin, Importance: 0.0160
Feature: tmax, Importance: 0.0130
Feature: tavg, Importance: 0.0092
Feature: ppt, Importance: 0.0245
Feature: dday_a5C, Importance: 

In [76]:
all_preds = xgb_y_pred
all_labels = y_test.to_numpy()

#calcualte metrics
precision = precision_score(all_labels, all_preds)
recall = recall_score(all_labels, all_preds)
f1 = f1_score(all_labels, all_preds)
accuracy = accuracy_score(all_labels, all_preds)

print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")
print(f"Accuracy: {accuracy:.4f}")

Precision: 0.6297
Recall: 0.3925
F1-Score: 0.4836
Accuracy: 0.6856
