A full documentation of the data we used, relavent links and relavent schema

Flights were sourced from here: https://www.kaggle.com/datasets/usdot/flight-delays?select=flights.csv

Weather was sourced from here: https://asmith.ucdavis.edu/data/prism-weather

AirportLocations.csv was sourced from: https://geodata.bts.gov/datasets/usdot::aviation-facilities/about

For weather use the settings: Temporal unit should be daily, use county as spatial unit, start and end year are both 2015, months go from 1 to 12, states are all states, choose variables are tmin, tmax, tavg, ppt, dday_a5C, dday_b15C



In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import polars as pl
import kagglehub
import seaborn as sns

State regions were determined by the following image (kind of arbitrary). We decided to set AK to north and HI to pacific

![Alt Text](region.png)

In [None]:
flights_df = pl.read_csv("flights.csv")
airport_loc_df = pl.read_csv('AirportLocations.csv')
airlines_df = pl.read_csv('airlines.csv')
weather_df = pl.read_csv('weather.csv')

In [None]:
flights_df.shape

(5819079, 31)

In [None]:
delays_df = flights_df.filter(pl.col("DEPARTURE_DELAY") > 0)
ontime_df = flights_df.filter(pl.col("DEPARTURE_DELAY") <= 0)

Joining the County information so that we can compare weather

In [None]:
airport_loc_df = airport_loc_df.select(["ARPT_ID", "COUNTY_NAME", "STATE_CODE"])
flights_df = flights_df.join(
    airport_loc_df,
    left_on="ORIGIN_AIRPORT",
    right_on="ARPT_ID",
    how="left"
)
delays_df = flights_df.filter(pl.col("DEPARTURE_DELAY") > 0)
ontime_df = flights_df.filter(pl.col("DEPARTURE_DELAY") <= 0)
flights_df = flights_df.with_columns(
    pl.datetime(
        year=pl.col("YEAR"),
        month=pl.col("MONTH"),
        day=pl.col("DAY")
    ).cast(pl.Date).alias("date")
)
flights_df

YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,COUNTY_NAME,STATE_CODE,date
i64,i64,i64,i64,str,i64,str,str,str,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,str,i64,i64,i64,i64,i64,str,str,date
2015,1,1,4,"""AS""",98,"""N407AS""","""ANC""","""SEA""",5,2354,-11,21,15,205,194,169,1448,404,4,430,408,-22,0,0,,,,,,,"""ANCHORAGE""","""AK""",2015-01-01
2015,1,1,4,"""AA""",2336,"""N3KUAA""","""LAX""","""PBI""",10,2,-8,12,14,280,279,263,2330,737,4,750,741,-9,0,0,,,,,,,"""LOS ANGELES""","""CA""",2015-01-01
2015,1,1,4,"""US""",840,"""N171US""","""SFO""","""CLT""",20,18,-2,16,34,286,293,266,2296,800,11,806,811,5,0,0,,,,,,,"""SAN MATEO""","""CA""",2015-01-01
2015,1,1,4,"""AA""",258,"""N3HYAA""","""LAX""","""MIA""",20,15,-5,15,30,285,281,258,2342,748,8,805,756,-9,0,0,,,,,,,"""LOS ANGELES""","""CA""",2015-01-01
2015,1,1,4,"""AS""",135,"""N527AS""","""SEA""","""ANC""",25,24,-1,11,35,235,215,199,1448,254,5,320,259,-21,0,0,,,,,,,"""KING""","""WA""",2015-01-01
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2015,12,31,4,"""B6""",688,"""N657JB""","""LAX""","""BOS""",2359,2355,-4,22,17,320,298,272,2611,749,4,819,753,-26,0,0,,,,,,,"""LOS ANGELES""","""CA""",2015-12-31
2015,12,31,4,"""B6""",745,"""N828JB""","""JFK""","""PSE""",2359,2355,-4,17,12,227,215,195,1617,427,3,446,430,-16,0,0,,,,,,,"""QUEENS""","""NY""",2015-12-31
2015,12,31,4,"""B6""",1503,"""N913JB""","""JFK""","""SJU""",2359,2350,-9,17,7,221,222,197,1598,424,8,440,432,-8,0,0,,,,,,,"""QUEENS""","""NY""",2015-12-31
2015,12,31,4,"""B6""",333,"""N527JB""","""MCO""","""SJU""",2359,2353,-6,10,3,161,157,144,1189,327,3,340,330,-10,0,0,,,,,,,"""ORANGE""","""FL""",2015-12-31


In [None]:
type(flights_df)

Don't run the below cell more than once

In [None]:
weather_df = weather_df.with_columns(
    pl.col("date").cast(pl.Utf8).str.strptime(pl.Date, "%Y%m%d").alias("date")
)
weather_df = weather_df.with_columns(
    pl.col("county_name").str.to_lowercase().alias("county_name")
)
weather_df

st_abb,st_code,county_name,fips,date,stability,tmin,tmax,tavg,ppt,dday_a5C,dday_b15C
str,i64,str,i64,date,str,f64,f64,f64,f64,f64,f64
"""AL""",1,"""autauga""",1001,2015-01-01,"""stable""",-0.835,10.961,5.063,0.059,1.909,9.937
"""AL""",1,"""autauga""",1001,2015-01-02,"""stable""",0.276,13.216,6.746,3.863,3.008,8.254
"""AL""",1,"""autauga""",1001,2015-01-03,"""stable""",8.511,12.552,10.531,14.217,5.532,4.469
"""AL""",1,"""autauga""",1001,2015-01-04,"""stable""",12.328,20.585,16.457,48.919,11.456,0.668
"""AL""",1,"""autauga""",1001,2015-01-05,"""stable""",2.642,15.865,9.254,0.0,4.684,5.841
…,…,…,…,…,…,…,…,…,…,…,…
"""PA""",42,"""cumberland""",42041,2015-01-03,"""stable""",-5.854,5.052,-0.401,0.0,0.002,15.401
"""PA""",42,"""cumberland""",42041,2015-01-04,"""stable""",-2.57,2.941,0.185,13.427,0.0,14.815
"""PA""",42,"""cumberland""",42041,2015-01-05,"""stable""",-0.039,10.977,5.469,5.055,1.994,9.531
"""PA""",42,"""cumberland""",42041,2015-01-06,"""stable""",-9.558,0.322,-4.618,1.246,0.0,19.618


Cross Referenced Data from here in order to ensure that the temperatures were correctly aligned: https://www.timeanddate.com/weather/usa/new-york/historic?month=12&year=2015

In [None]:
flights_df = flights_df.with_columns(
    pl.col("COUNTY_NAME").str.to_lowercase().alias("COUNTY_NAME")
)

result_df = flights_df.join(
    weather_df,
    left_on=["COUNTY_NAME", "date", "STATE_CODE"],
    right_on=["county_name", "date", "st_abb"],
    how="left"
)

flights_df

YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,COUNTY_NAME,STATE_CODE,date
i64,i64,i64,i64,str,i64,str,str,str,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,str,i64,i64,i64,i64,i64,str,str,date
2015,1,1,4,"""AS""",98,"""N407AS""","""ANC""","""SEA""",5,2354,-11,21,15,205,194,169,1448,404,4,430,408,-22,0,0,,,,,,,"""anchorage""","""AK""",2015-01-01
2015,1,1,4,"""AA""",2336,"""N3KUAA""","""LAX""","""PBI""",10,2,-8,12,14,280,279,263,2330,737,4,750,741,-9,0,0,,,,,,,"""los angeles""","""CA""",2015-01-01
2015,1,1,4,"""US""",840,"""N171US""","""SFO""","""CLT""",20,18,-2,16,34,286,293,266,2296,800,11,806,811,5,0,0,,,,,,,"""san mateo""","""CA""",2015-01-01
2015,1,1,4,"""AA""",258,"""N3HYAA""","""LAX""","""MIA""",20,15,-5,15,30,285,281,258,2342,748,8,805,756,-9,0,0,,,,,,,"""los angeles""","""CA""",2015-01-01
2015,1,1,4,"""AS""",135,"""N527AS""","""SEA""","""ANC""",25,24,-1,11,35,235,215,199,1448,254,5,320,259,-21,0,0,,,,,,,"""king""","""WA""",2015-01-01
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2015,12,31,4,"""B6""",688,"""N657JB""","""LAX""","""BOS""",2359,2355,-4,22,17,320,298,272,2611,749,4,819,753,-26,0,0,,,,,,,"""los angeles""","""CA""",2015-12-31
2015,12,31,4,"""B6""",745,"""N828JB""","""JFK""","""PSE""",2359,2355,-4,17,12,227,215,195,1617,427,3,446,430,-16,0,0,,,,,,,"""queens""","""NY""",2015-12-31
2015,12,31,4,"""B6""",1503,"""N913JB""","""JFK""","""SJU""",2359,2350,-9,17,7,221,222,197,1598,424,8,440,432,-8,0,0,,,,,,,"""queens""","""NY""",2015-12-31
2015,12,31,4,"""B6""",333,"""N527JB""","""MCO""","""SJU""",2359,2353,-6,10,3,161,157,144,1189,327,3,340,330,-10,0,0,,,,,,,"""orange""","""FL""",2015-12-31


In [None]:
'''
result_df2 = flights_df.join(
    weather_df,
    left_on=["COUNTY_NAME", "date"],
    right_on=["county_name", "date"],
    how="inner"
)

result_df2
'''

In [None]:
result_important_df = result_df.filter(pl.col('tavg').is_not_null())
result_important_df

YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,COUNTY_NAME,STATE_CODE,date,st_code,fips,stability,tmin,tmax,tavg,ppt,dday_a5C,dday_b15C
i64,i64,i64,i64,str,i64,str,str,str,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,str,i64,i64,i64,i64,i64,str,str,date,i64,i64,str,f64,f64,f64,f64,f64,f64
2015,1,1,4,"""AA""",2336,"""N3KUAA""","""LAX""","""PBI""",10,2,-8,12,14,280,279,263,2330,737,4,750,741,-9,0,0,,,,,,,"""los angeles""","""CA""",2015-01-01,6,6037,"""stable""",-2.162,7.71,2.774,0.009,0.621,12.226
2015,1,1,4,"""US""",840,"""N171US""","""SFO""","""CLT""",20,18,-2,16,34,286,293,266,2296,800,11,806,811,5,0,0,,,,,,,"""san mateo""","""CA""",2015-01-01,6,6081,"""stable""",2.561,12.746,7.654,0.0,3.173,7.346
2015,1,1,4,"""AA""",258,"""N3HYAA""","""LAX""","""MIA""",20,15,-5,15,30,285,281,258,2342,748,8,805,756,-9,0,0,,,,,,,"""los angeles""","""CA""",2015-01-01,6,6037,"""stable""",-2.162,7.71,2.774,0.009,0.621,12.226
2015,1,1,4,"""DL""",806,"""N3730B""","""SFO""","""MSP""",25,20,-5,18,38,217,230,206,1589,604,6,602,610,8,0,0,,,,,,,"""san mateo""","""CA""",2015-01-01,6,6081,"""stable""",2.561,12.746,7.654,0.0,3.173,7.346
2015,1,1,4,"""NK""",612,"""N635NK""","""LAS""","""MSP""",25,19,-6,11,30,181,170,154,1299,504,5,526,509,-17,0,0,,,,,,,"""clark""","""NV""",2015-01-01,32,32003,"""stable""",-3.51,1.801,-0.854,0.068,0.0,15.854
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2015,1,31,6,"""US""",889,"""N656AW""","""PHX""","""MSP""",2359,2352,-7,9,1,180,176,159,1276,340,8,359,348,-11,0,0,,,,,,,"""maricopa""","""AZ""",2015-01-31,4,4013,"""stable""",11.649,14.992,13.32,9.89,8.32,1.68
2015,1,31,6,"""B6""",745,"""N627JB""","""JFK""","""PSE""",2359,2353,-6,19,12,227,205,182,1617,414,4,446,418,-28,0,0,,,,,,,"""queens""","""NY""",2015-01-31,36,36081,"""stable""",-9.897,2.994,-3.451,0.228,0.0,18.451
2015,1,31,6,"""B6""",839,"""N658JB""","""JFK""","""BQN""",2359,2359,0,18,17,221,200,179,1576,416,3,440,419,-21,0,0,,,,,,,"""queens""","""NY""",2015-01-31,36,36081,"""stable""",-9.897,2.994,-3.451,0.228,0.0,18.451
2015,1,31,6,"""F9""",300,"""N218FR""","""DEN""","""TPA""",2359,2,3,35,37,192,212,168,1506,525,9,511,534,23,0,0,,21,0,2,0,0,"""denver""","""CO""",2015-01-31,8,8031,"""stable""",-0.508,8.696,4.094,0.0,1.04,10.906


We will now split in to test and train sets and proceed with EDA on the train set

In [None]:
from sklearn.model_selection import train_test_split

target = ['DEPARTURE_DELAY']

X = result_important_df.drop("")

X_train, X_test, y_train, y_test = train_test_split(result_important_df, test_size=0.2, random_state=0)
print(X_train.shape)
print(X_test.shape)

ColumnNotFoundError: "" not found

Resolved plan until failure:

	---> FAILED HERE RESOLVING THIS_NODE <---
DF ["YEAR", "MONTH", "DAY", "DAY_OF_WEEK"]; PROJECT */43 COLUMNS; SELECTION: None

Let us now get some understanding of what our data looks like

In [None]:
basic_stats = X_train.describe()
info = X_train.schema
columns_of_interest = ['SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 'DEPARTURE_DELAY', 'TAXI_OUT', 'WHEELS_OFF', 'ELAPSED_TIME', 'AIR_TIME', 'WHEELS_ON', 'DISTANCE', 'ARRIVAL_DELAY', 'DIVERTED', 'CANCELLED', 'tmin', 'tmax', 'tavg', 'ppt', 'dday_a5C', 'dday_b15C']
X_train_numerical = X_train.select(columns_of_interest)
df = X_train_numerical.to_pandas()
corr_matrix = df.corr()
corr_matrix

In [None]:
'''
VISUAL 1: Naturally as we are interested in DEPARTURE and ARRIVAL delay when choosing features, we should choose features that seem to be highly correlated with these delays
We will use 2 metrics that I think do a good job of measuring correlation, 1 is the standard pearson correlation, then to compare categorical values we will use the CramerV correlation metric
'''
plt.figure(figsize=(12, 10))

sns.heatmap(
    corr_matrix,
    annot=True,
    fmt=".2f",
    cmap='coolwarm',
    square=True,
    linewidths=.5,
    cbar_kws={"shrink": .5}
)

plt.title('Correlation Matrix Heatmap', fontsize=16)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()


As we can see there are very weak correlations, but what if we filter for precipitation > 1?

In [None]:
df_ppt_filtered = df[df['ppt'] > 5]
corr_matrix = df_ppt_filtered.corr()
plt.figure(figsize=(12, 10))

sns.heatmap(
    corr_matrix,
    annot=True,
    fmt=".2f",
    cmap='coolwarm',
    square=True,
    linewidths=.5,
    cbar_kws={"shrink": .5}
)

plt.title('Correlation Matrix Heatmap', fontsize=16)
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()
print(df_ppt_filtered.shape)

In [None]:
origin_counts = result_important_df.group_by("ORIGIN_AIRPORT").count()
origin_counts_pd = origin_counts.to_pandas()
origin_counts_pd = origin_counts_pd.sort_values(by="count", ascending=False)
top_30 = origin_counts_pd.head(30)

# Plot the bar chart
plt.figure(figsize=(12, 6))
plt.bar(top_30["ORIGIN_AIRPORT"], top_30["count"], color="skyblue")
plt.xlabel("Origin Airport", fontsize=14)
plt.ylabel("Flight Count", fontsize=14)
plt.title("Flight Count by Origin Airport", fontsize=16)
plt.xticks(rotation=45, fontsize=12)
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.tight_layout()
plt.show()

In [None]:
airline_counts = result_important_df.group_by("AIRLINE").count()
airline_counts = airline_counts.to_pandas()
airline_counts = airline_counts.sort_values(by="count", ascending=False)

# Plot the bar chart
plt.figure(figsize=(12, 6))
plt.bar(airline_counts["AIRLINE"], airline_counts["count"], color="skyblue")
plt.xlabel("Airlines Represented", fontsize=14)
plt.ylabel("Flight Count", fontsize=14)
plt.title("Airline Count", fontsize=16)
plt.xticks(rotation=45, fontsize=12)
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.tight_layout()
plt.show()

Now we try and do some classification, predicting whether or not a flight will be delayed given some conditions. We first need to find a way to categorize if a plane is delayed.

In [None]:
target = "DEPARTURE_DELAY"
print(result_important_df.filter(result_important_df[target] > 0).shape)
print(result_important_df.filter(result_important_df[target] <= 0).shape)

(125271, 43)
(196783, 43)


In [None]:
binary_results_df = result_important_df.with_columns(
    pl.when(result_important_df[target] > 0)
    .then(1)
    .otherwise(0)
    .alias("IS_DELAYED")  # Replace the column with the modified values
)

In [None]:
binary_results_df.head()

YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,COUNTY_NAME,STATE_CODE,date,st_code,fips,stability,tmin,tmax,tavg,ppt,dday_a5C,dday_b15C,IS_DELAYED
i64,i64,i64,i64,str,i64,str,str,str,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,str,i64,i64,i64,i64,i64,str,str,date,i64,i64,str,f64,f64,f64,f64,f64,f64,i32
2015,1,1,4,"""AA""",2336,"""N3KUAA""","""LAX""","""PBI""",10,2,-8,12,14,280,279,263,2330,737,4,750,741,-9,0,0,,,,,,,"""los angeles""","""CA""",2015-01-01,6,6037,"""stable""",-2.162,7.71,2.774,0.009,0.621,12.226,0
2015,1,1,4,"""US""",840,"""N171US""","""SFO""","""CLT""",20,18,-2,16,34,286,293,266,2296,800,11,806,811,5,0,0,,,,,,,"""san mateo""","""CA""",2015-01-01,6,6081,"""stable""",2.561,12.746,7.654,0.0,3.173,7.346,0
2015,1,1,4,"""AA""",258,"""N3HYAA""","""LAX""","""MIA""",20,15,-5,15,30,285,281,258,2342,748,8,805,756,-9,0,0,,,,,,,"""los angeles""","""CA""",2015-01-01,6,6037,"""stable""",-2.162,7.71,2.774,0.009,0.621,12.226,0
2015,1,1,4,"""DL""",806,"""N3730B""","""SFO""","""MSP""",25,20,-5,18,38,217,230,206,1589,604,6,602,610,8,0,0,,,,,,,"""san mateo""","""CA""",2015-01-01,6,6081,"""stable""",2.561,12.746,7.654,0.0,3.173,7.346,0
2015,1,1,4,"""NK""",612,"""N635NK""","""LAS""","""MSP""",25,19,-6,11,30,181,170,154,1299,504,5,526,509,-17,0,0,,,,,,,"""clark""","""NV""",2015-01-01,32,32003,"""stable""",-3.51,1.801,-0.854,0.068,0.0,15.854,0


For now we train on features that are just numeric, not any of the categorical features.

In [None]:
from sklearn.model_selection import train_test_split

binary_results_df = binary_results_df.to_pandas()

numeric_features = ["SCHEDULED_DEPARTURE", "SCHEDULED_TIME","YEAR","MONTH","DAY","DAY_OF_WEEK","DISTANCE","tmin","tmax","tavg","ppt","dday_a5C","dday_b15C"]

X = binary_results_df[numeric_features]
y = binary_results_df['IS_DELAYED']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)
print(X_train.shape)
print(X_test.shape)

(264604, 13)
(66152, 13)


Train a Random Forest Classifier to see if flight is delayed.

In [None]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26460 entries, 194526 to 257364
Data columns (total 43 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   YEAR                 26460 non-null  int64         
 1   MONTH                26460 non-null  int64         
 2   DAY                  26460 non-null  int64         
 3   DAY_OF_WEEK          26460 non-null  int64         
 4   AIRLINE              26460 non-null  object        
 5   FLIGHT_NUMBER        26460 non-null  int64         
 6   TAIL_NUMBER          26306 non-null  object        
 7   ORIGIN_AIRPORT       26460 non-null  object        
 8   DESTINATION_AIRPORT  26460 non-null  object        
 9   SCHEDULED_DEPARTURE  26460 non-null  int64         
 10  DEPARTURE_TIME       25764 non-null  float64       
 11  DEPARTURE_DELAY      25764 non-null  float64       
 12  TAXI_OUT             25754 non-null  float64       
 13  WHEELS_OFF           25754 non

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(clf.score(X_test,y_test))

0.6744769621477809


In [None]:
feature_importances = clf.feature_importances_
for feature, importance in zip(X.columns, feature_importances):
    print(f"Feature: {feature}, Importance: {importance:.4f}")

Feature: SCHEDULED_DEPARTURE, Importance: 0.2982
Feature: SCHEDULED_TIME, Importance: 0.2131
Feature: YEAR, Importance: 0.0000
Feature: MONTH, Importance: 0.0000
Feature: DAY, Importance: 0.0603
Feature: DAY_OF_WEEK, Importance: 0.0187
Feature: DISTANCE, Importance: 0.1951
Feature: tmin, Importance: 0.0439
Feature: tmax, Importance: 0.0421
Feature: tavg, Importance: 0.0400
Feature: ppt, Importance: 0.0270
Feature: dday_a5C, Importance: 0.0226
Feature: dday_b15C, Importance: 0.0388


Now, we try and predict which airline a flight belongs to given that it was delayed. We need to

In [None]:
numeric_features = ["SCHEDULED_DEPARTURE", "SCHEDULED_TIME","YEAR","MONTH","DAY","DAY_OF_WEEK","DISTANCE","tmin","tmax","tavg","ppt","dday_a5C","dday_b15C"]

X = binary_results_df[numeric_features]
y = binary_results_df['IS_DELAYED']

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)
print(X_train.shape)
print(X_test.shape)

In [45]:
binary_results_df.head()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,...,st_code,fips,stability,tmin,tmax,tavg,ppt,dday_a5C,dday_b15C,IS_DELAYED
0,2015,1,1,4,AA,2336,N3KUAA,LAX,PBI,10,...,6,6037,stable,-2.162,7.71,2.774,0.009,0.621,12.226,0
1,2015,1,1,4,US,840,N171US,SFO,CLT,20,...,6,6081,stable,2.561,12.746,7.654,0.0,3.173,7.346,0
2,2015,1,1,4,AA,258,N3HYAA,LAX,MIA,20,...,6,6037,stable,-2.162,7.71,2.774,0.009,0.621,12.226,0
3,2015,1,1,4,DL,806,N3730B,SFO,MSP,25,...,6,6081,stable,2.561,12.746,7.654,0.0,3.173,7.346,0
4,2015,1,1,4,NK,612,N635NK,LAS,MSP,25,...,32,32003,stable,-3.51,1.801,-0.854,0.068,0.0,15.854,0
