# Flight Delay Prediction 

## Main Goals

- Preprocess the dataset
    - Analyze and clean raw flight data, handling missing values and outliers
    - Parse and transform time features into cyclical numerical representations
    - Encode categorical variables such as airline and airport codes
    - Exclude columns that would leak information not available at departure time

- Engineer features
    - Create new time-based features (e.g., hour of departure, day of week)
    - Aggregate and incorporate average delay statistics for airlines and airports

- Model development
    - Formulate and train a multi-class classification model to predict flight delays



### Context 

Flight delays are a persistent challenge within the aviation industry, impacting both passenger satisfaction and airline operations. Understanding and predicting the likelihood of delays enables airlines and airports to proactively manage schedules, resources, and communication. This project leverages a comprehensive real-world flight dataset to try and predict flight delays based on a combination of time, location, and operational features available prior to departure.

## 1. Loading in Data

For this project, we'll be using a collection of datasets from [2015 Flight Delays and Cancellations](https://www.kaggle.com/datasets/usdot/flight-delays?select=flights.csv). In accordance with Kaggle licenses, please directly visit the Kaggle website and download the `flights.csv` dataset for this activity. It'll help to have access to both `airports.csv` and `airlines.csv`, so while the need to download them isn't there, please be able view them.

We can start by loading in the dataset into a pandas dataframe, and then displaying it to ensure it loaded correctly, and so we can see what the features are and how the target is displayed. This means that we have to start by importing pandas as well.

It's worth mentioning that anytime you have a dataset from an external source, such as Kaggle, you can and should refer back to the source of the data to clear up misconceptions and also to get a better understanding of the data. Especially in this case as we are working with a large and extensive amount of data.

Our main source of data for this project is `flights.csv`, which is why we only need to load in this file. However, for clarity, it is a good practice to view and try to understand the other datasets as well.

In [1]:
#Import pandas
import pandas as pd

#Read the flights.csv file
flights = pd.read_csv('flights.csv')

#Display our dataframes
display(flights)

  flights = pd.read_csv('flights.csv')


Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,...,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
0,2015,1,1,4,AS,98,N407AS,ANC,SEA,5,...,408.0,-22.0,0,0,,,,,,
1,2015,1,1,4,AA,2336,N3KUAA,LAX,PBI,10,...,741.0,-9.0,0,0,,,,,,
2,2015,1,1,4,US,840,N171US,SFO,CLT,20,...,811.0,5.0,0,0,,,,,,
3,2015,1,1,4,AA,258,N3HYAA,LAX,MIA,20,...,756.0,-9.0,0,0,,,,,,
4,2015,1,1,4,AS,135,N527AS,SEA,ANC,25,...,259.0,-21.0,0,0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5819074,2015,12,31,4,B6,688,N657JB,LAX,BOS,2359,...,753.0,-26.0,0,0,,,,,,
5819075,2015,12,31,4,B6,745,N828JB,JFK,PSE,2359,...,430.0,-16.0,0,0,,,,,,
5819076,2015,12,31,4,B6,1503,N913JB,JFK,SJU,2359,...,432.0,-8.0,0,0,,,,,,
5819077,2015,12,31,4,B6,333,N527JB,MCO,SJU,2359,...,330.0,-10.0,0,0,,,,,,


### Understanding the Data
While the data did load in properly, you can notice immediately that there is a LOT of data. 31 features, and seemingly a large number of null values as well.

As always, it's important to understand what you're working with. Since we weren't able to view every feature with the display() funciton, we'll go ahead and print out the features so that we know what they are. Additionally, we can refer back to the source of the data, in this case Kaggle, to learn more about the dataset.

So let's go ahead and print out the features of our main dataframe. We'll use that information and the source of the data to clarify and take note of important features.

In [2]:
#Print the columns of the dataframe
print(flights.columns)

Index(['YEAR', 'MONTH', 'DAY', 'DAY_OF_WEEK', 'AIRLINE', 'FLIGHT_NUMBER',
       'TAIL_NUMBER', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT',
       'SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 'DEPARTURE_DELAY', 'TAXI_OUT',
       'WHEELS_OFF', 'SCHEDULED_TIME', 'ELAPSED_TIME', 'AIR_TIME', 'DISTANCE',
       'WHEELS_ON', 'TAXI_IN', 'SCHEDULED_ARRIVAL', 'ARRIVAL_TIME',
       'ARRIVAL_DELAY', 'DIVERTED', 'CANCELLED', 'CANCELLATION_REASON',
       'AIR_SYSTEM_DELAY', 'SECURITY_DELAY', 'AIRLINE_DELAY',
       'LATE_AIRCRAFT_DELAY', 'WEATHER_DELAY'],
      dtype='object')


#### Analysis of Flight Dataset Features

Using the information from Kaggle as reference, let's clarify some of the values and take note of important features.

- **AIRLINE**: Carrier code (e.g. AA, UA). We can learn more about this through airlines.csv 
- **ORIGIN_AIRPORT**: Departure airport code (can be found through airports.csv)
- **DESTINATION_AIRPORT**: Arrival airport code (can be found through airports.csv)
- **TAXI_OUT**: Time spent taxiing out (in minutes)
- **TAXI_IN**: Time spent taxiing in (in minutes)
- **WHEELS_OFF**: Time aircraft wheels left ground
- **WHEELS_ON**: Time aircraft wheels touched ground
- **DEPARTURE_DELAY**: Departure delay in minutes
- **ARRIVAL_DELAY**: Arrival delay in minutes
- **AIR_SYSTEM_DELAY**: Delay caused by air traffic control
- **SECURITY_DELAY**: Delay caused by security issues
- **AIRLINE_DELAY**: Delay caused by the airline
- **LATE_AIRCRAFT_DELAY**: Delay caused by late arriving aircraft
- **WEATHER_DELAY**: Delay caused by weather
- **DIVERTED**: Binary indicator if flight was diverted
- **CANCELLED**: Binary indicator if flight was cancelled
- **CANCELLATION_REASON**: Reason for cancellation if applicable (mainly null)

Not even examining every feature here, we still have quite the large list. The biggest thing to note is that there isn't a clear target column for us to use when we create our model. As such, we will have to engineer our own target column later on.

Additionally, while we may not have many reasons for delays, especially considering how these are reasons that wouldn't be available to us at prediction time, it might seem fruitless to try and make predictions with the rest of the features. However, this is actually similar to how many airports create their prediction models, and the models can subesequently be used to explore patterns as opposed to genuinely predict flight delays. For example, the model might reveal that maybe flights on Fridays at 6 pm during the month of Feburary are consistently delayed. We don't know that for sure, but our model could possibly reveal such patterns to us, which is where the inherit usefulness of this kind of modeling comes from. That being said, if given data from an actual airport, we would have even more data to work with, and would be able to predict the delays with more accuraccy. With all that in mind, let's continue with the goal of simply creating a model for this data.

### Sampling the Data
Before we continue on to preprocessing, we'll be taking a sample of the data to work with. In a real world scenario, you wouldn't typically sample the data unless you explicitly need a small set to work with. In our case, this project is meant to run on laptops or school devices, and these devices might not be capable of handling the tasks later on with this large of a dataset. As such, we'll be sampling 500,000 values to work with. For our purposes of creating a model, this will be plenty. Just note that you'll likely have a stronger device to work with in a real world scenario, and wouldn't have to worry about this. 

In [3]:
#Sample 500,000 values from the dataframe
df = flights.sample(n=1000000, random_state=64)

#Print the shape to make sure it sampled correctly
print("Shape of our new project dataset:", df.shape)

Shape of our new project dataset: (1000000, 31)


## 2. Preprocessing.

Having taken a good look a the data, we can now start to clean it. Before we split the data or build the model, it is important to make sure the data is ready for the model and any other transformations. Additionally, we should also take this oppurtunity to remove any features that will not assist our model, as well as features that wouldn't be available to us if we're trying to make predictions, like elapsed time. 

### Dealing with Null Values and Cleaning Columns

Let's first check how many null entries we have.

In [4]:
#Check how many null values are in the dataframe
#Here, we filter the columns to show only those with null values   
print(df.isnull().sum()[df.isnull().sum() > 0])



TAIL_NUMBER              2542
DEPARTURE_TIME          14630
DEPARTURE_DELAY         14630
TAXI_OUT                15148
WHEELS_OFF              15148
SCHEDULED_TIME              1
ELAPSED_TIME            17816
AIR_TIME                17816
WHEELS_ON               15705
TAXI_IN                 15705
ARRIVAL_TIME            15705
ARRIVAL_DELAY           17816
CANCELLATION_REASON    984714
AIR_SYSTEM_DELAY       817169
SECURITY_DELAY         817169
AIRLINE_DELAY          817169
LATE_AIRCRAFT_DELAY    817169
WEATHER_DELAY          817169
dtype: int64


Here we can see that we have missing values in a lot of rows. Considering what we learned prior as well, we can start by simply dropping columns that won't help in our prediction or wouldn't be available to us normal, like tail_number or Weather_Delay respectively. Additionally, features like flight number, though completely filled out, won't assist our model in creating predictions, so we'll drop that too. In the same idea, while the year seems like it could be an important feature, all of this data is from 2015, so that means the entire feature only has one output. This won't assist our model either, so we're free to drop this as well.

Dropping these extra columns makes it easier for our model to predict in the long run. For columns with more value to us such as Arrival_delay or departure delay, we can impute, or fill in the missing values with 0, as in there's nothing there. Specifically with Arrival_delay however, we will use it to create our target, and then we can drop it. 

We'll end up dropping a lot of features here, but it will help our model run in the end.

In [5]:
#List of columns to keep for now (features available before flight plus columns for target creation)
columns_to_keep = [
    'MONTH', 'DAY', 'DAY_OF_WEEK', 'AIRLINE',
    'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT',
    'SCHEDULED_DEPARTURE', 'SCHEDULED_ARRIVAL',
    'SCHEDULED_TIME', 'DISTANCE',
    'ARRIVAL_DELAY','CANCELLED'
]

#Select only these columns
df = df[columns_to_keep]

#Drop all other columns because they are either only known after the flight (post-flight stats),
#unique identifiers not useful for prediction, or explanation columns that are unavailable at prediction time

# Impute any remaining null values in the kept columns with zero
df = df.fillna(0)

#View the cleaned dataframe (takes time to load due to size)
display(df)

Unnamed: 0,MONTH,DAY,DAY_OF_WEEK,AIRLINE,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,SCHEDULED_ARRIVAL,SCHEDULED_TIME,DISTANCE,ARRIVAL_DELAY,CANCELLED
5266008,11,25,3,WN,DEN,PDX,2220,15,175.0,991,-9.0,0
1284652,3,24,2,MQ,LAW,DFW,1722,1809,47.0,140,-8.0,0
4745705,10,23,5,UA,13930,12266,1550,1835,165.0,925,8.0,0
3471346,8,4,2,OO,SFO,BUR,1643,1808,85.0,326,26.0,0
649019,2,12,4,DL,MEM,ATL,1835,2057,82.0,332,-11.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
1820600,4,26,7,OO,IND,DEN,2024,2112,168.0,977,-11.0,0
1603501,4,13,1,WN,PVD,BWI,1205,1325,80.0,327,2.0,0
4291158,9,24,4,WN,STL,HOU,2110,2305,115.0,687,-13.0,0
4925302,11,4,3,AA,BWI,DFW,1030,1300,210.0,1217,-20.0,0


With that, we can see that our data has been cleaned. Many features that wouldn't typically be available to us have been dropped, and all null values have been imputed as well.

### Time Based Feature Engineering

Now that our data is cleaned and ready, it's time to move on to creating new features that will help our model make better predictions. Feature engineering is a crucial step in any machine learning workflow, as it allows us to extract more meaningful information from our existing data.

Some of the new features we can create include:

- **Hour of Day**: Extracting the hour from the scheduled departure time can help capture daily patterns in flight delays.
- **Delay**: We'll also need to create a target feature that categorizes whether or not we have a delay. If the flight takes more than 10 minutes to be where it needs to be, we'll consider it as late. For our purposes as well, we'll considered canceled flights as delayed since canceled flights likely occur for the same reasons as delays, just more severe. With what data we have, our goal is to create a model to see if we can reveal patterns in the data. Though for now, our focus is just the model.

By engineering these features, we provide our model with richer information, which can lead to improved predictive performance. Additionally, after creating this, we can drop certain features just meant to help create the target, such as Arrival_Delay.

We'll start by importing numpy for the .clip function so that we can use it later on as well. Then we'll create our hour of day feature, and then our target.

In [6]:
#import numpy for numerical operations
import numpy as np

#Create hour of day from scheduled departure (HHMM format)
#Divide by 100 and take the integer part to get the hour (e.g., 1530 -> 15)
df['HOUR_OF_DAY'] = df['SCHEDULED_DEPARTURE'] // 100

#Ensure hour is in range 0-23 (in rare cases, scheduled departure can be 2400)
df['HOUR_OF_DAY'] = df['HOUR_OF_DAY'].clip(upper=23)

#Create a function to determine delay 
def delays(row):
    delay = row['ARRIVAL_DELAY']
    if delay < 10:
        return 0  #On-time
    elif delay >= 10 or row['CANCELLED'] == 1:
        return 1  #Delayed/Canceled.

#Apply our new delay severity function to each row and create a new column
df['delay'] = df.apply(delays, axis=1)


With our target and extra features created, we can go ahead and drop some of the remaining features that might not help us or wouldn't be available to us, such as arrival delay and cancelled. 

In [7]:
#Drop the original ARRIVAL_DELAY and CANCELLED columns
df = df.drop(columns=['ARRIVAL_DELAY', 'CANCELLED'])

#Display the final dataframe with the new target column
display(df)

Unnamed: 0,MONTH,DAY,DAY_OF_WEEK,AIRLINE,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,SCHEDULED_ARRIVAL,SCHEDULED_TIME,DISTANCE,HOUR_OF_DAY,delay
5266008,11,25,3,WN,DEN,PDX,2220,15,175.0,991,22,0
1284652,3,24,2,MQ,LAW,DFW,1722,1809,47.0,140,17,0
4745705,10,23,5,UA,13930,12266,1550,1835,165.0,925,15,0
3471346,8,4,2,OO,SFO,BUR,1643,1808,85.0,326,16,1
649019,2,12,4,DL,MEM,ATL,1835,2057,82.0,332,18,0
...,...,...,...,...,...,...,...,...,...,...,...,...
1820600,4,26,7,OO,IND,DEN,2024,2112,168.0,977,20,0
1603501,4,13,1,WN,PVD,BWI,1205,1325,80.0,327,12,0
4291158,9,24,4,WN,STL,HOU,2110,2305,115.0,687,21,0
4925302,11,4,3,AA,BWI,DFW,1030,1300,210.0,1217,10,0


### One-Hot Encoding

Something else we have to take care of while preprocessing is our categorical features. Many models expect to see numerical values, so we'll have to deal with our Airline, Destination Airport, and Origin Airport features, which aren't numerical. It's here that it helps to have inspected the other datasets, such as `airlines.csv` and `airports.csv`. You'll notice that there are less than twenty airlines, while there are hundreds of airports. Our plan is to use a tecnhique called one-hot encoding, where we give each possible outcome in a feature it's own column. While this will be ok for the airlines, creating hundreds of features for the airports will be an issue, so we'll choose the top 10 airpots for destinations and origins.

We'll start by using pd.get_dummies to one-hot encode the airlines, and then use other pandas functions alongside pd.get_dummies to choose the top ten origin and destination airports, and put the remaining airports in an 'other' column.

In [8]:
#One-hot encode the AIRLINE column. Keep the original column for reference for later.
df['AIRLINE_ORIGINAL'] = df['AIRLINE']
df = pd.get_dummies(df, columns=['AIRLINE'])

#Now one-hot encode the top 10 busiest airports 
#Find the busiest N origin airports
top_airports = df['ORIGIN_AIRPORT'].value_counts().nlargest(10).index

#Create a new column grouping rare airports as "OTHER"
df['ORIGIN_AIRPORT_GROUPED'] = df['ORIGIN_AIRPORT'].apply(lambda x: x if x in top_airports else 'OTHER')

#One-hot encode the grouped column
origin_airport_dummies = pd.get_dummies(df['ORIGIN_AIRPORT_GROUPED'], prefix='ORIGIN')

#Repeat for destination airports
top_dest_airports = df['DESTINATION_AIRPORT'].value_counts().nlargest(10).index
df['DESTINATION_AIRPORT_GROUPED'] = df['DESTINATION_AIRPORT'].apply(lambda x: x if x in top_dest_airports else 'OTHER')
dest_airport_dummies = pd.get_dummies(df['DESTINATION_AIRPORT_GROUPED'], prefix='DEST')

#Concatenate back to your DataFrame
df = pd.concat([df, origin_airport_dummies, dest_airport_dummies], axis=1)

#display the final DataFrame with one-hot encoded columns
display(df)

#print columns of the final DataFrame since the previous result is likely truncated
print(df.columns)


Unnamed: 0,MONTH,DAY,DAY_OF_WEEK,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,SCHEDULED_ARRIVAL,SCHEDULED_TIME,DISTANCE,HOUR_OF_DAY,...,DEST_DEN,DEST_DFW,DEST_IAH,DEST_LAS,DEST_LAX,DEST_MSP,DEST_ORD,DEST_OTHER,DEST_PHX,DEST_SFO
5266008,11,25,3,DEN,PDX,2220,15,175.0,991,22,...,False,False,False,False,False,False,False,True,False,False
1284652,3,24,2,LAW,DFW,1722,1809,47.0,140,17,...,False,True,False,False,False,False,False,False,False,False
4745705,10,23,5,13930,12266,1550,1835,165.0,925,15,...,False,False,False,False,False,False,False,True,False,False
3471346,8,4,2,SFO,BUR,1643,1808,85.0,326,16,...,False,False,False,False,False,False,False,True,False,False
649019,2,12,4,MEM,ATL,1835,2057,82.0,332,18,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1820600,4,26,7,IND,DEN,2024,2112,168.0,977,20,...,True,False,False,False,False,False,False,False,False,False
1603501,4,13,1,PVD,BWI,1205,1325,80.0,327,12,...,False,False,False,False,False,False,False,True,False,False
4291158,9,24,4,STL,HOU,2110,2305,115.0,687,21,...,False,False,False,False,False,False,False,True,False,False
4925302,11,4,3,BWI,DFW,1030,1300,210.0,1217,10,...,False,True,False,False,False,False,False,False,False,False


Index(['MONTH', 'DAY', 'DAY_OF_WEEK', 'ORIGIN_AIRPORT', 'DESTINATION_AIRPORT',
       'SCHEDULED_DEPARTURE', 'SCHEDULED_ARRIVAL', 'SCHEDULED_TIME',
       'DISTANCE', 'HOUR_OF_DAY', 'delay', 'AIRLINE_ORIGINAL', 'AIRLINE_AA',
       'AIRLINE_AS', 'AIRLINE_B6', 'AIRLINE_DL', 'AIRLINE_EV', 'AIRLINE_F9',
       'AIRLINE_HA', 'AIRLINE_MQ', 'AIRLINE_NK', 'AIRLINE_OO', 'AIRLINE_UA',
       'AIRLINE_US', 'AIRLINE_VX', 'AIRLINE_WN', 'ORIGIN_AIRPORT_GROUPED',
       'DESTINATION_AIRPORT_GROUPED', 'ORIGIN_ATL', 'ORIGIN_DEN', 'ORIGIN_DFW',
       'ORIGIN_IAH', 'ORIGIN_LAS', 'ORIGIN_LAX', 'ORIGIN_ORD', 'ORIGIN_OTHER',
       'ORIGIN_PHX', 'ORIGIN_SEA', 'ORIGIN_SFO', 'DEST_ATL', 'DEST_DEN',
       'DEST_DFW', 'DEST_IAH', 'DEST_LAS', 'DEST_LAX', 'DEST_MSP', 'DEST_ORD',
       'DEST_OTHER', 'DEST_PHX', 'DEST_SFO'],
      dtype='object')


You'll notice that not only do we have a lot more code for encoding the airports, but we use the pd.get_dummies() function differently. What we're ultimately doing is creating two one-hot encoded dataframes, one for origin airpors, and one for destination airports, and concatenated them into our original dataframe. We specifically make it so that any airport not in the top 10 gets sent to the 'OTHER' column in both dataframes. In the end, we'll have our one-hot encoded airlines, as well as DEST_OTHER and ORIGIN_OTHER.

However, you might also notice that this actually leaves our temporary grouped features, as well as the original airport features as well. It's always important to make sure you properly deal with features left behind like this. Before moving on, we'll remove these features from the data set. We'll specifcally keep the AIRLINE_ORIGINAL and ORIGIN_AIRPORT features, as we'll need them later for further feature engineering.


In [9]:
#Remove the original ORIGIN_AIRPORT and DESTINATION_AIRPORT columns
df = df.drop(columns=['DESTINATION_AIRPORT', 'ORIGIN_AIRPORT_GROUPED', 'DESTINATION_AIRPORT_GROUPED'])

### Cyclical encoding

As we continue our feature engineering, it’s important to consider how certain features represent repeating patterns in time, such as the hour of the day or the day of the week. Unlike straightforward numerical variables, time features are inherently cyclical; midnight and 11 PM are only an hour apart, even though their values (0 and 23) appear far apart numerically. If we were to encode these features as simple numbers, our model might misunderstand the true relationship between different hours or days. To address this, we’ll use a technique called **cyclical encoding**, where each time-based value is transformed using sine and cosine functions. This method effectively wraps the feature around a circle, ensuring that values at the “edges” (like the start and end of a day or week) are considered close together. For example, we’ll convert our hour of day and day of week columns into their sine and cosine representations, allowing our model to better recognize natural patterns and periodic trends in flight delays. By capturing these cyclical relationships, we provide our model with a more accurate understanding of how time influences flight performance.


In [10]:
#Encode hour of day (0–23) (using 24 as the period for sine/cosine, aka the full day cycle)
#This way
df['HOUR_SIN'] = np.sin(2 * np.pi * df['HOUR_OF_DAY'] / 24)
df['HOUR_COS'] = np.cos(2 * np.pi * df['HOUR_OF_DAY'] / 24)

#Encode day of week (using 24 as the period for sine/cosine, aka the full week cycle)
df['DOW_SIN'] = np.sin(2 * np.pi * df['DAY_OF_WEEK'] / 7)
df['DOW_COS'] = np.cos(2 * np.pi * df['DAY_OF_WEEK'] / 7)

#Drop the original HOUR_OF_DAY and DAY_OF_WEEK columns
df = df.drop(columns=['HOUR_OF_DAY', 'DAY_OF_WEEK'])

#print columns of the DataFrame to see the new features
print(df.columns)

Index(['MONTH', 'DAY', 'ORIGIN_AIRPORT', 'SCHEDULED_DEPARTURE',
       'SCHEDULED_ARRIVAL', 'SCHEDULED_TIME', 'DISTANCE', 'delay',
       'AIRLINE_ORIGINAL', 'AIRLINE_AA', 'AIRLINE_AS', 'AIRLINE_B6',
       'AIRLINE_DL', 'AIRLINE_EV', 'AIRLINE_F9', 'AIRLINE_HA', 'AIRLINE_MQ',
       'AIRLINE_NK', 'AIRLINE_OO', 'AIRLINE_UA', 'AIRLINE_US', 'AIRLINE_VX',
       'AIRLINE_WN', 'ORIGIN_ATL', 'ORIGIN_DEN', 'ORIGIN_DFW', 'ORIGIN_IAH',
       'ORIGIN_LAS', 'ORIGIN_LAX', 'ORIGIN_ORD', 'ORIGIN_OTHER', 'ORIGIN_PHX',
       'ORIGIN_SEA', 'ORIGIN_SFO', 'DEST_ATL', 'DEST_DEN', 'DEST_DFW',
       'DEST_IAH', 'DEST_LAS', 'DEST_LAX', 'DEST_MSP', 'DEST_ORD',
       'DEST_OTHER', 'DEST_PHX', 'DEST_SFO', 'HOUR_SIN', 'HOUR_COS', 'DOW_SIN',
       'DOW_COS'],
      dtype='object')


By encoding time features using both sine and cosine, the model can recognize that the beginning and end of cycles like hours or days are actually close together. For example:

- If hour = 0, the angle is 2π×0/24=0 radians.
- If hour = 12, the angle is 2π×12/24=π radians (directly opposite on the circle).
- If day = 0, the angle is 2π×0/7=0.
- If day = 6, the angle is 2π×6/7 (almost a full circle).

A fun application of precalculus! This allows the model to learn true circular patterns in the data and more accurately capture the periodic trends that affect flight delays.

With this, we have done a majority of our preprocessing. The remaing preprocessing would specifically need to be done on training data, so for that, we'll move on to the train test split.

## 3. Train Test Split

At this point, we can split our data into training and testing data. This leaves us with data to train our model with and data to test our model with. As per standard, we'll be doing an 80-20 split, so 80% of the data will be for training, and 20% for testing. This allows us to later evaluate our model and observe how well it performs. Before doing this, we'll also separate our features from our target, just so that any future transformations or processing won't affect our target. Additionally, having our target separate gives our model something to try to predict. It can view the target to learn during training, but not during testing. 

We'll start by importing the test_train_split module from Sklearn. It does the split for us thankfully. 

In [11]:
#Import the train_test_split function from sklearn
from sklearn.model_selection import train_test_split

#Split the data into training and testing sets
X = df.drop(columns=['delay'])
y = df['delay']

#Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=64)

## 4. Remaining Preprocessing

While we've already performed extensive cleaning and feature engineering, there are a few crucial preprocessing steps left before modeling. One important task is to calculate aggregate features, such as the average delay per airline or per airport. These features can provide valuable context to our model, helping it recognize patterns specific to certain carriers or locations. However, to avoid data leakage, we must compute these averages using only the training set, and then merge the resulting features into both the training and test sets. This ensures that our model does not inadvertently gain information from the test data during training.

Finally, to ensure all features are on a comparable scale, we'll fit a scaler (such as StandardScaler) on the training data and use it to transform both the training and test sets. By performing these steps after the train test split and fitting transformations only on the training data, we maintain the integrity of our evaluation and prevent information from the test set from influencing our preprocessing or model training. This, alongside aggregation, is why we waited until the train 

### Aggregate Features

As part of our feature engineering process, we can further enhance our model by introducing aggregate features. These are summary statistics calculated from the training data and added as new columns to our dataset. By including these aggregate values, our model gains additional context about the typical delay performance of different airlines and airports, allowing it to make more informed predictions even without access to external factors like weather. In other words, our model would be able to view historical patterns and tendencies, which might help it understand predict the likelihood of future delays. It's for this that we kept our original AIRLINE and ORIGIN_AIRPORT column. After doing this, we can remove the feature from the testing and training data.

In [12]:
#Attach target to X_train for grouping
train_with_target = X_train.copy()
train_with_target['delay'] = y_train

#Calculate average delay severity per airline
avg_airline_delay = train_with_target.groupby('AIRLINE_ORIGINAL')['delay'].mean()

#Calculate average delay severity per origin airport
avg_origin_delay = train_with_target.groupby('ORIGIN_AIRPORT')['delay'].mean()

#Map these averages back to both train and test
X_train['AVG_AIRLINE_DELAY'] = X_train['AIRLINE_ORIGINAL'].map(avg_airline_delay)
X_test['AVG_AIRLINE_DELAY'] = X_test['AIRLINE_ORIGINAL'].map(avg_airline_delay)

X_train['AVG_ORIGIN_DELAY'] = X_train['ORIGIN_AIRPORT'].map(avg_origin_delay)
X_test['AVG_ORIGIN_DELAY'] = X_test['ORIGIN_AIRPORT'].map(avg_origin_delay)

#Fill any missing values in test (if an airline or airport never appeared in train) with the global average
global_avg = y_train.mean()
X_test['AVG_AIRLINE_DELAY'] = X_test['AVG_AIRLINE_DELAY'].fillna(global_avg)
X_test['AVG_ORIGIN_DELAY'] = X_test['AVG_ORIGIN_DELAY'].fillna(global_avg)

#Drop the original AIRLINE and ORIGIN_AIRPORT columns
X_train = X_train.drop(columns=['AIRLINE_ORIGINAL', 'ORIGIN_AIRPORT'])
X_test = X_test.drop(columns=['AIRLINE_ORIGINAL', 'ORIGIN_AIRPORT'])

#Display the final training and testing sets
display(X_train)
display(X_test)

Unnamed: 0,MONTH,DAY,SCHEDULED_DEPARTURE,SCHEDULED_ARRIVAL,SCHEDULED_TIME,DISTANCE,AIRLINE_AA,AIRLINE_AS,AIRLINE_B6,AIRLINE_DL,...,DEST_ORD,DEST_OTHER,DEST_PHX,DEST_SFO,HOUR_SIN,HOUR_COS,DOW_SIN,DOW_COS,AVG_AIRLINE_DELAY,AVG_ORIGIN_DELAY
4261762,9,23,735,1018,163.0,978,False,True,False,False,...,False,True,False,False,0.965926,-0.258819,4.338837e-01,-0.900969,0.177309,0.226525
4435015,10,4,915,1208,173.0,1211,False,False,True,False,...,False,True,False,False,0.707107,-0.707107,-2.449294e-16,1.000000,0.263209,0.159590
2472833,6,6,745,905,80.0,338,False,False,False,False,...,False,True,False,False,0.965926,-0.258819,-7.818315e-01,0.623490,0.235924,0.237530
1128714,3,15,920,1755,335.0,2586,False,False,False,False,...,False,True,False,False,0.707107,-0.707107,-2.449294e-16,1.000000,0.236186,0.239073
5199481,11,21,1530,1717,167.0,967,False,False,False,False,...,False,False,False,True,-0.707107,-0.707107,-7.818315e-01,0.623490,0.304473,0.266604
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2445011,6,4,1348,1701,133.0,763,False,False,False,False,...,False,True,False,False,-0.258819,-0.965926,-4.338837e-01,-0.900969,0.234564,0.277056
3120027,7,14,2105,2210,65.0,319,False,False,False,False,...,False,True,False,False,-0.707107,0.707107,9.749279e-01,-0.222521,0.233737,0.238049
241016,1,16,1442,2020,218.0,1514,False,False,False,False,...,True,False,False,False,-0.500000,-0.866025,-9.749279e-01,-0.222521,0.241380,0.255064
2440773,6,4,940,951,71.0,214,False,False,False,True,...,False,True,False,False,0.707107,-0.707107,-4.338837e-01,-0.900969,0.170237,0.201499


Unnamed: 0,MONTH,DAY,SCHEDULED_DEPARTURE,SCHEDULED_ARRIVAL,SCHEDULED_TIME,DISTANCE,AIRLINE_AA,AIRLINE_AS,AIRLINE_B6,AIRLINE_DL,...,DEST_ORD,DEST_OTHER,DEST_PHX,DEST_SFO,HOUR_SIN,HOUR_COS,DOW_SIN,DOW_COS,AVG_AIRLINE_DELAY,AVG_ORIGIN_DELAY
2738426,6,22,600,903,123.0,719,False,False,False,False,...,False,True,False,False,1.000000e+00,6.123234e-17,7.818315e-01,0.623490,0.242452,0.277056
5735928,12,26,1645,1830,105.0,583,True,False,False,False,...,False,True,False,False,-8.660254e-01,-5.000000e-01,-7.818315e-01,0.623490,0.222008,0.258993
2150895,5,17,1120,1420,120.0,707,False,False,False,True,...,False,False,False,False,2.588190e-01,-9.659258e-01,-2.449294e-16,1.000000,0.170237,0.183962
5765306,12,28,1329,1612,223.0,1183,False,False,False,True,...,False,True,False,False,-2.588190e-01,-9.659258e-01,7.818315e-01,0.623490,0.170237,0.267566
5666333,12,21,2220,2330,70.0,327,False,False,False,False,...,False,True,False,False,-5.000000e-01,8.660254e-01,7.818315e-01,0.623490,0.233737,0.270436
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5463331,12,9,550,710,80.0,273,False,False,True,False,...,False,True,False,False,9.659258e-01,2.588190e-01,4.338837e-01,-0.900969,0.263209,0.211180
5457793,12,8,1603,2055,172.0,1205,True,False,False,False,...,False,False,False,False,-8.660254e-01,-5.000000e-01,9.749279e-01,-0.222521,0.222008,0.185624
2866311,6,29,1525,1656,151.0,725,False,False,False,False,...,False,True,False,False,-7.071068e-01,-7.071068e-01,7.818315e-01,0.623490,0.234564,0.248406
1992551,5,7,1240,2117,337.0,2446,False,False,False,True,...,False,True,False,False,1.224647e-16,-1.000000e+00,-4.338837e-01,-0.900969,0.170237,0.216402


### Normalizing the Data
With our final feature in place and with all extra features removed, we can finally start normalizing the data. This means that we scale the numerical so that the mean is 0 and the standard deviation is 1. We do this so that in case the model we use prioritizes larger values, it won't be skewed by said larger values. For our one-hot encoded columns however, normalization can actually make By scaling down everything within a certain range, the model can learn from everything without bias.

We'll start by importing StandardScaler and ColumnTransformer from Sklearn, fit it on the training data, and then transform both the training and testing data. We fit the object only on the training data so that it isn't able to make any inferences on the testing data. ColumnTransformer allows us to specifically transform certain columns, so as to not affect our one-hot encoded columns. We'll have to list our one-hot encoded columns so that we can get them ready for the transformations.

In [13]:
#Import the StandardScaler and ColumnTransformer from sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

#List the columns we don't want to scale
passthrough = [
    'AIRLINE_AA', 'AIRLINE_AS', 'AIRLINE_B6', 'AIRLINE_DL', 'AIRLINE_EV',
    'AIRLINE_F9', 'AIRLINE_HA', 'AIRLINE_MQ', 'AIRLINE_NK', 'AIRLINE_OO',
    'AIRLINE_UA', 'AIRLINE_US', 'AIRLINE_VX', 'AIRLINE_WN', 'ORIGIN_ATL',
    'ORIGIN_DEN', 'ORIGIN_DFW', 'ORIGIN_IAH', 'ORIGIN_LAS', 'ORIGIN_LAX',
    'ORIGIN_ORD', 'ORIGIN_OTHER', 'ORIGIN_PHX','ORIGIN_SEA', 'ORIGIN_SFO',
    'DEST_ATL', 'DEST_DEN', 'DEST_DFW', 'DEST_IAH', 'DEST_LAS', 'DEST_LAX',
    'DEST_MSP', 'DEST_ORD', 'DEST_OTHER', 'DEST_PHX', 'DEST_SFO'
]

#List the remaining columns to scale
columns_to_scale = [
    'MONTH', 'DAY', 'SCHEDULED_DEPARTURE', 'SCHEDULED_ARRIVAL',
    'SCHEDULED_TIME', 'DISTANCE', 'HOUR_SIN', 'HOUR_COS', 'DOW_SIN', 'DOW_COS',
    'AVG_AIRLINE_DELAY', 'AVG_ORIGIN_DELAY'
]

#Create a StandardScaler instance
scaler = StandardScaler()

#Set up the ColumnTransformer to scale the numerical features and pass through the one-hot encoded features
#The first strings are just the labels for the transformers, which can be anything descriptive
#The second parameter is the transformer to apply, and the third is the columns to apply it to
preprocessor = ColumnTransformer(
    transformers=[
        ('scale', StandardScaler(), columns_to_scale),
        ('pass', 'passthrough', passthrough)
    ],
    remainder='drop' # Good practice to ensure no columns are accidentally left out
)

#Fit the preprocessor on the training data
preprocessor.fit(X_train)

#Transform both training and testing data
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)


#Add the column names back to the transformed data
all_columns = (
    columns_to_scale + passthrough
)

#display the scaled training and testing sets. 
#We specify the data as a dataframe since our actions made it a numpy array
#Additonally, viewing a dataframe is more readable than a numpy array 
print("Scaled Training Set:")
display(pd.DataFrame(X_train_processed, columns=all_columns, index=X_train.index))
print("Scaled Testing Set:")
display(pd.DataFrame(X_test_processed, columns=all_columns, index=X_test.index))



Scaled Training Set:


Unnamed: 0,MONTH,DAY,SCHEDULED_DEPARTURE,SCHEDULED_ARRIVAL,SCHEDULED_TIME,DISTANCE,HOUR_SIN,HOUR_COS,DOW_SIN,DOW_COS,...,DEST_DEN,DEST_DFW,DEST_IAH,DEST_LAS,DEST_LAX,DEST_MSP,DEST_ORD,DEST_OTHER,DEST_PHX,DEST_SFO
4261762,0.728084,0.830831,-1.230609,-0.936663,0.285037,0.257635,1.401781,0.218718,0.588789,-1.240107,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4435015,1.021807,-1.332274,-0.858334,-0.562158,0.418176,0.641519,1.057430,-0.619311,-0.025406,1.448102,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2472833,-0.153085,-1.104579,-1.209927,-1.159395,-0.820022,-0.796811,1.401781,0.218718,-1.132147,0.915669,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1128714,-1.034254,-0.079950,-0.847993,0.516022,2.575037,2.906931,1.057430,-0.619311,-0.025406,1.448102,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5199481,1.315530,0.603135,0.413607,0.441121,0.338293,0.239512,-0.824140,-0.619311,-1.132147,0.915669,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2445011,-0.153085,-1.332274,0.037195,0.409583,-0.114382,-0.096593,-0.227707,-1.103147,-0.639601,-1.240107,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3120027,0.140638,-0.193798,1.602819,1.412862,-1.019731,-0.828115,-0.824140,2.024418,1.354678,-0.280696,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
241016,-1.621701,0.033897,0.231605,1.038357,1.017304,1.140734,-0.548591,-0.916393,-1.405490,-0.280696,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2440773,-0.153085,-1.332274,-0.806629,-1.068726,-0.939847,-1.001110,1.057430,-0.619311,-0.639601,-1.240107,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


Scaled Testing Set:


Unnamed: 0,MONTH,DAY,SCHEDULED_DEPARTURE,SCHEDULED_ARRIVAL,SCHEDULED_TIME,DISTANCE,HOUR_SIN,HOUR_COS,DOW_SIN,DOW_COS,...,DEST_DEN,DEST_DFW,DEST_IAH,DEST_LAS,DEST_LAX,DEST_MSP,DEST_ORD,DEST_OTHER,DEST_PHX,DEST_SFO
2738426,-0.153085,0.716983,-1.509816,-1.163337,-0.247522,-0.169086,1.447116,0.702554,1.081335,0.915669,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5735928,1.609253,1.172373,0.651449,0.663852,-0.487173,-0.393156,-1.035577,-0.232146,-1.132147,0.915669,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2150895,-0.446808,0.147745,-0.434354,-0.144290,-0.287463,-0.188857,0.460996,-1.103147,-0.025406,1.448102,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5765306,1.609253,1.400069,-0.002101,0.234157,1.083874,0.595387,-0.227707,-1.103147,1.081335,0.915669,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5666333,1.609253,0.603135,1.840662,1.649392,-0.953161,-0.814934,-0.548591,2.321500,1.081335,0.915669,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5463331,1.609253,-0.763036,-1.613226,-1.543756,-0.820022,-0.903903,1.401781,1.186390,0.588789,-1.240107,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5457793,1.609253,-0.876883,0.564585,1.107345,0.404862,0.631634,-1.035577,-0.232146,1.354678,-0.280696,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2866311,-0.153085,1.513916,0.403266,0.320885,0.125269,-0.159201,-0.824140,-0.619311,1.081335,0.915669,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1992551,-0.446808,-0.990731,-0.186170,1.229552,2.601665,2.676271,0.116645,-1.166845,-0.639601,-1.240107,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


Now while the values in the data might not make a lot of sense to us, it'll make a lot more sense to whatever model we use.

## 5. Building and Training the Model

Finally we can move on to building our Random Forest model. 

We'll begin by building our Random Forest model. This model is a powerful ensemble method that is particularly well-suited for our task because it can capture complex, non-linear interactions between features that might be missed by a linear model. It works by constructing a multitude of decision trees during training and outputting the class that is the mode of the classes of the individual trees. This approach makes it robust and often leads to high predictive accuracy.

To implement this, we will first import the RandomForestClassifier from sklearn. After creating an instance of the model, we will then fit it using our scaled and preprocessed training data.

In [14]:
#import the RandomForestClassifier from sklearn
from sklearn.ensemble import RandomForestClassifier

#Create a RandomForestClassifier instance
model = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=64)

#Fit the model on the training data
model.fit(X_train_processed, y_train)


With that, our model has been created and trained. For those of you who ran the code yourself, you'll notice that it took quite a long while to fit the model. That's only to be expected, as we have a significantly large dataset, and we are building 100 different decision trees for this forest. As you move into real world examples, running times tend to increase as well. Now it's time to test our model and see how well it performs

## 6. Evaluating the Model

With our model trained, it's now time to see how well it predicts. From Sklearn, we can import several different metrics off success. After testing our models with said metrics, we'll evaluate it to see how well it performs.

In [15]:
#Import necessary libraries for evaluation
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

#Make predictions on the validation set, and store it in a variable. 
y_pred = model.predict(X_test_processed)

#Check the accuracy of our predictions
print("Validation Accuracy:", accuracy_score(y_test, y_pred))

#Display the confusion matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

#Print the classification report for more details
print("Classification Report:\n", classification_report(y_test, y_pred))

Validation Accuracy: 0.769065
Confusion Matrix:
 [[147585   7110]
 [ 39077   6228]]
Classification Report:
               precision    recall  f1-score   support

           0       0.79      0.95      0.86    154695
           1       0.47      0.14      0.21     45305

    accuracy                           0.77    200000
   macro avg       0.63      0.55      0.54    200000
weighted avg       0.72      0.77      0.72    200000



### Analysis

#### Confusion Matrix
The confusion matrix breaks down the model’s predictions for on-time versus delayed flights. Each row represents the actual class (0 for on-time, 1 for delayed), while each column shows the predicted class. The top-left value (147,585) indicates the high number of on-time flights that were correctly identified. The bottom-right value (6,228) shows the number of delayed flights the model successfully predicted. The off-diagonal values reveal where the model made mistakes: 7,110 on-time flights were incorrectly flagged as delayed, but more significantly, 39,077 delayed flights were missed by the model and predicted to be on-time.

#### Classification Report
Precision, recall, and F1-score give us more insight into how well the model handles each class:

- Precision: For class 0 (on-time), precision is strong at 0.79, meaning when the model predicts a flight is on-time, it is generally correct. For class 1 (delayed), precision is much lower at 0.47, indicating that when the model does predict a delay, it is only correct about half the time.
- Recall: The recall for class 0 is very high at 0.95, showing the model successfully identifies the vast majority of on-time flights. For class 1, however, the recall is very low at 0.14. This remains the model's critical weakness, as it means the model is only able to identify 14% of all flights that were actually delayed.
- F1-score: The F1-scores, which balance precision and recall, summarize the performance gap: a strong 0.86 for on-time flights, but a very weak 0.21 for delayed flights, confirming that the model is not yet reliable for its primary purpose of predicting delays.

#### Classification Report Analysis
- Overall Accuracy: The model achieved an accuracy of 0.77, meaning it correctly classified 77% of flights. As we've discovered, this high number is misleading because it is mostly driven by the model's success in predicting the majority "on-time" class.
- Class 0 (On-Time): The model is effective and reliable when identifying on-time flights.
- Class 1 (Delayed): The model's performance for delayed flights is poor. The low recall is the key indicator that the model cannot reliably detect a delay.
- Support: The class imbalance, with on-time flights outnumbering delayed flights by a ratio of more than 3-to-1 (154,695 vs. 45,305), remains the most significant challenge and the primary cause of the skewed performance.

#### Project Conclusion and Path to Improvement
Overall, this project successfully demonstrated an end-to-end machine learning workflow, from cleaning a large dataset to advanced feature engineering and model evaluation. The final Random Forest classifier achieved a high overall accuracy of 77%, but a deeper analysis of metrics like recall revealed a significant weakness in predicting the minority 'Delayed' class. This outcome powerfully illustrates the challenges of working with imbalanced, real-world data, where simple accuracy can be a misleading metric. It also illustrates the importance of evaluation metrics aside from accuracy.

The most significant improvements to this model would come not from more advanced algorithms, but from more informative data. In a real-world operational setting at an airline or aviation authority, a data science team would be provided with or tasked to source live data feeds that are truly predictive of delays. This includes:

- Real-time Weather Data: Detailed forecasts, wind speed, precipitation, and visibility information for both the origin and destination airports around the scheduled departure and arrival times.
- Air Traffic System Data: Information on national airspace status, airport congestion, runway availability, and ground stops issued by air traffic control.
- Aircraft-Specific Data: The specific aircraft's recent maintenance history, its age, and the status of its previous incoming flight (to model the effect of cascading delays).

For the purposes of this educational project, however, the skills developed are invaluable and directly transferable. You have successfully navigated the challenges of data cleaning, implementing cyclical feature encoding, engineering new features from existing ones, building a robust data pipeline, and critically interpreting a model's performance beyond its accuracy score. The hurdles encountered here are not failures, but realistic representations of the problems faced in professional data science roles. This workflow provides a strong foundation for tackling any classification problem, highlighting the practical steps and critical thinking needed to turn raw data into actionable insights.

Excellent work on seeing this complex project through to the end. Congratulations!