# Flight Delay Project - Data Cleaning
- The goal of this project is to show relevant data science skills using data from US Domestic flights.

- This first notebook will be focused on data cleaning, data engineering, and data preparation

- Future notebooks will use this dataset to engage in exploratory data analysis, and predict the delay of flights by tuning machine learning algorithms

### The Aim of This Notebook Is To:

- Preprocess the dataset to prepare for data analysis and machine learning.
- Clean and repair the heterogeneous airport identifiers.
- Clean and standardise the time-related columns to ensure that they are machine friendly.

# Step 1: Import the Dataset

In [1]:
import numpy as np
import pandas as pd

In [2]:
# airlines_df includes information about each airline
airlines_df = pd.read_csv("data/airlines.csv")

# airports_df includes information about each airport
airports_df = pd.read_csv("data/airports.csv")

# flights_df includes information about each flight and will be the primary dataframe in this project.
flights_df = pd.read_csv("data/flights.csv",
                         dtype={"YEAR":"category",
                                "MONTH":"uint8",
                                "DAY":"uint8",
                                "DAY_OF_WEEK":"uint8",
                                "AIRLINE":"category",
                                "FLIGHT_NUMBER":"category",
                                "TAIL_NUMBER":"category",
                                "ORIGIN_AIRPORT":"str",
                                "DESTINATION_AIRPORT":"str",
                                "SCHEDULED_DEPARTURE":"uint16",
                                "SCHEDULED_ARRIVAL":"uint16",
                                "DIVERTED":"int8",
                                "CANCELLED":"int8"})

print(f"flights_df is the main dataframe with {flights_df.shape[1]} features and {flights_df.shape[0]:,} rows.  Each row is a flight")

flights_df is the main dataframe with 31 features and 5,819,079 rows.  Each row is a flight


In [3]:
flights_df.head()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,...,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
0,2015,1,1,4,AS,98,N407AS,ANC,SEA,5,...,408.0,-22.0,0,0,,,,,,
1,2015,1,1,4,AA,2336,N3KUAA,LAX,PBI,10,...,741.0,-9.0,0,0,,,,,,
2,2015,1,1,4,US,840,N171US,SFO,CLT,20,...,811.0,5.0,0,0,,,,,,
3,2015,1,1,4,AA,258,N3HYAA,LAX,MIA,20,...,756.0,-9.0,0,0,,,,,,
4,2015,1,1,4,AS,135,N527AS,SEA,ANC,25,...,259.0,-21.0,0,0,,,,,,


# Step 2: Heterogeneous Identifiers
The main issue with the dataset is in the columns "ORIGIN_AIRPORT", and "DESTINATION_AIRPORT".  The majority of the elements are the Airport IATA Codes.  These are the 3 digit codes you can see in the head of the dataframe.  However, all of the elements for the month of October have a different identifier which is a numeric 5-digit code.

After some research I was able to find information about these codes.  They are internal identifiers used by US Department of Transportation (DOT).  I downloaded their database of these codes, however, their database does not include the corresponding IATA codes in flights_df (which _is from_ the US DOT but was uploaded in 2015).

How to download.  Go to:<br>
https://transtats.bts.gov/Fields.asp?gnoyr_VQ=FIL and click on: "OriginAirportID".  Unfortunately the direct link to the data is changed regularly.

I will replace the 5 digit DOT codes, in flights_df, with the 3 character IATA codes by matching the airports from the DOT database and the flights_df.

### Preparing the Data:
I will first clean and prepare both airports_df (part of my original dataset) and numeric_airports_df (the new dataframe for the DOT codes) in order to match up the airports.

In [4]:
# List of all flights that depart from an airport with a
numeric_flights_df = flights_df[flights_df["ORIGIN_AIRPORT"].str.contains(r"\d+", regex=True)]

print(f"Number of flights in October: {len(flights_df[flights_df['MONTH'] == 10]):,}")
print(f"Number of October flights with a numeric code: {numeric_flights_df['MONTH'].value_counts().loc[10]:,}")
print(f"Number of flights with a numeric code in flights_df {len(numeric_flights_df):,}")

Number of flights in October: 486,165
Number of October flights with a numeric code: 486,165
Number of flights with a numeric code in flights_df 486,165


In [5]:
# Reading in the DOT data as numeric_airports_df
numeric_airports_df = pd.read_csv("data/L_AIRPORT_ID.csv")
numeric_airports_df.head()

Unnamed: 0,Code,Description
0,10001,"Afognak Lake, AK: Afognak Lake Airport"
1,10003,"Granite Mountain, AK: Bear Creek Mining Strip"
2,10004,"Lik, AK: Lik Mining Camp"
3,10005,"Little Squaw, AK: Little Squaw Airport"
4,10006,"Kizhuyak, AK: Kizhuyak Bay"


In [6]:
# Creating a list of all the DOT codes in the dataset.
codes = (pd.concat([numeric_flights_df["ORIGIN_AIRPORT"],
                    numeric_flights_df["DESTINATION_AIRPORT"]],
                   ignore_index=True)
         .unique()
         .astype("int64"))

# Filter numeric_airports_df to only include the DOT codes that are in the flights_df dataset.
numeric_airports_df = numeric_airports_df.set_index("Code").loc[codes].reset_index()

print("Number of airports in the month of October:", len(numeric_airports_df))
print("Number of airports in the dataset as a whole:", len(airports_df))
print(f"There are {len(airports_df) -len(numeric_airports_df)} less airports used in October than in the rest of the year.")

Number of airports in the month of October: 307
Number of airports in the dataset as a whole: 322
There are 15 less airports used in October than in the rest of the year.


In [7]:
numeric_airports_df["Code"] = numeric_airports_df["Code"].astype("str")

# Extracting derivative columns from the "Description" column in numeric_airports_df.
numeric_airports_df["N_CITY"] = numeric_airports_df["Description"].str.split(",").str[0] # Derivative column for the City
numeric_airports_df["AIRPORT"] = numeric_airports_df["Description"].str.split(": ").str[1] # Derivative column for the Airport name
numeric_airports_df["N_STATE"] = numeric_airports_df["Description"].str.extract(r", (.+?):") # Derivative column for the State

# Strip white space from the derivative columns.
for c in ["N_CITY", "AIRPORT", "N_STATE"]:
    numeric_airports_df[c].str.strip()

numeric_airports_df.head()

Unnamed: 0,Code,Description,N_CITY,AIRPORT,N_STATE
0,14747,"Seattle, WA: Seattle/Tacoma International",Seattle,Seattle/Tacoma International,WA
1,14771,"San Francisco, CA: San Francisco International",San Francisco,San Francisco International,CA
2,12889,"Las Vegas, NV: Harry Reid International",Las Vegas,Harry Reid International,NV
3,12892,"Los Angeles, CA: Los Angeles International",Los Angeles,Los Angeles International,CA
4,14869,"Salt Lake City, UT: Salt Lake City International",Salt Lake City,Salt Lake City International,UT


In [8]:
# Cleaning airports_df
airports_df["AIRPORT"] = airports_df["AIRPORT"].str.replace("\u00A0", "") # Get rid of all non-breaking spaces
airports_df["AIRPORT"] = airports_df["AIRPORT"].str.replace("Airport", "").str.strip() # Remove the word "Airport" to match with numeric_airports_df because "Airport" is rarely included in numeric_airports_df

### Matching the Airports:
Now I will merge the 2 dataframes together in order to match airports with the same name.

In [9]:
# Match airports based off of their names.
numeric_airports_df = pd.merge(numeric_airports_df, airports_df, on="AIRPORT", how="left")
matched_airports_df = numeric_airports_df.dropna()[["Code", "IATA_CODE"]]

# Storing the unmatched airports that have numeric DOT codes.
unmatched_numeric_airports_df = numeric_airports_df.loc[numeric_airports_df["LATITUDE"].isna(), ["Code", "Description", "N_CITY", "AIRPORT", "N_STATE"]]

# Storing the unmatched airports that have IATA codes.
unmatched_airports_df = airports_df[~airports_df["IATA_CODE"].isin(matched_airports_df["IATA_CODE"].to_list())]

print("Number of matched airports:", len(matched_airports_df))
print("Number of unmatched airports from the month of October:", len(unmatched_numeric_airports_df))
print("Number of unmatched airports from the dataset as a whole:", len(unmatched_airports_df))

Number of matched airports: 181
Number of unmatched airports from the month of October: 126
Number of unmatched airports from the dataset as a whole: 141


I will now match airports together if they are the only airport in their city.

In [10]:
# Gather how many times each city occurs in both dataframes.
unmatched_numeric_cities = unmatched_numeric_airports_df["N_CITY"].value_counts()
unmatched_cities = unmatched_airports_df["CITY"].value_counts()

# Finding when the city occurs only once in both unmatched_numeric_airports_df (DOT) and unmatched_airports_df (IATA).
by_city_airports_df =  pd.merge(unmatched_numeric_airports_df.set_index("N_CITY").loc[unmatched_numeric_cities[unmatched_numeric_cities ==1].index],
                                unmatched_airports_df.set_index("CITY").loc[unmatched_cities[unmatched_cities == 1].index],
                                left_index=True,
                                right_index=True,
                                how="left")

matched_by_city_airports_df = by_city_airports_df[["Code", "IATA_CODE"]].dropna()

# Update the matched_airports_df with the new matched airports.
matched_airports_df = pd.concat([matched_airports_df, matched_by_city_airports_df])


# Update both datasets of unmatched airport codes.
unmatched_numeric_airports_df = unmatched_numeric_airports_df[~unmatched_numeric_airports_df["Code"].isin(matched_by_city_airports_df["Code"].to_list())]
unmatched_airports_df = unmatched_airports_df[~unmatched_airports_df["IATA_CODE"].isin(matched_by_city_airports_df["IATA_CODE"].to_list())]

print("Number of matched airports:", len(matched_airports_df))
print("Number of unmatched airports from the month of October:", len(unmatched_numeric_airports_df))
print("Number of unmatched airports from the dataset as a whole:", len(unmatched_airports_df))

Number of matched airports: 278
Number of unmatched airports from the month of October: 29
Number of unmatched airports from the dataset as a whole: 44


I will now match airports together if they are the only airport in their state.

In [11]:
# Gather how many times each city occurs in both dataframes.
unmatched_numeric_states = unmatched_numeric_airports_df["N_STATE"].value_counts()
unmatched_states = unmatched_airports_df["STATE"].value_counts()

# Finding when the State occurs only once in both unmatched_numeric_airports_df (DOT) and unmatched_airports_df (IATA).
by_state_airports_df = pd.merge(unmatched_numeric_airports_df.set_index("N_STATE").loc[unmatched_numeric_states[unmatched_numeric_states ==1].index],
                                unmatched_airports_df.set_index("STATE").loc[unmatched_states[unmatched_states == 1].index],
                                left_index=True,
                                right_index=True,
                                how="left")

matched_by_state_airports_df = by_state_airports_df[["Code", "IATA_CODE"]].dropna()

# Update the matched_airports_df with the new matched airports.
matched_airports_df = pd.concat([matched_airports_df, matched_by_state_airports_df])

# Update both datasets of unmatched airport codes.
unmatched_airports_df = unmatched_airports_df[~unmatched_airports_df["IATA_CODE"].isin(matched_by_state_airports_df["IATA_CODE"].to_list())]
unmatched_numeric_airports_df = unmatched_numeric_airports_df[~unmatched_numeric_airports_df["Code"].isin(matched_by_state_airports_df["Code"].to_list())]

print("Number of matched airports:", len(matched_airports_df))
print("Number of unmatched airports from the month of October:", len(unmatched_numeric_airports_df))
print("Number of unmatched airports from the dataset as a whole:", len(unmatched_airports_df))

Number of matched airports: 291
Number of unmatched airports from the month of October: 16
Number of unmatched airports from the dataset as a whole: 31


I will now manually match the final 16 airports.

In [12]:
# The key in the dictionary is the index of the airport in unmatched_numeric_airports_df.
codes ={216: "SAN",
        297: "MVY",
        153: "EWN",
        154: "OAJ",
        177: "FAY",
        92: "JFK",
        107: "LGA",
        265: "SWF",
        130: "RDM",
        296: "OTH",
        17: "IAH",
        26: "HOU",
        49: "CLL",
        52: "MAF",
        64: "MFE",
        241: "SGU"}

# Add the codes to an unmatched dataframe.
unmatched_numeric_airports_df["IATA_CODE"] = ""
unmatched_numeric_airports_df.loc[codes.keys(), "IATA_CODE"] = list(codes.values())

# Add the airports and the codes to the matched dataframe.
matched_airports_df = pd.concat([matched_airports_df, unmatched_numeric_airports_df[["Code", "IATA_CODE"]]])

print(f"I have now matched {len(matched_airports_df)} of {len(numeric_airports_df)} airports")

I have now matched 307 of 307 airports


In [13]:
matched_airports_df.head()

Unnamed: 0,Code,IATA_CODE
1,14771,SFO
3,12892,LAX
4,14869,SLC
5,10299,ANC
6,11292,DEN


I will now edit flights_df to replace the DOT codes with the IATA airport codes.

In [14]:
for c in ["ORIGIN", "DESTINATION"]:

    # Renaming the columns of matched_airports_df to match those of flights_df.
    matched_airports_df.columns=[f"{c}_AIRPORT", f"{c}_IATA_CODE"]

    # Merge the dataframes to add a new column with the correct IATA codes.
    flights_df = pd.merge(flights_df, matched_airports_df, on=f"{c}_AIRPORT", how="left")

    # Replace the DOT codes with the IATA codes.
    flights_df.loc[flights_df["MONTH"] == 10, f"{c}_AIRPORT"] = flights_df[f"{c}_IATA_CODE"]

    # Drop the added column.
    flights_df = flights_df.drop(f"{c}_IATA_CODE", axis=1)

    # Check to see if there were any unedited columns.
    print("There are", flights_df[f"{c}_AIRPORT"].str.contains(r"\d+", regex=True).sum(), f"airports in the \"{c}_AIRPORT\" column with a DOT code.")

There are 0 airports in the "ORIGIN_AIRPORT" column with a DOT code.
There are 0 airports in the "DESTINATION_AIRPORT" column with a DOT code.


# Step 3: The Time Columns
Many of the columns in flights_df deal with time and these all need to be made machine-friendly.

In [15]:
flights_df.head()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,...,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
0,2015,1,1,4,AS,98,N407AS,ANC,SEA,5,...,408.0,-22.0,0,0,,,,,,
1,2015,1,1,4,AA,2336,N3KUAA,LAX,PBI,10,...,741.0,-9.0,0,0,,,,,,
2,2015,1,1,4,US,840,N171US,SFO,CLT,20,...,811.0,5.0,0,0,,,,,,
3,2015,1,1,4,AA,258,N3HYAA,LAX,MIA,20,...,756.0,-9.0,0,0,,,,,,
4,2015,1,1,4,AS,135,N527AS,SEA,ANC,25,...,259.0,-21.0,0,0,,,,,,


For data that is recorded in minutes, such as TAXI_IN taking 10 minutes, and being recorded as 10, I will leave those columns as is:
- DEPARTURE_DELAY
- TAXI_OUT
- SCHEDULED_TIME
- ELAPSED_TIME
- AIR_TIME
- TAXI_IN
- ARRIVAL_DELAY

<br>

For data that is formated as hhmm, such as WHEELS_ON happening at 08:21, but being recorded as 821, I will turn these columns into pd.datetime, but there are several problems:

The columns:
- YEAR
- MONTH
- DAY
- DAY_OF_WEEK

are accurate only for SCHEDULED_DEPARTURE and for none of the other 5 recorded times:
- DEPARTURE_TIME
- WHEELS_OFF
- WHEELS_ON
- SCHEDULED_ARRIVAL
- ARRIVAL_TIME

If we look at the first row, the SCHEDULED_DEPARTURE is 00:05 on the 1st of January and the DEPARTURE_TIME is 23:54.  If I was to use the YEAR, MONTH, and DAY columns, to calculate datetime, the flight would be shown as departing on the 1st of January 2015 at 23:54 rather than the 31st of December 2014 at 23:54.

So, in order to find an accurate datetime for all the time columns, I will use the columns recorded in minutes such as DEPARTURE_DELAY to calculate them.

However, 3 of these columns are related to the destination airport:
- WHEELS_ON
- SCHEDULED_ARRIVAL
- ARRIVAL TIME

and they are recorded, in flights_df, in the timezone of the destination airport rather than the origin airport.

In order to retain data for the destination timezone, I will extract the hour and minute from SCHEDULED_ARRIVAL but ensure that all columns in datetime are in the timezone of the origin airport.

In [16]:
# Converting SCHEDULED_DEPARTURE from hhmm into datetime and
flights_df["SCHEDULED_DEPARTURE"] = flights_df["SCHEDULED_DEPARTURE"].astype(str).str.zfill(4) # Increasing the total number of digits to 4.
flights_df["HOURS"] = flights_df["SCHEDULED_DEPARTURE"].str[:2].astype("uint8") # Creating the HOURS column.
flights_df["MINUTES"] = flights_df["SCHEDULED_DEPARTURE"].str[2:].astype("uint8") # Creating the MINUTES column.
flights_df["SCHEDULED_DEPARTURE"] = pd.to_datetime(flights_df[["YEAR", "MONTH", "DAY", "HOURS", "MINUTES"]]) # Transforming SCHEDULED_DEPARTURE into datetime.

# Adding separate HOUR and MINUTE columns for departure.
flights_df = flights_df.rename({"HOURS":"SCHEDULED_DEPARTURE_HOURS",
                                "MINUTES":"SCHEDULED_DEPARTURE_MINUTES"}, axis=1)

# Adding separate HOUR and MINUTE columns, in the destination timezone, for arrival.
flights_df["SCHEDULED_ARRIVAL"] = flights_df["SCHEDULED_ARRIVAL"].astype(str).str.zfill(4) # Increasing the total number of digits to 4.
flights_df["SCHEDULED_ARRIVAL_HOUR_IN_DESTINATION_TIMEZONE"] = flights_df["SCHEDULED_ARRIVAL"].str[:2].astype("uint8") # Creating the HOURS column.
flights_df["SCHEDULED_ARRIVAL_MINUTE_IN_DESTINATION_TIMEZONE"] = flights_df["SCHEDULED_ARRIVAL"].str[2:].astype("uint8") # Creating the MINUTES column.

Creating datetime columns for the 5 other times using the features that are recorded in minutes.

In [17]:
# Calculating DEPARTURE_TIME from SCHEDULED_DEPARTURE and DEPARTURE_DELAY
flights_df["DEPARTURE_TIME"] = flights_df["SCHEDULED_DEPARTURE"] + pd.to_timedelta(flights_df["DEPARTURE_DELAY"], unit="m")

# Calculating WHEELS_OFF from DEPARTURE_TIME and TAXI_OUT
flights_df["WHEELS_OFF"] = flights_df["DEPARTURE_TIME"] + pd.to_timedelta(flights_df["TAXI_OUT"], unit="m")

# Calculating WHEELS_ON from WHEELS_OFF and AIR_TIME
flights_df["WHEELS_ON"] = flights_df["WHEELS_OFF"] + pd.to_timedelta(flights_df["AIR_TIME"], unit="m")

# Calculating ARRIVAL_TIME from WHEELS_ON and TAXI_IN
flights_df["ARRIVAL_TIME"] = flights_df["WHEELS_ON"] + pd.to_timedelta(flights_df["TAXI_IN"], unit="m")

# Calculating SCHEDULED_ARRIVAL from SCHEDULED_DEPARTURE and SCHEDULED_TIME
flights_df["SCHEDULED_ARRIVAL"] = flights_df["SCHEDULED_DEPARTURE"] + pd.to_timedelta(flights_df["SCHEDULED_TIME"], unit="m")

# Step 4: Cleaning the Data

After preparing the time columns and fixing the heterogeneous identifiers of the airports, I will now clean the rest of the data.

Flights_df includes flights that have been cancelled and diverted as well as the ones that actually landed at the target airport.

In order to predict flight delays, I will only be using flights that have landed at the target airports.

In [18]:
# Dropping rows of flights that were cancelled or diverted.
landed_df = flights_df.loc[(flights_df["CANCELLED"] == 0) & (flights_df["DIVERTED"] == 0)]

print(f"{(1 - len(landed_df)/len(flights_df)) * 100:.2f}% of flights, out of a possible {len(flights_df):,} were diverted or cancelled.")

# Dropping the columns related to cancelled flights, diverted flights, as well as the YEAR column.
landed_df = landed_df.drop(["YEAR", "DIVERTED", "CANCELLED", "CANCELLATION_REASON"], axis=1)

1.81% of flights, out of a possible 5,819,079 were diverted or cancelled.


The:
- AIR_SYSTEM_DELAY
- SECURITY_DELAY
- AIRLINE_DELAY
- LATE_AIRCRAFT_DELAY
- WEATHER_DELAY

columns are all NaN when there is not a delay, but if there is any named delay the other 4 columns will all be 0 rather than NaN.

In [19]:
# Turning the NaNs in these columns into 0s.
landed_df.loc[:,["AIR_SYSTEM_DELAY", "SECURITY_DELAY", "AIRLINE_DELAY", "LATE_AIRCRAFT_DELAY", "WEATHER_DELAY"]] = \
    landed_df.loc[:,["AIR_SYSTEM_DELAY", "SECURITY_DELAY", "AIRLINE_DELAY", "LATE_AIRCRAFT_DELAY", "WEATHER_DELAY"]].fillna(0.0)

In [20]:
landed_df.isna().sum()

MONTH                                               0
DAY                                                 0
DAY_OF_WEEK                                         0
AIRLINE                                             0
FLIGHT_NUMBER                                       0
TAIL_NUMBER                                         0
ORIGIN_AIRPORT                                      0
DESTINATION_AIRPORT                                 0
SCHEDULED_DEPARTURE                                 0
DEPARTURE_TIME                                      0
DEPARTURE_DELAY                                     0
TAXI_OUT                                            0
WHEELS_OFF                                          0
SCHEDULED_TIME                                      0
ELAPSED_TIME                                        0
AIR_TIME                                            0
DISTANCE                                            0
WHEELS_ON                                           0
TAXI_IN                     

In [21]:
landed_df.dtypes

MONTH                                                        uint8
DAY                                                          uint8
DAY_OF_WEEK                                                  uint8
AIRLINE                                                   category
FLIGHT_NUMBER                                             category
TAIL_NUMBER                                               category
ORIGIN_AIRPORT                                              object
DESTINATION_AIRPORT                                         object
SCHEDULED_DEPARTURE                                 datetime64[ns]
DEPARTURE_TIME                                      datetime64[ns]
DEPARTURE_DELAY                                            float64
TAXI_OUT                                                   float64
WHEELS_OFF                                          datetime64[ns]
SCHEDULED_TIME                                             float64
ELAPSED_TIME                                               flo

In [22]:
landed_df = landed_df.astype({"ORIGIN_AIRPORT":"category",
                              "DESTINATION_AIRPORT":"category",
                              "DEPARTURE_DELAY":"int16",
                              "TAXI_OUT":"uint16",
                              "SCHEDULED_TIME":"uint16",
                              "ELAPSED_TIME":"uint16",
                              "AIR_TIME":"uint16",
                              "DISTANCE":"uint16",
                              "TAXI_IN":"uint8",
                              "ARRIVAL_DELAY":"int16",
                              "AIR_SYSTEM_DELAY":"uint16",
                              "SECURITY_DELAY":"uint16",
                              "AIRLINE_DELAY":"uint16",
                              "LATE_AIRCRAFT_DELAY":"uint16",
                              "WEATHER_DELAY":"uint16"})

In [23]:
landed_df.to_pickle("landed_flights.pkl")

# Further Work

In the next notebook, Exploratory_Data_Analysis, I will explore and analyse the data.

I will also use the cleaned data from this notebook to predict the arrival delay of each flight in the Flight_Predictions notebook.

Further work could be done to clean, and work with, the cancelled and diverted flights, but they are outside the scope of this project.