## Preprocess Flight Delay Data
This only needs to be run once. It requires downloading all of the datasets from the Bureau of Transportation. To recreate, follow steps: 
- visit the [Bureau of Transportation](https://www.transtats.bts.gov/databases.asp?Z1qr_VQ=E&Z1qr_Qr5p=N8vn6v10&f7owrp6_VQF=D)
- select Airline Performance Data
- select download for Reporting Carrier On-Time Performance (1987-Present)
- select relevant features from GUI. See relevant_columns below
- download and unzip all files into Downloads folder (see src_filepath). This is set of 36 files. all 12 months per year for 2020, 2021, 2022

### Preprocess data steps
1. Trim data, keeping only relevant columns
2. Balance Data, keeping 10k samples from each month and class (Delayed, Not Delayed)
3. Join years and months and save to single csv file flight_data.csv

> |Total row count is 713,664. Slightly under 720k, as some months didn't have 10k samples for each class.

In [None]:
import pandas as pd


YEARS= ["2020", "2021", "2022"]
MONTHS =[ "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"]
src_filepath = "~/Downloads/"
dest_filepath = "../data/"

RECORDS_TO_KEEP_FROM_EACH_DATASET = 5000

In [None]:
filename = "flight_data.csv"
flight_df = df = pd.read_csv(f"{src_filepath}/{filename}")

In [None]:
flight_df.columns

## Drop rows that are marked as diverted or cancelled

In [None]:
flight_df = flight_df[flight_df.Cancelled == 0]
flight_df = flight_df[flight_df.Diverted == 0]


In [None]:
# Drop columns that are not needed
flight_df = flight_df.drop(['Cancelled', 'Diverted', 'CarrierDelay', 'WeatherDelay',
       'NASDelay', 'SecurityDelay', 'LateAircraftDelay'], axis=1)

## For 36 months, grab 10k records from each month. (5k delayed, and 5k not delayed)

In [None]:
def preprocess_data(df: pd.DataFrame) -> pd.DataFrame:
    # Find delayed flights, indicated by ArrDel15 == 1
    df_delayed = df[df["ArrDel15"] == 1]
    # Only calculating keep amount for delayed class, as its smaller
    keep_amount = min(len(df_delayed), RECORDS_TO_KEEP_FROM_EACH_DATASET)
    print(f"there are {len(df_delayed)} flight records that were delayed. randomly sampling {keep_amount}")
    df_delayed_sample = df_delayed.sample(n=keep_amount, random_state=1)

    df_not_delayed = df[df["ArrDel15"] == 0] 
    print(f"there are {len(df_not_delayed)} flight records that weren't delayed, sampling {keep_amount} of them")

    df_non_delayed_sample = df_not_delayed.sample(n=keep_amount, random_state=1)

    return pd.concat([df_delayed_sample, df_non_delayed_sample])


In [None]:
preprocessed_data = []
for year in YEARS:
    for month in MONTHS:
        print(f"processing year:{year} and month:{month}")
        preprocessed_data.append(preprocess_data(flight_df[(flight_df["Year"] == int(year)) & (flight_df["Month"] == int(month))]))
    
all_flights_df = pd.concat(preprocessed_data)

In [None]:
all_flights_df.describe()

In [None]:
all_flights_df.info()

In [None]:
output_file = "flight_data.csv"
print(f"writing to {output_file}")
all_flights_df.to_csv(dest_filepath + output_file, index=False)