## Preprocess Flight Delay Data
This only needs to be run once. It requires downloading all of the datasets from the Bureau of Transportation. To recreate, follow steps: 
- visit the [Bureau of Transportation](https://www.transtats.bts.gov/databases.asp?Z1qr_VQ=E&Z1qr_Qr5p=N8vn6v10&f7owrp6_VQF=D)
- select Airline Performance Data
- select download for Reporting Carrier On-Time Performance (1987-Present)
- select relevant features from GUI. See relevant_columns below
- download and unzip all files into Downloads folder (see src_filepath). This is set of 36 files. all 12 months per year for 2020, 2021, 2022

### Preprocess data steps
1. Trim data, keeping only relevant columns
2. Balance Data, keeping 10k samples from each month and class (Delayed, Not Delayed)
3. Join years and months and save to single csv file flight_data.csv

> |Total row count is 713,664. Slightly under 720k, as some months didn't have 10k samples for each class.

In [22]:
import pandas as pd


YEARS= ["2020", "2021", "2022"]
MONTHS =[ "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"]
src_filepath = "~/Downloads/"
dest_filepath = "../data/"

RECORDS_TO_KEEP_FROM_EACH_DATASET = 5000

In [3]:
filename = "flight_data.csv"
flight_df = df = pd.read_csv(f"{src_filepath}/{filename}")

In [8]:
flight_df.columns

Index(['Year', 'Quarter', 'Month', 'DayofMonth', 'DayOfWeek', 'FlightDate',
       'Reporting_Airline', 'Tail_Number', 'Flight_Number_Reporting_Airline',
       'Origin', 'Dest', 'DepTime', 'DepDelay', 'TaxiOut', 'WheelsOff',
       'WheelsOn', 'TaxiIn', 'CRSArrTime', 'ArrTime', 'ArrDelay', 'ArrDel15',
       'Cancelled', 'Diverted', 'CRSElapsedTime', 'ActualElapsedTime',
       'AirTime', 'Flights', 'Distance', 'CarrierDelay', 'WeatherDelay',
       'NASDelay', 'SecurityDelay', 'LateAircraftDelay', 'Carrier',
       'Full-time', 'Part-time', 'Grand Total'],
      dtype='object')

## Drop rows that are marked as diverted or cancelled

In [15]:
flight_df = flight_df[flight_df.Cancelled == 0]
flight_df = flight_df[flight_df.Diverted == 0]


In [17]:
# Drop columns that are not needed
flight_df = flight_df.drop(['Cancelled', 'Diverted', 'CarrierDelay', 'WeatherDelay',
       'NASDelay', 'SecurityDelay', 'LateAircraftDelay'], axis=1)

## For 36 months, grab 10k records from each month. (5k delayed, and 5k not delayed)

In [23]:
def preprocess_data(df: pd.DataFrame) -> pd.DataFrame:
    # Find delayed flights, indicated by ArrDel15 == 1
    df_delayed = df[df["ArrDel15"] == 1]
    # Only calculating keep amount for delayed class, as its smaller
    keep_amount = min(len(df_delayed), RECORDS_TO_KEEP_FROM_EACH_DATASET)
    print(f"there are {len(df_delayed)} flight records that were delayed. randomly sampling {keep_amount}")
    df_delayed_sample = df_delayed.sample(n=keep_amount, random_state=1)

    df_not_delayed = df[df["ArrDel15"] == 0] 
    print(f"there are {len(df_not_delayed)} flight records that weren't delayed, sampling {keep_amount} of them")

    df_non_delayed_sample = df_not_delayed.sample(n=keep_amount, random_state=1)

    return pd.concat([df_delayed_sample, df_non_delayed_sample])


In [24]:
preprocessed_data = []
for year in YEARS:
    for month in MONTHS:
        print(f"processing year:{year} and month:{month}")
        preprocessed_data.append(preprocess_data(flight_df[(flight_df["Year"] == int(year)) & (flight_df["Month"] == int(month))]))
    
all_flights_df = pd.concat(preprocessed_data)

processing year:2020 and month:1
there are 82285 flight records that were delayed. randomly sampling 5000
there are 516983 flight records that weren't delayed, sampling 5000 of them
processing year:2020 and month:2
there are 84616 flight records that were delayed. randomly sampling 5000
there are 483460 flight records that weren't delayed, sampling 5000 of them
processing year:2020 and month:3
there are 53720 flight records that were delayed. randomly sampling 5000
there are 483524 flight records that weren't delayed, sampling 5000 of them
processing year:2020 and month:4
there are 9038 flight records that were delayed. randomly sampling 5000
there are 173968 flight records that weren't delayed, sampling 5000 of them
processing year:2020 and month:5
there are 7794 flight records that were delayed. randomly sampling 5000
there are 160853 flight records that weren't delayed, sampling 5000 of them
processing year:2020 and month:6
there are 14949 flight records that were delayed. randomly 

In [25]:
all_flights_df.describe()

Unnamed: 0,Year,Quarter,Month,DayofMonth,DayOfWeek,Flight_Number_Reporting_Airline,DepTime,DepDelay,TaxiOut,WheelsOff,...,ArrDelay,ArrDel15,CRSElapsedTime,ActualElapsedTime,AirTime,Flights,Distance,Full-time,Part-time,Grand Total
count,360000.0,360000.0,360000.0,360000.0,360000.0,360000.0,360000.0,360000.0,360000.0,360000.0,...,360000.0,360000.0,360000.0,360000.0,360000.0,360000.0,360000.0,358754.0,358754.0,358754.0
mean,2021.0,2.5,6.5,15.733522,4.039153,2538.964169,1385.208219,28.711775,17.883931,1408.563928,...,26.670942,0.5,142.023333,139.983686,114.126142,1.0,815.165475,40896.425107,3907.236535,44803.661643
std,0.816498,1.118036,3.452057,8.744654,2.009596,1778.124821,492.271123,74.5935,11.835681,494.764773,...,75.450179,0.500001,70.052962,71.527463,69.273166,0.0,576.230327,32754.907819,4377.865772,35962.942062
min,2020.0,1.0,1.0,1.0,1.0,1.0,1.0,-62.0,1.0,1.0,...,-82.0,0.0,10.0,16.0,8.0,1.0,29.0,2357.0,0.0,2374.0
25%,2020.0,1.75,3.75,8.0,2.0,1035.0,1004.0,-4.0,11.0,1020.0,...,-11.0,0.0,90.0,87.0,63.0,1.0,391.0,10932.0,740.0,13143.0
50%,2021.0,2.5,6.5,16.0,4.0,2164.0,1411.0,2.0,14.0,1424.0,...,14.5,0.5,127.0,125.0,99.0,1.0,678.0,53197.0,1818.0,54698.0
75%,2022.0,3.25,9.25,23.0,6.0,3902.0,1804.0,37.0,20.0,1820.0,...,37.0,1.0,172.0,172.0,145.0,1.0,1055.0,65424.0,5655.0,71849.0
max,2022.0,4.0,12.0,31.0,7.0,8819.0,2400.0,2175.0,218.0,2400.0,...,2186.0,1.0,695.0,716.0,677.0,1.0,5812.0,97373.0,16424.0,109108.0


In [26]:
all_flights_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 360000 entries, 559890 to 17400394
Data columns (total 30 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   Year                             360000 non-null  int64  
 1   Quarter                          360000 non-null  int64  
 2   Month                            360000 non-null  int64  
 3   DayofMonth                       360000 non-null  int64  
 4   DayOfWeek                        360000 non-null  int64  
 5   FlightDate                       360000 non-null  object 
 6   Reporting_Airline                360000 non-null  object 
 7   Tail_Number                      360000 non-null  object 
 8   Flight_Number_Reporting_Airline  360000 non-null  int64  
 9   Origin                           360000 non-null  object 
 10  Dest                             360000 non-null  object 
 11  DepTime                          360000 non-null  float64


In [27]:
output_file = "flight_data.csv"
print(f"writing to {output_file}")
all_flights_df.to_csv(dest_filepath + output_file, index=False)

writing to flight_data.csv
