# Processing the Airline Reporting Carrier On-Time Performance Dataset

This notebook relates to the Airline Reporting Carrier On-Time Performance Dataset. The dataset  contains information on approximately 200 million domestic US flights reported to the United States Bureau of Transportation Statistics, from 1987 - 2020. This dataset is freely available from the IBM Developer [Data Asset Exchange](https://developer.ibm.com/exchanges/data/all/airline/).

**Note** the full dataset is very large (over 80GB uncompressed), so here we work with a smaller sample dataset containing a total of 2 million rows.

In this notebook, we process the raw dataset by:
* selecting the columns we wish to keep for later analysis
* converting and cleaning data where required
* handling missing values

#### Import required modules

Import and configure the required modules.

In [None]:
# !pip install pandas > /dev/null 2>&1

In [None]:
# Define required imports
import pandas as pd
# These set pandas max column and row display in the notebook
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 50)

### Read the Raw Data

We start by reading in the raw dataset and displaying the first few rows of the dataframe.

In [None]:
data_path = 'data/airline_2m.csv'
raw_data = pd.read_csv(data_path, encoding = "ISO-8859-1", parse_dates=['FlightDate'],
                 dtype={'Div1Airport': str, 'Div1TailNum': str, 'Div2Airport': str, 'Div2TailNum': str})
raw_data.head()

### Clean the Data

Fortunately, the airline delay dataset is relatively clean already! The fields we wish to use already represent variables such as unique codes for the airline, origin and destination. There are also fields representing binned variables for departure time slot and flight distance.

We will select a subset of the data relating to years 2010-2017, with origin airport `JFK`, to match our weather data. For simplicity, we will focus on delayed flights and ignore flight cancellations. We will ignore arrival delays, hence we will be focusing on predicting "departure delays" _from JFK_ to other destinations, using the field `DepDel15`. This is a binary value indicating whether the flight was delayed by more than 15 minutes (deemed to be `delayed`) or not (deemed to be `on time`).

In [None]:
# Select the data sub-set for years 2010-2017 for flights originating from JFK
jfk_flights = raw_data.copy()
jfk_flights = jfk_flights[(jfk_flights['Origin'] == 'JFK') & (jfk_flights['Year'].isin(range(2010, 2018))) & (jfk_flights['Cancelled'] == 0)]
jfk_flights.head()

We create a mapping of airline id to a more readable airline name (see the [airline ID looukp table](https://www.transtats.bts.gov/Download_Lookup.asp?Lookup=L_AIRLINE_ID)).

In [None]:
airline_codes = {
    20409: 'JetBlue',
    19790: 'Delta',
    19805: 'American Airlines',
    20398: 'Envoy Air',
    19977: 'United',
    21171: 'Virgin America',
    20363: 'Endeavor Air',
    20417: 'Comair',
    20355: 'US Airways',
    20366: 'ExpressJet',
    19690: 'Hawaiian Airlines',
    19930: 'Alaska Airlines',
    20378: 'Mesa Airlines'
}

In [None]:
jfk_flights.loc[:, 'airline_name'] = jfk_flights['DOT_ID_Reporting_Airline'].map(lambda x: airline_codes[x])

Next, we select the set of columns to keep for downstream analysis and rename the columns to `snake_case` for consistency with our processed weather dataset.

In [None]:
cols_to_keep = ['FlightDate', 'Month', 'DayofMonth', 'DayOfWeek', 'DOT_ID_Reporting_Airline', 'airline_name', 'Origin', 'Dest', 'CRSDepTime', 'DepTimeBlk', 'DistanceGroup', 'DepDel15', 'DepDelay']
jfk_flights = jfk_flights[cols_to_keep]
col_names = {
    'FlightDate': 'flight_date',
    'Month': 'month',
    'DayofMonth': 'day_of_month',
    'DayOfWeek': 'day_of_week',
    'DOT_ID_Reporting_Airline': 'airline_id',
    'Origin': 'origin',
    'Dest': 'dest',
    'CRSDepTime': 'sched_dep_time',
    'DepTimeBlk': 'dep_time_bin',
    'DistanceGroup': 'distance_bin',
    'DepDel15': 'delayed',
    'DepDelay': 'dep_delay'
}
jfk_flights_renamed = jfk_flights.rename(columns=col_names)
jfk_flights_renamed.head()

In [None]:
# Log some general information about the dataset
print('# of columns: ' + str(jfk_flights_renamed.shape[1])) 
print('# of observations: ' + str(jfk_flights_renamed.shape[0]))
print('Start date: ' + str(jfk_flights_renamed['flight_date'].min()))
print('End date: ' + str(jfk_flights_renamed['flight_date'].max()))

### Save the Processed Data

Finally, we save the processed dataset for use by downstream tasks.

In [None]:
jfk_flights_renamed.to_csv('data/jfk_flight_features.csv', index=False, float_format='%g')

 ### Authors
 
 This notebook was created by the [Center for Open-Source Data & AI Technologies](http://codait.org).

Copyright © 2020 IBM. This notebook and its source code are released under the terms of the MIT License.