# Airline Delayes and cancellation 

## Introduction
In this project, we would like to make airline delay predictions for the airport JFK in New York. The aim is to find relevant variables we can use to build models for predictions of departure delay in JFK airport. We will use two datasets for the analysis. The first dataset contains information of airline delays and cancellation and the second dataset contains information of weather.

The airline delay dataset consists of more than 6 million flight observations and 28 variables of which some are technical data on airlines, airports, flight numbers, and the rest are time-related. The dataset is from the year 2010.

The weather dataset consists of hourly weather observations for "Temperature" and "Windspeed", for the year 2010.

We will explore the airline delay dataset in order to find relevant variables we can use for predictions of departure delay. Then we will use the variables to build Baysesian linear regression models in Pyro. 

## Data preprocess
### Airline delay dataset
The data is loaded from a csv file into a pandas dataframe object and preprocessed. We convert the date to a datetime and set as index. We remove nan values in specific columns, because these columns are useless without information. The data is reduced to only contain information from the first six months of 2010 and only to contain information of the airport JFK.

In [None]:
# Imports
import numpy as np
import pandas as pd
import datetime

In [None]:
data = pd.read_csv('Data/2010.csv')

In [None]:
# Convert string to DateTime and set as index
data.FL_DATE = pd.to_datetime(data.FL_DATE, infer_datetime_format=True)
data.set_index('FL_DATE', inplace=True)
data.drop(columns='Unnamed: 27', axis=1, inplace=True) # Drop last weird column

# Drop NaN values in the variables we will use in our baseline model
data = data[data['TAXI_OUT'].notna()]
data = data[data['ORIGIN'].notna()]
data = data[data['DEST'].notna()]

# Take first 6 months
df = data.loc['2010-01-01':'2010-06-30']
# Renaming airline codes to company names
# Source: https://en.wikipedia.org/wiki/List_of_airlines_of_the_United_States

df['OP_CARRIER'].replace({
    'UA':'United Airlines',
    'AS':'Alaska Airlines',
    '9E':'Endeavor Air',
    'B6':'JetBlue Airways',
    'EV':'ExpressJet',
    'F9':'Frontier Airlines',
    'G4':'Allegiant Air',
    'HA':'Hawaiian Airlines',
    'MQ':'Envoy Air',
    'NK':'Spirit Airlines',
    'OH':'PSA Airlines',
    'OO':'SkyWest Airlines',
    'VX':'Virgin America',
    'WN':'Southwest Airlines',
    'YV':'Mesa Airline',
    'YX':'Republic Airways',
    'AA':'American Airlines',
    'DL':'Delta Airlines'
},inplace=True)

# Use only JFK
df = df[df.ORIGIN == 'JFK']

The two variables "DEP_TIME" and "DEP_DELAY" are time variables, and we change them both to appropriate datetime format. With this change we have exact information of flight takeoff and not just which day.

In [None]:
delays = df[['DEP_DELAY', 'DEP_TIME']]
delays['DEP_TIME'] = delays['DEP_TIME'].astype(int)

preps = []
for i in range(len(delays['DEP_TIME'])):
    # Zero fill values
    dep_time_val = str(delays['DEP_TIME'].iloc[i]).zfill(4)
    # If flight at 24:00, set that as 00:00
    if dep_time_val == str(2400):
        dep_time_act = datetime.datetime.strptime('0000','%H%M').strftime('%H:%M')
    else:
        dep_time_act = datetime.datetime.strptime(dep_time_val,'%H%M').strftime('%H:%M')
    

    # append
    preps.append(dep_time_act)

# Drop and add corrected times
df.drop(columns=['DEP_TIME'])
df['DEP_TIME'] = preps
df['DEP_TIME'] = pd.to_datetime(df['DEP_TIME'], format='%H:%M') # Convert to datetime

# Convert FL time to column
df = df.reset_index(level=0)

In [None]:
comb_date = []
# Loop through all to combine dates and time
for i in range(len(df['DEP_TIME'])):

    # Get date and time
    date = datetime.datetime.date(df['FL_DATE'].iloc[i])
    time = datetime.datetime.time(df['DEP_TIME'].iloc[i])

    # Get combined as a string
    comb = datetime.datetime.combine(date, time).strftime('%Y-%m-%d %H:%M:%S')

    #append
    comb_date.append(comb)


In [None]:
# Add column, and remove DEP_TIME and FL_DATE
df['DATE_TIME'] = pd.to_datetime(comb_date)
# Get dep time as time only
df['DEP_TIME'] = pd.to_datetime(df['DEP_TIME'], format='%H:%M').dt.time # Convert to datetime


In [None]:
# Move it to the front of the data frame
dates = df.pop('DATE_TIME')
df.insert(0, 'DATE_TIME', dates)


### Weather dataset
The weather data is loaded and only relevant features are extracted. 

In [None]:
# Load weather data
df_weather_all = pd.read_csv('Data/Weather.csv')
# Select only relevant features
df_weather = df_weather_all[["NAME","DATE","HLY-TEMP-NORMAL","HLY-WIND-AVGSPD"]]
# Select only JFK airport and from January 1st to July 1st
df_weather= df_weather[(df_weather["NAME"] == "JFK INTERNATIONAL AIRPORT, NY US") & (df_weather["DATE"] < '07-01T00:00:00')]
# Change the format of DATE to datetime
df_weather["DATE"] = pd.to_datetime(df_weather["DATE"],format='%m-%dT%H:%M:%S',infer_datetime_format='%d-%m-%H')
# Change the format to month - day - hour
df_weather["DATE"] = df_weather["DATE"].dt.strftime("%m-%d-%H")
# Change NAME to just JFK
df_weather["NAME"] = 'JFK'
# Reset index
df_weather = df_weather.reset_index().drop('index',axis=1)

# Select interesting features
df = df[["FL_DATE","OP_CARRIER","DEP_TIME","TAXI_OUT", "DEP_DELAY"]]

# Now we must merge df_weather and df
df["TEMP"] = 0
df["WIND"] = 0

for i in range(len(df)):
    # Filter out the month, day and hour 
    hour = df["DEP_TIME"][i].strftime("%H")
    monthday = df["FL_DATE"][i].strftime("%m-%d")
    
    if monthday+'-'+hour == '01-01-00':
        continue
    else:
        # Find temp in df_weather corresponding to month, day and hour
        df.iloc[i,-2] = df_weather[df_weather["DATE"] == monthday+'-'+hour].iloc[0,2]
        # Find wind in df_weather corresponding to month, day and hour
        df.iloc[i,-1] = df_weather[df_weather["DATE"] == monthday+'-'+hour].iloc[0,3]
        
# The hour of january 1st is not included in weather data. Therefore we drop the flights occurring in that time
df = df[df.TEMP > 0]

In [None]:
# Average delays
y = df.groupby('FL_DATE', as_index=False)['DEP_DELAY'].mean()
#df.set_index('FL_DATE', inplace=True)

In [None]:
# Impose delay from previous day
df["DELAY_PREV"] = 0
# Set index as date
# Loop
for i in range(len(y)):
    if y['FL_DATE'][i].strftime('%d') == '01':
        continue

    # Take date
    date = y['FL_DATE'][i].strftime('%Y-%m-%d')
    
    df.loc[df.FL_DATE == date, 'DELAY_PREV']= y['DEP_DELAY'][i-1]
    



# One hot encode the carriers
y = pd.get_dummies(df.OP_CARRIER, prefix='OP_CARRIER')
# Drop column B as it is now encoded
df.drop(columns='OP_CARRIER', inplace=True)
# Join the encoded df
df = df.join(y)


In [None]:
# Compare shape to see effect of preprocessing
print('Shape before:',data.shape,'\n Shape after:',df.shape)
# Save as new dataframe
df.to_csv('Data/data_pre.csv')
df_weather.to_csv('Data/weather_pre.csv')