# Predicting Flight Delay

Problem Set-up:
We define a delayed flight to be one that is delayed by >= 15 minutes. 
The prediction problem is to train a model that can classify flights, to predict if they will or will not be delayed.

Use case:
- The idea is that this model would be useful to choosing airlines, flightpaths, airports, at the time of booking, relatively in advance of the scheduled departure (days, weeks, months ahead of time). Therefore, the prediction problem will focus on features that can be known in advance, rather than predicting using day-off features like weather and previous flights from that day. 

Notes:
- We restrict the analysis to relatively large airport, those with more than 20 (domestic) flights a day

In [1]:
# Imports
from sklearn.linear_model import LogisticRegression

import numpy as np
import pandas as pd

In [7]:
# Import custom code
from flightdelay.fld import io as flio
import imp
imp.reload(flio)

<module 'flightdelay.fld.io' from '/gh/flightdelay/fld/io.py'>

In [8]:
airlines_df, airports_df, flights_df = flio.load_data(N_flights=10000)

In [9]:
# Drop cancelled flights
flights_df = flights_df[flights_df['CANCELLED'] != 1]

In [10]:
flights_df.columns

Index(['Unnamed: 0', 'Unnamed: 0.1', 'YEAR', 'MONTH', 'DAY', 'DAY_OF_WEEK',
       'AIRLINE', 'FLIGHT_NUMBER', 'TAIL_NUMBER', 'ORIGIN_AIRPORT',
       'DESTINATION_AIRPORT', 'SCHEDULED_DEPARTURE', 'DEPARTURE_TIME',
       'DEPARTURE_DELAY', 'TAXI_OUT', 'WHEELS_OFF', 'SCHEDULED_TIME',
       'ELAPSED_TIME', 'AIR_TIME', 'DISTANCE', 'WHEELS_ON', 'TAXI_IN',
       'SCHEDULED_ARRIVAL', 'ARRIVAL_TIME', 'ARRIVAL_DELAY', 'DIVERTED',
       'CANCELLED', 'CANCELLATION_REASON', 'AIR_SYSTEM_DELAY',
       'SECURITY_DELAY', 'AIRLINE_DELAY', 'LATE_AIRCRAFT_DELAY',
       'WEATHER_DELAY'],
      dtype='object')

In [12]:
flights_df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,...,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
0,0,0,2015,1,1,4,AS,98,N407AS,ANC,...,408.0,-22.0,0,0,,,,,,
1,1,1,2015,1,1,4,AA,2336,N3KUAA,LAX,...,741.0,-9.0,0,0,,,,,,
2,2,2,2015,1,1,4,US,840,N171US,SFO,...,811.0,5.0,0,0,,,,,,
3,3,3,2015,1,1,4,AA,258,N3HYAA,LAX,...,756.0,-9.0,0,0,,,,,,
4,4,4,2015,1,1,4,AS,135,N527AS,SEA,...,259.0,-21.0,0,0,,,,,,


In [13]:
# Set labels
def create_label(row):
    if row['DEPARTURE_DELAY'] > 15: return 1
    else: return 0

In [14]:
def features(df):
    return np.array(df['DISTANCE']).reshape(-1, 1)
    #return np.hstack([np.array(df['AIR_TIME']).reshape(-1, 1), 
    #                 np.array(df['DISTANCE']).reshape(-1, 1)])

def labels(df):
    return np.array(df['LABEL'])#.reshape(-1, 1)

In [15]:
# Create labels
flights_df['LABEL'] = flights_df.apply(lambda row: create_label(row), axis=1)

In [16]:
X = features(flights_df)
y = labels(flights_df)

model = LogisticRegression()
model.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [17]:
model.score(X, y)

0.83586594504579514

In [18]:
features(flights_df).shape

(9608, 1)

In [122]:
# 
sum(model.predict(X))

0

In [123]:
sum(y) / len(y)

0.16413405495420483

In [124]:
sum(np.isnan(X))

array([0])