# Flight Delay Model Training

Trains and saves the deployment model to be used by the Flight Delay Predictor. Trains a neural network model based on years of flight data from the US Bureau of Transportation Statistics. Predicts flight delays based on the flight schedule, the airline, and the airports. A python notebook showing model cross-validation as well as a summary report of the modeling procedure and general findings are available at https://github.com/Tate-G/portfolio

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
tf.compat.v1.disable_eager_execution()
from tensorflow import keras
from tensorflow.keras import layers
from datetime import datetime
import matplotlib.pyplot as plt

In [2]:


print(tf.config.list_physical_devices('GPU'))
print('Number of GPUs Available: ', len(tf.config.list_physical_devices('GPU')))
print('Tensorflow version: '+tf.version.VERSION)


[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Number of GPUs Available:  1
Tensorflow version: 2.1.0


## Input

Define input variables for this run of modeling and evaluation. User can train a new model or load a previously trained model.

In [3]:
#output file names
output_folder='7-22_May2019-Apr2021/'


#define the first and last month of the training data
train_list_start='2019_5'
train_list_end='2021_4'
#define the first and last month of the validation data
val_list_start='2021_5'
val_list_end='2021_5'


#define predictor variables
X_vars=['Month','DayOfWeek','Reporting_Airline',
            'Origin','Dest','OriginState','DestState',
            'CRSDepTime','CRSArrTime',
            'CRSElapsedTime']
X_vars_categorical=['Reporting_Airline','Origin','Dest','OriginState','DestState']
X_vars_hours=['CRSDepTime','CRSArrTime']
X_vars_cyclical_dict={'Month':12,'DayOfWeek':7,'CRSDepTime':24,'CRSArrTime':24}
X_vars_normalize=['CRSElapsedTime']
X_vars_log=['CRSElapsedTime'] #variables to log while normalizing, must be a subset of X_vars_normalize
y_var=['ArrDelayMinutes']

#define number of busiest airports to consider
num_airports=100

#define the delay length to consider
delay_minutes=15

#define the random fraction of training data to use
#true random subset if imbalanced==False, subset with increased % minority class (delays) if imbalanced==True
imbalanced=False
rand_frac_train=0.35
rand_frac_val=1


#define modeling parameters
Epochs = 15
Batch_Size = 2048



## Model Supporting Functions

Functions to support reading input data from csv files. Files stored in subfolder 'US_DOT/On-Time/'. Dataset csv files available for download at https://www.transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data&DB_Short_Name=On-Time 

In [4]:
#function to import csv files into dataframe
def read_flight_csv(filenames,col_names):
    df=pd.read_csv(filenames[0],usecols=col_names)
    print('Read '+filenames[0])
    if len(filenames)>1:
        for i in np.arange(1,len(filenames)):
            df=df.append(pd.read_csv(filenames[i],usecols=col_names))
            print('Read '+filenames[i])
    df=df.reset_index(drop=True)
    return df

#make list of filenames to import in a range of months starting from given YYYY_MM strings
def make_filenames(start_date,end_date):
    filename_start='US_DOT/On-Time/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_'
    filename_ext='.csv'
    start=datetime.strptime(start_date,'%Y_%m')
    end=datetime.strptime(end_date,'%Y_%m')
    dates=pd.date_range(start,end,freq='MS')
    date_list=dates.strftime('%Y_%#m').tolist()
    files=[filename_start+date+filename_ext for date in date_list]
    return files


Functions to support restricting data to only include flights to and from a given number of the busiest airports, and to clean airline names.

In [5]:
#function to find the given number of busiest airports in a dataset
def airport_select(df,num_airports):
    arrivals=df.groupby(['Dest']).size()
    departures=df.groupby(['Origin']).size()
    flights=arrivals+departures
    airports=flights.sort_values(ascending=False)[:num_airports].index.tolist()
    return airports
    
#function to restrict to list of airports
#this helps limit training data size and in most cases gives consistent lists of airports for train, val, and test
def airport_restrict(df,airports):
    restricted=df[(df['Dest'].isin(airports))&(df['Origin'].isin(airports))]
    return restricted


#function to give airlines actual names
#translates select airlines that have been integrated into other airlines over time
def airline_names(df):
    carriers=pd.read_csv('US_DOT/On-Time/L_UNIQUE_CARRIERS.csv')
    carrier_dict=pd.Series(carriers['Description'].values,index=carriers['Code']).to_dict()
    df_new=df.copy()
    df_new['Reporting_Airline']=df['Reporting_Airline'].map(carrier_dict)
    df_new.loc[df_new['Reporting_Airline']=='Virgin America','Reporting_Airline']='Alaska Airlines Inc.'  #Virgin America integrated into Alaska Airlines in 2018
    return df_new



Functions to extract and preprocess model features and target variable from data that has been read from 'On-Time' csv files. Supports feature cleaning, transformation, and normalization, as well as selection of a random subset of the dataset.

In [6]:

#function to encode cyclical time data into sin and cos variables
#usful resource at https://www.kaggle.com/avanwyk/encoding-cyclical-features-for-deep-learning
def convert_cyclical(df,col_name_maxval_dict):
    df_copy=df.copy()
    for col in col_name_maxval_dict.keys():
        max_value=col_name_maxval_dict[col]
        df_copy['sin('+col+')']=np.sin(2*np.pi*df_copy[col]/max_value)
        df_copy['cos('+col+')']=np.cos(2*np.pi*df_copy[col]/max_value)
        df_copy.drop(col,axis=1,inplace=True)
    return df_copy

#function to convert times to decimal hours. 
#Format in file is an integer with hour in thousands and hundreds place and minutes in tens and ones place 
#(For example, in input csv 1415 corresponds to 2:15pm, convert to decimal hour 14.25)
def convert_hours(df,columns):
    df_copy=df.copy()
    for col in columns:
        df_copy[col]=df[col]//100+(df[col]%100)/60
    return df_copy


#function to produce the desired predictor and target variable dataframes
def predictors_and_target(df,X_vars,X_vars_categorical,target_delay_minutes=1):
    subset=df[X_vars]
    X=pd.get_dummies(subset,columns=X_vars_categorical)#one-hot encode the categorical variables
    y=df['ArrDelayMinutes']>=target_delay_minutes
    return X,y


#function to choose random subset of matching predictor and target rows
#if subset_frac is not less than 1, returns copies of original X and y
def rand_subset(X,y,subset_frac,rand_state=0):
    if subset_frac<1:
        X_copy=X.copy()
        X_copy['y']=y
        samp=X_copy.sample(frac=subset_frac,random_state=rand_state)
        y_samp=samp['y']
        X_samp=samp.drop(['y'],axis=1)
    else:
        X_samp=X.copy()
        y_samp=y.copy()
    return X_samp,y_samp


#function to choose imbalanced random subset of matching predictor and target rows
#if subset_frac is not less than 1, returns copies of original X and y
#if sum(y==True)>=len(y)*subset_frac//2, will return rows with equal number of T and F values in y
#if sum(y==True)<len(y)*subset_frac//2, include all rows where y is T and len(y)*subset_frac//2 rows where y is F
#same conditions on sum(y==False)
def imbalanced_subsample(X,y,subset_frac,rand_state=0):
    if subset_frac<1:
        X_copy=X.copy()
        X_copy['y']=y
        X_copy_T=X_copy[X_copy['y']==True]
        X_copy_F=X_copy[X_copy['y']==False]
        num_samp=X_copy.shape[0]*subset_frac
        if X_copy_T.shape[0]>=num_samp//2:
            samp_T=X_copy_T.sample(n=int(num_samp//2),random_state=rand_state)
        else:
            samp_T=X_copy_T
        if X_copy_F.shape[0]>=num_samp//2:
            samp_F=X_copy_F.sample(n=int(num_samp//2),random_state=rand_state)
        else:
            samp_F=X_copy_F
        samp=samp_T.append(samp_F)
        y_samp=samp['y']
        X_samp=samp.drop(['y'],axis=1)
    else:
        X_samp=X.copy()
        y_samp=y.copy()
    return X_samp,y_samp


#function to normalize continuous varaibles in train and val data
def normalize(X_train,X_val,X_vars_normalize,X_vars_log):
    from sklearn.preprocessing import StandardScaler
    scaler=StandardScaler()
    
    train_to_scale=X_train.loc[:,X_vars_normalize]
    val_to_scale=X_val.loc[:,X_vars_normalize]
    train_to_log=train_to_scale.loc[:,X_vars_log]
    val_to_log=val_to_scale.loc[:,X_vars_log]
    
    X_train_out=X_train.copy()
    X_val_out=X_val.copy()
    
    train_to_scale[X_vars_log]=np.log(train_to_log)
    train_scaled=scaler.fit_transform(train_to_scale)
    X_train_out[X_vars_normalize]=train_scaled
    
    val_to_scale[X_vars_log]=np.log(val_to_log)
    val_scaled=scaler.transform(val_to_scale)
    X_val_out[X_vars_normalize]=val_scaled

    
    return X_train_out,X_val_out,scaler

#function to add categorical variables from train that are missing in validation
#(for example, an airport included in train that was not flown to in validation)
#note however that this eliminates extra columns in the target dataframe (if you insert an airport or airline not in train, etc.)
def missing_cat_add(X_train,X_val):
    X_val_new=X_val.reindex(columns=X_train.columns.values.tolist(),fill_value=0)
    
    return X_val_new

Function to support full sequence of importing data: reading csv files, processing and subsamping data, and extracting predictive features and target variable.

In [7]:
#Read csv flight data, restrict airport list, convert hour and other cyclical data, and return predictor and target
#Must give either the number of airports (for train) or an airport list (for val and test)
def import_data(filenames,rand_frac,X_vars,X_vars_categorical,y_var,delay_minutes,
                X_vars_hours,X_vars_cyclical_dict,
                num_airports=None,airports_list=None,subsample_imbalanced=False):
    
    df=read_flight_csv(filenames,X_vars+y_var)
    df.dropna(inplace=True)  
    #EDA not shown here finds a few hundred CRSElapsedTime values missing some months, 
    #otherwise few tens of thousands of delay time targets missing out of half a million flights in a month
    if airports_list is not None:
        airports=airports_list
    else:
        airports=airport_select(df,num_airports)
    df_restricted=airline_names(airport_restrict(df,airports))
    
    X,y_full=predictors_and_target(df_restricted,X_vars,X_vars_categorical,delay_minutes)
    X_hours=convert_hours(X,X_vars_hours)
    X_full=convert_cyclical(X_hours,X_vars_cyclical_dict)
    
    if subsample_imbalanced:
        X_samp,y_samp=imbalanced_subsample(X_full,y_full,rand_frac)
    else:
        X_samp,y_samp=rand_subset(X_full,y_full,rand_frac)
    
    return X_samp,y_samp,airports,X_full,y_full


Define the Keras modeling procedure. Includes one dense layer and a dropout layer within a sequential model.

In [8]:
#create modeling procedure
#usful reference at https://www.tensorflow.org/tutorials/structured_data/imbalanced_data

Metrics=[
    keras.metrics.TruePositives(name='tp'),
    keras.metrics.FalsePositives(name='fp'),
    keras.metrics.TrueNegatives(name='tn'),
    keras.metrics.FalseNegatives(name='fn'), 
    keras.metrics.BinaryAccuracy(name='accuracy'),
    keras.metrics.Precision(name='precision'),
    keras.metrics.Recall(name='recall'),
    keras.metrics.AUC(name='auc')]

def make_model(metrics=Metrics):
    model = keras.Sequential([
        keras.layers.Dense(
            16, activation='relu',
            input_shape=(X_train_normalized.shape[-1],)),
        keras.layers.Dropout(0.5),
        keras.layers.Dense(1, activation='sigmoid')])
    model.compile(
        optimizer=keras.optimizers.Adam(lr=1e-3),
        loss=keras.losses.BinaryCrossentropy(),
        metrics=metrics)
    
    return model



Instructions for updating:
If using Keras pass *_constraint arguments to layers.


## Model Training

Loads data, runs model training, and evaluates against the validation dataset. Prints loading, modeling, and evaluation progress to the terminal. Saves model, scaler, X column names, and X_vars_log. If train_new_model==False, loads existing model.

In [9]:
#set train and val files
train_files=make_filenames(train_list_start,train_list_end)
val_files=make_filenames(val_list_start,val_list_end)

train_range=train_list_start+'-'+train_list_end

 
print('******************************\n')

    
##############################
#import and normalize data
print('start data import '+str(datetime.now()))
X_train,y_train,airports,_,y_train_full=import_data(train_files,rand_frac_train,X_vars,X_vars_categorical,y_var,delay_minutes,
                                X_vars_hours,X_vars_cyclical_dict,num_airports=num_airports,
                                subsample_imbalanced=imbalanced)
print('imported train '+str(datetime.now()))


X_val,y_val,_,_,_=import_data(val_files,rand_frac_val,X_vars,X_vars_categorical,y_var,delay_minutes,
                                X_vars_hours,X_vars_cyclical_dict,airports_list=airports,subsample_imbalanced=imbalanced)
print('imported validation '+str(datetime.now()))


X_train_normalized,X_val_norm0,scaler=normalize(X_train,X_val,X_vars_normalize,X_vars_log)
X_val_normalized=missing_cat_add(X_train_normalized,X_val_norm0)
print('data normalized '+str(datetime.now()))
    

##############################
#train model
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_auc',verbose=1,patience=10,
                                                    mode='max',restore_best_weights=True)
model=make_model()

print('modeling begin '+str(datetime.now()))
history=model.fit(X_train_normalized,y_train,batch_size=Batch_Size,epochs=Epochs,
                    validation_data=(X_val_normalized,y_val))
print('model fitted'+str(datetime.now()))


print('******************************\n')




******************************

start data import 2021-07-23 17:21:48.011512
Read US_DOT/On-Time/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2019_5.csv
Read US_DOT/On-Time/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2019_6.csv
Read US_DOT/On-Time/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2019_7.csv
Read US_DOT/On-Time/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2019_8.csv
Read US_DOT/On-Time/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2019_9.csv
Read US_DOT/On-Time/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2019_10.csv
Read US_DOT/On-Time/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2019_11.csv
Read US_DOT/On-Time/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2019_12.csv
Read US_DOT/On-Time/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2020_1.csv
Read US_DOT/On-Time/On_Time_Reporting_Carrier_On_Time_Performance_(1987_present)_2020_2.csv


Epoch 15/15
model fitted2021-07-23 17:37:32.734292
******************************



Save model and other variables needed to later make predictions

In [11]:
#save model
model.save(output_folder.split('/')[0]+'\saved_model')

#save scaler
from joblib import dump
dump(scaler,output_folder+'scaler.joblib')

#save distribution of predicted probabilities for validation data
val_probs=model.predict_proba(X_val_normalized)
dump(val_probs,output_folder+'val_probs.joblib')

#save training data characteristics
dump(X_train.columns.values.tolist(),output_folder+'X_train_cols.joblib')
dump(np.mean(y_train_full),output_folder+'Train_delay_proportion.joblib')
dump(airports,output_folder+'airports.joblib')

#save processing inputs
dump(X_vars_normalize,output_folder+'X_vars_normalize.joblib')
dump(X_vars_log,output_folder+'X_vars_log.joblib')
dump(X_vars_categorical,output_folder+'X_vars_categorical.joblib')
dump(X_vars_hours,output_folder+'X_vars_hours.joblib')
dump(X_vars_cyclical_dict,output_folder+'X_vars_cyclical_dict.joblib')
dump(delay_minutes,output_folder+'delay_minutes.joblib')
dump(imbalanced,output_folder+'subsample_imbalanced.joblib')



INFO:tensorflow:Assets written to: 7-22_May2019-Apr2021\saved_model\assets


['7-22_May2019-Apr2021/subsample_imbalanced.joblib']

## Resources

Useful tensorflow resources: <br>
https://www.tensorflow.org/tutorials/structured_data/feature_columns, <br>
https://www.tensorflow.org/tutorials/structured_data/preprocessing_layers, <br>
https://www.tensorflow.org/tutorials/structured_data/time_series,  <br>
https://www.tensorflow.org/tutorials/structured_data/imbalanced_data <br>

Cyclical time discussion: https://www.kaggle.com/avanwyk/encoding-cyclical-features-for-deep-learning

