# Clean Connection

### Use this tool 2-3 days before your flight to determine if it is likely that your flight connection will be missed due to weather

#### Clean Connection tool uses historical flight data, archived weather forecast data, and a machine learning algorithm to estimate if you are likely to miss your connection due to extreme weather

#### User Input: Specify origin, connection, and final airports (4 letter code), along with scheduled deparature and arrival times

In [1]:
UserOrigin = 'KBOI'
UserDepartureYear = 2019
UserDepartureMonth = 10
UserDepartureDay = 6
UserScheduledDepartureTime = '11:00' #local time

UserConnectingAirport = 'KBOS'
UserArrivalYear = 2019
UserArrivalMonth = 10
UserArrivalDay = 6
UserScheduledArrivalTime = '20:00' #local time

UserDestinationAirport = 'KDCA'
UserFinalYear = 2019
UserFinalMonth = 10
UserFinalDay = 6
UserScheduledFinalTime = '22:00' #local time

##### ------- Behind the scenes --------

In [2]:
#Import modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier

##### Read in met data so we can pair user-specified flight info with it, read in merged flight+met data for ML algorithm

In [4]:
#Read met data
metfile = '../data/processed/met/2019_ProcessedMet.csv'
dfMet = pd.read_csv(metfile)

#Formatting to help with merge
dfMet['timeLocal'] = pd.to_datetime(dfMet['timeLocal'])
dfMet['ORIGIN'] = dfMet['airport']
dfMet['DEST'] = dfMet['airport']
dfMet.sort_values(by=['timeLocal'],inplace=True)

#Forward-fill met data for now (note this is not the final place for this)
dfMet.fillna(method='ffill',inplace=True)

#Read merged flight+met data for ML and perform slight processing
file = '../data/processed/merged/2019_FlightMetMerged.csv'
df = pd.read_csv(file)

df.drop_duplicates(inplace=True)

#For now remove flights where arrival time before departure time
df['ARR_TIME'] = pd.to_datetime(df['ARR_TIME'])
df['DEP_TIME'] = pd.to_datetime(df['DEP_TIME'])

In [5]:
#Convert user-input into df-compatible format
depTime = str(UserDepartureYear)+'-'+str(UserDepartureMonth)+'-'+str(UserDepartureDay)+' '+UserScheduledDepartureTime
depTime = pd.to_datetime(depTime)

arrTime = str(UserArrivalYear)+'-'+str(UserArrivalMonth)+'-'+str(UserArrivalDay)+' '+UserScheduledArrivalTime
arrTime = pd.to_datetime(arrTime)

orgCode = df[df['ORIGIN']==UserOrigin]['ORIGIN_AIRPORT_ID'].iloc[0]
depCode = df[df['DEST']==UserConnectingAirport]['DEST_AIRPORT_ID'].iloc[0]

In [11]:
#Create a row of 'X' (called_dfUser) based upon given flight data, pair with appropriate meteorology
dftmp = df[['ORIGIN_AIRPORT_ID','DEST_AIRPORT_ID','DEP_TIME','ARR_TIME']].iloc[0]
dfUser1 = pd.DataFrame(data=dftmp).transpose()
dfUser1['ORIGIN'] = UserOrigin
dfUser1['DEST'] = UserConnectingAirport
dfUser1['ORIGIN_AIRPORT_ID'] = orgCode
dfUser1['DEST_AIRPORT_ID'] = depCode
dfUser1['DEP_TIME'] = depTime
dfUser1['ARR_TIME'] = arrTime

#Link in meteorology at the departure airport
dfUser2 = pd.merge_asof(left=dfUser1,right=dfMet,left_on=['DEP_TIME'],right_on=['timeLocal'],by=['ORIGIN'])

#Drop columns no longer needed and rename met columns so we know those are tied to departure
dfUser2.drop(['timeLocal','airport','DEST_y'],axis=1,inplace=True)
dfUser2.rename(columns={'DEST_x':'DEST','tmpF':'tmpF_D','dptF':'dptF_D','CC':'CC_D','dir':'dir_D',
        'spd':'spd_D','6hPrecPrb':'6hPrecPrb_D','12hPrecPrb':'12hPrecPrb_D',
        '6hQntPrec':'6hQntPrec_D','12hQntPrec':'12hQntPrec_D','snow':'snow_D',
        'ceil':'ceil_D','visib':'visib_D','obstruc':'obstruc_D','fzRnPrb':'fzRnPrb_D',
        'snowPrb':'snowPrb_D','6hrTsPrb_15mi':'6hrTsPrb_15mi_D',
        '6hrSvrTsPrb_25mi':'6hrSvrTsPrb_25mi_D'},inplace=True)

#Link in meteorology at the arrival airport
dfUser2.sort_values(by=['ARR_TIME'],inplace=True)
dfUser = pd.merge_asof(left=dfUser2,right=dfMet,left_on=['ARR_TIME'],right_on=['timeLocal'],
            by=['DEST'])

dfUser.drop(['timeLocal','airport','ORIGIN_y'],axis=1,inplace=True)
dfUser.rename(columns={'ORIGIN_x':'ORIGIN','tmpF':'tmpF_A','dptF':'dptF_A','CC':'CC_A','dir':'dir_A',
            'spd':'spd_A','6hPrecPrb':'6hPrecPrb_A','12hPrecPrb':'12hPrecPrb_A',
            '6hQntPrec':'6hQntPrec_A','12hQntPrec':'12hQntPrec_A','snow':'snow_A',
            'ceil':'ceil_A','visib':'visib_A','obstruc':'obstruc_A','fzRnPrb':'fzRnPrb_A',
            'snowPrb':'snowPrb_A','6hrTsPrb_15mi':'6hrTsPrb_15mi_A',
            '6hrSvrTsPrb_25mi':'6hrSvrTsPrb_25mi_A'},inplace=True)

#### Trained ML model

In [8]:
#For df used for ML modeling 

#Drop any row from df with nans for ML model, drop columns columns that are not needed
dfML = df.dropna() 

#Merged arrival bins for simplicity for now
dfML['ARR_DELAY_GROUP'].loc[(dfML['ARR_DELAY_GROUP']<=0)] = 0
dfML['ARR_DELAY_GROUP'].loc[(dfML['ARR_DELAY_GROUP']>0)] = 1

#Use subset of dfML for ML modeling
X = dfML[['tmpF_D', 'dptF_D','dir_D', 'spd_D', '6hPrecPrb_D', 
          '6hQntPrec_D', 'ceil_D', 'visib_D','fzRnPrb_D', 'snowPrb_D', '6hrTsPrb_15mi_D',
          '6hrSvrTsPrb_25mi_D', 'tmpF_A', 'dptF_A', 'dir_A', 'spd_A','6hPrecPrb_A', '6hQntPrec_A', 'ceil_A', 
          'visib_A','fzRnPrb_A', 'snowPrb_A', '6hrTsPrb_15mi_A','6hrSvrTsPrb_25mi_A','snow_D','snow_A',
          '12hPrecPrb_D','12hQntPrec_D','12hPrecPrb_A','12hQntPrec_A']]

y = dfML['ARR_DELAY_GROUP']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfML['ARR_DELAY_GROUP'].loc[(dfML['ARR_DELAY_GROUP']<=0)] = 0
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfML['ARR_DELAY_GROUP'].loc[(dfML['ARR_DELAY_GROUP']>0)] = 1


In [12]:
#Normalize dataset, make sure to include the user data row is in here so it gets scaled
dfUser = dfUser[['tmpF_D', 'dptF_D', 'dir_D', 'spd_D', '6hPrecPrb_D', '6hQntPrec_D','ceil_D', 'visib_D', 
                 'fzRnPrb_D', 'snowPrb_D', '6hrTsPrb_15mi_D','6hrSvrTsPrb_25mi_D', 'tmpF_A', 'dptF_A', 
                 'dir_A', 'spd_A','6hPrecPrb_A', '6hQntPrec_A', 'ceil_A', 'visib_A', 'fzRnPrb_A','snowPrb_A', 
                 '6hrTsPrb_15mi_A', '6hrSvrTsPrb_25mi_A', 'snow_D','snow_A', '12hPrecPrb_D', '12hQntPrec_D', 
                 '12hPrecPrb_A','12hQntPrec_A']]

#Make a dataframe with X plus user data
XwithUsertmp = [X,dfUser]
XwithUser = pd.concat(XwithUsertmp)

#Perform scaling
sc = MinMaxScaler()
data = sc.fit_transform(XwithUser)

#Separate X from user data
Xdata = data[:-1,:]

XUser = data[-1,:]


In [17]:
#Split conversion dataset into train and test groups
X_train, X_test, y_train, y_test = train_test_split(Xdata, y)

In [18]:
#Train random forest model
clf = RandomForestClassifier(n_estimators=10).fit(X_train, y_train)

#### Ingest user-provided data into trained ML model

In [40]:
#Run random forest model using user-provided data
XUserT = XUser.reshape(1, -1)
forest_predicted = clf.predict(XUserT)

#### --- end behind the scenes ---

#### Important Information for Customer

In [44]:
if (forest_predicted==0):
    print ("You are going to make your connection!")
else:
    print ("You are likely going to miss your connection!")

You are going to make your connection!
