# Task Description

Case description:
Welcome to East Carmen Airlines! As a strategic data scientist, you are confronted with many use cases to digitalize the airline business. Most important for airlines are accurate predictions of flight arrivals during a day of operations. When it comes to passenger handling, such predictions would admit a proactive steering, in order that the passengers reach the connection flights and therefore save a lot of money.

Your model should be applicable to each aircraft in East Carmen’s fleet. Let us take as an example the aircraft with registration name EC-LPD. At scheduled departure of the earliest morning flight from East Carmen’s hub Madrid the model should predict all flights of EC-LPD on the given day.

Many business stakeholders do not trust in the robustness of such a model. They claim that for a given aircraft, one could predict the first flight of the day with a decent error, but the errors of second and third flights will be huge. It is now your task to convince the business stakeholders that with a well-prepared data set and meaningful features you can deliver robust flight predictions.

Project Aim:
From different departments business analysts have collected a lot of information on possible features for flight predictions. Now it is your task to use these data to deliver a prototype forecasting model for all flights of a given aircraft at scheduled departure of the first flight of the day. 

For instance, for an aircraft with registration name EC-LPD, first flight of the day (FFOD) scheduled at 5:30 am from MAD to VIE and with the connection chain MAD-VIE-MAD-CDG-MAD-LHR-MAD your model should predict the successive landing times in VIE, MAD, CDG, MAD, LHR, MAD. The colleagues from daily operations are especially interested in how the confidence level of such a model evolves during a day of operations.

# General Information
| Amendments by me


3.1 File Names
- zip-folder: use_case_3.zip
- flight_information.csv : information on flights in a given time period, where each data point is a so-called 
leg, i.e. a flight from departure airport to destination airport with all relevant flight and crew information
- ground_information.csv : information on the ground processes after landing at our hub “East Carmen”,
where each data point describes the processes between two flights (inbound = landing, outbound = takeoff).


3.2 Column Description – Flight Information
- leg_no: unique identifier of a flight on a given day, at a certain time, with a given flight number
- fn_carrier: airline name
- fn_number: flight number; has to be unique on a given day
- dep_ap_sched: scheduled departure airport
- arr_ap_sched: scheduled arrival airport
- dep_sched_date: scheduled departure date
- dep_sched_time: scheduled departure time
- arr_sched_date: scheduled arrival date
- arr_sched_time: scheduled arrival time
- m_offblockdt: timestamp of departure
- m_onblockdt: timestamp of arrival
- ac_registration: aircraft registration number, i.e. the “license plate” of the aircraft
- change_reason_code: reason for delay (assigned after the flight)
- dep_delay: departure delay
- Ac Type Code: aircraft type (example: 320 = Airbus A320)
- trans_time: true minimal transition time for crew members after flight, i.e. transition time of the crew
member with the least time
- sched_trans_time: scheduled minimal transition time for crew members after flight, i.e. scheduled transition
time of the crew member with the least time
4
- Crew Group: assignment of what happens to the whole crew after a flight
o Start : First flight of day
o A : all crew members stay on the aircraft for the next flight
o B, B2: all crew members switch aircraft for the next flight
o C: at least one crew member switches aircraft for the next flight
- TLC_trans: names of crew members on flight with some additional information attached to each name (but business does not know exactly which additional information is visible there)
- crew_type_change: rank (cp = pilot, ca = cabin member) of crew members, who changed aircraft
- Sched Groundtime: scheduled ground time of the aircraft between flights | has mistakes; mostly (dep_sched_time of next line) - (arr_sched_time of current line); groundtime AFTER landing
- Act Groundtime: actual ground time of the aircraft between flights | has mistakes; mostly (m_offblocktime of next line) - (m_onblocktime of current line); groundtime AFTER landing


3.3 Column Description – Ground Information
- Each datapoint consists of an inbound (=arriving) flight and an outbound (=departing) flight from our hub “East Carmen)
- Catering, cleaning and pax (=passenger) boarding durations given for our hub “East Carmen” between the inbound and outbound flights
- day_of_origin: day of flight (also given in Flight Information)
- ac_type: aircraft type (also given in Flight Information)
- fn_number: flight number (also given in Flight Information)
- ac_registration: aircraft registration (also given in Flight Information)
- mingt: minimal scheduled ground time for the given aircraft
- dep_leg_inbound: departure airport name of inbound (=arriving) flight
- arr_leg_inbound: arrival airport name of inbound (=arriving) flight
- arr_leg_outbound: departure airport name of outbound (=departing) flight
- sched_inbound_dep: scheduled departure time of inbound (=arriving) flight
- sched_inbound_arr: scheduled arrival time of inbound (=arriving) flight
- sched_outbound_dep: scheduled departure time of outbound (=departing) flight
- sched_outbound_arr: scheduled arrival time of outbound (=departing) flight
- sched_turnaround: scheduled ground time for aircraft
- leg_inbound: leg number of inbound flight
- leg_outbound: leg number of outbound flight
- catering_duration: catering duration (i.e. filling up meal boxes) between flights in minutes
- cleaning_duration: cleaning duration between flights in minutes
- pax_boarding_duration: boarding duration between flights in minutes


3.4 Additional Info from Business Side:
- BI (=business intelligence) colleagues have collected a lot of flight information from many different de-partments: flight-, crew- and ground operations. Each department uses a different data warehouse. BI colleagues claim that some of the columns might be redundant and some information could be wrong, due to bad and not consistent data in the different data warehouses of the business departments. It is central for this prototype study to clean and transform the data and make consistency checks, before applying a machine learning algorithm.

In [1]:
# Imports
import pandas as pd 
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt

import datetime

import functions # own helper functions

#Settings
pd.set_option("display.max_rows", 70)
pd.set_option("display.max_columns", 101)


# Scope of this workbook

This workbook aims at initial feature creation and engineering without validating them. Validation is done in advanced eda with final preparation in advanced feature engineering. 

# 1. Full df import

In [2]:
# data frame for all edits
full_df = pd.read_pickle('../data/interim/full_df_unedited.pkl')

# Sorting data frame correctly
full_df.sort_values(by = ['ac_registration_x', 'm_offblockdt'], inplace = True)


# 4. Full df - Data Cleaning

This is the overall standard data frame we will use for all models except the baseline model. Here we do all the data cleaning we found out in raw_eda

## 4.1 Specific Add Columns 

In [3]:
# Create flt event number
#full_df["flt_event_number"] = full_df.groupby(['dep_sched_date', 'ac_registration_x']).cumcount()+1

functions.create_leg_no(full_df)

In [4]:
# Create leg as directional route 

functions.create_leg(full_df)

In [5]:
# Create rotation mismatch indicator
full_df = functions.create_rotmismatch(full_df)

idx = full_df[full_df['rot_mismatch']==True].index
idx_rol = []
for i in idx:
    idx_rol.append(i-1)
    idx_rol.append(i)
    idx_rol.append(i+1)

print(len(idx), " rotational mistakes")
full_df.loc[idx_rol,:].head(9)

222  rotational mistakes


Unnamed: 0,leg_no,fn_carrier,fn_number_x,dep_ap_sched,arr_ap_sched,dep_sched_date,dep_sched_time,arr_sched_date,arr_sched_time,m_offblockdt,m_onblockdt,ac_registration_x,change_reason_code,dep_delay,Ac Type Code,trans_time,sched_trans_time,Crew Group,TLC_trans,crew_type_change,Sched Groundtime,Act Groundtime,day_of_origin,ac_type,fn_number_y,ac_registration_y,mingt,dep_leg_inbound,arr_leg_inbound,arr_leg_outbound,sched_inbound_dep,sched_inbound_arr,sched_outbound_dep,sched_outbound_arr,sched_turnaround,leg_inbound,leg_outbound,catering_duration,cleaning_duration,pax_boarding_duration,flt_event_number,leg,rot_mismatch
4733,272004639,East Carmen Airlines,EC3088,Paulbury,East Carmen,2019-06-12,2019-06-12 17:50:00,2019-06-12,2019-06-12 19:05:00,2019-06-12 17:51:00,2019-06-12 19:07:00,ECLGLX,other problem,1.0,DH4,25,45,A,"['Brian Lester_302_25_45_cp', 'Jessica Wilson_...",[],130.0,140.0,NaT,,,,,,,,NaT,NaT,NaT,NaT,,,,,,,7,Paulbury-East Carmen,False
4734,271958002,East Carmen Airlines,EC2582,Lake Lawrencechester,East Carmen,2019-06-12,2019-06-12 17:50:00,2019-06-12,2019-06-12 19:15:00,2019-06-12 18:00:00,2019-06-12 19:19:00,ECLBCX,other problem,10.0,320,47,70,A,"['David Cruz_286_47_70_ca', 'Veronica Nunez_28...",[],,,2019-06-12,321,EC2582,ECLBCX,45.0,Lake Lawrencechester,East Carmen,East Latashaview,2019-06-12 17:50:00,2019-06-12 19:15:00,2019-06-12 18:15:00,2019-06-12 21:40:00,-60.0,271958002.0,272213781.0,23.0,,19.0,6,Lake Lawrencechester-East Carmen,True
4735,271990429,East Carmen Airlines,EC2930,East Melindachester,East Carmen,2019-06-12,2019-06-12 17:50:00,2019-06-12,2019-06-12 19:25:00,2019-06-12 17:50:00,2019-06-12 19:09:00,ECLWEX,no reason,0.0,E95,-9,40,A,"['Chad Ford_318_-9_40_ca', 'Juan Dennis_318_-9...",[],,,NaT,,,,,,,,NaT,NaT,NaT,NaT,,,,,,,7,East Melindachester-East Carmen,False
7941,272214515,East Carmen Airlines,EC3320,New Bobby,East Carmen,2019-06-20,2019-06-20 13:40:00,2019-06-20,2019-06-20 17:25:00,2019-06-20 15:20:00,2019-06-20 18:46:00,ECLBVX,technical problem,100.0,320,37,60,A,"['Kevin Robinson_147_37_60_ca', 'Ariana Turner...",[],60.0,-1.0,2019-06-20,320,EC3320,ECLBVX,40.0,New Bobby,East Carmen,Yoderburgh,2019-06-20 13:40:00,2019-06-20 17:25:00,2019-06-20 18:25:00,2019-06-20 20:05:00,60.0,272214515.0,271960963.0,29.0,,8.0,4,New Bobby-East Carmen,False
7942,272024194,East Carmen Airlines,EC3288,New Jessica,East Carmen,2019-06-20,2019-06-20 13:50:00,2019-06-20,2019-06-20 17:15:00,2019-06-20 14:13:00,2019-06-20 17:24:00,ECLBDX,other problem,23.0,320,0,0,Start,"['Angel Vazquez_nan_nan_nan_ca', 'Patrick Brow...",[],,,2019-06-20,322,EC3288,ECLBDX,45.0,New Jessica,East Carmen,East Latashaview,2019-06-20 13:50:00,2019-06-20 17:15:00,2019-06-20 18:15:00,2019-06-20 21:40:00,60.0,272024194.0,272213789.0,25.0,,20.0,3,New Jessica-East Carmen,True
7943,272057439,East Carmen Airlines,EC6516,Christophershire,East Carmen,2019-06-20,2019-06-20 14:00:00,2019-06-20,2019-06-20 18:40:00,2019-06-20 14:02:00,2019-06-20 18:14:00,ECLBQX,other problem,2.0,320,14,45,A,"['Danielle Harrison_118_14_45_ca', 'Jonathan T...",[],45.0,83.0,2019-06-20,320,EC6516,ECLBQX,40.0,Christophershire,East Carmen,Lisamouth,2019-06-20 14:00:00,2019-06-20 18:40:00,2019-06-20 19:25:00,2019-06-20 21:05:00,45.0,272057439.0,272016582.0,20.0,,17.0,3,Christophershire-East Carmen,False
5263,271992583,East Carmen Airlines,EC2974,West Danielport,East Carmen,2019-06-14,2019-06-14 05:05:00,2019-06-14,2019-06-14 06:35:00,2019-06-14 05:16:00,2019-06-14 06:36:00,ECLWNX,other problem,11.0,E95,0,0,Start,"['Kathryn Garcia_nan_nan_nan_cp', 'Angie Smith...",[],80.0,145.0,2019-06-14,E95,EC2974,ECLWNX,40.0,West Danielport,East Carmen,Kaylashire,2019-06-14 05:05:00,2019-06-14 06:35:00,2019-06-14 07:55:00,2019-06-14 10:30:00,80.0,271992583.0,272202801.0,24.0,,21.0,1,West Danielport-East Carmen,False
5264,271985389,East Carmen Airlines,EC2867,East Carmen,Kennethfort,2019-06-14,2019-06-14 05:05:00,2019-06-14,2019-06-14 07:15:00,2019-06-14 05:24:00,2019-06-14 07:33:00,ECLBEX,other problem,19.0,320,0,0,Start,"['Aaron Allen_nan_nan_nan_ca', 'Ian Odonnell_n...",[],215.0,229.0,NaT,,,,,,,,NaT,NaT,NaT,NaT,,,,,,,1,East Carmen-Kennethfort,True
5265,271999086,East Carmen Airlines,EC3027,East Carmen,Gilbertland,2019-06-14,2019-06-14 05:10:00,2019-06-14,2019-06-14 06:50:00,2019-06-14 05:24:00,2019-06-14 06:55:00,ECLWQX,other problem,14.0,E95,0,0,Start,"['William Ray_nan_nan_nan_cp', 'Heather Casey_...",[],80.0,104.0,NaT,,,,,,,,NaT,NaT,NaT,NaT,,,,,,,1,East Carmen-Gilbertland,False


In [6]:
# Create delay columns
functions.create_actualblocktime(full_df)
functions.create_blockdelay(full_df)
functions.create_grounddelay(full_df)

In [7]:
# Create new features for day of the week and hour of the day
full_df['day_of_week'] = full_df['dep_sched_time'].dt.dayofweek
full_df['hour_of_day_dep'] = full_df['dep_sched_time'].dt.hour
full_df['hour_of_day_arr'] = full_df['arr_sched_time'].dt.hour
full_df['minute_of_day_dep'] = full_df['dep_sched_time'].dt.hour*60+full_df['dep_sched_time'].dt.minute

In [8]:
# Create correct crew change column

functions.create_crewchange(full_df)

full_df['Crewchange'].value_counts()

no change          5207
both               4127
last flt of day    2013
cabin               453
cockpit             301
Name: Crewchange, dtype: int64

In [10]:
#Correcting the ground times to display the ground values before a flight, not after
functions.create_groundtimes_before(full_df)
full_df['Crewchange_before'].value_counts()

no change           5207
both                4127
first flt of day    2013
cabin                453
cockpit              301
Name: Crewchange_before, dtype: int64

In [11]:

# Create crew change column before flight
functions.create_crewchange_before(full_df)
full_df.head()

Unnamed: 0,leg_no,fn_carrier,fn_number_x,dep_ap_sched,arr_ap_sched,dep_sched_date,dep_sched_time,arr_sched_date,arr_sched_time,m_offblockdt,m_onblockdt,ac_registration_x,change_reason_code,dep_delay,Ac Type Code,trans_time,sched_trans_time,Crew Group,TLC_trans,crew_type_change,Sched Groundtime,Act Groundtime,day_of_origin,ac_type,fn_number_y,ac_registration_y,mingt,dep_leg_inbound,arr_leg_inbound,arr_leg_outbound,sched_inbound_dep,sched_inbound_arr,sched_outbound_dep,sched_outbound_arr,sched_turnaround,leg_inbound,leg_outbound,catering_duration,cleaning_duration,pax_boarding_duration,flt_event_number,leg,rot_mismatch,actual_block_time,block_delay,ground_delay,day_of_week,hour_of_day_dep,hour_of_day_arr,minute_of_day_dep,Crewchange,act_groundtime_before,sched_groundtime_before,Crewchange_before
9,272024970,East Carmen Airlines,EC3292,New Jessica,East Carmen,2019-06-01,2019-06-01 03:25:00,2019-06-01,2019-06-01 06:45:00,2019-06-01 03:50:00,2019-06-01 07:01:00,ECLBAX,other problem,25.0,320,0,0,Start,"['Sean Weeks_nan_nan_nan_ca', 'Tony Lloyd_nan_...",[],95.0,94.0,2019-06-01,321.0,EC3292,ECLBAX,45.0,New Jessica,East Carmen,South Nathaniel,2019-06-01 03:25:00,2019-06-01 06:45:00,2019-06-01 08:20:00,2019-06-01 10:35:00,95.0,272024970.0,272022230.0,27.0,,25.0,1,New Jessica-East Carmen,False,191.0,-9.0,25.0,5,3,6,205,both,94.0,95.0,first flt of day
120,272022230,East Carmen Airlines,EC3257,East Carmen,South Nathaniel,2019-06-01,2019-06-01 08:20:00,2019-06-01,2019-06-01 10:35:00,2019-06-01 08:35:00,2019-06-01 10:41:00,ECLBAX,other problem,15.0,320,60,60,B,"['Toni Rodriguez_48_60_60_ca', 'Rachel Smith_4...",[],75.0,120.0,NaT,,,,,,,,NaT,NaT,NaT,NaT,,,,,,,2,East Carmen-South Nathaniel,False,126.0,-9.0,15.0,5,8,10,500,no change,94.0,95.0,both
198,272212848,East Carmen Airlines,EC3258,South Nathaniel,East Carmen,2019-06-01,2019-06-01 11:50:00,2019-06-01,2019-06-01 14:15:00,2019-06-01 12:41:00,2019-06-01 14:52:00,ECLBAX,rotational problem,51.0,320,68,75,A,"['Toni Rodriguez_127_68_75_ca', 'Rachel Smith_...",[],80.0,86.0,2019-06-01,321.0,EC3258,ECLBAX,45.0,South Nathaniel,East Carmen,Joneshaven,2019-06-01 11:50:00,2019-06-01 14:15:00,2019-06-01 15:35:00,2019-06-01 17:00:00,80.0,272212848.0,271997824.0,25.0,,,3,South Nathaniel-East Carmen,False,131.0,-14.0,51.0,5,11,14,710,both,120.0,75.0,no change
283,271997824,East Carmen Airlines,EC3021,East Carmen,Joneshaven,2019-06-01,2019-06-01 15:35:00,2019-06-01,2019-06-01 17:00:00,2019-06-01 16:18:00,2019-06-01 17:32:00,ECLBAX,rotational problem,43.0,320,64,70,B2,"['Tyler Morris_224_64_70_ca', 'Sergio Cummings...","['ca', 'cp', 'ca', 'ca', 'ca']",50.0,32.0,NaT,,,,,,,,NaT,NaT,NaT,NaT,,,,,,,4,East Carmen-Joneshaven,False,74.0,-11.0,43.0,5,15,17,935,no change,86.0,80.0,both
317,271998033,East Carmen Airlines,EC3022,Joneshaven,East Carmen,2019-06-01,2019-06-01 17:50:00,2019-06-01,2019-06-01 19:10:00,2019-06-01 18:04:00,2019-06-01 19:14:00,ECLBAX,other problem,14.0,320,42,50,A,"['Sergio Cummings_279_42_50_cp', 'Jackie Black...",[],,,NaT,,,,,,,,NaT,NaT,NaT,NaT,,,,,,,5,Joneshaven-East Carmen,False,70.0,-10.0,14.0,5,17,19,1070,last flt of day,32.0,50.0,no change


## 4.2 Specific Drop Rows 

In [10]:
# Mark dupes to drop 
print(full_df.shape)
full_df.drop_duplicates(subset = ['leg_no'], inplace = True, keep = False) # we keep no dupes at all
print(full_df.shape)

(12101, 54)
(11977, 54)


In [11]:
# Drop NAs for ground delay, block delay
print(full_df.shape)
full_df.dropna(subset=['block_delay'], inplace = True)
print(full_df.shape)
full_df.dropna(subset=['ground_delay'], inplace = True)
print(full_df.shape)

(11977, 54)
(11896, 54)
(11896, 54)


In [12]:
# Find cases where dep ap flight is not equal dep ap ground
print(full_df[full_df['dep_leg_inbound'].notna()][full_df['dep_ap_sched']!=full_df['dep_leg_inbound']].shape)
print(full_df[full_df['arr_leg_inbound'].notna()][full_df['arr_ap_sched']!=full_df['arr_leg_inbound']].shape)

(0, 54)
(0, 54)


  print(full_df[full_df['dep_leg_inbound'].notna()][full_df['dep_ap_sched']!=full_df['dep_leg_inbound']].shape)
  print(full_df[full_df['arr_leg_inbound'].notna()][full_df['arr_ap_sched']!=full_df['arr_leg_inbound']].shape)


In [15]:
# Mark rows that have wrong sched/act ground time 

def recalc_groundtimes_no_replace(df):
    df['act_groundtime_new'] = np.where(
        (df['ac_registration_x'].shift(-1).eq(df['ac_registration_x']) == True) & (df['Act Groundtime'].isna() == False),
        np.around((df['m_offblockdt'].shift(-1) - df['m_onblockdt']).dt.total_seconds()/60, decimals =1),
        df['Act Groundtime']
    )
    df['sched_groundtime_new'] = np.where(
        (df['ac_registration_x'].shift(-1).eq(df['ac_registration_x']) == True) & (df['Sched Groundtime'].isna() == False),
        np.around((df['dep_sched_time'].shift(-1) - df['arr_sched_time']).dt.total_seconds()/60, decimals =1),
        df['Sched Groundtime']
    )

recalc_groundtimes_no_replace(full_df)

# rows which do not have correct actual or sched ground time

full_df['act_gt_diff'] = np.around(full_df['Act Groundtime'] - full_df['act_groundtime_new'],0)
full_df['sched_gt_diff'] = np.around(full_df['Sched Groundtime'] - full_df['sched_groundtime_new'],0)
#print(str(full_df[full_df['act_groundtime_new'].isna() == False].shape[0]) + " rows have wrongly calculated ground times. We filter out NAs for Act Groundtime as the majority of them" +
#    " are the last flights of the respective day.")

#Fill NAs which are created as some ground times are NA
full_df['act_gt_diff'].fillna(0, inplace = True)
full_df['sched_gt_diff'].fillna(0, inplace = True)

full_df["rows_to_drop"] = np.where(full_df['act_gt_diff']!=0, 1, 0)
print(full_df["rows_to_drop"].sum())
full_df["rows_to_drop"] = np.where(full_df['sched_gt_diff']!=0, 1, 0)
print(full_df["rows_to_drop"].sum())

full_df.drop(['act_gt_diff', 'sched_gt_diff'], inplace = True, axis = 1)
full_df.shape

161
155


(11896, 57)

In [16]:
# Mark rows with rotational mistakes
full_df["rows_to_drop"] = np.where(full_df['rot_mismatch']==True, 1, full_df["rows_to_drop"])
full_df["rows_to_drop"].sum()

224

In [17]:

# Create Feature for additional rows to drop for ground delay prediction
full_df['rows_to_drop_grounddelay'] = np.where(full_df['Sched Groundtime'].isna() == True, 1, 0)
full_df["rows_to_drop_grounddelay"].sum()

1992

In [18]:

# Also drop all last flights of a day for grounddelay
full_df['rows_to_drop_grounddelay'] = np.where(full_df['Crewchange']=='last flt of day', 1, full_df['rows_to_drop_grounddelay'])
full_df["rows_to_drop_grounddelay"].sum()

1997

In [19]:
# Temp von Hinnerk ******************************************************************
# hieran liegts nicht

'''print(full_df.shape)

# Sort values based on registration and scheduled departure date.
full_df.sort_values(by=['ac_registration_x', 'dep_sched_time'], inplace=True)

# Create new variable to check for consistent routing.
full_df['routing'] = np.where(((full_df['ac_registration_x'] == full_df['ac_registration_x'].shift(-1)) | full_df['dep_sched_time'].dt.date != full_df['dep_sched_time'].dt.date.shift(-1))
    & (full_df['arr_ap_sched'] == full_df['dep_ap_sched'].shift(-1)), True, False)

# Explicitly set routing on True for first entry in dataset as the above code doesn't work for the first line.
full_df['routing'].iloc[0] = True

# Save all scheduling mistakes in a seperate dataframe and write it to a csv file
sched_mistake = full_df[(full_df['routing'] == False)]
#sched_mistake.to_csv('../data/interim/sched_mistake.csv')

# Drop rows with scheduling mistake
print(full_df.shape)
print(full_df[(full_df['routing'] == True)].shape)
#full_df = full_df[(full_df['routing'] == True)]
#print(full_df.shape)

sched_mistake#[(sched_mistake['rows_to_drop'] == 0 ) & (sched_mistake['rows_to_drop_grounddelay'] ==0)]

#idx = [150,151,152]
#full_df.loc[idx]'''

"print(full_df.shape)\n\n# Sort values based on registration and scheduled departure date.\nfull_df.sort_values(by=['ac_registration_x', 'dep_sched_time'], inplace=True)\n\n# Create new variable to check for consistent routing.\nfull_df['routing'] = np.where(((full_df['ac_registration_x'] == full_df['ac_registration_x'].shift(-1)) | full_df['dep_sched_time'].dt.date != full_df['dep_sched_time'].dt.date.shift(-1))\n    & (full_df['arr_ap_sched'] == full_df['dep_ap_sched'].shift(-1)), True, False)\n\n# Explicitly set routing on True for first entry in dataset as the above code doesn't work for the first line.\nfull_df['routing'].iloc[0] = True\n\n# Save all scheduling mistakes in a seperate dataframe and write it to a csv file\nsched_mistake = full_df[(full_df['routing'] == False)]\n#sched_mistake.to_csv('../data/interim/sched_mistake.csv')\n\n# Drop rows with scheduling mistake\nprint(full_df.shape)\nprint(full_df[(full_df['routing'] == True)].shape)\n#full_df = full_df[(full_df['routin

In [20]:
# Actual drop of rows

full_df = full_df[full_df['rows_to_drop'] == 0]
full_df.shape

(11672, 58)

In [21]:
full_df.head()

Unnamed: 0,leg_no,fn_carrier,fn_number_x,dep_ap_sched,arr_ap_sched,dep_sched_date,dep_sched_time,arr_sched_date,arr_sched_time,m_offblockdt,m_onblockdt,ac_registration_x,change_reason_code,dep_delay,Ac Type Code,trans_time,sched_trans_time,Crew Group,TLC_trans,crew_type_change,Sched Groundtime,Act Groundtime,day_of_origin,ac_type,fn_number_y,ac_registration_y,mingt,dep_leg_inbound,arr_leg_inbound,arr_leg_outbound,sched_inbound_dep,sched_inbound_arr,sched_outbound_dep,sched_outbound_arr,sched_turnaround,leg_inbound,leg_outbound,catering_duration,cleaning_duration,pax_boarding_duration,flt_event_number,leg,rot_mismatch,actual_block_time,block_delay,ground_delay,day_of_week,hour_of_day_dep,hour_of_day_arr,minute_of_day_dep,Crewchange,act_groundtime_before,sched_groundtime_before,Crewchange_before,act_groundtime_new,sched_groundtime_new,rows_to_drop,rows_to_drop_grounddelay
9,272024970,East Carmen Airlines,EC3292,New Jessica,East Carmen,2019-06-01,2019-06-01 03:25:00,2019-06-01,2019-06-01 06:45:00,2019-06-01 03:50:00,2019-06-01 07:01:00,ECLBAX,other problem,25.0,320,0,0,Start,"['Sean Weeks_nan_nan_nan_ca', 'Tony Lloyd_nan_...",[],95.0,94.0,2019-06-01,321.0,EC3292,ECLBAX,45.0,New Jessica,East Carmen,South Nathaniel,2019-06-01 03:25:00,2019-06-01 06:45:00,2019-06-01 08:20:00,2019-06-01 10:35:00,95.0,272024970.0,272022230.0,27.0,,25.0,1,New Jessica-East Carmen,False,191.0,-9.0,25.0,5,3,6,205,both,94.0,95.0,first flt of day,94.0,95.0,0,0
120,272022230,East Carmen Airlines,EC3257,East Carmen,South Nathaniel,2019-06-01,2019-06-01 08:20:00,2019-06-01,2019-06-01 10:35:00,2019-06-01 08:35:00,2019-06-01 10:41:00,ECLBAX,other problem,15.0,320,60,60,B,"['Toni Rodriguez_48_60_60_ca', 'Rachel Smith_4...",[],75.0,120.0,NaT,,,,,,,,NaT,NaT,NaT,NaT,,,,,,,2,East Carmen-South Nathaniel,False,126.0,-9.0,15.0,5,8,10,500,no change,94.0,95.0,both,120.0,75.0,0,0
198,272212848,East Carmen Airlines,EC3258,South Nathaniel,East Carmen,2019-06-01,2019-06-01 11:50:00,2019-06-01,2019-06-01 14:15:00,2019-06-01 12:41:00,2019-06-01 14:52:00,ECLBAX,rotational problem,51.0,320,68,75,A,"['Toni Rodriguez_127_68_75_ca', 'Rachel Smith_...",[],80.0,86.0,2019-06-01,321.0,EC3258,ECLBAX,45.0,South Nathaniel,East Carmen,Joneshaven,2019-06-01 11:50:00,2019-06-01 14:15:00,2019-06-01 15:35:00,2019-06-01 17:00:00,80.0,272212848.0,271997824.0,25.0,,,3,South Nathaniel-East Carmen,False,131.0,-14.0,51.0,5,11,14,710,both,120.0,75.0,no change,86.0,80.0,0,0
283,271997824,East Carmen Airlines,EC3021,East Carmen,Joneshaven,2019-06-01,2019-06-01 15:35:00,2019-06-01,2019-06-01 17:00:00,2019-06-01 16:18:00,2019-06-01 17:32:00,ECLBAX,rotational problem,43.0,320,64,70,B2,"['Tyler Morris_224_64_70_ca', 'Sergio Cummings...","['ca', 'cp', 'ca', 'ca', 'ca']",50.0,32.0,NaT,,,,,,,,NaT,NaT,NaT,NaT,,,,,,,4,East Carmen-Joneshaven,False,74.0,-11.0,43.0,5,15,17,935,no change,86.0,80.0,both,32.0,50.0,0,0
317,271998033,East Carmen Airlines,EC3022,Joneshaven,East Carmen,2019-06-01,2019-06-01 17:50:00,2019-06-01,2019-06-01 19:10:00,2019-06-01 18:04:00,2019-06-01 19:14:00,ECLBAX,other problem,14.0,320,42,50,A,"['Sergio Cummings_279_42_50_cp', 'Jackie Black...",[],,,NaT,,,,,,,,NaT,NaT,NaT,NaT,,,,,,,5,Joneshaven-East Carmen,False,70.0,-10.0,14.0,5,17,19,1070,last flt of day,32.0,50.0,no change,,,0,1


## 4.3 Specific Drop Columns 

In [22]:
full_df.head()

Unnamed: 0,leg_no,fn_carrier,fn_number_x,dep_ap_sched,arr_ap_sched,dep_sched_date,dep_sched_time,arr_sched_date,arr_sched_time,m_offblockdt,m_onblockdt,ac_registration_x,change_reason_code,dep_delay,Ac Type Code,trans_time,sched_trans_time,Crew Group,TLC_trans,crew_type_change,Sched Groundtime,Act Groundtime,day_of_origin,ac_type,fn_number_y,ac_registration_y,mingt,dep_leg_inbound,arr_leg_inbound,arr_leg_outbound,sched_inbound_dep,sched_inbound_arr,sched_outbound_dep,sched_outbound_arr,sched_turnaround,leg_inbound,leg_outbound,catering_duration,cleaning_duration,pax_boarding_duration,flt_event_number,leg,rot_mismatch,actual_block_time,block_delay,ground_delay,day_of_week,hour_of_day_dep,hour_of_day_arr,minute_of_day_dep,Crewchange,act_groundtime_before,sched_groundtime_before,Crewchange_before,act_groundtime_new,sched_groundtime_new,rows_to_drop,rows_to_drop_grounddelay
9,272024970,East Carmen Airlines,EC3292,New Jessica,East Carmen,2019-06-01,2019-06-01 03:25:00,2019-06-01,2019-06-01 06:45:00,2019-06-01 03:50:00,2019-06-01 07:01:00,ECLBAX,other problem,25.0,320,0,0,Start,"['Sean Weeks_nan_nan_nan_ca', 'Tony Lloyd_nan_...",[],95.0,94.0,2019-06-01,321.0,EC3292,ECLBAX,45.0,New Jessica,East Carmen,South Nathaniel,2019-06-01 03:25:00,2019-06-01 06:45:00,2019-06-01 08:20:00,2019-06-01 10:35:00,95.0,272024970.0,272022230.0,27.0,,25.0,1,New Jessica-East Carmen,False,191.0,-9.0,25.0,5,3,6,205,both,94.0,95.0,first flt of day,94.0,95.0,0,0
120,272022230,East Carmen Airlines,EC3257,East Carmen,South Nathaniel,2019-06-01,2019-06-01 08:20:00,2019-06-01,2019-06-01 10:35:00,2019-06-01 08:35:00,2019-06-01 10:41:00,ECLBAX,other problem,15.0,320,60,60,B,"['Toni Rodriguez_48_60_60_ca', 'Rachel Smith_4...",[],75.0,120.0,NaT,,,,,,,,NaT,NaT,NaT,NaT,,,,,,,2,East Carmen-South Nathaniel,False,126.0,-9.0,15.0,5,8,10,500,no change,94.0,95.0,both,120.0,75.0,0,0
198,272212848,East Carmen Airlines,EC3258,South Nathaniel,East Carmen,2019-06-01,2019-06-01 11:50:00,2019-06-01,2019-06-01 14:15:00,2019-06-01 12:41:00,2019-06-01 14:52:00,ECLBAX,rotational problem,51.0,320,68,75,A,"['Toni Rodriguez_127_68_75_ca', 'Rachel Smith_...",[],80.0,86.0,2019-06-01,321.0,EC3258,ECLBAX,45.0,South Nathaniel,East Carmen,Joneshaven,2019-06-01 11:50:00,2019-06-01 14:15:00,2019-06-01 15:35:00,2019-06-01 17:00:00,80.0,272212848.0,271997824.0,25.0,,,3,South Nathaniel-East Carmen,False,131.0,-14.0,51.0,5,11,14,710,both,120.0,75.0,no change,86.0,80.0,0,0
283,271997824,East Carmen Airlines,EC3021,East Carmen,Joneshaven,2019-06-01,2019-06-01 15:35:00,2019-06-01,2019-06-01 17:00:00,2019-06-01 16:18:00,2019-06-01 17:32:00,ECLBAX,rotational problem,43.0,320,64,70,B2,"['Tyler Morris_224_64_70_ca', 'Sergio Cummings...","['ca', 'cp', 'ca', 'ca', 'ca']",50.0,32.0,NaT,,,,,,,,NaT,NaT,NaT,NaT,,,,,,,4,East Carmen-Joneshaven,False,74.0,-11.0,43.0,5,15,17,935,no change,86.0,80.0,both,32.0,50.0,0,0
317,271998033,East Carmen Airlines,EC3022,Joneshaven,East Carmen,2019-06-01,2019-06-01 17:50:00,2019-06-01,2019-06-01 19:10:00,2019-06-01 18:04:00,2019-06-01 19:14:00,ECLBAX,other problem,14.0,320,42,50,A,"['Sergio Cummings_279_42_50_cp', 'Jackie Black...",[],,,NaT,,,,,,,,NaT,NaT,NaT,NaT,,,,,,,5,Joneshaven-East Carmen,False,70.0,-10.0,14.0,5,17,19,1070,last flt of day,32.0,50.0,no change,,,0,1


In [23]:
adv_drop_columns = [
    'Crew Group',
    'TLC_trans',
    'crew_type_change',
    'fn_number_x', # unreliable
    'dep_ap_sched', # encoded in route/leg; no additional information
    'arr_ap_sched' # encoded in route/leg; no additional information
    'Sched Groundtime' # As we have sched_groundtime_before and calculate landing times, Sched Groundtime does not have any predictive value for us
]

In [24]:
full_df = full_df.drop(adv_drop_columns, axis = 1)
full_df.shape

(11672, 52)

## 4.4 Save full_df

In [25]:
full_df.to_pickle('../data/processed/full_df.pkl')
full_df.to_csv('../data/processed/full_df.csv')