# Task Description

Case description:
Welcome to East Carmen Airlines! As a strategic data scientist, you are confronted with many use cases to digitalize the airline business. Most important for airlines are accurate predictions of flight arrivals during a day of operations. When it comes to passenger handling, such predictions would admit a proactive steering, in order that the passengers reach the connection flights and therefore save a lot of money.

Your model should be applicable to each aircraft in East Carmen’s fleet. Let us take as an example the aircraft with registration name EC-LPD. At scheduled departure of the earliest morning flight from East Carmen’s hub Madrid the model should predict all flights of EC-LPD on the given day.

Many business stakeholders do not trust in the robustness of such a model. They claim that for a given aircraft, one could predict the first flight of the day with a decent error, but the errors of second and third flights will be huge. It is now your task to convince the business stakeholders that with a well-prepared data set and meaningful features you can deliver robust flight predictions.

Project Aim:
From different departments business analysts have collected a lot of information on possible features for flight predictions. Now it is your task to use these data to deliver a prototype forecasting model for all flights of a given aircraft at scheduled departure of the first flight of the day. 

For instance, for an aircraft with registration name EC-LPD, first flight of the day (FFOD) scheduled at 5:30 am from MAD to VIE and with the connection chain MAD-VIE-MAD-CDG-MAD-LHR-MAD your model should predict the successive landing times in VIE, MAD, CDG, MAD, LHR, MAD. The colleagues from daily operations are especially interested in how the confidence level of such a model evolves during a day of operations.

# General Information
| Amendments by me


3.1 File Names
- zip-folder: use_case_3.zip
- flight_information.csv : information on flights in a given time period, where each data point is a so-called 
leg, i.e. a flight from departure airport to destination airport with all relevant flight and crew information
- ground_information.csv : information on the ground processes after landing at our hub “East Carmen”,
where each data point describes the processes between two flights (inbound = landing, outbound = takeoff).


3.2 Column Description – Flight Information
- leg_no: unique identifier of a flight on a given day, at a certain time, with a given flight number
- fn_carrier: airline name
- fn_number: flight number; has to be unique on a given day
- dep_ap_sched: scheduled departure airport
- arr_ap_sched: scheduled arrival airport
- dep_sched_date: scheduled departure date
- dep_sched_time: scheduled departure time
- arr_sched_date: scheduled arrival date
- arr_sched_time: scheduled arrival time
- m_offblockdt: timestamp of departure
- m_onblockdt: timestamp of arrival
- ac_registration: aircraft registration number, i.e. the “license plate” of the aircraft
- change_reason_code: reason for delay (assigned after the flight)
- dep_delay: departure delay
- Ac Type Code: aircraft type (example: 320 = Airbus A320)
- trans_time: true minimal transition time for crew members after flight, i.e. transition time of the crew
member with the least time
- sched_trans_time: scheduled minimal transition time for crew members after flight, i.e. scheduled transition
time of the crew member with the least time
4
- Crew Group: assignment of what happens to the whole crew after a flight
o Start : First flight of day
o A : all crew members stay on the aircraft for the next flight
o B, B2: all crew members switch aircraft for the next flight
o C: at least one crew member switches aircraft for the next flight
- TLC_trans: names of crew members on flight with some additional information attached to each name (but business does not know exactly which additional information is visible there)
- crew_type_change: rank (cp = pilot, ca = cabin member) of crew members, who changed aircraft
- Sched Groundtime: scheduled ground time of the aircraft between flights | has mistakes; mostly (dep_sched_time of next line) - (arr_sched_time of current line); groundtime AFTER landing
- Act Groundtime: actual ground time of the aircraft between flights | has mistakes; mostly (m_offblocktime of next line) - (m_onblocktime of current line); groundtime AFTER landing


3.3 Column Description – Ground Information
- Each datapoint consists of an inbound (=arriving) flight and an outbound (=departing) flight from our hub “East Carmen)
- Catering, cleaning and pax (=passenger) boarding durations given for our hub “East Carmen” between the inbound and outbound flights
- day_of_origin: day of flight (also given in Flight Information)
- ac_type: aircraft type (also given in Flight Information)
- fn_number: flight number (also given in Flight Information)
- ac_registration: aircraft registration (also given in Flight Information)
- mingt: minimal scheduled ground time for the given aircraft
- dep_leg_inbound: departure airport name of inbound (=arriving) flight
- arr_leg_inbound: arrival airport name of inbound (=arriving) flight
- arr_leg_outbound: departure airport name of outbound (=departing) flight
- sched_inbound_dep: scheduled departure time of inbound (=arriving) flight
- sched_inbound_arr: scheduled arrival time of inbound (=arriving) flight
- sched_outbound_dep: scheduled departure time of outbound (=departing) flight
- sched_outbound_arr: scheduled arrival time of outbound (=departing) flight
- sched_turnaround: scheduled ground time for aircraft
- leg_inbound: leg number of inbound flight
- leg_outbound: leg number of outbound flight
- catering_duration: catering duration (i.e. filling up meal boxes) between flights in minutes
- cleaning_duration: cleaning duration between flights in minutes
- pax_boarding_duration: boarding duration between flights in minutes


3.4 Additional Info from Business Side:
- BI (=business intelligence) colleagues have collected a lot of flight information from many different de-partments: flight-, crew- and ground operations. Each department uses a different data warehouse. BI colleagues claim that some of the columns might be redundant and some information could be wrong, due to bad and not consistent data in the different data warehouses of the business departments. It is central for this prototype study to clean and transform the data and make consistency checks, before applying a machine learning algorithm.

In [1]:
# Imports
import pandas as pd 
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt

import datetime

import functions # own helper functions

#Settings
pd.set_option("display.max_rows", 70)
pd.set_option("display.max_columns", 101)


# Scope of this workbook

This workbook aims at initial feature creation and engineering without validating them. Validation is done in advanced eda with final preparation in advanced feature engineering. 

# Standard Add Column Functions

some functions to add columns to any data frame

In [2]:
# Scheduled block time 
def create_scheduledblocktime(df):
    df['scheduled_block_time'] = (df['arr_sched_time']-df['dep_sched_time']).dt.total_seconds()/60

# Leg column 
def create_leg(df):
    df['leg'] = df["dep_ap_sched"] + "-" + df["arr_ap_sched"]

# Routing column 
def create_route(df):
    df['route'] = np.where(df["dep_ap_sched"]<df["arr_ap_sched"], 
        df["dep_ap_sched"] + "-" + df["arr_ap_sched"],
        df["arr_ap_sched"] + "-" + df["dep_ap_sched"]
        )

# Ground/Departure delay in minutes
def create_grounddelay(df):
    df['ground_delay'] = (df['m_offblockdt'] - df['dep_sched_time']).dt.total_seconds()/60

# Block time delay in minutes
def create_blockdelay(df):
    df['block_delay'] = (df['block_time'] - (
        (df['arr_sched_time']-df['dep_sched_time']).dt.total_seconds()/60) # this is just scheduled_block_time without adding the column
    )

# standard function 
def create_standard_columns(df):
    create_scheduledblocktime(df)
    create_leg(df)
    create_route(df)
    create_grounddelay(df)
    create_blockdelay(df)



# Standard Drop Columns

In [18]:
# Columns to drop

basic_drop_columns = [
    'leg_no',
    'fn_carrier',
    'change_reason_code',
    'day_of_origin',
    'ac_type',
    'ac_registration_y',
    'dep_leg_inbound',
    'arr_leg_inbound',
    'arr_leg_outbound',
    'sched_inbound_dep',
    'sched_inbound_arr',
    'sched_outbound_dep',
    'sched_outbound_arr',
    'sched_turnaround',
    'leg_inbound',
    'leg_outbound',
    'cleaning_duration' # dropped as we only have three observations, see EDA
]

# 1. Full df import

In [5]:
# data frame for all edits
full_df = pd.read_pickle('../data/interim/full_df_unedited.pkl')

# Block time in minutes
# Block time is also needed for the fully unedited data frame, so we add it here. 
full_df["block_time"] = (full_df['m_onblockdt']-full_df['m_offblockdt']).dt.total_seconds()/60

# Number of flight event on a given day, might be interesting. Added to allow RMSE calculation over the course of the day, but this might be an unneeded Proxy as we have time already. #Todo: Check whether to include it here.
full_df["flt_event_number"] = full_df.groupby(['dep_sched_date', 'ac_registration_x']).cumcount()+1

# source all data frames from this full_df variable and store in a case specivic variable

# 2. Full df unedited

In [20]:
# read data frames

# data frame for unedited model
full_df_unedited = full_df

In [21]:
# dropping columns
full_df_unedited = full_df_unedited.drop(basic_drop_columns, axis = 1)

# We also get rid of NaN
# flt_columns = [i for i in full_df_unedited.columns[:19]] # we only take flt dataset columns Toto: Check how gradient boosting can cope with nan as some remain if we do this
full_df_unedited = full_df_unedited.dropna()#subset = flt_columns)

# Export
full_df_unedited.to_pickle('../data/processed/full_df_unedited.pkl')
full_df_unedited.to_csv('../data/processed/full_df_unedited.csv')

In [22]:
full_df_unedited.head()

Unnamed: 0,fn_number,dep_ap_sched,arr_ap_sched,dep_sched_date,dep_sched_time,arr_sched_date,arr_sched_time,m_offblockdt,m_onblockdt,ac_registration_x,dep_delay,Ac Type Code,trans_time,sched_trans_time,Crew Group,TLC_trans,crew_type_change,Sched Groundtime,Act Groundtime,mingt,catering_duration,pax_boarding_duration,block_time
4,EC3114,East Melissaberg,East Carmen,2019-06-01,2019-06-01 02:30:00,2019-06-01,2019-06-01 04:15:00,2019-06-01 03:02:00,2019-06-01 04:43:00,ECLGNX,32.0,DH4,0,0,Start,"['Renee Fisher_nan_nan_nan_ca', 'Rebecca Castr...",[],35.0,21.0,35.0,26.0,26.0,101.0
7,EC3316,East Latashaview,East Carmen,2019-06-01,2019-06-01 03:15:00,2019-06-01,2019-06-01 07:05:00,2019-06-01 03:30:00,2019-06-01 07:07:00,ECLBIX,15.0,320,0,0,Start,"['Nicholas Evans_nan_nan_nan_ca', 'Jessica Her...",[],45.0,61.0,40.0,27.0,15.0,217.0
9,EC3292,New Jessica,East Carmen,2019-06-01,2019-06-01 03:25:00,2019-06-01,2019-06-01 06:45:00,2019-06-01 03:50:00,2019-06-01 07:01:00,ECLBAX,25.0,320,0,0,Start,"['Sean Weeks_nan_nan_nan_ca', 'Tony Lloyd_nan_...",[],95.0,94.0,45.0,27.0,25.0,191.0
13,EC3420,East Allisontown,East Carmen,2019-06-01,2019-06-01 04:00:00,2019-06-01,2019-06-01 04:35:00,2019-06-01 04:04:00,2019-06-01 04:36:00,ECLWFX,4.0,E95,0,0,Start,"['Frederick Ramirez_nan_nan_nan_cp', 'Ariel Wi...",[],55.0,120.0,40.0,18.0,11.0,32.0
15,EC3400,Port Courtneytown,East Carmen,2019-06-01,2019-06-01 04:00:00,2019-06-01,2019-06-01 04:35:00,2019-06-01 04:14:00,2019-06-01 04:57:00,ECLGBX,14.0,DH4,0,0,Start,"['Heather Ryan_nan_nan_nan_ca', 'Jeff Hays_nan...",[],35.0,23.0,30.0,15.0,12.0,43.0


In [23]:
# Save
full_df_unedited.to_pickle('../data/processed/full_df_unedited.pkl')

# 3. Full df unedited with delay calculation

This creates a new dataframe where we focus on forecasting delay instead of total ground/block times.

In [24]:
# creating full_df_unedited_wdelay
full_df_unedited_wdelay = full_df

# creating standard columns
# create_standard_columns(full_df_unedited_wdelay) # As this is the unedited df, we do not add our standard columns

# create delay columns
create_blockdelay(full_df_unedited_wdelay)
create_grounddelay(full_df_unedited_wdelay)

# dropping basic drop columns
full_df_unedited_wdelay = full_df_unedited_wdelay.drop(basic_drop_columns, axis = 1)

# We also get rid of NaN
# flt_columns = [i for i in full_df_unedited.columns[:19]] # we only take flt dataset columns Toto: Check how gradient boosting can cope with nan as some remain if we do this
full_df_unedited_wdelay = full_df_unedited_wdelay.dropna()#subset = flt_columns)

# Export
full_df_unedited_wdelay.to_pickle('../data/processed/full_df_unedited_wdelay.pkl')
full_df_unedited_wdelay.to_csv('../data/processed/full_df_unedited_wdelay.csv')

full_df_unedited_wdelay.head()

Unnamed: 0,fn_number,dep_ap_sched,arr_ap_sched,dep_sched_date,dep_sched_time,arr_sched_date,arr_sched_time,m_offblockdt,m_onblockdt,ac_registration_x,dep_delay,Ac Type Code,trans_time,sched_trans_time,Crew Group,TLC_trans,crew_type_change,Sched Groundtime,Act Groundtime,mingt,catering_duration,pax_boarding_duration,block_time,block_delay,ground_delay
4,EC3114,East Melissaberg,East Carmen,2019-06-01,2019-06-01 02:30:00,2019-06-01,2019-06-01 04:15:00,2019-06-01 03:02:00,2019-06-01 04:43:00,ECLGNX,32.0,DH4,0,0,Start,"['Renee Fisher_nan_nan_nan_ca', 'Rebecca Castr...",[],35.0,21.0,35.0,26.0,26.0,101.0,-4.0,32.0
7,EC3316,East Latashaview,East Carmen,2019-06-01,2019-06-01 03:15:00,2019-06-01,2019-06-01 07:05:00,2019-06-01 03:30:00,2019-06-01 07:07:00,ECLBIX,15.0,320,0,0,Start,"['Nicholas Evans_nan_nan_nan_ca', 'Jessica Her...",[],45.0,61.0,40.0,27.0,15.0,217.0,-13.0,15.0
9,EC3292,New Jessica,East Carmen,2019-06-01,2019-06-01 03:25:00,2019-06-01,2019-06-01 06:45:00,2019-06-01 03:50:00,2019-06-01 07:01:00,ECLBAX,25.0,320,0,0,Start,"['Sean Weeks_nan_nan_nan_ca', 'Tony Lloyd_nan_...",[],95.0,94.0,45.0,27.0,25.0,191.0,-9.0,25.0
13,EC3420,East Allisontown,East Carmen,2019-06-01,2019-06-01 04:00:00,2019-06-01,2019-06-01 04:35:00,2019-06-01 04:04:00,2019-06-01 04:36:00,ECLWFX,4.0,E95,0,0,Start,"['Frederick Ramirez_nan_nan_nan_cp', 'Ariel Wi...",[],55.0,120.0,40.0,18.0,11.0,32.0,-3.0,4.0
15,EC3400,Port Courtneytown,East Carmen,2019-06-01,2019-06-01 04:00:00,2019-06-01,2019-06-01 04:35:00,2019-06-01 04:14:00,2019-06-01 04:57:00,ECLGBX,14.0,DH4,0,0,Start,"['Heather Ryan_nan_nan_nan_ca', 'Jeff Hays_nan...",[],35.0,23.0,30.0,15.0,12.0,43.0,8.0,14.0


# 4. Full df - Data Cleaning

In [25]:
full_df_cleaned = full_df

## 4.1 Specific Add Columns 

In [26]:
# Sorting data frame correctly
full_df = full_df.sort_values(by = ['ac_registration_x', 'm_offblockdt']).reset_index(drop = True)

full_df = functions.create_rotmismatch(full_df)

## 4.2 Specific Drop Rows 

In [None]:
# Mark dupes to drop 
full_df["rows_to_drop"] = np.where(full_df.duplicated(subset = ['leg_no'], keep = False) == True, 1,0)

# Mark NAs for M_onblockdt for dropping
full_df["rows_to_drop"] = np.where(full_df['m_onblockdt'].isna() == True, full_df["rows_to_drop"] + 1, full_df["rows_to_drop"])

# Create Feature for additional rows to drop for ground delay prediction
full_df['rows_to_drop_grounddelay'] = np.where(full_df['Sched Groundtime'].isna() == True, 1, 0)


## 4.3 Specific Drop Columns 