# Machine Learning

In this file, instructions how to approach the challenge can be found.

We are going to work on different types of Machine Learning problems:

- **Regression Problem**: The goal is to predict delay of flights.
- **(Stretch) Multiclass Classification**: If the plane was delayed, we will predict what type of delay it is (will be).
- **(Stretch) Binary Classification**: The goal is to predict if the flight will be cancelled.

In [1]:
import pandas as pd
import numpy as np
import sklearn.model_selection as ms
import harness
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.decomposition import PCA

In [2]:
pd.options.display.max_columns=None

In [3]:
def train_test_split():
    flights = pd.read_csv('data/flights_1%.csv')
    flights_train, flights_test = ms.train_test_split(
        flights, test_size=0.2, random_state=729
    )
    flights_train.to_csv('data/flights_train.csv', index_label='id')
    flights_test.to_csv('data/flights_test.csv', index_label='id')

In [4]:
#train_test_split()

In [5]:
flights_train = pd.read_csv('data/flights_train.csv', index_col=0)
flights_test = pd.read_csv('data/flights_test.csv', index_col=0)

In [6]:
weather_0 = pd.read_csv('weather_0.csv', index_col=0)
weather_1 = pd.read_csv('weather_1.csv', index_col=0)
weather_2 = pd.read_csv('weather_2.csv', index_col=0)
weather_3 = pd.read_csv('weather_3.csv', index_col=0)
weather_4 = pd.read_csv('weather_4.csv', index_col=0)
weather = pd.concat([weather_0,weather_1,weather_2,weather_3,weather_4])

## Main Task: Regression Problem

The target variable is **ARR_DELAY**. We need to be careful which columns to use and which don't. For example, DEP_DELAY is going to be the perfect predictor, but we can't use it because in real-life scenario, we want to predict the delay before the flight takes of --> We can use average delay from earlier days but not the one from the actual flight we predict.  

For example, variables **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY** shouldn't be used directly as predictors as well. However, we can create various transformations from earlier values.

We will be evaluating your models by predicting the ARR_DELAY for all flights **1 week in advance**.

### Feature Engineering

Feature engineering will play a crucial role in this problems. We have only very little attributes so we need to create some features that will have some predictive power.

- weather: we can use some weather API to look for the weather in time of the scheduled departure and scheduled arrival.
- statistics (avg, mean, median, std, min, max...): we can take a look at previous delays and compute descriptive statistics
- airports encoding: we need to think about what to do with the airports and other categorical variables
- time of the day: the delay probably depends on the airport traffic which varies during the day.
- airport traffic
- unsupervised learning as feature engineering?
- **what are the additional options?**: Think about what we could do more to improve the model.

In [7]:
flights_clean = harness.clean_train(flights_train)

In [8]:
flights_clean

Unnamed: 0_level_0,fl_date,mkt_unique_carrier,branded_code_share,mkt_carrier,mkt_carrier_fl_num,op_unique_carrier,tail_num,op_carrier_fl_num,origin_airport_id,origin,origin_city_name,dest_airport_id,dest,dest_city_name,crs_dep_time,dep_time,dep_delay,taxi_out,wheels_off,wheels_on,taxi_in,crs_arr_time,arr_time,arr_delay,cancelled,cancellation_code,diverted,dup,crs_elapsed_time,actual_elapsed_time,air_time,flights,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,first_dep_time,total_add_gtime,longest_add_gtime,no_name
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1
121941,2019-08-25,AA,AA_CODESHARE,AA,5636,OH,N723PS,5636,11057,CLT,"Charlotte, NC",10208,AGS,"Augusta, GA",1818,1811.0,-7.0,14.0,1825.0,1856.0,3.0,1921,1859.0,-22.0,0,,0,N,63,48.0,31.0,1,140,,,,,,,,,
109932,2018-03-02,WN,WN,WN,2028,WN,N244WN,2028,13495,MSY,"New Orleans, LA",10693,BNA,"Nashville, TN",745,738.0,-7.0,7.0,745.0,851.0,4.0,915,855.0,-20.0,0,,0,N,90,77.0,66.0,1,471,,,,,,,,,
63108,2019-03-06,UA,UA,UA,545,UA,N69804,545,11292,DEN,"Denver, CO",12266,IAH,"Houston, TX",1214,1233.0,19.0,12.0,1245.0,1531.0,4.0,1531,1535.0,4.0,0,,0,N,137,122.0,106.0,1,862,,,,,,,,,
91519,2019-10-11,DL,DL_CODESHARE,DL,3798,OO,N452SW,3798,16869,XWA,"Williston, ND",13487,MSP,"Minneapolis, MN",600,609.0,9.0,31.0,640.0,817.0,3.0,806,820.0,14.0,0,,0,N,126,131.0,97.0,1,553,,,,,,,,,
8361,2019-09-26,UA,UA_CODESHARE,UA,5359,OO,N120SY,5359,10713,BOI,"Boise, ID",14771,SFO,"San Francisco, CA",1426,1431.0,5.0,10.0,1441.0,1504.0,7.0,1521,1511.0,-10.0,0,,0,N,115,100.0,83.0,1,522,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48069,2019-05-01,G4,G4,G4,2170,G4,260NV,2170,14761,SFB,"Sanford, FL",11973,GPT,"Gulfport/Biloxi, MS",605,559.0,-6.0,13.0,612.0,627.0,5.0,638,632.0,-6.0,0,,0,N,93,93.0,75.0,1,485,,,,,,,,,
68804,2019-07-01,UA,UA_CODESHARE,UA,6259,YV,N80343,6259,10140,ABQ,"Albuquerque, NM",12266,IAH,"Houston, TX",540,553.0,13.0,12.0,605.0,853.0,3.0,851,856.0,5.0,0,,0,N,131,123.0,108.0,1,744,,,,,,,,,
43542,2018-07-12,AA,AA,AA,412,AA,N177US,412,11057,CLT,"Charlotte, NC",14107,PHX,"Phoenix, AZ",1300,1313.0,13.0,13.0,1326.0,1408.0,5.0,1415,1413.0,-2.0,0,,0,N,255,240.0,222.0,1,1773,,,,,,,,,
100583,2018-11-21,AA,AA,AA,1252,AA,N136AN,1252,11298,DFW,"Dallas/Fort Worth, TX",14893,SMF,"Sacramento, CA",1855,2007.0,72.0,15.0,2022.0,2137.0,8.0,2056,2145.0,49.0,0,,0,N,241,218.0,195.0,1,1431,49.0,0.0,0.0,0.0,0.0,,,,


In [9]:
flights_clean_copy = flights_clean.copy()
flights_clean_copy['month']=flights_clean_copy.fl_date
flights_clean_copy.month = flights_clean_copy.month.map(lambda v: int(v[5:7]))
flights_clean_copy['day']=flights_clean_copy.fl_date
flights_clean_copy.day = flights_clean_copy.day.map(lambda v: int(v[8:]))
flights_clean_copy['haul']=flights_clean_copy.crs_elapsed_time/60
flights_clean_copy['haul'] = pd.cut(flights_clean_copy.haul,bins=[0,3,6,12],labels=['Short','Medium','Long'])

In [10]:
flights_clean_copy

Unnamed: 0_level_0,fl_date,mkt_unique_carrier,branded_code_share,mkt_carrier,mkt_carrier_fl_num,op_unique_carrier,tail_num,op_carrier_fl_num,origin_airport_id,origin,origin_city_name,dest_airport_id,dest,dest_city_name,crs_dep_time,dep_time,dep_delay,taxi_out,wheels_off,wheels_on,taxi_in,crs_arr_time,arr_time,arr_delay,cancelled,cancellation_code,diverted,dup,crs_elapsed_time,actual_elapsed_time,air_time,flights,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,first_dep_time,total_add_gtime,longest_add_gtime,no_name,month,day,haul
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1
121941,2019-08-25,AA,AA_CODESHARE,AA,5636,OH,N723PS,5636,11057,CLT,"Charlotte, NC",10208,AGS,"Augusta, GA",1818,1811.0,-7.0,14.0,1825.0,1856.0,3.0,1921,1859.0,-22.0,0,,0,N,63,48.0,31.0,1,140,,,,,,,,,,8,25,Short
109932,2018-03-02,WN,WN,WN,2028,WN,N244WN,2028,13495,MSY,"New Orleans, LA",10693,BNA,"Nashville, TN",745,738.0,-7.0,7.0,745.0,851.0,4.0,915,855.0,-20.0,0,,0,N,90,77.0,66.0,1,471,,,,,,,,,,3,2,Short
63108,2019-03-06,UA,UA,UA,545,UA,N69804,545,11292,DEN,"Denver, CO",12266,IAH,"Houston, TX",1214,1233.0,19.0,12.0,1245.0,1531.0,4.0,1531,1535.0,4.0,0,,0,N,137,122.0,106.0,1,862,,,,,,,,,,3,6,Short
91519,2019-10-11,DL,DL_CODESHARE,DL,3798,OO,N452SW,3798,16869,XWA,"Williston, ND",13487,MSP,"Minneapolis, MN",600,609.0,9.0,31.0,640.0,817.0,3.0,806,820.0,14.0,0,,0,N,126,131.0,97.0,1,553,,,,,,,,,,10,11,Short
8361,2019-09-26,UA,UA_CODESHARE,UA,5359,OO,N120SY,5359,10713,BOI,"Boise, ID",14771,SFO,"San Francisco, CA",1426,1431.0,5.0,10.0,1441.0,1504.0,7.0,1521,1511.0,-10.0,0,,0,N,115,100.0,83.0,1,522,,,,,,,,,,9,26,Short
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48069,2019-05-01,G4,G4,G4,2170,G4,260NV,2170,14761,SFB,"Sanford, FL",11973,GPT,"Gulfport/Biloxi, MS",605,559.0,-6.0,13.0,612.0,627.0,5.0,638,632.0,-6.0,0,,0,N,93,93.0,75.0,1,485,,,,,,,,,,5,1,Short
68804,2019-07-01,UA,UA_CODESHARE,UA,6259,YV,N80343,6259,10140,ABQ,"Albuquerque, NM",12266,IAH,"Houston, TX",540,553.0,13.0,12.0,605.0,853.0,3.0,851,856.0,5.0,0,,0,N,131,123.0,108.0,1,744,,,,,,,,,,7,1,Short
43542,2018-07-12,AA,AA,AA,412,AA,N177US,412,11057,CLT,"Charlotte, NC",14107,PHX,"Phoenix, AZ",1300,1313.0,13.0,13.0,1326.0,1408.0,5.0,1415,1413.0,-2.0,0,,0,N,255,240.0,222.0,1,1773,,,,,,,,,,7,12,Medium
100583,2018-11-21,AA,AA,AA,1252,AA,N136AN,1252,11298,DFW,"Dallas/Fort Worth, TX",14893,SMF,"Sacramento, CA",1855,2007.0,72.0,15.0,2022.0,2137.0,8.0,2056,2145.0,49.0,0,,0,N,241,218.0,195.0,1,1431,49.0,0.0,0.0,0.0,0.0,,,,,11,21,Medium


In [11]:
flights_with_weather = flights_clean_copy.merge(
    weather, left_on=['fl_date', 'origin_city_name'], right_on=['date', 'city']
).merge(
    weather, left_on=['fl_date', 'dest_city_name'], right_on=['date', 'city'],
    suffixes=('_origin', '_dest')
)

weather_category_map = {
    'Partially cloudy': 'Cloudy',
    'Clear': 'Sunny',
    'Rain, Partially cloudy': 'Rainy',
    'Rain, Overcast': 'Rainy',
    'Overcast': 'Cloudy',
    'Rain': 'Rainy',
    'Snow, Partially cloudy': 'Snowy',
    'Snow, Overcast': 'Snowy',
    'Snow': 'Snowy',
}
flights_with_weather['weather_origin'] = flights_with_weather.conditions_origin.map(weather_category_map)
flights_with_weather['weather_dest'] = flights_with_weather.conditions_dest.map(weather_category_map)

In [12]:
flights_with_weather = flights_clean_copy.merge(flights_with_weather, how='outer')

In [13]:
def add_grouped_stats(df,col):
    """
    Takes a DataFrame and a catagorical column name and adds five new columns to the Dataframe base off of grouped delay stats in relation to the catagories.
    Parameters:
        - df: The dataframe.
        - col (str): The catagoriclal column which you would like to produces stats from in relation to delay.
    Returns:
        - df_2: The new dataframe with the stat column based off of chosen col.
    """
    df_2 = df.copy()
    col_mean=col + '_delay_mean'
    col_median=col+'_delay_median'
    col_std=col+'_delay_std'
    col_min = col+'_delay_min'
    col_max=col+'_delay_max'
    df_2[col_mean] = df[col].map(df.groupby([col]).arr_delay.mean().to_dict())
    df_2[col_median] = df[col].map(df.groupby([col]).arr_delay.median().to_dict())
    df_2[col_std] = df[col].map(df.groupby([col]).arr_delay.std().to_dict())
    df_2[col_min] = df[col].map(df.groupby([col]).arr_delay.min().to_dict())
    df_2[col_max] = df[col].map(df.groupby([col]).arr_delay.max().to_dict())
    return df_2
                            


In [14]:
flights_with_weather['carrier_delay'] = flights_with_weather['carrier_delay'].fillna(0)
flights_with_weather['weather_delay'] = flights_with_weather['weather_delay'].fillna(0)
flights_with_weather['nas_delay'] = flights_with_weather['nas_delay'].fillna(0)
flights_with_weather['security_delay'] = flights_with_weather['security_delay'].fillna(0)
flights_with_weather['late_aircraft_delay'] = flights_with_weather['late_aircraft_delay'].fillna(0)
flights_with_weather['first_dep_time'] = flights_with_weather['late_aircraft_delay'].fillna(0)
flights_with_weather['total_add_gtime'] = flights_with_weather['total_add_gtime'].fillna(0)
flights_with_weather['longest_add_gtime'] = flights_with_weather['longest_add_gtime'].fillna(0)



In [15]:
flights_with_weather.isnull().sum().head(50)

fl_date                     0
mkt_unique_carrier          0
branded_code_share          0
mkt_carrier                 0
mkt_carrier_fl_num          0
op_unique_carrier           0
tail_num                    0
op_carrier_fl_num           0
origin_airport_id           0
origin                      0
origin_city_name            0
dest_airport_id             0
dest                        0
dest_city_name              0
crs_dep_time                0
dep_time                    0
dep_delay                   0
taxi_out                    0
wheels_off                  0
wheels_on                   0
taxi_in                     0
crs_arr_time                0
arr_time                    0
arr_delay                   0
cancelled                   0
cancellation_code      125434
diverted                    0
dup                         0
crs_elapsed_time            0
actual_elapsed_time         0
air_time                    0
flights                     0
distance                    0
carrier_de

In [16]:
flights_with_weather.isnull().sum().tail(50)

mkt_unique_carrier          0
branded_code_share          0
mkt_carrier                 0
mkt_carrier_fl_num          0
op_unique_carrier           0
tail_num                    0
op_carrier_fl_num           0
origin_airport_id           0
origin                      0
origin_city_name            0
dest_airport_id             0
dest                        0
dest_city_name              0
crs_dep_time                0
dep_time                    0
dep_delay                   0
taxi_out                    0
wheels_off                  0
wheels_on                   0
taxi_in                     0
crs_arr_time                0
arr_time                    0
arr_delay                   0
cancelled                   0
cancellation_code      125434
diverted                    0
dup                         0
crs_elapsed_time            0
actual_elapsed_time         0
air_time                    0
flights                     0
distance                    0
carrier_delay               0
weather_de

In [17]:
flights_with_weather = add_grouped_stats(flights_with_weather,'conditions_origin')
flights_with_weather = add_grouped_stats(flights_with_weather,'conditions_dest')
flights_with_weather = add_grouped_stats(flights_with_weather,'weather_origin')
flights_with_weather = add_grouped_stats(flights_with_weather,'weather_dest')
flights_with_weather = add_grouped_stats(flights_with_weather,'op_unique_carrier')
flights_with_weather = add_grouped_stats(flights_with_weather,'fl_date')
flights_with_weather = add_grouped_stats(flights_with_weather,'dest_city_name')
flights_with_weather = add_grouped_stats(flights_with_weather,'origin_airport_id')
flights_with_weather = add_grouped_stats(flights_with_weather,'origin_city_name')
flights_with_weather = add_grouped_stats(flights_with_weather,'dest_airport_id')
flights_with_weather = add_grouped_stats(flights_with_weather,'day')
flights_with_weather = add_grouped_stats(flights_with_weather,'month')
flights_with_weather = add_grouped_stats(flights_with_weather,'haul')
flights_with_weather = add_grouped_stats(flights_with_weather,'tail_num')

In [18]:
flights_with_weather = flights_with_weather.drop(['conditions_origin',
                                                  'conditions_dest','cancellation_code',
                                                  'weather_origin', 'date_dest',
                                                  'weather_dest', 'dup','no_name',
                                                  'op_unique_carrier', 'date_origin',
                                                  'fl_date','dest_city_name',
                                                  'origin_airport_id','origin_city_name',
                                                  'dest_airport_id','day','month','haul',
                                                  'branded_code_share',
                                                  'mkt_carrier','origin','dest',
                                                  'tail_num','mkt_unique_carrier'], axis =1)




In [19]:
flights_with_weather.isnull().sum().head(50)

mkt_carrier_fl_num                     0
op_carrier_fl_num                      0
crs_dep_time                           0
dep_time                               0
dep_delay                              0
taxi_out                               0
wheels_off                             0
wheels_on                              0
taxi_in                                0
crs_arr_time                           0
arr_time                               0
arr_delay                              0
cancelled                              0
diverted                               0
crs_elapsed_time                       0
actual_elapsed_time                    0
air_time                               0
flights                                0
distance                               0
carrier_delay                          0
weather_delay                          0
nas_delay                              0
security_delay                         0
late_aircraft_delay                    0
first_dep_time  

In [20]:
flights_with_weather.isnull().sum().tail(50)

op_unique_carrier_delay_mean        0
op_unique_carrier_delay_median      0
op_unique_carrier_delay_std         0
op_unique_carrier_delay_min         0
op_unique_carrier_delay_max         0
fl_date_delay_mean                  0
fl_date_delay_median                0
fl_date_delay_std                   0
fl_date_delay_min                   0
fl_date_delay_max                   0
dest_city_name_delay_mean           0
dest_city_name_delay_median         0
dest_city_name_delay_std            4
dest_city_name_delay_min            0
dest_city_name_delay_max            0
origin_airport_id_delay_mean        0
origin_airport_id_delay_median      0
origin_airport_id_delay_std         5
origin_airport_id_delay_min         0
origin_airport_id_delay_max         0
origin_city_name_delay_mean         0
origin_city_name_delay_median       0
origin_city_name_delay_std          5
origin_city_name_delay_min          0
origin_city_name_delay_max          0
dest_airport_id_delay_mean          0
dest_airport

In [21]:
flights_with_weather

Unnamed: 0,mkt_carrier_fl_num,op_carrier_fl_num,crs_dep_time,dep_time,dep_delay,taxi_out,wheels_off,wheels_on,taxi_in,crs_arr_time,arr_time,arr_delay,cancelled,diverted,crs_elapsed_time,actual_elapsed_time,air_time,flights,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,first_dep_time,total_add_gtime,longest_add_gtime,conditions_origin_delay_mean,conditions_origin_delay_median,conditions_origin_delay_std,conditions_origin_delay_min,conditions_origin_delay_max,conditions_dest_delay_mean,conditions_dest_delay_median,conditions_dest_delay_std,conditions_dest_delay_min,conditions_dest_delay_max,weather_origin_delay_mean,weather_origin_delay_median,weather_origin_delay_std,weather_origin_delay_min,weather_origin_delay_max,weather_dest_delay_mean,weather_dest_delay_median,weather_dest_delay_std,weather_dest_delay_min,weather_dest_delay_max,op_unique_carrier_delay_mean,op_unique_carrier_delay_median,op_unique_carrier_delay_std,op_unique_carrier_delay_min,op_unique_carrier_delay_max,fl_date_delay_mean,fl_date_delay_median,fl_date_delay_std,fl_date_delay_min,fl_date_delay_max,dest_city_name_delay_mean,dest_city_name_delay_median,dest_city_name_delay_std,dest_city_name_delay_min,dest_city_name_delay_max,origin_airport_id_delay_mean,origin_airport_id_delay_median,origin_airport_id_delay_std,origin_airport_id_delay_min,origin_airport_id_delay_max,origin_city_name_delay_mean,origin_city_name_delay_median,origin_city_name_delay_std,origin_city_name_delay_min,origin_city_name_delay_max,dest_airport_id_delay_mean,dest_airport_id_delay_median,dest_airport_id_delay_std,dest_airport_id_delay_min,dest_airport_id_delay_max,day_delay_mean,day_delay_median,day_delay_std,day_delay_min,day_delay_max,month_delay_mean,month_delay_median,month_delay_std,month_delay_min,month_delay_max,haul_delay_mean,haul_delay_median,haul_delay_std,haul_delay_min,haul_delay_max,tail_num_delay_mean,tail_num_delay_median,tail_num_delay_std,tail_num_delay_min,tail_num_delay_max
0,5636,5636,1818,1811.0,-7.0,14.0,1825.0,1856.0,3.0,1921,1859.0,-22.0,0,0,63,48.0,31.0,1,140,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,8.537708,-4.0,47.071042,-59.0,674.0,-3.043689,-9.0,25.477355,-34.0,166.0,3.933333,-8.0,34.339849,-22.0,163.0,7.251644,-4.0,42.269576,-43.0,1163.0,7.251644,-4.0,42.269576,-43.0,1163.0,3.933333,-8.0,34.339849,-22.0,163.0,1.441833,-8.0,38.705541,-54.0,627.0,9.752049,-5.0,59.957840,-60.0,1417.0,5.734592,-6.0,50.170304,-59.0,1532.0,11.241379,-4.0,42.447998,-44.0,143.0
1,2028,2028,745,738.0,-7.0,7.0,745.0,851.0,4.0,915,855.0,-20.0,0,0,90,77.0,66.0,1,471,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,3.509573,-5.0,31.549720,-67.0,587.0,20.691824,-5.0,114.508247,-36.0,918.0,6.174742,-6.0,58.390846,-52.0,1157.0,7.313725,-5.0,52.047747,-41.0,934.0,7.313725,-5.0,52.047747,-41.0,934.0,6.174742,-6.0,58.390846,-52.0,1157.0,6.736996,-6.0,50.716204,-57.0,918.0,3.965452,-6.0,52.066359,-67.0,1402.0,5.734592,-6.0,50.170304,-59.0,1532.0,-4.343750,-12.0,17.830039,-21.0,61.0
2,545,545,1214,1233.0,19.0,12.0,1245.0,1531.0,4.0,1531,1535.0,4.0,0,0,137,122.0,106.0,1,862,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,6.732577,-6.0,56.435969,-57.0,1402.0,3.689655,-5.0,55.029957,-38.0,661.0,4.770677,-5.0,45.131706,-63.0,960.0,7.128471,-5.0,45.896500,-50.0,821.0,7.128471,-5.0,45.896500,-50.0,821.0,5.459028,-5.0,49.232498,-63.0,960.0,3.581111,-7.0,50.341037,-57.0,1263.0,3.965452,-6.0,52.066359,-67.0,1402.0,5.734592,-6.0,50.170304,-59.0,1532.0,4.391304,-1.0,31.517296,-31.0,117.0
3,3798,3798,600,609.0,9.0,31.0,640.0,817.0,3.0,806,820.0,14.0,0,0,126,131.0,97.0,1,553,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,7.885413,-7.0,69.573885,-66.0,1417.0,19.863388,-3.0,101.813706,-42.0,1205.0,0.911169,-10.0,50.740951,-58.0,1077.0,-12.200000,-17.0,20.191582,-40.0,14.0,18.666667,-11.5,103.973898,-40.0,559.0,0.911169,-10.0,50.740951,-58.0,1077.0,7.085337,-6.0,55.418038,-58.0,1205.0,2.820572,-7.0,46.805755,-53.0,1532.0,5.734592,-6.0,50.170304,-59.0,1532.0,-6.120000,-10.0,18.619703,-30.0,39.0
4,5359,5359,1426,1431.0,5.0,10.0,1441.0,1504.0,7.0,1521,1511.0,-10.0,0,0,115,100.0,83.0,1,522,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,7.885413,-7.0,69.573885,-66.0,1417.0,0.547619,-7.0,29.396690,-34.0,186.0,9.499104,-6.0,62.487939,-68.0,1271.0,2.589499,-6.0,52.541016,-38.0,815.0,2.589499,-6.0,52.541016,-38.0,815.0,9.499104,-6.0,62.487939,-68.0,1271.0,4.267059,-6.0,45.897289,-64.0,727.0,0.223971,-8.0,37.688462,-68.0,1095.0,5.734592,-6.0,50.170304,-59.0,1532.0,-2.545455,-6.0,19.421949,-32.0,57.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125429,2170,2170,605,559.0,-6.0,13.0,612.0,627.0,5.0,638,632.0,-6.0,0,0,93,93.0,75.0,1,485,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,10.008679,-5.0,65.135847,-54.0,1532.0,-1.508671,-7.0,25.198426,-45.0,142.0,5.242424,-6.5,40.402232,-20.0,275.0,9.173653,-5.0,57.759575,-27.0,618.0,9.173653,-5.0,57.759575,-27.0,618.0,5.242424,-6.5,40.402232,-20.0,275.0,6.960794,-6.0,57.170933,-56.0,1177.0,6.803424,-6.0,50.362827,-66.0,1271.0,5.734592,-6.0,50.170304,-59.0,1532.0,5.857143,-9.5,47.419336,-27.0,128.0
125430,6259,6259,540,553.0,13.0,12.0,605.0,853.0,3.0,851,856.0,5.0,0,0,131,123.0,108.0,1,744,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,9.422725,-5.0,52.363297,-63.0,869.0,-0.597701,-9.0,33.384417,-56.0,150.0,4.770677,-5.0,45.131706,-63.0,960.0,0.077694,-7.0,29.793535,-37.0,222.0,0.077694,-7.0,29.793535,-37.0,222.0,5.459028,-5.0,49.232498,-63.0,960.0,6.960794,-6.0,57.170933,-56.0,1177.0,8.501768,-5.0,53.961650,-57.0,1263.0,5.734592,-6.0,50.170304,-59.0,1532.0,9.086957,-3.0,33.709876,-22.0,103.0
125431,412,412,1300,1313.0,13.0,13.0,1326.0,1408.0,5.0,1415,1413.0,-2.0,0,0,255,240.0,222.0,1,1773,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,5.627454,-5.0,46.718174,-61.0,1128.0,7.514286,-5.0,41.970961,-41.0,257.0,5.219892,-5.0,57.125311,-53.0,1532.0,7.251644,-4.0,42.269576,-43.0,1163.0,7.251644,-4.0,42.269576,-43.0,1163.0,4.507322,-5.0,49.729182,-53.0,1417.0,5.529552,-5.0,49.425103,-61.0,1216.0,8.501768,-5.0,53.961650,-57.0,1263.0,4.104058,-6.0,49.418211,-67.0,1402.0,12.920000,-2.0,37.694960,-29.0,143.0
125432,1252,1252,1855,2007.0,72.0,15.0,2022.0,2137.0,8.0,2056,2145.0,49.0,0,0,241,218.0,195.0,1,1431,49.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,5.627454,-5.0,46.718174,-61.0,1128.0,8.106383,-7.0,67.744166,-38.0,711.0,3.546286,-5.0,34.471274,-51.0,341.0,9.364856,-3.0,45.672426,-44.0,1006.0,9.364856,-3.0,45.672426,-44.0,1006.0,3.546286,-5.0,34.471274,-51.0,341.0,6.890931,-5.0,50.717822,-61.0,1236.0,3.181731,-7.0,49.027435,-56.0,1236.0,4.104058,-6.0,49.418211,-67.0,1402.0,6.375000,-1.5,33.668729,-29.0,113.0


In [22]:
flights_with_weather.carrier_delay

0          0.0
1          0.0
2          0.0
3          0.0
4          0.0
          ... 
125429     0.0
125430     0.0
125431     0.0
125432    49.0
125433     0.0
Name: carrier_delay, Length: 125434, dtype: float64

In [23]:


# def add_outside_grouped_stats(df, df_sub, col):
#     """
#     Creates grouped statistics of a column in One Dataframe and maps it to another subset Dataframe
#     Parameters:
#         - df: Dataframe which stats will be derived
#         - df_sub: Dataframe which will have the stats mapped onto it
#         - col (str): The catagoriclal column which you would like to produces stats from df and map to df_sub.
#     Returns:
#         - df_2: The new subset dataframe with the stat columns based on df.
#     """
#     df_2 = df_sub.copy()
#     col_mean=col + '_delay_mean'
#     col_median=col+'_delay_median'
#     col_std=col+'_delay_std'
#     col_min = col+'_delay_min'
#     col_max=col+'_delay_max'
    
#     df_2[col_mean] = df_sub[col].map(df.groupby([col]).arr_delay.mean().to_dict())
#     df_2[col_median] = df_sub[col].map(df.groupby([col]).arr_delay.median().to_dict())
#     df_2[col_std] = df_sub[col].map(df.groupby([col]).arr_delay.std().to_dict())
#     df_2[col_min] = df_sub[col].map(df.groupby([col]).arr_delay.min().to_dict())
#     df_2[col_max] = df_sub[col].map(df.groupby([col]).arr_delay.max().to_dict())
#     return df_2




# flights_with_weather = add_outside_grouped_stats(flights_clean_copy, flights_with_weather, 'op_unique_carrier')
# flights_with_weather = add_outside_grouped_stats(flights_clean_copy, flights_with_weather, 'fl_date')
# flights_with_weather = add_outside_grouped_stats(flights_clean_copy, flights_with_weather, 'origin_city_name')
# flights_with_weather = add_outside_grouped_stats(flights_clean_copy, flights_with_weather, 'dest_city_name')
# flights_with_weather = add_outside_grouped_stats(flights_clean_copy, flights_with_weather, 'origin_airport_id')
# flights_with_weather = add_outside_grouped_stats(flights_clean_copy, flights_with_weather, 'dest_airport_id')
# flights_with_weather = add_outside_grouped_stats(flights_clean_copy, flights_with_weather, 'day')
# flights_with_weather = add_outside_grouped_stats(flights_clean_copy, flights_with_weather, 'month')
# flights_with_weather = add_outside_grouped_stats(flights_clean_copy, flights_with_weather, 'haul')
# flights_with_weather = add_outside_grouped_stats(flights_clean_copy, flights_with_weather, 'tail_num')

### Feature Selection / Dimensionality Reduction

We need to apply different selection techniques to find out which one will be the best for our problems.

- Original Features vs. PCA conponents?

In [24]:
flights_with_weather

Unnamed: 0,mkt_carrier_fl_num,op_carrier_fl_num,crs_dep_time,dep_time,dep_delay,taxi_out,wheels_off,wheels_on,taxi_in,crs_arr_time,arr_time,arr_delay,cancelled,diverted,crs_elapsed_time,actual_elapsed_time,air_time,flights,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,first_dep_time,total_add_gtime,longest_add_gtime,conditions_origin_delay_mean,conditions_origin_delay_median,conditions_origin_delay_std,conditions_origin_delay_min,conditions_origin_delay_max,conditions_dest_delay_mean,conditions_dest_delay_median,conditions_dest_delay_std,conditions_dest_delay_min,conditions_dest_delay_max,weather_origin_delay_mean,weather_origin_delay_median,weather_origin_delay_std,weather_origin_delay_min,weather_origin_delay_max,weather_dest_delay_mean,weather_dest_delay_median,weather_dest_delay_std,weather_dest_delay_min,weather_dest_delay_max,op_unique_carrier_delay_mean,op_unique_carrier_delay_median,op_unique_carrier_delay_std,op_unique_carrier_delay_min,op_unique_carrier_delay_max,fl_date_delay_mean,fl_date_delay_median,fl_date_delay_std,fl_date_delay_min,fl_date_delay_max,dest_city_name_delay_mean,dest_city_name_delay_median,dest_city_name_delay_std,dest_city_name_delay_min,dest_city_name_delay_max,origin_airport_id_delay_mean,origin_airport_id_delay_median,origin_airport_id_delay_std,origin_airport_id_delay_min,origin_airport_id_delay_max,origin_city_name_delay_mean,origin_city_name_delay_median,origin_city_name_delay_std,origin_city_name_delay_min,origin_city_name_delay_max,dest_airport_id_delay_mean,dest_airport_id_delay_median,dest_airport_id_delay_std,dest_airport_id_delay_min,dest_airport_id_delay_max,day_delay_mean,day_delay_median,day_delay_std,day_delay_min,day_delay_max,month_delay_mean,month_delay_median,month_delay_std,month_delay_min,month_delay_max,haul_delay_mean,haul_delay_median,haul_delay_std,haul_delay_min,haul_delay_max,tail_num_delay_mean,tail_num_delay_median,tail_num_delay_std,tail_num_delay_min,tail_num_delay_max
0,5636,5636,1818,1811.0,-7.0,14.0,1825.0,1856.0,3.0,1921,1859.0,-22.0,0,0,63,48.0,31.0,1,140,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,8.537708,-4.0,47.071042,-59.0,674.0,-3.043689,-9.0,25.477355,-34.0,166.0,3.933333,-8.0,34.339849,-22.0,163.0,7.251644,-4.0,42.269576,-43.0,1163.0,7.251644,-4.0,42.269576,-43.0,1163.0,3.933333,-8.0,34.339849,-22.0,163.0,1.441833,-8.0,38.705541,-54.0,627.0,9.752049,-5.0,59.957840,-60.0,1417.0,5.734592,-6.0,50.170304,-59.0,1532.0,11.241379,-4.0,42.447998,-44.0,143.0
1,2028,2028,745,738.0,-7.0,7.0,745.0,851.0,4.0,915,855.0,-20.0,0,0,90,77.0,66.0,1,471,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,3.509573,-5.0,31.549720,-67.0,587.0,20.691824,-5.0,114.508247,-36.0,918.0,6.174742,-6.0,58.390846,-52.0,1157.0,7.313725,-5.0,52.047747,-41.0,934.0,7.313725,-5.0,52.047747,-41.0,934.0,6.174742,-6.0,58.390846,-52.0,1157.0,6.736996,-6.0,50.716204,-57.0,918.0,3.965452,-6.0,52.066359,-67.0,1402.0,5.734592,-6.0,50.170304,-59.0,1532.0,-4.343750,-12.0,17.830039,-21.0,61.0
2,545,545,1214,1233.0,19.0,12.0,1245.0,1531.0,4.0,1531,1535.0,4.0,0,0,137,122.0,106.0,1,862,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,6.732577,-6.0,56.435969,-57.0,1402.0,3.689655,-5.0,55.029957,-38.0,661.0,4.770677,-5.0,45.131706,-63.0,960.0,7.128471,-5.0,45.896500,-50.0,821.0,7.128471,-5.0,45.896500,-50.0,821.0,5.459028,-5.0,49.232498,-63.0,960.0,3.581111,-7.0,50.341037,-57.0,1263.0,3.965452,-6.0,52.066359,-67.0,1402.0,5.734592,-6.0,50.170304,-59.0,1532.0,4.391304,-1.0,31.517296,-31.0,117.0
3,3798,3798,600,609.0,9.0,31.0,640.0,817.0,3.0,806,820.0,14.0,0,0,126,131.0,97.0,1,553,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,7.885413,-7.0,69.573885,-66.0,1417.0,19.863388,-3.0,101.813706,-42.0,1205.0,0.911169,-10.0,50.740951,-58.0,1077.0,-12.200000,-17.0,20.191582,-40.0,14.0,18.666667,-11.5,103.973898,-40.0,559.0,0.911169,-10.0,50.740951,-58.0,1077.0,7.085337,-6.0,55.418038,-58.0,1205.0,2.820572,-7.0,46.805755,-53.0,1532.0,5.734592,-6.0,50.170304,-59.0,1532.0,-6.120000,-10.0,18.619703,-30.0,39.0
4,5359,5359,1426,1431.0,5.0,10.0,1441.0,1504.0,7.0,1521,1511.0,-10.0,0,0,115,100.0,83.0,1,522,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,7.885413,-7.0,69.573885,-66.0,1417.0,0.547619,-7.0,29.396690,-34.0,186.0,9.499104,-6.0,62.487939,-68.0,1271.0,2.589499,-6.0,52.541016,-38.0,815.0,2.589499,-6.0,52.541016,-38.0,815.0,9.499104,-6.0,62.487939,-68.0,1271.0,4.267059,-6.0,45.897289,-64.0,727.0,0.223971,-8.0,37.688462,-68.0,1095.0,5.734592,-6.0,50.170304,-59.0,1532.0,-2.545455,-6.0,19.421949,-32.0,57.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125429,2170,2170,605,559.0,-6.0,13.0,612.0,627.0,5.0,638,632.0,-6.0,0,0,93,93.0,75.0,1,485,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,10.008679,-5.0,65.135847,-54.0,1532.0,-1.508671,-7.0,25.198426,-45.0,142.0,5.242424,-6.5,40.402232,-20.0,275.0,9.173653,-5.0,57.759575,-27.0,618.0,9.173653,-5.0,57.759575,-27.0,618.0,5.242424,-6.5,40.402232,-20.0,275.0,6.960794,-6.0,57.170933,-56.0,1177.0,6.803424,-6.0,50.362827,-66.0,1271.0,5.734592,-6.0,50.170304,-59.0,1532.0,5.857143,-9.5,47.419336,-27.0,128.0
125430,6259,6259,540,553.0,13.0,12.0,605.0,853.0,3.0,851,856.0,5.0,0,0,131,123.0,108.0,1,744,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,9.422725,-5.0,52.363297,-63.0,869.0,-0.597701,-9.0,33.384417,-56.0,150.0,4.770677,-5.0,45.131706,-63.0,960.0,0.077694,-7.0,29.793535,-37.0,222.0,0.077694,-7.0,29.793535,-37.0,222.0,5.459028,-5.0,49.232498,-63.0,960.0,6.960794,-6.0,57.170933,-56.0,1177.0,8.501768,-5.0,53.961650,-57.0,1263.0,5.734592,-6.0,50.170304,-59.0,1532.0,9.086957,-3.0,33.709876,-22.0,103.0
125431,412,412,1300,1313.0,13.0,13.0,1326.0,1408.0,5.0,1415,1413.0,-2.0,0,0,255,240.0,222.0,1,1773,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,5.627454,-5.0,46.718174,-61.0,1128.0,7.514286,-5.0,41.970961,-41.0,257.0,5.219892,-5.0,57.125311,-53.0,1532.0,7.251644,-4.0,42.269576,-43.0,1163.0,7.251644,-4.0,42.269576,-43.0,1163.0,4.507322,-5.0,49.729182,-53.0,1417.0,5.529552,-5.0,49.425103,-61.0,1216.0,8.501768,-5.0,53.961650,-57.0,1263.0,4.104058,-6.0,49.418211,-67.0,1402.0,12.920000,-2.0,37.694960,-29.0,143.0
125432,1252,1252,1855,2007.0,72.0,15.0,2022.0,2137.0,8.0,2056,2145.0,49.0,0,0,241,218.0,195.0,1,1431,49.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,5.627454,-5.0,46.718174,-61.0,1128.0,8.106383,-7.0,67.744166,-38.0,711.0,3.546286,-5.0,34.471274,-51.0,341.0,9.364856,-3.0,45.672426,-44.0,1006.0,9.364856,-3.0,45.672426,-44.0,1006.0,3.546286,-5.0,34.471274,-51.0,341.0,6.890931,-5.0,50.717822,-61.0,1236.0,3.181731,-7.0,49.027435,-56.0,1236.0,4.104058,-6.0,49.418211,-67.0,1402.0,6.375000,-1.5,33.668729,-29.0,113.0


In [25]:
# dummy = pd.get_dummies(flights_with_weather['fl_date'])
# flights_with_weather = pd.concat([flights_with_weather,dummy], axis=1)
# flights_with_weather = flights_with_weather.drop('fl_date',axis=1)

# dummy = pd.get_dummies(flights_with_weather['mkt_unique_carrier'])
# flights_with_weather = pd.concat([flights_with_weather,dummy], axis=1)
# flights_with_weather = flights_with_weather.drop('mkt_unique_carrier',axis=1)


In [26]:
def dummy_maker(df,col):
        """
        A function that takes a Dataframe and a catagorical column name and returns dataframe with the column replaced by dummy variables.
        Parameters
            - df: The Dataframe
            - col (str): the column name
        Returns:
            - df_2: The Dataframe with dummbies instead of the selected columns
        """
        dummy = pd.get_dummies(df[col])
        df_2 = pd.concat([df,dummy], axis=1)
        df_2 = df_2.drop(col,axis=1)
        return df_2

In [27]:
flights_with_weather

Unnamed: 0,mkt_carrier_fl_num,op_carrier_fl_num,crs_dep_time,dep_time,dep_delay,taxi_out,wheels_off,wheels_on,taxi_in,crs_arr_time,arr_time,arr_delay,cancelled,diverted,crs_elapsed_time,actual_elapsed_time,air_time,flights,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,first_dep_time,total_add_gtime,longest_add_gtime,conditions_origin_delay_mean,conditions_origin_delay_median,conditions_origin_delay_std,conditions_origin_delay_min,conditions_origin_delay_max,conditions_dest_delay_mean,conditions_dest_delay_median,conditions_dest_delay_std,conditions_dest_delay_min,conditions_dest_delay_max,weather_origin_delay_mean,weather_origin_delay_median,weather_origin_delay_std,weather_origin_delay_min,weather_origin_delay_max,weather_dest_delay_mean,weather_dest_delay_median,weather_dest_delay_std,weather_dest_delay_min,weather_dest_delay_max,op_unique_carrier_delay_mean,op_unique_carrier_delay_median,op_unique_carrier_delay_std,op_unique_carrier_delay_min,op_unique_carrier_delay_max,fl_date_delay_mean,fl_date_delay_median,fl_date_delay_std,fl_date_delay_min,fl_date_delay_max,dest_city_name_delay_mean,dest_city_name_delay_median,dest_city_name_delay_std,dest_city_name_delay_min,dest_city_name_delay_max,origin_airport_id_delay_mean,origin_airport_id_delay_median,origin_airport_id_delay_std,origin_airport_id_delay_min,origin_airport_id_delay_max,origin_city_name_delay_mean,origin_city_name_delay_median,origin_city_name_delay_std,origin_city_name_delay_min,origin_city_name_delay_max,dest_airport_id_delay_mean,dest_airport_id_delay_median,dest_airport_id_delay_std,dest_airport_id_delay_min,dest_airport_id_delay_max,day_delay_mean,day_delay_median,day_delay_std,day_delay_min,day_delay_max,month_delay_mean,month_delay_median,month_delay_std,month_delay_min,month_delay_max,haul_delay_mean,haul_delay_median,haul_delay_std,haul_delay_min,haul_delay_max,tail_num_delay_mean,tail_num_delay_median,tail_num_delay_std,tail_num_delay_min,tail_num_delay_max
0,5636,5636,1818,1811.0,-7.0,14.0,1825.0,1856.0,3.0,1921,1859.0,-22.0,0,0,63,48.0,31.0,1,140,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,8.537708,-4.0,47.071042,-59.0,674.0,-3.043689,-9.0,25.477355,-34.0,166.0,3.933333,-8.0,34.339849,-22.0,163.0,7.251644,-4.0,42.269576,-43.0,1163.0,7.251644,-4.0,42.269576,-43.0,1163.0,3.933333,-8.0,34.339849,-22.0,163.0,1.441833,-8.0,38.705541,-54.0,627.0,9.752049,-5.0,59.957840,-60.0,1417.0,5.734592,-6.0,50.170304,-59.0,1532.0,11.241379,-4.0,42.447998,-44.0,143.0
1,2028,2028,745,738.0,-7.0,7.0,745.0,851.0,4.0,915,855.0,-20.0,0,0,90,77.0,66.0,1,471,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,3.509573,-5.0,31.549720,-67.0,587.0,20.691824,-5.0,114.508247,-36.0,918.0,6.174742,-6.0,58.390846,-52.0,1157.0,7.313725,-5.0,52.047747,-41.0,934.0,7.313725,-5.0,52.047747,-41.0,934.0,6.174742,-6.0,58.390846,-52.0,1157.0,6.736996,-6.0,50.716204,-57.0,918.0,3.965452,-6.0,52.066359,-67.0,1402.0,5.734592,-6.0,50.170304,-59.0,1532.0,-4.343750,-12.0,17.830039,-21.0,61.0
2,545,545,1214,1233.0,19.0,12.0,1245.0,1531.0,4.0,1531,1535.0,4.0,0,0,137,122.0,106.0,1,862,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,6.732577,-6.0,56.435969,-57.0,1402.0,3.689655,-5.0,55.029957,-38.0,661.0,4.770677,-5.0,45.131706,-63.0,960.0,7.128471,-5.0,45.896500,-50.0,821.0,7.128471,-5.0,45.896500,-50.0,821.0,5.459028,-5.0,49.232498,-63.0,960.0,3.581111,-7.0,50.341037,-57.0,1263.0,3.965452,-6.0,52.066359,-67.0,1402.0,5.734592,-6.0,50.170304,-59.0,1532.0,4.391304,-1.0,31.517296,-31.0,117.0
3,3798,3798,600,609.0,9.0,31.0,640.0,817.0,3.0,806,820.0,14.0,0,0,126,131.0,97.0,1,553,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,7.885413,-7.0,69.573885,-66.0,1417.0,19.863388,-3.0,101.813706,-42.0,1205.0,0.911169,-10.0,50.740951,-58.0,1077.0,-12.200000,-17.0,20.191582,-40.0,14.0,18.666667,-11.5,103.973898,-40.0,559.0,0.911169,-10.0,50.740951,-58.0,1077.0,7.085337,-6.0,55.418038,-58.0,1205.0,2.820572,-7.0,46.805755,-53.0,1532.0,5.734592,-6.0,50.170304,-59.0,1532.0,-6.120000,-10.0,18.619703,-30.0,39.0
4,5359,5359,1426,1431.0,5.0,10.0,1441.0,1504.0,7.0,1521,1511.0,-10.0,0,0,115,100.0,83.0,1,522,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,7.885413,-7.0,69.573885,-66.0,1417.0,0.547619,-7.0,29.396690,-34.0,186.0,9.499104,-6.0,62.487939,-68.0,1271.0,2.589499,-6.0,52.541016,-38.0,815.0,2.589499,-6.0,52.541016,-38.0,815.0,9.499104,-6.0,62.487939,-68.0,1271.0,4.267059,-6.0,45.897289,-64.0,727.0,0.223971,-8.0,37.688462,-68.0,1095.0,5.734592,-6.0,50.170304,-59.0,1532.0,-2.545455,-6.0,19.421949,-32.0,57.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125429,2170,2170,605,559.0,-6.0,13.0,612.0,627.0,5.0,638,632.0,-6.0,0,0,93,93.0,75.0,1,485,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,10.008679,-5.0,65.135847,-54.0,1532.0,-1.508671,-7.0,25.198426,-45.0,142.0,5.242424,-6.5,40.402232,-20.0,275.0,9.173653,-5.0,57.759575,-27.0,618.0,9.173653,-5.0,57.759575,-27.0,618.0,5.242424,-6.5,40.402232,-20.0,275.0,6.960794,-6.0,57.170933,-56.0,1177.0,6.803424,-6.0,50.362827,-66.0,1271.0,5.734592,-6.0,50.170304,-59.0,1532.0,5.857143,-9.5,47.419336,-27.0,128.0
125430,6259,6259,540,553.0,13.0,12.0,605.0,853.0,3.0,851,856.0,5.0,0,0,131,123.0,108.0,1,744,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,9.422725,-5.0,52.363297,-63.0,869.0,-0.597701,-9.0,33.384417,-56.0,150.0,4.770677,-5.0,45.131706,-63.0,960.0,0.077694,-7.0,29.793535,-37.0,222.0,0.077694,-7.0,29.793535,-37.0,222.0,5.459028,-5.0,49.232498,-63.0,960.0,6.960794,-6.0,57.170933,-56.0,1177.0,8.501768,-5.0,53.961650,-57.0,1263.0,5.734592,-6.0,50.170304,-59.0,1532.0,9.086957,-3.0,33.709876,-22.0,103.0
125431,412,412,1300,1313.0,13.0,13.0,1326.0,1408.0,5.0,1415,1413.0,-2.0,0,0,255,240.0,222.0,1,1773,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,5.627454,-5.0,46.718174,-61.0,1128.0,7.514286,-5.0,41.970961,-41.0,257.0,5.219892,-5.0,57.125311,-53.0,1532.0,7.251644,-4.0,42.269576,-43.0,1163.0,7.251644,-4.0,42.269576,-43.0,1163.0,4.507322,-5.0,49.729182,-53.0,1417.0,5.529552,-5.0,49.425103,-61.0,1216.0,8.501768,-5.0,53.961650,-57.0,1263.0,4.104058,-6.0,49.418211,-67.0,1402.0,12.920000,-2.0,37.694960,-29.0,143.0
125432,1252,1252,1855,2007.0,72.0,15.0,2022.0,2137.0,8.0,2056,2145.0,49.0,0,0,241,218.0,195.0,1,1431,49.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,5.627454,-5.0,46.718174,-61.0,1128.0,8.106383,-7.0,67.744166,-38.0,711.0,3.546286,-5.0,34.471274,-51.0,341.0,9.364856,-3.0,45.672426,-44.0,1006.0,9.364856,-3.0,45.672426,-44.0,1006.0,3.546286,-5.0,34.471274,-51.0,341.0,6.890931,-5.0,50.717822,-61.0,1236.0,3.181731,-7.0,49.027435,-56.0,1236.0,4.104058,-6.0,49.418211,-67.0,1402.0,6.375000,-1.5,33.668729,-29.0,113.0


In [28]:
flights_with_weather.isnull().sum().head(50)

mkt_carrier_fl_num                     0
op_carrier_fl_num                      0
crs_dep_time                           0
dep_time                               0
dep_delay                              0
taxi_out                               0
wheels_off                             0
wheels_on                              0
taxi_in                                0
crs_arr_time                           0
arr_time                               0
arr_delay                              0
cancelled                              0
diverted                               0
crs_elapsed_time                       0
actual_elapsed_time                    0
air_time                               0
flights                                0
distance                               0
carrier_delay                          0
weather_delay                          0
nas_delay                              0
security_delay                         0
late_aircraft_delay                    0
first_dep_time  

In [29]:
flights_with_weather.isnull().sum().tail(50)

op_unique_carrier_delay_mean        0
op_unique_carrier_delay_median      0
op_unique_carrier_delay_std         0
op_unique_carrier_delay_min         0
op_unique_carrier_delay_max         0
fl_date_delay_mean                  0
fl_date_delay_median                0
fl_date_delay_std                   0
fl_date_delay_min                   0
fl_date_delay_max                   0
dest_city_name_delay_mean           0
dest_city_name_delay_median         0
dest_city_name_delay_std            4
dest_city_name_delay_min            0
dest_city_name_delay_max            0
origin_airport_id_delay_mean        0
origin_airport_id_delay_median      0
origin_airport_id_delay_std         5
origin_airport_id_delay_min         0
origin_airport_id_delay_max         0
origin_city_name_delay_mean         0
origin_city_name_delay_median       0
origin_city_name_delay_std          5
origin_city_name_delay_min          0
origin_city_name_delay_max          0
dest_airport_id_delay_mean          0
dest_airport

In [30]:
# flights_with_weather = dummy_maker(flights_with_weather,'fl_date')
# flights_with_weather = dummy_maker(flights_with_weather,'mkt_unique_carrier')
# flights_with_weather = dummy_maker(flights_with_weather,'branded_code_share')
# flights_with_weather = dummy_maker(flights_with_weather,'mkt_carrier')
# flights_with_weather = dummy_maker(flights_with_weather,'mkt_carrier_fl_num')
# flights_with_weather = dummy_maker(flights_with_weather,'op_unique_carrier')
# flights_with_weather = dummy_maker(flights_with_weather,'tail_num')
# flights_with_weather = dummy_maker(flights_with_weather,'dup')



In [31]:
pca = PCA()
pca.fit(flights_with_weather)
exp_var_cumul = np.cumsum(pca.explained_variance_ratio_)

px.area(
    x=range(1, exp_var_cumul.shape[0] + 1),
    y=exp_var_cumul,
    labels={"x": "# Components", "y": "Explained Variance"}
)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

In [None]:
n_components = #To be det

In [None]:
pca = PCA(n_components=n_components)
components = pca.fit_transform(flights_with_weather)

total_var = pca.explained_variance_ratio_.sum() * 100

labels = {str(i): f"PC {i+1}" for i in range(n_components)}
labels['color'] = 'Median Price'

fig = px.scatter_matrix(
    components,
    color=boston.target,
    dimensions=range(n_components),
    labels=labels,
    title=f'Total Explained Variance: {total_var:.2f}%',
)
fig.update_traces(diagonal_visible=False)
fig.show()

### Modeling

Use different ML techniques to predict each problem.

- linear / logistic / multinomial logistic regression
- Naive Bayes
- Random Forest
- SVM
- XGBoost
- The ensemble of your own choice

### Evaluation

You have data from 2018 and 2019 to develop models. Use different evaluation metrics for each problem and compare the performance of different models.

You are required to predict delays on **out of sample** data from **first 7 days (1st-7th) of January 2020** and to share the file with LighthouseLabs. Sample submission can be found in the file **_sample_submission.csv_**

======================================================================
## Stretch Tasks

### Multiclass Classification

The target variables are **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY**. We need to do additional transformations because these variables are not binary but continuos. For each flight that was delayed, we need to have one of these variables as 1 and others 0.

It can happen that we have two types of delays with more than 0 minutes. In this case, take the bigger one as 1 and others as 0.

### Binary Classification

The target variable is **CANCELLED**. The main problem here is going to be huge class imbalance. We have only very little cancelled flights with comparison to all flights. It is important to do the right sampling before training and to choose correct evaluation metrics.