# Machine Learning

In this file, instructions how to approach the challenge can be found.

We are going to work on different types of Machine Learning problems:

- **Regression Problem**: The goal is to predict delay of flights.
- **(Stretch) Multiclass Classification**: If the plane was delayed, we will predict what type of delay it is (will be).
- **(Stretch) Binary Classification**: The goal is to predict if the flight will be cancelled.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

## Main Task: Regression Problem

The target variable is **ARR_DELAY**. We need to be careful which columns to use and which don't. For example, DEP_DELAY is going to be the perfect predictor, but we can't use it because in real-life scenario, we want to predict the delay before the flight takes of --> We can use average delay from earlier days but not the one from the actual flight we predict.  

For example, variables **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY** shouldn't be used directly as predictors as well. However, we can create various transformations from earlier values.

We will be evaluating your models by predicting the ARR_DELAY for all flights **1 week in advance**.

### Feature Engineering

Feature engineering will play a crucial role in this problems. We have only very little attributes so we need to create some features that will have some predictive power.

- weather: we can use some weather API to look for the weather in time of the scheduled departure and scheduled arrival.
- statistics (avg, mean, median, std, min, max...): we can take a look at previous delays and compute descriptive statistics
- airports encoding: we need to think about what to do with the airports and other categorical variables
- time of the day: the delay probably depends on the airport traffic which varies during the day.
- airport traffic
- unsupervised learning as feature engineering?
- **what are the additional options?**: Think about what we could do more to improve the model.

### Feature Selection / Dimensionality Reduction

We need to apply different selection techniques to find out which one will be the best for our problems.

- Original Features vs. PCA conponents?

In [None]:
#Holidays?

In [19]:
df = pd.read_csv('df_flights_final.csv')

#drop flight date and first column
df = df.drop(columns='Unnamed: 0')
df.drop("fl_date",axis=1, inplace=True)

In [20]:
# random sample of 10000 rows
df_small = df.sample(n = 10000)
df_small.shape

(10000, 41)

In [21]:
y = df_small.std_tail_num_arr_delay
df_small.drop("std_tail_num_arr_delay",axis=1, inplace=True)
X = df_small

Index(['mkt_unique_carrier', 'op_unique_carrier', 'tail_num',
       'origin_airport_id', 'origin', 'dest_airport_id', 'dest',
       'crs_dep_time', 'crs_arr_time', 'arr_delay', 'crs_elapsed_time',
       'distance', 'hour_of_day_dep', 'hour_of_day_arr', 'origin_weather',
       'mean_weather_delay', 'std_weather_delay', 'state', 'fl_day',
       'daily_arr_delay_mean', 'daily_carrier_delay_mean',
       'daily_weather_delay_mean', 'daily_nas_delay_mean',
       'daily_security_delay_mean', 'daily_late_aircraft_delay_mean',
       'daily_arr_delay_std', 'daily_carrier_delay_std',
       'daily_weather_delay_std', 'daily_nas_delay_std',
       'daily_security_delay_std', 'daily_late_aircraft_delay_std',
       'dep_mean_hourly_delay', 'arr_mean_hourly_delay',
       'dep_std_hourly_delay', 'arr_std_hourly_delay',
       'mean_mkt_carrier_delay', 'mean_op_carrier_delay',
       'std_mkt_carrier_delay', 'std_op_carrier_delay',
       'mean_tail_num_arr_delay'],
      dtype='object')

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [23]:
enc = OneHotEncoder(handle_unknown='ignore')

In [24]:
# features with dtype is object
categorical_columns = X_train.columns[X_train.dtypes == "object"]
categorical_columns

Index(['mkt_unique_carrier', 'op_unique_carrier', 'tail_num', 'origin', 'dest',
       'origin_weather', 'state'],
      dtype='object')

In [25]:
# Encode categorical features as a one-hot numeric array
enc = OneHotEncoder(handle_unknown='ignore')
transformed_columns = enc.fit_transform(X_train[['mkt_unique_carrier', 'op_unique_carrier', 'tail_num', 'fl_day', 'origin', 'dest',
       'origin_weather', 'state']])

In [26]:
#hash encoder, pca 10

In [27]:
df_transformed_columns = pd.DataFrame(transformed_columns.todense())

In [28]:
df_transformed_columns

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4641,4642,4643,4644,4645,4646,4647,4648,4649,4650
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7495,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7496,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7497,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7498,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [29]:
# concatinating X_train and encoded values

In [33]:
X_train = pd.concat([X_train.drop(columns = categorical_columns).reset_index(), df_transformed_columns], axis=1)

In [45]:
X_train.drop("index",axis=1, inplace=True)

In [46]:
X_train

Unnamed: 0,origin_airport_id,dest_airport_id,crs_dep_time,crs_arr_time,arr_delay,crs_elapsed_time,distance,hour_of_day_dep,hour_of_day_arr,mean_weather_delay,...,4641,4642,4643,4644,4645,4646,4647,4648,4649,4650
0,12266,13851,2125,2254,121.0,89.0,395.0,23.0,0.0,0.304863,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,12523,12819,800,905,-16.0,65.0,234.0,7.0,8.0,3.141741,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,15096,11433,540,740,-10.0,120.0,374.0,5.0,7.0,0.170538,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,15295,13930,1835,1854,20.0,79.0,213.0,18.0,19.0,0.170538,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,14771,13930,1658,2255,59.0,237.0,1846.0,18.0,23.0,0.304863,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7495,11433,10397,2059,2307,49.0,128.0,594.0,22.0,23.0,0.170538,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7496,14122,11292,1621,1800,-23.0,219.0,1290.0,16.0,17.0,0.170538,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7497,13303,10792,845,1155,4.0,190.0,1185.0,8.0,11.0,0.304863,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7498,12266,12278,2142,2332,8.0,110.0,542.0,21.0,23.0,0.304863,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [62]:
from sklearn.preprocessing import StandardScaler

In [78]:
X_train.dtypes

origin_airport_id      int64
dest_airport_id        int64
crs_dep_time           int64
crs_arr_time           int64
arr_delay            float64
                      ...   
4646                 float64
4647                 float64
4648                 float64
4649                 float64
4650                 float64
Length: 4684, dtype: object

In [89]:
data_for_scaling = X_train[['origin_airport_id', 'dest_airport_id',
       'crs_dep_time', 'crs_arr_time', 'arr_delay', 'crs_elapsed_time',
       'distance', 'hour_of_day_dep', 'hour_of_day_arr',
       'mean_weather_delay', 'std_weather_delay',
       'daily_arr_delay_mean', 'daily_carrier_delay_mean',
       'daily_weather_delay_mean', 'daily_nas_delay_mean',
       'daily_security_delay_mean', 'daily_late_aircraft_delay_mean',
       'daily_arr_delay_std', 'daily_carrier_delay_std',
       'daily_weather_delay_std', 'daily_nas_delay_std',
       'daily_security_delay_std', 'daily_late_aircraft_delay_std',
       'dep_mean_hourly_delay', 'arr_mean_hourly_delay',
       'dep_std_hourly_delay', 'arr_std_hourly_delay',
       'mean_mkt_carrier_delay', 'mean_op_carrier_delay',
       'std_mkt_carrier_delay', 'std_op_carrier_delay',
       'mean_tail_num_arr_delay']]
data_for_scaling

Unnamed: 0,origin_airport_id,dest_airport_id,crs_dep_time,crs_arr_time,arr_delay,crs_elapsed_time,distance,hour_of_day_dep,hour_of_day_arr,mean_weather_delay,...,daily_late_aircraft_delay_std,dep_mean_hourly_delay,arr_mean_hourly_delay,dep_std_hourly_delay,arr_std_hourly_delay,mean_mkt_carrier_delay,mean_op_carrier_delay,std_mkt_carrier_delay,std_op_carrier_delay,mean_tail_num_arr_delay
0,12266,13851,2125,2254,121.0,89.0,395.0,23.0,0.0,0.304863,...,23.812280,42.481198,33.482304,65.825566,56.155253,3.843078,4.641786,19.032934,20.498324,-8.222222
1,12523,12819,800,905,-16.0,65.0,234.0,7.0,8.0,3.141741,...,20.172657,1.692415,-6.587438,16.170308,20.110324,2.421876,2.475357,14.606831,14.950003,-5.829787
2,15096,11433,540,740,-10.0,120.0,374.0,5.0,7.0,0.170538,...,23.812280,-3.710128,-7.129878,6.302142,18.748851,2.992308,2.502434,17.311735,15.376128,0.222222
3,15295,13930,1835,1854,20.0,79.0,213.0,18.0,19.0,0.170538,...,23.812280,14.856133,7.616542,37.799754,38.967034,4.100424,4.518811,18.530901,22.746417,56.790123
4,14771,13930,1658,2255,59.0,237.0,1846.0,18.0,23.0,0.304863,...,20.918964,14.856133,12.420321,37.799754,45.207148,3.843078,2.671135,19.032934,14.365098,1.386364
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7495,11433,10397,2059,2307,49.0,128.0,594.0,22.0,23.0,0.170538,...,24.337276,23.171596,12.420321,50.110807,45.207148,2.427974,2.427974,13.555080,13.555080,-5.895522
7496,14122,11292,1621,1800,-23.0,219.0,1290.0,16.0,17.0,0.170538,...,20.172657,13.169301,6.147322,36.560999,36.670739,3.843078,2.671135,19.032934,14.365098,-6.080000
7497,13303,10792,845,1155,4.0,190.0,1185.0,8.0,11.0,0.304863,...,16.062124,3.580190,-0.148537,21.547013,31.106827,4.100424,2.672724,18.530901,14.686858,-0.119048
7498,12266,12278,2142,2332,8.0,110.0,542.0,21.0,23.0,0.304863,...,16.062124,22.703228,12.420321,48.076949,45.207148,3.843078,4.641786,19.032934,20.498324,-2.583333


In [91]:
#scaling data
scaler = StandardScaler()
print(scaler.fit(data_for_scaling))
StandardScaler()
print(scaler.mean_)

StandardScaler()
[1.26988273e+04 1.26772208e+04 1.32715600e+03 1.49003067e+03
 4.53920000e+00 1.42912000e+02 7.96602667e+02 1.30170667e+01
 1.43810667e+01 4.89210460e-01 5.60726902e+00 4.18066517e+00
 3.78493104e+00 5.00193491e-01 2.40010498e+00 2.67230210e-02
 5.12402182e+00 3.71791627e+01 1.79815323e+01 6.98777766e+00
 1.11553665e+01 1.13148371e+00 2.15269727e+01 1.04556745e+01
 4.28118802e+00 3.16519774e+01 3.49644715e+01 3.80139584e+00
 3.78832672e+00 1.76962999e+01 1.74409734e+01 4.20122531e+00]


In [48]:
# Removing Correlated Features
# step 1
df_corr = X_train.corr().abs()

# step 2
indices = np.where(df_corr > 0.8) 
indices = [(df_corr.index[x], df_corr.columns[y]) 
for x, y in zip(*indices)
    if x != y and x < y]

# step 3
for idx in indices: #each pair
    try:
        df.drop(idx[1], axis = 1, inplace=True)
    except KeyError:
        pass

In [16]:
print(indices)

[('crs_dep_time', 'hour_of_day_dep'), ('crs_dep_time', 'dep_std_hourly_delay'), ('crs_arr_time', 'hour_of_day_arr'), ('crs_elapsed_time', 'distance'), ('mean_weather_delay', 'std_weather_delay'), ('daily_arr_delay_mean', 'daily_carrier_delay_mean'), ('daily_arr_delay_mean', 'daily_late_aircraft_delay_mean'), ('daily_arr_delay_mean', 'daily_carrier_delay_std'), ('daily_arr_delay_mean', 'daily_late_aircraft_delay_std'), ('daily_carrier_delay_mean', 'daily_late_aircraft_delay_mean'), ('daily_carrier_delay_mean', 'daily_carrier_delay_std'), ('daily_carrier_delay_mean', 'daily_late_aircraft_delay_std'), ('daily_weather_delay_mean', 'daily_late_aircraft_delay_mean'), ('daily_weather_delay_mean', 'daily_weather_delay_std'), ('daily_nas_delay_mean', 'daily_arr_delay_std'), ('daily_nas_delay_mean', 'daily_nas_delay_std'), ('daily_security_delay_mean', 'daily_security_delay_std'), ('daily_late_aircraft_delay_mean', 'daily_carrier_delay_std'), ('daily_late_aircraft_delay_mean', 'daily_late_aircra

In [51]:
df_corr

Unnamed: 0,origin_airport_id,dest_airport_id,crs_dep_time,crs_arr_time,arr_delay,crs_elapsed_time,distance,hour_of_day_dep,hour_of_day_arr,mean_weather_delay,...,4641,4642,4643,4644,4645,4646,4647,4648,4649,4650
origin_airport_id,1.000000,0.010318,0.008037,0.008765,0.012506,0.031993,0.060853,0.011891,0.014962,0.153556,...,0.023782,0.209875,0.180845,0.071432,0.025004,0.048294,0.195859,0.004429,0.024763,0.002825
dest_airport_id,0.010318,1.000000,0.028949,0.016931,0.030056,0.080152,0.065222,0.028527,0.011961,0.015016,...,0.002710,0.009275,0.019600,0.052945,0.017560,0.015261,0.042482,0.001739,0.032938,0.029489
crs_dep_time,0.008037,0.028949,1.000000,0.701678,0.092075,0.001801,0.013863,0.945510,0.611109,0.014876,...,0.023384,0.022203,0.021343,0.037765,0.012770,0.026852,0.003425,0.036375,0.021487,0.009815
crs_arr_time,0.008765,0.016931,0.701678,1.000000,0.068246,0.023479,0.023534,0.712306,0.845807,0.002268,...,0.017936,0.018068,0.018887,0.035340,0.015004,0.024014,0.020038,0.021080,0.021770,0.004334
arr_delay,0.012506,0.030056,0.092075,0.068246,1.000000,0.022286,0.018818,0.146277,0.025271,0.006524,...,0.003448,0.004797,0.023114,0.003376,0.005726,0.006400,0.026027,0.025163,0.005311,0.005571
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4646,0.048294,0.015261,0.026852,0.024014,0.006400,0.009198,0.018844,0.024983,0.021639,0.014359,...,0.001084,0.013085,0.004928,0.003666,0.000626,1.000000,0.006647,0.003720,0.001534,0.001716
4647,0.195859,0.042482,0.003425,0.020038,0.026027,0.036486,0.054131,0.000041,0.022887,0.037588,...,0.004907,0.059213,0.022302,0.016589,0.002832,0.006647,1.000000,0.016834,0.006943,0.007764
4648,0.004429,0.001739,0.036375,0.021080,0.025163,0.020101,0.024500,0.034673,0.017559,0.247195,...,0.002746,0.033140,0.012481,0.009284,0.001585,0.003720,0.016834,1.000000,0.003886,0.004345
4649,0.024763,0.032938,0.021487,0.021770,0.005311,0.030531,0.034426,0.020627,0.020957,0.014998,...,0.001133,0.013668,0.005148,0.003829,0.000654,0.001534,0.006943,0.003886,1.000000,0.001792


In [58]:
X_train.dtypes == object

origin_airport_id    False
dest_airport_id      False
crs_dep_time         False
crs_arr_time         False
arr_delay            False
                     ...  
4646                 False
4647                 False
4648                 False
4649                 False
4650                 False
Length: 4684, dtype: bool

In [60]:
from sklearn.feature_selection import f_regression, SelectKBest
skb = SelectKBest(f_regression, k=20)
X_train = skb.fit_transform(df_small, y_train)

ValueError: could not convert string to float: 'WN'

### Modeling

Use different ML techniques to predict each problem.

- linear / logistic / multinomial logistic regression
- Naive Bayes
- Random Forest
- SVM
- XGBoost
- The ensemble of your own choice

### Evaluation

You have data from 2018 and 2019 to develop models. Use different evaluation metrics for each problem and compare the performance of different models.

You are required to predict delays on **out of sample** data from **first 7 days (1st-7th) of January 2020** and to share the file with LighthouseLabs. Sample submission can be found in the file **_sample_submission.csv_**

======================================================================
## Stretch Tasks

### Multiclass Classification

The target variables are **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY**. We need to do additional transformations because these variables are not binary but continuos. For each flight that was delayed, we need to have one of these variables as 1 and others 0.

It can happen that we have two types of delays with more than 0 minutes. In this case, take the bigger one as 1 and others as 0.

### Binary Classification

The target variable is **CANCELLED**. The main problem here is going to be huge class imbalance. We have only very little cancelled flights with comparison to all flights. It is important to do the right sampling before training and to choose correct evaluation metrics.