# Machine Learning

In this file, instructions how to approach the challenge can be found.

We are going to work on different types of Machine Learning problems:

- **Regression Problem**: The goal is to predict delay of flights.
- **(Stretch) Multiclass Classification**: If the plane was delayed, we will predict what type of delay it is (will be).
- **(Stretch) Binary Classification**: The goal is to predict if the flight will be cancelled.

In [2]:
import pandas as pd
import numpy as np
import sklearn
from matplotlib import pyplot
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.datasets import make_classification
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.metrics import RocCurveDisplay
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [12]:
from sklearn.preprocessing import LabelEncoder

In [28]:
import statsmodels.api as sm

## Main Task: Regression Problem

The target variable is **ARR_DELAY**. We need to be careful which columns to use and which don't. For example, DEP_DELAY is going to be the perfect predictor, but we can't use it because in real-life scenario, we want to predict the delay before the flight takes of --> We can use average delay from earlier days but not the one from the actual flight we predict.  

For example, variables **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY** shouldn't be used directly as predictors as well. However, we can create various transformations from earlier values.

We will be evaluating your models by predicting the ARR_DELAY for all flights **1 week in advance**.

In [3]:
df = pd.read_csv('Heavy_Data/flightsprocessed.csv')

In [5]:
df = df.drop(columns = 'Unnamed: 0')

In [7]:
df.isnull().sum()

fl_date                                 0
mkt_unique_carrier                      0
branded_code_share                      0
mkt_carrier                             0
mkt_carrier_fl_num                      0
op_unique_carrier                       0
tail_num                                0
op_carrier_fl_num                       0
origin_airport_id                       0
origin                                  0
origin_city_name                        0
dest_airport_id                         0
dest                                    0
dest_city_name                          0
crs_dep_time                            0
dep_time                                0
dep_delay                               0
taxi_out                                0
wheels_off                              0
wheels_on                               0
taxi_in                                 0
crs_arr_time                            0
arr_time                                0
arr_delay                         

In [8]:
y = df['arr_delay']

In [10]:
X = df[['month','hour','haul','flight_speed','state','proportional_activity','dummydelay','route','mean delay for route','mean delay based on air carrier','distance','origin_airport_id']]

In [14]:
X['haultest'] = LabelEncoder().fit(X['haul']).transform(X['haul'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X['haultest'] = LabelEncoder().fit(X['haul']).transform(X['haul'])


In [16]:
X = X.drop(columns='haul')

In [20]:
X = X.drop(columns=['dummydelay','route'])

In [22]:
pd.set_option('display.max_columns', None)
df

Unnamed: 0,fl_date,mkt_unique_carrier,branded_code_share,mkt_carrier,mkt_carrier_fl_num,op_unique_carrier,tail_num,op_carrier_fl_num,origin_airport_id,origin,origin_city_name,dest_airport_id,dest,dest_city_name,crs_dep_time,dep_time,dep_delay,taxi_out,wheels_off,wheels_on,taxi_in,crs_arr_time,arr_time,arr_delay,cancelled,diverted,dup,crs_elapsed_time,actual_elapsed_time,air_time,flights,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,first_dep_time,total_add_gtime,longest_add_gtime,month,hour,haul,flight_speed,state,proportional_activity,dummydelay,route,mean delay for route,mean delay based on air carrier
0,2018-06-15,DL,DL_CODESHARE,DL,6230,G7,N378CA,6230,14492,RDU,"Raleigh/Durham, NC",11042,CLE,"Cleveland, OH",1942,1958.0,16.0,28.0,2026.0,2132.0,6.0,2124,2138.0,14.0,0.0,0.0,N,102.0,100.0,66.0,1.0,416.0,,,,,,,,,6,20,Short,378.181818,OH,0.008601,1,RDUCLE,24.321429,2.490683
1,2019-07-30,DL,DL_CODESHARE,DL,6229,G7,N186GJ,6229,14492,RDU,"Raleigh/Durham, NC",11042,CLE,"Cleveland, OH",2049,2047.0,-2.0,24.0,2111.0,2214.0,5.0,2233,2219.0,-14.0,0.0,0.0,N,104.0,92.0,63.0,1.0,416.0,,,,,,,,,7,21,Short,396.190476,OH,0.008601,0,RDUCLE,24.321429,2.490683
2,2018-07-17,DL,DL_CODESHARE,DL,6234,G7,N354CA,6234,14492,RDU,"Raleigh/Durham, NC",11042,CLE,"Cleveland, OH",2059,2234.0,95.0,22.0,2256.0,2.0,11.0,2237,13.0,96.0,0.0,0.0,N,98.0,99.0,66.0,1.0,416.0,0.0,0.0,4.0,0.0,92.0,,,,7,22,Short,378.181818,OH,0.008601,1,RDUCLE,24.321429,2.490683
3,2019-08-04,DL,DL_CODESHARE,DL,6229,G7,N186GJ,6229,14492,RDU,"Raleigh/Durham, NC",11042,CLE,"Cleveland, OH",2049,2039.0,-10.0,20.0,2059.0,2158.0,5.0,2233,2203.0,-30.0,0.0,0.0,N,104.0,84.0,59.0,1.0,416.0,,,,,,,,,8,20,Short,423.050847,OH,0.008601,0,RDUCLE,24.321429,2.490683
4,2019-05-30,DL,DL_CODESHARE,DL,6229,G7,N331CA,6229,14492,RDU,"Raleigh/Durham, NC",11042,CLE,"Cleveland, OH",2030,2021.0,-9.0,25.0,2046.0,2152.0,16.0,2213,2208.0,-5.0,0.0,0.0,N,103.0,107.0,66.0,1.0,416.0,,,,,,,,,5,20,Short,378.181818,OH,0.008601,0,RDUCLE,24.321429,2.490683
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
684218,2018-01-15,HA,HA,HA,466,HA,N588HA,466,14222,PPG,"Pago Pago, TT",12173,HNL,"Honolulu, HI",2330,2349.0,19.0,17.0,6.0,613.0,7.0,600,620.0,20.0,0.0,0.0,N,330.0,331.0,307.0,1.0,2599.0,8.0,0.0,0.0,0.0,12.0,,,,1,6,Long,507.947883,HI,0.000015,1,PPGHNL,68.200000,0.178935
684219,2018-09-24,HA,HA,HA,466,HA,N380HA,466,14222,PPG,"Pago Pago, TT",12173,HNL,"Honolulu, HI",2320,2307.0,-13.0,15.0,2322.0,532.0,8.0,550,540.0,-10.0,0.0,0.0,N,330.0,333.0,310.0,1.0,2599.0,,,,,,,,,9,23,Long,503.032258,HI,0.000015,0,PPGHNL,68.200000,0.178935
684220,2019-07-08,HA,HA,HA,466,HA,N381HA,466,14222,PPG,"Pago Pago, TT",12173,HNL,"Honolulu, HI",2320,2325.0,5.0,14.0,2339.0,608.0,58.0,550,706.0,76.0,0.0,0.0,N,330.0,401.0,329.0,1.0,2599.0,71.0,0.0,0.0,0.0,5.0,,,,7,23,Long,473.981763,HI,0.000015,1,PPGHNL,68.200000,0.178935
684221,2019-05-09,HA,HA,HA,466,HA,N370HA,466,14222,PPG,"Pago Pago, TT",12173,HNL,"Honolulu, HI",2320,2309.0,-11.0,15.0,2324.0,521.0,10.0,545,531.0,-14.0,0.0,0.0,N,325.0,322.0,297.0,1.0,2599.0,,,,,,,,,5,23,Long,525.050505,HI,0.000015,0,PPGHNL,68.200000,0.178935


In [23]:
X['statetest'] = LabelEncoder().fit(X['state']).transform(X['state'])

In [25]:
Xstorage = X.copy()

In [51]:
X

Unnamed: 0,const,month,hour,flight_speed,proportional_activity,mean delay for route,mean delay based on air carrier,distance,haultest
0,1.0,6,20,378.181818,0.008601,24.321429,2.490683,416.0,2
1,1.0,7,21,396.190476,0.008601,24.321429,2.490683,416.0,2
2,1.0,7,22,378.181818,0.008601,24.321429,2.490683,416.0,2
3,1.0,8,20,423.050847,0.008601,24.321429,2.490683,416.0,2
4,1.0,5,20,378.181818,0.008601,24.321429,2.490683,416.0,2
...,...,...,...,...,...,...,...,...,...
684218,1.0,1,6,507.947883,0.000015,68.200000,0.178935,2599.0,0
684219,1.0,9,23,503.032258,0.000015,68.200000,0.178935,2599.0,0
684220,1.0,7,23,473.981763,0.000015,68.200000,0.178935,2599.0,0
684221,1.0,5,23,525.050505,0.000015,68.200000,0.178935,2599.0,0


In [52]:
X = X.drop(columns='const')

In [54]:
Xscaled = MinMaxScaler().fit_transform(X)

In [61]:
yarr = np.array(y)

In [65]:
yarreshape = yarr.reshape(-1,1)

In [66]:
yscaled = MinMaxScaler().fit_transform(yarreshape)

In [46]:
X = X.drop(columns = 'origin_airport_id')

In [76]:
X

Unnamed: 0,month,hour,flight_speed,proportional_activity,mean delay for route,mean delay based on air carrier,distance,haultest
0,6,20,378.181818,0.008601,24.321429,2.490683,416.0,2
1,7,21,396.190476,0.008601,24.321429,2.490683,416.0,2
2,7,22,378.181818,0.008601,24.321429,2.490683,416.0,2
3,8,20,423.050847,0.008601,24.321429,2.490683,416.0,2
4,5,20,378.181818,0.008601,24.321429,2.490683,416.0,2
...,...,...,...,...,...,...,...,...
684218,1,6,507.947883,0.000015,68.200000,0.178935,2599.0,0
684219,9,23,503.032258,0.000015,68.200000,0.178935,2599.0,0
684220,7,23,473.981763,0.000015,68.200000,0.178935,2599.0,0
684221,5,23,525.050505,0.000015,68.200000,0.178935,2599.0,0


In [55]:
Xscaled = sm.add_constant(Xscaled)

In [79]:
Xtry = X.drop(columns = 'distance')

In [122]:
pd.DataFrame.to_csv(X,'features')
pd.DataFrame.to_csv(Xstorage,'featuresbackup')

In [88]:
model = sm.OLS(y,X)

In [89]:
ylabel = pd.Series.apply(y,lambda x: 0 if x <= 0 else 1)

In [90]:
results = model.fit()
print(results.summary())

                                 OLS Regression Results                                
Dep. Variable:              arr_delay   R-squared (uncentered):                   0.057
Model:                            OLS   Adj. R-squared (uncentered):              0.057
Method:                 Least Squares   F-statistic:                              5167.
Date:                Mon, 28 Jun 2021   Prob (F-statistic):                        0.00
Time:                        23:20:36   Log-Likelihood:                     -3.6316e+06
No. Observations:              684223   AIC:                                  7.263e+06
Df Residuals:                  684215   BIC:                                  7.263e+06
Df Model:                           8                                                  
Covariance Type:            nonrobust                                                  
                                      coef    std err          t      P>|t|      [0.025      0.975]
--------------------

Unnamed: 0,const,month,hour,flight_speed,proportional_activity,mean delay for route,mean delay based on air carrier,distance,haultest
0,1.0,6,20,378.181818,0.008601,24.321429,2.490683,416.0,2
1,1.0,7,21,396.190476,0.008601,24.321429,2.490683,416.0,2
2,1.0,7,22,378.181818,0.008601,24.321429,2.490683,416.0,2
3,1.0,8,20,423.050847,0.008601,24.321429,2.490683,416.0,2
4,1.0,5,20,378.181818,0.008601,24.321429,2.490683,416.0,2
...,...,...,...,...,...,...,...,...,...
684218,1.0,1,6,507.947883,0.000015,68.200000,0.178935,2599.0,0
684219,1.0,9,23,503.032258,0.000015,68.200000,0.178935,2599.0,0
684220,1.0,7,23,473.981763,0.000015,68.200000,0.178935,2599.0,0
684221,1.0,5,23,525.050505,0.000015,68.200000,0.178935,2599.0,0


### Feature Engineering

Feature engineering will play a crucial role in this problems. We have only very little attributes so we need to create some features that will have some predictive power.

- weather: we can use some weather API to look for the weather in time of the scheduled departure and scheduled arrival.
- statistics (avg, mean, median, std, min, max...): we can take a look at previous delays and compute descriptive statistics
- airports encoding: we need to think about what to do with the airports and other categorical variables
- time of the day: the delay probably depends on the airport traffic which varies during the day.
- airport traffic
- unsupervised learning as feature engineering?
- **what are the additional options?**: Think about what we could do more to improve the model.

In [93]:
passengers = pd.read_csv('Heavy_Data/passengers.csv')

In [94]:
passengers

Unnamed: 0.1,Unnamed: 0,departures_scheduled,departures_performed,payload,seats,passengers,freight,mail,distance,ramp_to_ramp,air_time,unique_carrier,airline_id,unique_carrier_name,region,carrier,carrier_name,carrier_group,carrier_group_new,origin_airport_id,origin_city_market_id,origin,origin_city_name,origin_country,origin_country_name,dest_airport_id,dest_city_market_id,dest,dest_city_name,dest_country,dest_country_name,aircraft_group,aircraft_type,aircraft_config,year,month,distance_group,class,data_source
0,0,0.0,6.0,205170.0,0.0,0.0,37737.0,106181.0,548.0,495.0,471.0,5V,20408,Tatonduk Outfitters Limited d/b/a Everts Air A...,D,5V,Tatonduk Outfitters Limited d/b/a Everts Air A...,1,1,10299,30299,ANC,"Anchorage, AK",US,United States,13970,33970,OTZ,"Kotzebue, AK",US,United States,6,640,2,2018,6,2,G,DU
1,1,33.0,33.0,1259910.0,5280.0,4619.0,5680.0,82.0,692.0,4035.0,3435.0,DL,19790,Delta Air Lines Inc.,D,DL,Delta Air Lines Inc.,3,3,10397,30397,ATL,"Atlanta, GA",US,United States,13198,33198,MCI,"Kansas City, MO",US,United States,6,656,1,2016,4,2,F,DU
2,2,111.0,110.0,2187460.0,8353.0,7305.0,0.0,0.0,109.0,6349.0,2733.0,CP,21167,Compass Airlines,D,CP,Compass Airlines,2,2,14679,33570,SAN,"San Diego, CA",US,United States,12892,32575,LAX,"Los Angeles, CA",US,United States,6,673,1,2015,8,1,F,DU
3,3,0.0,2.0,6800.0,18.0,16.0,0.0,0.0,21.0,24.0,16.0,H6,20336,Hageland Aviation Service,D,H6,Hageland Aviation Service,1,5,15554,35554,VAK,"Chevak, AK",US,United States,14718,34718,SCM,"Scammon Bay, AK",US,United States,4,416,1,2017,4,1,L,DU
4,4,1.0,1.0,43400.0,190.0,140.0,0.0,0.0,1826.0,302.0,262.0,B6,20409,JetBlue Airways,D,B6,JetBlue Airways,3,3,12478,31703,JFK,"New York, NY",US,United States,10140,30140,ABQ,"Albuquerque, NM",US,United States,6,699,1,2016,11,4,F,DU
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
699995,699995,127.0,125.0,4583100.0,17500.0,13762.0,405.0,0.0,334.0,11399.0,6957.0,AA,19805,American Airlines Inc.,D,AA,American Airlines Inc.,3,3,13487,31650,MSP,"Minneapolis, MN",US,United States,13930,30977,ORD,"Chicago, IL",US,United States,6,655,1,2017,1,1,F,DU
699996,699996,0.0,1.0,49090.0,180.0,178.0,0.0,0.0,680.0,135.0,99.0,DL,19790,Delta Air Lines Inc.,D,DL,Delta Air Lines Inc.,3,3,13487,31650,MSP,"Minneapolis, MN",US,United States,11292,30325,DEN,"Denver, CO",US,United States,6,888,1,2018,3,2,F,DU
699997,699997,20.0,20.0,790000.0,3000.0,2747.0,1422.0,0.0,603.0,2074.0,1647.0,US,20355,US Airways Inc.,D,US,US Airways Inc.,3,3,11057,31057,CLT,"Charlotte, NC",US,United States,15096,35096,SYR,"Syracuse, NY",US,United States,6,694,1,2015,5,2,F,DU
699998,699998,47.0,44.0,868120.0,3344.0,2236.0,48.0,0.0,327.0,3815.0,2541.0,MQ,20398,Envoy Air,D,MQ,Envoy Air,3,3,12511,32511,JLN,"Joplin, MO",US,United States,11298,30194,DFW,"Dallas/Fort Worth, TX",US,United States,6,673,1,2018,1,1,F,DU


In [97]:
df.nunique()

fl_date                              730
mkt_unique_carrier                    11
branded_code_share                    16
mkt_carrier                           11
mkt_carrier_fl_num                  7058
op_unique_carrier                     28
tail_num                            6419
op_carrier_fl_num                   7067
origin_airport_id                    375
origin                               375
origin_city_name                     367
dest_airport_id                      375
dest                                 375
dest_city_name                       367
crs_dep_time                        1349
dep_time                            1425
dep_delay                           1060
taxi_out                             171
wheels_off                          1400
wheels_on                           1440
taxi_in                              155
crs_arr_time                        1414
arr_time                            1440
arr_delay                           1086
cancelled       

In [107]:
import requests as re

In [108]:
 ret('https://weather.visualcrossing.com/VisualCrossingWebServices/rest/services/timeline/Anchorage,AK/2018-06-15/2018-06-22?unitGroup=metric&key=AHBS12DVFHFFSHF764DGDB')

<Response [400]>

In [114]:
shape = np.full((684223,1), 7)

In [115]:
shape

array([[7],
       [7],
       [7],
       ...,
       [7],
       [7],
       [7]])

In [119]:
df.dtypes

fl_date                             object
mkt_unique_carrier                  object
branded_code_share                  object
mkt_carrier                         object
mkt_carrier_fl_num                   int64
op_unique_carrier                   object
tail_num                            object
op_carrier_fl_num                    int64
origin_airport_id                    int64
origin                              object
origin_city_name                    object
dest_airport_id                      int64
dest                                object
dest_city_name                      object
crs_dep_time                         int64
dep_time                           float64
dep_delay                          float64
taxi_out                           float64
wheels_off                         float64
wheels_on                          float64
taxi_in                            float64
crs_arr_time                         int64
arr_time                           float64
arr_delay  

In [120]:
df['fl_date_previous'] = df.to_timedeltal_date'] -  pd.to_timedelta((7), unit='days')

AttributeError: 'DataFrame' object has no attribute 'timedelta'

### Feature Selection / Dimensionality Reduction

We need to apply different selection techniques to find out which one will be the best for our problems.

- Original Features vs. PCA conponents?

In [98]:
passengers.nunique()

Unnamed: 0               700000
departures_scheduled        649
departures_performed        604
payload                  164160
seats                     20724
passengers                23459
freight                  118274
mail                      35780
distance                   4808
ramp_to_ramp              27825
air_time                  25280
unique_carrier              375
airline_id                  375
unique_carrier_name         375
region                        6
carrier                     386
carrier_name                394
carrier_group                 4
carrier_group_new             8
origin_airport_id          2049
origin_city_market_id      1798
origin                     2050
origin_city_name           1857
origin_country              135
origin_country_name         136
dest_airport_id            2054
dest_city_market_id        1801
dest                       2055
dest_city_name             1861
dest_country                152
dest_country_name           152
aircraft

In [99]:
passengers.year

0         2018
1         2016
2         2015
3         2017
4         2016
          ... 
699995    2017
699996    2018
699997    2015
699998    2018
699999    2017
Name: year, Length: 700000, dtype: int64

In [101]:
df['year'] = pd.DatetimeIndex(df['fl_date']).year

In [102]:
df

Unnamed: 0,fl_date,mkt_unique_carrier,branded_code_share,mkt_carrier,mkt_carrier_fl_num,op_unique_carrier,tail_num,op_carrier_fl_num,origin_airport_id,origin,origin_city_name,dest_airport_id,dest,dest_city_name,crs_dep_time,dep_time,dep_delay,taxi_out,wheels_off,wheels_on,taxi_in,crs_arr_time,arr_time,arr_delay,cancelled,diverted,dup,crs_elapsed_time,actual_elapsed_time,air_time,flights,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,first_dep_time,total_add_gtime,longest_add_gtime,month,hour,haul,flight_speed,state,proportional_activity,dummydelay,route,mean delay for route,mean delay based on air carrier,year
0,2018-06-15,DL,DL_CODESHARE,DL,6230,G7,N378CA,6230,14492,RDU,"Raleigh/Durham, NC",11042,CLE,"Cleveland, OH",1942,1958.0,16.0,28.0,2026.0,2132.0,6.0,2124,2138.0,14.0,0.0,0.0,N,102.0,100.0,66.0,1.0,416.0,,,,,,,,,6,20,Short,378.181818,OH,0.008601,1,RDUCLE,24.321429,2.490683,2018
1,2019-07-30,DL,DL_CODESHARE,DL,6229,G7,N186GJ,6229,14492,RDU,"Raleigh/Durham, NC",11042,CLE,"Cleveland, OH",2049,2047.0,-2.0,24.0,2111.0,2214.0,5.0,2233,2219.0,-14.0,0.0,0.0,N,104.0,92.0,63.0,1.0,416.0,,,,,,,,,7,21,Short,396.190476,OH,0.008601,0,RDUCLE,24.321429,2.490683,2019
2,2018-07-17,DL,DL_CODESHARE,DL,6234,G7,N354CA,6234,14492,RDU,"Raleigh/Durham, NC",11042,CLE,"Cleveland, OH",2059,2234.0,95.0,22.0,2256.0,2.0,11.0,2237,13.0,96.0,0.0,0.0,N,98.0,99.0,66.0,1.0,416.0,0.0,0.0,4.0,0.0,92.0,,,,7,22,Short,378.181818,OH,0.008601,1,RDUCLE,24.321429,2.490683,2018
3,2019-08-04,DL,DL_CODESHARE,DL,6229,G7,N186GJ,6229,14492,RDU,"Raleigh/Durham, NC",11042,CLE,"Cleveland, OH",2049,2039.0,-10.0,20.0,2059.0,2158.0,5.0,2233,2203.0,-30.0,0.0,0.0,N,104.0,84.0,59.0,1.0,416.0,,,,,,,,,8,20,Short,423.050847,OH,0.008601,0,RDUCLE,24.321429,2.490683,2019
4,2019-05-30,DL,DL_CODESHARE,DL,6229,G7,N331CA,6229,14492,RDU,"Raleigh/Durham, NC",11042,CLE,"Cleveland, OH",2030,2021.0,-9.0,25.0,2046.0,2152.0,16.0,2213,2208.0,-5.0,0.0,0.0,N,103.0,107.0,66.0,1.0,416.0,,,,,,,,,5,20,Short,378.181818,OH,0.008601,0,RDUCLE,24.321429,2.490683,2019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
684218,2018-01-15,HA,HA,HA,466,HA,N588HA,466,14222,PPG,"Pago Pago, TT",12173,HNL,"Honolulu, HI",2330,2349.0,19.0,17.0,6.0,613.0,7.0,600,620.0,20.0,0.0,0.0,N,330.0,331.0,307.0,1.0,2599.0,8.0,0.0,0.0,0.0,12.0,,,,1,6,Long,507.947883,HI,0.000015,1,PPGHNL,68.200000,0.178935,2018
684219,2018-09-24,HA,HA,HA,466,HA,N380HA,466,14222,PPG,"Pago Pago, TT",12173,HNL,"Honolulu, HI",2320,2307.0,-13.0,15.0,2322.0,532.0,8.0,550,540.0,-10.0,0.0,0.0,N,330.0,333.0,310.0,1.0,2599.0,,,,,,,,,9,23,Long,503.032258,HI,0.000015,0,PPGHNL,68.200000,0.178935,2018
684220,2019-07-08,HA,HA,HA,466,HA,N381HA,466,14222,PPG,"Pago Pago, TT",12173,HNL,"Honolulu, HI",2320,2325.0,5.0,14.0,2339.0,608.0,58.0,550,706.0,76.0,0.0,0.0,N,330.0,401.0,329.0,1.0,2599.0,71.0,0.0,0.0,0.0,5.0,,,,7,23,Long,473.981763,HI,0.000015,1,PPGHNL,68.200000,0.178935,2019
684221,2019-05-09,HA,HA,HA,466,HA,N370HA,466,14222,PPG,"Pago Pago, TT",12173,HNL,"Honolulu, HI",2320,2309.0,-11.0,15.0,2324.0,521.0,10.0,545,531.0,-14.0,0.0,0.0,N,325.0,322.0,297.0,1.0,2599.0,,,,,,,,,5,23,Long,525.050505,HI,0.000015,0,PPGHNL,68.200000,0.178935,2019


### Modeling

Use different ML techniques to predict each problem.

- linear / logistic / multinomial logistic regression
- Naive Bayes
- Random Forest
- SVM
- XGBoost
- The ensemble of your own choice

### Evaluation

You have data from 2018 and 2019 to develop models. Use different evaluation metrics for each problem and compare the performance of different models.

You are required to predict delays on **out of sample** data from **first 7 days (1st-7th) of January 2020** and to share the file with LighthouseLabs. Sample submission can be found in the file **_sample_submission.csv_**

======================================================================
## Stretch Tasks

### Multiclass Classification

The target variables are **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY**. We need to do additional transformations because these variables are not binary but continuos. For each flight that was delayed, we need to have one of these variables as 1 and others 0.

It can happen that we have two types of delays with more than 0 minutes. In this case, take the bigger one as 1 and others as 0.

### Binary Classification

The target variable is **CANCELLED**. The main problem here is going to be huge class imbalance. We have only very little cancelled flights with comparison to all flights. It is important to do the right sampling before training and to choose correct evaluation metrics.