# Machine Learning

In this file, instructions how to approach the challenge can be found.

We are going to work on different types of Machine Learning problems:

- **Regression Problem**: The goal is to predict delay of flights.
- **(Stretch) Multiclass Classification**: If the plane was delayed, we will predict what type of delay it is (will be).
- **(Stretch) Binary Classification**: The goal is to predict if the flight will be cancelled.

## Main Task: Regression Problem

The target variable is **ARR_DELAY**. We need to be careful which columns to use and which don't. For example, DEP_DELAY is going to be the perfect predictor, but we can't use it because in real-life scenario, we want to predict the delay before the flight takes of --> We can use average delay from earlier days but not the one from the actual flight we predict.  

For example, variables **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY** shouldn't be used directly as predictors as well. However, we can create various transformations from earlier values.

We will be evaluating your models by predicting the ARR_DELAY for all flights **1 week in advance**.

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

pd.set_option('display.max_columns',None)
scaler = StandardScaler()

In [27]:
flight_df = pd.read_feather('data/v2_clean_flight')
flight_test_df = pd.read_feather('data/v1_clean_flight_test')
fuel_consumption_df = pd.read_feather('data/v1_clean_fuel_consumption')
passenger_df = pd.read_feather('data/v1_clean_passenger')

### Feature Engineering

Feature engineering will play a crucial role in this problems. We have only very little attributes so we need to create some features that will have some predictive power.

- weather: we can use some weather API to look for the weather in time of the scheduled departure and scheduled arrival.
- statistics (avg, mean, median, std, min, max...): we can take a look at previous delays and compute descriptive statistics
- airports encoding: we need to think about what to do with the airports and other categorical variables
- time of the day: the delay probably depends on the airport traffic which varies during the day.
- airport traffic
- unsupervised learning as feature engineering?
- **what are the additional options?**: Think about what we could do more to improve the model.

In [None]:
# Features for selection for second model to follow.

# Flight month, carrier ID, hour, (?weather?), state, distance and/or flight duration
# origin and destination (probably specific airport)

# Maybe feature to create stating whether or not the flight has passengers, mail, freight, or a combination of the three.
# Could specify if a score: 1/3 = 1, 2/3 = 2, etc.
# Also check to see how full the flight is: num of passengers / seats

# maybe if there is time we could set up a binary model for if the airport is in top 10 busiest airports,
# and the 8 states that make up 50% of the air traffic.

In [28]:
flight_test_df

Unnamed: 0,fl_date,mkt_unique_carrier,mkt_carrier_fl_num,op_unique_carrier,tail_num,origin_airport_id,origin,origin_city_name,dest_airport_id,dest,dest_city_name,crs_dep_time,crs_arr_time,crs_elapsed_time,distance
0,2020-01-30,WN,2193,WN,N8642E,12191,HOU,"Houston, TX",13232,MDW,"Chicago, IL",1010,1235,145,937
1,2020-01-26,WN,3352,WN,N796SW,12992,LIT,"Little Rock, AR",15016,STL,"St. Louis, MO",1705,1810,65,296
2,2020-01-17,AS,365,AS,N584AS,14771,SFO,"San Francisco, CA",14057,PDX,"Portland, OR",2155,2344,109,550
3,2020-01-31,AA,3156,OO,N776SK,13930,ORD,"Chicago, IL",10372,ASE,"Aspen, CO",950,1214,204,1013
4,2020-01-25,WN,3762,WN,N8679A,14747,SEA,"Seattle, WA",13796,OAK,"Oakland, CA",1450,1655,125,672
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,2020-01-14,AA,5338,OH,N562NN,11057,CLT,"Charlotte, NC",11986,GRR,"Grand Rapids, MI",1622,1830,128,583
99996,2020-01-30,WN,2053,WN,N942WN,11292,DEN,"Denver, CO",15016,STL,"St. Louis, MO",2000,2255,115,770
99997,2020-01-06,WN,5757,WN,N923WN,14492,RDU,"Raleigh/Durham, NC",13204,MCO,"Orlando, FL",1730,1915,105,534
99998,2020-01-16,AA,1704,AA,N651AW,15304,TPA,"Tampa, FL",11057,CLT,"Charlotte, NC",813,1007,114,507


In [25]:
passenger_df

Unnamed: 0,departures_scheduled,departures_performed,payload,seats,passengers,freight,mail,distance,air_time,unique_carrier,airline_id,unique_carrier_name,region,carrier_name,origin_airport_id,origin_city_name,origin_country_name,dest_airport_id,dest_city_name,dest_country_name,aircraft_type,year,month,class
0,1.0,1.0,31200.0,156.0,133.0,0.0,0.0,1141.0,150.0,G4,20368,Allegiant Air,D,Allegiant Air,10408,"Appleton, WI",United States,14761,"Sanford, FL",United States,698,2018,11,F
1,16.0,16.0,736000.0,3680.0,2864.0,0.0,0.0,957.0,2111.0,F9,20436,Frontier Airlines Inc.,D,Frontier Airlines Inc.,13204,"Orlando, FL",United States,11433,"Detroit, MI",United States,699,2016,6,F
2,0.0,57.0,76950.0,342.0,157.0,4701.0,7618.0,24.0,717.0,H6,20336,Hageland Aviation Service,D,Hageland Aviation Service,10551,"Bethel, AK",United States,12831,"Kasigluk, AK",United States,35,2017,4,F
3,0.0,1.0,31841.0,58.0,43.0,0.0,0.0,440.0,0.0,AC,19531,Air Canada,I,Air Canada,11433,"Detroit, MI",United States,16149,"Ottawa, Canada",Canada,698,2018,3,L
4,4.0,3.0,57276.0,210.0,197.0,0.0,0.0,284.0,152.0,S5,20448,Shuttle America Corp.,D,Shuttle America Corp.,11618,"Newark, NJ",United States,13931,"Norfolk, VA",United States,677,2016,10,F
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,27.0,0.0,0.0,0.0,0.0,0.0,0.0,6745.0,0.0,DL,19790,Delta Air Lines Inc.,P,Delta Air Lines Inc.,13744,"Tokyo, Japan",Japan,12478,"New York, NY",United States,650,2015,10,F
99996,163.0,154.0,2075500.0,7700.0,5343.0,0.0,0.0,383.0,10100.0,EV,20366,ExpressJet Airlines LLC,D,ExpressJet Airlines Inc.,10781,"Baton Rouge, LA",United States,11298,"Dallas/Fort Worth, TX",United States,629,2015,3,F
99997,0.0,1.0,3400.0,9.0,0.0,0.0,0.0,67.0,28.0,H6,20336,Hageland Aviation Service,D,Hageland Aviation Service,11535,"Elim, AK",United States,15478,"Unalakleet, AK",United States,416,2018,9,F
99998,28.0,27.0,3294000.0,6533.0,4830.0,654319.0,0.0,3434.0,12215.0,AA,19805,American Airlines Inc.,A,American Airlines Inc.,13165,"Manchester, United Kingdom",United Kingdom,14100,"Philadelphia, PA",United States,696,2019,3,F


In [29]:
fuel_consumption_df

Unnamed: 0,month,airline_id,carrier,carrier_name,carrier_group_new,total_gallons,total_cost,year
0,9,21629.0,KD,Western Global,1,2368002.0,4548958.0,2017
1,6,20422.0,SY,Sun Country Airlines d/b/a MN Airlines,2,5064157.0,13353711.0,2018
2,5,19930.0,AS,Alaska Airlines Inc.,3,63119777.0,150860233.0,2018
3,4,19930.0,AS,Alaska Airlines Inc.,3,59726464.0,137208447.0,2019
4,6,20447.0,U7,USA Jet Airlines Inc.,2,463401.0,921016.0,2015
...,...,...,...,...,...,...,...,...
3023,5,20416.0,NK,Spirit Air Lines,3,42123364.0,94757645.0,2019
3024,8,20447.0,U7,USA Jet Airlines Inc.,2,573486.0,863326.0,2016
3025,1,20149.0,PRQ,Florida West Airlines Inc.,1,0.0,0.0,2017
3026,10,20398.0,MQ,Envoy Air,3,0.0,0.0,2017


In [1]:
# Features for selection for second model to follow.

In [None]:
# Flight month, carrier ID, hour, (?weather?), state, distance and/or flight duration
# origin and destination (probably specific airport)

# Maybe feature to create stating whether or not the flight has passengers, mail, freight, or a combination of the three.
# Could specify if a score: 1/3 = 1, 2/3 = 2, etc.
# Also check to see how full the flight is: num of passengers / seats

# maybe if there is time we could set up a binary model for if the airport is in top 10 busiest airports,
# and the 8 states that make up 50% of the air traffic.

### Feature Selection / Dimensionality Reduction

We need to apply different selection techniques to find out which one will be the best for our problems.

- Original Features vs. PCA conponents?

In [1]:
# PCA is not suitable to be done as it only utilizes continuous variables. There are only two continuous variables in our models: 'crs_elapsed_time' and 'distance'.
# Therefore our original features are the best for our problems as they contain a lot more categorical variables that contributes more to our model.  

### Modeling

Use different ML techniques to predict each problem.

- linear / logistic / multinomial logistic regression
- Naive Bayes
- Random Forest
- SVM
- XGBoost
- The ensemble of your own choice

In [21]:
flight_model_df = flight_df[['dest','month','arr_delay','distance','state/country']]
cat_col = ['dest','month','state/country']
num_col = ['distance']
dummies = pd.get_dummies(flight_model_df,columns = cat_col)
dummies.dropna(inplace=True)
y_target = dummies[['arr_delay']].copy()
dummies.drop('arr_delay',axis=1,inplace=True)

In [22]:
x_train,x_test,y_train,y_test = train_test_split(dummies, y_target ,test_size=0.3)
x_train_scaled = scaler.fit_transform(x_train[num_col])
x_test_scaled = scaler.fit_transform(x_test[num_col])
x_train[num_col] = x_train_scaled
x_test[num_col] = x_test_scaled

In [23]:
clf = LinearRegression()
clf.fit(x_train,y_train)
y_pred = clf.predict(x_test)
r2_score(y_test,y_pred)

-3.3937437933993906e+19

### Evaluation

You have data from 2018 and 2019 to develop models. Use different evaluation metrics for each problem and compare the performance of different models.

You are required to predict delays on **out of sample** data from **first 7 days (1st-7th) of January 2020** and to share the file with LighthouseLabs. Sample submission can be found in the file **_sample_submission.csv_**

======================================================================
## Stretch Tasks

### Multiclass Classification

The target variables are **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY**. We need to do additional transformations because these variables are not binary but continuos. For each flight that was delayed, we need to have one of these variables as 1 and others 0.

It can happen that we have two types of delays with more than 0 minutes. In this case, take the bigger one as 1 and others as 0.

### Binary Classification

The target variable is **CANCELLED**. The main problem here is going to be huge class imbalance. We have only very little cancelled flights with comparison to all flights. It is important to do the right sampling before training and to choose correct evaluation metrics.