# Setting up a baseline model for benchmarking
The general architecture of the model will be using two indermediate models to predict the departure and arrival delays for each consecutive flight, which will be ultimately added to the scheduled times: next_landing_time = last_landing_time + onblock_sched + dep_delay + offblock_sched + arr_delay

Two simple baseline models will be established to predict the consecuetivee landing times of a given connection chain:
1. Predicting delays as means of delays
1. Random forests regression

In [49]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings
import random

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import RandomForestRegressor


In [50]:
# Adjust settings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
%matplotlib inline

In [51]:
# Load preprocessed dataset
data_path = '../data/processed/'
df = pd.read_pickle(os.path.join(data_path, 'final_one_hot.pkl'))

In [53]:
# Generate train/test splits for both intermediate models
X_train_arr_delay, X_test_arr_delay, y_train_arr_delay, y_test_arr_delay = train_test_split(df_one_hot.drop(['arr_delay'], axis=1), df_one_hot['arr_delay'], test_size=0.33, random_state=42)

In [54]:
# Implement first baseline model as mean of delays
dummy_regr_off = DummyRegressor(strategy='mean')
dummy_regr_off.fit(X_train_arr_delay, y_train_arr_delay)
dummy_regr_off.predict(X_test_arr_delay)

array([18.43333333, 18.43333333, 18.43333333, ..., 18.43333333,
       18.43333333, 18.43333333])

In [55]:
r2_arr_delay = dummy_regr_off.score(X_test_arr_delay, y_test_arr_delay)
rmse_arr_delay = np.sqrt(mean_squared_error(y_test_arr_delay, dummy_regr_off.predict(X_test_arr_delay)))
print('The r^2 for arrival delay is ' + str(round(r2_arr_delay, 4)))
print('The RMSE for arrival delay is ' + str(round(rmse_arr_delay, 2)) + 'minutes.')
print('\n')

The r^2 for arrival delay is -0.0
The RMSE for arrival delay is 20.76minutes.


