<font size="6"><b>XGBoost - without DEP_DELAY</b></font>

![Figure_8](img/Figure_8.png)

In [3]:
import pandas as pd
import numpy as np
np.random.seed(0)
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import xgboost as xgb
sns.set_style('darkgrid')
pd.set_option('display.max_columns', None)
import datetime, warnings, scipy
warnings.filterwarnings("ignore")

from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn import svm
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report

This is second version of the XGBoost where I will drop the DEP_DELAY to actually give no indications of any possible delays to the algorithm and see how it behaves. I am expecting a significant drop on the accuracy of the model but this will be more realistic and useful from my point of view. 

I think this is a good time to mentioned that there are a few factors that I should have considered while cleaning the data and could have affected these decisions that I am making now with the DEP_DELAY and the ones already made with regards to the ARR_DELAY. These factors are that according to the OAG, a flight is not considered delayed from the departure nor the arrival city, if it is lower or equal than 15 minutes. That could have affected the EDA that I did plus the decisions made for the models in terms of variables. So it will definitely be a suggestion for a way forward. This would have also decreased the size of the dataframe and therefore the time to run the models, which if you see on the bottom of each fitting cell, it has been considerably long with some being up to 9 hours. 

You can look at the source of the information <a href="https://www.oag.com/airline-on-time-performance-defining-late">here</a>

In [4]:
dfm_ready = pd.read_csv('dfm_ready.csv', index_col=0)
dfm_ready.head()

Unnamed: 0,DEP_DELAY,CRS_ELAPSED_TIME,AIR_TIME,DISTANCE,FLIGHT_STATUS,OP_CARRIER_Allegiant Air,OP_CARRIER_American Airlines,OP_CARRIER_Delta Airlines,OP_CARRIER_Endeavor Air,OP_CARRIER_Envoy Air,OP_CARRIER_ExpressJet,OP_CARRIER_Frontier Airlines,OP_CARRIER_Hawaiian Airlines,OP_CARRIER_JetBlue Airways,OP_CARRIER_Mesa Airline,OP_CARRIER_PSA Airlines,OP_CARRIER_Republic Airways,OP_CARRIER_SkyWest Airlines,OP_CARRIER_Southwest Airlines,OP_CARRIER_Spirit Airlines,OP_CARRIER_United Airlines,OP_CARRIER_Virgin America,DEST_Atlanta,DEST_Boston,DEST_Charlotte,DEST_Chicago,DEST_Dallas-Fort Worth,DEST_Denver,DEST_Detroit,DEST_Houston,DEST_Las Vegas,DEST_Los Angeles,DEST_Minneapolis,DEST_New York,DEST_Newark,DEST_Orlando,DEST_Philadelphia,DEST_Phoenix,DEST_Salt Lake City,DEST_San Francisco,DEST_Seattle,CRS_DEP_TIME_2,CRS_DEP_TIME_3,CRS_DEP_TIME_4,CRS_ARR_TIME_2,CRS_ARR_TIME_3,CRS_ARR_TIME_4,MONTH_2,MONTH_3,MONTH_4,MONTH_5,MONTH_6,MONTH_7,MONTH_8,MONTH_9,MONTH_10,MONTH_11,MONTH_12,WEEKDAY_1,WEEKDAY_2,WEEKDAY_3,WEEKDAY_4,WEEKDAY_5,WEEKDAY_6
0,-5.0,268.0,225.0,1605.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,-8.0,99.0,65.0,414.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,-5.0,134.0,106.0,846.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,6.0,190.0,157.0,1120.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,-3.0,206.0,173.0,1222.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [4]:
dfm_ready.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4008257 entries, 0 to 7213445
Data columns (total 64 columns):
 #   Column                         Dtype  
---  ------                         -----  
 0   DEP_DELAY                      float64
 1   CRS_ELAPSED_TIME               float64
 2   AIR_TIME                       float64
 3   DISTANCE                       float64
 4   FLIGHT_STATUS                  int64  
 5   OP_CARRIER_Allegiant Air       int64  
 6   OP_CARRIER_American Airlines   int64  
 7   OP_CARRIER_Delta Airlines      int64  
 8   OP_CARRIER_Endeavor Air        int64  
 9   OP_CARRIER_Envoy Air           int64  
 10  OP_CARRIER_ExpressJet          int64  
 11  OP_CARRIER_Frontier Airlines   int64  
 12  OP_CARRIER_Hawaiian Airlines   int64  
 13  OP_CARRIER_JetBlue Airways     int64  
 14  OP_CARRIER_Mesa Airline        int64  
 15  OP_CARRIER_PSA Airlines        int64  
 16  OP_CARRIER_Republic Airways    int64  
 17  OP_CARRIER_SkyWest Airlines    int64  
 18  OP

# XGBoost

In [5]:
# Define features (X) and traget(y)
y = dfm_ready['FLIGHT_STATUS']
X = dfm_ready.drop(['FLIGHT_STATUS', 'DEP_DELAY'], axis = 1)

In [6]:
scaler = StandardScaler()
scaled_df = scaler.fit_transform(X)

In [7]:
# Perform the dataset split
X_test, X_train, y_test, y_train = train_test_split(scaled_df, y, test_size=0.25, random_state=42)

In [10]:
# Fitting the model and calculating the training and text (val) accuracies

clf = xgb.XGBClassifier()
clf.fit(X_train, y_train)
training_preds = clf.predict(X_train)
val_preds = clf.predict(X_test)
training_accuracy = accuracy_score(y_train, training_preds)
val_accuracy = accuracy_score(y_test, val_preds)

print("Training Accuracy: {:.4}%".format(training_accuracy * 100))
print("Validation accuracy: {:.4}%".format(val_accuracy * 100))

Training Accuracy: 66.67%
Validation accuracy: 66.64%


As expected, this model has almost a 20% drop compared to the previous one where the DEP_DELAY was left. Still at almost 67% it is not that bad because the predictions come before you go into the plane and even before the possible delay is announced on the departure boards/screens.

The fact that the test set (validation) has an accuracy so close to the training one, suggests that the model is properly fit. If there was a high difference between then (with the test being considerably lower), then that would suggest an overfit of the training data. 

Still it is always worth it to try to tune the model and see if the accuracy improves, like I said, at 66.64% is not bad but I believe that there might be room for improvement. I would like to aim for at least a 70%. I have already created paremeters dictionary and ran it, so unless I do any other tests with a new dictionary, which will be time dependent, I will keep the previous values. 

# Tuning XGBoost

In [14]:
param_grid = {
    "learning_rate": [0.1],
    'max_depth': [6],
    'min_child_weight': [10],
    'subsample': [ 0.7],
    'n_estimators': [100],
}

In [None]:
grid_clf = GridSearchCV(clf, param_grid, scoring='accuracy', cv=None, n_jobs=1)
grid_clf.fit(scaled_df, y)

best_parameters = grid_clf.best_params_

print("Grid Search found the following optimal parameters: ")
for param_name in sorted(best_parameters.keys()):
    print("%s: %r" % (param_name, best_parameters[param_name]))

training_preds = grid_clf.predict(X_train)
val_preds = grid_clf.predict(X_test)
training_accuracy = accuracy_score(y_train, training_preds)
val_accuracy = accuracy_score(y_test, val_preds)

print("")
print("Training Accuracy: {:.4}%".format(training_accuracy * 100))
print("Validation accuracy: {:.4}%".format(val_accuracy * 100))

Ok, so this is more like what I was looking for, getting closer to 70% accuracy. Once I'm done with the rest of the models, and I have put a check mark on everything I had planned to do, I will come back and create a different dictionary to see if there is any additional tunning that I can do to reach my goal of 70%. For the time being let's this is my best model so far.

In [28]:
model_grid_clf = xgb.XGBClassifier(learning_rate = 0.1, 
                                   max_depth = 6, 
                                   min_child_weight = 10,
                                   subsample = 0.7, 
                                   n_estimators = 100)
model_grid_clf.fit(X_train, y_train)

XGBClassifier(max_depth=6, min_child_weight=10, subsample=0.7)

In [30]:
print("Accuracy on training set: {:.2f}".format(model_grid_clf.score(X_train, y_train) * 100))
print("Accuracy on validation set: {:.2f}".format(model_grid_clf.score(X_test, y_test) * 100))

Accuracy on training set: 69.547
Accuracy on validation set: 69.374


In [52]:
model_grid_clf1_predict = model_grid_clf.predict(X_test)
model_grid_clf1_predict

array([0, 0, 0, ..., 0, 0, 0])

In [53]:
print(confusion_matrix(y_test, model_grid_clf1_predict))
print(classification_report(y_test, model_grid_clf1_predict))

[[1766883  122525]
 [ 798163  318621]]
              precision    recall  f1-score   support

           0       0.69      0.94      0.79   1889408
           1       0.72      0.29      0.41   1116784

    accuracy                           0.69   3006192
   macro avg       0.71      0.61      0.60   3006192
weighted avg       0.70      0.69      0.65   3006192



# XGBoost for Imbalance Classification

Let's see if anything improves for the model with the addition of the scale_pos_weight and small modifications on the dictionary

scale_pos_weight = total_negative_examples/total_positive_examples

In [38]:
model_imb = xgb.XGBClassifier(learning_rate = 0.1, 
                                   max_depth = 6, 
                                   min_child_weight = 10,
                                   subsample = 0.7, 
                                   n_estimators = 100, 
                                   scale_pos_weight=1.69)
model_imb.fit(X_train, y_train)

XGBClassifier(max_depth=6, min_child_weight=10, scale_pos_weight=1.69,
              subsample=0.7)

In [39]:
print("Accuracy on training set: {:.2f}".format(model_imb.score(X_train, y_train) * 100))
print("Accuracy on validation set: {:.2f}".format(model_imb.score(X_test, y_test) * 100))

Accuracy on training set: 66.72
Accuracy on validation set: 66.30


In [40]:
xgb_predict = model_imb.predict(X_test)
xgb_predict

array([1, 0, 0, ..., 0, 1, 1])

In [41]:
print(confusion_matrix(y_test,xgb_predict))
print(classification_report(y_test,xgb_predict))

[[1317324  572084]
 [ 440873  675911]]
              precision    recall  f1-score   support

           0       0.75      0.70      0.72   1889408
           1       0.54      0.61      0.57   1116784

    accuracy                           0.66   3006192
   macro avg       0.65      0.65      0.65   3006192
weighted avg       0.67      0.66      0.67   3006192



It doesn't seem as those modification worked but on the contrary they made less performant the model. I'm going to go back to the original one but now add some different value for the scale_pos_weight and see if the model improves

<b>scale_pos_weight = 2</b>

In [33]:
model_grid_clf1 = xgb.XGBClassifier(learning_rate = 0.1, 
                                   max_depth = 6, 
                                   min_child_weight = 10,
                                   subsample = 0.7, 
                                   n_estimators = 100, 
                                   scale_pos_weight=2)
model_grid_clf1.fit(X_train, y_train)

XGBClassifier(max_depth=6, min_child_weight=10, scale_pos_weight=2,
              subsample=0.7)

In [34]:
print("Accuracy on training set: {:.2f}".format(model_grid_clf1.score(X_train, y_train) * 100))
print("Accuracy on validation set: {:.2f}".format(model_grid_clf1.score(X_test, y_test) * 100))

Accuracy on training set: 62.54
Accuracy on validation set: 62.14


Wow, that was definitely a big drop from the original model that doesn't account for the data imbalance. I'm going to try dropping the 1.69 to 1.25 and see how that behaves, and if it doesn't work then I will stop there and keep on going assuming my original model as the best XGBoost

<b>scale_pos_weight = 1.25 </b>

In [42]:
model_grid_clf2 = xgb.XGBClassifier(learning_rate = 0.1, 
                                   max_depth = 6, 
                                   min_child_weight = 10,
                                   subsample = 0.7, 
                                   n_estimators = 100, 
                                   scale_pos_weight=1.25)
model_grid_clf2.fit(X_train, y_train)

XGBClassifier(max_depth=6, min_child_weight=10, scale_pos_weight=1.25,
              subsample=0.7)

In [43]:
print("Accuracy on training set: {:.2f}".format(model_grid_clf2.score(X_train, y_train) * 100))
print("Accuracy on validation set: {:.2f}".format(model_grid_clf2.score(X_test, y_test) * 100))

Accuracy on training set: 69.76
Accuracy on validation set: 69.54


<b>scale_pos_weight = 1.15</b>

In [44]:
model_grid_clf3 = xgb.XGBClassifier(learning_rate = 0.1, 
                                   max_depth = 6, 
                                   min_child_weight = 10,
                                   subsample = 0.7, 
                                   n_estimators = 100, 
                                   scale_pos_weight=1.15)
model_grid_clf3.fit(X_train, y_train)

XGBClassifier(max_depth=6, min_child_weight=10, scale_pos_weight=1.15,
              subsample=0.7)

In [45]:
print("Accuracy on training set: {:.2f}".format(model_grid_clf3.score(X_train, y_train) * 100))
print("Accuracy on validation set: {:.2f}".format(model_grid_clf3.score(X_test, y_test) * 100))

Accuracy on training set: 69.86
Accuracy on validation set: 69.68


In [50]:
xgb_predict_clf3 = model_grid_clf3.predict(X_test)
xgb_predict_clf3

array([0, 0, 0, ..., 0, 0, 1])

In [51]:
print(confusion_matrix(y_test,xgb_predict_clf3))
print(classification_report(y_test,xgb_predict_clf3))

[[1695821  193587]
 [ 717753  399031]]
              precision    recall  f1-score   support

           0       0.70      0.90      0.79   1889408
           1       0.67      0.36      0.47   1116784

    accuracy                           0.70   3006192
   macro avg       0.69      0.63      0.63   3006192
weighted avg       0.69      0.70      0.67   3006192



<b>The scale_pos_weight of 1.15 gave the best results with an accuracy very close to what I had in mind (70%) with 69.68%, a Recall of 63% and Precision of 69%.  So far this is the best of the models that I have made without the DEP_DEELAY and accounting for the imbalanced data</b>

<b>scale_pos_weight = 0.9</b>

In [46]:
model_grid_clf4 = xgb.XGBClassifier(learning_rate = 0.1, 
                                   max_depth = 6, 
                                   min_child_weight = 10,
                                   subsample = 0.7, 
                                   n_estimators = 100, 
                                   scale_pos_weight=0.9)
model_grid_clf4.fit(X_train, y_train)

XGBClassifier(max_depth=6, min_child_weight=10, scale_pos_weight=0.9,
              subsample=0.7)

In [47]:
print("Accuracy on training set: {:.2f}".format(model_grid_clf4.score(X_train, y_train) * 100))
print("Accuracy on validation set: {:.2f}".format(model_grid_clf4.score(X_test, y_test) * 100))

Accuracy on training set: 69.13
Accuracy on validation set: 68.98


================================================================

Separate test to try and get best parameters...

In [8]:
param_grid_ltest = {
    "learning_rate": [0.01, 0.1],
    'max_depth': [4, 6],
    'min_child_weight': [10],
    'subsample': [ 0.7],
    'n_estimators': [100, 200],
}

In [None]:
grid_clf_2 = GridSearchCV(clf, param_grid_ltest, scoring='accuracy', cv=None, n_jobs=1)
grid_clf_2.fit(scaled_df, y)

best_parameters = grid_clf_2.best_params_

print("Grid Search found the following optimal parameters: ")
for param_name in sorted(best_parameters.keys()):
    print("%s: %r" % (param_name, best_parameters[param_name]))

training_preds = grid_clf_2.predict(X_train)
val_preds = grid_clf_2.predict(X_test)
training_accuracy = accuracy_score(y_train, training_preds)
val_accuracy = accuracy_score(y_test, val_preds)

print("")
print("Training Accuracy: {:.4}%".format(training_accuracy * 100))
print("Validation accuracy: {:.4}%".format(val_accuracy * 100))