In this notebook I want to tackle the "Flight Delays" Kaggle challenge
https://www.kaggle.com/c/flight-delays-fall-2018

For the start we follow
https://www.kaggle.com/philippr/mlcourse-ai-assignment-3-starter/edit

In [1]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
import os

  from numpy.core.umath_tests import inner1d


In [2]:
train_df = pd.read_csv('H:/Python/mlcourse.AI/flight-delays-fall-2018/flight_delays_train.csv')
test_df = pd.read_csv('H:/Python/mlcourse.AI/flight-delays-fall-2018/flight_delays_test.csv')

In [3]:
train_df.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min
0,c-8,c-21,c-7,1934,AA,ATL,DFW,732,N
1,c-4,c-20,c-3,1548,US,PIT,MCO,834,N
2,c-9,c-2,c-5,1422,XE,RDU,CLE,416,N
3,c-11,c-25,c-6,1015,OO,DEN,MEM,872,N
4,c-10,c-7,c-6,1828,WN,MDW,OMA,423,Y


In [4]:
test_df.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance
0,c-7,c-25,c-3,615,YV,MRY,PHX,598
1,c-4,c-17,c-2,739,WN,LAS,HOU,1235
2,c-12,c-2,c-7,651,MQ,GSP,ORD,577
3,c-3,c-25,c-7,1614,WN,BWI,MHT,377
4,c-6,c-6,c-3,1505,UA,ORD,STL,258


I continue to follow Yury's notebook, but I don't restrict myself to two features. We therefore have to turn some of the existing features into more useable ones. We start with the day codes

 <font color="blue"> Should take cos/sin for DepTime </font> 

In [5]:
train_df['DayofMonth'] = train_df['DayofMonth'].map({'c-'+str(i):i for i in range(32)})
train_df['Month'] = train_df['Month'].map({'c-'+str(i):i for i in range(13)})
train_df['DayOfWeek'] = train_df['DayOfWeek'].map({'c-'+str(i):i for i in range(8)})
test_df['DayofMonth'] = test_df['DayofMonth'].map({'c-'+str(i):i for i in range(32)})
test_df['Month'] = test_df['Month'].map({'c-'+str(i):i for i in range(13)})
test_df['DayOfWeek'] = test_df['DayOfWeek'].map({'c-'+str(i):i for i in range(8)})
#train_df['DepTime'] = train_df['Deptime'].apply(lambda x: cos(2*np.pi*x/2400)

In [6]:
train_df.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min
0,8,21,7,1934,AA,ATL,DFW,732,N
1,4,20,3,1548,US,PIT,MCO,834,N
2,9,2,5,1422,XE,RDU,CLE,416,N
3,11,25,6,1015,OO,DEN,MEM,872,N
4,10,7,6,1828,WN,MDW,OMA,423,Y


Let's also encode the origin and destination. We bundle them together into the actual route to make a more powerful feature

In [7]:
train_df['Route'] = train_df['Origin'] + '-' + train_df['Dest']
test_df['Route'] = test_df['Origin'] + '-' + test_df['Dest']

We'll remove routes that are used rarely to keep the amount of features more manageable

In [8]:
cutroutes = train_df['Route'].value_counts()<10
criterion = train_df['Route'].map(lambda x: cutroutes[x])
train_df['Route'][criterion] = 'RARE'
train_df['Route'].nunique()

2934

All of these features need to be encoded properly

<font color="blue">We can also just use origins and destinations, as I did in an older version </font>

In [9]:
train_df = pd.concat([
        train_df.drop(["Route"], axis=1), 
        pd.get_dummies(train_df[["Route"]], prefix=['route'])
    ], axis=1)
test_df = pd.concat([
        test_df.drop(["Route"], axis=1), 
        pd.get_dummies(test_df[["Route"]], prefix=['route'])
    ], axis=1)

Remove unencoded features and only train the model on those that also exist in the test set.

<font color='blue'> I should probably have a look if I can find routes that often have delay by myself

In [10]:
trainfts = list(train_df.columns)
trainfts.remove('UniqueCarrier')
trainfts.remove('Origin')
trainfts.remove('Dest')

testfts = list(test_df.columns)

feats = list(set(trainfts).intersection(set(testfts)))
len(feats)

2891

In [11]:
X_train = train_df[feats].values
y_train = train_df['dep_delayed_15min'].map({'Y': 0, 'N': 1}).values
X_test = test_df[feats].values

X_train_part, X_valid, y_train_part, y_valid = \
    train_test_split(X_train, y_train, 
                     test_size=0.3, random_state=17)

We keep Yury's pipeline and then set up a gridsearch cross validation to optimise C

In [12]:
logit_pipe = Pipeline([('scaler', StandardScaler()),
                       ('logit', LogisticRegression(random_state=17, solver='liblinear'))])

In [13]:
logit_pipe_params = {'logit__C': np.logspace(-5, -3, 10)}
skf = StratifiedKFold(n_splits=5, shuffle=True)

logrid = GridSearchCV(
    logit_pipe,
    logit_pipe_params,
    cv=skf,
    scoring='roc_auc',
    return_train_score=True,
    refit=True)
logrid.fit(X_train, y_train)
logrid.best_score_, logrid.best_params_

(0.6716339870691026, {'logit__C': 0.00035938136638046257})

In [14]:
lobest = logrid.best_estimator_

Let's try the same stuff with a random forest

<font color='red'>I think at least scikit learns random forests don't work well with categorical variables</font>

rfc_pipe = Pipeline([('scaler', StandardScaler()),
                       ('rfc', RandomForestClassifier(class_weight='balanced', n_estimators=20))])

max_depth_values = range(10, 12) #The range uses the fact that we know the previous ideal value
max_features_values = [40,50,60]
forest_params = {
    'rfc__max_depth': max_depth_values,
    'rfc__max_features': max_features_values
}
skf = StratifiedKFold(n_splits=5, shuffle=True)
rfcgrid = GridSearchCV(
    rfc_pipe,
    forest_params,
    cv=skf,
    scoring='roc_auc',
    return_train_score=True,
    refit=True
)
rfcgrid.fit(X_train,y_train)
rfcgrid.best_score_, rfcgrid.best_params_

rfcbest = rfcgrid.best_estimator_

Let's make a prediction file with this

In [17]:
PATH_TO_DATA = 'H:/Python/mlcourse.AI/flight-delays-fall-2018'
def write_submission_file(prediction, filename,
                          path_to_sample=os.path.join(PATH_TO_DATA, 
                                                      'sample_submission.csv')):
    submission = pd.read_csv(path_to_sample, index_col='id')
    
    submission['dep_delayed_15min'] = prediction
    submission.to_csv(filename)

rfcpred = rfcbest.predict_proba(X_test)
write_submission_file(rfcpred, 'RFC4.csv')

In [18]:
logpred = lobest.predict_proba(X_test)
write_submission_file(logpred, 'Log5.csv')

Old stuff