

Your task is to beat all benchmarks in this competition. Here you won’t be provided with detailed instructions. Hopefully, at this stage of the course, it's enough for you to take a quick look at the data in order to understand that this is the type of task where gradient boosting will do. Most likely it will be LightGBM. But you can try Xgboost or Catboost as well.

<img src="https://habrastorage.org/webt/fs/42/ms/fs42ms0r7qsoj-da4x7yfntwrbq.jpeg" width=30% />

In [1]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

In [2]:
train_df = pd.read_csv('../../datasets/flight/flight_delays_train.csv')
test_df = pd.read_csv('../../datasets/flight/flight_delays_test.csv')

In [3]:
train_df.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min
0,c-8,c-21,c-7,1934,AA,ATL,DFW,732,N
1,c-4,c-20,c-3,1548,US,PIT,MCO,834,N
2,c-9,c-2,c-5,1422,XE,RDU,CLE,416,N
3,c-11,c-25,c-6,1015,OO,DEN,MEM,872,N
4,c-10,c-7,c-6,1828,WN,MDW,OMA,423,Y


In [4]:
test_df.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance
0,c-7,c-25,c-3,615,YV,MRY,PHX,598
1,c-4,c-17,c-2,739,WN,LAS,HOU,1235
2,c-12,c-2,c-7,651,MQ,GSP,ORD,577
3,c-3,c-25,c-7,1614,WN,BWI,MHT,377
4,c-6,c-6,c-3,1505,UA,ORD,STL,258


Given flight departure time, carrier's code, departure airport, destination location, and flight distance, you have to predict departure delay for more than 15 minutes. As the simplest benchmark, let's take logistic regression and two features that are easiest to take: DepTime and Distance. This will correspond to **"simple logit baseline"** on Public LB.

In [5]:
cat_cols = [ i for i in train_df.columns if i not in ['Distance','DepTime','dep_delayed_15min']]
cat_cols

['Month', 'DayofMonth', 'DayOfWeek', 'UniqueCarrier', 'Origin', 'Dest']

In [6]:
df = pd.concat([train_df[cat_cols],test_df[cat_cols]],axis = 0)

In [7]:
from sklearn.preprocessing import OneHotEncoder
onehot=OneHotEncoder()
onehot.fit(df)
X_train_onehot = onehot.transform(train_df[cat_cols])

X_train, y_train = np.hstack([X_train_onehot.toarray(),train_df[['Distance', 'DepTime']].values]), train_df['dep_delayed_15min'].map({'Y': 1, 'N': 0}).values

X_test_onehot = onehot.transform(test_df[cat_cols])
X_test = np.hstack([X_train_onehot.toarray(),test_df[['Distance', 'DepTime']].values])

X_train_part, X_valid, y_train_part, y_valid = \
    train_test_split(X_train, y_train, 
                     test_size=0.3, random_state=17)

In [8]:
logit_pipe = Pipeline([('scaler', StandardScaler()),
                       ('logit', LogisticRegression(C=1, random_state=17, solver='liblinear'))])

In [9]:
logit_pipe.fit(X_train_part, y_train_part)
logit_valid_pred = logit_pipe.predict_proba(X_valid)[:, 1]

roc_auc_score(y_valid, logit_valid_pred)

0.6991062491760075

In [None]:
logit_pipe.fit(X_train, y_train)
logit_test_pred = logit_pipe.predict_proba(X_test)[:, 1]

pd.Series(logit_test_pred, 
          name='dep_delayed_15min').to_csv('logit_2feat.csv', 
                                           index_label='id', header=True)

Now you have to beat **"A10 benchmark"** on Public LB. It's not challenging at all. Go for LightGBM, maybe some other models (or ensembling) as well. Include categorical features, do some simple feature engineering as well. Good luck!

If you think this course is worth spreading, you can do a favour:
* upvote this [announcement](https://www.kaggle.com/general/68205) on Kaggle Forum; optionally, tell your story threin
* upvote the mlcourse.ai [Kaggle Dataset](https://www.kaggle.com/kashnitsky/mlcourse), it'll pull the Dataset up in the list of all datasets
* upvoting course [Kernels](https://www.kaggle.com/kashnitsky/mlcourse/kernels?sortBy=voteCount&group=everyone&pageSize=20&datasetId=32132) is also a nice thing to do 
* spread a word on [mlcourse.ai](https://mlcourse.ai) in social networks, the next session is planned to launch in February 2019