

Your task is to beat all benchmarks in this competition. Here you won’t be provided with detailed instructions. Hopefully, at this stage of the course, it's enough for you to take a quick look at the data in order to understand that this is the type of task where gradient boosting will do. Most likely it will be LightGBM. But you can try Xgboost or Catboost as well.

<img src="https://habrastorage.org/webt/fs/42/ms/fs42ms0r7qsoj-da4x7yfntwrbq.jpeg" width=30% />

In [6]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

In [7]:
train_df = pd.read_csv('../../datasets/flight/flight_delays_train.csv')
test_df = pd.read_csv('../../datasets/flight/flight_delays_test.csv')

In [8]:
train_df.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min
0,c-8,c-21,c-7,1934,AA,ATL,DFW,732,N
1,c-4,c-20,c-3,1548,US,PIT,MCO,834,N
2,c-9,c-2,c-5,1422,XE,RDU,CLE,416,N
3,c-11,c-25,c-6,1015,OO,DEN,MEM,872,N
4,c-10,c-7,c-6,1828,WN,MDW,OMA,423,Y


In [9]:
test_df.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance
0,c-7,c-25,c-3,615,YV,MRY,PHX,598
1,c-4,c-17,c-2,739,WN,LAS,HOU,1235
2,c-12,c-2,c-7,651,MQ,GSP,ORD,577
3,c-3,c-25,c-7,1614,WN,BWI,MHT,377
4,c-6,c-6,c-3,1505,UA,ORD,STL,258


Given flight departure time, carrier's code, departure airport, destination location, and flight distance, you have to predict departure delay for more than 15 minutes. As the simplest benchmark, let's take logistic regression and two features that are easiest to take: DepTime and Distance. This will correspond to **"simple logit baseline"** on Public LB.

In [10]:
cat_cols = [ i for i in train_df.columns if i not in ['Distance','DepTime','dep_delayed_15min']]
cat_cols

['Month', 'DayofMonth', 'DayOfWeek', 'UniqueCarrier', 'Origin', 'Dest']

In [11]:
df = pd.concat([train_df[cat_cols],test_df[cat_cols]],axis = 0)

In [12]:
from sklearn.preprocessing import OneHotEncoder
onehot=OneHotEncoder()
onehot.fit(df)
X_train_onehot = onehot.transform(train_df[cat_cols])

X_train, y_train = np.hstack([X_train_onehot.toarray(),train_df[['Distance', 'DepTime']].values]), train_df['dep_delayed_15min'].map({'Y': 1, 'N': 0}).values

X_test_onehot = onehot.transform(test_df[cat_cols])
X_test = np.hstack([X_train_onehot.toarray(),test_df[['Distance', 'DepTime']].values])

# X_train_part, X_valid, y_train_part, y_valid = \
#     train_test_split(X_train, y_train, 
#                      test_size=0.3, random_state=17)

In [8]:
logit_pipe = Pipeline([('scaler', StandardScaler()),
                       ('logit', LogisticRegression(C=1, random_state=17, solver='liblinear'))])

In [9]:
logit_pipe.fit(X_train_part, y_train_part)
logit_valid_pred = logit_pipe.predict_proba(X_valid)[:, 1]

roc_auc_score(y_valid, logit_valid_pred)

0.6991062491760075

In [10]:
logit_pipe.fit(X_train, y_train)
logit_test_pred = logit_pipe.predict_proba(X_test)[:, 1]

pd.Series(logit_test_pred, 
          name='dep_delayed_15min').to_csv('logit_2feat.csv', 
                                           index_label='id', header=True)

In [35]:
from sklearn.ensemble import RandomForestClassifier

In [36]:
rf_clf = RandomForestClassifier(n_estimators=1000,verbose=1,oob_score=True,n_jobs=-1)

In [37]:
rf_clf.fit(X_train_part,y_train_part)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   47.7s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  4.5min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 10.3min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed: 14.5min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 16.7min finished


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=-1,
            oob_score=True, random_state=None, verbose=1, warm_start=False)

In [109]:
rf_clf.oob_score_

0.8170571428571428

In [110]:
rf_valid_pred = rf_clf.predict_proba(X_valid)[:, 1]

roc_auc_score(y_valid, rf_valid_pred)

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    1.0s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    4.9s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:   12.3s
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:   23.1s
[Parallel(n_jobs=4)]: Done 1000 out of 1000 | elapsed:   31.6s finished


0.7337395079820489

In [16]:
rf_clf.fit(X_train, y_train)
rf_test_pred = logit_pipe.predict_proba(X_test)[:, 1]

pd.Series(rf_test_pred, 
          name='dep_delayed_15min').to_csv('rf_2feat.csv', 
                                           index_label='id', header=True)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   55.9s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:  7.4min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed: 13.5min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 17.6min finished


In [111]:
from sklearn.decomposition import PCA

In [112]:
pca = PCA(n_components=0.9)

In [113]:
X_train_pca=pca.fit_transform(X_train_onehot.toarray())

In [116]:
y_train.shape

(354,)

In [114]:
X_train_part_pca, X_valid_pca, y_train_part_pca, y_valid_pca = \
    train_test_split(X_train_pca, y_train, 
                     test_size=0.3, random_state=17)

ValueError: Found input variables with inconsistent numbers of samples: [100000, 354]

In [36]:
from sklearn.model_selection import GridSearchCV

In [None]:
rf_clf_pca = RandomForestClassifier(n_estimators=1000,verbose=1,oob_score=True,n_jobs=-1)
param={'max_depth':range(6,30,4)}

gs1=GridSearchCV(rf_clf_pca,param_grid=param,cv=5,n_jobs=-1,scoring='roc_auc',verbose=1)
gs1.fit(X_train_part_pca,y_train_part_pca)

In [40]:
gs1.best_params_

{'max_depth': 14}

In [43]:
rf_train_part_pred = gs1.best_estimator_.predict_proba(X_train_part_pca)[:, 1]

roc_auc_score(y_train_part_pca, rf_train_part_pred)

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.2s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    0.9s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:    2.0s
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:    3.5s
[Parallel(n_jobs=4)]: Done 1000 out of 1000 | elapsed:    4.6s finished


0.9662637216795118

In [42]:
print(gs1.best_estimator_.oob_score_)
rf_valid_pred = gs1.best_estimator_.predict_proba(X_valid_pca)[:, 1]

roc_auc_score(y_valid_pca, rf_valid_pred)

0.8085428571428571


[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.1s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:    0.4s
[Parallel(n_jobs=4)]: Done 442 tasks      | elapsed:    0.9s
[Parallel(n_jobs=4)]: Done 792 tasks      | elapsed:    1.6s
[Parallel(n_jobs=4)]: Done 1000 out of 1000 | elapsed:    2.1s finished


0.6330848876677431

In [None]:
rf_clf_pca = RandomForestClassifier(max_depth=14,verbose=1,oob_score=True,n_jobs=-1)
param={'n_estimators':range(100,1000,50)}

gs2=GridSearchCV(rf_clf_pca,param_grid=param,cv=5,n_jobs=-1,scoring='roc_auc',verbose=1)
gs2.fit(X_train_part_pca,y_train_part_pca)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 69.8min


In [33]:
print(X_train_onehot.shape)
print(X_train_part_pca.shape)

(100000, 687)
(70000, 139)


In [14]:
from sklearn.datasets import load_boston
data=load_boston()

In [16]:
X,y=data['data'],data['target']

In [17]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.3,random_state=42)

Now you have to beat **"A10 benchmark"** on Public LB. It's not challenging at all. Go for LightGBM, maybe some other models (or ensembling) as well. Include categorical features, do some simple feature engineering as well. Good luck!

If you think this course is worth spreading, you can do a favour:
* upvote this [announcement](https://www.kaggle.com/general/68205) on Kaggle Forum; optionally, tell your story threin
* upvote the mlcourse.ai [Kaggle Dataset](https://www.kaggle.com/kashnitsky/mlcourse), it'll pull the Dataset up in the list of all datasets
* upvoting course [Kernels](https://www.kaggle.com/kashnitsky/mlcourse/kernels?sortBy=voteCount&group=everyone&pageSize=20&datasetId=32132) is also a nice thing to do 
* spread a word on [mlcourse.ai](https://mlcourse.ai) in social networks, the next session is planned to launch in February 2019

In [38]:
from sklearn.datasets import load_boston
data = load_boston()
X,y=data['data'],data['target']

In [39]:
from sklearn.model_selection import train_test_split

In [40]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=42)

In [48]:
from sklearn.linear_model import SGDRegressor, LinearRegression
lr_reg = LinearRegression()

In [94]:
from sklearn.preprocessing import StandardScaler
ss=StandardScaler()
X_train_s=ss.fit_transform(X_train)
X_test_s=ss.transform(X_test)

In [105]:
from sklearn.metrics import mean_squared_error
from sklearn.base import clone
from copy import deepcopy

pre=0
sgd_reg = SGDRegressor(warm_start=True,eta0=0.01,penalty=None,learning_rate='constant')
for i in range(1000):
    sgd_reg.partial_fit(X_train_s,y_train)
    y_pred = sgd_reg.predict(X_test_s)
    err = mean_squared_error(y_test,y_pred)
    if pre == 0 or err < pre:
        pre=err
        count+=1
        best_epoch=i
        best_model = deepcopy(sgd_reg)

In [104]:
pre,count

(19.97812549771452, 8)

In [108]:
lr_reg.fit(X_train,y_train)
y_pred=lr_reg.predict(X_test)
print(mean_squared_error(y_pred,y_test))
y_pred=best_model.predict(X_test_s)
print(mean_squared_error(y_pred,y_test))

21.517444231177205
20.111370716927553


In [93]:
mean_squared_error(y_train,y_train_pred)

5.0437795822376603e+30