# 3. Binary classification
**Goal:** The goal is to predict if the flight will be cancelled  

**Target variables:** 
- `CANCELLED` 

**Notes**:  

2% of flights are cancelled. it's a highly imbalanced dataset. We will want to predict probabilities with custom threshold.  
[Article on framework for highly imbalanced datasets](https://machinelearningmastery.com/framework-for-imbalanced-classification-projects/)

**Steps**:
1. Pick evaluation metrics
2. Spot check different algos
3. Tune hyperparameters using grid search on 2-3 selected algos
4. Compare them based on the selected metrics

## Step 1. Pick evaluation metrics
**Brier score** (because we will be using probabilities in some models)  
**Precision-Recall AUC** (because positive class is more important)  
**F1 score**  

## Step 2. Spot check different algos (using K-Folds on sample data)

### Step 2.1 Spot check regular algorythms

- Naive algorythm (used as a baseline model):  
Predict the majority class 

- Linear algorythms:  
Logistic Regression  
LDA  
- Non linear algos:  
k-Nearest Neighbors  

- Ensemble algorythms:  
Random forest  
Stochastic Gradient Boosting   
XGBoost  
 *Optional: custom ensemble*

- Other
One-Class Support Vector Machines (usually used for outlier detection but can be used to do classification)

### Step 2.2 Spot check imbalanced classification algos. Use those techniques on the ML algos to see if it improves performance

- Data sampling  
Oversampler and Undersampler method: SMOTE combined with RandomUnderSampler on the training datasets in this step [Article on how to perform SMOTE](https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/)

- Probability tuning (custom threshold) on the algos that can give probabilities as outputs  
Logistic Regression 
LDA 

- Optional: Try a [calibration algo on the predicted prbobabilities](https://machinelearningmastery.com/calibrated-classification-model-in-scikit-learn/)



In [3]:
import pandas as pd
import numpy as np
#from sklearn import cross_validation, metrics

In [4]:
# import data
data = pd.read_csv('db_binary_sample.csv', index_col = 0)

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000000 entries, 6552218 to 9619844
Data columns (total 14 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   branded_code_share  1000000 non-null  int64  
 1   crs_dep_time        1000000 non-null  int64  
 2   crs_arr_time        1000000 non-null  int64  
 3   cancelled           1000000 non-null  float64
 4   crs_elapsed_time    1000000 non-null  float64
 5   air_time            1000000 non-null  float64
 6   distance            1000000 non-null  float64
 7   fl_month            1000000 non-null  int64  
 8   fl_day_of_week      1000000 non-null  int64  
 9   fl_type             1000000 non-null  int64  
 10  state_travel_type   1000000 non-null  int64  
 11  origin_cat          1000000 non-null  int64  
 12  dest_cat            1000000 non-null  int64  
 13  mkt_op_combo_cat    1000000 non-null  int64  
dtypes: float64(4), int64(10)
memory usage: 114.4 MB


In [142]:
# get class representations

print(f'Class {data.cancelled.value_counts().index[0]}: {y.value_counts().values[0]} values')
print(f'Class {data.cancelled.value_counts().index[1]}: {y.value_counts().values[1]} values')
print(f'Sample imbalance: {data.cancelled.value_counts().values[1]/len(y)*100} %')

Class 0.0: 982771 values
Class 1.0: 17229 values
Sample imbalance: 1.7229 %


# ----- Trying things -----
### Try different splitting methods
Trying with `KFold` and `StratifiedShuffleSplit`.

In [81]:
# try using K-Fold

from sklearn.model_selection import KFold
kf = KFold(n_splits=10)
kf.get_n_splits(X_scaled, y)
KFold()

for train_index, test_index in kf.split(X_scaled,y):
    print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [100000 100001 100002 ... 999997 999998 999999] TEST: [    0     1     2 ... 99997 99998 99999]
TRAIN: [     0      1      2 ... 999997 999998 999999] TEST: [100000 100001 100002 ... 199997 199998 199999]
TRAIN: [     0      1      2 ... 999997 999998 999999] TEST: [200000 200001 200002 ... 299997 299998 299999]
TRAIN: [     0      1      2 ... 999997 999998 999999] TEST: [300000 300001 300002 ... 399997 399998 399999]
TRAIN: [     0      1      2 ... 999997 999998 999999] TEST: [400000 400001 400002 ... 499997 499998 499999]
TRAIN: [     0      1      2 ... 999997 999998 999999] TEST: [500000 500001 500002 ... 599997 599998 599999]
TRAIN: [     0      1      2 ... 999997 999998 999999] TEST: [600000 600001 600002 ... 699997 699998 699999]
TRAIN: [     0      1      2 ... 999997 999998 999999] TEST: [700000 700001 700002 ... 799997 799998 799999]
TRAIN: [     0      1      2 ... 999997 999998 999999] TEST: [800000 800001 800002 ... 899997 899998 899999]
TRAIN: [     0      1    

In [74]:
# try using StratifiesShuffleSplit

from sklearn.model_selection import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(n_splits=10, test_size=0.75, random_state=0)
sss.get_n_splits(X_scaled, y)

for train_index, test_index in sss.split(X_scaled, y):
    print("TRAIN:", train_index, "TEST:", test_index)

TRAIN: [281750 943380 795722 ... 660010 884461 260970] TEST: [276253 447818 356965 ... 975127  45228 627114]
TRAIN: [591858 801835 762344 ... 286665 458622 994887] TEST: [996020 796429 511687 ... 620788 815472 405986]
TRAIN: [576527 806177 930996 ...  90238 846362 903068] TEST: [167251 488581 985668 ... 560869 182965 112546]
TRAIN: [ 11082  66792  97360 ...  85078 995251 140024] TEST: [999600 458983 700206 ... 939295 780817 200095]
TRAIN: [459230 383268 532306 ...   5436 749097 297154] TEST: [232992 863587   5703 ...  11039 215364 768469]
TRAIN: [362763 729592 322726 ...  39634 497503 250751] TEST: [984895 653479 547897 ... 736018 312592 814918]
TRAIN: [ 66554 528276 832008 ... 295657 818478 957125] TEST: [152964 610160 944232 ... 305574 897138 353157]
TRAIN: [581090 849604   6935 ... 851652 875719 431848] TEST: [571587 352508 237495 ... 528434 892110 135950]
TRAIN: [ 81802  24268  53156 ... 323598 371099 539020] TEST: [128670 551496 672841 ...  33865 102943 810189]
TRAIN: [201134 8648

### Try syntax for training and predicting a model

In [109]:
# import selected metrics
from sklearn.metrics import brier_score_loss

# import model
from sklearn.linear_model import LogisticRegression

In [110]:
# get train/test
from sklearn.model_selection import train_test_split

# get train/test data
X_train, X_test, y_train, y_test = \
train_test_split(X_scaled, y, train_size=0.80, random_state=101)

In [115]:
# create placeolder
model = LogisticRegression(random_state=0)

# fit the data
model.fit(X_scaled, y)

# predict
y_pred = model.predict(X_test)
brier_score = brier_score_loss(y_test, y_pred)
print(brier_score)

0.0


### Trying using cross validate

In [134]:
from sklearn.model_selection import cross_validate
from sklearn.preprocessing import StandardScaler

#model
from sklearn.linear_model import LogisticRegression

# placeholder
model = LogisticRegression(random_state=0)

# get the data, scale X
y = data.cancelled
X = data.drop('cancelled', axis = 1)
X_scaled = StandardScaler().fit_transform(X)

# get the number of folds
n_folds = 5

# get the scorings
scoring = ('neg_brier_score', 'roc_auc', 'f1')

# get scores
cv_results = cross_validate(model, X_scaled, y, cv = n_folds, scoring = scoring)
print(f"neg_brier_score\nMean:\t{cv_results['test_neg_brier_score'].mean()}\nStd.:\t{cv_results['test_neg_brier_score'].mean().std()}")
print(f"roc_auc\nMean:\t{cv_results['test_roc_auc'].mean()}\nStd.:\t{cv_results['test_roc_auc'].mean().std()}")
print(f"f1\nMean:\t{cv_results['test_f1'].mean()}\nStd.:\t{cv_results['test_f1'].mean().std()}")

neg_brier_score
Mean:	-8.30031802523422e-06
Std.:	0.0
roc_auc
Mean:	0.999999682279595
Std.:	0.0
f1
Mean:	0.9999709850573044
Std.:	0.0


# ------ End of trying things ----- back at checking algos

## Step 2.1 Spot check regular algos

In [136]:
# imports
from sklearn.model_selection import cross_validate
from sklearn.preprocessing import StandardScaler

In [138]:
# Setup data and parameters

# get the data, scale X
y = data.cancelled
X = data.drop('cancelled', axis = 1)
X_scaled = StandardScaler().fit_transform(X)

# get the number of folds
n_folds = 5

# get the scorings
scoring = ('neg_brier_score', 'roc_auc', 'f1')

### Naive algorythms: predicting class 0 for all

In [147]:
# import model
from sklearn.dummy import DummyClassifier

# placeholder
model = DummyClassifier(strategy = "most_frequent")

# get scores
cv_results = cross_validate(model, X_scaled, y, cv = n_folds, scoring = scoring)
print(f"neg_brier_score\nMean:\t{cv_results['test_neg_brier_score'].mean()}\nStd.:\t{cv_results['test_neg_brier_score'].mean().std()}")
print(f"roc_auc\nMean:\t{cv_results['test_roc_auc'].mean()}\nStd.:\t{cv_results['test_roc_auc'].mean().std()}")
print(f"f1\nMean:\t{cv_results['test_f1'].mean()}\nStd.:\t{cv_results['test_f1'].mean().std()}")

neg_brier_score
Mean:	-0.017228999999999998
Std.:	0.0
roc_auc
Mean:	0.5
Std.:	0.0
f1
Mean:	0.0
Std.:	0.0


### Logistic Regression

In [139]:
# import model
from sklearn.linear_model import LogisticRegression

# placeholder
model = LogisticRegression()

# get scores
cv_results = cross_validate(model, X_scaled, y, cv = n_folds, scoring = scoring)
print(f"neg_brier_score\nMean:\t{cv_results['test_neg_brier_score'].mean()}\nStd.:\t{cv_results['test_neg_brier_score'].mean().std()}")
print(f"roc_auc\nMean:\t{cv_results['test_roc_auc'].mean()}\nStd.:\t{cv_results['test_roc_auc'].mean().std()}")
print(f"f1\nMean:\t{cv_results['test_f1'].mean()}\nStd.:\t{cv_results['test_f1'].mean().std()}")

neg_brier_score
Mean:	-8.30031802523422e-06
Std.:	0.0
roc_auc
Mean:	0.999999682279595
Std.:	0.0
f1
Mean:	0.9999709850573044
Std.:	0.0


### LDA

In [141]:
# import model
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# placeholder
model = LinearDiscriminantAnalysis()

# get scores
cv_results = cross_validate(model, X_scaled, y, cv = n_folds, scoring = scoring)
print(f"neg_brier_score\nMean:\t{cv_results['test_neg_brier_score'].mean()}\nStd.:\t{cv_results['test_neg_brier_score'].mean().std()}")
print(f"roc_auc\nMean:\t{cv_results['test_roc_auc'].mean()}\nStd.:\t{cv_results['test_roc_auc'].mean().std()}")
print(f"f1\nMean:\t{cv_results['test_f1'].mean()}\nStd.:\t{cv_results['test_f1'].mean().std()}")

neg_brier_score
Mean:	-0.003590931248893237
Std.:	0.0
roc_auc
Mean:	0.9998115128775324
Std.:	0.0
f1
Mean:	0.8720059949150183
Std.:	0.0


### Random forest  

In [145]:
# import model
from sklearn.ensemble import RandomForestClassifier

# placeholder
model = RandomForestClassifier()

# get scores
cv_results = cross_validate(model, X_scaled, y, cv = n_folds, scoring = scoring)
print(f"neg_brier_score\nMean:\t{cv_results['test_neg_brier_score'].mean()}\nStd.:\t{cv_results['test_neg_brier_score'].mean().std()}")
print(f"roc_auc\nMean:\t{cv_results['test_roc_auc'].mean()}\nStd.:\t{cv_results['test_roc_auc'].mean().std()}")
print(f"f1\nMean:\t{cv_results['test_f1'].mean()}\nStd.:\t{cv_results['test_f1'].mean().std()}")

neg_brier_score
Mean:	-7.579000000000008e-06
Std.:	0.0
roc_auc
Mean:	1.0
Std.:	0.0
f1
Mean:	1.0
Std.:	0.0


### Stochastic Gradient Boosting 

In [148]:
# import model
from sklearn.ensemble import GradientBoostingClassifier

# placeholder
model = RandomForestClassifier()

# get scores
cv_results = cross_validate(model, X_scaled, y, cv = n_folds, scoring = scoring)
print(f"neg_brier_score\nMean:\t{cv_results['test_neg_brier_score'].mean()}\nStd.:\t{cv_results['test_neg_brier_score'].mean().std()}")
print(f"roc_auc\nMean:\t{cv_results['test_roc_auc'].mean()}\nStd.:\t{cv_results['test_roc_auc'].mean().std()}")
print(f"f1\nMean:\t{cv_results['test_f1'].mean()}\nStd.:\t{cv_results['test_f1'].mean().std()}")

neg_brier_score
Mean:	-7.587800000000011e-06
Std.:	0.0
roc_auc
Mean:	1.0
Std.:	0.0
f1
Mean:	1.0
Std.:	0.0


### One Class Support Vector Machines

In [None]:
# import model
from sklearn.svm import OneClassSVM

# placeholder
model = OneClassSVM()

# get scores
cv_results = cross_validate(model, X_scaled, y, cv = n_folds, scoring = scoring)
print(f"neg_brier_score\nMean:\t{cv_results['test_neg_brier_score'].mean()}\nStd.:\t{cv_results['test_neg_brier_score'].mean().std()}")
print(f"roc_auc\nMean:\t{cv_results['test_roc_auc'].mean()}\nStd.:\t{cv_results['test_roc_auc'].mean().std()}")
print(f"f1\nMean:\t{cv_results['test_f1'].mean()}\nStd.:\t{cv_results['test_f1'].mean().std()}")

### XGBoost

In [154]:
# import model
import xgboost as xgb

# placeholder
model = xgb.XGBClassifier(scale_pos_weight=100) # parameter set because the dataset is imbalanced

# get scores
cv_results = cross_validate(model, X_scaled, y, cv = n_folds, scoring = 'roc_auc')
# print(f"neg_brier_score\nMean:\t{cv_results['test_neg_brier_score'].mean()}\nStd.:\t{cv_results['test_neg_brier_score'].mean().std()}")
# print(f"roc_auc\nMean:\t{cv_results['test_roc_auc'].mean()}\nStd.:\t{cv_results['test_roc_auc'].mean().std()}")
# print(f"f1\nMean:\t{cv_results['test_f1'].mean()}\nStd.:\t{cv_results['test_f1'].mean().std()}")

In [156]:
cv_results

{'fit_time': array([31.20423722, 31.55741978, 33.59384584, 34.10717416, 34.16739893]),
 'score_time': array([0.19097233, 0.20549273, 0.19444442, 0.18151617, 0.16628432]),
 'test_score': array([1., 1., 1., 1., 1.])}

## Step 3. Hyperparameter tuning
Use Grid Search to tune the selected algo (on the whole dataframe?)