# Experiment 01: Airline dataset

In this experiment we use [the airline dataset](http://kt.ijs.si/elena_ikonomovska/data.html) to predict arrival delay. The dataset consists of a large amount of records, containing flight arrival and departure details for all the commercial flights within the USA, from October 1987 to April 2008. Its size is around 116 million records and 5.76 GB of memory.



In [3]:
import os,sys
import numpy as np
import pandas as pd
from lightgbm.sklearn import LGBMRegressor, LGBMClassifier
import xgboost as xgb
from xgboost import XGBRegressor, XGBClassifier
from sklearn.metrics import (confusion_matrix, accuracy_score, roc_auc_score, f1_score, log_loss, precision_score,
                             recall_score)
from libs.loaders import load_airline
from libs.conversion import convert_cols_categorical_to_numeric, convert_related_cols_categorical_to_numeric
from libs.timer import Timer
from libs.utils import get_number_processors
from libs.notebook_memory_management import start_watching_memory
import pkg_resources
import json
import matplotlib.pylab as plt

print("System version: {}".format(sys.version))
print("XGBoost version: {}".format(pkg_resources.get_distribution('xgboost').version))
print("LightGBM version: {}".format(pkg_resources.get_distribution('lightgbm').version))

%matplotlib inline


System version: 3.5.2 |Anaconda custom (64-bit)| (default, Jul  2 2016, 17:53:06) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
XGBoost version: 0.6
LightGBM version: 0.2
In [3] used 0.0000 MiB RAM in 0.13s, total RAM usage 225.30 MiB


In [2]:
start_watching_memory()

In [2] used 13.0781 MiB RAM in 0.04s, total RAM usage 225.30 MiB


# 1) XGBoost vs LightGBM benchmark
In the next section we compare both libraries speed, accuracy and other metrics for the dataset of airline arrival delay. 

### Data loading and management

In [4]:
%%time
df_plane = load_airline()
print(df_plane.shape)

INFO:libs.loaders:MOUNT_POINT not found in environment. Defaulting to /fileshare


(115069017, 14)
CPU times: user 1min 49s, sys: 17.7 s, total: 2min 7s
Wall time: 4min 10s
In [4] used 21994.2773 MiB RAM in 250.27s, total RAM usage 22219.58 MiB


In [5]:
df_plane.head()

Unnamed: 0,Year,Month,DayofMonth,DayofWeek,CRSDepTime,CRSArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,Origin,Dest,Distance,Diverted,ArrDelay
0,1987,10,1,4,1,556,AA,190,247,SFO,ORD,1846,0,27
1,1987,10,1,4,5,114,EA,57,74,LAX,SFO,337,0,5
2,1987,10,1,4,5,35,HP,351,167,ICT,LAS,987,0,17
3,1987,10,1,4,5,40,DL,251,35,MCO,PBI,142,0,-2
4,1987,10,1,4,8,517,UA,500,208,LAS,ORD,1515,0,17


In [5] used 0.3125 MiB RAM in 0.12s, total RAM usage 22219.89 MiB


The first step is to convert the categorical features to numeric features.

In [6]:
%%time
df_plane_numeric = convert_related_cols_categorical_to_numeric(df_plane, col_list=['Origin','Dest'])
del df_plane

CPU times: user 1min 56s, sys: 8.8 s, total: 2min 5s
Wall time: 2min 1s
In [6] used 5263.6406 MiB RAM in 121.30s, total RAM usage 27483.54 MiB


In [7]:
df_plane_numeric.head()

Unnamed: 0,Year,Month,DayofMonth,DayofWeek,CRSDepTime,CRSArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,Origin,Dest,Distance,Diverted,ArrDelay
0,1987,10,1,4,1,556,AA,190,247,0,33,1846,0,27
1,1987,10,1,4,5,114,EA,57,74,1,0,337,0,5
2,1987,10,1,4,5,35,HP,351,167,2,4,987,0,17
3,1987,10,1,4,5,40,DL,251,35,3,41,142,0,-2
4,1987,10,1,4,8,517,UA,500,208,4,33,1515,0,17


In [7] used 0.0039 MiB RAM in 0.12s, total RAM usage 27483.54 MiB


In [8]:
%%time
df_plane_numeric = convert_cols_categorical_to_numeric(df_plane_numeric, col_list='UniqueCarrier')

CPU times: user 1min 2s, sys: 8.18 s, total: 1min 10s
Wall time: 1min 8s
In [8] used 13168.6641 MiB RAM in 68.96s, total RAM usage 40652.20 MiB


In [9]:
df_plane_numeric.head()

Unnamed: 0,Year,Month,DayofMonth,DayofWeek,CRSDepTime,CRSArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,Origin,Dest,Distance,Diverted,ArrDelay
0,1987,10,1,4,1,556,0,190,247,0,33,1846,0,27
1,1987,10,1,4,5,114,1,57,74,1,0,337,0,5
2,1987,10,1,4,5,35,2,351,167,2,4,987,0,17
3,1987,10,1,4,5,40,3,251,35,3,41,142,0,-2
4,1987,10,1,4,8,517,4,500,208,4,33,1515,0,17


In [9] used 0.0078 MiB RAM in 0.11s, total RAM usage 40652.21 MiB


To simplify the pipeline, we are going to set a classification problem where the goal is to classify wheather a flight has arrived delayed or not. For that we need to binarize the variable `ArrDelay`.

If you want to extend this experiment, you can set a regression problem and try to identify the number of minutes of delay a fight has. Both XGBoost and LightGBM have regression classes.

In [10]:
%%time
df_plane_numeric['ArrDelayBinary'] = 1*(df_plane_numeric['ArrDelay'] > 0)

CPU times: user 652 ms, sys: 424 ms, total: 1.08 s
Wall time: 1.08 s
In [10] used 877.9102 MiB RAM in 1.18s, total RAM usage 41530.12 MiB


In [11]:
df_plane_numeric.head()

Unnamed: 0,Year,Month,DayofMonth,DayofWeek,CRSDepTime,CRSArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,Origin,Dest,Distance,Diverted,ArrDelay,ArrDelayBinary
0,1987,10,1,4,1,556,0,190,247,0,33,1846,0,27,1
1,1987,10,1,4,5,114,1,57,74,1,0,337,0,5,1
2,1987,10,1,4,5,35,2,351,167,2,4,987,0,17,1
3,1987,10,1,4,5,40,3,251,35,3,41,142,0,-2,0
4,1987,10,1,4,8,517,4,500,208,4,33,1515,0,17,1


In [11] used -5267.4609 MiB RAM in 0.13s, total RAM usage 36262.66 MiB


In [12]:
#small dataset
df_plane_numeric_small = df_plane_numeric.sample(n=1e6).reset_index(drop=True)

In [12] used 0.0781 MiB RAM in 24.01s, total RAM usage 36262.74 MiB


Once the features are prepared, let's split the dataset into train, validation and test set.

In [13]:
def split_train_val_test_df(df, val_size=0.2, test_size=0.2):
    train, validate, test = np.split(df.sample(frac=1), 
                                     [int((1-val_size-test_size)*len(df)), int((1-test_size)*len(df))])
    return train, validate, test

In [13] used 0.0000 MiB RAM in 0.11s, total RAM usage 36262.74 MiB


In [14]:
%%time
train, validate, test = split_train_val_test_df(df_plane_numeric_small, val_size=0, test_size=0.2)
#train, validate, test = split_train_val_test_df(df_plane_numeric, val_size=0, test_size=0.2)
print(train.shape)
print(validate.shape)
print(test.shape)

(8000000, 15)
(0, 15)
(2000000, 15)
CPU times: user 5.16 s, sys: 372 ms, total: 5.53 s
Wall time: 5.39 s
In [14] used 0.0000 MiB RAM in 5.50s, total RAM usage 36262.74 MiB


In [15]:
def generate_feables(df):
    X = df[df.columns.difference(['ArrDelay', 'ArrDelayBinary'])]
    y = df['ArrDelayBinary']
    return X,y

In [15] used 0.0000 MiB RAM in 0.11s, total RAM usage 36262.74 MiB


In [16]:
%%time
X_train, y_train = generate_feables(train)
X_test, y_test = generate_feables(test)


CPU times: user 212 ms, sys: 4 ms, total: 216 ms
Wall time: 216 ms
In [16] used 0.0625 MiB RAM in 0.32s, total RAM usage 36262.80 MiB


In [17]:
%%time
dtrain = xgb.DMatrix(data=X_train, label=y_train)
dtest = xgb.DMatrix(data=X_test, label=y_test)

CPU times: user 4.79 s, sys: 416 ms, total: 5.2 s
Wall time: 5.06 s
In [17] used 0.0664 MiB RAM in 5.03s, total RAM usage 36262.87 MiB


### Training 
Now we are going to create two pipelines, one of XGBoost and one for LightGBM. The technology behind both libraries is different, so it is difficult to compare them in the exact same model setting. XGBoost grows the trees depth-wise and controls model complexity with `max_depth`. Instead, LightGBM uses a leaf-wise algorithm and controls the model complexity by `num_leaves`. As a tradeoff, we use XGBoost with `max_depth=8`, which will have max number leaves of 255, and compare it with LightGBM with `num_leaves=255`. 

In [18]:
results_dict = dict()

In [18] used 0.0000 MiB RAM in 0.10s, total RAM usage 36262.87 MiB


Let's start with the XGBoost model.

In [19]:
params = {'max_depth':8, 'num_round':100, 'min_child_weight':30, 'eta':0.1, 'colsample_bytree':0.80,
          'scale_pos_weight':2, 'gamma':0.1, 'reg_lamda':1, 'subsample':1,'tree_method':'exact', 'updater':'grow_gpu'}


In [19] used 0.0000 MiB RAM in 0.10s, total RAM usage 36262.87 MiB


In [20]:
with Timer() as t:
    xgb_clf_pipeline = xgb.train(params, dtrain)

In [20] used 80.9102 MiB RAM in 34.91s, total RAM usage 36343.78 MiB


In [21]:
results_dict['xgb']={
    'train_time': t.interval
}

In [21] used 0.0000 MiB RAM in 0.10s, total RAM usage 36343.78 MiB


Training XGBoost model with leave-wise growth

In [22]:
params = {'max_depth':8, 'num_round':100, 'min_child_weight':30, 'eta':0.1, 'colsample_bytree':0.80,
          'scale_pos_weight':2, 'gamma':0.1, 'reg_lamda':1, 'subsample':1,'tree_method':'exact', 'updater':'grow_gpu_hist'}


In [22] used 0.0000 MiB RAM in 0.11s, total RAM usage 36343.78 MiB


In [23]:
with Timer() as t:
    xgb_hist_clf_pipeline = xgb.train(params, dtrain)

In [23] used 3.8242 MiB RAM in 5.30s, total RAM usage 36347.60 MiB


In [24]:
results_dict['xgb_hist']={
    'train_time': t.interval
}

In [24] used 0.0078 MiB RAM in 0.11s, total RAM usage 36347.61 MiB


Training LightGBM model

In [26]:
lgbm_clf_pipeline = LGBMRegressor(num_leaves=255,
                                  n_estimators=50,
                                  min_child_weight=30,
                                  learning_rate=0.1,
                                  colsample_bytree=0.80,
                                  scale_pos_weight=2,
                                  min_split_gain=0.1,
                                  reg_lambda=1,
                                  subsample=1,
                                  device='gpu',
                                  seed=77)

In [26] used 0.0000 MiB RAM in 0.11s, total RAM usage 36347.61 MiB


In [27]:
with Timer() as t:
    lgbm_clf_pipeline.fit(X_train, y_train)

In [27] used 96.6250 MiB RAM in 24.72s, total RAM usage 36444.23 MiB


In [28]:
results_dict['lgbm']={
    'train_time': t.interval
}

In [28] used 0.0000 MiB RAM in 0.10s, total RAM usage 36444.23 MiB


As it can be seen in the results, given the specific versions and parameters of both XGBoost and LightGBM and in this specific dataset, LightGBM is faster. 

In general terms, leaf-wise algorithms are more efficient, they converge much faster than depth-wise. However, it may cause over-fitting when the data is small or there are too many leaves.

### Evaluation
Now let's evaluate the model in the test set.

In [29]:
with Timer() as t:
    y_prob_xgb = np.clip(xgb_clf_pipeline.predict(dtest), 0.0001, 0.9999)

In [29] used 0.0039 MiB RAM in 0.22s, total RAM usage 36444.24 MiB


In [30]:
results_dict['xgb']['test_time'] = t.interval

In [30] used 0.0000 MiB RAM in 0.10s, total RAM usage 36444.24 MiB


In [31]:
with Timer() as t:
    y_prob_xgb_hist = np.clip(xgb_hist_clf_pipeline.predict(dtest), 0.0001, 0.9999)

In [31] used 0.0000 MiB RAM in 0.40s, total RAM usage 36444.24 MiB


In [32]:
results_dict['xgb_hist']['test_time'] = t.interval

In [32] used 0.0000 MiB RAM in 0.10s, total RAM usage 36444.24 MiB


In [33]:
with Timer() as t:
    y_prob_lgbm = np.clip(lgbm_clf_pipeline.predict(X_test), 0.0001, 0.9999)

In [33] used 0.0000 MiB RAM in 1.26s, total RAM usage 36444.24 MiB


In [34]:
results_dict['lgbm']['test_time'] = t.interval


In [34] used 0.0000 MiB RAM in 0.10s, total RAM usage 36444.24 MiB


### Metrics
We are going to obtain some metrics to evaluate the performance of each of the models.

In [35]:
#https://github.com/miguelgfierro/codebase/blob/master/python/machine_learning/metrics.py
def classification_metrics_binary(y_true, y_pred):
    m_acc = accuracy_score(y_true, y_pred)
    m_f1 = f1_score(y_true, y_pred)
    m_precision = precision_score(y_true, y_pred)
    m_recall = recall_score(y_true, y_pred)
    report = {'Accuracy':m_acc, 'Precision':m_precision, 'Recall':m_recall, 'F1':m_f1}
    return report

In [35] used 0.0000 MiB RAM in 0.11s, total RAM usage 36444.24 MiB


In [36]:
#https://github.com/miguelgfierro/codebase/blob/master/python/machine_learning/metrics.py
def classification_metrics_binary_prob(y_true, y_prob):
    m_auc = roc_auc_score(y_true, y_prob)
    report = {'AUC':m_auc}
    return report

In [36] used 0.0000 MiB RAM in 0.10s, total RAM usage 36444.24 MiB


In [37]:
def binarize_prediction(y, threshold=0.5):
    y_pred = np.where(y > threshold, 1, 0)
    return y_pred

In [37] used 0.0000 MiB RAM in 0.10s, total RAM usage 36444.24 MiB


In [38]:
y_pred_xgb = binarize_prediction(y_prob_xgb)
y_pred_xgb_hist = binarize_prediction(y_prob_xgb_hist)
y_pred_lgbm = binarize_prediction(y_prob_lgbm)


In [38] used 0.0000 MiB RAM in 0.13s, total RAM usage 36444.24 MiB


In [39]:
report_xgb = classification_metrics_binary(y_test, y_pred_xgb)
report2_xgb = classification_metrics_binary_prob(y_test, y_prob_xgb)
report_xgb.update(report2_xgb)

In [39] used 0.0625 MiB RAM in 2.27s, total RAM usage 36444.30 MiB


In [40]:
results_dict['xgb']['performance'] = report_xgb

In [40] used 0.0000 MiB RAM in 0.10s, total RAM usage 36444.30 MiB


In [41]:
report_xgb_hist = classification_metrics_binary(y_test, y_pred_xgb_hist)
report2_xgb_hist = classification_metrics_binary_prob(y_test, y_prob_xgb_hist)
report_xgb_hist.update(report2_xgb_hist)

In [41] used 0.0000 MiB RAM in 1.70s, total RAM usage 36444.30 MiB


In [42]:
results_dict['xgb_hist']['performance'] = report_xgb_hist

In [42] used 0.0000 MiB RAM in -0.04s, total RAM usage 36444.30 MiB


In [43]:
report_lgbm = classification_metrics_binary(y_test, y_pred_lgbm)
report2_lgbm = classification_metrics_binary_prob(y_test, y_prob_lgbm)
report_lgbm.update(report2_lgbm)

In [43] used 0.0000 MiB RAM in 2.04s, total RAM usage 36444.30 MiB


In [44]:
results_dict['lgbm']['performance'] = report_lgbm

In [44] used 0.0039 MiB RAM in 0.11s, total RAM usage 36444.30 MiB


In [45]:
# Results
print(json.dumps(results_dict, indent=4, sort_keys=True))

{
    "lgbm": {
        "performance": {
            "AUC": 0.8093959657942654,
            "Accuracy": 0.7358005,
            "F1": 0.6939065625421792,
            "Precision": 0.76993008161644,
            "Recall": 0.6315470918216711
        },
        "test_time": 1.1605696250044275,
        "train_time": 25.3338222859893
    },
    "xgb": {
        "performance": {
            "AUC": 0.7429219386378774,
            "Accuracy": 0.5658205,
            "F1": 0.6727538924491292,
            "Precision": 0.5234582960693184,
            "Recall": 0.9411911088616137
        },
        "test_time": 0.11862853499769699,
        "train_time": 35.80327978699643
    },
    "xgb_hist": {
        "performance": {
            "AUC": 0.7439819371854803,
            "Accuracy": 0.563739,
            "F1": 0.6721286711268569,
            "Precision": 0.5221384483527672,
            "Recall": 0.9430226918047999
        },
        "test_time": 0.2936741500016069,
        "train_time": 5.3360232150007

The experiment shows a fairly similar performance in both libraries, being LightGBM slightly better.