# Experiment 06: Airline dataset (GPU version)

In this experiment we use [the airline dataset](http://kt.ijs.si/elena_ikonomovska/data.html) to predict arrival delay. The dataset consists of a large amount of records, containing flight arrival and departure details for all the commercial flights within the USA, from October 1987 to April 2008. Its size is around 116 million records and 5.76 GB of memory.

For this experiment we used an [Azure NV24 VM](https://azure.microsoft.com/en-gb/blog/azure-n-series-general-availability-on-december-1/), which has [NVIDIA M60](http://www.nvidia.com/object/tesla-m60.html) GPUs. Its operating system is Ubuntu 16.04.

For both XGBoost and LightGBM we compiled from source, to get the last improvements. In XGboost we used the commit 6776292951565c8cd72e69afd9d94de1474f00c0 of May 26th. **Note that it was different from the CPU notebooks**. For LightGBM we used the commit 73968a96829e212b333c88cd44725c8c39c03ad1 of June 2nd. To get these versions and replicate our experiments:
```python
git clone --recursive *url_of_library*
git checkout *oldcommit*
git submodule update --recursive
```

In [2]:
import os,sys
import numpy as np
import pandas as pd
import lightgbm as lgb
from lightgbm.sklearn import LGBMRegressor, LGBMClassifier
import xgboost as xgb
from xgboost import XGBRegressor, XGBClassifier
from sklearn.metrics import (confusion_matrix, accuracy_score, roc_auc_score, f1_score, log_loss, 
                             precision_score, recall_score)
from libs.loaders import load_airline
from libs.conversion import convert_cols_categorical_to_numeric, convert_related_cols_categorical_to_numeric
from libs.timer import Timer
from libs.utils import get_number_processors
import pkg_resources
import json
from toolz import curry

print("System version: {}".format(sys.version))
print("XGBoost version: {}".format(pkg_resources.get_distribution('xgboost').version))
print("LightGBM version: {}".format(pkg_resources.get_distribution('lightgbm').version))



System version: 3.5.2 |Anaconda custom (64-bit)| (default, Jul  2 2016, 17:53:06) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
XGBoost version: 0.6
LightGBM version: 0.2


# 1) XGBoost vs LightGBM benchmark
In the next section we compare both libraries speed, accuracy and other metrics for the dataset of airline arrival delay. 

### Data loading and management

In [3]:
%%time
df_plane = load_airline()
print(df_plane.shape)

INFO:libs.loaders:MOUNT_POINT not found in environment. Defaulting to /fileshare


(115069017, 14)
CPU times: user 1min 30s, sys: 14.9 s, total: 1min 45s
Wall time: 4min 14s


In [4]:
df_plane.head()

Unnamed: 0,Year,Month,DayofMonth,DayofWeek,CRSDepTime,CRSArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,Origin,Dest,Distance,Diverted,ArrDelay
0,1987,10,1,4,1,556,AA,190,247,SFO,ORD,1846,0,27
1,1987,10,1,4,5,114,EA,57,74,LAX,SFO,337,0,5
2,1987,10,1,4,5,35,HP,351,167,ICT,LAS,987,0,17
3,1987,10,1,4,5,40,DL,251,35,MCO,PBI,142,0,-2
4,1987,10,1,4,8,517,UA,500,208,LAS,ORD,1515,0,17


The first step is to convert the categorical features to numeric features.

In [5]:
%%time
df_plane_numeric = convert_related_cols_categorical_to_numeric(df_plane, col_list=['Origin','Dest'])
del df_plane

CPU times: user 1min 36s, sys: 14.3 s, total: 1min 50s
Wall time: 1min 51s


In [6]:
df_plane_numeric.head()

Unnamed: 0,Year,Month,DayofMonth,DayofWeek,CRSDepTime,CRSArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,Origin,Dest,Distance,Diverted,ArrDelay
0,1987,10,1,4,1,556,AA,190,247,0,33,1846,0,27
1,1987,10,1,4,5,114,EA,57,74,1,0,337,0,5
2,1987,10,1,4,5,35,HP,351,167,2,4,987,0,17
3,1987,10,1,4,5,40,DL,251,35,3,41,142,0,-2
4,1987,10,1,4,8,517,UA,500,208,4,33,1515,0,17


In [7]:
%%time
df_plane_numeric = convert_cols_categorical_to_numeric(df_plane_numeric, col_list='UniqueCarrier')

CPU times: user 51.2 s, sys: 11.5 s, total: 1min 2s
Wall time: 1min 3s


In [8]:
df_plane_numeric.head()

Unnamed: 0,Year,Month,DayofMonth,DayofWeek,CRSDepTime,CRSArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,Origin,Dest,Distance,Diverted,ArrDelay
0,1987,10,1,4,1,556,0,190,247,0,33,1846,0,27
1,1987,10,1,4,5,114,1,57,74,1,0,337,0,5
2,1987,10,1,4,5,35,2,351,167,2,4,987,0,17
3,1987,10,1,4,5,40,3,251,35,3,41,142,0,-2
4,1987,10,1,4,8,517,4,500,208,4,33,1515,0,17


To simplify the pipeline, we are going to set a classification problem where the goal is to classify wheather a flight has arrived delayed or not. For that we need to binarize the variable `ArrDelay`.

If you want to extend this experiment, you can set a regression problem and try to identify the number of minutes of delay a fight has. Both XGBoost and LightGBM have regression classes.

In [9]:
%%time
df_plane_numeric['ArrDelayBinary'] = 1*(df_plane_numeric['ArrDelay'] > 0)

CPU times: user 484 ms, sys: 408 ms, total: 892 ms
Wall time: 892 ms


In [10]:
df_plane_numeric.head()

Unnamed: 0,Year,Month,DayofMonth,DayofWeek,CRSDepTime,CRSArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,Origin,Dest,Distance,Diverted,ArrDelay,ArrDelayBinary
0,1987,10,1,4,1,556,0,190,247,0,33,1846,0,27,1
1,1987,10,1,4,5,114,1,57,74,1,0,337,0,5,1
2,1987,10,1,4,5,35,2,351,167,2,4,987,0,17,1
3,1987,10,1,4,5,40,3,251,35,3,41,142,0,-2,0
4,1987,10,1,4,8,517,4,500,208,4,33,1515,0,17,1


In [11]:
#small dataset
df_plane_numeric_small = df_plane_numeric.sample(n=1e6).reset_index(drop=True)

  locs = rs.choice(axis_length, size=n, replace=replace, p=weights)


Once the features are prepared, let's split the dataset into train, validation and test set.

In [12]:
def split_train_val_test_df(df, val_size=0.2, test_size=0.2):
    train, validate, test = np.split(df.sample(frac=1), 
                                     [int((1-val_size-test_size)*len(df)), int((1-test_size)*len(df))])
    return train, validate, test

In [13]:
%%time
train, validate, test = split_train_val_test_df(df_plane_numeric_small, val_size=0, test_size=0.2)
#train, validate, test = split_train_val_test_df(df_plane_numeric, val_size=0, test_size=0.2)
print(train.shape)
print(validate.shape)
print(test.shape)

(800000, 15)
(0, 15)
(200000, 15)
CPU times: user 212 ms, sys: 4 ms, total: 216 ms
Wall time: 218 ms


In [14]:
def generate_feables(df):
    X = df[df.columns.difference(['ArrDelay', 'ArrDelayBinary'])]
    y = df['ArrDelayBinary']
    return X,y

In [15]:
%%time
X_train, y_train = generate_feables(train)
X_test, y_test = generate_feables(test)

CPU times: user 24 ms, sys: 0 ns, total: 24 ms
Wall time: 24 ms


Let's put the data in the XGBoost format.

In [16]:
dtrain = xgb.DMatrix(data=X_train, label=y_train)
dtest = xgb.DMatrix(data=X_test, label=y_test)

Now, we'll do the same for LightGBM.

In [17]:
lgb_train = lgb.Dataset(X_train.values, y_train.values, free_raw_data=False)
lgb_test = lgb.Dataset(X_test.values, y_test.values, reference=lgb_train, free_raw_data=False)

### Training 
Now we are going to create two pipelines, one of XGBoost and one for LightGBM. The technology behind both libraries is different, so it is difficult to compare them in the exact same model setting. XGBoost grows the trees depth-wise and controls model complexity with `max_depth`. Instead, LightGBM uses a leaf-wise algorithm and controls the model complexity by `num_leaves`. As a tradeoff, we use XGBoost with `max_depth=8`, which will have max number leaves of 255, and compare it with LightGBM with `num_leaves=255`. 

In [18]:
results_dict = dict()

Let's start with the XGBoost model.

In [57]:
xgb_params = {'max_depth':8, 
              'objective':'binary:logistic', 
              'min_child_weight':30, 
              'eta':0.1, 
              'colsample_bytree':0.80,
              'scale_pos_weight':2, 
              'gamma':0.1, 
              'reg_lamda':1, 
              'subsample':1,
              'tree_method':'exact', 
              'updater':'grow_gpu'
             }

In [20]:
with Timer() as t:
    xgb_clf_pipeline = xgb.train(xgb_params, dtrain, num_boost_round=200)

In [21]:
results_dict['xgb']={
    'train_time': t.interval
}

Training XGBoost model with leaf-wise growth

In [58]:
xgb_hist_params = {'max_depth':0, 
                  'objective':'binary:logistic', 
                  'min_child_weight':30, 
                  'eta':0.1, 
                  'colsample_bytree':0.80,
                  'scale_pos_weight':2, 
                  'gamma':0.1, 
                  'reg_lamda':1, 
                  'subsample':1,
                  'tree_method':'hist', 
                  'max_leaves':255, 
                  'grow_policy':'lossguide', 
                  'updater':'grow_gpu_hist'
                 }

In [47]:
with Timer() as t:
    xgb_hist_clf_pipeline = xgb.train(xgb_hist_params, dtrain, num_boost_round=200)

In [48]:
results_dict['xgb_hist']={
    'train_time': t.interval
}

Training LightGBM model

In [59]:
lgbm_params = {'num_leaves': 2**8,
                 'learning_rate': 0.1,
                 'scale_pos_weight': 1,
                 'min_split_gain': 0.1,
                 'min_child_weight': 30,
                 'reg_lambda': 1,
                 'subsample': 1,
                 'objective':'binary',
                 'device': 'gpu',
                 'task': 'train'
                 }

In [26]:
with Timer() as t:
    lgbm_clf_pipeline = lgb.train(lgbm_params, lgb_train, num_boost_round=200)

In [27]:
results_dict['lgbm']={
    'train_time': t.interval
}

As it can be seen in the results, given the specific versions and parameters of both XGBoost and LightGBM and in this specific dataset, LightGBM is faster. 

In general terms, leaf-wise algorithms are more efficient, they converge much faster than depth-wise. However, it may cause over-fitting when the data is small or there are too many leaves.

### Evaluation
Now let's evaluate the model in the test set.

In [28]:
with Timer() as t:
    y_prob_xgb = xgb_clf_pipeline.predict(dtest)

In [29]:
results_dict['xgb']['test_time'] = t.interval

In [30]:
with Timer() as t:
    y_prob_xgb_hist = xgb_hist_clf_pipeline.predict(dtest)

In [31]:
results_dict['xgb_hist']['test_time'] = t.interval

In [32]:
with Timer() as t:
    y_prob_lgbm = lgbm_clf_pipeline.predict(X_test.values)

In [33]:
results_dict['lgbm']['test_time'] = t.interval

### Metrics
We are going to obtain some metrics to evaluate the performance of each of the models.

In [34]:
#https://github.com/miguelgfierro/codebase/blob/master/python/machine_learning/metrics.py
def classification_metrics_binary(y_true, y_pred):
    m_acc = accuracy_score(y_true, y_pred)
    m_f1 = f1_score(y_true, y_pred)
    m_precision = precision_score(y_true, y_pred)
    m_recall = recall_score(y_true, y_pred)
    report = {'Accuracy':m_acc, 'Precision':m_precision, 'Recall':m_recall, 'F1':m_f1}
    return report

In [35]:
#https://github.com/miguelgfierro/codebase/blob/master/python/machine_learning/metrics.py
def classification_metrics_binary_prob(y_true, y_prob):
    m_auc = roc_auc_score(y_true, y_prob)
    report = {'AUC':m_auc}
    return report

In [36]:
def binarize_prediction(y, threshold=0.5):
    y_pred = np.where(y > threshold, 1, 0)
    return y_pred

In [37]:
y_pred_xgb = binarize_prediction(y_prob_xgb)
y_pred_xgb_hist = binarize_prediction(y_prob_xgb_hist)
y_pred_lgbm = binarize_prediction(y_prob_lgbm)


In [38]:
report_xgb = classification_metrics_binary(y_test, y_pred_xgb)
report2_xgb = classification_metrics_binary_prob(y_test, y_prob_xgb)
report_xgb.update(report2_xgb)

In [39]:
results_dict['xgb']['performance'] = report_xgb

In [40]:
report_xgb_hist = classification_metrics_binary(y_test, y_pred_xgb_hist)
report2_xgb_hist = classification_metrics_binary_prob(y_test, y_prob_xgb_hist)
report_xgb_hist.update(report2_xgb_hist)

In [41]:
results_dict['xgb_hist']['performance'] = report_xgb_hist

In [42]:
report_lgbm = classification_metrics_binary(y_test, y_pred_lgbm)
report2_lgbm = classification_metrics_binary_prob(y_test, y_prob_lgbm)
report_lgbm.update(report2_lgbm)

In [43]:
results_dict['lgbm']['performance'] = report_lgbm

In [49]:
# Results
print(json.dumps(results_dict, indent=4, sort_keys=True))

{
    "lgbm": {
        "performance": {
            "AUC": 0.8353728178787345,
            "Accuracy": 0.760175,
            "F1": 0.7294682993135889,
            "Precision": 0.7855468228034159,
            "Recall": 0.6808629366800732
        },
        "test_time": 0.5706548339803703,
        "train_time": 15.469012161018327
    },
    "xgb": {
        "performance": {
            "AUC": 0.8222958270248748,
            "Accuracy": 0.6985,
            "F1": 0.7299234111165854,
            "Precision": 0.6351526205842921,
            "Recall": 0.8579355219103372
        },
        "test_time": 0.2692563059972599,
        "train_time": 52.310362780001014
    },
    "xgb_hist": {
        "train_time": 25.17619772598846
    }
}


The experiment shows a fairly similar performance in both libraries, being LightGBM slightly better.

# 2) Data size benchmark
Now we are going to analyze the performance of the libraries with different data sizes. The depth-wise implementation needs much more memory than the leaf-wise implementation.

In [80]:
def generate_partial_datasets(df, num_rows, test_size=0.2):
    df_small = df.sample(n=num_rows).reset_index(drop=True)
    train, _, test = split_train_val_test_df(df_plane_numeric_small, val_size=0, test_size=test_size)
    X_train, y_train = generate_feables(train)
    X_test, y_test = generate_feables(test)
    return X_train, y_train, X_test, y_test

In [88]:
sizes = [1e4, 1e5, 1e6, 1e7, 1e8]

In [86]:
def train_xgboost(parameters, X, y):
    ddata = xgb.DMatrix(data=X, label=y)
    with Timer() as t:
        clf = xgb.train(parameters, ddata, num_boost_round=50)
    return clf, t.interval

In [76]:
def test_xgboost(clf, X, y):
    ddata = xgb.DMatrix(data=X, label=y)
    with Timer() as t:
        y_pred = clf.predict(ddata)
    return y_pred, t.interval

In [87]:
def train_lightgbm(parameters, X, y):
    ddata = lgb.Dataset(X.values, y.values, free_raw_data=False)
    with Timer() as t:
        clf = lgb.train(parameters, ddata, num_boost_round=50)
    return clf, t.interval

In [78]:
def test_lightgbm(clf, X, y):
    with Timer() as t:
        y_pred = clf.predict(X.values)
    return y_pred, t.interval

Let's loop for the different data sizes.

In [89]:
xgb_data_bench_time = []
xgb_data_bench_AUC = []
xgb_hist_data_bench_time = []
xgb_hist_data_bench_AUC = []
lgbm_data_bench_time = []
lgbm_data_bench_AUC = []
for s in sizes:
    X_train, y_train, X_test, y_test = generate_partial_datasets(df_plane_numeric, s)
    clf_xgb, train_time_xgb = train_xgboost(xgb_params, X_train, y_train)
    y_pred, test_time_xgb = test_xgboost(clf_xgb, X_test, y_test)
    auc_xgb = roc_auc_score(y_test, y_pred)
    del clf_xgb #free GPU memory
    print("Computed XGBoost with {:.0e} samples in {:.3}s with AUC={:.3}".format(s, train_time_xgb, auc_xgb))
    
    clf_xgb_hist, train_time_xgb_hist = train_xgboost(xgb_hist_params, X_train, y_train)
    y_pred, test_time_xgb = test_xgboost(clf_xgb_hist, X_test, y_test)
    auc_xgb_hist = roc_auc_score(y_test, y_pred)
    del clf_xgb_hist
    print("Computed XGBoost hist with {:.0e} samples in {:.3}s with AUC={:.3}".format(s, train_time_xgb_hist, auc_xgb_hist))

    clf_lgbm, train_time_lgbm = train_lightgbm(lgbm_params, X_train, y_train)
    y_pred, test_time_lgbm = test_lightgbm(clf_lgbm, X_test, y_test)
    auc_lgbm = roc_auc_score(y_test, y_pred)
    del clf_lgbm
    print("Computed LightGBM with {:.0e} samples in {:.3}s with AUC={:.3}\n".format(s, train_time_lgbm, auc_lgbm))
   
    xgb_data_bench_time.append({s: train_time_xgb})
    xgb_data_bench_AUC.append({s:auc_xgb})
    xgb_hist_data_bench_time.append({s: train_time_xgb_hist})
    xgb_hist_data_bench_AUC.append({s: auc_xgb_hist})
    lgbm_data_bench_time.append({s: train_time_lgbm})
    lgbm_data_bench_AUC.append({s: auc_lgbm})


  locs = rs.choice(axis_length, size=n, replace=replace, p=weights)


Computed XGBoost with 1e+04 samples in 14.9s with AUC=0.786
Computed XGBoost hist with 1e+04 samples in 8.11s with AUC=0.804
Computed LightGBM with 1e+04 samples in 6.59s with AUC=0.807

Computed XGBoost with 1e+05 samples in 14.5s with AUC=0.787
Computed XGBoost hist with 1e+05 samples in 6.72s with AUC=0.804
Computed LightGBM with 1e+05 samples in 6.05s with AUC=0.806

Computed XGBoost with 1e+06 samples in 15.0s with AUC=0.788
Computed XGBoost hist with 1e+06 samples in 6.6s with AUC=0.805
Computed LightGBM with 1e+06 samples in 4.02s with AUC=0.807

Computed XGBoost with 1e+07 samples in 14.8s with AUC=0.788
Computed XGBoost hist with 1e+07 samples in 8.61s with AUC=0.806
Computed LightGBM with 1e+07 samples in 6.74s with AUC=0.807

Computed XGBoost with 1e+08 samples in 15.3s with AUC=0.786
Computed XGBoost hist with 1e+08 samples in 6.21s with AUC=0.805
Computed LightGBM with 1e+08 samples in 4.41s with AUC=0.807

