# Experiment 01: Airline dataset

In this experiment we use [the airline dataset](http://kt.ijs.si/elena_ikonomovska/data.html) to predict arrival delay. The dataset consists of a large amount of records, containing flight arrival and departure details for all the commercial flights within the USA, from October 1987 to April 2008. Its size is around 116 million records and 5.76 GB of memory.



In [2]:
import os,sys
import numpy as np
import pandas as pd
from lightgbm.sklearn import LGBMRegressor, LGBMClassifier
from xgboost import XGBRegressor
from sklearn.metrics import (confusion_matrix, accuracy_score, roc_auc_score, f1_score, log_loss, precision_score,
                             recall_score)
from libs.loaders import load_airline
from libs.conversion import convert_cols_categorical_to_numeric, convert_related_cols_categorical_to_numeric
from libs.timer import Timer
from libs.utils import get_number_processors
from libs.notebook_memory_management import start_watching_memory
import pkg_resources
import json
import matplotlib.pylab as plt
import warnings
from toolz import curry
from bokeh.io import output_notebook, show
from bokeh.plotting import figure

print("System version: {}".format(sys.version))
print("XGBoost version: {}".format(pkg_resources.get_distribution('xgboost').version))
print("LightGBM version: {}".format(pkg_resources.get_distribution('lightgbm').version))

%matplotlib inline
warnings.filterwarnings("ignore", category=DeprecationWarning) 

Using TensorFlow backend.
  wrapper._legacy_support_signature = inspect.getargspec(func)
  wrapper._legacy_support_signature = inspect.getargspec(func)
  wrapper._legacy_support_signature = inspect.getargspec(func)
  arg_spec = inspect.getargspec(func)
  arg_spec = inspect.getargspec(func)
  arg_spec = inspect.getargspec(func)
  arg_spec = inspect.getargspec(func)
  arg_spec = inspect.getargspec(func)
  spec = inspect.getargspec(f)
  spec = inspect.getargspec(f)
  spec = inspect.getargspec(f)
  spec = inspect.getargspec(f)
  spec = inspect.getargspec(f)
  spec = inspect.getargspec(f)
  spec = inspect.getargspec(f)
  spec = inspect.getargspec(f)
  spec = inspect.getargspec(f)
  spec = inspect.getargspec(f)
  spec = inspect.getargspec(f)
  spec = inspect.getargspec(f)
  spec = inspect.getargspec(f)
  spec = inspect.getargspec(f)
  spec = inspect.getargspec(f)
  spec = inspect.getargspec(f)
  spec = inspect.getargspec(f)
  spec = inspect.getargspec(f)
  spec = inspect.getargspec(f)
  spec

  wrapper._legacy_support_signature = inspect.getargspec(func)
  wrapper._legacy_support_signature = inspect.getargspec(func)
  wrapper._legacy_support_signature = inspect.getargspec(func)
  wrapper._legacy_support_signature = inspect.getargspec(func)
  wrapper._legacy_support_signature = inspect.getargspec(func)
  wrapper._legacy_support_signature = inspect.getargspec(func)
  wrapper._legacy_support_signature = inspect.getargspec(func)
  wrapper._legacy_support_signature = inspect.getargspec(func)
  wrapper._legacy_support_signature = inspect.getargspec(func)
  wrapper._legacy_support_signature = inspect.getargspec(func)
  wrapper._legacy_support_signature = inspect.getargspec(func)
  wrapper._legacy_support_signature = inspect.getargspec(func)
  wrapper._legacy_support_signature = inspect.getargspec(func)
  wrapper._legacy_support_signature = inspect.getargspec(func)
  wrapper._legacy_support_signature = inspect.getargspec(func)
  wrapper._legacy_support_signature = inspect.getargspe

System version: 3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:09:58) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
XGBoost version: 0.6
LightGBM version: 0.2


The dataset we are going to use in this notebook is huge, therefore we want to monitor the memory consumption. 

In [78]:
output_notebook()

In [78] used 0.0000 MiB RAM in 0.11s, total RAM usage 74903.22 MiB


In [3]:
start_watching_memory()

In [3] used 19.6289 MiB RAM in 0.78s, total RAM usage 225.94 MiB


# 1) XGBoost vs LightGBM benchmark
In the next section we compare both libraries speed, accuracy and other metrics for the dataset of airline arrival delay. 

### Data loading and management

In [4]:
%%time
df_plane = load_airline()
print(df_plane.shape)

INFO:libs.loaders:MOUNT_POINT not found in environment. Defaulting to /fileshare


(115069017, 14)
CPU times: user 1min 38s, sys: 14.9 s, total: 1min 53s
Wall time: 3min 57s
In [4] used 21997.4922 MiB RAM in 237.78s, total RAM usage 22223.43 MiB


In [5]:
df_plane.head()

Unnamed: 0,Year,Month,DayofMonth,DayofWeek,CRSDepTime,CRSArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,Origin,Dest,Distance,Diverted,ArrDelay
0,1987,10,1,4,1,556,AA,190,247,SFO,ORD,1846,0,27
1,1987,10,1,4,5,114,EA,57,74,LAX,SFO,337,0,5
2,1987,10,1,4,5,35,HP,351,167,ICT,LAS,987,0,17
3,1987,10,1,4,5,40,DL,251,35,MCO,PBI,142,0,-2
4,1987,10,1,4,8,517,UA,500,208,LAS,ORD,1515,0,17


In [5] used 0.2539 MiB RAM in 0.08s, total RAM usage 22223.68 MiB


The first step is to convert the categorical features to numeric features.

In [6]:
%%time
df_plane_numeric = convert_related_cols_categorical_to_numeric(df_plane, col_list=['Origin','Dest'])
del df_plane

CPU times: user 1min 42s, sys: 6.52 s, total: 1min 48s
Wall time: 1min 47s
In [6] used 5267.8203 MiB RAM in 108.09s, total RAM usage 27491.50 MiB


In [7]:
df_plane_numeric.head()

Unnamed: 0,Year,Month,DayofMonth,DayofWeek,CRSDepTime,CRSArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,Origin,Dest,Distance,Diverted,ArrDelay
0,1987,10,1,4,1,556,AA,190,247,0,33,1846,0,27
1,1987,10,1,4,5,114,EA,57,74,1,0,337,0,5
2,1987,10,1,4,5,35,HP,351,167,2,4,987,0,17
3,1987,10,1,4,5,40,DL,251,35,3,41,142,0,-2
4,1987,10,1,4,8,517,UA,500,208,4,33,1515,0,17


In [7] used 0.0977 MiB RAM in 0.12s, total RAM usage 27491.60 MiB


In [8]:
%%time
df_plane_numeric = convert_cols_categorical_to_numeric(df_plane_numeric, col_list='UniqueCarrier')

CPU times: user 1min 1s, sys: 8.51 s, total: 1min 10s
Wall time: 1min 9s
In [8] used 12290.9609 MiB RAM in 69.61s, total RAM usage 39782.56 MiB


In [9]:
df_plane_numeric.head()

Unnamed: 0,Year,Month,DayofMonth,DayofWeek,CRSDepTime,CRSArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,Origin,Dest,Distance,Diverted,ArrDelay
0,1987,10,1,4,1,556,0,190,247,0,33,1846,0,27
1,1987,10,1,4,5,114,1,57,74,1,0,337,0,5
2,1987,10,1,4,5,35,2,351,167,2,4,987,0,17
3,1987,10,1,4,5,40,3,251,35,3,41,142,0,-2
4,1987,10,1,4,8,517,4,500,208,4,33,1515,0,17


In [9] used 0.0039 MiB RAM in 0.12s, total RAM usage 39782.57 MiB


To simplify the pipeline, we are going to set a classification problem where the goal is to classify wheather a flight has arrived delayed or not. For that we need to binarize the variable `ArrDelay`.

If you want to extend this experiment, you can set a regression problem and try to identify the number of minutes of delay a fight has. Both XGBoost and LightGBM have regression classes.

In [10]:
%%time
df_plane_numeric['ArrDelayBinary'] = 1*(df_plane_numeric['ArrDelay'] > 0)

CPU times: user 568 ms, sys: 500 ms, total: 1.07 s
Wall time: 1.07 s
In [10] used 877.9141 MiB RAM in 1.17s, total RAM usage 40660.48 MiB


In [11]:
df_plane_numeric.head()

Unnamed: 0,Year,Month,DayofMonth,DayofWeek,CRSDepTime,CRSArrTime,UniqueCarrier,FlightNum,ActualElapsedTime,Origin,Dest,Distance,Diverted,ArrDelay,ArrDelayBinary
0,1987,10,1,4,1,556,0,190,247,0,33,1846,0,27,1
1,1987,10,1,4,5,114,1,57,74,1,0,337,0,5,1
2,1987,10,1,4,5,35,2,351,167,2,4,987,0,17,1
3,1987,10,1,4,5,40,3,251,35,3,41,142,0,-2,0
4,1987,10,1,4,8,517,4,500,208,4,33,1515,0,17,1


In [11] used 0.0000 MiB RAM in 0.12s, total RAM usage 40660.48 MiB


Once the features are prepared, let's split the dataset into train and test set. We won't use validation for this example (however, you can try to add it).

In [12]:
def split_train_val_test_df(df, val_size=0.2, test_size=0.2):
    train, validate, test = np.split(df.sample(frac=1), 
                                     [int((1-val_size-test_size)*len(df)), int((1-test_size)*len(df))])
    return train, validate, test

In [12] used 0.0039 MiB RAM in 0.10s, total RAM usage 40660.48 MiB


In [13]:
%%time
train, validate, test = split_train_val_test_df(df_plane_numeric, val_size=0, test_size=0.2)
print(train.shape)
print(validate.shape)
print(test.shape)

(92055213, 15)
(0, 15)
(23013804, 15)
CPU times: user 52.8 s, sys: 43.5 s, total: 1min 36s
Wall time: 1min 35s
In [13] used 14018.3594 MiB RAM in 95.70s, total RAM usage 54678.84 MiB


In [14]:
def generate_feables(df):
    X = df[df.columns.difference(['ArrDelay', 'ArrDelayBinary'])]
    y = df['ArrDelayBinary']
    return X,y

In [14] used 0.0000 MiB RAM in 0.07s, total RAM usage 54678.84 MiB


In [15]:
%%time
X_train, y_train = generate_feables(train)
X_val, y_val = generate_feables(validate)
X_test, y_test = generate_feables(test)


CPU times: user 2.24 s, sys: 4.43 s, total: 6.68 s
Wall time: 6.58 s
In [15] used 11412.8555 MiB RAM in 6.68s, total RAM usage 66091.70 MiB


In [16]:
del train, validate, test

In [16] used 0.0000 MiB RAM in 0.10s, total RAM usage 66091.70 MiB


### Training 
Now we are going to create two pipelines, one of XGBoost and one for LightGBM. The technology behind both libraries is different, so it is difficult to compare them in the exact same model setting. XGBoost grows the trees depth-wise and controls model complexity with `max_depth`. Instead, LightGBM uses a leaf-wise algorithm and controls the model complexity by `num_leaves`. As a tradeoff, we use XGBoost with `max_depth=8`, which will have max number leaves of 255, and compare it with LightGBM with `num_leaves=255`. 

In [17]:
results_dict = dict()

In [17] used 0.0000 MiB RAM in 0.10s, total RAM usage 66091.70 MiB


In [18]:
number_processors = get_number_processors()
print(number_processors)

20
In [18] used 0.0000 MiB RAM in 0.10s, total RAM usage 66091.70 MiB


Let's start with the XGBoost classifier.

In [19]:
xgb_clf_pipeline = XGBRegressor(max_depth=8,
                                n_estimators=50,
                                min_child_weight=30,
                                learning_rate=0.1,
                                colsample_bytree=0.80,
                                scale_pos_weight=2,
                                gamma=0.1,
                                reg_lambda=1,
                                subsample=1,
                                n_jobs=number_processors,
                                random_state=77)

In [19] used 0.0000 MiB RAM in 0.10s, total RAM usage 66091.70 MiB


In [20]:
with Timer() as t:
    xgb_clf_pipeline.fit(X_train, y_train)

In [20] used 20226.5977 MiB RAM in 1803.87s, total RAM usage 86318.30 MiB


In [21]:
results_dict['xgb']={
    'train_time': t.interval
}

In [21] used 0.0273 MiB RAM in 0.10s, total RAM usage 86318.32 MiB


Training XGBoost model with leave-wise growth

In [22]:
xgb_hist_clf_pipeline = XGBRegressor(max_depth=0,
                                    n_estimators=50,
                                    min_child_weight=30,
                                    learning_rate=0.1,
                                    colsample_bytree=0.80,
                                    scale_pos_weight=2,
                                    gamma=0.1,
                                    reg_lambda=1,
                                    subsample=1,
                                    max_leaves=255,
                                    grow_policy='lossguide',
                                    tree_method='hist',
                                    n_jobs=number_processors,
                                    random_state=77)

In [22] used 0.0000 MiB RAM in 0.10s, total RAM usage 86318.32 MiB


In [23]:
with Timer() as t:
    xgb_hist_clf_pipeline.fit(X_train, y_train)

In [23] used 23414.4922 MiB RAM in 604.42s, total RAM usage 109732.82 MiB


In [24]:
results_dict['xgb_hist']={
    'train_time': t.interval
}

In [24] used 0.0000 MiB RAM in 0.10s, total RAM usage 109732.82 MiB


Training LightGBM model

In [25]:
lgbm_clf_pipeline = LGBMRegressor(num_leaves=255,
                                  n_estimators=50,
                                  min_child_weight=30,
                                  learning_rate=0.1,
                                  colsample_bytree=0.80,
                                  scale_pos_weight=2,
                                  min_split_gain=0.1,
                                  reg_lambda=1,
                                  subsample=1,
                                  nthread=number_processors,
                                  seed=77)

In [25] used 0.0000 MiB RAM in 0.10s, total RAM usage 109732.82 MiB


In [26]:
with Timer() as t:
    lgbm_clf_pipeline.fit(X_train, y_train)

In [26] used 12260.1016 MiB RAM in 493.91s, total RAM usage 121992.92 MiB


In [27]:
results_dict['lgbm']={
    'train_time': t.interval
}

In [27] used 1.3086 MiB RAM in 0.11s, total RAM usage 121994.23 MiB


As it can be seen in the results, given the specific versions and parameters of both XGBoost and LightGBM and in this specific dataset, LightGBM is faster. 

In general terms, leaf-wise algorithms are more efficient, they converge much faster than depth-wise. However, it may cause over-fitting when the data is small or there are too many leaves.

### Evaluation
Now let's evaluate the model in the test set.

In [28]:
with Timer() as t:
    y_prob_xgb = np.clip(xgb_clf_pipeline.predict(X_test), 0.0001, 0.9999)

In [28] used 187.4609 MiB RAM in 12.82s, total RAM usage 122181.69 MiB


In [29]:
results_dict['xgb']['test_time'] = t.interval

In [29] used 0.1641 MiB RAM in 0.10s, total RAM usage 122181.85 MiB


In [30]:
with Timer() as t:
    y_prob_xgb_hist = np.clip(xgb_hist_clf_pipeline.predict(X_test), 0.0001, 0.9999)

In [30] used -9041.6211 MiB RAM in 14.76s, total RAM usage 113140.23 MiB


In [31]:
results_dict['xgb_hist']['test_time'] = t.interval

In [31] used 0.5859 MiB RAM in 0.10s, total RAM usage 113140.82 MiB


In [32]:
with Timer() as t:
    y_prob_lgbm = np.clip(lgbm_clf_pipeline.predict(X_test), 0.0001, 0.9999)

In [32] used 2458.9727 MiB RAM in 16.12s, total RAM usage 115599.79 MiB


In [33]:
results_dict['lgbm']['test_time'] = t.interval


In [33] used 0.0039 MiB RAM in 0.10s, total RAM usage 115599.79 MiB


### Metrics
We are going to obtain some metrics to evaluate the performance of each of the models.

In [34]:
#https://github.com/miguelgfierro/codebase/blob/master/python/machine_learning/metrics.py
def classification_metrics_binary(y_true, y_pred):
    m_acc = accuracy_score(y_true, y_pred)
    m_f1 = f1_score(y_true, y_pred)
    m_precision = precision_score(y_true, y_pred)
    m_recall = recall_score(y_true, y_pred)
    report = {'Accuracy':m_acc, 'Precision':m_precision, 'Recall':m_recall, 'F1':m_f1}
    return report

In [34] used 0.0000 MiB RAM in 0.12s, total RAM usage 115599.79 MiB


In [35]:
#https://github.com/miguelgfierro/codebase/blob/master/python/machine_learning/metrics.py
def classification_metrics_binary_prob(y_true, y_prob):
    m_auc = roc_auc_score(y_true, y_prob)
    report = {'AUC':m_auc}
    return report

In [35] used 0.0000 MiB RAM in 0.10s, total RAM usage 115599.79 MiB


In [36]:
def binarize_prediction(y, threshold=0.5):
    y_pred = np.where(y > threshold, 1, 0)
    return y_pred

In [36] used 0.0000 MiB RAM in 0.10s, total RAM usage 115599.79 MiB


In [37]:
y_pred_xgb = binarize_prediction(y_prob_xgb)
y_pred_xgb_hist = binarize_prediction(y_prob_xgb_hist)
y_pred_lgbm = binarize_prediction(y_prob_lgbm)


In [37] used 526.7461 MiB RAM in 0.59s, total RAM usage 116126.54 MiB


In [38]:
report_xgb = classification_metrics_binary(y_test, y_pred_xgb)
report2_xgb = classification_metrics_binary_prob(y_test, y_prob_xgb)
report_xgb.update(report2_xgb)

In [38] used 526.2344 MiB RAM in 33.18s, total RAM usage 116652.77 MiB


In [39]:
results_dict['xgb']['performance'] = report_xgb

In [39] used 0.0000 MiB RAM in 0.10s, total RAM usage 116652.77 MiB


In [40]:
report_xgb_hist = classification_metrics_binary(y_test, y_pred_xgb_hist)
report2_xgb_hist = classification_metrics_binary_prob(y_test, y_prob_xgb_hist)
report_xgb_hist.update(report2_xgb_hist)

In [40] used 0.0000 MiB RAM in 26.86s, total RAM usage 116652.77 MiB


In [41]:
results_dict['xgb_hist']['performance'] = report_xgb_hist

In [41] used 0.0000 MiB RAM in 0.10s, total RAM usage 116652.77 MiB


In [42]:
report_lgbm = classification_metrics_binary(y_test, y_pred_lgbm)
report2_lgbm = classification_metrics_binary_prob(y_test, y_prob_lgbm)
report_lgbm.update(report2_lgbm)

In [42] used 0.0000 MiB RAM in 27.64s, total RAM usage 116652.77 MiB


In [43]:
results_dict['lgbm']['performance'] = report_lgbm

In [43] used 0.0000 MiB RAM in 0.10s, total RAM usage 116652.77 MiB


In [44]:
# Results
print(json.dumps(results_dict, indent=4, sort_keys=True))

{
    "lgbm": {
        "performance": {
            "AUC": 0.8088803309182079,
            "Accuracy": 0.7356248449843407,
            "F1": 0.6940313699064132,
            "Precision": 0.7702070051501826,
            "Recall": 0.6315675645794081
        },
        "test_time": 16.16001626700745,
        "train_time": 498.1318734950037
    },
    "xgb": {
        "performance": {
            "AUC": 0.7857766547042155,
            "Accuracy": 0.6307526561015293,
            "F1": 0.6980503896875465,
            "Precision": 0.570517871408198,
            "Recall": 0.8990144248502152
        },
        "test_time": 12.853620631009107,
        "train_time": 1815.0101890249935
    },
    "xgb_hist": {
        "performance": {
            "AUC": 0.807637854492637,
            "Accuracy": 0.6739416916907783,
            "F1": 0.7171351977471667,
            "Precision": 0.6096706129530747,
            "Recall": 0.8705914128181881
        },
        "test_time": 14.775966885004891,
        "

The experiment shows a fairly similar performance in both libraries, being LightGBM slightly better.

In [45]:
del xgb_clf_pipeline, xgb_hist_clf_pipeline, lgbm_clf_pipeline, X_train, X_test, X_val

In [45] used -57324.6953 MiB RAM in 1.84s, total RAM usage 59328.08 MiB


# 2) Concept drift
In this section we are trying to find concept drift in the dataset to check if retraining is valuable.

### Data management
We are going to pack the data yearly to try to find concept drift

In [46]:
initial_year = 1987
num_ini = 5

In [46] used 0.0000 MiB RAM in 0.10s, total RAM usage 59328.08 MiB


In [47]:
def generate_subset_by_year(df, year_ini, year_end):
    return df[df['Year'].isin(range(year_ini, year_end))]

In [47] used 0.0000 MiB RAM in 0.11s, total RAM usage 59328.08 MiB


In [48]:
%%time
subset_base = generate_subset_by_year(df_plane_numeric, initial_year, initial_year + num_ini)
print(subset_base.shape)

(16810190, 15)
CPU times: user 1.8 s, sys: 688 ms, total: 2.48 s
Wall time: 2.44 s
In [48] used 1924.1172 MiB RAM in 2.54s, total RAM usage 61252.20 MiB


In [49]:
%%time
rest_df = df_plane_numeric.loc[df_plane_numeric.index.difference(subset_base.index)]
print(rest_df.shape)

(98258827, 15)
CPU times: user 1min 15s, sys: 7.88 s, total: 1min 23s
Wall time: 1min 22s
In [49] used 15856.0078 MiB RAM in 82.65s, total RAM usage 77108.20 MiB


### Traininig
Let's see what happens when we train on a subset of data and then evaluate in the data of the following years.

In [50]:
X_train, y_train = generate_feables(subset_base)
del(subset_base)

In [50] used -8867.6094 MiB RAM in 1.09s, total RAM usage 68240.59 MiB


In [51]:
clf = LGBMClassifier(num_leaves=255,
                    n_estimators=100,
                    min_child_weight=30,
                    learning_rate=0.1,
                    subsample=0.80,
                    colsample_bytree=0.80,
                    seed=42)

In [51] used 0.0000 MiB RAM in 0.11s, total RAM usage 68240.59 MiB


In [52]:
%%time
clf.fit(X_train, y_train)

CPU times: user 26min 56s, sys: 35.2 s, total: 27min 31s
Wall time: 1min 26s


LGBMClassifier(boosting_type='gbdt', colsample_bytree=0.8, learning_rate=0.1,
        max_bin=255, max_depth=-1, min_child_samples=10,
        min_child_weight=30, min_split_gain=0, n_estimators=100,
        nthread=-1, num_leaves=255, objective='binary', reg_alpha=0,
        reg_lambda=0, seed=42, silent=True, subsample=0.8,
        subsample_for_bin=50000, subsample_freq=1)

In [52] used 2116.9453 MiB RAM in 86.34s, total RAM usage 70357.54 MiB


### Prediction

In [53]:
@curry
def predict_accuracy(clf, test_df):
    X_test, y_test = generate_feables(test_df)
    y_pred = clf.predict(X_test)
    return accuracy_score(y_test, y_pred)

In [53] used 0.0000 MiB RAM in 0.11s, total RAM usage 70357.54 MiB


In [54]:
%%time
accuracy_series = rest_df.groupby('Year').apply(predict_accuracy(clf))
print(accuracy_series)

Year
1992    0.755851
1993    0.755262
1994    0.743402
1995    0.730882
1996    0.722218
1997    0.720392
1998    0.705345
1999    0.700257
2000    0.689137
2001    0.673514
2002    0.679002
2003    0.684676
2004    0.679677
2005    0.672613
2006    0.663230
2007    0.651053
2008    0.635933
dtype: float64
CPU times: user 37min 19s, sys: 21.2 s, total: 37min 40s
Wall time: 2min 44s
In [54] used 15493.8555 MiB RAM in 164.21s, total RAM usage 85851.39 MiB


From the results we can observe that the accuracy of the model gets worse as the years pass on.

### Retraining
Now let's see what happens when we retrain and evaluate in the data of the following years.

In [55]:
new_init = 15

In [55] used 0.0000 MiB RAM in 0.10s, total RAM usage 85851.39 MiB


In [56]:
%%time
subset_retrain = generate_subset_by_year(df_plane_numeric, initial_year, initial_year + new_init) 
print(subset_retrain.shape)

(69425349, 15)
CPU times: user 4.71 s, sys: 5.53 s, total: 10.2 s
Wall time: 10.1 s
In [56] used 8474.7773 MiB RAM in 10.23s, total RAM usage 94326.17 MiB


In [57]:
X_train, y_train = generate_feables(subset_retrain)

In [57] used 5218.4844 MiB RAM in 3.90s, total RAM usage 99544.66 MiB


In [58]:
clf_retrain = LGBMClassifier(num_leaves=255,
                            n_estimators=100,
                            min_child_weight=30,
                            learning_rate=0.1,
                            subsample=0.80,
                            colsample_bytree=0.80,
                            seed=42)

In [58] used 0.0000 MiB RAM in 0.10s, total RAM usage 99544.66 MiB


In [59]:
%%time
clf_retrain.fit(X_train, y_train)

CPU times: user 1h 57min 33s, sys: 3min 50s, total: 2h 1min 23s
Wall time: 6min 45s


LGBMClassifier(boosting_type='gbdt', colsample_bytree=0.8, learning_rate=0.1,
        max_bin=255, max_depth=-1, min_child_samples=10,
        min_child_weight=30, min_split_gain=0, n_estimators=100,
        nthread=-1, num_leaves=255, objective='binary', reg_alpha=0,
        reg_lambda=0, seed=42, silent=True, subsample=0.8,
        subsample_for_bin=50000, subsample_freq=1)

In [59] used 9798.9922 MiB RAM in 405.83s, total RAM usage 109343.65 MiB


### Prediction

In [60]:
%%time
rest_df = df_plane_numeric.loc[df_plane_numeric.index.difference(subset_retrain.index)]
print(rest_df.shape)

(45643668, 15)
CPU times: user 31.7 s, sys: 11 s, total: 42.7 s
Wall time: 41.8 s
In [60] used -5398.7500 MiB RAM in 41.88s, total RAM usage 103944.90 MiB


In [61]:
%%time
accuracy_retrain = rest_df.groupby('Year').apply(predict_accuracy(clf_retrain))
print(accuracy_retrain)

Year
2002    0.751166
2003    0.749844
2004    0.726010
2005    0.720412
2006    0.707695
2007    0.696212
2008    0.699735
dtype: float64
CPU times: user 16min 44s, sys: 12.1 s, total: 16min 56s
Wall time: 1min 16s
In [61] used -4243.3945 MiB RAM in 76.34s, total RAM usage 99701.50 MiB


### Plot

In [68]:
def plot_metrics(metric1, metric2, legend1=None, legend2=None, x_label=None, y_label=None):
    lists = sorted(metric1.items()) 
    x, y = zip(*lists) 
    fig, ax = plt.subplots()
    ax.plot(x, y, label=legend1, color='#5975a4')
    lists2 = sorted(metric2.items()) 
    x2, y2 = zip(*lists2) 
    ax.plot(x2, y2, label=legend2, color='#5f9e6f')
    legend = ax.legend(loc=0)
    ax.set_xlabel(x_label)
    ax.set_ylabel(y_label)
    plt.show()
    return ax

In [68] used -22750.0078 MiB RAM in 0.62s, total RAM usage 84798.14 MiB


In [82]:
from bokeh.models.sources import ColumnDataSource
from bokeh.models import HoverTool

In [82] used 0.0000 MiB RAM in 0.10s, total RAM usage 74903.22 MiB


In [83]:
data_cds = ColumnDataSource(pd.DataFrame({
    'train_acc': accuracy_series,
    'retrain_acc': accuracy_retrain
}))

In [83] used 0.0000 MiB RAM in 0.10s, total RAM usage 74903.22 MiB


In [84]:
# Airline Retrain Results
p = figure(y_axis_label='Accuracy', plot_width=700, plot_height=350, tools="pan,wheel_zoom,box_zoom,reset")
l1 = p.line('Year', 'train_acc', legend=' Train AUC', line_color="#5975a4", source=data_cds, line_width=6, line_cap="round")
p.line('Year', 'retrain_acc', legend=' Retrain AUC', line_color="#a1bae3", source=data_cds, line_width=6, line_cap="round")
l1_hover = HoverTool(renderers=[l1], tooltips=[( 'Train',  '@{train_acc}{0.4f}' ), ( 'Retrain',  '@{retrain_acc}{0.4f}' )], mode='vline')
p.add_tools(l1_hover)
show(p)

In [84] used 0.0039 MiB RAM in 0.22s, total RAM usage 74903.23 MiB


As it can be seen, the performance is better after retraining. We have found concept drift in this dataset.