# Experiment 06: HIGGS boson (GPU version)

This experiment uses the data from the [HIGGS dataset](https://archive.ics.uci.edu/ml/datasets/HIGGS) to predict the appearance of the Higgs boson. The dataset consists of 11 million of observations. More information about the data can be found in [loaders.py](libs/loaders.py).  

The details of the machine we used and the version of the libraries can be found in [experiment 01](01_airline.ipynb).

In [25]:
import sys
import os
import pandas as pd
import numpy as np
from tqdm import tqdm
import xgboost as xgb
import lightgbm as lgb
from sklearn.model_selection import train_test_split
import json
import seaborn
import matplotlib.pyplot as plt
import pkg_resources
from libs.loaders import load_higgs
from libs.timer import Timer
from libs.metrics import classification_metrics_binary, classification_metrics_binary_prob, binarize_prediction
import warnings

print("System version: {}".format(sys.version))
print("XGBoost version: {}".format(pkg_resources.get_distribution('xgboost').version))
print("LightGBM version: {}".format(pkg_resources.get_distribution('lightgbm').version))

warnings.filterwarnings("ignore")
% matplotlib inline
% load_ext autoreload
% autoreload 2

System version: 3.5.3 |Anaconda 4.4.0 (64-bit)| (default, Mar  6 2017, 11:58:13) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
XGBoost version: 0.6
LightGBM version: 0.2
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Data loading and management

In [2]:
%%time
df = load_higgs()
print(df.shape)

INFO:libs.loaders:MOUNT_POINT not found in environment. Defaulting to /fileshare


(11000000, 29)
CPU times: user 1min 14s, sys: 6.77 s, total: 1min 21s
Wall time: 4min 7s


In [3]:
df.head(5)

Unnamed: 0,boson,lepton_pT,lepton_eta,lepton_phi,missing_energy_magnitude,missing_energy_phi,jet_1_pt,jet_1_eta,jet_1_phi,jet_1_b-tag,...,jet_4_eta,jet_4_phi,jet_4_b-tag,m_jj,m_jjj,m_lv,m_jlv,m_bb,m_wbb,m_wwbb
0,1.0,0.869293,-0.635082,0.22569,0.32747,-0.689993,0.754202,-0.248573,-1.092064,0.0,...,-0.010455,-0.045767,3.101961,1.35376,0.979563,0.978076,0.920005,0.721657,0.988751,0.876678
1,1.0,0.907542,0.329147,0.359412,1.49797,-0.31301,1.095531,-0.557525,-1.58823,2.173076,...,-1.13893,-0.000819,0.0,0.30222,0.833048,0.9857,0.978098,0.779732,0.992356,0.798343
2,1.0,0.798835,1.470639,-1.635975,0.453773,0.425629,1.104875,1.282322,1.381664,0.0,...,1.128848,0.900461,0.0,0.909753,1.10833,0.985692,0.951331,0.803252,0.865924,0.780118
3,0.0,1.344385,-0.876626,0.935913,1.99205,0.882454,1.786066,-1.646778,-0.942383,0.0,...,-0.678379,-1.360356,0.0,0.946652,1.028704,0.998656,0.728281,0.8692,1.026736,0.957904
4,1.0,1.105009,0.321356,1.522401,0.882808,-1.205349,0.681466,-1.070464,-0.921871,0.0,...,-0.373566,0.113041,0.0,0.755856,1.361057,0.98661,0.838085,1.133295,0.872245,0.808487


Depending on your GPU you could experiment memory issues, if that is so, you could try to reduce the datasize. 

In [4]:
#subset = 1e6
#df_small = df.sample(n=subset).reset_index(drop=True)

Let's generate the train and test set.

In [5]:
def generate_feables(df):
    X = df[df.columns.difference(['boson'])]
    y = df['boson']
    return X,y

In [6]:
%%time
X, y = generate_feables(df)
#X, y = generate_feables(df_small)
print(X.shape)
print(y.shape)

(11000000, 28)
(11000000,)
CPU times: user 392 ms, sys: 504 ms, total: 896 ms
Wall time: 892 ms


In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=77, test_size=500000)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(10500000, 28)
(10500000,)
(500000, 28)
(500000,)


Let's put the data in the XGBoost format.

In [8]:
dtrain = xgb.DMatrix(data=X_train, label=y_train)
dtest = xgb.DMatrix(data=X_test, label=y_test)

Now, we'll do the same for LightGBM.

In [9]:
lgb_train = lgb.Dataset(X_train.values, y_train.values, free_raw_data=False)
lgb_test = lgb.Dataset(X_test.values, y_test.values, reference=lgb_train, free_raw_data=False)

### XGBoost
Let's start by computing the standard version of XGBoost in a GPU.

In [10]:
results_dict = dict()
num_rounds = 200

In [11]:
params = {'max_depth':2, #'max_depth':5, 
          'objective':'binary:logistic', 
          'min_child_weight':1, 
          'learning_rate':0.1, 
          'scale_pos_weight':2, 
          'gamma':0.1, 
          'reg_lamda':1, 
          'subsample':1,
          'tree_method':'exact', 
          'updater':'grow_gpu'
          }

*NOTE: We got an out of memory error with xgb. Please see the comments at the end of the notebook.*

```python
with Timer() as train_t:
    xgb_clf_pipeline = xgb.train(params, dtrain, num_boost_round=num_rounds)
    
with Timer() as test_t:
    y_prob_xgb = xgb_clf_pipeline.predict(dtest)
    
```

Once the training and test is finised, let's compute some metrics.

```python
y_pred_xgb = binarize_prediction(y_prob_xgb)
report_xgb = classification_metrics_binary(y_test, y_pred_xgb)
report2_xgb = classification_metrics_binary_prob(y_test, y_prob_xgb)
report_xgb.update(report2_xgb)
results_dict['xgb']={
    'train_time': train_t.interval,
    'test_time': test_t.interval,
    'performance': report_xgb 
}
del xgb_clf_pipeline 

```

Now let's try with XGBoost histogram.

In [12]:
params = {'max_depth':0, 
          'max_leaves':2**5, 
          'objective':'binary:logistic', 
          'min_child_weight':1, 
          'learning_rate':0.1, 
          'scale_pos_weight':2, 
          'gamma':0.1, 
          'reg_lamda':1, 
          'subsample':1,
          'tree_method':'hist', 
          'grow_policy':'lossguide', 
          'updater':'grow_gpu_hist'
         }

In [13]:
with Timer() as t_train:
    xgb_hist_clf_pipeline = xgb.train(params, dtrain, num_boost_round=num_rounds)
    
with Timer() as t_test:
    y_prob_xgb_hist = xgb_hist_clf_pipeline.predict(dtest)

In [14]:
y_pred_xgb_hist = binarize_prediction(y_prob_xgb_hist)

In [15]:
report_xgb_hist = classification_metrics_binary(y_test, y_pred_xgb_hist)
report2_xgb_hist = classification_metrics_binary_prob(y_test, y_prob_xgb_hist)
report_xgb_hist.update(report2_xgb_hist)

In [16]:
results_dict['xgb_hist']={
    'train_time': t_train.interval,
    'test_time': t_test.interval,
    'performance': report_xgb_hist
}

In [17]:
del xgb_hist_clf_pipeline #clear GPU memory (214Mb)

### LightGBM
After the XGBoost version is finished, let's try LightGBM in GPU. 

In [18]:
params = {'num_leaves': 2**5,
         'learning_rate': 0.1,
         'scale_pos_weight': 2,
         'min_split_gain': 0.1,
         'min_child_weight': 1,
         'reg_lambda': 1,
         'subsample': 1,
         'objective':'binary',
         'device': 'gpu',
         'task': 'train'
         }

In [19]:
with Timer() as train_t:
    lgbm_clf_pipeline = lgb.train(params, lgb_train, num_boost_round=num_rounds)
    
with Timer() as test_t:
    y_prob_lgbm = lgbm_clf_pipeline.predict(X_test.values)

As we did before, let's obtain some performance metrics.

In [20]:
y_pred_lgbm = binarize_prediction(y_prob_lgbm)

In [21]:
report_lgbm = classification_metrics_binary(y_test, y_pred_lgbm)
report2_lgbm = classification_metrics_binary_prob(y_test, y_prob_lgbm)
report_lgbm.update(report2_lgbm)

In [22]:
results_dict['lgbm']={
    'train_time': train_t.interval,
    'test_time': test_t.interval,
    'performance': report_lgbm 
}

In [23]:
del lgbm_clf_pipeline #clear GPU memory (135Mb)

Finally, we show the results

In [24]:
# Results
print(json.dumps(results_dict, indent=4, sort_keys=True))

{
    "lgbm": {
        "performance": {
            "AUC": 0.8206713959277258,
            "Accuracy": 0.707346,
            "F1": 0.7678274211385622,
            "Precision": 0.6623814985860588,
            "Recall": 0.9132019927536232
        },
        "test_time": 0.6611301500006448,
        "train_time": 71.8760530440004
    },
    "xgb_hist": {
        "performance": {
            "AUC": 0.8205886356415744,
            "Accuracy": 0.70721,
            "F1": 0.767674555527519,
            "Precision": 0.6623426413523601,
            "Recall": 0.9128434480676328
        },
        "test_time": 0.5808831149997786,
        "train_time": 114.88390647300002
    }
}


The full size of HIGGS dataset is 11 million rows. This amount of information can not be processed by XGBoost in its standard version (xgb) using a NVIDIA M60 GPU, even if we reduce the max depth of the tree to 2. We got an out of memory error. However, when reducing the dataset to 1 million rows, xgb works correctly. 

In our experiments with the reduced dataset of 1 million rows, the memory consumption of xgb is around 10 times higher than LightGBM and 5 times higher than XGBoost histogram (leaf-wise implementation).

We can observe that LightGBM is faster than XGBoost histogram, having a similar performance. But also, when we did the experiment with the reduced dataset, we found that  XGBoost with the leaf-wise implementation is faster than with the depth-wise implementation. 

Final advice: go leaf-wise :-)