<a id="introduction"></a>
## Introduction to XGBoost
#### Originally By Paul Hendricks
##### Modified By Ingine Hmwe
-------

In this notebook, we will show how to work with GPU accelerated XGBoost in RAPIDS.

**Table of Contents**

* [Introduction to XGBoost](#introduction)
* [Setup](#setup)
* [Load Libraries](#libraries)
* [Generate Data](#generate)
  * [Load Data](#load)
  * [Simulate Data](#simulate)
  * [Split Data](#split)
  * [Check Dimensions](#check)
* [Convert NumPy data to DMatrix format](#convert)
* [Set Parameters](#parameters)
* [Train Model](#train)
* [Conclusion](#conclusion)

<a id="setup"></a>
## Setup

This notebook was tested using the following Docker containers:

* `rapidsai/rapidsai:0.10-cuda10.0-runtime-ubuntu18.04` container from [DockerHub](https://hub.docker.com/r/rapidsai/rapidsai/tags)

This notebook was run on the NVIDIA V100 GPU. Please be aware that your system may be different and you may need to modify the code or install packages to run the below examples. 

Before we begin, let's check out our hardware setup by running the `nvidia-smi` command.

In [1]:
!nvidia-smi

Fri Nov 15 05:00:21 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla V100-PCIE...  Off  | 00000000:00:08.0 Off |                    0 |
| N/A   37C    P0    37W / 250W |   2028MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  Off  | 00000000:00:09.0 Off |                    0 |
| N/A   35C    P0    37W / 250W |    696MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-------

<a id="libraries"></a>
## Load Libraries

Let's load some of the libraries within the RAPIDs ecosystem and see which versions we have.

In [2]:
import numpy as np; print('numpy Version:', np.__version__)
import pandas as pd; print('pandas Version:', pd.__version__)
import sklearn; print('Scikit-Learn Version:', sklearn.__version__)
import xgboost as xgb; print('XGBoost Version:', xgb.__version__)
import time

numpy Version: 1.16.4
pandas Version: 0.24.2
Scikit-Learn Version: 0.21.3
XGBoost Version: 1.0.0-SNAPSHOT


<a id="generate"></a>
## Generate Data

<a id="load"></a>
### Load Data

We can load the data using `cudf.read_csv`. We've provided a helper function `load_data` that will load data from a CSV file (and will only read the first 1 billion rows if that file is unreasonably big).

In [3]:
# helper function for loading data
def load_data(filename, n_rows):
    if n_rows >= 1e9:
        gdf = cudf.read_csv(filename)
    else:
        gdf = cudf.read_csv(filename, nrows=n_rows)
    return gdf.values.astype(np.float32)

<a id="simulate"></a>
### Simulate Data

Alternatively, we can simulate data for our train and validation datasets. The features will be tabular with `n_rows` and `n_columns` in the training dataset, where each value is either of type `np.float32`. We can simulate data for both classification and regression using the `make_classification` or `make_regression` functions from the Scikit-Learn package.

In [4]:
from sklearn.datasets import make_classification, make_regression


# helper function for simulating data
def simulate_data(m, n, k=2, random_state=None, classification=True):
    if classification:
        features, labels = make_classification(n_samples=m, n_features=n, 
                                               n_informative=int(n/5), n_classes=k, 
                                              random_state=random_state)
    else:
        features, labels = make_regression(n_samples=m, n_features=n, 
                                           n_informative=int(n/5), n_targets=1, 
                                           random_state=random_state)
    return np.c_[labels, features].astype(np.float32)

In [5]:
# settings
simulate = True
classification = True  # change this to false to use regression
n_rows = int(1e6)  # we'll use 1 millions rows
n_columns = int(100)
n_categories = 2
random_state = np.random.RandomState(43210)

In [6]:
%%time
data_file = './simulated_data'

if simulate:
    dataset = simulate_data(n_rows, n_columns, n_categories, 
                            random_state=random_state, 
                            classification=classification)
else:
    dataset = load_data(data_file, n_rows)
print(dataset.shape)

(1000000, 101)
CPU times: user 14.9 s, sys: 13 s, total: 27.9 s
Wall time: 10.8 s


dataset is a `2D Array` of data, with dataset[0][0] represents `label` and dataset[0][1:end] represents `features`.

In [7]:
dataset

array([[ 0.        , -0.7669604 ,  0.41350517, ..., -0.8127142 ,
        -0.11832006,  0.03517063],
       [ 1.        ,  0.03652514, -0.5934094 , ...,  0.32232106,
        -0.12324404,  2.4050565 ],
       [ 0.        , -0.4764251 ,  1.5536602 , ..., -0.0919623 ,
         1.7981887 ,  0.31243178],
       ...,
       [ 1.        , -1.3806853 , -0.96609867, ..., -1.8649141 ,
         0.8573347 ,  1.2853271 ],
       [ 1.        ,  1.0025431 , -0.48246416, ...,  0.12920411,
         0.03204911,  0.53436947],
       [ 1.        , -0.20219818, -0.7586921 , ..., -0.9084102 ,
         1.40354   ,  2.0290613 ]], dtype=float32)

<a id="split"></a>
### Split Data

We'll split our dataset into a 80% training dataset and a 20% validation dataset.

In [8]:
# identify shape and indices
n_rows, n_columns = dataset.shape
train_size = 0.80
train_index = int(n_rows * train_size)

# split X, y
X, y = dataset[:, 1:], dataset[:, 0]
del dataset

# split train data
X_train, y_train = X[:train_index, :], y[:train_index]

# split validation data
X_validation, y_validation = X[train_index:, :], y[train_index:]

<a id="check"></a>
### Check Dimensions

We can check the dimensions and proportions of our training and validation dataets.

In [9]:
# check dimensions
print('X_train: ', X_train.shape, X_train.dtype, 'y_train: ', y_train.shape, y_train.dtype)
print('X_validation', X_validation.shape, X_validation.dtype, 'y_validation: ', y_validation.shape, y_validation.dtype)

# check the proportions
total = X_train.shape[0] + X_validation.shape[0]
print('X_train proportion:', X_train.shape[0] / total)
print('X_validation proportion:', X_validation.shape[0] / total)

X_train:  (800000, 100) float32 y_train:  (800000,) float32
X_validation (200000, 100) float32 y_validation:  (200000,) float32
X_train proportion: 0.8
X_validation proportion: 0.2


<a id="convert"></a>
## Convert NumPy data to DMatrix format

With our data loaded and formatted as NumPy arrays, our next step is to convert this to a `DMatrix` object that XGBoost can work with. We can instantiate an object of the `xgboost.DMatrix` by passing in the feature matrix as the first argument followed by the label vector using the `label=` keyword argument. To learn more about XGBoost's support for data structures other than NumPy arrays, see the documentation for the Data Interface:


https://xgboost.readthedocs.io/en/latest/python/python_intro.html#data-interface

In [10]:
%%time

dtrain = xgb.DMatrix(X_train, label=y_train)
dvalidation = xgb.DMatrix(X_validation, label=y_validation)

  "memory consumption")


CPU times: user 956 ms, sys: 765 ms, total: 1.72 s
Wall time: 1.72 s


<a id="parameters"></a>
## Set Parameters

There are a number of parameters that can be set before XGBoost can be run. 

* General parameters relate to which booster we are using to do boosting, commonly tree or linear model
* Booster parameters depend on which booster you have chosen
* Learning task parameters decide on the learning scenario. For example, regression tasks may use different parameters with ranking tasks.

For more information on the configurable parameters within the XGBoost module, see the documentation here:


https://xgboost.readthedocs.io/en/latest/parameter.html

In [11]:
'''instantiate GPU PARAMS '''
paramsGPU = {}

# general params
general_params = {'verbosity': 0}
paramsGPU.update(general_params)

# booster params
n_gpus = 1  # change this to -1 to use all GPUs available or 0 to use the CPU
booster_params = {}

# gpu implementation of hist algorithm
if n_gpus != 0:
    booster_params['max_depth'] = 6
    booster_params['tree_method'] = 'gpu_hist'
    booster_params['n_gpus'] = n_gpus   
paramsGPU.update(booster_params)

# learning task params
learning_task_params = {}
if classification:
    learning_task_params['eval_metric'] = 'auc'
    learning_task_params['objective'] = 'binary:logistic'
else:
    learning_task_params['eval_metric'] = 'rmse'
    learning_task_params['objective'] = 'reg:squarederror'
paramsGPU.update(learning_task_params)

print(paramsGPU)

{'verbosity': 0, 'max_depth': 6, 'tree_method': 'gpu_hist', 'n_gpus': 1, 'eval_metric': 'auc', 'objective': 'binary:logistic'}


In [12]:
'''instantiate CPU PARAMS '''

nCores = !nproc --all
nCores = int(nCores[0])

cpu_num_round = 100

paramsCPU = {
    'n_estimators': cpu_num_round,
    'max_depth': 6,
    'tree_method': 'hist',
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'n_jobs': nCores
}

print(paramsCPU)

{'n_estimators': 100, 'max_depth': 6, 'tree_method': 'hist', 'objective': 'binary:logistic', 'eval_metric': 'auc', 'n_jobs': 32}


<a id="train"></a>
## Train Model

Now it's time to train our model! We can use the `xgb.train` function and pass in the parameters, training dataset, the number of boosting iterations, and the list of items to be evaluated during training. For more information on the parameters that can be passed into `xgb.train`, check out the documentation:


https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train

In [13]:
# model training settings
evallist = [(dvalidation, 'validation'), (dtrain, 'train')]
gpu_num_round = 100

### Train on GPU

In [14]:
%%time
gpu_bst = xgb.train(paramsGPU, dtrain, gpu_num_round, evallist)

n_gpus: 
	Deprecated. Single process multi-GPU training is no longer supported.
	Please switch to distributed training with one process per GPU.
	This can be done using Dask or Spark.  See documentation for details.
[0]	validation-auc:0.89143	train-auc:0.89310
[1]	validation-auc:0.91929	train-auc:0.92058
[2]	validation-auc:0.93920	train-auc:0.94055
[3]	validation-auc:0.94835	train-auc:0.94974
[4]	validation-auc:0.95581	train-auc:0.95714
[5]	validation-auc:0.96042	train-auc:0.96177
[6]	validation-auc:0.96489	train-auc:0.96612
[7]	validation-auc:0.96865	train-auc:0.96981
[8]	validation-auc:0.97128	train-auc:0.97245
[9]	validation-auc:0.97415	train-auc:0.97533
[10]	validation-auc:0.97564	train-auc:0.97680
[11]	validation-auc:0.97797	train-auc:0.97917
[12]	validation-auc:0.97891	train-auc:0.98010
[13]	validation-auc:0.98000	train-auc:0.98113
[14]	validation-auc:0.98075	train-auc:0.98188
[15]	validation-auc:0.98203	train-auc:0.98318
[16]	validation-auc:0.98269	train-auc:0.98383
[17]	validat

### Train on CPU

In [15]:
%%time
cpu_bst = xgb.XGBClassifier(n_estimators = paramsCPU['n_estimators'],
                                tree_method = paramsCPU['tree_method'],
                                objective = paramsCPU['objective'],
                                n_jobs = paramsCPU['n_jobs'])

CPU times: user 46 µs, sys: 10 µs, total: 56 µs
Wall time: 60.1 µs


In [16]:
%%time
cpu_bst.fit(X_train, y_train,
        eval_set=[(X_validation, y_validation), (X_train, y_train)],
        eval_metric=paramsCPU['eval_metric'],
        verbose=True)

[0]	validation_0-auc:0.80024	validation_1-auc:0.80026
[1]	validation_0-auc:0.81536	validation_1-auc:0.81527
[2]	validation_0-auc:0.84754	validation_1-auc:0.84779
[3]	validation_0-auc:0.85908	validation_1-auc:0.85930
[4]	validation_0-auc:0.86482	validation_1-auc:0.86532
[5]	validation_0-auc:0.86835	validation_1-auc:0.86877
[6]	validation_0-auc:0.87200	validation_1-auc:0.87237
[7]	validation_0-auc:0.87644	validation_1-auc:0.87707
[8]	validation_0-auc:0.88128	validation_1-auc:0.88209
[9]	validation_0-auc:0.88544	validation_1-auc:0.88635
[10]	validation_0-auc:0.88760	validation_1-auc:0.88862
[11]	validation_0-auc:0.89376	validation_1-auc:0.89494
[12]	validation_0-auc:0.89780	validation_1-auc:0.89909
[13]	validation_0-auc:0.90027	validation_1-auc:0.90153
[14]	validation_0-auc:0.90407	validation_1-auc:0.90518
[15]	validation_0-auc:0.90748	validation_1-auc:0.90854
[16]	validation_0-auc:0.91069	validation_1-auc:0.91182
[17]	validation_0-auc:0.91273	validation_1-auc:0.91391
[18]	validation_0-au

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=32,
              objective='binary:logistic', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='hist',
              verbosity=1)

In [17]:
evals_result = cpu_bst.evals_result()

<a id="conclusion"></a>
## Conclusion

In this notebook, we showed how to work with GPU accelerated XGBoost in RAPIDS.

To learn more about RAPIDS, be sure to check out: 

* [Open Source Website](http://rapids.ai)
* [GitHub](https://github.com/rapidsai/)
* [Press Release](https://nvidianews.nvidia.com/news/nvidia-introduces-rapids-open-source-gpu-acceleration-platform-for-large-scale-data-analytics-and-machine-learning)
* [NVIDIA Blog](https://blogs.nvidia.com/blog/2018/10/10/rapids-data-science-open-source-community/)
* [Developer Blog](https://devblogs.nvidia.com/gpu-accelerated-analytics-rapids/)
* [NVIDIA Data Science Webpage](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/)

Credits go to Paul Hendricks for authoring and creating this notebook
* https://github.com/rapidsai/notebooks-contrib/blob/master/getting_started_notebooks/intro_tutorials/07_Introduction_to_XGBoost.ipynb