<a href="https://colab.research.google.com/github/Praveen76/Introduction-to-RAPIDS/blob/main/Introduction_to_RAPIDS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Learning Objectives


At the end of the experiment, you will be able to:

* load, simulate, split data, and check dimensions
* convert numpy data to DMatrix format
* set the parameters and train the model

## Introduction

While the world’s data doubles each year, CPU computing has hit a brick wall with the end of Moore’s law. For the same reasons, scientific computing and deep learning has turned to NVIDIA GPU acceleration, data analytics and machine learning where GPU acceleration is ideal.

NVIDIA created RAPIDS, an open source data analytics and machine learning acceleration platform that leverages GPUs to accelerate computations.

<br>
<img src='https://rapids.ai/images/RAPIDS-logo.png' width=180px>

RAPIDS is based on Python, has pandas like and `scikit-learn` like interfaces, is built on `apache arrow` in memory data format, and can scale from 1 to multi GPU to multi nodes. RAPIDS integrates easily into the world’s most popular data science Python based workflows. RAPIDS accelerates data science from data preparation, machine learning, to deep learning. Through Arrow, Spark users can easily move data into the RAPIDS platform for acceleration.

In this notebook, the acceleration will be demonstrated by using GPUs with XGBoost in RAPIDS.

To know more about RAPIDS, refer [here](https://rapids.ai/).

### Setup Steps:

### Import necessary libraries

In [4]:
import numpy as np
import pandas as pd
import xgboost as xgb

### Check the version of the imported libraries

In [5]:
print('numpy Version:', np.__version__)
print('pandas Version:', pd.__version__)
print('XGBoost Version:', xgb.__version__)

numpy Version: 1.25.2
pandas Version: 2.0.3
XGBoost Version: 2.0.3


Make sure you are connected with GPU runtime in colab.

> Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

**Requirements for using RAPIDS:**

1. NVIDIA Volta™ or higher GPU with compute capability 7.0+

2. Ubuntu 20.04 or 22.04, CentOS 7, Rocky Linux 8, or WSL2 on Windows 11

3. Recent CUDA version and NVIDIA driver pairs. Check yours with: `nvidia-smi`

Check OS:

In [6]:
!lsb_release -a

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.3 LTS
Release:	22.04
Codename:	jammy


Check the CUDA version:

In [7]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0


In [8]:
# Check GPU
!nvidia-smi

Mon Jun 17 17:04:14 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

Colab's Tesla T4 GPU has compute capability 7.5

## Load/Simulate data

### Load Data

The data can be loaded using `pandas.read_csv`.


In [9]:
# helper function for loading data
def load_data(filename, n_rows):
    if n_rows >= 1e9:    # If number of rows are greater than the threshold value
        df = pd.read_csv(filename, nrows=n_rows)
    else:
        df = pd.read_csv(filename)
    return df.values.astype(np.float32)

### Simulate Data

The features will be tabular with `n_rows` and `n_columns` in the training dataset, where each value is either of type `np.float32` if the data is numerical or `np.uint8` if the data is categorical. Both numerical and categorical data can also be combined. In this experiment, this combination is not utlised.

In [10]:
# helper function for simulating data
def simulate_data(m, n, k=2, numerical=False):
    if numerical:
        features = np.random.rand(m, n)
    else:
        features = np.random.randint(2, size=(m, n))
    labels = np.random.randint(k, size=m)
    return np.c_[labels, features].astype(np.float32)

Define the number of rows, number of columns to be read.

If LOAD = False, the data will be simulated.

In [11]:
# settings
LOAD = False
n_rows = int(1e5)
n_columns = int(100)
n_categories = 2

Depending on the 'LOAD' boolean value, either load or simulate the data.

In [12]:
%%time

if LOAD:
    dataset = load_data('/tmp', n_rows)
else:
    dataset = simulate_data(n_rows, n_columns, n_categories)
print(dataset.shape)

(100000, 101)
CPU times: user 71 ms, sys: 34.1 ms, total: 105 ms
Wall time: 105 ms


In [13]:
# Few rows of the dataset
dataset[0:2, :]

array([[0., 0., 1., 1., 1., 1., 1., 0., 1., 0., 0., 0., 0., 0., 1., 1.,
        1., 1., 1., 1., 0., 0., 1., 1., 1., 0., 0., 0., 0., 1., 0., 0.,
        0., 0., 1., 1., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0.,
        0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 1., 1., 0., 1., 0., 0.,
        0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 1., 1., 1., 1., 1., 0.,
        1., 0., 1., 0., 1., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1.,
        0., 1., 1., 0., 1.],
       [0., 1., 0., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0.,
        1., 1., 0., 1., 1., 1., 0., 0., 0., 1., 0., 0., 1., 1., 0., 0.,
        0., 0., 1., 0., 1., 1., 1., 0., 1., 0., 1., 1., 1., 0., 1., 1.,
        1., 1., 1., 1., 1., 0., 1., 0., 1., 1., 1., 1., 1., 0., 0., 1.,
        0., 1., 0., 0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1.,
        1., 1., 0., 0., 0., 0., 1., 0., 1., 1., 0., 0., 0., 0., 1., 0.,
        0., 1., 1., 0., 0.]], dtype=float32)

### Split Data

Split the dataset into a 80% training dataset and a 20% test dataset.

In [14]:
# identify shape and indices
n_rows, n_columns = dataset.shape
train_size = 0.80
train_index = int(n_rows * train_size)

print("number of rows is equal to", n_rows)
print("number of columns is equal to", n_columns)
print("The train index is equal to", train_index)

number of rows is equal to 100000
number of columns is equal to 101
The train index is equal to 80000


#### Split the data into features and target

In [15]:
# split X, y
X, y = dataset[:, 1:], dataset[:, 0]
del dataset

In [16]:
# split train data
X_train, y_train = X[:train_index, :], y[:train_index]

In [17]:
# split test data
X_test, y_test = X[train_index:, :], y[train_index:]

### Check Dimensions

Check the dimensions and proportions of the training and test datasets.

In [18]:
# check dimensions
print('X_train: ', X_train.shape, X_train.dtype, 'y_train: ', y_train.shape, y_train.dtype)
print('X_test', X_test.shape, X_test.dtype, 'y_test: ', y_test.shape, y_test.dtype)

X_train:  (80000, 100) float32 y_train:  (80000,) float32
X_test (20000, 100) float32 y_test:  (20000,) float32


In [19]:
# check the proportions
total = X_train.shape[0] + X_test.shape[0]

print('X_train proportion:', X_train.shape[0] / total)
print('X_test proportion:', X_test.shape[0] / total)

X_train proportion: 0.8
X_test proportion: 0.2


## Convert NumPy data to DMatrix format

The data is simulated and formatted as NumPy arrays, next step is to convert this to a `DMatrix` object that XGBoost can work with. Instantiate an object of the `xgboost.DMatrix` by passing in the feature matrix as the first argument followed by the label vector using the `label=` keyword argument. To learn more about XGBoost's support for data structures other than NumPy arrays, see the documentation for the Data Interface [here](https://xgboost.readthedocs.io/en/latest/python/python_intro.html#data-interface)





In [20]:
%%time

dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

CPU times: user 306 ms, sys: 73.7 ms, total: 380 ms
Wall time: 365 ms


## Set Parameters

Before running XGBoost, we must set three types of parameters:

* **General parameters** relate to which booster is being used to do boosting, commonly tree or linear model

* **Booster parameters** depend on which booster is chosen

* **Learning task parameters** decide on the learning scenario. For example, regression tasks may use different parameters with ranking tasks.

For more information on the configurable parameters within the XGBoost module, see the documentation [here](https://xgboost.readthedocs.io/en/latest/parameter.html)




In [21]:
# instantiate params
params = {}

In [22]:
# general params
general_params = {'silent': 1}
params.update(general_params)

In [23]:
# booster params
n_gpus = 2           # no. of GPUs
booster_params = {}

In [24]:
if n_gpus != 0:
    booster_params['tree_method'] = 'hist'
    booster_params['n_gpus'] = n_gpus
params.update(booster_params)

In [25]:
# learning task params
learning_task_params = {'eval_metric': 'auc', 'objective': 'binary:logistic'}
params.update(learning_task_params)
print(params)

{'silent': 1, 'tree_method': 'hist', 'n_gpus': 2, 'eval_metric': 'auc', 'objective': 'binary:logistic'}


## Train Model

Use the `xgb.train` function and pass in the parameters, training dataset, the number of boosting iterations, and the list of items to be evaluated during training.

For more information on the parameters that can be passed into `xgb.train`, check out the documentation [here](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train)




In [26]:
# model training settings
evallist = [(dtest, 'test'), (dtrain, 'train')]
num_round = 10

In [27]:
%%time

bst = xgb.train(params, dtrain, num_round, evallist)

Parameters: { "n_gpus", "silent" } are not used.



[0]	test-auc:0.50183	train-auc:0.54066
[1]	test-auc:0.50239	train-auc:0.55668
[2]	test-auc:0.50079	train-auc:0.56981
[3]	test-auc:0.50178	train-auc:0.57994
[4]	test-auc:0.50389	train-auc:0.58869
[5]	test-auc:0.50727	train-auc:0.59655
[6]	test-auc:0.50479	train-auc:0.60407
[7]	test-auc:0.50446	train-auc:0.61037
[8]	test-auc:0.50393	train-auc:0.61690
[9]	test-auc:0.50193	train-auc:0.62302
CPU times: user 1.28 s, sys: 25.7 ms, total: 1.31 s
Wall time: 700 ms


### References

* [Open Source Website](http://rapids.ai)
* [GitHub](https://github.com/rapidsai/)
* [Press Release](https://nvidianews.nvidia.com/news/nvidia-introduces-rapids-open-source-gpu-acceleration-platform-for-large-scale-data-analytics-and-machine-learning)
* [NVIDIA Blog](https://blogs.nvidia.com/blog/2018/10/10/rapids-data-science-open-source-community/)
* [Developer Blog](https://devblogs.nvidia.com/gpu-accelerated-analytics-rapids/)
* [NVIDIA Data Science Webpage](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/)