<a href="https://colab.research.google.com/github/Praveen76/Introduction-to-RAPIDS/blob/main/Introduction_to_RAPIDS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Learning Objectives


At the end of the experiment, you will be able to:

* load, simulate, split data, and check dimensions
* convert numpy data to DMatrix format
* set the parameters and train the model

## Introduction

While the world’s data doubles each year, CPU computing has hit a brick wall with the end of Moore’s law. For the same reasons, scientific computing and deep learning has turned to NVIDIA GPU acceleration, data analytics and machine learning where GPU acceleration is ideal.

NVIDIA created RAPIDS, an open source data analytics and machine learning acceleration platform that leverages GPUs to accelerate computations.

<br>
<img src='https://rapids.ai/images/RAPIDS-logo.png' width=180px>

RAPIDS is based on Python, has pandas like and `scikit-learn` like interfaces, is built on `apache arrow` in memory data format, and can scale from 1 to multi GPU to multi nodes. RAPIDS integrates easily into the world’s most popular data science Python based workflows. RAPIDS accelerates data science from data preparation, machine learning, to deep learning. Through Arrow, Spark users can easily move data into the RAPIDS platform for acceleration.

In this notebook, the acceleration will be demonstrated by using GPUs with XGBoost in RAPIDS.

To know more about RAPIDS, refer [here](https://rapids.ai/).

### Setup Steps:

In [None]:
#@title Please enter your registration id to start: { run: "auto", display-mode: "form" }
Id = "" #@param {type:"string"}

In [None]:
#@title Please enter your password (your registered phone number) to continue: { run: "auto", display-mode: "form" }
password = "" #@param {type:"string"}

In [None]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython

ipython = get_ipython()

notebook= "M8_AST_04_XGBoost_with_RAPIDS_C" #name of the notebook

def setup():
#  ipython.magic("sx pip3 install torch")

    from IPython.display import HTML, display
    display(HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():
    ipython.magic("notebook -e "+ notebook + ".ipynb")

    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:
        print(r["err"])
        return None
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getAnswer() and getComplexity() and getAdditional() and getConcepts() and getComments() and getMentorSupport():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional,
              "concepts" : Concepts, "record_id" : submission_id,
              "answer" : Answer, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook,
              "feedback_experiments_input" : Comments,
              "feedback_mentor_support": Mentor_support}
      r = requests.post(url, data = data)
      r = json.loads(r.text)
      if "err" in r:
        print(r["err"])
        return None
      else:
        print("Your submission is successful.")
        print("Ref Id:", submission_id)
        print("Date of submission: ", r["date"])
        print("Time of submission: ", r["time"])
        print("View your submissions: https://aimlops-iisc.talentsprint.com/notebook_submissions")
        #print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
        return submission_id
    else: submission_id


def getAdditional():
  try:
    if not Additional:
      raise NameError
    else:
      return Additional
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    if not Complexity:
      raise NameError
    else:
      return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None

def getConcepts():
  try:
    if not Concepts:
      raise NameError
    else:
      return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None


# def getWalkthrough():
#   try:
#     if not Walkthrough:
#       raise NameError
#     else:
#       return Walkthrough
#   except NameError:
#     print ("Please answer Walkthrough Question")
#     return None

def getComments():
  try:
    if not Comments:
      raise NameError
    else:
      return Comments
  except NameError:
    print ("Please answer Comments Question")
    return None


def getMentorSupport():
  try:
    if not Mentor_support:
      raise NameError
    else:
      return Mentor_support
  except NameError:
    print ("Please answer Mentor support Question")
    return None

def getAnswer():
  try:
    if not Answer:
      raise NameError
    else:
      return Answer
  except NameError:
    print ("Please answer Question")
    return None


def getId():
  try:
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
else:
  print ("Please complete Id and Password cells before running setup")



### Import necessary libraries

In [None]:
import numpy as np
import pandas as pd
import xgboost as xgb

### Check the version of the imported libraries

In [None]:
print('numpy Version:', np.__version__)
print('pandas Version:', pd.__version__)
print('XGBoost Version:', xgb.__version__)

Make sure you are connected with GPU runtime in colab.

> Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

**Requirements for using RAPIDS:**

1. NVIDIA Volta™ or higher GPU with compute capability 7.0+

2. Ubuntu 20.04 or 22.04, CentOS 7, Rocky Linux 8, or WSL2 on Windows 11

3. Recent CUDA version and NVIDIA driver pairs. Check yours with: `nvidia-smi`

Check OS:

In [None]:
!lsb_release -a

Check the CUDA version:

In [None]:
!nvcc --version

In [None]:
# Check GPU
!nvidia-smi

Colab's Tesla T4 GPU has compute capability 7.5

## Load/Simulate data

### Load Data

The data can be loaded using `pandas.read_csv`.


In [None]:
# helper function for loading data
def load_data(filename, n_rows):
    if n_rows >= 1e9:    # If number of rows are greater than the threshold value
        df = pd.read_csv(filename, nrows=n_rows)
    else:
        df = pd.read_csv(filename)
    return df.values.astype(np.float32)

### Simulate Data

The features will be tabular with `n_rows` and `n_columns` in the training dataset, where each value is either of type `np.float32` if the data is numerical or `np.uint8` if the data is categorical. Both numerical and categorical data can also be combined. In this experiment, this combination is not utlised.

In [None]:
# helper function for simulating data
def simulate_data(m, n, k=2, numerical=False):
    if numerical:
        features = np.random.rand(m, n)
    else:
        features = np.random.randint(2, size=(m, n))
    labels = np.random.randint(k, size=m)
    return np.c_[labels, features].astype(np.float32)

Define the number of rows, number of columns to be read.

If LOAD = False, the data will be simulated.

In [None]:
# settings
LOAD = False
n_rows = int(1e5)
n_columns = int(100)
n_categories = 2

Depending on the 'LOAD' boolean value, either load or simulate the data.

In [None]:
%%time

if LOAD:
    dataset = load_data('/tmp', n_rows)
else:
    dataset = simulate_data(n_rows, n_columns, n_categories)
print(dataset.shape)

In [None]:
# Few rows of the dataset
dataset[0:2, :]

### Split Data

Split the dataset into a 80% training dataset and a 20% test dataset.

In [None]:
# identify shape and indices
n_rows, n_columns = dataset.shape
train_size = 0.80
train_index = int(n_rows * train_size)

print("number of rows is equal to", n_rows)
print("number of columns is equal to", n_columns)
print("The train index is equal to", train_index)

#### Split the data into features and target

In [None]:
# split X, y
X, y = dataset[:, 1:], dataset[:, 0]
del dataset

In [None]:
# split train data
X_train, y_train = X[:train_index, :], y[:train_index]

In [None]:
# split test data
X_test, y_test = X[train_index:, :], y[train_index:]

### Check Dimensions

Check the dimensions and proportions of the training and test datasets.

In [None]:
# check dimensions
print('X_train: ', X_train.shape, X_train.dtype, 'y_train: ', y_train.shape, y_train.dtype)
print('X_test', X_test.shape, X_test.dtype, 'y_test: ', y_test.shape, y_test.dtype)

In [None]:
# check the proportions
total = X_train.shape[0] + X_test.shape[0]

print('X_train proportion:', X_train.shape[0] / total)
print('X_test proportion:', X_test.shape[0] / total)

## Convert NumPy data to DMatrix format

The data is simulated and formatted as NumPy arrays, next step is to convert this to a `DMatrix` object that XGBoost can work with. Instantiate an object of the `xgboost.DMatrix` by passing in the feature matrix as the first argument followed by the label vector using the `label=` keyword argument. To learn more about XGBoost's support for data structures other than NumPy arrays, see the documentation for the Data Interface [here](https://xgboost.readthedocs.io/en/latest/python/python_intro.html#data-interface)





In [None]:
%%time

dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

## Set Parameters

Before running XGBoost, we must set three types of parameters:

* **General parameters** relate to which booster is being used to do boosting, commonly tree or linear model

* **Booster parameters** depend on which booster is chosen

* **Learning task parameters** decide on the learning scenario. For example, regression tasks may use different parameters with ranking tasks.

For more information on the configurable parameters within the XGBoost module, see the documentation [here](https://xgboost.readthedocs.io/en/latest/parameter.html)




In [None]:
# instantiate params
params = {}

In [None]:
# general params
general_params = {'silent': 1}
params.update(general_params)

In [None]:
# booster params
n_gpus = 2           # no. of GPUs
booster_params = {}

In [None]:
if n_gpus != 0:
    booster_params['tree_method'] = 'hist'
    booster_params['n_gpus'] = n_gpus
params.update(booster_params)

In [None]:
# learning task params
learning_task_params = {'eval_metric': 'auc', 'objective': 'binary:logistic'}
params.update(learning_task_params)
print(params)

## Train Model

Use the `xgb.train` function and pass in the parameters, training dataset, the number of boosting iterations, and the list of items to be evaluated during training.

For more information on the parameters that can be passed into `xgb.train`, check out the documentation [here](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train)




In [None]:
# model training settings
evallist = [(dtest, 'test'), (dtrain, 'train')]
num_round = 10

In [None]:
%%time

bst = xgb.train(params, dtrain, num_round, evallist)

### References

* [Open Source Website](http://rapids.ai)
* [GitHub](https://github.com/rapidsai/)
* [Press Release](https://nvidianews.nvidia.com/news/nvidia-introduces-rapids-open-source-gpu-acceleration-platform-for-large-scale-data-analytics-and-machine-learning)
* [NVIDIA Blog](https://blogs.nvidia.com/blog/2018/10/10/rapids-data-science-open-source-community/)
* [Developer Blog](https://devblogs.nvidia.com/gpu-accelerated-analytics-rapids/)
* [NVIDIA Data Science Webpage](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/)

### Please answer the questions below to complete the experiment:

In [None]:
# @title Select the FALSE statement: { run: "auto", form-width: "500px", display-mode: "form" }
Answer = "" #@param ["","RAPIDS is an open source data analytics and ML acceleration platform that leverages GPUs to accelerate computations","RAPIDS is based on Python, has pandas like and scikit-learn like interfaces","RAPIDS is built on apache spark in memory data format, and can scale from 1 to multi GPU to multi nodes"]

In [None]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "" #@param ["","Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging for me", "Was Tough, but I did it", "Too Difficult for me"]

In [None]:
#@title If it was too easy, what more would you have liked to be added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "" #@param {type:"string"}

In [None]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "" #@param ["","Yes", "No"]

In [None]:
#@title  Text and image description/explanation and code comments within the experiment: { run: "auto", vertical-output: true, display-mode: "form" }
Comments = "" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]

In [None]:
#@title Mentor Support: { run: "auto", vertical-output: true, display-mode: "form" }
Mentor_support = "" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]

In [None]:
#@title Run this cell to submit your notebook for grading { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id = return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")