![Nvidia Rapids](https://developer.nvidia.com/sites/default/files/pictures/2018/rapids/rapids-logo.png)

The RAPIDS suite of software libraries are built on CUDA. This means they leverage the GPU for data loading & preparition and to build classic ML model. 

Here, I have leveraged the **CUDF** and **CUML** part of the Nvidia Rapids, which allows me to use GPU to load and manipulate data, thereby reducing my wait time significantly

This is crucial in this month's TPS, as the data has approx 10 lakh rows and 250+ columns. Effective data loaing and processing can be key to achieving a good score in this competition

Both CUDF and CUML follows the familiar syntax of pandas and scikit respectively, thereby integrating well with your existing code and aiding understanding of the framework

In [None]:
import cudf
import pandas as pd
import xgboost as xgb
from cuml import train_test_split

In [None]:
!tree ../input

CUDF takes 3 seconds approx per loop to process the data, while Pandas takes 1 minute approx per loop to process the data. This is a huge leap in performance ⚡️

In [None]:
%%timeit
#timing the read block of train, test and sample submission with CUDF
train = cudf.read_csv('../input/tabular-playground-series-oct-2021/train.csv')
test = cudf.read_csv('../input/tabular-playground-series-oct-2021/test.csv')
submission = cudf.read_csv('../input/tabular-playground-series-oct-2021/sample_submission.csv')

In [None]:
%%timeit
#timing the read block of train, test and sample submission with Pandas
train_pd = pd.read_csv('../input/tabular-playground-series-oct-2021/train.csv')
test_pd = pd.read_csv('../input/tabular-playground-series-oct-2021/test.csv')
submission_pd = pd.read_csv('../input/tabular-playground-series-oct-2021/sample_submission.csv')

Instead of using Scikit's train_test_split(), I have used CUML's train_test_split() to split the training and validation sets 

(I have not the done the time comparison as above, as the gains from using RAPIDS is clear from the above example)

In [None]:
y = train['target']
X = train.drop(['target'], axis=1)

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

XGBoost Parameters from : https://www.kaggle.com/rahulchauhan3j/tps-oct-2021-xgboost-pipeline-with-optuna#Model-Fit-and-Submission

In [None]:
xgb_params = {'n_estimators': 10000,
               'learning_rate': 0.03689407512484644,
               'max_depth': 8,
               'colsample_bytree': 0.3723914688159835,
               'subsample': 0.780714581166012,
               'eval_metric': 'auc',
               'use_label_encoder': False,
               'gamma': 0,
               'reg_lambda': 50.0,
               'tree_method': 'gpu_hist',
               'gpu_id': 0,
               'predictor': 'gpu_predictor',
               'random_state': 42}

In [None]:
xgb_classifier = xgb.XGBClassifier(**xgb_params)
xgb_classifier.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], early_stopping_rounds=10, verbose=True)

In [None]:
sub = cudf.DataFrame()
sub['id'] = submission['id']
sub['target'] = xgb_classifier.predict_proba(test)[:,-1]
sub = sub.set_index('id')

In [None]:
sub.to_csv('submission.csv')

# Kindly upvote 👍🏻 if you found this kernel helpful

**Also, kindly upvote if you are forking the kernel** 😊