In [1]:
############################################################################
##
## Copyright (C) 2021 NVIDIA Corporation.  All rights reserved.
##
## NVIDIA Sample Code
##
## Please refer to the NVIDIA end user license agreement (EULA) associated
## with this source code for terms and conditions that govern your use of
## this software. Any use, reproduction, disclosure, or distribution of
## this software and related documentation outside the terms of the EULA
## is strictly prohibited.
##
############################################################################

In this notebook we show how RAPIDS libraries like cuDF (GPU version of Pandas), cuML (GPU version of Scikit-learn) can be used to GPU-accelerated the data preprocessing and feature engineering needed for training XGBoost model on the TabFormer dataset.

In [2]:
import os
import subprocess
import shutil
import cudf
import numpy as np

import cupy as cp
import cuml
from cuml.preprocessing import LabelEncoder

In [3]:
BASE_DIR = "./basedir"

In [4]:
# Define and clean our processed data
processed_path = os.path.join(BASE_DIR, "processed_data_1gpu")

In [5]:
gdf = cudf.read_parquet(os.path.join(processed_path, 'subset_transactions.parquet'))

In [6]:
gdf.head()

Unnamed: 0,user,card,year,month,day,time,amount,use_chip,merchant_id,merchant_city,merchant_state,zip,mcc,errors,is_fraud,card_id,is_train
0,0,0,2002,9,1,06:21,$134.09,Swipe Transaction,64518,La Verne,CA,91750.0,5300,,No,0,True
1,0,0,2002,9,1,06:42,$38.48,Swipe Transaction,43058,Monterey Park,CA,91754.0,5411,,No,0,True
2,0,0,2002,9,2,06:22,$120.34,Swipe Transaction,43058,Monterey Park,CA,91754.0,5411,,No,0,True
3,0,0,2002,9,2,17:45,$128.95,Swipe Transaction,63932,Monterey Park,CA,91754.0,5651,,No,0,True
4,0,0,2002,9,3,06:23,$104.71,Swipe Transaction,76114,La Verne,CA,91750.0,5912,,No,0,True


## Data Preprocessing

In [7]:
# drop user, card, columns
gdf = gdf.drop(columns=['user', 'card'])

We start our data cleaning with slicing off the dollar symbol prefix.

In [8]:
gdf['amount'] = gdf['amount'].str.slice(1).astype('float32')

And then convert categorical features like `merchant_city`, `merchant_state`, `zip` and `mcc`

In [9]:
gdf['merchant_city'] = gdf['merchant_city'].astype('category')
gdf['merchant_state'] = gdf['merchant_state'].astype('category')
gdf['zip'] = gdf['zip'].astype('category')
gdf['mcc'] = gdf['mcc'].astype('category')                                            

## Dealing with features with Multiple Categorical values for a transaction

We deal with errors column which can have multiple errors for a transaction.

In [10]:
unique_errors = ['Bad CVV',
 'Bad Zipcode',
 'Bad PIN',
 'Technical Glitch',
 'Insufficient Balance',
 'Bad Expiration',
 'Bad Card Number']

We have 7 possible errors that a transaction can be associated with in this dataset. These are

In [11]:
print(unique_errors)

['Bad CVV', 'Bad Zipcode', 'Bad PIN', 'Technical Glitch', 'Insufficient Balance', 'Bad Expiration', 'Bad Card Number']


For e.g. a given transaction can have multiple errors like`Bad CVV`, `Bad Zipcode` associated with it. We can perform multi-hot encoding which will ensure that for every error that appears we encode it with `1` and `0` otherwise. So across our 7 unique errors if a transaction has `Bad CVV`, `Bad Zipcode` erros then it will be encoded as `[1, 1, 0, 0, 0, 0, 0]`

In [12]:
exploded = gdf['errors'].str.strip(',').str.split(',').explode()

raw_one_hot = cudf.get_dummies(exploded, columns=["errors"])
errs = raw_one_hot.groupby(raw_one_hot.index).sum()

gdf = cudf.concat([gdf, errs], axis=1)

gdf = gdf.rename(columns={col:f'errors_{col}' for col in unique_errors})
gdf = gdf.drop(columns=['errors'])

Next we extract out the `hour` and `minute` from `time` column.

In [13]:
gdf['hour'] = gdf['time'].str.slice(stop=2).astype('int32')
gdf['minute'] = gdf['time'].str.slice(start=3).astype('int32')
gdf = gdf.drop(columns=['time'])

## Train-Test Split

Now we create the `X_train`, `X_test` and the labels `y_train` and `y_test` based on the temporal train-test split we did in Notebook 1a where said all transactions in years 1991-2017 would be training set and 2018-2020 would be test set. This corresponds to roughly 85% train and 15% test split.

In [14]:
train_gdf = gdf.loc[gdf['is_train'], :]
train_gdf = train_gdf.sample(frac=1)

We also make sure that the class distribution of our label remains the same across both train and test set.

In [15]:
test_gdf = gdf.loc[~gdf['is_train'], :]
test_gdf = test_gdf.sample(frac=1)

In [16]:
train_gdf['is_fraud'].value_counts(normalize=True) * 100

No     99.877801
Yes     0.122199
Name: is_fraud, dtype: float64

In [17]:
test_gdf['is_fraud'].value_counts(normalize=True) * 100

No     99.875737
Yes     0.124263
Name: is_fraud, dtype: float64

In [18]:
X_train = train_gdf[train_gdf.columns.difference(['is_fraud', 'is_train'])]
y_train = train_gdf['is_fraud']
X_test = test_gdf[test_gdf.columns.difference(['is_fraud', 'is_train'])]
y_test = test_gdf['is_fraud']

## Label Encode the labels

Next we label encode our labels using cuML's LabelEncoder (just like in Scikit-learn).

In [19]:
label_encoder = LabelEncoder()

In [20]:
y_train = label_encoder.fit_transform(y_train)
y_test = label_encoder.transform(y_test)

## Target Encoding

We have high cardinality categorical columns in our data like `merchant_city`, `merchant_state`, `zip`, `mcc` (stands for merchant category code) with large number of unique categories. If we were to one-hot encode these, our feature set would blow up and we would be hit hard with the curse of dimensionality. Additionally it would lead to either huge memory consumption and very sparse data.  

For Categorical Columns with lots of levels instead of One-Hot Encoding we can use [TargetEncoding](https://docs.rapids.ai/api/cuml/stable/api.html?highlight=target%20encoder#cuml.preprocessing.TargetEncoder.TargetEncoder) where each category in the column is replaced with the mean target value for that category. This way we can still effectively represent a categorical column and it only takes up the space of one feature. cuML's implementation of TargetEncoding uses several optimizations to prevent label leakage and parallelize the execution. To learn more about Target Encoding in cuML check out this [target encoder walkthrough](https://medium.com/rapids-ai/target-encoding-with-rapids-cuml-do-more-with-your-categorical-data-8c762c79e784).

In [21]:
from cuml.preprocessing import TargetEncoder

In [22]:
high_card_cols = ["merchant_city", "merchant_state", "zip", "mcc"]
for col in high_card_cols:
    # we append TE to column name to indicate we have target encoded it
    out_col = f'{col}_TE'
    tgt_encoder = TargetEncoder(smooth=0.001)
    X_train[out_col] = tgt_encoder.fit_transform(X_train[col], y_train)
    X_test[out_col] = tgt_encoder.transform(X_test[col])
# drop old columns
X_train.drop(columns=high_card_cols, inplace=True)
X_test.drop(columns=high_card_cols, inplace=True)

## One Hot Encoding

 As discussed above, we will now one-hot encode rest of the low cardinality categorical columns like `use_chip` which has 3 unique categories. We can easily accomplish this through cudf's get_dummies function (just like in Pandas).

In [23]:
oneh_enc_cols = ["use_chip"]
X_train = cudf.get_dummies(X_train)
X_test = cudf.get_dummies(X_test)

## Save Dataframes

Finally we saved the processed train and test dataframes.

In [24]:
X_train['label'] = y_train
X_test['label'] = y_test

In [25]:
X_train.to_parquet(os.path.join(processed_path, 'X_train.parquet'))

In [26]:
X_test.to_parquet(os.path.join(processed_path, 'X_test.parquet'))