# XGBoost Starter with NVTabular - LB 0.794

This notebook builds on the wonderful [XGBoost Starter - [0.793]](https://www.kaggle.com/code/cdeotte/xgboost-starter-0-793) by [Chris Deotte](https://www.kaggle.com/cdeotte).

In this notebook, we will use [NVTabular](https://github.com/NVIDIA-Merlin/NVTabular). It is a library designed specifically for processing tabular data that tighlty integrates with the rest of the [Merlin Framework](https://github.com/NVIDIA-Merlin/Merlin).

NVTabular features very powerful data preprocessing techniques implemented using best practices straight from Kaggle (target encoding, for instance). It follows the `sklearn` `fit`, `transform` and `fit_transform` pattern, which allows us to write fewer lines of code using a familiar API.

Also, NVTabular leverages `dask_cudf` as it's backend -- this enables running calculations on arbitrarily large amounts of data that might not fit into our GPU RAM all at once (with some limitations -- there are certain operations that might still cause us OOM errors).

Let's see if we will be able to put all this to good use in this competition! 🙂

We first need to install a bunch of things that are missing from the Kaggle kernel. The installation can take several minutes so do not be alarmed.

# Load Libraries

In [None]:
%%bash

sudo apt update -y --fix-missing
sudo apt install -y --no-install-recommends libexpat1-dev libsasl2-2 libssl-dev graphviz openssl protobuf-compiler software-properties-common libopenmpi-dev
sudo apt autoremove -y
sudo apt clean
sudo rm -rf /var/lib/apt/lists/*

pip install betterproto graphviz pybind11 pydot pytest mpi4py
pip install nvidia-pyindex
pip install nvtabular

In [None]:
!git clone https://github.com/NVIDIA-Merlin/NVTabular.git

In [None]:
import os
os.chdir('NVTabular')

!git fetch origin pull/1580/head:fix_dtype_discrepancy_after_groupby_aggs
!git checkout fix_dtype_discrepancy_after_groupby_aggs
!pip install -e .

In [None]:
import pandas as pd, numpy as np # CPU libraries
import cupy, cudf # GPU libraries
import matplotlib.pyplot as plt, gc, os
import nvtabular as nvt

print('RAPIDS version',cudf.__version__)

In [None]:
# VERSION NAME FOR SAVED MODEL FILES
VER = "nvt_0"

# TRAIN RANDOM SEED
SEED = 42

# FILL NAN VALUE
NAN_VALUE = -127 # will fit in int8

# FOLDS PER MODEL
FOLDS = 5

# Process and Feature Engineer Train Data
We will load @raddar Kaggle dataset from [here][1] with discussion [here][2]. Then we will engineer features suggested by @huseyincot in his notebooks [here][3] and [here][4]. We will use [NVTabular][6] (leveraging `cudf` by [RAPIDS][5]) and the GPU to create new features quickly.

[1]: https://www.kaggle.com/datasets/raddar/amex-data-integer-dtypes-parquet-format
[2]: https://www.kaggle.com/competitions/amex-default-prediction/discussion/328514
[3]: https://www.kaggle.com/code/huseyincot/amex-catboost-0-793
[4]: https://www.kaggle.com/code/huseyincot/amex-agg-data-how-it-created
[5]: https://rapids.ai/
[6]: https://github.com/NVIDIA-Merlin/NVTabular

In [None]:
def read_file(path = '', usecols = None):
    if usecols is not None: df = cudf.read_parquet(path, columns=usecols)
    else: df = cudf.read_parquet(path)
    
    df['customer_ID'] = df['customer_ID'].str[-16:].str.hex_to_int().astype('int64')
    df.S_2 = cudf.to_datetime(df.S_2)
    df = df.fillna(NAN_VALUE) 
    print('shape of data:', df.shape)
    
    return df

train = read_file('../../input/amex-data-integer-dtypes-parquet-format/train.parquet')

In [None]:
all_cols = [c for c in list(train.columns) if c not in ['customer_ID','S_2']]
cat_features = ["B_30","B_38","D_114","D_116","D_117","D_120","D_126","D_63","D_64","D_66","D_68"]
num_features = [col for col in all_cols if col not in cat_features]

targets = cudf.read_csv('../../input/amex-default-prediction/train_labels.csv')
targets['customer_ID'] = targets['customer_ID'].str[-16:].str.hex_to_int().astype('int64')

train = train.merge(targets, on='customer_ID', how='left')

In [None]:
df = train.to_pandas()
fig, axes = plt.subplots(4, 3, figsize=(15,8))

for col, ax in zip(cat_features, axes.flat):
    vc = df[col].value_counts()
    vc.reset_index(drop=True, inplace=True)
    ax.bar(vc.index.astype(np.int32), vc.values)
    ax.set_title('Frequency of ' + str(col))
    ax.set_xlabel('Number frequency')
    ax.set_ylabel('Frequency')

axes.flatten()[-1].axis('off')    
fig.tight_layout()

In [None]:
%%time

ds = nvt.Dataset(train)

ds = nvt.Dataset(train, npartitions=5)
ds = ds.shuffle_by_keys(keys=['customer_ID'])
train = ds.to_ddf()
train = train.sort_values(['customer_ID', 'S_2'])
ds = nvt.Dataset(train)

num_aggs = num_features + ['customer_ID'] >> nvt.ops.Groupby(
    'customer_ID',
    aggs=['mean', 'std', 'min', 'max', 'last']
)

cat_aggs = cat_features + ['customer_ID'] >> nvt.ops.Groupby(
    'customer_ID',
    aggs=['count', 'last', 'nunique']
)

te = cat_features >> nvt.ops.TargetEncoding(
    'target',
    kfold=5
) 
te = te + ['customer_ID'] >> nvt.ops.Groupby(
    'customer_ID',
    aggs=['mean', 'last']
)

out = num_aggs + cat_aggs + te >> nvt.ops.ReduceDtypeSize()

wf = nvt.Workflow(out)
train = wf.fit_transform(ds).compute()

In [None]:
train = train.merge(targets, on='customer_ID', how='left')

In [None]:
# the beauty of nvt.ops.ReduceDtypeSize! our dtypes got reduced to save memory... automagically!
train.dtypes.value_counts()

# Train XGB
We will train using `DeviceQuantileDMatrix`. This has a very small GPU memory footprint.

In [None]:
# LOAD XGB LIBRARY
from sklearn.model_selection import KFold
import xgboost as xgb
print('XGB Version',xgb.__version__)

# XGB MODEL PARAMETERS
xgb_parms = { 
    'max_depth':4, 
    'learning_rate':0.05, 
    'subsample':0.8,
    'colsample_bytree':0.6, 
    'eval_metric':'logloss',
    'objective':'binary:logistic',
    'tree_method':'gpu_hist',
    'predictor':'gpu_predictor',
    'random_state':SEED
}

In [None]:
# NEEDED WITH DeviceQuantileDMatrix BELOW
class IterLoadForDMatrix(xgb.core.DataIter):
    def __init__(self, df=None, features=None, target=None, batch_size=256*1024):
        self.features = features
        self.target = target
        self.df = df
        self.it = 0 # set iterator to 0
        self.batch_size = batch_size
        self.batches = int( np.ceil( len(df) / self.batch_size ) )
        super().__init__()

    def reset(self):
        '''Reset the iterator'''
        self.it = 0

    def next(self, input_data):
        '''Yield next batch of data.'''
        if self.it == self.batches:
            return 0 # Return 0 when there's no more batch.
        
        a = self.it * self.batch_size
        b = min( (self.it + 1) * self.batch_size, len(self.df) )
        dt = cudf.DataFrame(self.df.iloc[a:b])
        input_data(data=dt[self.features], label=dt[self.target]) #, weight=dt['weight'])
        self.it += 1
        return 1

In [None]:
# https://www.kaggle.com/kyakovlev
# https://www.kaggle.com/competitions/amex-default-prediction/discussion/327534
def amex_metric_mod(y_true, y_pred):

    labels     = np.transpose(np.array([y_true, y_pred]))
    labels     = labels[labels[:, 1].argsort()[::-1]]
    weights    = np.where(labels[:,0]==0, 20, 1)
    cut_vals   = labels[np.cumsum(weights) <= int(0.04 * np.sum(weights))]
    top_four   = np.sum(cut_vals[:,0]) / np.sum(labels[:,0])

    gini = [0,0]
    for i in [1,0]:
        labels         = np.transpose(np.array([y_true, y_pred]))
        labels         = labels[labels[:, i].argsort()[::-1]]
        weight         = np.where(labels[:,0]==0, 20, 1)
        weight_random  = np.cumsum(weight / np.sum(weight))
        total_pos      = np.sum(labels[:, 0] *  weight)
        cum_pos_found  = np.cumsum(labels[:, 0] * weight)
        lorentz        = cum_pos_found / total_pos
        gini[i]        = np.sum((lorentz - weight_random) * weight)

    return 0.5 * (gini[1]/gini[0] + top_four)

In [None]:
FEATURES = train.columns[1:-1]

importances = []
oof = []
train = train.to_pandas() # free GPU memory
TRAIN_SUBSAMPLE = 1.0
gc.collect()

skf = KFold(n_splits=FOLDS)
for fold,(train_idx, valid_idx) in enumerate(skf.split(
            train, train.target )):
    
    # TRAIN WITH SUBSAMPLE OF TRAIN FOLD DATA
    if TRAIN_SUBSAMPLE<1.0:
        np.random.seed(SEED)
        train_idx = np.random.choice(train_idx, 
                       int(len(train_idx)*TRAIN_SUBSAMPLE), replace=False)
        np.random.seed(None)
    
    print('#'*25)
    print('### Fold',fold+1)
    print('### Train size',len(train_idx),'Valid size',len(valid_idx))
    print(f'### Training with {int(TRAIN_SUBSAMPLE*100)}% fold data...')
    print('#'*25)
    
    # TRAIN, VALID, TEST FOR FOLD K
    Xy_train = IterLoadForDMatrix(train.loc[train_idx], FEATURES, 'target')
    X_valid = train.loc[valid_idx, FEATURES]
    y_valid = train.loc[valid_idx, 'target']
    
    dtrain = xgb.DeviceQuantileDMatrix(Xy_train, max_bin=256)
    dvalid = xgb.DMatrix(data=X_valid, label=y_valid)
    
    # TRAIN MODEL FOLD K
    model = xgb.train(xgb_parms, 
                dtrain=dtrain,
                evals=[(dtrain,'train'),(dvalid,'valid')],
                num_boost_round=9999,
                early_stopping_rounds=100,
                verbose_eval=100) 
    model.save_model(f'XGB_v{VER}_fold{fold}.xgb')
    
    # GET FEATURE IMPORTANCE FOR FOLD K
    dd = model.get_score(importance_type='weight')
    df = pd.DataFrame({'feature':dd.keys(),f'importance_{fold}':dd.values()})
    importances.append(df)
            
    # INFER OOF FOLD K
    oof_preds = model.predict(dvalid)
    acc = amex_metric_mod(y_valid.values, oof_preds)
    print('Kaggle Metric =',acc,'\n')
    
    # SAVE OOF
    df = train.loc[valid_idx, ['customer_ID','target'] ].copy()
    df['oof_pred'] = oof_preds
    oof.append( df )
    
    del dtrain, Xy_train, dd, df
    del X_valid, y_valid, dvalid, model
    _ = gc.collect()
    
print('#'*25)
oof = pd.concat(oof,axis=0,ignore_index=True).set_index('customer_ID')
acc = amex_metric_mod(oof.target.values, oof.oof_pred.values)
print('OVERALL CV Kaggle Metric =',acc)

In [None]:
# CLEAN RAM
del train
_ = gc.collect()

# Predict on test set

In [None]:
# CALCULATE SIZE OF EACH SEPARATE TEST PART
def get_rows(customers, test, NUM_PARTS = 4, verbose = ''):
    chunk = len(customers)//NUM_PARTS
    if verbose != '':
        print(f'We will process {verbose} data as {NUM_PARTS} separate parts.')
        print(f'There will be {chunk} customers in each part (except the last part).')
        print('Below are number of rows in each part:')
    rows = []

    for k in range(NUM_PARTS):
        if k==NUM_PARTS-1: cc = customers[k*chunk:]
        else: cc = customers[k*chunk:(k+1)*chunk]
        s = test.loc[test.customer_ID.isin(cc)].shape[0]
        rows.append(s)
    if verbose != '': print( rows )
    return rows,chunk

# COMPUTE SIZE OF 4 PARTS FOR TEST DATA
NUM_PARTS = 4
TEST_PATH = '../../input/amex-data-integer-dtypes-parquet-format/test.parquet'

print(f'Reading test data...')
test = read_file(path = TEST_PATH, usecols = ['customer_ID','S_2'])
customers = test[['customer_ID']].drop_duplicates().sort_index().values.flatten()
rows,num_cust = get_rows(customers, test[['customer_ID']], NUM_PARTS = NUM_PARTS, verbose = 'test')

In [None]:
# INFER TEST DATA IN PARTS
skip_rows = 0
skip_cust = 0
test_preds = []

for k in range(NUM_PARTS):
    
    # READ PART OF TEST DATA
    print(f'\nReading test data...')
    test = read_file(path = TEST_PATH)
    test = test.iloc[skip_rows:skip_rows+rows[k]]
    skip_rows += rows[k]
    print(f'=> Test part {k+1} has shape', test.shape )
    
    # PROCESS AND FEATURE ENGINEER PART OF TEST DATA
    test['D_59'][test['D_59'] > 127] = -1
    test['D_123'][test['D_123'] > 127] = -1

    test['D_59'] = test['D_59'].astype(np.int8)
    test['D_123'] = test['D_123'].astype(np.int8)
    test['D_124'] = test['D_124'].astype(np.int16)
    test['target'] = 0 # dummy variable to get our pipeline to run
    test = wf.transform(nvt.Dataset(test)).compute()
    test = test.set_index('customer_ID')
    if k==NUM_PARTS-1: test = test.loc[customers[skip_cust:]]
    else: test = test.loc[customers[skip_cust:skip_cust+num_cust]]
    skip_cust += num_cust
    
    # TEST DATA FOR XGB
    X_test = test[FEATURES]
    dtest = xgb.DMatrix(data=X_test)
    test = test[['P_2_mean']] # reduce memory
    del X_test
    gc.collect()

    # INFER XGB MODELS ON TEST DATA
    model = xgb.Booster()
    model.load_model(f'XGB_v{VER}_fold0.xgb')
    preds = model.predict(dtest)
    for f in range(1,FOLDS):
        model.load_model(f'XGB_v{VER}_fold{f}.xgb')
        preds += model.predict(dtest)
    preds /= FOLDS
    test_preds.append(preds)

    # CLEAN MEMORY
    del dtest, model
    _ = gc.collect()

In [None]:
# WRITE SUBMISSION FILE
test_preds = np.concatenate(test_preds)
test = cudf.DataFrame(index=customers,data={'prediction':test_preds})
sub = cudf.read_csv('../../input/amex-default-prediction/sample_submission.csv')[['customer_ID']]
sub['customer_ID_hash'] = sub['customer_ID'].str[-16:].str.hex_to_int().astype('int64')
sub = sub.set_index('customer_ID_hash')
sub = sub.merge(test[['prediction']], left_index=True, right_index=True, how='left')
sub = sub.reset_index(drop=True)

# DISPLAY PREDICTIONS
sub.to_csv('../submission.csv', index=False)
print('Submission file shape is', sub.shape )
sub.head()

In [None]:
# PLOT PREDICTIONS
plt.hist(sub.to_pandas().prediction, bins=100)
plt.title('Test Predictions')
plt.show()